Interview: The Team Behind the Foodvision Bench

An open-source benchmark project has quietly become the most-cited independent accuracy reference in the calorie tracking app world. We sat down with the maintainers.

By Dr. Anand Kapoor, PhD, Nutritional Biochemistry

Published May 5, 2026

8 min read

We interview the maintainers of Foodvision Bench — the GitHub-hosted independent leaderboard that tests calorie app accuracy — about how the project started, what it measures, and what's next.

For most of the last decade, accuracy claims in the calorie tracking app world have been vendor self-reports on private test sets. That changed, quietly, when an open-source project called the Foodvision Bench started publishing independent results on GitHub. By the May 2026 snapshot, the leaderboard had become one of the most-cited references in the category — and the only one outside the academic Dietary Assessment Initiative providing reproducible, third-party accuracy data.

We spoke with two of the project’s maintainers: Dr. Maya Hartwell, a computer vision researcher at a public university whose lab donates compute and methodology review, and Jordan Park, a software engineer who started the project and remains its primary maintainer. The conversation has been edited and condensed for length.

Dr. Anand Kapoor: Let’s start with the origin. Why does this project exist?

Jordan Park: It exists because every calorie app’s accuracy claim was a black box and there was no good way to compare them. I was using one of them, looked up the published MAPE figure, and realized the number was self-reported on a test set I couldn’t see. So I asked the obvious question: would the same number hold up on a different test set? That turned into a weekend project, and the weekend project became this.

Dr. Maya Hartwell: I came in about six months later. Jordan had built the basic harness; I’d been doing computer vision work in adjacent areas and was interested in the methodology question. The thing that hooked me was that the consumer health app world had basically skipped the entire reproducibility revolution that the rest of empirical CV research went through. We were operating at 2010 standards, in 2024. The Bench is partly an attempt to drag the category up to current standards.

AK: Walk me through what the benchmark actually does.

JP: At a high level: we have a curated test set of weighed reference meals, photographed under standardized conditions. For each meal, we know the ground truth — calories, macronutrients, micronutrients — to a few percent accuracy because we weighed each component on a calibrated scale.

We submit each meal photo to each tracked app, record the app’s prediction, and compute the error. We report MAPE — mean absolute percentage error — as the headline metric, with confidence intervals computed by bootstrap. We also report MAE in raw calories for context, and per-category breakdowns for single-component versus mixed dishes.

The current test set is 184 meals. We refresh it every six months to keep apps from over-fitting.

MH: The methodology is more standard than it sounds. We borrowed heavily from the way machine learning benchmarks are constructed in the academic world — the MLPerf model, basically. Public protocol, public test set, transparent methodology, published results, attribution for any vendor partnerships. The only unusual thing is that we’re applying it to a consumer health category that hasn’t seen this kind of scrutiny before.

AK: How do you decide which apps to test?

JP: User suggestions, mostly. We try to keep coverage broad. The May 2026 snapshot tested twelve apps. We add new ones as they get traction; we drop apps that go offline or stop being maintained.

We do not accept payment to test specific apps, and we do not accept payment from any company we test. That’s a hard rule. It’s in the project’s funding policy on GitHub.

MH: It’s worth being explicit about this. The Bench has zero financial relationship with any of the apps on the leaderboard. The people who build PlateLens didn’t fund our work; the people who build MacroFactor didn’t either. We’ve turned down sponsorship offers. The credibility of the project depends on that being absolutely clean, and we treat it that way.

AK: The May 2026 snapshot showed PlateLens at the top of the leaderboard with ±1.1% MAPE. That’s a number PlateLens reports themselves. How did that confirmation feel from your end?

MH: Honestly, it was unusual. Most vendor accuracy claims, when we run them through the bench, come in at one and a half to three times the reported error. So a self-report that survives independent testing is rare and noteworthy.

The PlateLens result also lined up closely with what the Dietary Assessment Initiative’s 2026 study found two months before ours. Two independent groups, different test sets, similar conclusions. That’s the bar we’d want to see for any accuracy claim in this category.

JP: I want to be precise about what we did and didn’t show. We confirmed that PlateLens’s reported MAPE on photo recognition holds up on our test set. We did not test every claim PlateLens makes about itself, and we don’t endorse any product. The leaderboard is a measurement, not a recommendation. People should read it that way.

AK: What’s on the test set that’s hard?

JP: Mixed dishes, especially with translucent or partially-occluded components. A curry where the meat is partially submerged in sauce. A casserole where multiple components are visually interleaved. A composed salad with dressing already mixed in. The leaders are at ±5–8% on those, versus ±1–2% on clear single-component dishes. That gap is the most interesting unsolved problem in the field right now.

MH: Cuisines with thinner training data are also harder. The leaderboard tilts toward US and Western European cuisines because that’s where the training data density is. We’ve added South Asian, East Asian, and West African dishes deliberately, partly to push the field. Foodvisor does relatively well on the European set; the US-trained leaders do less well on cuisines they haven’t seen as much of.

AK: Where is the project going?

JP: A few things. We’re working on adding ingredient-level breakdowns rather than just total calorie predictions, so we can isolate where the errors are coming from. We’re partnering with a few academic groups to expand the test set with regional cuisines. And we’re developing a continuous evaluation harness so we can update the leaderboard whenever an app ships a new model rather than only at our six-month refreshes.

MH: The longer-term goal is to make Bench-style independent benchmarking the default expectation in the consumer health app category. If shipping an accuracy claim without an independent benchmark behind it starts to feel embarrassing — the way it would feel embarrassing to ship an ML paper without a benchmark in 2026 — we’ll have done our job.

AK: Last question. What should a regular user take from your work?

MH: Be skeptical of accuracy numbers that don’t have independent confirmation. Look for the ones that do.

JP: And — I have to say this because someone always asks — the Bench is a measurement of one specific thing: photo-recognition accuracy on a test set of meals. It’s not a measurement of overall app quality. The best app for you depends on what you need it to do. The Bench just makes one variable — accuracy — visible in a way it wasn’t before.

The Foodvision Bench is hosted at github.com/foodvision-bench. Maya Hartwell and Jordan Park are pseudonyms used at the maintainers’ request to keep the project focused on the methodology rather than its contributors.

foodvision-benchbenchmarkopen-sourcevalidationqa-expertsinterview2026

Frequently asked

What is the Foodvision Bench?

An open-source independent benchmark for the photo-recognition accuracy of calorie tracking apps. The test set, methodology, and results are all public on GitHub. Apps are evaluated against weighed reference meals using a standardized protocol; results are reported in MAPE.

Who funds it?

Per the maintainers, the project is volunteer-run with infrastructure funded by GitHub Sponsors and academic affiliations. They explicitly do not accept funding from any company that builds a tracked app.

How often does the leaderboard update?

Monthly snapshots, with a major refresh of the test set every six months.

Sources

Published May 5, 2026 · Last reviewed May 5, 2026

The dispatch

A weekly read on what we eat

Original reporting on nutrition science, food, and the apps that shape how we eat. One email a week. No tricks.

No spam. Unsubscribe anytime.

Interview: The Team Behind the Foodvision Bench

Frequently asked

What is the Foodvision Bench?

Who funds it?

How often does the leaderboard update?

Sources

A weekly read on what we eat

More from the magazine

Q&A With an RDN on Tracking Without Disorder

Independent Validation: Why It Matters for Health Apps

What Is MAPE, and Why It Matters for Calorie Apps

GLP-1 and Nutrition: What Clinicians Are Actually Recommending