Why the Lancet study suggesting a far higher Gaza death toll is deeply flawed

A few months ago, The Lancet Global Health published a study by Professor Michael Spagat et al. claiming that the Hamas-run Gaza Ministry of Health had undercounted violent deaths by roughly 35 per cent, and very quickly the finding took on a life of its own. Journalists and advocacy groups cited it as independent confirmation that the official death toll significantly underestimated reality, and some even mistakenly assumed the same 35 per cent correction factor also applied to the most current official fatalities figures.

However, as Professor Sergio DellaPergola and I showed in a formal response recently published in the same journal, the survey suffers from major methodological flaws that undermine its headline claim.

Before turning to those flaws, one important clarification is needed: our criticism is not an attempt to assess the accuracy of the Gaza Health Ministry’s figures. That question raises complex issues of its own, and both I and other researchers have noted shortcomings in the ministry’s reporting, including the absence of a distinction between civilians and combatants and indications that some recorded deaths may not be conflict related. The focus here is the dramatic claim, advanced by Professor Spagat and colleagues, that the ministry’s data miss roughly one in three violent deaths. As I show below, the survey does not substantiate this conclusion.

How the study was supposed to work

On paper, the study's approach is straightforward: collect wartime mortality data from a representative sample of 2,000 households and then extrapolate those findings to estimate Gaza-wide death totals. It's a standard method used in mortality surveys – with one crucial caveat common to all of them: Your estimate is only as good as your sample.

A simple way to see this is to imagine estimating average income from people you happened to meet on a single busy street. Even if you collect the data carefully, you have no way of knowing whether your estimate is too high or too low for the simple reason that you do not know how that group differs from the wider population.

The authors themselves appear aware of this risk. They set out a methodology intended to approximate representativeness, including multiple safeguards aimed at catching both obvious and subtle breakdowns: monitoring GPS locations to ensure teams remain within assigned areas, and flagging results that looked statistically unusual, including demographic patterns such as household size and the share of children.

These safeguards were meant to ensure that interviewers follow the survey design, so that the resulting sample would be representative of the general population. However, it appears this is not what happened in practice.

When the sampling plan breaks down

Examination of the raw survey data shows that the actual fieldwork significantly diverged from the survey design – and not as a single isolated issue, but as a pattern across multiple dimensions. This gap between design and implementation is central to understanding the survey’s limitations.

For example, take the GPS tracking the authors themselves introduced to confirm interviewers stayed within their designated areas. The traces show something different: interviewers moving along main roads, crossing boundaries, and clustering interviews in narrow zones instead of covering the full area. In addition, in several cases, multiple teams appear to have surveyed the same sampling unit, rather than separate ones.

These patterns are clearly visible in the raw data and run counter to the protocol described in the paper – the practical effect being interviews skewed toward whoever was easiest to reach, rather than a broad, representative sample. None of these deviations are discussed in the published analysis – even though they're exactly what the study's quality controls were meant to catch, suggesting those safeguards were never effectively applied.

This is not to dismiss the severe constraints of conducting fieldwork in an active war zone. But if those constraints made it impossible to follow the sampling plan, that limitation – and its implications for representativeness – should have been clearly acknowledged. Regardless of the reason for such deviations, the result is no longer the kind of representative sample the survey was meant to produce but, in effect, a “convenience sample”, based on which households happened to be easiest to reach rather than a true cross-section of the population. This survey may thus allow very loose approximations, but it cannot support precise population-wide estimates.

Extreme outliers that skewed the results

The problems described above are not just theoretical. They show up clearly when we look at the results produced by different survey teams.

In a properly conducted random sample in the same area, different teams are typically expected to produce broadly similar results – both in reported mortality and in the demographic composition of the households they interview – with only modest differences due to chance. Instead, two teams reported results that are dramatically different from the rest.

One team – labelled Gaza9 – conducted about 8 per cent of the interviews yet reported roughly a quarter of all violent deaths in the survey – several times higher than most other teams. Another team, Gaza3, also reported elevated mortality, roughly double the rate of the rest of the sample.

Critically, the differences are not limited to death counts. Gaza's general population averages 5.5 people per household, with children making up about half. Most other teams cluster reasonably close to these figures, but Gaza3 and Gaza9 stand well apart from all of them: Gaza9 reports an average household size of roughly 3.7 and only about 28 per cent children, while Gaza3's figures are even further off, at 2.8 people per household and just 19 per cent children.

Taken together, these two teams combine unusually high reported mortality with clearly unrepresentative household structures. When their data are excluded, the estimated death toll drops by more than one fifth – bringing it within the survey’s margin of error relative to the Ministry of Health figures.

Dramatic outliers like this are a clear red flag, and according to the survey's own design, catching this kind of problems was supposed to happen during fieldwork. In our critique, we pointed out that it didn't: neither Gaza9 nor Gaza3 was flagged as an outlier during the data collection.

In response to our critique, the authors dispute this and claim that these anomalies "were visible from early in the survey and actively discussed." However, the paper itself contains no indication that the anomalous data were flagged to field workers or supervisors as they emerged. Instead, the discussion of the Gaza9 anomalies appears only during the analysis phase, when the authors calculated violent death estimates, after all survey data had already been collected.

As part of that analysis, they also estimated the effect of excluding Gaza9, noting that doing so "does substantially lower our estimate for the size of the MoH undercount." Only then do the authors report looking into the reason for the outlier, writing: "So, we investigated further and found that three PSUs [survey sampling areas] covered by this team were in shelters that give special preference to families that have lost members during the fighting…"

And that after-the-fact analysis covered only Gaza9's mortality numbers. Nowhere in the paper are Gaza9 or Gaza3 identified as having unusual demographic profiles – smaller households and far fewer children – which should have also been flagged as anomalies. The paper does discuss demographics, but only to show that the data set as a whole doesn’t match what's known about Gaza's population overall. That is a different claim entirely: it says nothing about whether any specific team's results were completely out of line, which is exactly the question Gaza9 and Gaza3 raise.

Rather than address that gap, the authors try to downplay the importance of these imbalances by pointing to their use of statistical weighting – a procedure meant to adjust the results, so the full data set matches the population's known characteristics. However, while weighting can correct some small deviations, it cannot repair a situation where a handful of teams produced vastly different results. It's a bit like putting a band-aid on a structural crack: it may make the surface look more even, but it does not repair the underlying problem and doesn't clarify why the problem wasn't addressed early on.

The authors further suggest that the Gaza9 results may reflect a "genuine spatial concentration of violence". However, the data show that Gaza9 consistently reported higher mortality than other teams operating in the same areas. In at least one particularly clear case, Gaza9 reported deaths in 10 out of 20 households, while another team working just a few buildings away reported none. Such a discrepancy is difficult to reconcile with any claim of a “genuine spatial concentration of violence”.

Differences of that magnitude cannot plausibly be explained by local variation. If location were the primary factor, teams operating side by side would be expected to report broadly similar results. Instead, the discrepancies appear to track the teams, not the locations – pointing to a problem with how the data were collected rather than where they were collected.

The bottom line

The survey's headline claim – that the Ministry of Health undercounted deaths by roughly 35 per cent – rests on the assumption that the sample represents the population. The study’s own data say it doesn't. Fieldwork repeatedly seems to have departed from the sampling plan, a small number of teams drove a disproportionate share of the results, and those same teams produced demographic profiles unlike the wider population. Remove their data, and the gap largely disappears within the survey's own margin of error.

This isn't some technical nitpicking for its own sake. What's at stake here is a dramatic claim that got treated as settled fact – cited, repeated, and stretched to cover deaths it never actually measured – when, in fact, the sample behind it was just too broken to support it.

Dr Mark Zlochin is an independent researcher and data analyst

To get more from opinion, click here to sign up for our free Editor's Picks newsletter.

Topics:

The Lancet

Gaza war

Why the Lancet study suggesting a far higher Gaza death toll is deeply flawed

Topics:

Support the world’s oldest Jewish newspaper