What is the use of lots of data when we don’t know what they mean?

There are lots of sources for data on Covid-19. Many of them just take data from government or hospital sources, and let you compare them however you want to. Others are a little more careful. Financial Times data journalists outline some of the problems:

Comparing the spread of coronavirus in different countries is difficult using the data being released by governments. Confirmed case counts depend heavily on the extent of countries’ very different testing regimes, so higher totals may simply reflect more testing.
Deaths are somewhat more reliable, but remain problematic because countries have different rules for what deaths to include in their official numbers.

I just finished reading Spiegelhalter and Masters’ recent book Covid by Numbers–Making sense of the pandemic with data. The authors, being British statisticians, focus on the situation in the UK, although do mention ways of comparing, for example, excess deaths between countries.

I am well aware of some of the differences in how data are collected. In the spring of 2020, many statisticians were thinking about data quality. On April 1, 2020 I wrote in a Facebook post:

So why are the data on cases total garbage? First of all, a large proportion of those infected have no symptoms, and are therefore not tested. That may be a third to half of the cases. Second, here is what happened to three friends of mine who got the disease: One was tested, and diagnosed. One was tested, had all the symptoms, but no test results have come back after 12 days. One was refused testing. All three quite clearly had the disease, and their doctors agreed.

Because there was only limited access to tests (and the US insisted on developing their own test rather than using the available German test promoted and distributed by the World Health Organization), testing in the US originally required a doctor’s order. And different doctors had different ideas about the importance of testing. At that time there were no real treatments, and no vaccine. Since the testing policies have changed over time in most places, comparing case data over time is meaningless: the data are simply not directly comparable over time, especially more recent data compared to data from early in the epidemic.

A colleague commented on my post, writing:

Differential testing rates by state and country using different criteria for who gets the test; type I and II error rates vary by testing kit manufacturer; and, just to add confusion in other reported statistics, deaths are attributed to cause using alternative criteria across areas (country, region, hospital). May the value of statistical thinking and international statistical standards become evident to leadership…somewhere!

Another colleague posted in a separate thread about the same time:

I’ve been seeing a lot made about the proportion of SARS-Covid-2 tests that are positive in India being low right now (around 4%, compared to the current 20% in the US) as indicative of India somehow dodging the bullet.

I have basic question. What population parameter is this positivity percentage estimating? And why is it a good estimate of anything, given the sampling scheme being used? Is this statistic useful for comparing across countries given variations in testing rate, criteria for testing, reporting requirements in different jurisdictions, and the like? And, if the answer is no, why are we trying to make (confirmation bias influenced) inference using it?!!!

The Greater Seattle Coronavirus Assessment Network (SCAN) allows you to do in-home tests of COVID-19. On May 10, 2020, I stuck the stick up my nose, turned it around several times, put it in the SCAN box and texted them to come and pick it up. Four days later, I accessed my results page. It said that SCAN had been told by the feds that they were not authorized to do in-home testing any longer (they originally had a preliminary authorization). So, although I had done the test, I was not able to find out the results. Ever. A few weeks later the SCAN in-home test was approved by CDC – I requested and took the test, and it was negative.

About the same time, Nate Silver published a story on 538 about why case counts are meaningless. He says

… if you’re not accounting for testing patterns, it can throw your conclusions entirely out of whack. You don’t just run the risk of being a little bit wrong: Your analysis could be off by an order of magnitude. Or even worse, you might be led in the opposite direction of what is actually happening.

He presents some detailed examples of the testing pattern problems that arise in case data. It is not at all like survey data, where the sample resembles the population.

Photo by Atypeek Dgn.

So how about other COVID-19 statistics? COVID-19 death can be defined in different ways. In some countries any death within 28 days of a positive COVID-19 test counts, while other countries require COVID-19 listed on the death certificate, perhaps even that it is listed as the leading cause of death. Some countries only count hospital deaths in their COVID-19 statistics. Table 1 in Bauer et al. (2021) lists the different definitions in several European countries and the USA. The different definitions make interpreting comparisons of these numbers difficult, and is expected to explain some of the differences in COVID-19 death rates between countries.

Excess deaths beyond pre-pandemic rates may be a more comparable quantity, assuming that all deaths are recorded in some way. If the death rates are much higher than pre-pandemic rates, this may be an indication that many of the excess deaths are due to COVID-19. However, public health policies such as masks, hand washing, physical separation, and work from home affect other causes of death as well. In many countries, the 2020 influenza epidemic was smaller than most recent years. Traffic in lock-down countries was lower, resulting in fewer traffic fatalities. So just using excess deaths to estimate the number of COVID-19 deaths will not be particularly accurate either.

Data on COVID-19 hospitalizations are sparse. According to Our World in Data, several European countries, USA, Canada, Algeria, and Malaysia provide hospitalization data. Not all of these provide further data on the number of patients in intensive care units (ICUs), which is a crucial quantity in policy making, since authorities want to ascertain sufficient ICU capacity beyond COVID-19 infections.

There are about a dozen COVID-19 vaccines available in the world. No country has access to all of them. Our World in Data has a list of what vaccines are used in what countries. Different countries also have very different criteria for vaccine eligibility, and the availability of vaccines varies. Not every country gives data on whether an individual is fully vaccinated, and not every country defines fully vaccinated in the same way. Some vaccines only require one dose, some three, and most two, although recent research indicates that booster shots may be needed.

It struck me that it would be interesting to try to bring together statisticians from different countries who have paid attention to local data collection methods to try to figure out what data should go into the global numbers. Consideration should be given to how each country obtains:

Testing results (who gets tested, who does the testing, which test methods are used, what results are recorded)
Deaths (what constitutes a COVID-19 death, and does that change over time)
Hospitalizations (is there data from all hospitals in the country, are COVID-19 intensive care cases separated from other COVID-19 cases)
Vaccination (who is allowed to get what vaccine when)

and maybe an overview of the approach the authorities have taken to public health policies of a pandemic.

My feeling is that there are no two countries collecting relevant data the same way. There is standardization by the World Health Organization of nomenclature and diagnosis, but not for data collection. Since the International Statistical Institute (ISI) family has members in 136 countries, it would be useful to put together a white book with all this information for as many countries as possible. This could lead to a future data collection policy suggestion. With enough information (and appropriate assumptions), it may even be possible to build a statistical model that allows us to produce an improved estimate of the actual number of COVID-19 cases, and the actual number of COVID-19 deaths worldwide, with appropriate measures of uncertainty.