Projections from mathematical models of infectious disease are guiding policy decisions around the world in the fight against COVID-19. While most of these models are specific to the institutions that develop them, they share basic mathematical principles: they divide a population into different groups and try to simulate how the population transitions from one group to the next. In particular, the SIRD model — one of the most frequently used — divides the population into four groups: those who are susceptible to the virus (S), have become infected (I), recovered (R) and died (D). In this article we explore a modified version of the SIRD model to study COVID-19 time series data across the US. Specifically, we use polynomial regression to fit an SIRD model to real world data and utilize the estimated parameters of the model to cluster geographies across the US.

*Animation of an SIRD model with changing parameters (from Wikipedia)*

**What are disease models?**

Disease models (such as SIRD) stem from research done by a group of British doctors, mathematicians and epidemiologists that was presented in a series of articles published from 1927 to 1933. Their aim was to explain the rapid rise and fall in the number of infected patients observed in epidemics such as the London cholera outbreak of 1865. Contemporary disease models are used to predict properties of an epidemic such as its duration, prevalence, and the virality of its spread. The simplest models make some basic assumptions: everyone has the same chance of becoming infected, the infected are equally infectious, and the level of contact between people in a population is equally distributed. Advanced implementations subdivide populations into more granular groups based on factors like, age, sex, and health status. They may also incorporate exogenous factors such as population density and policy decisions.

**The mechanics of an SIRD model**

An SIRD model can be described through a series of differential equations.

Where γ is the rate of recovery (the inverse of the duration of sickness), β is the transmission rate, μ is the mortality rate and N is the total population. The first equation governs the rate of change for the susceptible group, which decreases at a rate proportional to β as individuals go from susceptible to infected. Individuals in the infected group move at rates proportional to μ and γ into either the recovered or deceased group. The ratio of β/γ is an estimation of the R0 which is the reproduction number or probability of an individual in a population infecting someone else.

In their basic form these models do not consider geographic, demographic or policy contexts. However, geo-specific features such as commuter patterns and housing density are significant factors in understanding how COVID-19 spreads through communities (as discussed in our previous article examining geographic similarity in relation to COVID-19). Here we modify the basic SIRD model to capture variations in geography: we add a term to capture local adherence to social distancing, and we adjust the number of susceptible individuals to factor in vulnerable populations across the country. The equations are adjusted as follows:

Where the addition of ρ captures the extent to which individuals are choosing to shelter in place, thus flattening both the infections and mortality curves. Here N becomes the proportion of the population who are eligible for testing[1].

*Animation of an SIRD model with changing rho parameter*

**Curve fitting and parameter extraction**

With this modified SIRD model in hand, we set about fitting it to county-level US COVID-19 data. Our goal is to develop a robust understanding of the characteristics of curves across counties (what counties saw gentler progressions, exponential growth, or sudden spikes). We assume all parameters are unknown and attempt to optimize the parameters using nonlinear least squares regression to fit our model to actual COVID-19 data. Under the hood, we’re iteratively optimizing the bounded parameters of the SIRD model β, γ and ρ (minimizing the sum of squared differences) so that the model closely follows the curve/growth patterns seen in actual data[2]. Using estimated parameters from the model gives us a succinct way to describe the impact of social-distancing, the virus’ virality and the contact rate of populations over time, across the country.

With governments regularly revising the number of cases and deaths reported as new data comes in, we cannot be 100% certain that the numbers we see reported on a given day will be the same the following week. To account for this uncertainty, we artificially add noise to our dataset by assuming that errors will be normally distributed and that variation will diminish over time as reported numbers stabilize.

*Example of output from polynomial regression showing values of the three parameters and the curve fitted to real world data for the Los Angeles county*

**Relationship between model parameters and spatial features**

After extracting model parameters using nonlinear least squares regression, we study the relationship between these parameters and our county-level spatial features. The relationships to the parameter capturing adherence to social distancing are statistically significant, albeit fairly weak. It is positively and negatively correlated with higher percentages of white and black populations respectively. This speaks to similar findings that show adherence to social distancing policy is lower in communities of color, as individuals are more likely to work in jobs that are deemed essential or have higher in-person contact. This parameter is also negatively correlated with the rate of testing per 100k people, which suggests that counties with more rigorous testing frameworks are also taking social distancing measures more seriously.

*Selected correlates of the extracted Rho parameter (capturing the effectiveness of social distancing)*

Looking at the derived contact rate (ρ and β) and Topos’ geospatial features surfaces some intuitively sensible relationships: The contact rate is higher in denser, urban areas and in areas where residents have travelled further in the last 7 days, but lower in areas where the median distance to hospitals is larger (sparse/ rural communities). There are also some interesting relationships between the derived contact rate and ethnicity and voting. These findings echo correlations described in our previous article.

*Selected correlates of the extracted contact-rate parameter Beta*Rho*

**Clustering geographies based on curves**

The output of our regression is a set of curves that closely approximates[3] the observed spread of COVID-19 in the US and the three parameters of the equations that describe the various rates of change in population groups. With these parameters in hand, we can now group counties in the US based on the functions governing their curves. In particular, we use K-means clustering to group counties based on estimated parameters[4].

*Total number of cases over time by cluster.*

Looking at the above chart, we see that our clustering has successfully distinguished various characteristics of the outbreak across the US. The counties that had early, rapid outbreaks (like those in Westchester, New York City and Yonkers) are grouped together (cluster-4). The counties that had outbreaks a few weeks later (in mid-March) but also saw rapid growth (like those in Ann Arbor and Baton Rouge) are also clustered together, as are other groups that saw later outbreaks with gentler slopes.

*Growth in new infections versus population per square mile, sized according to total cases and colored by cluster.*

Visualizing the relationship between recent growth, density and total cases by cluster reveals some interesting insights. We can see that counties that are currently (as of April 23, 2020) seeing the largest case percentage increases are those in cluster 0, 1, and 3 — those that saw outbreaks start later. We also observe counties that are bucking the “high density leads to a large outbreak” trend. Namely, those in Virginia such as Harrisonburg and Clayton which have above-average population density but below-average case numbers and growth.

*Left: Counties that are seeing above average growth in new cases with below average density (people per square mile). Right: Counties with below average growth in new cases with above average density (as of 23rd April 2020)*

Virginia Beach City, VA (cluster-1), has a population of 2500/square mile, well above the national average. But it has seen very low recent growth in new cases (as of 23rd April) and low numbers of total cases. Conversely, Apache County AZ (cluster-3), has a population density of 6.4/square mile and less than half the population of Norfolk City (~71k) but is seeing double digit growth in new cases and is above the 75th percentile in terms of total number of cases (as of 23rd April).

*Downtown Virginia Beach City, VA*

*Percentage of the population staying at home over the last 7 days*

*Five Clusters displayed on the map*

The average contact rate in counties with early and large outbreaks are significantly higher than all other counties which is to be expected. But recent social distancing measures have been heeded as mobility data from the last 7 days shows that these counties have a larger proportion of people staying at home, as can be seen in the above table.

In this article, we demonstrated how the variations in outbreaks over time reflect adherence to social distancing and the behavioral patterns of people. We also showed that although density plays a critical role in how the outbreak has spread, there are a number of smaller, highly dense communities that appear to be bucking the trend. This article reflects insights we gathered while developing the Topos COVID-19 Compiler — we encourage you to check out our site and share your feedback.

Footnotes

- We acknowledge that eligibility might vary by region and over time; our assumptions about who is eligible for testing are those over 50, health care workers and those with underlying medical conditions
- In our implementation μ was not a bounded parameter and so was not included at this step
- We used the coefficient of determination R2 as measure for goodness of fit for the regression. The mean R2 across all counties analyzed on 23rd April 2020 was 0.92
- We chose five clusters by using a heuristic that graphs the number of clusters versus the average internal sum of squares distances per cluster, where the “optimal” number is found when the internal distances no longer decrease linearly.

Dense urban counties with large populations have the most concentrated numbers of cases and deaths caused by COVID-19. This can be attributed to many factors beyond their respective population density, particularly the fact that New York City and Boston (cities containing counties with some of the highest concentration of cases) are global travel hubs. While population density, to a large degree explains the high level of mortality⁽¹⁾ these cities have experienced, there are other factors that also have a statistically significant relationship to mortality and infection⁽²⁾: the number of daily commuters using public transportation networks, the percentage of residential buildings with 50+ units, even the density of pizza restaurants (a subject near and dear to our hearts) in a neighborhood. In this article, we will explore the similarities between counties based on a variety of factors and see whether this can give us insight into death rates across the U.S caused by COVID-19.⁽³⁾ Left: Number of commuters who take public transportation in relation to the total number of deaths. Size of the circles indicates the number of days since stay at home policy was enacted. Both and X and Y axis follow a logarithmic scale. Right: Percentage of large multifamily residential buildings versus total number of deaths as of April 13th

*Percentage of large multifamily residential buildings versus total number of deaths as of April 13th*

**Correlation to total death count**

As a location intelligence company, Topos works with thousands of features that describe the way we live together and move around as humans — the types of buildings we inhabit, the modes of transport we use to get to work, the variety of establishments that make up our cities and neighborhoods. We transform these features into billions of data points that give us a holistic understanding of place.

One of the first questions we asked as the pandemic started to unfold was, “What can our features tell us about how the virus is spreading through communities?” Looking at the relationship between the total number of deaths per county and our features, two of the most significant relationships we found were⁽⁴⁾ the percentage of the population who take public transportation (explaining ~70% of the variation in total deaths across the country) and the percentage of 50+ unit residential buildings (explaining ~45%). These relationships remain significant, although with a slightly lower R2 value, when removing counties in New York state.

*Selection of features correlated to total death counts per county across the country*

**Similarities and dissimilarities between counties**

As a company, one of the core questions we have sought to answer from the beginning is: how do we understand similarity and ‘distance’ between geographies in the 21st century?

Given that the density of residential buildings with 50+ units and use of public transportation show such a strong relationship to the number of deaths caused by COVID-19 we can study the similarities and differences between counties based on mode of transit and residential housing type⁽⁵⁾ to see if they can give any insight into how the virus may impact a community. We construct our transit + housing based similarity metric using ‘cosine distance’ to measure the similarity across geographies. Cosine distance is a calculation frequently used to study the distance between entities in high dimensional vector spaces; here it enables us to compare counties across several relevant dimensions at once.

Based on this transit + housing based similarity metric, the two most similar counties to Brooklyn (we chose Brooklyn as it has one of the highest mortality rates in the country as of April 13th, 2020) in terms of commuter mode of transportation and housing type are Manhattan and The Bronx, not surprising, given that New York City has the most extensive public transportation network and highest concentration of high rise buildings in the country. Following closely behind are counties in DC, San Francisco, and Chicago.

*Left: Breakdown of percentages of transportation type for most similar counties. Right: Breakdown of percentages of transportation type for dissimilar counties*

*Left: Percentage breakdown of housing type for dissimilar counties Right: Percentage breakdown of housing type for most similar counties.*

*Top: 20 most similar counties to Brooklyn (Kings) in terms of our similarity metric encompassing residential building type and commuter mode of transportation.*

![image not found][deaths_since]
[deaths_since]: https://miro.medium.com/max/1000/1*hIvG2qI8rsBxf5ArEc-wgg.png
*Moving average of deaths since deaths reached > 10 of most similar counties to Brooklyn in terms of transportation and housing type*

Similar trajectories of daily deaths can be seen in many of the most similar (as measured by the transport + housing-based similarity metric) counties to Brooklyn. This is not surprising as both the density of public transportation and percentage of 50+ unit residential buildings correlate highly with the numbers of deaths we’ve seen so far. San Francisco however seems to be bucking the trend. This could be due to a myriad of factors, below we highlight three: SF has a lower proportion of public transit commuters, a slightly lower density of large residential buildings (both metrics feeding our similarity calculation) and the county implemented the stay at home policy five days earlier than Brooklyn did.

*Stay Home Policy status in San Francisco and Kings County*

*Number of cases over time for most dissimilar counties to Brooklyn*

If we look at the number of cases over time for the most dissimilar counties to Brooklyn, in terms of the distributions of commuter transportation and housing type, we can see that the growth of cases is a lot slower than that of the counties in NYC. Whereas initially Brooklyn saw cases double every 5 days, in Waynesboro, Virginia or Attala, Mississippi cases are doubling every 14–16 days. None of these counties have seen any deaths so far (as of April 13th, 2020).

**Similarities between cities**

When we zoom out to look at the data at the city-level (here we use Combined Statistical Area to define a city), the features that correlate most highly with death rates differ slightly from those at the county level. There is a strong relationship between older populations and higher death rates from COVID-19; and with the removal of New York City from the analysis, the density of single family detached homes becomes a highly correlated feature, explaining ~65% of the variation.

*Selection of features correlated to total death counts per CSA across the country excluding New York City*

This time, we will look at similar cities to a current (as of April 13th, 2020) well known hotspot — Detroit — and see which cities are most similar based on the highly correlated features listed above: age of the population, density of residential building types, and the distribution of household income.

*Top: 20 closest/most similar cities to Detroit in terms of age of the population, transportation, and building units*

The most similar cities are widely dispersed geographically, from Louisville to Greensboro. Picking 3 of the most similar cities, Louisville, Dayton and Birmingham, we can see that although their daily death rates are currently much lower than Detroit’s, the growth rates are starting to ramp up. Given the similar distributions of older vulnerable populations, these cities may see cases rise even further.

Footnotes

- 1: Population density accounts for 45% of the variance in mortality rates across the country, p < 0.0001
- 2: We used Pearson correlation to study the relationship between the number of COVID-19 cases + deaths and other factors.
- 3: We look at death rates as opposed to the number of cases due to the variation in testing policy across the country.
- 4: Pearson correlation calculated April 13th 2020 across the US. Data comes from the 2018 American Community Survey
- 5: Similarity as measured by the mode of transportation + housing type includes: percentage of commuters who drive, take public transit, cycle, work remotely, walk, and use a car service. As well as percentage of single family detached, single family attached, mobile homes, RV vans, and buildings with units ranging from 2, 3–4, 5–9, 10–19, 20–49, and 50+

Data Sources:

- COVID-19 cases by county, The New York Times (source)
- 2018 demographics by county, American Community Survey, 2018 (source)
- US election data, MIT Election Lab (source)
- Local Covid-19 policy, Kaiser Family Foundation (source)