Posts on Carlo Bailey

Predicting COVID-19 With Tax Returns

Wed, 10 Jun 2020 14:18:53 -0400

Post I made for the Topos COVID-19 app – original post on medium here

Income disparities and COVID-19

The relationship between income inequality and COVID-19 has been widely covered by various sources over the last 60 days. Findings show that the number of deaths and hospitalizations is much higher in low income neighborhoods, and cities with high levels of inequality.[1] Most of these studies rely on income data provided by the US Census, which is self-reported, often extrapolated based on relatively small samples (as in the ACS[2]) and fairly simple in the way income is measured (ie “Median Household Income”). A more nuanced and complete view of income can be gleaned from studying IRS income tax data. Tax data is only available publicly at the zip code level, but provides a highly detailed economic portrait of neighborhoods, particularly in the types of deductions that are claimed (dependents, capital gains, education credits, etc). And rather than being voluntarily self reported (as is the case with the Census/ACS), tax returns are mandated by law, with consequences for false reporting.

Thus while Census data can tell us how many people in a geography have salaries within specific intervals or what the median income of a neighborhood is, tax data can tell us how many people earned $25k — $50k, what proportion of those earnings were deductible (and for what reasons), and how much went to healthcare contributions. In this article, we study the relationship between fine-grained income tax data and COVID-19 cases at the zipcode level in NYC.

To focus our study, we started with tax metrics that generally indicate the top and bottom of the economic spectrum, looking at tax deductions that are only open to earners below a certain threshold (like the Child tax credit ), or income from financial instruments generally utilized by the wealthy (like Capital gains). The table below shows a selection of the correlation coefficients between cases per capita and income metrics in NYC.

Table showing correlations between cases per capita and various IRS income metrics

The strongest correlation (R = 0.79) to cases per capita is the average Child tax credit amount per tax return. In the year the most recent tax data is available (2017), an individual or family had to earn less than $75k or $110k (respectively) to qualify for it. The amount you receive is determined by how much you earn and how many children are in your household; thus it serves as an indicator for low to middle income families with many mouths to feed. Zip Codes that see the highest rates of child tax credits are lower-middle class neighborhoods like 11239 (Carnasie), 11436 (Jamaica), and 10462 (Pelham). On the other end of the spectrum, zip codes that see the lowest levels of child tax credits per return are: 10005 (Lower Manhattan), 10021 (Upper East Side), and 10003 (Union Square) — neighborhoods with some of the wealthiest residents in the city.

Bars indicate the average child tax credit amount per capita, while the red lines indicate cases per capita

Another strong indicator for COVID-19 cases per capita is the proportion of returns claiming Capital gains. Capital gains is a reflection of income earned from the sale of property or an investment — which is a form of wealth that typically benefits the highest rungs of society. Zip codes with the largest proportion of tax returns claiming capital gains are: 10282 (Tribeca) and 10021 (Lenox Hill) — where 60% and 56% of tax returns include capital gains. On the other end of the spectrum, 10452 (High Bridge) and 10457 (West Bronx) have 1% of returns with capital gains claims. Residents of neighborhoods in the bottom quartile of tax returns claiming capital gains, are roughly 4 times more likely to die from COVID-19 than neighborhoods in the top quartile.

Total deaths related to COVID-19 for zip codes in the upper and bottom quartile in terms of proportion of tax returns claiming capital gains

One of the strongest negatively correlated metrics is the percentage of returns with self-employed health insurance deductions. To be eligible for this deduction, an individual must be a business owner who pays for their own insurance. Zip Codes with high proportions of self employed individuals with health insurance deductions are: 10024 (Upper West Side), 10021 (Upper East Side) and 10022 (Gramercy Park) — each with between 7–9% of returns claiming this deduction. At the other end of the spectrum: 10455 (Hunts Point), 10453 (Morris Heights), and 10472 (Pelham) — each with 0 returns claiming this deduction.

Predicting Cases With so many strong linear relationships between per capita COVID-19 cases and IRS income metrics in New York, we decided to build a regression model to see if we can predict cases[3] within a zip code using only tax data.[4]

Predicted vs actual cumulative cases per capita in NYC. R2 = 0.88 RMSE = 9%

Selecting appropriate features to power the model without overfitting or biasing the predictions is a tricky problem. We established 2 criteria to guide our selection: (1) p values had to be below 0.001 and (2) the 95% confidence interval (established by doing “non-parametric” bootstrap resampling) of the estimated R value had to be above 0.5. These two conditions left us with 48 unique features to train the model. To mitigate overfitting and multicollinearity, we chose to model the data using a custom[5] kernel ridge regression algorithm. This algorithm combines two properties that alleviate our concerns — regularization and the kernel trick. At a high level, the kernel trick allows us to discover patterns in the data in a higher dimensional space while keeping the input dimensionality low. Regularization penalizes features to minimize their impact, thus preventing overfitting. Using this approach we achieved an R2 of 0.88 and an NRMSE[6] of 9% using 10-fold cross validation.

Extensibility to other geographies With these strong results in hand, we wanted to see how extensible a model trained on NYC tax data is to other geographies. Tax and COVID-19 data is highly skewed — NYC has far higher income and case counts than anywhere else in the country; to mitigate this we looked at normalized tax and COVID data relative to the region. We gathered data for four additional cities: Chicago, Baltimore, San Francisco, and Richmond and ran the model attempting to predict normalized cumulative cases for each city. Model accuracy is shown in the table below:

Model performance in other geographies. Cosine Distance measures similarity to NYC, smaller => more similar

Left: choropleth of model performance by zipcode in Chicago. Right: choropleth of model performance by zipcode in Baltimore. Darker => model is more accurate

The model translates very well to Chicago achieving an R2 of 0.72 and an NRMSE of 11% — meaning the predicted cumulative cases are usually +– 11% from the actual number. In contrast, the model performed poorly in Richmond, getting an R2 of 0.5 and NRMSE of 15%. One possible hypothesis for explaining the performance across geographies, is looking at model accuracy in relation to the similarity between NYC and other cities. As in our previous post exploring COVID across similar geographies, we constructed vectors for each city from metrics with a statistically significant relationship to COVID cases such as commuter transportation mode and residential housing types, and looked at the distance between places in this high dimensional space. Although we only have four data points, it is clear that the model performs better in cities that are more similar to NYC in terms of transportation and building types. On average, the model also has a lower error rate in denser, more urban neighborhoods. Looking at the above map of error rates for Chicago and Baltimore, centrally located zip codes like Magnificent Mile, Chicago or Midway East, Baltimore are approximately 1% +– off the actual case numbers. In contrast, primarily residential zip codes on the peripheries like Montclare, Chicago or Essex, Baltimore are well over 20% off.

In addition to training a model on NYC data, we also tested a model trained on four cities together, predicting cases for a withheld fifth. For example, we trained a model using data from Baltimore, Richmond, NYC, San Francisco and predicted cumulative cases in Chicago. Including multiple cities significantly improved the model performance for Chicago, achieving an R2 of 0.84 and an NRMSE of 9%. However, introducing additional cities lowered the model performance in the case of Baltimore and NYC. The below table shows model performances for various configurations of training and test sets.

Model performance across various geographical training sets

The fact that IRS data alone can explain over 80% of the variation in COVID-19 cases speaks to the many findings that link the impact of the pandemic to income inequality. Exploring data points that are outside the purview of traditional epidemiology could allow policy makers, medical professionals, and others to better anticipate the effects of future pandemics.

Footnotes:

Inequality measured by gini index at the state level
ACS is sent to approximately 295,000 addresses monthly (or 3.5 million per year.)
Of course, the number of cases in New York is growing by the day, so for this exercise we are predicting the relative/normalized number of cumulative cases in New York for a particular date (30th May 2020)
As the data is highly right skewed — meaning just a few number of zip codes have very large case counts, while many others have low amounts (relatively) — we transformed the data (taking the square root) for a more normal distribution.
Extending the built in list of kernels in sci-kit learn to include a Power kernel (also known as the unrectified triangular kernel)
Normalized root mean squared error

Disease models & differential equations: connecting geographies with time series clustering

Sat, 02 May 2020 14:18:53 -0400

Post I made for the Topos COVID-19 app – original post on medium here

Projections from mathematical models of infectious disease are guiding policy decisions around the world in the fight against COVID-19. While most of these models are specific to the institutions that develop them, they share basic mathematical principles: they divide a population into different groups and try to simulate how the population transitions from one group to the next. In particular, the SIRD model — one of the most frequently used — divides the population into four groups: those who are susceptible to the virus (S), have become infected (I), recovered (R) and died (D). In this article we explore a modified version of the SIRD model to study COVID-19 time series data across the US. Specifically, we use polynomial regression to fit an SIRD model to real world data and utilize the estimated parameters of the model to cluster geographies across the US.

Animation of an SIRD model with changing parameters (from Wikipedia)

What are disease models?

Disease models (such as SIRD) stem from research done by a group of British doctors, mathematicians and epidemiologists that was presented in a series of articles published from 1927 to 1933. Their aim was to explain the rapid rise and fall in the number of infected patients observed in epidemics such as the London cholera outbreak of 1865. Contemporary disease models are used to predict properties of an epidemic such as its duration, prevalence, and the virality of its spread. The simplest models make some basic assumptions: everyone has the same chance of becoming infected, the infected are equally infectious, and the level of contact between people in a population is equally distributed. Advanced implementations subdivide populations into more granular groups based on factors like, age, sex, and health status. They may also incorporate exogenous factors such as population density and policy decisions.

The mechanics of an SIRD model

An SIRD model can be described through a series of differential equations.

Where γ is the rate of recovery (the inverse of the duration of sickness), β is the transmission rate, μ is the mortality rate and N is the total population. The first equation governs the rate of change for the susceptible group, which decreases at a rate proportional to β as individuals go from susceptible to infected. Individuals in the infected group move at rates proportional to μ and γ into either the recovered or deceased group. The ratio of β/γ is an estimation of the R0 which is the reproduction number or probability of an individual in a population infecting someone else.

In their basic form these models do not consider geographic, demographic or policy contexts. However, geo-specific features such as commuter patterns and housing density are significant factors in understanding how COVID-19 spreads through communities (as discussed in our previous article examining geographic similarity in relation to COVID-19). Here we modify the basic SIRD model to capture variations in geography: we add a term to capture local adherence to social distancing, and we adjust the number of susceptible individuals to factor in vulnerable populations across the country. The equations are adjusted as follows:

Where the addition of ρ captures the extent to which individuals are choosing to shelter in place, thus flattening both the infections and mortality curves. Here N becomes the proportion of the population who are eligible for testing[1].

Animation of an SIRD model with changing rho parameter

Curve fitting and parameter extraction

With this modified SIRD model in hand, we set about fitting it to county-level US COVID-19 data. Our goal is to develop a robust understanding of the characteristics of curves across counties (what counties saw gentler progressions, exponential growth, or sudden spikes). We assume all parameters are unknown and attempt to optimize the parameters using nonlinear least squares regression to fit our model to actual COVID-19 data. Under the hood, we’re iteratively optimizing the bounded parameters of the SIRD model β, γ and ρ (minimizing the sum of squared differences) so that the model closely follows the curve/growth patterns seen in actual data[2]. Using estimated parameters from the model gives us a succinct way to describe the impact of social-distancing, the virus’ virality and the contact rate of populations over time, across the country.

With governments regularly revising the number of cases and deaths reported as new data comes in, we cannot be 100% certain that the numbers we see reported on a given day will be the same the following week. To account for this uncertainty, we artificially add noise to our dataset by assuming that errors will be normally distributed and that variation will diminish over time as reported numbers stabilize.

Example of output from polynomial regression showing values of the three parameters and the curve fitted to real world data for the Los Angeles county

Relationship between model parameters and spatial features

After extracting model parameters using nonlinear least squares regression, we study the relationship between these parameters and our county-level spatial features. The relationships to the parameter capturing adherence to social distancing are statistically significant, albeit fairly weak. It is positively and negatively correlated with higher percentages of white and black populations respectively. This speaks to similar findings that show adherence to social distancing policy is lower in communities of color, as individuals are more likely to work in jobs that are deemed essential or have higher in-person contact. This parameter is also negatively correlated with the rate of testing per 100k people, which suggests that counties with more rigorous testing frameworks are also taking social distancing measures more seriously.

Selected correlates of the extracted Rho parameter (capturing the effectiveness of social distancing)

Looking at the derived contact rate (ρ and β) and Topos’ geospatial features surfaces some intuitively sensible relationships: The contact rate is higher in denser, urban areas and in areas where residents have travelled further in the last 7 days, but lower in areas where the median distance to hospitals is larger (sparse/ rural communities). There are also some interesting relationships between the derived contact rate and ethnicity and voting. These findings echo correlations described in our previous article.

Selected correlates of the extracted contact-rate parameter BetaRho*

Clustering geographies based on curves

The output of our regression is a set of curves that closely approximates[3] the observed spread of COVID-19 in the US and the three parameters of the equations that describe the various rates of change in population groups. With these parameters in hand, we can now group counties in the US based on the functions governing their curves. In particular, we use K-means clustering to group counties based on estimated parameters[4].

Total number of cases over time by cluster.

Looking at the above chart, we see that our clustering has successfully distinguished various characteristics of the outbreak across the US. The counties that had early, rapid outbreaks (like those in Westchester, New York City and Yonkers) are grouped together (cluster-4). The counties that had outbreaks a few weeks later (in mid-March) but also saw rapid growth (like those in Ann Arbor and Baton Rouge) are also clustered together, as are other groups that saw later outbreaks with gentler slopes.

Growth in new infections versus population per square mile, sized according to total cases and colored by cluster.

Visualizing the relationship between recent growth, density and total cases by cluster reveals some interesting insights. We can see that counties that are currently (as of April 23, 2020) seeing the largest case percentage increases are those in cluster 0, 1, and 3 — those that saw outbreaks start later. We also observe counties that are bucking the “high density leads to a large outbreak” trend. Namely, those in Virginia such as Harrisonburg and Clayton which have above-average population density but below-average case numbers and growth.

Left: Counties that are seeing above average growth in new cases with below average density (people per square mile). Right: Counties with below average growth in new cases with above average density (as of 23rd April 2020)

Virginia Beach City, VA (cluster-1), has a population of 2500/square mile, well above the national average. But it has seen very low recent growth in new cases (as of 23rd April) and low numbers of total cases. Conversely, Apache County AZ (cluster-3), has a population density of 6.4/square mile and less than half the population of Norfolk City (~71k) but is seeing double digit growth in new cases and is above the 75th percentile in terms of total number of cases (as of 23rd April).

Downtown Virginia Beach City, VA

Percentage of the population staying at home over the last 7 days

Five Clusters displayed on the map

The average contact rate in counties with early and large outbreaks are significantly higher than all other counties which is to be expected. But recent social distancing measures have been heeded as mobility data from the last 7 days shows that these counties have a larger proportion of people staying at home, as can be seen in the above table.

In this article, we demonstrated how the variations in outbreaks over time reflect adherence to social distancing and the behavioral patterns of people. We also showed that although density plays a critical role in how the outbreak has spread, there are a number of smaller, highly dense communities that appear to be bucking the trend. This article reflects insights we gathered while developing the Topos COVID-19 Compiler — we encourage you to check out our site and share your feedback.

Footnotes

We acknowledge that eligibility might vary by region and over time; our assumptions about who is eligible for testing are those over 50, health care workers and those with underlying medical conditions
In our implementation μ was not a bounded parameter and so was not included at this step
We used the coefficient of determination R2 as measure for goodness of fit for the regression. The mean R2 across all counties analyzed on 23rd April 2020 was 0.92
We chose five clusters by using a heuristic that graphs the number of clusters versus the average internal sum of squares distances per cluster, where the “optimal” number is found when the internal distances no longer decrease linearly.

What do the similarities and differences between places tell us about how COVID-19 is spreading?

Wed, 15 Apr 2020 20:54:28 -0400

Post I made for the Topos COVID-19 app – original post on medium here

Dense urban counties with large populations have the most concentrated numbers of cases and deaths caused by COVID-19. This can be attributed to many factors beyond their respective population density, particularly the fact that New York City and Boston (cities containing counties with some of the highest concentration of cases) are global travel hubs. While population density, to a large degree explains the high level of mortality⁽¹⁾ these cities have experienced, there are other factors that also have a statistically significant relationship to mortality and infection⁽²⁾: the number of daily commuters using public transportation networks, the percentage of residential buildings with 50+ units, even the density of pizza restaurants (a subject near and dear to our hearts) in a neighborhood. In this article, we will explore the similarities between counties based on a variety of factors and see whether this can give us insight into death rates across the U.S caused by COVID-19.⁽³⁾ Left: Number of commuters who take public transportation in relation to the total number of deaths. Size of the circles indicates the number of days since stay at home policy was enacted. Both and X and Y axis follow a logarithmic scale. Right: Percentage of large multifamily residential buildings versus total number of deaths as of April 13th

Percentage of large multifamily residential buildings versus total number of deaths as of April 13th

Correlation to total death count

As a location intelligence company, Topos works with thousands of features that describe the way we live together and move around as humans — the types of buildings we inhabit, the modes of transport we use to get to work, the variety of establishments that make up our cities and neighborhoods. We transform these features into billions of data points that give us a holistic understanding of place.

One of the first questions we asked as the pandemic started to unfold was, “What can our features tell us about how the virus is spreading through communities?” Looking at the relationship between the total number of deaths per county and our features, two of the most significant relationships we found were⁽⁴⁾ the percentage of the population who take public transportation (explaining ~70% of the variation in total deaths across the country) and the percentage of 50+ unit residential buildings (explaining ~45%). These relationships remain significant, although with a slightly lower R2 value, when removing counties in New York state.

Selection of features correlated to total death counts per county across the country

Similarities and dissimilarities between counties

As a company, one of the core questions we have sought to answer from the beginning is: how do we understand similarity and ‘distance’ between geographies in the 21st century?

Given that the density of residential buildings with 50+ units and use of public transportation show such a strong relationship to the number of deaths caused by COVID-19 we can study the similarities and differences between counties based on mode of transit and residential housing type⁽⁵⁾ to see if they can give any insight into how the virus may impact a community. We construct our transit + housing based similarity metric using ‘cosine distance’ to measure the similarity across geographies. Cosine distance is a calculation frequently used to study the distance between entities in high dimensional vector spaces; here it enables us to compare counties across several relevant dimensions at once.

Based on this transit + housing based similarity metric, the two most similar counties to Brooklyn (we chose Brooklyn as it has one of the highest mortality rates in the country as of April 13th, 2020) in terms of commuter mode of transportation and housing type are Manhattan and The Bronx, not surprising, given that New York City has the most extensive public transportation network and highest concentration of high rise buildings in the country. Following closely behind are counties in DC, San Francisco, and Chicago.

Left: Breakdown of percentages of transportation type for most similar counties. Right: Breakdown of percentages of transportation type for dissimilar counties

Left: Percentage breakdown of housing type for dissimilar counties Right: Percentage breakdown of housing type for most similar counties.

Top: 20 most similar counties to Brooklyn (Kings) in terms of our similarity metric encompassing residential building type and commuter mode of transportation.

![image not found][deaths_since] [deaths_since]: https://miro.medium.com/max/1000/1*hIvG2qI8rsBxf5ArEc-wgg.png Moving average of deaths since deaths reached > 10 of most similar counties to Brooklyn in terms of transportation and housing type

Similar trajectories of daily deaths can be seen in many of the most similar (as measured by the transport + housing-based similarity metric) counties to Brooklyn. This is not surprising as both the density of public transportation and percentage of 50+ unit residential buildings correlate highly with the numbers of deaths we’ve seen so far. San Francisco however seems to be bucking the trend. This could be due to a myriad of factors, below we highlight three: SF has a lower proportion of public transit commuters, a slightly lower density of large residential buildings (both metrics feeding our similarity calculation) and the county implemented the stay at home policy five days earlier than Brooklyn did.

Stay Home Policy status in San Francisco and Kings County

Number of cases over time for most dissimilar counties to Brooklyn

If we look at the number of cases over time for the most dissimilar counties to Brooklyn, in terms of the distributions of commuter transportation and housing type, we can see that the growth of cases is a lot slower than that of the counties in NYC. Whereas initially Brooklyn saw cases double every 5 days, in Waynesboro, Virginia or Attala, Mississippi cases are doubling every 14–16 days. None of these counties have seen any deaths so far (as of April 13th, 2020).

Similarities between cities

When we zoom out to look at the data at the city-level (here we use Combined Statistical Area to define a city), the features that correlate most highly with death rates differ slightly from those at the county level. There is a strong relationship between older populations and higher death rates from COVID-19; and with the removal of New York City from the analysis, the density of single family detached homes becomes a highly correlated feature, explaining ~65% of the variation.

Selection of features correlated to total death counts per CSA across the country excluding New York City

This time, we will look at similar cities to a current (as of April 13th, 2020) well known hotspot — Detroit — and see which cities are most similar based on the highly correlated features listed above: age of the population, density of residential building types, and the distribution of household income.

Top: 20 closest/most similar cities to Detroit in terms of age of the population, transportation, and building units

The most similar cities are widely dispersed geographically, from Louisville to Greensboro. Picking 3 of the most similar cities, Louisville, Dayton and Birmingham, we can see that although their daily death rates are currently much lower than Detroit’s, the growth rates are starting to ramp up. Given the similar distributions of older vulnerable populations, these cities may see cases rise even further.

Footnotes

1: Population density accounts for 45% of the variance in mortality rates across the country, p < 0.0001
2: We used Pearson correlation to study the relationship between the number of COVID-19 cases + deaths and other factors.
3: We look at death rates as opposed to the number of cases due to the variation in testing policy across the country.
4: Pearson correlation calculated April 13th 2020 across the US. Data comes from the 2018 American Community Survey
5: Similarity as measured by the mode of transportation + housing type includes: percentage of commuters who drive, take public transit, cycle, work remotely, walk, and use a car service. As well as percentage of single family detached, single family attached, mobile homes, RV vans, and buildings with units ranging from 2, 3–4, 5–9, 10–19, 20–49, and 50+

Data Sources:

COVID-19 cases by county, The New York Times (source)
2018 demographics by county, American Community Survey, 2018 (source)
US election data, MIT Election Lab (source)
Local Covid-19 policy, Kaiser Family Foundation (source)