COVID-19: An attempt to predict Confirmed Cases in India
The COVID-19 pandemic continues to ravage the world. Even as global infections crossed 2.6 million, India’s number at around 21,370 seems modest, given we are home to one-sixth of the world’s population. Based on data from Johns Hopkins, in per capita terms, only 16 in a million people in India are infected by COVID-19, vs 338 in a million people globally (as of 22nd April 2020). Things in India are not as bad… but what does the future look like?
Given my interest in numbers and trends, I have been trying to figure out if we could forecast the trends for COVID-19. I requested data from the popular ones Johns Hopkins-CDDEP, BCG and other forecasts but these were allegedly not for public dissemination / disputed and I did not get a response. In general, I noticed that most of the forecast did not provide day-wise numbers. On a log scale without supporting numbers, it was difficult to decipher what the forecasters wanted to say from the presentations and reports. I would not have been able to read even my own forecast chart without the accompanying numbers.
Forecasts from the US by experts compiled by fivethirtyeight showed a huge variation in forecasts. My colleague Gunnvant has created a data scraper and visualization tool for COVID-19. However, I did not find any good forecasts for India.
Some bad ones are out there. “A five-member Central team has projected that the number of COVID-19 cases in Mumbai will touch an estimated 42,604 by April 30 and spiral to 6,56,407 by May 15. Based on mathematical modelling for Mumbai by the Union Ministry of Health on April 16”
Source: The Hindu
The assumptions are too simplistic. 3.8 doubling maintained throughout the forecast period. Such high numbers are great for scaremongering, grabbing eyeballs and making headlines. The state government is disputing these numbers. They should. Such “mathematical modelling” have been made by team members who had no understanding of either mathematics nor modelling. These forecasts add negligible value. May I direct these ill-trained forecasters to some courses at Jigsaw Academy …
Given my absolute lack of knowledge on diseases, I was initially hesitant to try to forecast it. I take solace from the words of Mark Weir of Ohio State’s ecology, epidemiology, and population health program:
I looked at this as a data forecasting problem and decided to build a simple time series model. Having spent over a decade forecasting revenues, profits and the unknowable stock prices of my coverage universe, I was used to being wrong and forecasting things I had no idea of! Here is the result, the link to my COVID-19 confirmed infections predictions for India: https://docs.google.com/spreadsheets/d/1dc9hwCSz7hoqkgymPghar0AnN80weDgRICQ2qXrmxB0/edit?usp=sharing
When I build the models, these are the things I wanted to have:
- Less difference between the upper and lower bound of estimates. This is therefore not the 95% likelihood
- Mean estimate that hopefully will have less than 5% error from actuals was my aim
- steady/sticky estimates that would update as new information came in but not be too sensitive to minor changes.
You may view the details from in the Google sheet. However, you will not be able to edit or change anything. You may copy it to your own Google drive if you would like to make any changes. All changes in forecast are recorded and ideally these will be updated once a day.
The data is sourced from Johns Hopkins (details in the Google spreadsheet). As some of the data is country-wise and some data is state-wise (for some countries like the US, China and Australia), we use groupby in Python and download as an excel file. We use a simple time series forecasting model to predict the number of confirmed COVID-19 infections in the next seven days. We also highlight the upper bound and lower bound of the estimates. We check the difference of our mean estimate and the actual numbers. The data for my daily forecasts is available from 11th April and since then the actual number has been within 5% of the predicted forecast. Here are my forecasts for the next seven days..
The model is work-in-progress and considering some fine tuning. The lower bound is easier to predict as it can’t be less than actuals. The upper bound needs to be tested, especially once we are not in lockdown and may increase the rate of spread. Looking forward to extending the duration of the forecast as well as seeing if we can predict the peak of the infection in India. Hope to share the model soon.
Given these limitations, honestly, I am surprised the simple model has reasonably good predictive power. And I decided to post it on a public forum to (i) make myself update it daily (ii) see if the model continues to be as good in predicting the numbers, especially in public scrutiny!
Note that my predictions keep changing each day as fresh data comes in. My prediction for today’s (23rd April) confirmed cases have increased by 4% over the last seven days. I am searching for the peak and to see the numbers fall. Hopefully, my numbers will prove excessive and we will see it reduce… Unfortunately, the forecasts seem to be edging up. All models are right until they go wrong! Hopefully, this falters in predicting too much, and the numbers end up being lower than forecast…
Let’s have a more sensible discussion on numbers and expectations. I estimated, India would be around 11,000 confirmed infections on 14th April and there would be a push to keep the lockdown intact. With cases around 20,000 currently, going to around 35,000 by 30th April and expected to cross 40,000 by 3rd May, are we looking for at least a partial lockdown continuing? We will know soon enough…
Ok, we all agree that Mumbai hitting 6.5 lakh cases by 15th May is baloney. However, while the experts in the Union Ministry of Health expect over 42,000 confirmed cases by 30th April, I have the audacity to suggest that the whole of India will have less than 42,000 cases by 30th April?
Yes, I do. Game on! And because I back myself, may the better forecaster win!
I offer my views, with the knowledge that diseases, medicine and healthcare are not my area of expertise. This is an attempt in predictive time series analysis. There are a lot of bad models out there, and I am confident this will be better than most.
Also, given that many discussions on the topic have been polarized by political leanings and viewpoints, I would like to stress that these are not to promote any ideology or offer judgment on government policy decisions.
My only wish is that the government both state and central focus on improving healthcare infrastructure and facilities in India, while they leave the forecasting to those who can!