Statisticians, like artists, have the bad habit of falling in love with their models.
— George Box
An unsophisticated forecaster uses statistics as a drunken man uses lamp posts — for support rather than for illumination.
— Andrew Lang
All predictive modelers are familiar with these quotes and similar anecdotes. A personal favorite still makes me feel clever even after I have uttered it for the thousandth time: All models are wrong, but some are useful. — George Box.
These tidbits serve a purpose to remind ourselves that models and statistical tests are tools that complement domain knowledge and common sense but cannot serve as a substitute for either. Despite reminding myself of the pitfalls of over-confidence, I have been guilty of hypocrisy.
As data mushrooms, models become more complex, roles become more specialized, and terminology becomes more confusing (and over-hyped) – we need to be honest with ourselves, honest with stakeholders and not allow hubris in our models to displace common sense.
Easier said than done, but following some guidelines can help. Some steps are outlined below, but this is by no means an exhaustive list:
- Understand the assumptions of the model and its limitations.
- Consider exogenous forces that may (not) be measurable.
- How was the data collected and reported? Could noise or bias be introduced through the process?
- Test early and often. Question surprising results. Consider an agile approach allowing yourself to fail fast. A lot can be learned from failures, and the longer it takes to admit failure, the more attached we become.
- Understand the model purpose and focus on that aspect (not every goal is point estimates!).
- Understand the variables: Are any variables proxies? Is data leakage a risk?
- Consider relationships when designing and interpreting the model. Are confounding or colliding effects accurately being captured? Are the relationships causal, correlative or spurious?
The final point is often the most overlooked, since traditional statistics and machine learning do not provide sufficient tools to address the nature of the relationships among variables. Methods aimed at teasing out variable relationships are mostly focused on other aspects such as reducing over-fitting (regularization, controlled randomized trials, etc.). However, a causal revolution is underway!
To learn more, I highly recommend Judea Pearl’s new book, The Book of Why: The New Science of Cause and Effect, where he demonstrates the unbelievable effectiveness of causal modeling and its simplicity. Basic statistics is all that’s required to understand the powerful concepts that can be deployed through diagrams (graphs).
Mortality Modeling in an Uncertain Environment
Mortality modeling in life insurance went through a renaissance from the early 1990s through the 2000s as models now often referred to as the “M” models were published, refined, and libraries became available. Variations, extensions and surveys such as “A Quantitative Comparison of Stochastic Mortality Models Using Data From England and Wales and the United States”1 go into detail on the differences and applications of each.
Recent declines in US mortality, particularly among specific demographics2 have caused concern for insurers over the impacts to their balance sheets. Opiate crisis has unfortunately become a part of our vocabulary and insurers want to know: How much does the epidemic, along with other growing concerns including obesity and diabetes, impact the mortality of their insured portfolios?
One area receiving attention is the usage of population mortality data used to supplement insured mortality data, whether used for setting base mortality assumptions due to low credibility or mortality improvement and trend assumptions due to a narrow time window on most homogeneous insured populations. Do we need to update our models, update our data or a combination?
Government data is slowly becoming a more reliable tool. In the US, the CDC3 and Census Bureau4 have recently provided an API (Application Programming Interface) which makes it easier to access and download larger datasets that can include additional demographic data, which may help segment the general population. This gives an additional data source over the Human Mortality Database, where the only demographic data provided is sex and age. Figure 1 illustrates the usage of incorporating the educational attainment factor.
Mortality Modeling with Cause-of-Death
Modeling mortality using a multi-decrement approach, typically cause-of-death (COD), has been a challenge that many have attempted with varying degrees of success. Segmenting mortality into separate causes, (often) independently forecasting then aggregating is proving to be very challenging. A few reasons are:
- Highest cause-specific improvement → decreased market share (and vice versa) → underestimating aggregate mortality improvement5
- Often ignores interaction among causes; for example, a cure or medical breakthrough in one cause implies “saved” individuals will die from another cause
- Too few COD groups results in grouping causes that may have dissimilar health and mortality dynamics together
- Too many COD groups increase noise and exposure to reporting volatility
COD reporting is very complicated and elevates process variance. Standardized COD codes (International Classification of Diseases) evolve over time, and reporting guidelines and forms can vary from state to state. Furthermore, a variety of professionals can be responsible for reporting the COD including coroners, medical examiners, physicians, nurses or forensic pathologists. To add to the complication, primary cause-of-death is often not very clear.
Consider the scenario of Herbert. Herbert is 65 years old, and a life of heavy drinking contributed to early-onset Alzheimer’s. Now he relies on a nurse to visit him daily. One snowy day a nurse forgets to visit him, he fails to take his medication and subsequently dies due to hypothermia when he gets lost going for a walk outside.
What is his primary COD — alcoholism, Alzheimer’s, negligence or hypothermia? Should only the primary causes be considered, or should the secondary causes listed on death certificates be given weight? These are not easy questions and often depend on the goal of the forecasting model, the allowable tolerance and other factors.
An example of the complications relying on cause-of-death reporting has been highlighted in recent articles on NPR6, fivethirtyeight7 and in the academic publication Addiction8. The main theme being reported is that, despite the alarming statistics on recent spikes, the number of opioid-related deaths are being underreported.5
Another surprising result emerges comparing the top five states with drug overdose deaths in 2016, each with at least 3,500 total deaths, shown in Chart 2. On one end of the spectrum, death certificates identified at least one drug attributed to the overdose for 95% and 97% of deaths, respectively, in Ohio and New York. On the other end, Pennsylvania only identified at least one drug on 55% of drug overdose deaths.
The lack of standardization in cause-of-death reporting across states, countries and professions adds another complication in modeling cause-of-death. Despite the data and modeling challenges, progress is being made.
A recent cross-collaboration with our R&D centers from the US and Paris worked on producing a mortality model using cause of death. From this work, we have learned some valuable insights. A few of which are listed below:
- Mortality improvements by cause change over time. For example, statins were a huge benefit in the 1990s to 2000s and a lot of research was focused on those areas. Eventually returns diminish and research money and time gets focused on other areas. Hence, a two-step approach should lead to better results: all-cause mortality followed by cause-specific mortality on the residuals.
- We found it easier to work with the age-at-death COD conditional distributions over the mortality rates to incorporate dynamics among CODs.5
- Adjustments, which may need to be subjective, may be necessary. For example, we have learned that recent increases in mortality from Alzheimer’s and dementia across ages are an artifact in improved COD reporting guidelines and not
indicative of an actual trend.
If you would like to learn more, please contact me, Tim Roy, at TRoy@SCOR.com.
- Cairns, Andrew J. G. and Blake, David P. and Dowd, Kevin and Coughlan, Guy and Epstein, David, A Quantitative Comparison of Stochastic Mortality Models Using Data from England & Wales and the United States (March 2007). Available at SSRN:
https://ssrn.com/abstract=1340389 or http://dx.doi.org/10.2139/ssrn.1340389
- Case, Anne, and Angus Deaton. 2015. “Rising morbidity and mortality in midlife among white non-Hispanic Americans in the 21st century.” Proceedings of the National Academy of Sciences, 1-6.
- Oeppen, Jim. 2008. Coherent forecasting of multipledecrement life tables : a test using Japanese cause of death data. @inproceedings
- Ruhm, Christopher J. 2018. Corrected US opioid involved drug poisoning deaths and mortality rates, 1999–2015. Addiction. Issn 0965-2140.