Using public data to develop a model to predict PAD
In this analysis data from reliable public sources is analyzed. Moderning supervised machine learning techniques permit us to develop a method to predict who has Peripheral Arterial Disease (PAD) and assess the risk of death amongst these patients. As the data source is a robust database at the CDC website, that is representative of the population of United States of America, the mathematical model derived should better apply to those who reside in USA. This analysis should by no means replace assessment by a physician and this is certainly not the intent of the investigator. Rather it should be considered as a 'patient empowerment tool' which would permit those patients who are at risk for the development of PAD to self assess themselves. This could then be considered as a 'leg self exam' similar to the 'breast self exam' idea amongst those at risk for breast cancer. The investigator plans to put forth a webpage or an application in the near future wherein one could go and enter measurement data and the mathematical model would predict whether or not one has PAD.
Publicly available data sources
Several research studies are conducted in USA using public funding. Data from such studies are available online. One such data source is the NHANES study.
'The National Health and Nutrition Examination Survey (NHANES) is a program of studies designed to assess the health and nutritional status of adults and children in the United States. The survey is unique in that it combines interviews and physical examinations. NHANES is a major program of the National Center for Health Statistics (NCHS). NCHS is part of the Centers for Disease Control and Prevention (CDC) and has the responsibility for producing vital and health statistics for the Nation.'
NHANES is a representative sample of the US population that has been conducted over several decades. Data exist in NHANES on patient symptoms, physical examination, biometric measurements and mortality linkage files. Our idea was to take this data and diagnose Peripheral Arterial Disease (PAD) using a ratio of Systolic Blood Pressure measured in the ankle and the arm with a cutoff of less than 0.9. Patients with an ABI greater than 1.3 were excluded from the analysis.
Why is this important?
A lot of time and effort has gone into data collection and maintainance. The data tables are well defined and although not in tidy data format they are quite usable. R and RStudio are free softwares that can be easily downloaded. CRAN is a useful repository that is publicy avaialble for software downloads and several useful help sites exist where one can learn how to use R. Additionally for those with a desire to learn several public sites such as Coursera offer great coures in statistics, data science and machine learning.
What does it matter?
We believe that medical research and breakthroughs can be crowdsourced. Intersted individuals can learn, master the techiniques at their own pace, download public data and reach important conclusions on Medical conditions. We present here our own effort at developing a method to diagnose PAD and mortality.
Where is the data from?
Data from NHANES datasets is freely avaialble at http://www.cdc.gov/nchs/nhanes/nhanes_questionnaires.htm. Of these datasets those from 1999 to 2004 had avaialable ankle and arm pressures allowing calculation of ankle brachial index (ABI) which allowed us to diagnose PAD in these patients. Althought mortality linkage data were available we had to write directly to the data team in order to get correct R coding and successfully link the mortality and the NHANES data. Thus data from NHANES was sliced and diced using R and then analyzed.
Data Cleaning and processing
Data from the CDC was downloaded to a folder /Downloads/nhanes. The R markdown file is available online at —-, and once the required files are downloaded to the nhanes folder and the code is run the results of the analysis are avaialable. In order to reproduce the analysis R, Rstudio and the required R packages should be available on the computer. The variable seqn provided means to merge the data tables together. Code provided by the Data Linkage team from CDC allowed publicly available mortality data to be merged with the NHANES data.
Reqired R Packages
Data Quality and imputation
As a result of the data processing a table with data on 6267 individuals and 11 columns respectively was produced. Information on smokers had NA’s:2860 . Among those with PAD who were Smoker 2.6168821 percent had missing data compared to 43.0189884 percent of Smokers without PAD. Thus simply dropping the NA values from the dataset would result in a bias in the estimation.
The NPBayesImpute package with the default settings, and the rfImpute package with default settings were used to impute categorical and continuous variables respectively.
What is Machine Learning?
'Machine learning explores the study and construction of algorithms that can learn from and make predictions on data. Such algorithms operate by building a model from example inputs in order to make data-driven predictions or decisions,:2 rather than following strictly static program instructions.'
There are a number of machine learning approaches including 4.1 Decision tree learning 4.2 Association rule learning 4.3 Artificial neural networks 4.4 Inductive logic programming 4.5 Support vector machines 4.6 Clustering 4.7 Bayesian networks 4.8 Reinforcement learning 4.9 Representation learning 4.10 Similarity and metric learning 4.11 Sparse dictionary learning 4.12 Genetic algorithms. After thoughtful evaluation of these a random forest model method was felt to be appropriate for our application.
After data cleaning and imputation to elimiate missing values no pre-processissing of data was done. Data were divided into test and train groups with 75% of the data in the train dataset. Test data was used to develop the model.
Random Forest Machine Learning output
Output from Random Forest Machine learning is interpreted using a confusion matrix.
From data school
'A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known. The confusion matrix itself is relatively simple to understand, but the related terminology can be confusing.'
## Reference ## Prediction None PAD ## None 1430 128 ## PAD 5 3
In order to understand the confusion matrix above it is important to understand what the random forest method did for us. It took the data from the train dataset and used it to develop a classification model. Then this model was applied to the test dataset. As a result of this the model predicted who would have PAD and who wouldn't. The test dataset already had information on who had PAD and who did not. Thus in the confusion matrix we have information on which patients were classified by the model as having PAD and actually had PAD and which ones the model said had PAD but actually did not. From this we can calculate specificity and sensitivity.
## Accuracy : 0.9151 ## 95% CI : (0.9002, 0.9284)
## Sensitivity : 0.9965 ## Specificity : 0.0229 ## Pos Pred Value : 0.9178 ## Neg Pred Value : 0.3750
## 'Positive' Class : None
Thus the output tells us that althoug we wanted to model presence of PAD, as a majority of the patients did not have PAD our random forest model is very good at predicting who does not have PAD.
The graph above illustrates that Age in years, Body Mass Index (BMI) and Calf circumference are important features in the random forest model predicting presence or absence of PAD.
Random Forest Model Predicting Death
In a similar fashion a random forest model for death was developed. Results are below.
## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 1214 176 ## 1 62 114 ## ## Accuracy : 0.848 ## 95% CI : (0.8293, 0.8655)
## Sensitivity : 0.9514 ## Specificity : 0.3931
## 'Positive' Class : 0
Thus similarly as more people were alive than dead in our dataset the model is better at predicting who is alive. It is 84% accurate at this prediction.
The figure above illustarates Age in years, body mass index and calf circumference are adequate to predict who will live with an 84% accuracy and 95% sensitivity.
Application of this research
Using R and yhat we hope to create an application wherin you can plug in your personal information and this will give you information on whether you dont have PAD. Those who are predicted as having PAD could then approach their physicians with the understanding that the model is better at predicting those who do not have PAD. Given that PAD is an important disease and costs 4.37 billion dollars in USA and can result in limb loss and death, it would be beneficial to empower people at large to screen themselves for PAD. Thus a simple tape measurement of the calf, an online tool and a motivated person could give us calf self examination to diagnose PAD in a fashion similar to breast self examination.
Generalized Linear Model
A common criticism of machine learning is that it is essentially a 'black box'. In order to assess the data a different way generalize linear model techiniques were also applied to the data.
'In statistics, the generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value. Generalized linear models were formulated by John Nelder and Robert Wedderburn as a way of unifying various other statistical models, including linear regression, logistic regression and Poisson regression. ' - Wikipedia
The specific r command used to generate the model glm(PAD ~ ., family=binomial, data=train[,-c(2)]) resulted in the following output.
## Response: PAD ## ## Terms added sequentially (first to last) ## ## ## Df Deviance Resid. Df Resid. Dev Pr(>Chi) ## Ethnicity 4 12.906 4696 2699.5 0.011743 * ## Diabetes 3 37.799 4693 2661.7 3.117e-08 *** ## Hypertension 2 35.651 4689 2625.8 1.814e-08 *** ## Age_in_years 1 233.793 4686 2390.8 < 2.2e-16 *** ## Calf_Circumference 1 9.687 4684 2380.0 0.001856 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation of Generalized Linear Model Output
A generalized linear model for PAD showed the features above as being highly signicantly associated with outcome PAD. Odds for PAD decrease for every cm increase in calf circumference -0.0737461 ,while the odds for Age and BMI increase by 0.0671621 and 0.0502271 per unit increase respectively. In comparison to Mexican Americans, PAD odds for Non-Hispanic whites were 0.3062672, while for Non-Hispanic blacks the PAD odds were 0.6113957.Those without diabetes when compared to those with diabetes had an odds of PAD -0.5282153. Those without Hypertension when compared to those with hypertension had an odds of PAD -0.2642184.