Will an XGBoost reveal the same risk factors as those used by a physician?
Cardiovascular disease (CVD) or heart disease is one of the leading causes of death in the United States. The Center for Disease Control Prevention estimates 647,000 deaths per year¹. CVD is an umbrella term that encompasses different heart conditions that include diseased blood vessels (atherosclerosis or vasculitis), structural problems (cardiomegaly), and irregular heartbeats (arrhythmia). Of the CVD, the most common type of heart disease in the United State is coronary artery disease. Most of the time CVD is “silent” and there is no diagnosis until individuals experience signs or symptoms of a heart attack, heart failure, or arrhythmia². Research has identified risk factors that are associated with developing CVD. These risk factors can be non-modifiable, where the factors cannot be changed, or modifiable factors, where the factors can be changed.
The non-modifiable risk factors are³:
- Increasing age
- Biological Sex — -men are at greater risk than women
The modifiable risk factors include³:
- Smoking tobacco
- High blood cholesterol
- High blood pressure
- Physical inactivity
A physician can gain insight using these risk factors to recommend lifestyle changes or treatment strategies for the patient. The curious question I want to investigate is can an XGBoost tree model predicts if someone has CVD based on these risk factors that physicians use.
The data used to conduct this analysis is from a dataset compiled by four hospitals in Cleveland, Hungary, Switzerland, and VA Long Beach. The data is referred to as the UCI Heart Disease dataset. This dataset consists of 303 individuals with 14 attributes where 138 individuals are presented with no CVD and 165 individuals presented with CVD. Originally, there were 76 attributes, but published experiments refer to using a subset of only 14 attributes. The target variable is the diagnosis of heart disease using the diameter narrowing in any major blood vessel. The cutoff percentage was 50% (see below attribute #14).
Only 14 attributes used:
1. age: age in years
2. sex: sex (1 = male; 0 = female)
3. cp: chest pain type — 1: typical angina, 2: atypical angina, 3: non-anginal pain, 4: asymptomatic
4. trestbps: resting blood pressure (in mm Hg on admission to the hospital)
5. chol: serum cholesterol in mg/dl
6. fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
7. restecg: resting electrocardiographic results — 0: normal, 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria
8. thalach: maximum heart rate achieved
9. exang: exercise-induced angina (1 = yes; 0 = no)
10. oldpeak = ST depression induced by exercise relative to rest
11. slope: the slope of the peak exercise ST-segment — 1: upsloping, 2: flat, 3: downsloping
12. ca: number of major vessels (0–3) colored by fluoroscopy
13. thal: 3 = normal; 6 = fixed defect; 7 = reversible defect
14. target: diagnosis of heart disease (angiographic disease status) — 0: < 50% diameter narrowing, 1: > 50% diameter narrowing
The data set and the all variables information can be found here: https://archive.ics.uci.edu/ml/datasets/Heart+Disease
An XGBoost tree model was used for two reasons: 1) the model is created by splitting on a specific feature and 2) it’s more robust to other forms of decision tree models⁴.
No feature engineering was done because 1) there were only 14 features, and 2) each feature was treated as an independent variable from each other to see what features contributed to the prediction. Moreover, to ensure this point, I checked if there was any collinearity amongst the 14 attributes. From looking at Pearson’s correlation, a strong correlation between variables did not exist (Fig 1).
Furthermore, the data was split into a 70:30 ratio (train: test ratio). This split was necessary because the dataset only had 303 individuals, which is relatively a small dataset.
To verify if the model was tuned appropriately, I used the default values of the XGBoost tree with the objective as “binary: logistic”. Then, 10-fold cross-validation using stratified k-fold was fitted with the accuracy as the metric for validation. The test accuracy was about 81% and the average validation across each fold was 77.81% with a standard deviation of 11.09%. Thus, the baseline model overfitted on the data. Moreover, the metric standard deviation between each fold was 11.09%. This suggests that some folds underperformed. Consequently, hyperparameter tuning resolved this issue.
A randomized search 10-fold cross-validation was used to determine the best parameters for the XGBoost tree. The best parameters were:
colsample_bytree:1, learning_rate:0.1, max_depth:4, min_child_weight:1e-05, n_estimators:200, objective:’binary:logistic’, subsample:0.5
Hyperparameter Tuned Model
Similar to the baseline model, 10-fold cross-validation was performed on the tuned model with the accuracy metric. The average accuracy across each fold was 82.12% with a standard deviation of 7.47%, which is much better than the baseline. The accuracy on the test was 84.6% and the area under the curve metric was 0.84. The XGBoost performed better than the baseline without overfitting.
Validity of Model
A paper by Dinh et al used an XGBoost model to predict diabetes and cardiovascular disease using the National Health and Nutrition Examination Survey (NHANES) dataset⁶. For the cardiovascular disease classification, their XGBoost AUC metric was 0.831. Even though these are slightly differing datasets, there was some overlap of the top five features between my model’s key features and Dinh et al’s key features. Dinh et al’s XGboost identified 1) age, 2) systolic blood pressure, 3) self-reported weight, 4) occurrence of chest pain, and 5) diastolic blood pressure as key contributors. Whereas the top features in my model identified were: 1) age, 2) cholesterol (chol), 3) maximum heart rate achieved (thalach), 4) resting blood pressure (trestbps), and 5) ST depression induced by exercise relative to rest (oldpeak). Despite differing variable, my model and Dinh et al’s model agreed on three features: age, blood pressure, and chest pain (Note old peak is the ST depression induced by exercise which is a reliable finding for the diagnosis of obstructive coronary atherosclerosis, which can cause chest pain due to decreased blood flow)
The feature importance graph shows the number of times (f score not f1 score) the XGBoost model splits upon a feature. So, the age was split upon a large amount of time to determine the presence of the CVD.
One of the top features that XGBoost tree used to classify if someone had heart disease or not was age, a non-modifiable risk factor. From a physiological standpoint, age is a determinant risk factor for cardiovascular disease. With age, the compliance of the aorta and carotid arteries decreases. This means that our aorta and carotid arteries become stiffer and thus making the elderly have a higher blood pressure than normal, which is a risk factor for CVD and atherosclerosis. Moreover, the age group 65 and older are more likely to develop CVD⁷. Figure 5 depicts the percentage of people within an age group with CVD. Two insights can be gathered from this graph. First, people over the age of 60 have CVD but notice that not all people over 65 have a risk of CVD. This can be due to successful aging. Successful aging is when individuals maintain the physical function of their body with proper exercise⁸. Whereas usual aging is the absence of overt cardiac pathology but some functional decline. Secondly, the graph shows that there is a good amount of young or middle age individuals that have CVD. This can be due to the modifiable risk factor mentioned in the introduction.
Non-Modifiable Risk Factors.
From the top five features, cholesterol and blood pressure are known risk factors of CVD that medical studies have proven to lead to heart disease. However, one factor that the XGBoost used to determine if someone has CVD or not was the maximum heart rate (MHR) achieved. MHR tell us the average number of times the heart should beat per minute during exercise. The data dictionary does not specify how it was calculated, but MHR is usually calculated by subtracting one’s age from 220. The notable trend is the maximum heart rate reduces with age. This can be due to the decrease in the number of SA nodal cells from apoptosis as one age. Below is a table from the American Heart Association of the MHR by age.
The two graphs below (Figure 6 and Figure 7) depict the average MHR with or without CVD. The results portrayed were a little odd to me. Both groups did not reach MHR as indicated by the American Heart Association, but the group with the presence of CVD were more likely to reach the MHR than with no CVD. According to academic literature, the maximal exercise-induced heart rate is inversely associated with cardiovascular mortality¹¹. So, the higher the MHR, the fewer chances of CVD. One possibility can be these individuals have a resting heart rate that is already set high due to the heart’s compensatory mechanism. Another important aspect to note from this is the nature of the machine learning model. XGBoost is a mathematical model that only classifies by the number fed into the model. Therefore, since there was a significant difference between the two, the model used the MHR as a feature to make splits.
A limitation of this analysis is not running the data on other tree-based models like a light-XGBoost or random forest. These models could have given better results. Another limitation of this analysis is the data test did not have a separate test set. Thus, I had to create my test from the 303 individuals, which reduces the number of training samples, which can affect the results.
The XGBoost model did reveal similar risk factors as those used by a physician to assess the potential for CVD. This small assessment justifies that there is validity in using data science algorithms in medicine. Creating a robust machine learning model on larger cardiovascular dataset has validity to act as a preliminary screening tool for CVD