# Team - Scikit Earn
Team Lead - Clif Peters
Analyst - Jarrett Devereaux
Analyst - Alexander Pacheco
Analyst - Rebecca Morris
![[capstone.gif]]
## Table of Contents
[Table of Contents](#Table%20of%20Contents)
[Abstract/Executive Summary](#Abstract/Executive%20Summary)
[Project Plan](#Project%20Plan)
[Introduction](#Introduction)
[Analysis Opportunity](#Analysis%20Opportunity)
[Research Questions](#Research%20Questions)
[Hypotheses](#Hypotheses)
[Data](#Data)
[Measurements](#Measurements)
[Methodology](#Methodology)
[Computational Methods and Outputs](#Computational%20Methods%20and%20Outputs)
[Output Summaries](#Output%20Summaries)
[Campaign Implementation](#Campaign%20Implementation)
[Literature Review](#Literature%20Review)
[Final Research Questions](#Final%20Research%20Questions)
[Exploratory Data Analysis](#Exploratory%20Data%20Analysis)
[Methodology](#Methodology)
[RQ 1: What features are most correlated with patient readmission?](#RQ%201%20What%20features%20are%20most%20correlated%20with%20patients’%20being%20readmitted?)
[RQ 2: Are there disparities in readmission rates across demographic groups?](#RQ%202%20Are%20there%20disparities%20in%20readmission%20rates%20across%20demographic%20groups?)
[RQ 3: Are there identifiable patterns amongst the sampled data to allow preemptive identification of patients that are at high risk of readmission?](#RQ%203%20Are%20there%20identifiable%20patterns%20amongst%20the%20sampled%20data%20to%20allow%20preemptive%20identification%20of%20patients%20that%20are%20at%20high%20risk%20of%20readmission?)
[Analysis](#Analysis)
[Research Question 1: What features are most correlated with patient readmission?](#Research%20Question%201%20What%20features%20are%20most%20correlated%20with%20patient%20readmission?)
[Research Question 2: Are there disparities in readmission rates across demographic groups?](#Research%20Question%202%20Are%20there%20disparities%20in%20readmission%20rates%20across%20demographic%20groups?)
[Research Question 3: Are there identifiable patterns amongst the sampled data to allow preemptive identification of patients that are at high risk of readmission?](#Research%20Question%203%20Are%20there%20identifiable%20patterns%20amongst%20the%20sampled%20data%20to%20allow%20preemptive%20identification%20of%20patients%20that%20are%20at%20high%20risk%20of%20readmission?)
[Ethical Recommendations](#Ethical%20Recommendations)
[Challenges](#Challenges)
[Recommendations and Next Steps](#Recommendations%20and%20Next%20Steps)
[References](#References)
[Appendix](#Appendix)
[Visualizations](#Visualizations)
[Source Code](#Source%20Code)
[Live Notebook](#Live%20Notebook)
## Abstract/Executive Summary
The prevalence of diabetes in the US has been an issue of concern for decades. Scikit-Earn looked at more than 100,000 diabetes-related hospital readmissions at 130 hospitals in the US. In our EDA, we identified demographic groups that showed a higher risk of readmission. Some of these groups include elderly patients, women between the age of 20 and 30, and Asian males. We used chi-square tests to quantify the significance of the relationships between categorical variables and readmission. We then used the results of these tests to help inform the feature selection process when experimenting with models.
We found that many of the findings from our logistic regression and tree-based classification models agreed with each other, singling out the feature “inpatient visits” as the most important predictor for readmission. Further experimentation was performed to see if the tree-based models would achieve better results by reducing the feature space. These models performed better with more features. One of the shortcomings of this dataset was the major gaps in certain clinical measurements, like weight and hBA1c levels. Despite the lack of important measures, we were able to develop a logistic regression model with 87% accuracy by just using inpatient history.
## Project Plan
### Introduction
Diabetes is a chronic disease that affects approximately 38 million people in the United States. Those affected are more likely to experience adverse health outcomes and are at greater risk of hospital readmission following hospitalization. This increase in hospital readmission not only places a substantial burden on the healthcare system, but also indicates potential gaps in patient care and an opportunity to reduce medical costs. By analyzing the hospital admission data of patients with diabetes, we will be able to identify the primary factors that influence readmission risk and give healthcare providers the insight they need to implement more effective patient care strategies and prevent unnecessary readmissions.
### Analysis Opportunity
Leveraging a dataset that is representative of ten years of clinical care data for 130 hospitals and other treatment centers across the US, our team plans to analyze some research questions and build predictive models to evaluate readmission risks for diabetes patients. These models will provide data that can be used to evaluate disparities across socioeconomic groups, cluster patients into groups based on risk profile, and determine effective treatment plans for patients based on similarity to previous ones. Below are two displays of the distribution of race and readmission status in the dataset.

### Research Questions
RQ 1: What features are most correlated with patient readmission?
Hospitals and treatment centers would find value in knowing what tests and/or treatment options are most correlated with having to readmit patients for further treatment. For example, what is the relationship between hospital stay criteria and readmission risk? What drug treatment options are predictive of readmission rates? If the hospital can predict that patients with a given diagnosis and treatment plan are likely to be readmitted, they can adjust their treatment to align with a plan that demonstrates success at preventing readmission. The value of examining this question is to ensure best practices for patient management are known to minimize readmission rate which is a key metric used for treatment center performance.
RQ 2: Are there disparities in readmission rates across demographic groups?
Regulators and research institutions that rely on grants for a portion of their funding are often required to demonstrate that they treat all customers equitably. It is important to objectively evaluate and demonstrate adherence to these regulations. In addition, equitable treatment across demographic groups may mean that treatment options need to be adjusted to meet the needs of different patients. An analysis of the readmission status across all of these groups can provide indications that treatment plans need to be examined further. The goal of this research question is to identify any disparities present in the readmission rates and visually present objective evidence of these disparities if they exist.
RQ 3: Are there identifiable patterns amongst the sampled data to allow preemptive identification of patients that are at high risk of readmission?
Identification of patterns among patient populations are very valuable to caregivers as this knowledge allows them to make the best choices for all parties as quickly as possible. This question would focus on identifying if there are risk factors that can allow a patient to be considered objectively high-risk and trigger alternative treatment plans. The value here is to provide data that can be used to review patient management practices and evaluate these risk factors when making discharge decisions.
### Hypotheses
H1: Socioeconomic factors contribute to healthcare options in ways that may be reflected in readmission rate.
By clustering patients by socioeconomic factors, disparities amongst the groups can be identified from the data. Identifying these clusters and comparing the readmission rate among them will test this hypothesis.
H2: A longer hospital stay will be positively correlated with a higher risk of readmission.
It logically follows that a patient that has been in the hospital for an extended period of time has a more severe disease state and therefore is more likely to be readmitted once discharged. We will test this hypothesis by analyzing the data.
H3: There are risk factors that can allow a patient to be considered objectively high-risk that can trigger alternative treatment plans to try to prevent readmission.
Typically, when a patient comes into a treatment center, their condition is evaluated and treated according to best practices. With data, we can identify if there are factors about the patient that on initial diagnosis, you can identify the most similar patients and evaluate the risk of that patient requiring a readmission based on previous similar patients.
### Data
The dataset was obtained from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008) and contains 101,766 inpatient records from patients diagnosed with diabetes over a 10-year period (1999-2008). Each observation represents a single unique hospital admission during which some form of diabetes was recorded as a diagnosis, although not necessarily the primary diagnosis. The data consists of 50 features that can be broadly categorized as: patient demographics, admission data, medical history, medications, and laboratory testing.
#### Patient Demographics
Information on patient demographics are included in the form of the categorical variables race, gender, and age. Possible values for race are Caucasian, Asian, African American, Hispanic, and other. Gender is classified as male, female, or unknown/invalid, and age has been grouped into 10-year intervals (e.g., 10-19, 20-29). These demographic characteristics will allow us to identify differences in readmission rates among social groups.
#### Admission Data
The data contains information on various aspects of hospital admission, such as the type of admission (emergency, urgent, etc.), the source of admission (e.g., physician referral, clinic referral, transfer from another hospital), and the length of time between admission and discharge. Also included in this category are attributes such as the medical specialty of the admitting physician and the number of days until the patient was readmitted if readmission occurred. The variable for readmission takes the value no if there was no record of readmission and the values <30 or >30 otherwise. <30 indicates that the patient was readmitted in less than 30 days and >30 indicates that the patient was readmitted more than 30 days later.
#### Medical History
Any records of patient hospital visits from the year preceding the current hospitalization are also incorporated into the data. The number of inpatient, outpatient, and emergency visits are all recorded individually and can be analyzed separately as potential predictors. In addition, a primary diagnosis, secondary diagnosis, and any other diagnoses are noted for each patient, along with the total number of diagnoses recorded in the system.
#### Medication
The prescription of 24 different drugs were tracked and each is associated with its own variable that indicates whether the drug was prescribed and if the dosage was adjusted during the hospital stay. Possible values include no (the drug was not prescribed), up (the dosage was increased during the visit), down (the dosage was decreased), and steady (no change in dosage). Additionally, the total number of medications administered to the patient throughout the hospital stay was also recorded.
#### Laboratory Testing
The dataset contains a variable indicating the total number of lab procedures performed while the patient was hospitalized, in addition to variables for the test results of multiple lab tests that are particularly relevant to patients with diabetes.These include the patient’s maximum glucose serum as well as their HbA1c levels. Maximum glucose serum values were categorized into the following groups: >200, >300, normal, and none if a measurement was not provided. HbA1c levels were also grouped into categories with the following values: >7 if the test result was greater than 7% but less than 8%, >8 if the result was greater than 8%, normal if the result was within normal range (i.e., less than 7%), and none if no measurement was provided.
### Measurements
To ensure accuracy and success for the predictivity of our models, we’ll establish a key set of measurement objectives. Ideally, when looking into patterns for finding patients with the probability of readmission, we should check for feature importance, biases, and reliability. To ensure we can recognize the most relevant features to predict our target, we run a feature importance analysis, increasing the output quality. Biases are inherent errors within our data, which can skew our output when predicting our target, leading to unfair or improper treatment. We explore methodologies used to combat the data presented in our dataset.
Reliability confirms that our model can be trusted to produce accurate results consistently. To achieve this, we could look at cross-validation techniques. However, when interpreting the results from a predictive model, we look at the precision, recall, and F1 score to verify that our instances of prediction are valid.
### Methodology
For our first research question, we’ll use a combination of chi-square tests and logistic regression analysis. During our EDA, we’ll be able to also see the correlation matrix, and we can look at the entries that contain demographic features and the target variables. The chi-square tests can help us identify associations between demographic factors and readmission rates. We can then use logistic regression models to quantify the impact of demographic features. We’re going to need to encode the categorical variables in our dataset using one-hot encoding.
For the second question, we’re going to be looking at the feature importance of multiple models. We’ll analyze the feature importance of a fitted Random Forest model, along with the SHAP values from an XGBoost model. We’re leaning towards using tree-based models because these models typically work well with large, tabular datasets, and they tend to handle missing data. We can try multiple approaches while training - one workflow with no imputation, and then other workflows with dropped rows, median imputation, and/or mean imputation. Analyzing the feature importance and SHAP values will allow us to rank the important clinical and demographic features in predicting readmission outcomes.
To identify patterns among high-risk patients for question number 3, we’ll start by conducting a k-means clustering analysis during the EDA phase. This will hopefully help us see if there’s distinct patterns or profiles of high-risk patients (e.g. clusters of elderly patients with comorbidities). After this cluster analysis, we can move onto the typical XGBoost classification workflow to predict readmission risk, then analyze the feature interactions using the SHAP values. To evaluate the model, we can look at the AUC-ROC and precision-recall curves along with the F1 scores. Initially, we were just going to use the evaluation metrics from the XGBoost classifier model to answer this question. The keywords “identifiable patterns” in the question bring clustering into the equation. Also, seeing how some of the studies in our literature review analyzed similar datasets using an XGBoost/logistic regression workflow, we need to answer this question from a new perspective.
### Computational Methods and Outputs
For examining demographic disparities, our group will leverage scipy.stats for conducting the chi-square tests and scikit-learn for implementing a logistic regression model. We’ll use the produced p-values to quantify how significant the associations between demographic factors and readmission rates are. During the EDA portion, we will generate crosstabs using Pandas to break down readmission rates by category.
To determine the features correlated with readmission risk, we’ll use a combination of scikit-learn’s ensemble module to implement the Random Forest model and feature importance values, and the XGBoost library to implement XGBoost along with SHAP values.
Finally, we’ll implement k-means clustering using the cluster module from scikit-learn. We’ll use the elbow technique to determine the optimal amount of clusters to use, which is also provided by scikit-learn. For the XGBoost classifier, we’ll also use the XGBoost module in Python.
### Output Summaries
#### RQ 1: What features are most correlated with patients’ being readmitted?
The analysis will be identifying the features from the dataset that have the most impact on readmission risk for the patient. Multiple models and techniques will be used to evaluate the level of correlation with our target variable. Once identified a simple table will be created that includes a numerical representation of the correlation as well as other visual techniques to highlight significant levels of correlation.
Example table of Highly-Correlated Features
| | |
|---|---|
|Feature Name|Correlation Value with Target Variable|
|number_inpatient|0.178208|
|number_diagnoses|0.104660|
|number_emergency|0.090516|
|number_outpatient|0.082255|
|diabetesMed|0.057168|
|num_procedures|-0.044150|
|admission_source_id|0.041193|
|time_in_hospital|0.040309|
|num_medications|0.037399|
|num_lab_procedures|0.035441|
#### RQ 2: Are there disparities in readmission rates across demographic groups?
The analysis will be able to identify if there are disparities across different groups. A crosstab of readmission rates for each demographic group analyzed will be able to show these trends in a clean, understandable way. A table for each demographic grouping analyzed will be created to display the output of this analysis.
Example Crosstab of Readmission by Race
| | | | | |
|---|---|---|---|---|
||<30 Days|>30 Days|None|All|
|African American|2155|6634|10421|19210|
|Asian|65|161|415|641|
|Caucasian|8592|27124|40383|76099|
|Hispanic|212|642|1183|2037|
|Other|145|446|915|1506|
|Unknown|188|538|1547|2273|
|All|11357|35545|54864|101766|
#### RQ 3: Are there identifiable patterns amongst the sampled data to allow preemptive identification of patients that are at high risk of readmission?
The analysis will leverage clustering techniques to identify if there are commonalities among patients that are high-risk for readmission. Presentation of the results will include graphs with centroids plotted across features that are identified as being highly-correlated with readmission risk.
Cluster analysis and plotting of highest correlated features
### Campaign Implementation
Healthcare providers should incorporate the demographic disparity findings into their care plans. This can help hospitals define targeted intervention programs. When our analysis reveals that specific factors are correlated with a higher chance of readmission, hospitals can modify their discharge criteria and post-discharge care plans. This means that hospitals will extend hospital stays for high-risk patients or prescribe different treatments. Our analysis of predictive features will also allow healthcare providers to develop a checklist for evaluating patient risk at admission-time.
## Literature Review
Hospital readmission rates are considered an important measure of performance in the healthcare industry. High readmission rates may signal problems with how a hospital operates, such as the quality of care patients receive, post-discharge support, and patient education (Dungan, 2012). Readmission rates among patients with diabetes are particularly troubling. While people with diabetes make up only 9% of the US population, they account for roughly 25% of all hospitalizations, of which a substantial portion are due to unnecessary readmissions (Rubin, 2015). Consequently, targeting diabetes related readmission rates as an area for improvement can significantly reduce the burden that unnecessary readmissions place on our healthcare system while simultaneously improving the quality of care received by those affected.
Frequent readmissions to the hospital increase the stress on the healthcare system. As such, readmission metrics are often cited for "research, quality improvement, and reimbursement decisions" (James et al., 2023). As the costs of healthcare continue to increase across the globe, additional research has been conducted into ways to reduce preventable readmissions. The research has led to the consensus that there are many different factors that contribute to a patient needing to be readmitted within 30 days of discharge. James et. al (2023) concluded that these readmission rates are representative of both the treatment quality of the care center and the community in which the patient resides.
In an analysis of diabetes-related hospital readmissions, Dungan (2012) found disparities across socioeconomic lines that impacted the rehospitalization rate for those groups - including Hispanic and African Americans. In addition, uninsured and lower income communities were impacted. Our project aims to build on top of this understanding by examining a dataset with over 100,000 patient records distributed across 130 US hospitals. We may be able to see whether these disparities are consistent across different datasets. Others have been able to identify indications, at least, that such subpopulations exist. One such group was able to identify “anomalous subpopulations where the likelihood of readmission was up to 1.8 times higher than that of the overall population (Ogallo et al, 2021).
Diabetes mellitus, a chronic disease, leads to higher evaluation and management. Shang et al. (2021) have shown disparities in detecting readmission rates due to many factors: inpatient admissions, age, diagnosis, number of emergencies, and sex. These disparities showcase the need for plans to tailor treatment amongst those affected. Given an analysis, a hospital could implement optimal solutions to reduce the likelihood of readmission based on the contexts of the patients.
Beyond the background research reviewed above, the dataset we plan to use has been selected and analyzed by other researchers with similar goals. That team determined that “number of inpatient admissions, age, diagnosis, number of emergencies, and sex” were influential in determining the readmission rate (Shang et. al, 2021). In addition, via their methods, they found that a random forest model was most suitable for making these predictions and recommended more clinical evaluation of their results. This information is useful for our analysis, however we plan to continue with our approaches with an open mind to evaluate if we get similar results.
## Final Research Questions
Research Question 1: What features are most correlated with higher readmission risk? For example, what is the relationship between hospital stay criteria and readmission risk? What drug treatment options are predictive of readmission rates? One of our hypotheses is that a longer hospital stay will be positively correlated with a higher risk of readmission. The value of examining this question is to ensure best practices for patient management are known to minimize readmission rate which is a key metric used for treatment center performance.
Research Question 2: Are there disparities in readmission rates across demographic groups? Our hypothesis is that socioeconomic factors contribute to healthcare options in ways that may be reflected in this data. The value in examining this question is to identify and ensure compliance with applicable laws and regulations to continue to qualify for funding from sources that consider these factors.
Research Question 3: Are there identifiable patterns amongst the sampled data to allow preemptive identification of patients that are at high risk of readmission? Our hypothesis is that there are risk factors that can allow a patient to be considered objectively high-risk that can trigger alternative treatment plans. The value here is to provide data that can be used to review patient management practices to include an evaluation of these risk factors when making discharge decisions.
## Exploratory Data Analysis
In this EDA, we aim to understand the factors that influence hospital readmission rates among patients with diabetes. Our investigation serves two purposes. First, we want to analyze our predictors so that we can identify which variables to include in our models. Second, we want to highlight potential healthcare disparities that may affect different patient populations. The data was obtained from the UC Irvine Machine
Learning Repository [https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008](https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008). It contains patient data covering ten years of clinical care for diabetes patients across 130 hospitals in the US. It contains 101766 observations with each representative of a single patient’s stay. Since we are ultimately evaluating predictions and analysis around the readmission rate, all rows are relevant. Some patient identification and administrative features are not useful for prediction and therefore will be omitted from modelling and analysis. They include:
- Patient_nbr
- Encounter_id
- Medical_specialty
Excluding those features, we will examine the distribution of each variable and note any significant imbalances or patterns that emerge, keeping an eye out for correlations and relationships between individual variables and our target variable. For numerical variables, we will calculate correlation coefficients and generate a heatmap to measure the strength of their relationship with each other and with the target variable. For categorical variables P-values generated by chi-square tests will be used to help us determine whether the differences in readmission rates across these variables are statistically significant.
Evaluating the target variable revealed that there is an imbalance between the patients that were admitted in less than 30 days and the patients that weren’t readmitted (Appendix A). This imbalance can be addressed using various sampling techniques in studies covered by our literature review (SMOTE, upsampling).
In performing some unsupervised analysis using k-means clustering, we found that there is not an extensive difference between our target classes when evaluating the number of medications and the number of lab procedures during the patient stay (Appendix B). These two features were chosen due to an initial correlation matrix evaluation on the data. We do see an upward trend in the mean patient as the number of lab procedures increases, the number of medications also increases. In comparing across our target categories there is also an increase in the overall number of procedures and medications for patients that ultimately required a readmission in <30 days.
#### Correlations between Medication Changes and the Target
To evaluate if there are correlations between drug treatment options and our target variable, we isolated all of the columns that document drug treatment modifications into 3 categories - increased dosage, steady dosage, or reduced dosage - during their hospital stay. From there we plotted the distributions of observations in each of the target categories by each of these categories (Appendix C,D,E). Though the distributions varied across the dosage changes, there was not a significant pattern that indicated a specific correlation. As such, we may consider in our analysis creating a variable to summarize the drug changes for each observation into increase, decrease, or hold steady. This would have the benefit of greatly reducing the number of features required to model and may provide more information that would be beneficial in interpreting the results.
#### Correlation Between Numerical Variables and Target
The feature “num_inpatient” has the strongest positive correlation with readmission when reduced to a binary target (Appendix F). This is a hint that this feature is a strong predictor to consider in our models. We want to avoid multicollinearity in our models, so the numerical features that are highly correlated with other numerical features will need to be flagged for removal during the feature selection process. These include:
- num_medications and complexity_score = .97
- num_procedures and complexity_score = .52
#### Relationships Between Categorical Variables and Target
Scatter plots don’t work well when analyzing categorical variables against our categorical target (readmitted_30), instead we can create stacked bar charts that show the percentage of readmitted vs. non-readmitted patients for each category. We can use chi-square tests to calculate the p-values (statistical significance) between each categorical feature and readmitted_30. We began by adding a feature indicating observations that have a readmitted value of ‘<30’ to enable additional analysis.
The correlation between A1CResult and Readmission was not statistically significant (p > 0.05),(Appendix G). This result does not align with an expectation that we had coming in that the A1C result would show a strong relationship with readmission. We’ll still try modeling with and without this variable, but this shows that the A1C result may not be as predictive of readmission as we expected.
It’s interesting to see that one of the younger age groups has a higher readmission rate (Appendix H). This feature is seen to be statistically significant when compared against the readmission binary.
Whether or not the patient is on diabetes medication has a relationship with the readmission status (Appendix I).
Race is a statistically significant variable when looking at readmission rates (Appendix J). This feature will be fed to the logistic regression and XGBoost models. The feature importance and SHAP values will help determine how good of a predictor race and other demographic features are when it comes to predicting readmission.
The relationship between the admission type and the readmission binary is statistically significant (Appendix AM). The admission type with the highest readmission rate is “Emergency”.
Whether or not the patient changed diabetes medications is statistically significant (Appendix K). There are not a lot of missing values for gender in the dataset, and the chi-square test yields a higher p-value (Appendix M). We’ll continue to analyze this along with other demographic factors (age, race) during the modeling process to explore the demographic disparities.
This is the most statistically significant relationship out of all the categorical variables compared against the target was for insulin which indicates the dosage changes for patients treated with insulin during their stay (Appendix L). Patients in this dataset that were not on any insulin treatment had the lowest readmission rate. Patients that had their dose increased or decreased had a higher readmission rate. This relationship will be examined further when building out modeling. Cases where the dose of metformin is decreased have more readmissions (Appendix O).
The payer code has a noticeable relationship with readmission (Appendix P). HM is notable here, as it’s associated with a lower readmission rate, and it’s the second most common payer code in the dataset.
The dosage of repaglinide has a strong relationship with readmission (Appendix Q). This will be a feature to include in the modeling process. Cases where the dosage was increased have more readmissions.
Weight is a nuanced feature in this dataset, as most of the values for weight are missing. One approach worth trying is to use the subset of records with weight values to apply chi-square tests on (Appendix R). Then, logistic regression and XGBoost models could be trained on this subset.
#### Distribution of predictor features
In comparing the racial distribution in the dataset (Appendix S) to the US population statistics for 2008, Caucasian and African American patients are over-represented in the data (Caucasian +10%, African American +6.6%) while Asian and Hispanic Patients are under represented (Hispanic -13.7%, Asian -4 %) (Our Changing Population: United States, 2025). When creating our test sets for modeling we should consider the implications of this if conclusions are to be drawn and extended to the rest of the population.
The age distribution shows that most patients are elderly (Appendix T).
Most patients in this dataset are on a form of diabetes medication (Appendix U).
Most of the patients aren’t prescribed repaglinide (Appendix V). In the previous section, we saw that patients that either have their dose reduced or aren’t on repaglinide have lower readmission rates.
Most of the values for weight in this dataset are missing, however there’s more than 3,000 entries with weight values (Appendix W). The modeling approach used can be applied to the subset of entries with weight values to determine the impact of weight on readmission.
The relatively even gender split should be helpful in reducing gender-based sampling bias in our predictions (Appendix X).
## Methodology
### RQ 1: What features are most correlated with patient readmission?
We will utilize multiple feature selection algorithms to determine the optimal set of features and leverage the output across multiple classification models. Models will include a random forest classification model and a logistic regression model, with the accuracy measure being used to determine the model’s goodness of fit. Consideration will also be given to simplifying the problem into a binary classification one by collapsing the target variable from 3 values to 2 (Readmitted versus not) and reevaluating the correlation of the predictor features. Model performance using the chosen predictors will be used to evaluate the effectiveness of the feature selection process. Leveraging multiple fold cross validation and grid search techniques as part of the modeling process will assist in objectively determining the best feature set and therefore the most correlated features to predict readmission.
Modeling techniques:
- Logistic Regression
- We plan to use the calculated coefficients to evaluate the importance of each feature.
- XGBoost and Random Forest
- We plan to use tree-based ensemble models to evaluate feature importance for the predictor features.
### RQ 2: Are there disparities in readmission rates across demographic groups?
To address our first research question, we will employ a combination of chi-square tests and logistic regression analysis. This approach will allow us to identify and quantify disparities in readmission rates across different demographics.
To begin our Exploratory Data Analysis (EDA), we will first obtain our descriptive statistics to visualize the dataset before applying models or computations. This step will help identify trends, skewing of data, missing data, outliers, or any other inconsistencies. Hence, we will use a stacked bar chart to show the percentage of readmitted vs. non-readmitted patients for each demographic group (age, gender, race). This will help us identify differences in readmission statuses across the demographic groups.
During the EDA, we will use chi-square tests to determine if there is a statistically significant relationship between the categorical variables and readmission rate. However, because the result of chi-square tests (P-values) only indicates the likelihood of an association, it’s not a determined measurement for evaluating the demographic variables' impact on the target. So, we can further our analysis by including logistic regression. Logistic Regression remains a fundamental method of analysis in machine learning and data science. It will improve our analysis by providing a means of quantifying our demographic variables through predictive analysis. Given that logistic regression measures the probability of a binary outcome (2 outcomes), we will use it to predict the likelihood of patient readmission based on our variables. This will require flattening our target variable into a binary choice of if the patient has been readmitted or not. This flattening does create the scenario where additional investigation within the readmitted category will be necessary to determine if there is more to be learned, but flattening for this initial investigation is an appropriate step to enable use of more models for our analysis.
### RQ 3: Are there identifiable patterns amongst the sampled data to allow preemptive identification of patients that are at high risk of readmission?
In order to identify patterns in the data that might indicate which patients are particularly high-risk, we’ll conduct a k-means clustering analysis during the EDA phase. This will help us determine if there are distinct patient profiles common to patients that are at high risk of readmission (e.g., clusters of elderly patients with comorbidities). Following the cluster analysis, we will move on to predicting readmission risk using an XGBoost classification model, then analyze the data for feature interactions using the SHAP values. This step is integral to understanding to what extent readmission rate is influenced by each data feature.
For evaluating model performance, we will use AUC-ROC and precision-recall curves along with the F1 scores. Using multiple measures to assess accuracy allows us to ensure that our model has been thoroughly evaluated for predictive power. We had originally considered relying solely on the evaluation metrics from the XGBoost classifier model, but incorporating clustering into this stage allows us to better address the “identifiable patterns” aspect of this research question. In addition, because prior studies included in our review of the literature used a combination of XGBoost and logistic regression to analyze similar datasets, we chose to take a different route in order to provide a fresh perspective on this research question.
## Analysis
### Research Question 1: What features are most correlated with patient readmission?
The raw CSV file was imported and some cleaning tasks were identified and undertaken. Chief among these was flattening the 3-state target variable (readmission) into a binary value by converting it to either no readmission (0) or an admission at any point (1). This decision was made because it eliminates the problem of dealing with an imbalanced target feature. From there the original target column was dropped as well as identification-only columns including encounter_id, medical_specialty, diagnosis code features, and patient_nbr. Weight was dropped due to the sparseness of the values there. From there, several columns were one-hot encoded to create binary columns suitable for modeling. This technique was applied to race, gender, and payer_code. With all categorical features dealt with, a Logistic Regression model and a Random Forest Classifier with 200 estimators were created and the coefficients and importance values were printed along with the associated feature names. The top 25 results are captured below.
|#### Top 25 Features by Different Importance Metrics| | | | | |
|---|---|---|---|---|---|
|Coefficients (Absolute Value)| |Random Forest Importance| |XGBoost Tree Model Importance| |
|Feature Name|Coefficient|Feature Name|Importance|Feature Name|Importance|
|number_inpatient|0.255439592|number_inpatient|0.006896518|number_inpatient|0.125693|
|diabetesMed_No|0.254828033|discharge_disposition_id|0.002593603|diabetesMed_No|0.025299|
|chlorpropamide|0.228048824|num_medications|0.002042268|number_emergency|0.020454|
|acarbose|0.226762876|num_lab_procedures|0.001993138|number_diagnoses|0.019607|
|miglitol|0.225261136|number_emergency|0.001690524|race_?|0.019515|
|glyburide-metformin|0.21556272|time_in_hospital|0.001604544|payer_code_BC|0.018156|
|nateglinide|0.211529561|number_diagnoses|0.001234952|age_80_90|0.017317|
|glimepiride|0.21116988|age|0.001205439|payer_code_?|0.016398|
|metformin|0.191825614|num_procedures|0.000869906|number_outpatient|0.016258|
|pioglitazone|0.17256024|number_outpatient|0.000703995|repaglinide_No|0.015713|
|rosiglitazone|0.171801222|admission_type_id|0.000667693|age_90_100|0.015586|
|repaglinide|0.160973333|insulin|0.000635973|max_glu_serum_Norm|0.014912|
|change_No|0.15460347|admission_source_id|0.000565341|acarbose_Steady|0.014735|
|payer_code_UK|0.126866892|metformin|0.000387642|age_70_80|0.013868|
|gender_Female|0.12027621|A1Cresult|0.000346618|max_glu_serum_>300|0.012969|
|glyburide|0.118814146|glipizide|0.000309569|age_20_30|0.012628|
|gender_Male|0.101852363|gender_Female|0.000304966|rosiglitazone_Steady|0.012517|
|payer_code_BC|0.081161294|gender_Male|0.000296173|A1Cresult_Norm|0.012213|
|glipizide|0.079119809|payer_code_UK|0.000279383|race_Caucasian|0.01207|
|race_AfricanAmerican|0.075517048|race_Caucasian|0.000276077|A1Cresult_>8|0.01197|
|change_Ch|0.067921342|payer_code_MC|0.000259752|max_glu_serum_>200|0.011843|
|max_glu_serum|0.06120959|race_AfricanAmerican|0.000258306|payer_code_UN|0.011584|
|A1Cresult|0.057960978|glyburide|0.000235351|payer_code_SP|0.01152|
|payer_code_CP|0.056053346|max_glu_serum|0.000226655|metformin_No|0.011364|
|race_Unknown|0.054101807|pioglitazone|0.000217059|metformin_Down|0.011323|
#### Modeling with Different Model and Importance Metric Combinations
The Random Forest Model built with the first n features from the feature importance list demonstrated that the first few features were extremely correlated with the target variable once flattened to a binary value (Appendix Y). Scoring over 88 % accuracy and 87% precision indicates that it is possible to create a very simple model and get a high level of precision. The models do become less successful after adding features but does rebound around 10 or so features although it never reaches as accurate or precise results as the simpler models.
A logistic regression model built by adding features one by one in order of feature importance maintains a fairly steady level of accuracy (Appendix Z). However, precision varied a bit and peaked with the model with 21 features. If we ultimately determine to use a logistic regression model, there seems to be some benefit to making it more complex for the added precision. It is important to keep in mind that the difference from the best to worst overall is <1%. With that knowledge, simplicity of model may be the more important attribute to consider.
Now, taking a look at the opposite, training a logistic regression model using the feature set identified via logistic regression, we see a fairly consistent level of accuracy and precision (Appendix AA). The peak is a model of middling complexity with ~12 features which topped out at 84.7% precision and 88.8% accuracy on our test set. In comparison to the previous two models, this may be a compromise choice between the complexity suggested by model 2 and the simplistic model indicated in option 1.
Evaluating a random forest model by using features identified by logistic regression coefficients does not provide much information (Appendix AB). This is likely, at least in part, because the sorting of the features from this model considers the absolute value of the coefficient so it is likely that some of the most impactful features in the logistic regression model are negatively correlated with the output variable. Nevertheless, the modeling here shows that the simplest model performance is that same as above as the most impactful feature was the same in all of our evaluations of feature importance. Adding additional features showed a fairly consistent decline in accuracy and precision as more features were added.
### Research Question 2: Are there disparities in readmission rates across demographic groups?
We began by looking at the readmission rates by race in the data (Appendix AC). Males have the higher readmission rates across all known races except Causcasian, where females are 0.4 percentage points higher. In the Other category females also had a 1.3 percent higher rate of readmission. Asian males had the highest overall rate of readmission at 12.7 percent. Based on our exploratory data analysis, gender had a high p-value, indicating that the relationship between gender and readmission risk was not significant. However race has a low p-value which suggests that race does play a factor in determining readmission risk.
Once we add age to the evaluation criteria, a more clear trend develops. Readmission risk shows a clear increasing trend as patients get older with older patients having a significantly higher risk than the youngest patients in this dataset (Appendix AD). This trend seems in alignment with the general idea that health outcomes worsen as patients age. Among some racial groups, the highest overall trend does break this trend and occurs in younger population groups. African-American, Other and Caucaisian patients’ highest risk occurs in the 20-30 or 30-40 age group.
Shifting focus to gender reveals that Females in the 20-30 age group had the highest overall readmission risk in the dataset (Appendix AE). At an almost 4 percentage point higher clip than any other gender and age combination group, there appears that there may be some trend to examine further with this group.
A more thorough investigation is needed to determine the relationship between gender and readmission risk. Given that the p-value of the relationship between gender and readmission is high after performing a chi-square test, we’re not able to quantify the relationship between gender and readmission. The best performing models in our analysis didn’t give much weight to race or gender, despite race having a low p-value in our EDA. Age, however, was one of the ten most impactful features for the random forest and XGBoost classifiers.
### Research Question 3: Are there identifiable patterns amongst the sampled data to allow preemptive identification of patients that are at high risk of readmission?
Patients with a higher number of previous inpatient, outpatient, or emergency visits as well as patients who spend more time in the hospital are at a higher risk of readmission.
With a Logistic Regression Model based on the single number of inpatient previous visits was <87% accurate predicting the need for a readmission. This seems to align with how you might intuitively think about this. If a patient has a lot of previous conditions that require an inpatient intervention, it follows that this would continue once they were discharged for the current visit. Digging further, this information could be an indication that further research is needed into methods for stopping this pattern of patients being in and out of the hospital over time.
Clustering with these features in mind demonstrates that the distribution across the binary target variable shows that patients with higher numbers of these visits are more often patients that have required readmission (Appendix AF and AG).
The above approach demonstrates some measure of clarity in the distribution of our target variable on these two features but relies on manual selection of dimensions to plot and create the clusters. An alternative approach to reduce the dimensionality of the data down to 2 dimensions is to use Principal Component Analysis (PCA). This plot shows the results of a PCA for this data set. The plot itself is less interpretable than others potentially may be but this method allows us to see how much each variable contributes to the variance in the data.
After analyzing what the components consist of, the first component (x-direction on the plot), explaining 25.4% of the variance, is determined by the features representing the number of medications, time in hospital, and the number of lab procedures. The second component (y-direction), explaining 17.1% of the variance, consists of the features representing the number of prior inpatient visits, number of prior emergency visits, and the number of prior outpatient visits that the patient had. These features do align with the most important features identified through our feature analysis for research question 1.
Interpreting the plot with that information we expect that moving in the positive direction of either principal component should result in a clear delineation of observations of our target variable. In our case, however, this clear divide does not materialize (Appendix AH). When moving in the positive y-direction, there appear to be more readmitted cases, however moving the x-direction does not show a noticeable trend of the distribution of readmitted patients.
Starting with the XGBoost classifier model which included most of the features from the dataset and attempting to prune the features that had high p-values resulted in the model performance decreasing as we went through the feature selection process (Appendix AJ). The model achieves a higher recall on the readmitted cases when given more features, regardless of their statistical significance or multicollinearity (Appendix AI).
Using the base model with most features included had mixed results (Appendix AK). For patients in the test set with no readmission, the model had a precision rate of 64.1% and recall of 72.9%. While for patients that had been readmitted, the model scored 62.4% precision and 52.4% recall. These results demonstrate that the model was much more successful (~20 percentage points) more in recall for patients without a readmission than those that had been readmitted. This result puts a bit of a damper on the seemingly high scores. Our stated goal was to predict patients who, based on their treatment plan and history, would have a readmission. Our modeling demonstrated that predicting those that would not is a simpler task with this base model.
Our next attempt was to use a grid search to tune the hyper parameters for the XGBoost Tree Model. This model resulted in a reduced precision for patients without a readmission (72.9% to 65.6%) but an increase in the precision on readmitted patients (52.4% to 60.7%) (Appendix AL). Though the weighted averages across all classes ended up being very similar, additional research is required to determine which class is more impactful to the business decisions that will be made based on the model’s predictions. If, for example, the impact of a readmission can be quantified in costs to the healthcare system, that value can be used to calculate which class’s predictability is more advantageous to the stakeholders involved. Overall the optimized model had a higher F1 score as well as the better precision in the class that is more important to our research questions. Based on these results, if a single model was the expected output, the optimized model would be our choice, however additional research is needed prior to putting the model into any decision-making practice.
## Ethical Recommendations
While data science and machine learning have the potential to revolutionize the healthcare industry, their use with patient data also raises ethical concerns that cannot be ignored. Much of the research in this area has focused on privacy and data protection, but even when these issues are managed correctly there is still the potential for harm, particularly when it comes to data bias and discrimination (Ienca et al., 2018). Although machine learning models may not be as susceptible to cognitive biases as human physicians, existing societal biases that are reflected in healthcare-related datasets may unintentionally lead to models that reinforce and amplify existing health disparities, especially among historically marginalized groups.
One notable example of how biased data can create biased algorithms was documented in a study by Obermeyer et al. (2019) that found racial bias in an algorithm widely used in US healthcare at the time. The authors found that the algorithm was erroneously assigning a lower level of risk to black patients even when they were just as sick as the white patients. This resulted in many black patients not being properly flagged as in need of extra medical care. The racial bias present in the algorithm stemmed from the use of medical costs as an indicator of medical need, the logic being that sicker patients require and receive more medical care (Obermeyer et al., 2019). While this relationship does in fact exist, lower medical costs can also be indicative of unequal access to medical care. However, this disparity between groups was not taken into account by the algorithm, which led to one group receiving worse medical recommendations. The lesson we can learn from this research is how carefully biases in data must be handled. Data biases can be addressed using methods like stratification and clustering analysis, which can help detect any differences that may exist among underrepresented subgroups, but we also have to be cognizant of how existing group disparities can sometimes be reflected in data features in unexpected ways that may not always be clear through data analysis techniques alone.
Unfortunately, addressing data biases may not always be enough to ensure equitable outcomes. It is troubling when attempting to mitigate issues caused by disparities using a one-dimensional framework. The way fairness is viewed in healthcare has been oversimplified beyond just ‘getting rid of bias’. In their paper, Giovanola and Tiribelli (2023) argue that fairness should be viewed under ‘distributive justice’, a principle to distribute fairness to benefit different demographics. Distributive Justice is seen to have three options: Equal outcomes, Equal performance, and Equal allocation. This framework is given emphasis when considering the benefits being misallocated to those who may not be in need. So, in machine learning, particularly in healthcare, addressing biases appropriately involves ensuring that algorithmic biases are avoidable while accounting for equitable distribution of the different demographics.
Addressing ethical considerations in the application of our project relating to healthcare is essential. By incorporating fairness techniques, protecting patient privacy, maintaining transparency, and respecting informed consent, we can develop studies and analyses that are computationally and ethically sound. Improvements in patient care and outcomes can increase ethical benefits for all parties involved. However, the moral issue remains related to ensuring fairness and appropriate treatment of patients. This is why, on top of compliances such as HIPAA, we will account for countering biases affecting different patients in our study. While no model is free from ethical challenges, continuously addressing these issues will ensure our data and work is fairly representative of different patients.
## Challenges
One of the more challenging aspects of working on this project was the number of missing values for several notable variables, namely weight, HbA1C levels, and maximum glucose serum. Almost 97% of the entries were missing a value for weight, roughly 95% of entries lacked a glucose serum measurement, and around 83% of entries were missing a measurement for HbA1C levels. This is unfortunate because all of these variables represent health metrics that are often considered useful for diagnosing and monitoring the progression of diabetes. Past research has shown that obesity is associated with an increase in the length of hospital stay in patients with diabetes as well as the potential for complications post-discharge, and poor HbA1c levels have historically been used as an indication of uncontrolled diabetes (Alexopoulos et al., 2016; Kaiafa et al., 2020). It’s possible that a more complete dataset in these areas might have had a significant impact on the accuracy of our predictions.
Another challenge encountered while working on this project was the imbalance in the target variable between the number of patients who were readmitted in less than 30 days and patients who were not readmitted at all. The majority of the patients in the dataset were not readmitted, with far fewer patients being readmitted in less than 30 days after discharge. This uneven distribution of the target variable makes predicting readmissions that occur within a month of discharge more challenging than it might be otherwise. Fortunately, we were more concerned with predicting readmissions as a whole, so we were able to largely bypass this issue by converting readmission into a binary variable and combining all readmissions regardless of timing into a single category.
## Recommendations and Next Steps
While we have completed the initial investigation and analysis, there are a number of additional items that would add value to the results and ultimately determine the value of the modeling that we have done. We identified several areas that were primed for additional investigation and would serve stakeholders well to take on additional research.
In our analysis we combined data for <30 day and >30 day readmission to a single class to enable different kinds of modeling and simplify the analysis to be done. Now that this has concluded, further research into evaluating if there are trends or analytical differences between patients with <30 day readmissions and those over that mark.
In our data we identified that the Female, age 20-30 group had the highest readmission risk though we did not find an obvious explanation for this via our modeling. Further investigation should be undertaken to determine if this is an oddity of the data or some other explanation can be found.
Our principal component analysis did not prove to be persuasive in identifying a clear differentiation between the classes we were predicting. Further investigation should be undertaken to identify if there are factors that more clearly allow visualization of the difference between classes. Being able to clearly visualize a dividing line may help make a more persuasive argument to stakeholders once investment decisions come into play.
Finally, our analysis has produced a recommended model for use to predict readmission rates. However, prior to this being put into practice, business stakeholders should perform a cost/benefit analysis on the impact of false positives versus false negatives in this space. The output of this analysis will potentially encourage different choices in our modeling to prioritize better performance on one class or the other to ensure that business goals are being met.
## References
Alexopoulos, A., Fayfman, M., Zhao, L., Weaver, J., Buehler, L., Smiley, D., Pasquel, F. J., Vellanki, P., Haw, J. S., & Umpierrez, G. E. (2016). Impact of obesity on hospital complications and mortality in hospitalized patients with hyperglycemia and diabetes. BMJ Open Diabetes Research & Care, 4(1), e000200. [https://doi.org/10.1136/bmjdrc-2016-000200](https://doi.org/10.1136/bmjdrc-2016-000200)
Beata Strack, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura, Krzysztof J. Cios, and John N. Clore, “Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records,” BioMed Research International, vol. 2014, Article ID 781670, 11 pages, 2014.[https://www.proquest.com/docview/1566522208/fulltextPDF/2C2CB6A3C6EE4D84PQ/1?accountid=4485&sourcetype=Scholarly%20Journals](https://www.proquest.com/docview/1566522208/fulltextPDF/2C2CB6A3C6EE4D84PQ/1?accountid=4485&sourcetype=Scholarly%20Journals)
Dungan, K. M. (2012). The effect of diabetes on hospital readmissions. Journal of Diabetes Science and Technology, 6(5), 1045–1052. [https://doi.org/10.1177/193229681200600508](https://doi.org/10.1177/193229681200600508)
Giovanola, B., & Tiribelli, S. (2023, July 20). Beyond bias and discrimination: Redefining the AI ethics principle of fairness in healthcare machine-learning algorithms - ai & society. SpringerLink. [https://link.springer.com/article/10.1007/s00146-022-01455-6](https://link.springer.com/article/10.1007/s00146-022-01455-6)
Ienca, M., Ferretti, A., Hurst, S., Puhan, M., Lovis, C., & Vayena, E. (2018). Considerations for ethics review of big data health research: A scoping review. PloS One, 13(10), e0204937–e0204937. [https://doi.org/10.1371/journal.pone.0204937](https://doi.org/10.1371/journal.pone.0204937)
James, J., Tan, S., Stretton, B., Kovoor, J. G., Gupta, A. K., Gluck, S., Gilbert, T., Sharma, Y., & Bacchi, S. (2023). Why do we evaluate 30‐day readmissions in general medicine? A historical perspective and contemporary data. Internal Medicine Journal, 53(6), 1070–1075. [https://doi.org/10.1111/imj.16115](https://doi.org/10.1111/imj.16115)
Kaiafa, G., Veneti, S., Polychronopoulos, G., Pilalas, D., Daios, S., Kanellos, I., Didangelos, T., Pagoni, S., & Savopoulos, C. (2020). Is HbA1c an ideal biomarker of well-controlled diabetes? Postgraduate Medical Journal, 97(1148), 380–383. https://doi.org/10.1136/postgradmedj-2020-138756
Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science (American Association for the Advancement of Science), 366(6464), 447–453. [https://doi.org/10.1126/science.aax2342](https://doi.org/10.1126/science.aax2342)
Ogallo, W., Tadesse, G. A., Speakman, S., & Walcott-Bryant, A. (2021). Detection of Anomalous Patterns Associated with the Impact of Medications on 30-Day Hospital Readmission Rates in Diabetes Care. AMIA ... Annual Symposium Proceedings, 2021, 495–504. [https://pmc.ncbi.nlm.nih.gov/articles/PMC8378652/](https://pmc.ncbi.nlm.nih.gov/articles/PMC8378652/)
Rubin, D. J. (2015). Hospital Readmission of Patients with Diabetes. Current Diabetes Reports, 15(4), 17–19. [https://doi.org/10.1007/s11892-015-0584-7](https://doi.org/10.1007/s11892-015-0584-7)
Shang, Y., Jiang, K., Wang, L., Zhang, Z., Zhou, S., Liu, Y., Dong, J., & Wu, H. (2021). The 30-days hospital readmission risk in diabetic patients: Predictive modeling with Machine Learning Classifiers. BMC Medical Informatics and Decision Making, 21(S2). [https://doi.org/10.1186/s12911-021-01423-y](https://doi.org/10.1186/s12911-021-01423-y)
Sharma, A., Agrawal, P., Madaan, V., & Goyal, S. (2019). Prediction on diabetes patient’s hospital readmission rates. ACM International Conference Proceeding Series, 1–5.
[https://dl-acm-org.ezproxy1.lib.asu.edu/doi/epdf/10.1145/3339311.3339349](https://dl-acm-org.ezproxy1.lib.asu.edu/doi/epdf/10.1145/3339311.3339349)
USAFacts Institute. (2025, February 17). Our Changing Population: United States. USAFacts. https://usafacts.org/data/topics/people-society/population-and-demographics/our-changing-population/?endDate=2022-01-01&startDate=2008-01-01
## Appendix
### Visualizations
Appendix A
Distribution of the Target Feature

Appendix B
Clustering results across Number of Lab Procedures and number of medications

Appendix C
Reduced Dosage

Appendix D.
Steady Dosage

Appendix E
Increased Dosage

Appendix F
Crosstab of Numerical Features
### 
Appendix G
A1CResult vs Readmission

There are fewer than 20,000 records with an entry with A1Cresult. 
Appendix H
Age vs Readmission

Appendix I
Diabetes Medications Status vs Readmission

Appendix J
Race vs Readmission

Appendix K
Diabetes Med Change vs Readmission

Appendix L
Insulin Change vs Readmission


Appendix M
Gender vs Readmission

Appendix N
Tolbutamide vs Readmission

Appendix O
Metformin vs Readmission

Most patients aren’t prescribed metformin. Based on the stack bar chart, patients that have their dose reduced or aren’t on metformin have higher readmission rates.
Appendix P
Payer Code vs Readmission

MC is a common payer code in this dataset. HM shows one of the lowest readmission rates on the stacked bar chart.
Appendix Q
Repaglinide vs Readmission

Appendix R
Weight vs Readmission

Appendix S
Distribution of Race

Appendix T
Distribution of Age

Appendix U
Distribution of Diabetes Medication Status

Appendix V
Distribution of Repaglinide Treatment

Appendix W
Distribution of Weight

Appendix X
Distribution of Gender

Appendix Y
Scoring Random Forest Model - Feature Importances

Appendix Z
Scoring Logistic Regression Model - Feature Importances

Appendix AA
Scoring Logistic Regression Model - Coefficient Feature Modeling

Appendix AB
Scoring Random Forest Model - Coefficient Feature Modeling

Appendix AC
Crosstab - Readmission rates by Race and Gender

Appendix AD
Crosstab - Readmission Rates by Race and Age

Appendix AE
Crosstab - Readmission Rates by Gender and Age

Appendix AF
Cluster Results - Number of In Patient Visits vs Emergency Visits

Appendix AG
Cluster Results - Total Previous Inpatient Visits vs Emergency Visits

Appendix AH
PCA Results

Appendix AI
XGBoost Base Model Performance Summary

Appendix AJ
Base Model Feature Importance Ranking

Appendix AK
Confusion Matrix and Performance Details - Base Model

Class: No Readmission
Precision: 0.641
Recall: 0.729
F1-Score: 0.682
Class: Readmission
Precision: 0.624
Recall: 0.524
F1-Score: 0.569
Overall Weighted Averages:
Precision: 0.633
Recall: 0.634
F1-Score: 0.630
Appendix AL
Confusion Matrix and Performance Details - Tuned Model

AUC-ROC Score: 0.668
Class: No Readmission
Precision: 0.656
Recall: 0.643
F1-Score: 0.650
Class: Readmission
Precision: 0.594
Recall: 0.607
F1-Score: 0.600
Overall Weighted Averages:
Precision: 0.627
Recall: 0.626
F1-Score: 0.627
Appendix AM
Admission Type vs Readmission
### 
### Source Code
#### Methodology - Example Cluster Plots
```python
from sklearn.cluster import KMeans
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
filename = ".\\RawData\\diabetic_data.csv"
df = pd.read_csv(filename)
kMeansCluster = KMeans(n_clusters=4, random_state=123)
df['Cluster'] = kMeansCluster.fit_predict(df[['time_in_hospital', 'number_inpatient']])
# Create a DataFrame for cluster centroids
centroids_df = pd.DataFrame(kMeansCluster.cluster_centers_, columns=['time_in_hospital', 'number_inpatient'])
centroids_df['Cluster'] = range(4)
# Create the Faceted Scatter Plot
g = sns.FacetGrid(df, col='readmitted', hue='Cluster', palette='Set1', height=5, aspect=1.2)
# Add scatter points for patients
g.map(sns.scatterplot, 'time_in_hospital', 'number_inpatient', alpha=0.6, s=100)
# Add centroids
for ax, playoff_status in zip(g.axes.flat, df['readmitted'].unique()):
for cluster_id in centroids_df['Cluster']:
# Filter for current playoff status and cluster
cluster_points = df[(df['readmitted'] == playoff_status)
& (df['Cluster'] == cluster_id)]
if not cluster_points.empty: # Only add centroid if cluster exists for the current playoff status
centroid = centroids_df[centroids_df['Cluster'] == cluster_id]
ax.scatter(centroid['time_in_hospital'], centroid['number_inpatient'], c='black', s=150, marker='X',
label=f'Centroid {cluster_id}')
# Final plot adjustments
g.fig.suptitle('Scatter Plot of Time in Hospital vs Number of Inpatient Visits with Centroids Highlighted', y=1.02)
plt.show()
```
#### Clustering Exploration
```python
import pandas as pd
from sklearn.cluster import KMeans
import seaborn as sns
import matplotlib.pyplot as plt
filename = "Cleaned_Data.csv"
df = pd.read_csv(filename)
# K-Means CLustering on num_lab_procedures and num_medications
kMeansCluster = KMeans(n_clusters=3, random_state=123)
df['Cluster'] = kMeansCluster.fit_predict(df[['num_lab_procedures', 'num_medications']])
# Create a DataFrame for cluster centroids
centroids_df = pd.DataFrame(kMeansCluster.cluster_centers_, columns=['num_lab_procedures', 'num_medications'])
centroids_df['Cluster'] = range(3)
# Create the Faceted Scatter Plot
g = sns.FacetGrid(df, col='readmitted', hue='Cluster', palette='Set1', height=5, aspect=1.2)
# Add scatter points for patients
g.map(sns.scatterplot, 'num_lab_procedures', 'num_medications', alpha=0.6, s=100)
# Add centroids
for ax, readmit_status in zip(g.axes.flat, df['readmitted'].unique()):
for cluster_id in centroids_df['Cluster']:
# Filter for current readmit status and cluster
cluster_points = df[(df['readmitted'] == readmit_status)
& (df['Cluster'] == cluster_id)]
if not cluster_points.empty: # Only add centroid if cluster exists for the current readmit status
centroid = centroids_df[centroids_df['Cluster'] == cluster_id]
ax.scatter(centroid['num_lab_procedures'], centroid['num_medications'], c='black', s=150, marker='X',
label=f'Centroid {cluster_id}')
# Final plot adjustments
g.fig.suptitle('Scatter Plot of Number of Procedures vs Number of Medications with Centroids Highlighted', y=1.02)
plt.show()
```
```python
#### Cleaning
import pandas as pd
filename = ".\\RawData\\diabetic_data.csv"
df = pd.read_csv(filename)
# Replace ? with Unknown for race
df.loc[df['race'] == '?', 'race'] = 'Unknown'
df.loc[df['weight'] == '?', 'weight'] = '-1'
df.loc[df['payer_code'] == '?', 'payer_code'] = 'UK'
df.loc[df['medical_specialty'] == '?', 'medical_specialty'] = 'Unknown'
df.loc[df['A1Cresult'] == '>7', 'A1Cresult'] = 'High'
df.loc[df['A1Cresult'] == '>8', 'A1Cresult'] = 'Extreme'
df.loc[df['A1Cresult'] == 'Norm', 'A1Cresult'] = 'Normal'
df.loc[df['A1Cresult'] == 'nan', 'A1Cresult'] = 'Not Measured'
df.loc[df['metformin'] == 'nan', 'metformin'] = 'Invalid'
df.loc[df['max_glu_serum'] == 'nan', 'max_glu_serum'] = 'Not Measured'
#df[['AgeLow', 'AgeHigh']] = df['age'].str.extract(r'\[(\d+)-(\d+)\)')
df = df.drop(columns=['weight'])
df = pd.get_dummies(df, prefix_sep='_',dummy_na=True, dtype=int,
columns = ['metformin','repaglinide','nateglinide', 'chlorpropamide',
'glimepiride','acetohexamide','glipizide', 'glyburide',
'tolbutamide', 'pioglitazone', 'rosiglitazone','acarbose',
'miglitol', 'troglitazone', 'tolazamide', 'examide',
'citoglipton', 'insulin', 'glyburide-metformin',
'glipizide-metformin', 'glimepiride-pioglitazone',
'metformin-rosiglitazone', 'metformin-pioglitazone',
'change', 'diabetesMed', 'race', 'gender', 'medical_specialty'])
df.to_csv("Cleaned_Data.csv")
```
#### Cleaning and encoding
```python
# -*- coding: utf-8 -*-
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
filename = ".\\RawData\\diabetic_data.csv"
df = pd.read_csv(filename)
# Replace ? with Unknown for race
df.loc[df['race'] == '?', 'race'] = 'Unknown'
df.loc[df['weight'] == '?', 'weight'] = '-1'
df.loc[df['payer_code'] == '?', 'payer_code'] = 'UK'
df.loc[df['medical_specialty'] == '?', 'medical_specialty'] = 'Unknown'
df.loc[df['A1Cresult'] == '>7', 'A1Cresult'] = 'High'
df.loc[df['A1Cresult'] == '>8', 'A1Cresult'] = 'Extreme'
df.loc[df['A1Cresult'] == 'Norm', 'A1Cresult'] = 'Normal'
df.loc[df['A1Cresult'] == 'nan', 'A1Cresult'] = 'Not Measured'
df.loc[df['metformin'] == 'nan', 'metformin'] = 'Invalid'
df.loc[df['max_glu_serum'] == 'nan', 'max_glu_serum'] = 'Not Measured'
#df[['AgeLow', 'AgeHigh']] = df['age'].str.extract(r'\[(\d+)-(\d+)\)')
df = df.drop(columns=['weight'])
df = pd.get_dummies(df, prefix_sep='_',dummy_na=True, dtype=int,
columns = ['change', 'diabetesMed', 'race', 'gender'])
# Encode Age
encAge = OrdinalEncoder()
ageVals = [['[0-10)',0],['[10-20)',1],['[20-30)',2],['[30-40)',3], ['[40-50)',4],['[50-60)',5],
['[60-70)',6],['[70-80)',7],['[80-90)',8],['[90-100)',9]]
encAge.fit(ageVals)
df['age'] = encAge.fit_transform(df[['age']])
# Encode readmitted
encReadmit = OrdinalEncoder()
readmitVals = [['NO',3],['<30',1],['>30',2]]
encReadmit.fit(readmitVals)
df['readmitted'] = encReadmit.fit_transform(df[['readmitted']])
# Encode max_glu_serum
encSerum = OrdinalEncoder()
serumVals = [['>300',3],['>200',2],['Norm',1]]
encSerum.fit(serumVals)
df['max_glu_serum'] = encSerum.fit_transform(df[['max_glu_serum']])
# Encode A1CResult
encA1C = OrdinalEncoder()
A1CVals = [['Extreme',3],['High',2],['Normal',1]]
encA1C.fit(A1CVals)
df['A1Cresult'] = encA1C.fit_transform(df[['A1Cresult']])
# Encode 'metformin'
encReadmit = OrdinalEncoder()
readmitVals = [['No',0],['Down',1],['Steady',2],['Up',3]]
encReadmit.fit(readmitVals)
df['metformin'] = encReadmit.fit_transform(df[['metformin']])
# Encode 'repaglinide'
encReadmit = OrdinalEncoder()
readmitVals = [['No',0],['Down',1],['Steady',2],['Up',3]]
encReadmit.fit(readmitVals)
df['repaglinide'] = encReadmit.fit_transform(df[['repaglinide']])
# Encode 'nateglinide',
encReadmit = OrdinalEncoder()
readmitVals = [['No',0],['Down',1],['Steady',2],['Up',3]]
encReadmit.fit(readmitVals)
df['nateglinide'] = encReadmit.fit_transform(df[['nateglinide']])
# Encode 'chlorpropamide',
encReadmit = OrdinalEncoder()
readmitVals = [['No',0],['Down',1],['Steady',2],['Up',3]]
encReadmit.fit(readmitVals)
df['chlorpropamide'] = encReadmit.fit_transform(df[['chlorpropamide']])
# Encode 'glimepiride',
encReadmit = OrdinalEncoder()
readmitVals = [['No',0],['Down',1],['Steady',2],['Up',3]]
encReadmit.fit(readmitVals)
df['glimepiride'] = encReadmit.fit_transform(df[['glimepiride']])
# Encode 'acetohexamide',
encReadmit = OrdinalEncoder()
readmitVals = [['No',0],['Down',1],['Steady',2],['Up',3]]
encReadmit.fit(readmitVals)
df['acetohexamide'] = encReadmit.fit_transform(df[['acetohexamide']])
# Encode 'glipizide',
encReadmit = OrdinalEncoder()
readmitVals = [['No',0],['Down',1],['Steady',2],['Up',3]]
encReadmit.fit(readmitVals)
df['glipizide'] = encReadmit.fit_transform(df[['glipizide']])
# Encode 'glyburide',
encReadmit = OrdinalEncoder()
readmitVals = [['No',0],['Down',1],['Steady',2],['Up',3]]
encReadmit.fit(readmitVals)
df['glyburide'] = encReadmit.fit_transform(df[['glyburide']])
# Encode 'tolbutamide',
encReadmit = OrdinalEncoder()
readmitVals = [['No',0],['Down',1],['Steady',2],['Up',3]]
encReadmit.fit(readmitVals)
df['tolbutamide'] = encReadmit.fit_transform(df[['tolbutamide']])
# Encode 'pioglitazone'
encReadmit = OrdinalEncoder()
readmitVals = [['No',0],['Down',1],['Steady',2],['Up',3]]
encReadmit.fit(readmitVals)
df['pioglitazone'] = encReadmit.fit_transform(df[['pioglitazone']])
# Encode 'rosiglitazone',
encReadmit = OrdinalEncoder()
readmitVals = [['No',0],['Down',1],['Steady',2],['Up',3]]
encReadmit.fit(readmitVals)
df['rosiglitazone'] = encReadmit.fit_transform(df[['rosiglitazone']])
# Encode 'acarbose',
encReadmit = OrdinalEncoder()
readmitVals = [['No',0],['Down',1],['Steady',2],['Up',3]]
encReadmit.fit(readmitVals)
df['acarbose'] = encReadmit.fit_transform(df[['acarbose']])
# Encode 'miglitol',
encReadmit = OrdinalEncoder()
readmitVals = [['No',0],['Down',1],['Steady',2],['Up',3]]
encReadmit.fit(readmitVals)
df['miglitol'] = encReadmit.fit_transform(df[['miglitol']])
# Encode 'troglitazone',
encReadmit = OrdinalEncoder()
readmitVals = [['No',0],['Down',1],['Steady',2],['Up',3]]
encReadmit.fit(readmitVals)
df['troglitazone'] = encReadmit.fit_transform(df[['troglitazone']])
# Encode 'tolazamide',
encReadmit = OrdinalEncoder()
readmitVals = [['No',0],['Down',1],['Steady',2],['Up',3]]
encReadmit.fit(readmitVals)
df['tolazamide'] = encReadmit.fit_transform(df[['tolazamide']])
# Encode 'examide',
encReadmit = OrdinalEncoder()
readmitVals = [['No',0],['Down',1],['Steady',2],['Up',3]]
encReadmit.fit(readmitVals)
df['examide'] = encReadmit.fit_transform(df[['examide']])
# Encode 'citoglipton',
encReadmit = OrdinalEncoder()
readmitVals = [['No',0],['Down',1],['Steady',2],['Up',3]]
encReadmit.fit(readmitVals)
df['citoglipton'] = encReadmit.fit_transform(df[['citoglipton']])
# Encode 'insulin',
encReadmit = OrdinalEncoder()
readmitVals = [['No',0],['Down',1],['Steady',2],['Up',3]]
encReadmit.fit(readmitVals)
df['insulin'] = encReadmit.fit_transform(df[['insulin']])
# Encode 'glyburide-metformin',
encReadmit = OrdinalEncoder()
readmitVals = [['No',0],['Down',1],['Steady',2],['Up',3]]
encReadmit.fit(readmitVals)
df['glyburide-metformin'] = encReadmit.fit_transform(df[['glyburide-metformin']])
# Encode 'glipizide-metformin',
encReadmit = OrdinalEncoder()
readmitVals = [['No',0],['Down',1],['Steady',2],['Up',3]]
encReadmit.fit(readmitVals)
df['glipizide-metformin'] = encReadmit.fit_transform(df[['glipizide-metformin']])
# Encode 'glimepiride-pioglitazone',
encReadmit = OrdinalEncoder()
readmitVals = [['No',0],['Down',1],['Steady',2],['Up',3]]
encReadmit.fit(readmitVals)
df['glimepiride-pioglitazone'] = encReadmit.fit_transform(df[['glimepiride-pioglitazone']])
# Encode 'metformin-rosiglitazone'
encReadmit = OrdinalEncoder()
readmitVals = [['No',0],['Down',1],['Steady',2],['Up',3]]
encReadmit.fit(readmitVals)
df['metformin-rosiglitazone'] = encReadmit.fit_transform(df[['metformin-rosiglitazone']])
# Encode 'metformin-pioglitazone'
encReadmit = OrdinalEncoder()
readmitVals = [['No',0],['Down',1],['Steady',2],['Up',3]]
encReadmit.fit(readmitVals)
df['metformin-pioglitazone'] = encReadmit.fit_transform(df[['metformin-pioglitazone']])
df.to_csv("Cleaned_Single_Encoded_Meds.csv")
```
#### EDA - Medication Plots
```python
import pandas as pd
from sklearn.cluster import KMeans
import seaborn as sns
import matplotlib.pyplot as plt
filename = "Cleaned_Data.csv"
df = pd.read_csv(filename)
df['readmitted_binary'] = df['readmitted'].isin(['<30','>30']).astype(int)
df['total_in'] = df['number_inpatient'] + df['number_emergency']
# K-Means CLustering on num_lab_procedures and num_medications
kMeansCluster = KMeans(n_clusters=5, random_state=123, init = 'random')
df['Cluster'] = kMeansCluster.fit_predict(df[['number_inpatient', 'number_emergency']])
# Create a DataFrame for cluster centroids
centroids_df = pd.DataFrame(kMeansCluster.cluster_centers_, columns=['number_inpatient', 'number_emergency'])
centroids_df['Cluster'] = range(5)
# Create the Faceted Scatter Plot
g = sns.FacetGrid(df, col='readmitted_binary', hue='Cluster', palette='Set1', height=5, aspect=1.2)
# Add scatter points for patients
g.map(sns.scatterplot, 'number_inpatient', 'number_emergency', alpha=0.6, s=100)
# Add centroids
for ax, readmit_status in zip(g.axes.flat, df['readmitted_binary'].unique()):
for cluster_id in centroids_df['Cluster']:
# Filter for current readmit status and cluster
cluster_points = df[(df['readmitted_binary'] == readmit_status)
& (df['Cluster'] == cluster_id)]
if not cluster_points.empty: # Only add centroid if cluster exists for the current readmit status
centroid = centroids_df[centroids_df['Cluster'] == cluster_id]
ax.scatter(centroid['number_inpatient'], centroid['number_emergency'], c='black', s=150, marker='X',
label=f'Centroid {cluster_id}')
# Final plot adjustments
g.fig.suptitle('Scatter Plot of Number of Inpatient Visits vs Number of Number of Emergency Visits with Centroids Highlighted', y=1.02)
plt.show()
# K-Means CLustering on num_lab_procedures and num_medications
kMeansCluster = KMeans(n_clusters=3, random_state=123, init = 'random')
df['Cluster'] = kMeansCluster.fit_predict(df[['total_in', 'number_emergency']])
# Create a DataFrame for cluster centroids
centroids_df = pd.DataFrame(kMeansCluster.cluster_centers_, columns=['total_in', 'number_emergency'])
centroids_df['Cluster'] = range(3)
# Create the Faceted Scatter Plot
g = sns.FacetGrid(df, col='readmitted_binary', hue='Cluster', palette='Set1', height=5, aspect=1.2)
# Add scatter points for patients
g.map(sns.scatterplot, 'total_in', 'number_emergency', alpha=0.6, s=100)
# Add centroids
for ax, readmit_status in zip(g.axes.flat, df['readmitted_binary'].unique()):
for cluster_id in centroids_df['Cluster']:
# Filter for current readmit status and cluster
cluster_points = df[(df['readmitted_binary'] == readmit_status)
& (df['Cluster'] == cluster_id)]
if not cluster_points.empty: # Only add centroid if cluster exists for the current readmit status
centroid = centroids_df[centroids_df['Cluster'] == cluster_id]
ax.scatter(centroid['total_in'], centroid['number_emergency'], c='black', s=150, marker='X',
label=f'Centroid {cluster_id}')
# Final plot adjustments
g.fig.suptitle('Scatter Plot of Total Hospital (Emergency + Inpatient) vs Outpatient Visits with Centroids Highlighted', y=1.02)
plt.show()
#### Feature Importance Calculations
```python
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
filename = 'Cleaned_Single_Encoded_Meds.csv'
df = pd.read_csv(filename)
df['readmitted_binary'] = df['readmitted'].isin([1,2]).astype(int)
df.drop(columns = ['Unnamed: 0','encounter_id', 'patient_nbr', 'diag_1',
'diag_2','diag_3', 'medical_specialty'], inplace = True)
df = pd.get_dummies(df, prefix_sep='_',dummy_na=True, dtype=int,
columns = ['payer_code'])
df.fillna(0, inplace = True)
X = df.iloc[:,0:154]
y = df['readmitted_binary']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
RFC = RandomForestClassifier(n_estimators = 200,
random_state=42)
RFC.fit(X,y)
Feature_Importances = RFC.feature_importances_
# Feature_Importances = sorted(Feature_Importances, reverse=True)
# for feature, importance in zip(X, Feature_Importances):
# print(feature, importance)
feature_columns = X.columns[0:75]
importance_calc = pd.Series(RFC.feature_importances_, index= feature_columns)
feature_importance = pd.DataFrame(Feature_Importances, index= feature_columns, columns=["Importance"])
feature_importance = feature_importance.sort_values(by=["Importance"], ascending = False)
print(feature_importance)
feature_importance.drop(['readmitted', 'readmitted_binary'], axis=0, inplace = True)
# feature_importance.to_csv('FixedFeatureImportance.csv')
X_train_full, X_test_full, y_train_full, y_test_full = train_test_split(X, y, test_size = 0.30, random_state=42)
CVScoreList = []
ACScoreList = []
PScoreList = []
RScoreList = []
F1ScoreList = []
for i in range(25):
#Add the X'th features
print(i)
X_train_trimmed = pd.DataFrame(X_train_full.loc[:,feature_importance.index[0:i+1]])
y_train_trimmed = y_train_full
X_test_trimmed = pd.DataFrame(X_test_full.loc[:,feature_importance.index[0:i+1]])
y_test_trimmed = y_test_full
print(X_train_trimmed)
#Train The model
TreeModel = RandomForestClassifier(random_state=42)
TreeModel.fit(X_train_trimmed, y_train_trimmed)
#Check Acccuracy on the test set
tree_pred = TreeModel.predict(X_test_trimmed)
#Add accuracy score to list
ACScoreList.append(accuracy_score(tree_pred,y_test_trimmed))
PScoreList.append(precision_score(y_test_trimmed, tree_pred, average='weighted'))
RScoreList.append(recall_score(y_test_trimmed, tree_pred, average='weighted'))
F1ScoreList.append(f1_score(y_test_trimmed, tree_pred, average='weighted'))
MaxACScore = max(ACScoreList)
MaxAIndex = ACScoreList.index(MaxACScore)
MaxPScore = max(PScoreList)
MaxPIndex = PScoreList.index(MaxPScore)
MaxRScore = max(RScoreList)
MaxRIndex = RScoreList.index(MaxRScore)
MaxF1Score = max(F1ScoreList)
MaxF1Index = F1ScoreList.index(MaxF1Score)
plt.plot(ACScoreList, label ='Accuracy', linestyle = '-.')
plt.plot(PScoreList, label = 'Precision')
plt.plot(RScoreList, label = 'Recall', linestyle = ':')
plt.plot(F1ScoreList, label = 'F1 Score', linestyle = ':')
plt.scatter(MaxAIndex, MaxACScore, color = 'blue', label='Max Accuracy Score')
plt.text(MaxAIndex, MaxACScore, "{:.3f}".format(MaxACScore))
plt.scatter(MaxPIndex, MaxPScore, color = 'orange', label='Max Precision Score')
plt.text(MaxPIndex, MaxPScore, "{:.3f}".format(MaxPScore))
plt.scatter(MaxRIndex, MaxRScore, color = 'green', label='Max Recall Score')
plt.text(MaxRIndex, MaxRScore, "{:.3f}".format(MaxRScore))
plt.scatter(MaxF1Index, MaxF1Score, color = 'red', label='Max F1 Score')
plt.text(MaxF1Index, MaxF1Score, "{:.3f}".format(MaxF1Score))
plt.legend(fontsize = 'xx-small')
plt.title('Scoring Random Forest Model - Feature Importances')
plt.xlabel('Number of features')
plt.ylabel('Score Value')
plt.grid(True)
plt.show()
LRACScoreList = []
LRPScoreList = []
LRRScoreList = []
LRF1ScoreList = []
for i in range(25):
#Add the X'th features
print(i)
X_train_trimmed = pd.DataFrame(X_train_full.loc[:,feature_importance.index[0:i+1]])
y_train_trimmed = y_train_full
X_test_trimmed = pd.DataFrame(X_test_full.loc[:,feature_importance.index[0:i+1]])
y_test_trimmed = y_test_full
print(X_train_trimmed)
#Train The model
LogModel = LogisticRegression(random_state=42)
LogModel.fit(X_train_trimmed, y_train_trimmed)
#Check Acccuracy on the test set
tree_pred = LogModel.predict(X_test_trimmed)
#Add accuracy score to list
LRACScoreList.append(accuracy_score(tree_pred,y_test_trimmed))
LRPScoreList.append(precision_score(y_test_trimmed, tree_pred, average='weighted'))
LRRScoreList.append(recall_score(y_test_trimmed, tree_pred, average='weighted'))
LRF1ScoreList.append(f1_score(y_test_trimmed, tree_pred, average='weighted'))
MaxACScore = max(LRACScoreList)
MaxAIndex = LRACScoreList.index(MaxACScore)
MaxPScore = max(LRPScoreList)
MaxPIndex = LRPScoreList.index(MaxPScore)
MaxRScore = max(LRRScoreList)
MaxRIndex = LRRScoreList.index(MaxRScore)
MaxF1Score = max(LRF1ScoreList)
MaxF1Index = LRF1ScoreList.index(MaxF1Score)
plt.plot(LRACScoreList, label ='Accuracy', linestyle = '--')
plt.plot(LRPScoreList, label = 'Precision')
plt.plot(LRRScoreList, label = 'Recall', linestyle = ':')
plt.plot(LRF1ScoreList, label = 'F1 Score', linestyle = ':')
plt.scatter(MaxAIndex, MaxACScore, color = 'blue', label='Max Accuracy Score')
plt.text(MaxAIndex, MaxACScore, "{:.3f}".format(MaxACScore))
plt.scatter(MaxPIndex, MaxPScore, color = 'orange', label='Max Precision Score')
plt.text(MaxPIndex, MaxPScore, "{:.3f}".format(MaxPScore))
plt.scatter(MaxRIndex, MaxRScore, color = 'green', label='Max Recall Score')
plt.text(MaxRIndex, MaxRScore, "{:.3f}".format(MaxRScore))
plt.scatter(MaxF1Index, MaxF1Score, color = 'red', label='Max F1 Score')
plt.text(MaxF1Index, MaxF1Score, "{:.3f}".format(MaxF1Score))
plt.legend(fontsize = 'xx-small')
plt.title('Scoring Logistic Regression Model - Feature Importances')
plt.xlabel('Number of features')
plt.ylabel('Score Value')
plt.grid(True)
plt.show()
```
#### Feature Importance - Logistic Regression
```python
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import matplotlib.pyplot as plt
filename = 'Cleaned_Single_Encoded_Meds.csv'
df = pd.read_csv(filename)
df['readmitted_binary'] = df['readmitted'].isin([1,2]).astype(int)
df.drop(columns = ['Unnamed: 0','encounter_id', 'patient_nbr', 'diag_1',
'diag_2','diag_3', 'medical_specialty', 'readmitted'], inplace = True)
df = pd.get_dummies(df, prefix_sep='_',dummy_na=True, dtype=int,
columns = ['payer_code'])
df.fillna(0, inplace = True)
X = df.iloc[:,0:74]
X.drop(columns = ['readmitted_binary'], inplace = True)
y = df['readmitted_binary']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
LR = LogisticRegression()
LR.fit(X,y)
coef_table = pd.DataFrame(list(X.columns)).copy()
coef_table.insert(len(coef_table.columns),"Coefs",LR.coef_.transpose())
coef_table['ABS'] = abs(coef_table['Coefs'])
coef_table.sort_values('ABS', ascending = False, inplace = True)
coef_table.reset_index(inplace = True)
coef_table.drop(columns = ['index', 'Coefs'], inplace = True)
coef_table.to_csv('Coefficients.csv')
X_train_full, X_test_full, y_train_full, y_test_full = train_test_split(X, y, test_size = 0.30, random_state=42)
ACScoreList = []
PScoreList = []
RScoreList = []
F1ScoreList = []
for i in range(25):
#Add the X'th features
print(i)
X_train_trimmed = pd.DataFrame(X_train_full.loc[:,coef_table.iloc[0:i+1,0]])
y_train_trimmed = y_train_full
X_test_trimmed = pd.DataFrame(X_test_full.loc[:,coef_table.iloc[0:i+1,0]])
y_test_trimmed = y_test_full
print(X_train_trimmed)
#Train The model
LogModel = LogisticRegression(random_state=42)
LogModel.fit(X_train_trimmed, y_train_trimmed)
#Check Acccuracy on the test set
tree_pred = LogModel.predict(X_test_trimmed)
#Add accuracy score to list
ACScoreList.append(accuracy_score(tree_pred,y_test_trimmed))
PScoreList.append(precision_score(y_test_trimmed, tree_pred, average='weighted'))
RScoreList.append(recall_score(y_test_trimmed, tree_pred, average='weighted'))
F1ScoreList.append(f1_score(y_test_trimmed, tree_pred, average='weighted'))
MaxACScore = max(ACScoreList)
MaxAIndex = ACScoreList.index(MaxACScore)
MaxPScore = max(PScoreList)
MaxPIndex = PScoreList.index(MaxPScore)
MaxRScore = max(RScoreList)
MaxRIndex = RScoreList.index(MaxRScore)
MaxF1Score = max(F1ScoreList)
MaxF1Index = F1ScoreList.index(MaxF1Score)
plt.plot(ACScoreList, label ='Accuracy', linestyle = '--')
plt.plot(PScoreList, label = 'Precision')
plt.plot(RScoreList, label = 'Recall', linestyle = ':')
plt.plot(F1ScoreList, label = 'F1 Score', linestyle = ':')
plt.scatter(MaxAIndex, MaxACScore, color = 'blue', label='Max Accuracy Score')
plt.text(MaxAIndex, MaxACScore, "{:.3f}".format(MaxACScore))
plt.scatter(MaxPIndex, MaxPScore, color = 'orange', label='Max Precision Score')
plt.text(MaxPIndex, MaxPScore, "{:.3f}".format(MaxPScore))
plt.scatter(MaxRIndex, MaxRScore, color = 'green', label='Max Recall Score')
plt.text(MaxRIndex, MaxRScore, "{:.3f}".format(MaxRScore))
plt.scatter(MaxF1Index, MaxF1Score, color = 'red', label='Max F1 Score')
plt.text(MaxF1Index, MaxF1Score, "{:.3f}".format(MaxF1Score))
plt.legend(fontsize = 'xx-small')
plt.title('Scoring Logistic Regression Model - Coefficient Feature Modeling')
plt.xlabel('Number of features')
plt.ylabel('Score Value')
plt.grid(True)
plt.show()
#Grading a random forest model with the features identified via Logistic Regression Coefficient Values
RFACScoreList = []
RFPScoreList = []
RFRScoreList = []
RFF1ScoreList = []
for i in range(25):
#Add the X'th features
print(i)
X_train_trimmed = pd.DataFrame(X_train_full.loc[:,coef_table.iloc[0:i+1,0]])
y_train_trimmed = y_train_full
X_test_trimmed = pd.DataFrame(X_test_full.loc[:,coef_table.iloc[0:i+1,0]])
y_test_trimmed = y_test_full
print(X_train_trimmed)
#Train The model
TreeModel = RandomForestClassifier(random_state=42)
TreeModel.fit(X_train_trimmed, y_train_trimmed)
#Check Acccuracy on the test set
tree_pred = TreeModel.predict(X_test_trimmed)
#Add accuracy score to list
RFACScoreList.append(accuracy_score(tree_pred,y_test_trimmed))
RFPScoreList.append(precision_score(y_test_trimmed, tree_pred, average='weighted'))
RFRScoreList.append(recall_score(y_test_trimmed, tree_pred, average='weighted'))
RFF1ScoreList.append(f1_score(y_test_trimmed, tree_pred, average='weighted'))
MaxACScore = max(RFACScoreList)
MaxAIndex = RFACScoreList.index(MaxACScore)
MaxPScore = max(RFPScoreList)
MaxPIndex = RFPScoreList.index(MaxPScore)
MaxRScore = max(RFRScoreList)
MaxRIndex = RFRScoreList.index(MaxRScore)
MaxF1Score = max(RFF1ScoreList)
MaxF1Index = RFF1ScoreList.index(MaxF1Score)
plt.plot(RFACScoreList, label ='Accuracy', linestyle = '--')
plt.plot(RFPScoreList, label = 'Precision')
plt.plot(RFRScoreList, label = 'Recall', linestyle = ':')
plt.plot(F1ScoreList, label = 'F1 Score', linestyle = ':')
plt.scatter(MaxAIndex, MaxACScore, color = 'blue', label='Max Accuracy Score')
plt.text(MaxAIndex, MaxACScore, "{:.3f}".format(MaxACScore))
plt.scatter(MaxPIndex, MaxPScore, color = 'orange', label='Max Precision Score')
plt.text(MaxPIndex, MaxPScore, "{:.3f}".format(MaxPScore))
plt.scatter(MaxRIndex, MaxRScore, color = 'green', label='Max Recall Score')
plt.text(MaxRIndex, MaxRScore, "{:.3f}".format(MaxRScore))
plt.scatter(MaxF1Index, MaxF1Score, color = 'red', label='Max F1 Score')
plt.text(MaxF1Index, MaxF1Score, "{:.3f}".format(MaxF1Score))
plt.legend(fontsize = 'xx-small')
plt.title('Scoring RandomForest Model - Coefficient Weighted Features')
plt.xlabel('Number of features')
plt.ylabel('Score Value')
plt.grid(True)
plt.show()
```
#### Post Modeling Clustering
```python
import pandas as pd
from sklearn.cluster import KMeans
import seaborn as sns
import matplotlib.pyplot as plt
filename = "Cleaned_Data.csv"
df = pd.read_csv(filename)
df['readmitted_binary'] = df['readmitted'].isin(['<30','>30']).astype(int)
df['total_in'] = df['number_inpatient'] + df['number_emergency']
# K-Means CLustering on num_lab_procedures and num_medications
kMeansCluster = KMeans(n_clusters=5, random_state=123, init = 'random')
df['Cluster'] = kMeansCluster.fit_predict(df[['number_inpatient', 'number_emergency']])
# Create a DataFrame for cluster centroids
centroids_df = pd.DataFrame(kMeansCluster.cluster_centers_, columns=['number_inpatient', 'number_emergency'])
centroids_df['Cluster'] = range(5)
# Create the Faceted Scatter Plot
g = sns.FacetGrid(df, col='readmitted_binary', hue='Cluster', palette='Set1', height=5, aspect=1.2)
# Add scatter points for patients
g.map(sns.scatterplot, 'number_inpatient', 'number_emergency', alpha=0.6, s=100)
# Add centroids
for ax, readmit_status in zip(g.axes.flat, df['readmitted_binary'].unique()):
for cluster_id in centroids_df['Cluster']:
# Filter for current readmit status and cluster
cluster_points = df[(df['readmitted_binary'] == readmit_status)
& (df['Cluster'] == cluster_id)]
if not cluster_points.empty: # Only add centroid if cluster exists for the current readmit status
centroid = centroids_df[centroids_df['Cluster'] == cluster_id]
ax.scatter(centroid['number_inpatient'], centroid['number_emergency'], c='black', s=150, marker='X',
label=f'Centroid {cluster_id}')
# Final plot adjustments
g.fig.suptitle('Scatter Plot of Number of Inpatient Visits vs Number of Number of Emergency Visits with Centroids Highlighted', y=1.02)
plt.show()
# K-Means CLustering on num_lab_procedures and num_medications
kMeansCluster = KMeans(n_clusters=3, random_state=123, init = 'random')
df['Cluster'] = kMeansCluster.fit_predict(df[['total_in', 'number_emergency']])
# Create a DataFrame for cluster centroids
centroids_df = pd.DataFrame(kMeansCluster.cluster_centers_, columns=['total_in', 'number_emergency'])
centroids_df['Cluster'] = range(3)
# Create the Faceted Scatter Plot
g = sns.FacetGrid(df, col='readmitted_binary', hue='Cluster', palette='Set1', height=5, aspect=1.2)
# Add scatter points for patients
g.map(sns.scatterplot, 'total_in', 'number_emergency', alpha=0.6, s=100)
# Add centroids
for ax, readmit_status in zip(g.axes.flat, df['readmitted_binary'].unique()):
for cluster_id in centroids_df['Cluster']:
# Filter for current readmit status and cluster
cluster_points = df[(df['readmitted_binary'] == readmit_status)
& (df['Cluster'] == cluster_id)]
if not cluster_points.empty: # Only add centroid if cluster exists for the current readmit status
centroid = centroids_df[centroids_df['Cluster'] == cluster_id]
ax.scatter(centroid['total_in'], centroid['number_emergency'], c='black', s=150, marker='X',
label=f'Centroid {cluster_id}')
# Final plot adjustments
g.fig.suptitle('Scatter Plot of Total Hospital (Emergency + Inpatient) vs Outpatient Visits with Centroids Highlighted', y=1.02)
plt.show()
```
#### EDA
```python
#%%
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import (roc_auc_score, precision_recall_curve, auc,
confusion_matrix, classification_report, silhouette_score)
from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
import xgboost as xgb
from typing import List, Dict, Tuple
import matplotlib.pyplot as plt
import seaborn as sns
import shap
import os
import pandas as pd
import numpy as np
#%%
full_numerical = ['time_in_hospital','num_lab_procedures','num_procedures','num_medications','number_outpatient','number_emergency','number_inpatient','number_diagnoses']
full_categorical = ['change', 'repaglinide', 'pioglitazone', 'glimepiride-pioglitazone','payer_code', 'tolazamide', 'metformin-pioglitazone', 'age','nateglinide', 'metformin', 'tolbutamide', 'troglitazone','insulin', 'glyburide', 'citoglipton', 'acetohexamide','miglitol', 'gender', 'glimepiride', 'glyburide-metformin','chlorpropamide', 'race', 'metformin-rosiglitazone','medical_specialty', 'rosiglitazone', 'diabetesMed','max_glu_serum', 'glipizide-metformin','glipizide', 'examide', 'acarbose', 'A1Cresult']
def prepare_features(df: pd.DataFrame, numerical_features: List[str],
categorical_features: List[str]) -> pd.DataFrame:
df_subset = df[numerical_features + categorical_features].copy()
# encode categoricals
df_encoded = pd.get_dummies(df_subset, columns=categorical_features)
scaler = StandardScaler()
df_encoded[numerical_features] = scaler.fit_transform(df_encoded[numerical_features])
return df_encoded
def train_and_evaluate_model(X: pd.DataFrame, y: pd.Series, model_name: str = "Model"):
# train test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# toggle scale_pos_weight, it's to account for class imbalance.
# not much class imbalance here, so experiment
model = xgb.XGBClassifier(objective='binary:logistic',eval_metric='auc',scale_pos_weight=len(y[y==0]) / len(y[y==1]),max_depth=4,min_child_weight=6,subsample=0.8,colsample_bytree=0.8,random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# f1, recall
y_pred_proba = model.predict_proba(X_test)[:, 1]
precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
feature_importance = pd.DataFrame({'feature': X.columns,'importance': model.feature_importances_}).sort_values('importance', ascending=False)
metrics = {'auc_roc': roc_auc_score(y_test, y_pred_proba),'cv_mean': cv_scores.mean(),'cv_std': cv_scores.std(),'confusion_matrix': confusion_matrix(y_test, y_pred),'classification_report': classification_report(y_test, y_pred, output_dict=True),'precision': precision,'recall': recall}
return model, feature_importance, metrics
# function to search different combinations of parameters for XGBoost
def train_and_evaluate_model_with_grid_search(X: pd.DataFrame, y: pd.Series, model_name: str = "Model"):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# define search space
param_grid = {'max_depth': [3, 4, 5],'min_child_weight': [4, 5, 6],'gamma': [0.1, 0.2, 0.3],'subsample': [0.7, 0.8, 0.9],'colsample_bytree': [0.7, 0.8, 0.9],'scale_pos_weight': [len(y[y==0]) / len(y[y==1])]}
base_model = xgb.XGBClassifier(objective='binary:logistic',eval_metric='auc',random_state=42)
grid_search = GridSearchCV(estimator=base_model,param_grid=param_grid,scoring='roc_auc',cv=5,verbose=1,n_jobs=-1)
# find something to do while this runs
print(f"grid searching underway for {model_name}")
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_
print("best params found:")
for param, value in grid_search.best_params_.items():
print(f"{param}: {value}")
y_pred = best_model.predict(X_test)
y_pred_proba = best_model.predict_proba(X_test)[:, 1]
precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
cv_scores = cross_val_score(best_model, X_train, y_train, cv=5, scoring='roc_auc')
feature_importance = pd.DataFrame({'feature': X.columns,'importance': best_model.feature_importances_}).sort_values('importance', ascending=False)
metrics = {'auc_roc': roc_auc_score(y_test, y_pred_proba),'cv_mean': cv_scores.mean(),'cv_std': cv_scores.std(),'confusion_matrix': confusion_matrix(y_test, y_pred),'classification_report': classification_report(y_test, y_pred, output_dict=True),'precision': precision,'recall': recall,'best_params': grid_search.best_params_,'cv_results': grid_search.cv_results_}
return best_model, feature_importance, metrics
def visualize_confusion_matrix(confusion_matrices: Dict[str, np.ndarray], save_path: str = None):
n_models = len(confusion_matrices)
fig, axes = plt.subplots(1, n_models, figsize=(6*n_models, 5))
if n_models == 1:
axes = [axes]
for ax, (model_name, conf_matrix) in zip(axes, confusion_matrices.items()):
sns.heatmap(conf_matrix,annot=True,fmt='d',cmap='Blues',ax=ax)
ax.set_xlabel('pred')
ax.set_ylabel('actual')
ax.set_title(f'confusion matrix - {model_name}')
plt.tight_layout()
# set dpi=300 for higher resolution
plt.savefig(save_path, dpi=300, bbox_inches='tight')
plt.show()
diabetes_df = pd.read_csv('/kaggle/input/diabetes-dataset/diabetic_data.csv')
diabetes_df['readmitted_binary'] = (diabetes_df['readmitted'] != 'NO').astype(int)
diabetes_df['age'] = diabetes_df['age'].str.replace('[','').str.replace('-','_').str.replace(')','')
X_full = prepare_features(diabetes_df, full_numerical, full_categorical)
model, importance, metrics = train_and_evaluate_model(
X_full, diabetes_df['readmitted_binary'], "Full Model"
)
print(f"auc rocc Score: {metrics['auc_roc']:.3f}")
print(f"crossval : {metrics['cv_mean']:.3f} (+/- {metrics['cv_std']*2:.3f})")
print("\n10 most importantt features:")
print(importance.head(10))
confusion_matrices = {"Optimized Model": metrics['confusion_matrix']}
visualize_confusion_matrix(confusion_matrices, save_path='confusion_matrices.png')
full_numerical = ['time_in_hospital','num_lab_procedures','num_procedures','num_medications','number_outpatient','number_emergency','number_inpatient','number_diagnoses']
full_categorical = ['change', 'repaglinide', 'pioglitazone', 'glimepiride-pioglitazone','payer_code', 'tolazamide', 'metformin-pioglitazone', 'age','nateglinide', 'metformin', 'tolbutamide', 'troglitazone','insulin', 'glyburide', 'citoglipton', 'acetohexamide','miglitol', 'gender', 'glimepiride', 'glyburide-metformin','chlorpropamide', 'race', 'metformin-rosiglitazone','rosiglitazone', 'diabetesMed','max_glu_serum', 'glipizide-metformin','glipizide', 'examide', 'acarbose', 'A1Cresult']
def prepare_features(df, numerical_features,categorical_features):
df_subset = df[numerical_features + categorical_features].copy()
df_encoded = pd.get_dummies(df_subset, columns=categorical_features)
# standard scaling for now
scaler = StandardScaler()
df_encoded[numerical_features] = scaler.fit_transform(df_encoded[numerical_features])
return df_encoded
def train_and_evaluate_model(X: pd.DataFrame, y: pd.Series, model_name: str = "Model") -> Tuple:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
'''
colsample_bytree: 0.9
gamma: 0.3
max_depth: 4
min_child_weight: 4
scale_pos_weight: 1.169758219265703
subsample: 0.9
'''
model = xgb.XGBClassifier(objective='binary:logistic',colsample_bytree=0.9,gamma=0.3,min_child_weight=4,scale_pos_weight=1.169758219265703,subsample=0.9,eval_metric='auc',use_label_encoder=False,random_state=42)
'''
# base model
model = xgb.XGBClassifier(
objective='binary:logistic',
eval_metric='auc',
use_label_encoder=False,
random_state=42
)
'''
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]
precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
feature_importance = pd.DataFrame({'feature': X.columns,'importance': model.feature_importances_}).sort_values('importance', ascending=False)
metrics = {'auc_roc': roc_auc_score(y_test, y_pred_proba),'cv_mean': cv_scores.mean(),'cv_std': cv_scores.std(),'confusion_matrix': confusion_matrix(y_test, y_pred),'classification_report': classification_report(y_test, y_pred, output_dict=True),'precision': precision,'recall': recall,'confusion_matrix': confusion_matrix(y_test, y_pred)}
return model, feature_importance, metrics
def visualize_confusion_matrix(confusion_matrices: Dict[str, np.ndarray], save_path: str = None):
n_models = len(confusion_matrices)
fig, axes = plt.subplots(1, n_models, figsize=(6*n_models, 5))
if n_models == 1:
axes = [axes]
for ax, (model_name, conf_matrix) in zip(axes, confusion_matrices.items()):
sns.heatmap(conf_matrix,annot=True,fmt='d',cmap='Blues',ax=ax)
ax.set_xlabel('pred')
ax.set_ylabel('actual')
ax.set_title(f'confusion atrix - {model_name}')
plt.tight_layout()
plt.savefig(save_path, dpi=300, bbox_inches='tight')
plt.show()
def display_classification_metrics(metrics: Dict):
report = metrics['classification_report']
for class_label in ['0', '1']:
class_name = 'No Readmission' if class_label == '0' else 'Readmission'
print(f"Class {class_name}")
print(f"Precision {report[class_label]['precision']:.3f}")
print(f"Recall{report[class_label]['recall']:.3f}")
print(f"F1-Score {report[class_label]['f1-score']:.3f}")
print(f"precision{report['weighted avg']['precision']:.3f}")
print(f"recall{report['weighted avg']['recall']:.3f}")
print(f"f1-Score{report['weighted avg']['f1-score']:.3f}")
display_classification_metrics(metrics)
def analyze_feature_importance(importance_df: pd.DataFrame, top_n: int = 15):
print(f"{top_n} most important predictors:")
for idx, row in importance_df.head(top_n).iterrows():
print(f"{row['feature']:<30} Importance: {row['importance']:.4f}")
plt.figure(figsize=(12, 6))
sns.barplot(x='importance', y='feature',data=importance_df.head(top_n))
plt.title('Feature Importance for Readmission Prediction')
plt.xlabel('Importance Score')
plt.tight_layout()
plt.savefig('feature_importance.png', dpi=300, bbox_inches='tight')
plt.show()
analyze_feature_importance(importance)
def analyze_shap_values(model: xgb.XGBClassifier, X: pd.DataFrame, max_display: int = 20, save_path: str = ''):
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
plt.figure(figsize=(10, 12))
shap.summary_plot(shap_values, X, max_display=max_display, show=False)
plt.tight_layout()
plt.savefig(save_path, bbox_inches='tight', dpi=300)
plt.show()
feature_importance = pd.DataFrame(np.abs(shap_values).mean(0),index=X.columns,columns=['importance']).sort_values('importance', ascending=False)
top_features = feature_importance.head(3).index
plt.figure(figsize=(15, 5))
for i, feature in enumerate(top_features, 1):
plt.subplot(1, 3, i)
shap.dependence_plot(feature, shap_values, X, show=False)
plt.title(f'SHAPPlot for {feature}')
plt.tight_layout()
base, ext = os.path.splitext(save_path)
plt.savefig(f"{base}_dependence{ext}", bbox_inches='tight', dpi=300)
plt.show()
print("analyzing feature importance with shapp values")
analyze_shap_values(model, X_full, save_path='shap_summary.png')
def visualize_pca_clusters(X: pd.DataFrame, y: pd.Series,
numerical_features: List[str]):
X_nums = X[numerical_features]
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_nums)
plot_df = pd.DataFrame({'PC1': X_pca[:, 0],'PC2': X_pca[:, 1],'Readmitted': y.map({0: 'No Readmission', 1: 'Readmitted'})})
var_explained = pca.explained_variance_ratio_ * 100
pc1_contributions = pd.DataFrame({'Feature': numerical_features,'Contribution': pca.components_[0],'Abs_Contribution': abs(pca.components_[0])}).sort_values('Abs_Contribution', ascending=False)
pc2_contributions = pd.DataFrame({'Feature': numerical_features,'Contribution': pca.components_[1],'Abs_Contribution': abs(pca.components_[1])}).sort_values('Abs_Contribution', ascending=False)
plt.figure(figsize=(12, 8))
sns.scatterplot(data=plot_df, x='PC1', y='PC2', hue='Readmitted',alpha=0.6, style='Readmitted')
plt.title(' PCA for patients')
plt.xlabel(f'First Principal Component {var_explained[0]:.1f}% variance explained')
plt.ylabel(f'Second Principal Component {var_explained[1]:.1f}% variance explained')
plt.tight_layout()
plt.savefig(f"PCA.png", bbox_inches='tight', dpi=300)
plt.show()
print(f" variance explained by these two components: {sum(var_explained):.1f}%")
print(f" explains {var_explained[0]:.1f}% of the variance")
for _, row in pc1_contributions.head(3).iterrows():
direction = "increases" if row['Contribution'] > 0 else "decreases"
print(f"- {row['Feature']}: {direction} PC1 (contribution: {abs(row['Contribution']):.3f})")
print(f"Explains {var_explained[1]:.1f}% of the variance")
# make sure to handle 'decreasing' case!
for _, row in pc2_contributions.head(3).iterrows():
direction = "increases" if row['Contribution'] > 0 else "decreases"
print(f"- {row['Feature']}: {direction} PC2 (contribution: {abs(row['Contribution']):.3f})")
visualize_pca_clusters(X_full[full_numerical],diabetes_df['readmitted_binary'], full_numerical)
def visualize_pca_clusters_mod(X: pd.DataFrame, y: pd.Series,
numerical_features: List[str], n_clusters: int = 3):
X_nums = X[numerical_features]
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_nums)
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
clusters = kmeans.fit_predict(X_pca)
silhouette_avg = silhouette_score(X_pca, clusters)
plot_df = pd.DataFrame({'PC1': X_pca[:, 0],'PC2': X_pca[:, 1],'Readmitted': y.map({0: 'No Readmission', 1: 'Readmitted'}),'Cluster': clusters})
cluster_centers = kmeans.cluster_centers_
pc1_contributions = pd.DataFrame({'Feature': numerical_features,'Contribution': pca.components_[0],'Abs_Contribution': abs(pca.components_[0])}).sort_values('Abs_Contribution', ascending=False)
pc2_contributions = pd.DataFrame({'Feature': numerical_features,'Contribution': pca.components_[1],'Abs_Contribution': abs(pca.components_[1])}).sort_values('Abs_Contribution', ascending=False)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))
sns.scatterplot(data=plot_df, x='PC1', y='PC2', hue='Readmitted', alpha=0.6, style='Readmitted', ax=ax1)
ax1.set_title('PCA by Readmission Status')
sns.scatterplot(data=plot_df, x='PC1', y='PC2', hue='Cluster',palette='deep', alpha=0.6, ax=ax2)
ax2.scatter(cluster_centers[:, 0], cluster_centers[:, 1],c='red', marker='x', s=200, linewidths=3, label='Cluster Centers')
ax2.set_title(f'K-means Clusters (k={n_clusters}, silhouette={silhouette_avg:.3f})')
plt.tight_layout()
plt.show()
cluster_stats = []
for i in range(n_clusters):
cluster_data = plot_df[plot_df['Cluster'] == i]
readmission_rate = (cluster_data['Readmitted'] == 'Readmitted').mean() * 100
cluster_size = len(cluster_data)
cluster_stats.append({'Cluster': i,'Size': cluster_size,'Readmission Rate (%)': readmission_rate})
cluster_stats_df = pd.DataFrame(cluster_stats)
print("\nCluster Statistics:")
print(cluster_stats_df.to_string(index=False))
print(f"\nTotal variance explained by these two components: {sum(pca.explained_variance_ratio_ * 100):.1f}%")
print("first PCA:")
print(f"Explains {pca.explained_variance_ratio_[0] * 100:.1f}% of the variance")
print("\nTop contributing features:")
for _, row in pc1_contributions.head(3).iterrows():
direction = "increases" if row['Contribution'] > 0 else "decreases"
print(f"- {row['Feature']}: {direction} PC1 (contribution: {abs(row['Contribution']):.3f})")
print("second oca:")
print(f"Explains {pca.explained_variance_ratio_[1] * 100:.1f}% of the variance")
print("\nTop contributing features:")
for _, row in pc2_contributions.head(3).iterrows():
direction = "increases" if row['Contribution'] > 0 else "decreases"
print(f"- {row['Feature']}: {direction} PC2 (contribution: {abs(row['Contribution']):.3f})")
return cluster_stats_df, kmeans, X_pca
#TODO: use elbow-method
cluster_stats, kmeans_model, X_pca = visualize_pca_clusters_mod(X_full[full_numerical],diabetes_df['readmitted_binary'],full_numerical,n_clusters=3)
diabetes_df
# Random Forest
def train_and_evaluate_random_forest(X: pd.DataFrame, y: pd.Series,
model_name: str = "Random Forest") -> Tuple:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = RandomForestClassifier(n_estimators=100,max_depth=None,min_samples_split=2,min_samples_leaf=1,random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]
precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
feature_importance = pd.DataFrame({'feature': X.columns,'importance': model.feature_importances_}).sort_values('importance', ascending=False)
metrics = {'auc_roc': roc_auc_score(y_test, y_pred_proba),'cv_mean': cv_scores.mean(),'cv_std': cv_scores.std(),'confusion_matrix': confusion_matrix(y_test, y_pred),'classification_report': classification_report(y_test, y_pred,output_dict=True),'precision': precision,'recall': recall}
return model, feature_importance, metrics
rf_model, rf_importance, rf_metrics = train_and_evaluate_random_forest(X_full, diabetes_df['readmitted_binary'], "Random Forest Model")
print(f"Randomforest")
print(f"aucruc Score: {rf_metrics['auc_roc']:.3f}")
print(f"cross-val Score: {rf_metrics['cv_mean']:.3f} "
f"(+/- {rf_metrics['cv_std']*2:.3f})")
print("10 most stronkest features:")
print(rf_importance.head(10))
confusion_matrices = {}
confusion_matrices["Random Forest"] = rf_metrics['confusion_matrix']
visualize_confusion_matrix(confusion_matrices, save_path='confusion_matrices.png')
display_classification_metrics(rf_metrics)
analyze_feature_importance(rf_importance)
```
#### Model Comparisons
```python
#%%
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import (roc_auc_score, precision_recall_curve, auc,
confusion_matrix, classification_report, silhouette_score)
from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
import xgboost as xgb
from typing import List, Dict, Tuple
import matplotlib.pyplot as plt
import seaborn as sns
import shap
import os
import pandas as pd
import numpy as np
#%%
full_numerical = ['time_in_hospital','num_lab_procedures','num_procedures','num_medications','number_outpatient','number_emergency','number_inpatient','number_diagnoses']
full_categorical = ['change', 'repaglinide', 'pioglitazone', 'glimepiride-pioglitazone','payer_code', 'tolazamide', 'metformin-pioglitazone', 'age','nateglinide', 'metformin', 'tolbutamide', 'troglitazone','insulin', 'glyburide', 'citoglipton', 'acetohexamide','miglitol', 'gender', 'glimepiride', 'glyburide-metformin','chlorpropamide', 'race', 'metformin-rosiglitazone','medical_specialty', 'rosiglitazone', 'diabetesMed','max_glu_serum', 'glipizide-metformin','glipizide', 'examide', 'acarbose', 'A1Cresult']
def prepare_features(df: pd.DataFrame, numerical_features: List[str],
categorical_features: List[str]) -> pd.DataFrame:
df_subset = df[numerical_features + categorical_features].copy()
# encode categoricals
df_encoded = pd.get_dummies(df_subset, columns=categorical_features)
scaler = StandardScaler()
df_encoded[numerical_features] = scaler.fit_transform(df_encoded[numerical_features])
return df_encoded
def train_and_evaluate_model(X: pd.DataFrame, y: pd.Series, model_name: str = "Model"):
# train test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# toggle scale_pos_weight, it's to account for class imbalance.
# not much class imbalance here, so experiment
model = xgb.XGBClassifier(objective='binary:logistic',eval_metric='auc',scale_pos_weight=len(y[y==0]) / len(y[y==1]),max_depth=4,min_child_weight=6,subsample=0.8,colsample_bytree=0.8,random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# f1, recall
y_pred_proba = model.predict_proba(X_test)[:, 1]
precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
feature_importance = pd.DataFrame({'feature': X.columns,'importance': model.feature_importances_}).sort_values('importance', ascending=False)
metrics = {'auc_roc': roc_auc_score(y_test, y_pred_proba),'cv_mean': cv_scores.mean(),'cv_std': cv_scores.std(),'confusion_matrix': confusion_matrix(y_test, y_pred),'classification_report': classification_report(y_test, y_pred, output_dict=True),'precision': precision,'recall': recall}
return model, feature_importance, metrics
# function to search different combinations of parameters for XGBoost
def train_and_evaluate_model_with_grid_search(X: pd.DataFrame, y: pd.Series, model_name: str = "Model"):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# define search space
param_grid = {'max_depth': [3, 4, 5],'min_child_weight': [4, 5, 6],'gamma': [0.1, 0.2, 0.3],'subsample': [0.7, 0.8, 0.9],'colsample_bytree': [0.7, 0.8, 0.9],'scale_pos_weight': [len(y[y==0]) / len(y[y==1])]}
base_model = xgb.XGBClassifier(objective='binary:logistic',eval_metric='auc',random_state=42)
grid_search = GridSearchCV(estimator=base_model,param_grid=param_grid,scoring='roc_auc',cv=5,verbose=1,n_jobs=-1)
# find something to do while this runs
print(f"grid searching underway for {model_name}")
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_
print("best params found:")
for param, value in grid_search.best_params_.items():
print(f"{param}: {value}")
y_pred = best_model.predict(X_test)
y_pred_proba = best_model.predict_proba(X_test)[:, 1]
precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
cv_scores = cross_val_score(best_model, X_train, y_train, cv=5, scoring='roc_auc')
feature_importance = pd.DataFrame({'feature': X.columns,'importance': best_model.feature_importances_}).sort_values('importance', ascending=False)
metrics = {'auc_roc': roc_auc_score(y_test, y_pred_proba),'cv_mean': cv_scores.mean(),'cv_std': cv_scores.std(),'confusion_matrix': confusion_matrix(y_test, y_pred),'classification_report': classification_report(y_test, y_pred, output_dict=True),'precision': precision,'recall': recall,'best_params': grid_search.best_params_,'cv_results': grid_search.cv_results_}
return best_model, feature_importance, metrics
def visualize_confusion_matrix(confusion_matrices: Dict[str, np.ndarray], save_path: str = None):
n_models = len(confusion_matrices)
fig, axes = plt.subplots(1, n_models, figsize=(6*n_models, 5))
if n_models == 1:
axes = [axes]
for ax, (model_name, conf_matrix) in zip(axes, confusion_matrices.items()):
sns.heatmap(conf_matrix,annot=True,fmt='d',cmap='Blues',ax=ax)
ax.set_xlabel('pred')
ax.set_ylabel('actual')
ax.set_title(f'confusion matrix - {model_name}')
plt.tight_layout()
# set dpi=300 for higher resolution
plt.savefig(save_path, dpi=300, bbox_inches='tight')
plt.show()
diabetes_df = pd.read_csv('/kaggle/input/diabetes-dataset/diabetic_data.csv')
diabetes_df['readmitted_binary'] = (diabetes_df['readmitted'] != 'NO').astype(int)
diabetes_df['age'] = diabetes_df['age'].str.replace('[','').str.replace('-','_').str.replace(')','')
X_full = prepare_features(diabetes_df, full_numerical, full_categorical)
model, importance, metrics = train_and_evaluate_model(
X_full, diabetes_df['readmitted_binary'], "Full Model"
)
print(f"auc rocc Score: {metrics['auc_roc']:.3f}")
print(f"crossval : {metrics['cv_mean']:.3f} (+/- {metrics['cv_std']*2:.3f})")
print("\n10 most importantt features:")
print(importance.head(10))
confusion_matrices = {"Optimized Model": metrics['confusion_matrix']}
visualize_confusion_matrix(confusion_matrices, save_path='confusion_matrices.png')
full_numerical = ['time_in_hospital','num_lab_procedures','num_procedures','num_medications','number_outpatient','number_emergency','number_inpatient','number_diagnoses']
full_categorical = ['change', 'repaglinide', 'pioglitazone', 'glimepiride-pioglitazone','payer_code', 'tolazamide', 'metformin-pioglitazone', 'age','nateglinide', 'metformin', 'tolbutamide', 'troglitazone','insulin', 'glyburide', 'citoglipton', 'acetohexamide','miglitol', 'gender', 'glimepiride', 'glyburide-metformin','chlorpropamide', 'race', 'metformin-rosiglitazone','rosiglitazone', 'diabetesMed','max_glu_serum', 'glipizide-metformin','glipizide', 'examide', 'acarbose', 'A1Cresult']
def prepare_features(df, numerical_features,categorical_features):
df_subset = df[numerical_features + categorical_features].copy()
df_encoded = pd.get_dummies(df_subset, columns=categorical_features)
# standard scaling for now
scaler = StandardScaler()
df_encoded[numerical_features] = scaler.fit_transform(df_encoded[numerical_features])
return df_encoded
def train_and_evaluate_model(X: pd.DataFrame, y: pd.Series, model_name: str = "Model") -> Tuple:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
'''
colsample_bytree: 0.9
gamma: 0.3
max_depth: 4
min_child_weight: 4
scale_pos_weight: 1.169758219265703
subsample: 0.9
'''
model = xgb.XGBClassifier(objective='binary:logistic',colsample_bytree=0.9,gamma=0.3,min_child_weight=4,scale_pos_weight=1.169758219265703,subsample=0.9,eval_metric='auc',use_label_encoder=False,random_state=42)
'''
# base model
model = xgb.XGBClassifier(
objective='binary:logistic',
eval_metric='auc',
use_label_encoder=False,
random_state=42
)
'''
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]
precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
feature_importance = pd.DataFrame({'feature': X.columns,'importance': model.feature_importances_}).sort_values('importance', ascending=False)
metrics = {'auc_roc': roc_auc_score(y_test, y_pred_proba),'cv_mean': cv_scores.mean(),'cv_std': cv_scores.std(),'confusion_matrix': confusion_matrix(y_test, y_pred),'classification_report': classification_report(y_test, y_pred, output_dict=True),'precision': precision,'recall': recall,'confusion_matrix': confusion_matrix(y_test, y_pred)}
return model, feature_importance, metrics
def visualize_confusion_matrix(confusion_matrices: Dict[str, np.ndarray], save_path: str = None):
n_models = len(confusion_matrices)
fig, axes = plt.subplots(1, n_models, figsize=(6*n_models, 5))
if n_models == 1:
axes = [axes]
for ax, (model_name, conf_matrix) in zip(axes, confusion_matrices.items()):
sns.heatmap(conf_matrix,annot=True,fmt='d',cmap='Blues',ax=ax)
ax.set_xlabel('pred')
ax.set_ylabel('actual')
ax.set_title(f'confusion atrix - {model_name}')
plt.tight_layout()
plt.savefig(save_path, dpi=300, bbox_inches='tight')
plt.show()
def display_classification_metrics(metrics: Dict):
report = metrics['classification_report']
for class_label in ['0', '1']:
class_name = 'No Readmission' if class_label == '0' else 'Readmission'
print(f"Class {class_name}")
print(f"Precision {report[class_label]['precision']:.3f}")
print(f"Recall{report[class_label]['recall']:.3f}")
print(f"F1-Score {report[class_label]['f1-score']:.3f}")
print(f"precision{report['weighted avg']['precision']:.3f}")
print(f"recall{report['weighted avg']['recall']:.3f}")
print(f"f1-Score{report['weighted avg']['f1-score']:.3f}")
display_classification_metrics(metrics)
def analyze_feature_importance(importance_df: pd.DataFrame, top_n: int = 15):
print(f"{top_n} most important predictors:")
for idx, row in importance_df.head(top_n).iterrows():
print(f"{row['feature']:<30} Importance: {row['importance']:.4f}")
plt.figure(figsize=(12, 6))
sns.barplot(x='importance', y='feature',data=importance_df.head(top_n))
plt.title('Feature Importance for Readmission Prediction')
plt.xlabel('Importance Score')
plt.tight_layout()
plt.savefig('feature_importance.png', dpi=300, bbox_inches='tight')
plt.show()
analyze_feature_importance(importance)
def analyze_shap_values(model: xgb.XGBClassifier, X: pd.DataFrame, max_display: int = 20, save_path: str = ''):
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
plt.figure(figsize=(10, 12))
shap.summary_plot(shap_values, X, max_display=max_display, show=False)
plt.tight_layout()
plt.savefig(save_path, bbox_inches='tight', dpi=300)
plt.show()
feature_importance = pd.DataFrame(np.abs(shap_values).mean(0),index=X.columns,columns=['importance']).sort_values('importance', ascending=False)
top_features = feature_importance.head(3).index
plt.figure(figsize=(15, 5))
for i, feature in enumerate(top_features, 1):
plt.subplot(1, 3, i)
shap.dependence_plot(feature, shap_values, X, show=False)
plt.title(f'SHAPPlot for {feature}')
plt.tight_layout()
base, ext = os.path.splitext(save_path)
plt.savefig(f"{base}_dependence{ext}", bbox_inches='tight', dpi=300)
plt.show()
print("analyzing feature importance with shapp values")
analyze_shap_values(model, X_full, save_path='shap_summary.png')
def visualize_pca_clusters(X: pd.DataFrame, y: pd.Series,
numerical_features: List[str]):
X_nums = X[numerical_features]
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_nums)
plot_df = pd.DataFrame({'PC1': X_pca[:, 0],'PC2': X_pca[:, 1],'Readmitted': y.map({0: 'No Readmission', 1: 'Readmitted'})})
var_explained = pca.explained_variance_ratio_ * 100
pc1_contributions = pd.DataFrame({'Feature': numerical_features,'Contribution': pca.components_[0],'Abs_Contribution': abs(pca.components_[0])}).sort_values('Abs_Contribution', ascending=False)
pc2_contributions = pd.DataFrame({'Feature': numerical_features,'Contribution': pca.components_[1],'Abs_Contribution': abs(pca.components_[1])}).sort_values('Abs_Contribution', ascending=False)
plt.figure(figsize=(12, 8))
sns.scatterplot(data=plot_df, x='PC1', y='PC2', hue='Readmitted',alpha=0.6, style='Readmitted')
plt.title(' PCA for patients')
plt.xlabel(f'First Principal Component {var_explained[0]:.1f}% variance explained')
plt.ylabel(f'Second Principal Component {var_explained[1]:.1f}% variance explained')
plt.tight_layout()
plt.savefig(f"PCA.png", bbox_inches='tight', dpi=300)
plt.show()
print(f" variance explained by these two components: {sum(var_explained):.1f}%")
print(f" explains {var_explained[0]:.1f}% of the variance")
for _, row in pc1_contributions.head(3).iterrows():
direction = "increases" if row['Contribution'] > 0 else "decreases"
print(f"- {row['Feature']}: {direction} PC1 (contribution: {abs(row['Contribution']):.3f})")
print(f"Explains {var_explained[1]:.1f}% of the variance")
# make sure to handle 'decreasing' case!
for _, row in pc2_contributions.head(3).iterrows():
direction = "increases" if row['Contribution'] > 0 else "decreases"
print(f"- {row['Feature']}: {direction} PC2 (contribution: {abs(row['Contribution']):.3f})")
visualize_pca_clusters(X_full[full_numerical],diabetes_df['readmitted_binary'], full_numerical)
def visualize_pca_clusters_mod(X: pd.DataFrame, y: pd.Series,
numerical_features: List[str], n_clusters: int = 3):
X_nums = X[numerical_features]
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_nums)
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
clusters = kmeans.fit_predict(X_pca)
silhouette_avg = silhouette_score(X_pca, clusters)
plot_df = pd.DataFrame({'PC1': X_pca[:, 0],'PC2': X_pca[:, 1],'Readmitted': y.map({0: 'No Readmission', 1: 'Readmitted'}),'Cluster': clusters})
cluster_centers = kmeans.cluster_centers_
pc1_contributions = pd.DataFrame({'Feature': numerical_features,'Contribution': pca.components_[0],'Abs_Contribution': abs(pca.components_[0])}).sort_values('Abs_Contribution', ascending=False)
pc2_contributions = pd.DataFrame({'Feature': numerical_features,'Contribution': pca.components_[1],'Abs_Contribution': abs(pca.components_[1])}).sort_values('Abs_Contribution', ascending=False)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))
sns.scatterplot(data=plot_df, x='PC1', y='PC2', hue='Readmitted', alpha=0.6, style='Readmitted', ax=ax1)
ax1.set_title('PCA by Readmission Status')
sns.scatterplot(data=plot_df, x='PC1', y='PC2', hue='Cluster',palette='deep', alpha=0.6, ax=ax2)
ax2.scatter(cluster_centers[:, 0], cluster_centers[:, 1],c='red', marker='x', s=200, linewidths=3, label='Cluster Centers')
ax2.set_title(f'K-means Clusters (k={n_clusters}, silhouette={silhouette_avg:.3f})')
plt.tight_layout()
plt.show()
cluster_stats = []
for i in range(n_clusters):
cluster_data = plot_df[plot_df['Cluster'] == i]
readmission_rate = (cluster_data['Readmitted'] == 'Readmitted').mean() * 100
cluster_size = len(cluster_data)
cluster_stats.append({'Cluster': i,'Size': cluster_size,'Readmission Rate (%)': readmission_rate})
cluster_stats_df = pd.DataFrame(cluster_stats)
print("\nCluster Statistics:")
print(cluster_stats_df.to_string(index=False))
print(f"\nTotal variance explained by these two components: {sum(pca.explained_variance_ratio_ * 100):.1f}%")
print("first PCA:")
print(f"Explains {pca.explained_variance_ratio_[0] * 100:.1f}% of the variance")
print("\nTop contributing features:")
for _, row in pc1_contributions.head(3).iterrows():
direction = "increases" if row['Contribution'] > 0 else "decreases"
print(f"- {row['Feature']}: {direction} PC1 (contribution: {abs(row['Contribution']):.3f})")
print("second oca:")
print(f"Explains {pca.explained_variance_ratio_[1] * 100:.1f}% of the variance")
print("\nTop contributing features:")
for _, row in pc2_contributions.head(3).iterrows():
direction = "increases" if row['Contribution'] > 0 else "decreases"
print(f"- {row['Feature']}: {direction} PC2 (contribution: {abs(row['Contribution']):.3f})")
return cluster_stats_df, kmeans, X_pca
#TODO: use elbow-method
cluster_stats, kmeans_model, X_pca = visualize_pca_clusters_mod(X_full[full_numerical],diabetes_df['readmitted_binary'],full_numerical,n_clusters=3)
diabetes_df
# Random Forest
def train_and_evaluate_random_forest(X: pd.DataFrame, y: pd.Series,
model_name: str = "Random Forest") -> Tuple:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = RandomForestClassifier(n_estimators=100,max_depth=None,min_samples_split=2,min_samples_leaf=1,random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]
precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
feature_importance = pd.DataFrame({'feature': X.columns,'importance': model.feature_importances_}).sort_values('importance', ascending=False)
metrics = {'auc_roc': roc_auc_score(y_test, y_pred_proba),'cv_mean': cv_scores.mean(),'cv_std': cv_scores.std(),'confusion_matrix': confusion_matrix(y_test, y_pred),'classification_report': classification_report(y_test, y_pred,output_dict=True),'precision': precision,'recall': recall}
return model, feature_importance, metrics
rf_model, rf_importance, rf_metrics = train_and_evaluate_random_forest(X_full, diabetes_df['readmitted_binary'], "Random Forest Model")
print(f"Randomforest")
print(f"aucruc Score: {rf_metrics['auc_roc']:.3f}")
print(f"cross-val Score: {rf_metrics['cv_mean']:.3f} "
f"(+/- {rf_metrics['cv_std']*2:.3f})")
print("10 most important features:")
print(rf_importance.head(10))
confusion_matrices = {}
confusion_matrices["Random Forest"] = rf_metrics['confusion_matrix']
visualize_confusion_matrix(confusion_matrices, save_path='confusion_matrices.png')
display_classification_metrics(rf_metrics)
analyze_feature_importance(rf_importance)
```
## Live Notebook
https://www.kaggle.com/code/dataranch/diabetes-readmission-prediction