Pediatric Appendicitis Classification - EDA

[Kaggle Notebook](https://www.kaggle.com/code/dataranch/pediatric-appendicitis-classification-eda) - This project focuses on developing a machine learning model to assist in the diagnosis of appendicitis in pediatric patients using the Regensburg Pediatric Appendicitis dataset. The main goal is to create a reliable classification model that can help identify appendicitis cases based on clinical, laboratory, and ultrasound findings. - This is the EDA portion ![[Appendicitis_Image.png]] ### Dataset Description - **Source**: Children's Hospital St. Hedwig in Regensburg, Germany (2016-2021) - **Size**: 782 patients - **Features**: 53 variables including: - Clinical measurements (e.g., Age, BMI, Body Temperature) - Laboratory findings (e.g., WBC Count, CRP, Neutrophil Percentage) - Physical examination results (e.g., Migratory Pain, Rebound Tenderness) - Clinical scoring systems (Alvarado Score, Pediatric Appendicitis Score) - Ultrasound findings - **Target Variables**: - Primary: Diagnosis (appendicitis vs. no appendicitis) - Secondary: Management (surgical vs. conservative) - Tertiary: Severity (complicated vs. uncomplicated/no appendicitis) https://archive.ics.uci.edu/dataset/938/regensburg+pediatric+appendicitis ```python # imports needed for milestone 1 import pandas as pd import matplotlib.pyplot as plt # better looking plots import seaborn as sns ``` ```python df = pd.read_excel('/kaggle/input/childrens-hospital-regensburg-appendicitis/app_data.xlsx') # this is so i can view the data locally # df.to_csv('app_data.csv', index=False) ``` - Let's take a look at the first few rows... ```python df.head(10) ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Age</th> <th>BMI</th> <th>Sex</th> <th>Height</th> <th>Weight</th> <th>Length_of_Stay</th> <th>Management</th> <th>Severity</th> <th>Diagnosis_Presumptive</th> <th>Diagnosis</th> <th>...</th> <th>Abscess_Location</th> <th>Pathological_Lymph_Nodes</th> <th>Lymph_Nodes_Location</th> <th>Bowel_Wall_Thickening</th> <th>Conglomerate_of_Bowel_Loops</th> <th>Ileus</th> <th>Coprostasis</th> <th>Meteorism</th> <th>Enteritis</th> <th>Gynecological_Findings</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>12.68</td> <td>16.9</td> <td>female</td> <td>148.0</td> <td>37.0</td> <td>3.0</td> <td>conservative</td> <td>uncomplicated</td> <td>appendicitis</td> <td>appendicitis</td> <td>...</td> <td>NaN</td> <td>yes</td> <td>reUB</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> </tr> <tr> <th>1</th> <td>14.10</td> <td>31.9</td> <td>male</td> <td>147.0</td> <td>69.5</td> <td>2.0</td> <td>conservative</td> <td>uncomplicated</td> <td>appendicitis</td> <td>no appendicitis</td> <td>...</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>yes</td> <td>NaN</td> <td>NaN</td> </tr> <tr> <th>2</th> <td>14.14</td> <td>23.3</td> <td>female</td> <td>163.0</td> <td>62.0</td> <td>4.0</td> <td>conservative</td> <td>uncomplicated</td> <td>appendicitis</td> <td>no appendicitis</td> <td>...</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>yes</td> <td>yes</td> <td>NaN</td> </tr> <tr> <th>3</th> <td>16.37</td> <td>20.6</td> <td>female</td> <td>165.0</td> <td>56.0</td> <td>3.0</td> <td>conservative</td> <td>uncomplicated</td> <td>appendicitis</td> <td>no appendicitis</td> <td>...</td> <td>NaN</td> <td>yes</td> <td>reUB</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>yes</td> <td>NaN</td> </tr> <tr> <th>4</th> <td>11.08</td> <td>16.9</td> <td>female</td> <td>163.0</td> <td>45.0</td> <td>3.0</td> <td>conservative</td> <td>uncomplicated</td> <td>appendicitis</td> <td>appendicitis</td> <td>...</td> <td>NaN</td> <td>yes</td> <td>reUB</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>yes</td> <td>NaN</td> </tr> <tr> <th>5</th> <td>11.05</td> <td>30.7</td> <td>male</td> <td>121.0</td> <td>45.0</td> <td>3.0</td> <td>conservative</td> <td>uncomplicated</td> <td>appendicitis</td> <td>no appendicitis</td> <td>...</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> </tr> <tr> <th>6</th> <td>8.98</td> <td>19.4</td> <td>female</td> <td>140.0</td> <td>38.5</td> <td>3.0</td> <td>conservative</td> <td>uncomplicated</td> <td>appendicitis</td> <td>no appendicitis</td> <td>...</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> </tr> <tr> <th>7</th> <td>7.06</td> <td>NaN</td> <td>female</td> <td>NaN</td> <td>21.5</td> <td>2.0</td> <td>conservative</td> <td>uncomplicated</td> <td>appendicitis</td> <td>no appendicitis</td> <td>...</td> <td>NaN</td> <td>yes</td> <td>re UB</td> <td>no</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> </tr> <tr> <th>8</th> <td>7.90</td> <td>15.7</td> <td>male</td> <td>131.0</td> <td>26.7</td> <td>3.0</td> <td>conservative</td> <td>uncomplicated</td> <td>appendicitis</td> <td>no appendicitis</td> <td>...</td> <td>NaN</td> <td>yes</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>yes</td> <td>NaN</td> <td>NaN</td> </tr> <tr> <th>9</th> <td>14.34</td> <td>14.9</td> <td>male</td> <td>174.0</td> <td>45.5</td> <td>3.0</td> <td>conservative</td> <td>uncomplicated</td> <td>appendicitis</td> <td>appendicitis</td> <td>...</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> <td>NaN</td> </tr> </tbody> </table> <p>10 rows × 58 columns</p> </div> - We see we have 57 features. - From the first 10 rows, we can see a missing value in the 'Height' column, so we'll need to investigate further. - Missing values in many columns as we look to the right side of the df. - The target column in this case is 'Diagnosis' which is a binary column with values 'Appendicitis' and 'No appendicitis' - We'll need to convert this and other categorical variables to a binary format for our model. ```python len(df) ``` 782 - 782 instances - Before computing any statistics, we need to isolate the numerical columns and check for missing values. ```python # isolate the numeric features numerical_features = df.select_dtypes(include=['float64']).columns ``` ```python df[numerical_features].head() ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Age</th> <th>BMI</th> <th>Height</th> <th>Weight</th> <th>Length_of_Stay</th> <th>Alvarado_Score</th> <th>Paedriatic_Appendicitis_Score</th> <th>Appendix_Diameter</th> <th>Body_Temperature</th> <th>WBC_Count</th> <th>Neutrophil_Percentage</th> <th>Segmented_Neutrophils</th> <th>RBC_Count</th> <th>Hemoglobin</th> <th>RDW</th> <th>Thrombocyte_Count</th> <th>CRP</th> <th>US_Number</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>12.68</td> <td>16.9</td> <td>148.0</td> <td>37.0</td> <td>3.0</td> <td>4.0</td> <td>3.0</td> <td>7.1</td> <td>37.0</td> <td>7.7</td> <td>68.2</td> <td>NaN</td> <td>5.27</td> <td>14.8</td> <td>12.2</td> <td>254.0</td> <td>0.0</td> <td>882.0</td> </tr> <tr> <th>1</th> <td>14.10</td> <td>31.9</td> <td>147.0</td> <td>69.5</td> <td>2.0</td> <td>5.0</td> <td>4.0</td> <td>NaN</td> <td>36.9</td> <td>8.1</td> <td>64.8</td> <td>NaN</td> <td>5.26</td> <td>15.7</td> <td>12.7</td> <td>151.0</td> <td>3.0</td> <td>883.0</td> </tr> <tr> <th>2</th> <td>14.14</td> <td>23.3</td> <td>163.0</td> <td>62.0</td> <td>4.0</td> <td>5.0</td> <td>3.0</td> <td>NaN</td> <td>36.6</td> <td>13.2</td> <td>74.8</td> <td>NaN</td> <td>3.98</td> <td>11.4</td> <td>12.2</td> <td>300.0</td> <td>3.0</td> <td>884.0</td> </tr> <tr> <th>3</th> <td>16.37</td> <td>20.6</td> <td>165.0</td> <td>56.0</td> <td>3.0</td> <td>7.0</td> <td>6.0</td> <td>NaN</td> <td>36.0</td> <td>11.4</td> <td>63.0</td> <td>NaN</td> <td>4.64</td> <td>13.6</td> <td>13.2</td> <td>258.0</td> <td>0.0</td> <td>886.0</td> </tr> <tr> <th>4</th> <td>11.08</td> <td>16.9</td> <td>163.0</td> <td>45.0</td> <td>3.0</td> <td>5.0</td> <td>6.0</td> <td>7.0</td> <td>36.9</td> <td>8.1</td> <td>44.0</td> <td>NaN</td> <td>4.44</td> <td>12.6</td> <td>13.6</td> <td>311.0</td> <td>0.0</td> <td>887.0</td> </tr> </tbody> </table> </div> ```python df[numerical_features].isna().sum() ``` Age 1 BMI 27 Height 26 Weight 3 Length_of_Stay 4 Alvarado_Score 52 Paedriatic_Appendicitis_Score 52 Appendix_Diameter 284 Body_Temperature 7 WBC_Count 6 Neutrophil_Percentage 103 Segmented_Neutrophils 728 RBC_Count 18 Hemoglobin 18 RDW 26 Thrombocyte_Count 18 CRP 11 US_Number 22 dtype: int64 - For these numeric columns, let's use a median imputation strategy to fill in the missing values. Given that there's less than 800 entries in the dataset, we don't want to drop any rows if we can avoid it. - Median imputation might be problematic for Segmented_Neutrophils and Appendix_Diameter, as they have a high percentage of missing values. We'll need to keep this in mind and may need to revisit this problem in the future. ```python #suppress warnings import warnings warnings.filterwarnings("ignore") for column in numerical_features: if df[column].isnull().sum() > 0: #calculate median of column median_value = df[column].median() df[column].fillna(median_value, inplace=True) ``` - Let's check and make sure there are no missing values in the numerical columns. ```python df[numerical_features].isna().sum() ``` Age 0 BMI 0 Height 0 Weight 0 Length_of_Stay 0 Alvarado_Score 0 Paedriatic_Appendicitis_Score 0 Appendix_Diameter 0 Body_Temperature 0 WBC_Count 0 Neutrophil_Percentage 0 Segmented_Neutrophils 0 RBC_Count 0 Hemoglobin 0 RDW 0 Thrombocyte_Count 0 CRP 0 US_Number 0 dtype: int64 - Perfect! Let's dig into the statistics of these numerical columns. ```python df[numerical_features].describe() ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Age</th> <th>BMI</th> <th>Height</th> <th>Weight</th> <th>Length_of_Stay</th> <th>Alvarado_Score</th> <th>Paedriatic_Appendicitis_Score</th> <th>Appendix_Diameter</th> <th>Body_Temperature</th> <th>WBC_Count</th> <th>Neutrophil_Percentage</th> <th>Segmented_Neutrophils</th> <th>RBC_Count</th> <th>Hemoglobin</th> <th>RDW</th> <th>Thrombocyte_Count</th> <th>CRP</th> <th>US_Number</th> </tr> </thead> <tbody> <tr> <th>count</th> <td>782.000000</td> <td>782.000000</td> <td>782.000000</td> <td>782.000000</td> <td>782.000000</td> <td>782.000000</td> <td>782.000000</td> <td>782.000000</td> <td>782.000000</td> <td>782.000000</td> <td>782.000000</td> <td>782.000000</td> <td>782.000000</td> <td>782.000000</td> <td>782.000000</td> <td>782.000000</td> <td>782.000000</td> <td>782.000000</td> </tr> <tr> <th>mean</th> <td>11.346601</td> <td>18.877753</td> <td>148.071739</td> <td>43.165742</td> <td>4.277494</td> <td>5.927110</td> <td>5.236573</td> <td>7.667263</td> <td>37.402685</td> <td>12.665537</td> <td>72.279668</td> <td>64.529668</td> <td>4.799041</td> <td>13.378645</td> <td>13.164322</td> <td>285.039642</td> <td>31.044847</td> <td>424.755754</td> </tr> <tr> <th>std</th> <td>3.527720</td> <td>4.311546</td> <td>19.403000</td> <td>17.357896</td> <td>2.569092</td> <td>2.083053</td> <td>1.893189</td> <td>2.027507</td> <td>0.899825</td> <td>5.346191</td> <td>13.534515</td> <td>3.931203</td> <td>0.493237</td> <td>1.377175</td> <td>4.463417</td> <td>71.667551</td> <td>57.100060</td> <td>267.770056</td> </tr> <tr> <th>min</th> <td>0.000000</td> <td>7.827983</td> <td>53.000000</td> <td>3.960000</td> <td>1.000000</td> <td>0.000000</td> <td>0.000000</td> <td>2.700000</td> <td>26.900000</td> <td>2.600000</td> <td>27.200000</td> <td>32.000000</td> <td>3.620000</td> <td>8.200000</td> <td>11.200000</td> <td>91.000000</td> <td>0.000000</td> <td>1.000000</td> </tr> <tr> <th>25%</th> <td>9.209377</td> <td>15.804082</td> <td>138.000000</td> <td>29.500000</td> <td>3.000000</td> <td>4.000000</td> <td>4.000000</td> <td>7.000000</td> <td>36.800000</td> <td>8.300000</td> <td>63.825000</td> <td>64.500000</td> <td>4.540000</td> <td>12.700000</td> <td>12.300000</td> <td>236.000000</td> <td>1.000000</td> <td>204.250000</td> </tr> <tr> <th>50%</th> <td>11.438741</td> <td>18.062284</td> <td>149.650000</td> <td>41.400000</td> <td>3.000000</td> <td>6.000000</td> <td>5.000000</td> <td>7.500000</td> <td>37.200000</td> <td>12.000000</td> <td>75.500000</td> <td>64.500000</td> <td>4.780000</td> <td>13.300000</td> <td>12.700000</td> <td>276.000000</td> <td>7.000000</td> <td>398.500000</td> </tr> <tr> <th>75%</th> <td>14.080082</td> <td>21.014438</td> <td>162.225000</td> <td>54.000000</td> <td>5.000000</td> <td>8.000000</td> <td>6.000000</td> <td>8.000000</td> <td>37.900000</td> <td>16.200000</td> <td>82.375000</td> <td>64.500000</td> <td>5.010000</td> <td>14.000000</td> <td>13.300000</td> <td>328.750000</td> <td>32.000000</td> <td>603.750000</td> </tr> <tr> <th>max</th> <td>18.360000</td> <td>38.156221</td> <td>192.000000</td> <td>103.000000</td> <td>28.000000</td> <td>10.000000</td> <td>10.000000</td> <td>17.000000</td> <td>40.200000</td> <td>37.700000</td> <td>97.700000</td> <td>91.000000</td> <td>14.000000</td> <td>36.000000</td> <td>86.900000</td> <td>708.000000</td> <td>365.000000</td> <td>992.000000</td> </tr> </tbody> </table> </div> ```python sns.heatmap(df[numerical_features].corr(), annot=True, cmap='coolwarm') plt.title('Correlation Matrix') plt.show() ``` ![[output_21_0.png]] - That's a lot of information! Let's focus on a few key statistics for now. ```python numerical_features ``` Index(['Age', 'BMI', 'Height', 'Weight', 'Length_of_Stay', 'Alvarado_Score', 'Paedriatic_Appendicitis_Score', 'Appendix_Diameter', 'Body_Temperature', 'WBC_Count', 'Neutrophil_Percentage', 'Segmented_Neutrophils', 'RBC_Count', 'Hemoglobin', 'RDW', 'Thrombocyte_Count', 'CRP', 'US_Number'], dtype='object') ```python subset = numerical_features[1:8] sns.heatmap(df[subset].corr(), annot=True, cmap='coolwarm') plt.title('Correlation Matrix') plt.show() ``` ![[output_24_0.png]] - From this subset, we can see the following features are highly correlated - Height and Weight - BMI and Weight - BMI and Height - Alvarado Score and Pediatric Appendicitis Score are slightly correlated - We may consider dropping one of the highly correlated features to reduce multicollinearity in the model. This is similar to a PCA analysis where we want to reduce the number of features to the most important ones. ```python subset ``` Index(['BMI', 'Height', 'Weight', 'Length_of_Stay', 'Alvarado_Score', 'Paedriatic_Appendicitis_Score', 'Appendix_Diameter'], dtype='object') ```python subset = numerical_features[8:] sns.heatmap(df[subset].corr(), annot=True, cmap='coolwarm') plt.title('Correlation Matrix') plt.show() ``` ![[output_27_0.png]] - The highly correlated features here are: - RBC_Count and Hemoglobin - WBC_Count and Neutrophil_Percentage # Distributions - Let's check out the distribution of some of these features. - For breviety, we'll focus on a few features here. ```python sns.histplot(df['BMI'], kde=True) ``` <Axes: xlabel='BMI', ylabel='Count'> ![[output_31_1.png]] - BMI has a normal distribution with a slight right skew. - Median BMI is around 17 ```python sns.histplot(df['Height'], kde=True) ``` <Axes: xlabel='Height', ylabel='Count'> ![[output_33_1.png]] - Height has a normal distribution with a left skew. - Median height is around 150 - One height is around 50-60 - Although this is an outlier, we'll keep it for now as it may be a valid entry. ```python sns.histplot(df['Weight'], kde=True) ``` <Axes: xlabel='Weight', ylabel='Count'> ![[output_35_1.png]] - Let's see how alvarado score is related to the diagnosis of appendicitis. ```python plt.figure(figsize=(12, 6)) plt.subplot(1, 2, 1) sns.boxplot(data=df, x='Diagnosis', y='Alvarado_Score') plt.title('Alvarado Score by Diagnosis') ``` Text(0.5, 1.0, 'Alvarado Score by Diagnosis') ![[output_37_1.png]] - Cases with appendicitis have a higher median alvarado score (7) than those without appendicitis (5). - Let's do the same for the pediatric appendicitis score. ```python plt.subplot(1, 2, 2) sns.boxplot(data=df, x='Diagnosis', y='Paedriatic_Appendicitis_Score') plt.title('Pediatric Appendicitis Score by Diagnosis') plt.tight_layout() ``` ![[output_40_0.png]] - Similar to the alvarado score, cases with appendicitis have a higher median pediatric appendicitis score (6) than those without appendicitis (4). - Let's convert the target column to a binary column with values 0 and 1 ```python #encode appendicitis as 1 and no appendicitis as 0 #df['Diagnosis'] = df['Diagnosis'].map({'Appendicitis': 1, 'No Appendicitis': 0}) # we'll do the same for severity #df['Severity'] = df['Severity'].map({'complicated': 1, 'uncomplicated': 0}) #df['Sex'] = df['Sex'].map({'female': 1, 'male': 0}) ``` - The following columns are sparse and categorical, let's drop them for now and maybe revisit them later. ```python sparse_columns = [ "Appendix_Wall_Layers", "Target_Sign", "Appendicolith", "Perfusion", "Perforation", "Surrounding_Tissue_Reaction", "Appendicular_Abscess", "Abscess_Location", "Pathological_Lymph_Nodes", "Lymph_Nodes_Location", "Bowel_Wall_Thickening", "Conglomerate_of_Bowel_Loops", "Ileus", "Coprostasis", "Meteorism", "Enteritis", "Gynecological_Findings" ] # we would then drop these columns or handle them in a different way # not going to drop them for now ``` - Let's sum up the symptoms across all columns so that for each entry, we know how many symptoms they have. ```python symptom_columns = [ 'Migratory_Pain', 'Lower_Right_Abd_Pain', 'Contralateral_Rebound_Tenderness', 'Coughing_Pain', 'Nausea', 'Loss_of_Appetite', 'Neutrophilia', 'Dysuria', 'Peritonitis', 'Psoas_Sign', 'Ipsilateral_Rebound_Tenderness' ] df['symptom_count'] = df[symptom_columns].apply(lambda x: (x == 'yes').sum(), axis=1) df['symptom_count'] ``` 0 4 1 6 2 3 3 6 4 6 .. 777 5 778 2 779 2 780 5 781 1 Name: symptom_count, Length: 782, dtype: int64 - It seems like more symtoms should be indicative of appendicitis, but let's check the distribution of symptoms for appendicitis vs. no appendicitis. ```python plt.subplot(1, 3, 1) sns.boxplot(x='Diagnosis', y='symptom_count', data=df) plt.title('Symptom Count by Diagnosis') ``` Text(0.5, 1.0, 'Symptom Count by Diagnosis') ![[output_49_1.png]] - Yep! Cases with appendicitis have more symptoms on average than those without appendicitis. - Let's use some of the key lab values to create interaction terms. - These are key inflammatory markers that doctors use to diagnose appendicitis - WBC (White Blood Cell) count indicates acute infection/inflammation - CRP (C-Reactive Protein) is another inflammatory marker - Neutrophils are the type of white blood cells that respond to bacterial infections. - The idea is these interaction terms can capture where these inflammatory markers are high together. ```python df['WBC_CRP_Interaction'] = df['WBC_Count'] * df['CRP'] df['WBC_Neutrophil_Interaction'] = df['WBC_Count'] * df['Neutrophil_Percentage'] df['CRP_Neutrophil_Interaction'] = df['CRP'] * df['Neutrophil_Percentage'] ``` - Let's see how these interaction terms are related to the diagnosis of appendicitis. - We'll use boxplots to visualize this relationship. ```python plt.figure(figsize=(15, 5)) plt.subplot(1, 3, 1) sns.boxplot(x='Diagnosis', y='WBC_CRP_Interaction', data=df) plt.title('WBC × CRP by Diagnosis') plt.yscale('log') #needed for large range of values plt.subplot(1, 3, 2) sns.boxplot(x='Diagnosis', y='WBC_Neutrophil_Interaction', data=df) plt.title('WBC × Neutrophil % by Diagnosis') plt.subplot(1, 3, 3) sns.boxplot(x='Diagnosis', y='CRP_Neutrophil_Interaction', data=df) plt.title('CRP × Neutrophil % by Diagnosis') plt.tight_layout() plt.show() ``` ![[output_54_0.png]] - For two interaction terms, cases with appendicitis have higher median values than those without appendicitis. - CRP * Neutrophil Percentage seems to have a similar/lower median value for appendicitis cases compared to no appendicitis cases. - Maybe these interaction terms will be useful for our model(s). - Just out of curiosity, let's see how these interaction terms are correlated with some of the original numerical features. ```python # correlation plot with the new features and the orinal numerical features new_features = ['WBC_CRP_Interaction', 'WBC_Neutrophil_Interaction', 'CRP_Neutrophil_Interaction'] features = numerical_features.tolist()[0:8] + new_features sns.heatmap(df[features].corr(), annot=True, cmap='coolwarm') plt.title('Correlation Matrix') ``` Text(0.5, 1.0, 'Correlation Matrix') ![[output_57_1.png]] - We can see the WBC_Neutrophil interaction term is highly correlated with the Alvarado Score and Pediatric Appendicitis Score. - These scores are indicative of appendicitis, so maybe this interaction term alone can be a good predictor of appendicitis.