[Kaggle Notebook](https://www.kaggle.com/code/dataranch/pediatric-appendicitis-classification-eda)
- This project focuses on developing a machine learning model to assist in the diagnosis of appendicitis in pediatric patients using the Regensburg Pediatric Appendicitis dataset. The main goal is to create a reliable classification model that can help identify appendicitis cases based on clinical, laboratory, and ultrasound findings.
- This is the EDA portion
![[Appendicitis_Image.png]]
### Dataset Description
- **Source**: Children's Hospital St. Hedwig in Regensburg, Germany (2016-2021)
- **Size**: 782 patients
- **Features**: 53 variables including:
- Clinical measurements (e.g., Age, BMI, Body Temperature)
- Laboratory findings (e.g., WBC Count, CRP, Neutrophil Percentage)
- Physical examination results (e.g., Migratory Pain, Rebound Tenderness)
- Clinical scoring systems (Alvarado Score, Pediatric Appendicitis Score)
- Ultrasound findings
- **Target Variables**:
- Primary: Diagnosis (appendicitis vs. no appendicitis)
- Secondary: Management (surgical vs. conservative)
- Tertiary: Severity (complicated vs. uncomplicated/no appendicitis)
https://archive.ics.uci.edu/dataset/938/regensburg+pediatric+appendicitis
```python
# imports needed for milestone 1
import pandas as pd
import matplotlib.pyplot as plt
# better looking plots
import seaborn as sns
```
```python
df = pd.read_excel('/kaggle/input/childrens-hospital-regensburg-appendicitis/app_data.xlsx')
# this is so i can view the data locally
# df.to_csv('app_data.csv', index=False)
```
- Let's take a look at the first few rows...
```python
df.head(10)
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Age</th>
<th>BMI</th>
<th>Sex</th>
<th>Height</th>
<th>Weight</th>
<th>Length_of_Stay</th>
<th>Management</th>
<th>Severity</th>
<th>Diagnosis_Presumptive</th>
<th>Diagnosis</th>
<th>...</th>
<th>Abscess_Location</th>
<th>Pathological_Lymph_Nodes</th>
<th>Lymph_Nodes_Location</th>
<th>Bowel_Wall_Thickening</th>
<th>Conglomerate_of_Bowel_Loops</th>
<th>Ileus</th>
<th>Coprostasis</th>
<th>Meteorism</th>
<th>Enteritis</th>
<th>Gynecological_Findings</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>12.68</td>
<td>16.9</td>
<td>female</td>
<td>148.0</td>
<td>37.0</td>
<td>3.0</td>
<td>conservative</td>
<td>uncomplicated</td>
<td>appendicitis</td>
<td>appendicitis</td>
<td>...</td>
<td>NaN</td>
<td>yes</td>
<td>reUB</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>1</th>
<td>14.10</td>
<td>31.9</td>
<td>male</td>
<td>147.0</td>
<td>69.5</td>
<td>2.0</td>
<td>conservative</td>
<td>uncomplicated</td>
<td>appendicitis</td>
<td>no appendicitis</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>yes</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>2</th>
<td>14.14</td>
<td>23.3</td>
<td>female</td>
<td>163.0</td>
<td>62.0</td>
<td>4.0</td>
<td>conservative</td>
<td>uncomplicated</td>
<td>appendicitis</td>
<td>no appendicitis</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>yes</td>
<td>yes</td>
<td>NaN</td>
</tr>
<tr>
<th>3</th>
<td>16.37</td>
<td>20.6</td>
<td>female</td>
<td>165.0</td>
<td>56.0</td>
<td>3.0</td>
<td>conservative</td>
<td>uncomplicated</td>
<td>appendicitis</td>
<td>no appendicitis</td>
<td>...</td>
<td>NaN</td>
<td>yes</td>
<td>reUB</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>yes</td>
<td>NaN</td>
</tr>
<tr>
<th>4</th>
<td>11.08</td>
<td>16.9</td>
<td>female</td>
<td>163.0</td>
<td>45.0</td>
<td>3.0</td>
<td>conservative</td>
<td>uncomplicated</td>
<td>appendicitis</td>
<td>appendicitis</td>
<td>...</td>
<td>NaN</td>
<td>yes</td>
<td>reUB</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>yes</td>
<td>NaN</td>
</tr>
<tr>
<th>5</th>
<td>11.05</td>
<td>30.7</td>
<td>male</td>
<td>121.0</td>
<td>45.0</td>
<td>3.0</td>
<td>conservative</td>
<td>uncomplicated</td>
<td>appendicitis</td>
<td>no appendicitis</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>6</th>
<td>8.98</td>
<td>19.4</td>
<td>female</td>
<td>140.0</td>
<td>38.5</td>
<td>3.0</td>
<td>conservative</td>
<td>uncomplicated</td>
<td>appendicitis</td>
<td>no appendicitis</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>7</th>
<td>7.06</td>
<td>NaN</td>
<td>female</td>
<td>NaN</td>
<td>21.5</td>
<td>2.0</td>
<td>conservative</td>
<td>uncomplicated</td>
<td>appendicitis</td>
<td>no appendicitis</td>
<td>...</td>
<td>NaN</td>
<td>yes</td>
<td>re UB</td>
<td>no</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>8</th>
<td>7.90</td>
<td>15.7</td>
<td>male</td>
<td>131.0</td>
<td>26.7</td>
<td>3.0</td>
<td>conservative</td>
<td>uncomplicated</td>
<td>appendicitis</td>
<td>no appendicitis</td>
<td>...</td>
<td>NaN</td>
<td>yes</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>yes</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>9</th>
<td>14.34</td>
<td>14.9</td>
<td>male</td>
<td>174.0</td>
<td>45.5</td>
<td>3.0</td>
<td>conservative</td>
<td>uncomplicated</td>
<td>appendicitis</td>
<td>appendicitis</td>
<td>...</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
</tbody>
</table>
<p>10 rows × 58 columns</p>
</div>
- We see we have 57 features.
- From the first 10 rows, we can see a missing value in the 'Height' column, so we'll need to investigate further.
- Missing values in many columns as we look to the right side of the df.
- The target column in this case is 'Diagnosis' which is a binary column with values 'Appendicitis' and 'No appendicitis'
- We'll need to convert this and other categorical variables to a binary format for our model.
```python
len(df)
```
782
- 782 instances
- Before computing any statistics, we need to isolate the numerical columns and check for missing values.
```python
# isolate the numeric features
numerical_features = df.select_dtypes(include=['float64']).columns
```
```python
df[numerical_features].head()
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Age</th>
<th>BMI</th>
<th>Height</th>
<th>Weight</th>
<th>Length_of_Stay</th>
<th>Alvarado_Score</th>
<th>Paedriatic_Appendicitis_Score</th>
<th>Appendix_Diameter</th>
<th>Body_Temperature</th>
<th>WBC_Count</th>
<th>Neutrophil_Percentage</th>
<th>Segmented_Neutrophils</th>
<th>RBC_Count</th>
<th>Hemoglobin</th>
<th>RDW</th>
<th>Thrombocyte_Count</th>
<th>CRP</th>
<th>US_Number</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>12.68</td>
<td>16.9</td>
<td>148.0</td>
<td>37.0</td>
<td>3.0</td>
<td>4.0</td>
<td>3.0</td>
<td>7.1</td>
<td>37.0</td>
<td>7.7</td>
<td>68.2</td>
<td>NaN</td>
<td>5.27</td>
<td>14.8</td>
<td>12.2</td>
<td>254.0</td>
<td>0.0</td>
<td>882.0</td>
</tr>
<tr>
<th>1</th>
<td>14.10</td>
<td>31.9</td>
<td>147.0</td>
<td>69.5</td>
<td>2.0</td>
<td>5.0</td>
<td>4.0</td>
<td>NaN</td>
<td>36.9</td>
<td>8.1</td>
<td>64.8</td>
<td>NaN</td>
<td>5.26</td>
<td>15.7</td>
<td>12.7</td>
<td>151.0</td>
<td>3.0</td>
<td>883.0</td>
</tr>
<tr>
<th>2</th>
<td>14.14</td>
<td>23.3</td>
<td>163.0</td>
<td>62.0</td>
<td>4.0</td>
<td>5.0</td>
<td>3.0</td>
<td>NaN</td>
<td>36.6</td>
<td>13.2</td>
<td>74.8</td>
<td>NaN</td>
<td>3.98</td>
<td>11.4</td>
<td>12.2</td>
<td>300.0</td>
<td>3.0</td>
<td>884.0</td>
</tr>
<tr>
<th>3</th>
<td>16.37</td>
<td>20.6</td>
<td>165.0</td>
<td>56.0</td>
<td>3.0</td>
<td>7.0</td>
<td>6.0</td>
<td>NaN</td>
<td>36.0</td>
<td>11.4</td>
<td>63.0</td>
<td>NaN</td>
<td>4.64</td>
<td>13.6</td>
<td>13.2</td>
<td>258.0</td>
<td>0.0</td>
<td>886.0</td>
</tr>
<tr>
<th>4</th>
<td>11.08</td>
<td>16.9</td>
<td>163.0</td>
<td>45.0</td>
<td>3.0</td>
<td>5.0</td>
<td>6.0</td>
<td>7.0</td>
<td>36.9</td>
<td>8.1</td>
<td>44.0</td>
<td>NaN</td>
<td>4.44</td>
<td>12.6</td>
<td>13.6</td>
<td>311.0</td>
<td>0.0</td>
<td>887.0</td>
</tr>
</tbody>
</table>
</div>
```python
df[numerical_features].isna().sum()
```
Age 1
BMI 27
Height 26
Weight 3
Length_of_Stay 4
Alvarado_Score 52
Paedriatic_Appendicitis_Score 52
Appendix_Diameter 284
Body_Temperature 7
WBC_Count 6
Neutrophil_Percentage 103
Segmented_Neutrophils 728
RBC_Count 18
Hemoglobin 18
RDW 26
Thrombocyte_Count 18
CRP 11
US_Number 22
dtype: int64
- For these numeric columns, let's use a median imputation strategy to fill in the missing values. Given that there's less than 800 entries in the dataset, we don't want to drop any rows if we can avoid it.
- Median imputation might be problematic for Segmented_Neutrophils and Appendix_Diameter, as they have a high percentage of missing values. We'll need to keep this in mind and may need to revisit this problem in the future.
```python
#suppress warnings
import warnings
warnings.filterwarnings("ignore")
for column in numerical_features:
if df[column].isnull().sum() > 0:
#calculate median of column
median_value = df[column].median()
df[column].fillna(median_value, inplace=True)
```
- Let's check and make sure there are no missing values in the numerical columns.
```python
df[numerical_features].isna().sum()
```
Age 0
BMI 0
Height 0
Weight 0
Length_of_Stay 0
Alvarado_Score 0
Paedriatic_Appendicitis_Score 0
Appendix_Diameter 0
Body_Temperature 0
WBC_Count 0
Neutrophil_Percentage 0
Segmented_Neutrophils 0
RBC_Count 0
Hemoglobin 0
RDW 0
Thrombocyte_Count 0
CRP 0
US_Number 0
dtype: int64
- Perfect! Let's dig into the statistics of these numerical columns.
```python
df[numerical_features].describe()
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Age</th>
<th>BMI</th>
<th>Height</th>
<th>Weight</th>
<th>Length_of_Stay</th>
<th>Alvarado_Score</th>
<th>Paedriatic_Appendicitis_Score</th>
<th>Appendix_Diameter</th>
<th>Body_Temperature</th>
<th>WBC_Count</th>
<th>Neutrophil_Percentage</th>
<th>Segmented_Neutrophils</th>
<th>RBC_Count</th>
<th>Hemoglobin</th>
<th>RDW</th>
<th>Thrombocyte_Count</th>
<th>CRP</th>
<th>US_Number</th>
</tr>
</thead>
<tbody>
<tr>
<th>count</th>
<td>782.000000</td>
<td>782.000000</td>
<td>782.000000</td>
<td>782.000000</td>
<td>782.000000</td>
<td>782.000000</td>
<td>782.000000</td>
<td>782.000000</td>
<td>782.000000</td>
<td>782.000000</td>
<td>782.000000</td>
<td>782.000000</td>
<td>782.000000</td>
<td>782.000000</td>
<td>782.000000</td>
<td>782.000000</td>
<td>782.000000</td>
<td>782.000000</td>
</tr>
<tr>
<th>mean</th>
<td>11.346601</td>
<td>18.877753</td>
<td>148.071739</td>
<td>43.165742</td>
<td>4.277494</td>
<td>5.927110</td>
<td>5.236573</td>
<td>7.667263</td>
<td>37.402685</td>
<td>12.665537</td>
<td>72.279668</td>
<td>64.529668</td>
<td>4.799041</td>
<td>13.378645</td>
<td>13.164322</td>
<td>285.039642</td>
<td>31.044847</td>
<td>424.755754</td>
</tr>
<tr>
<th>std</th>
<td>3.527720</td>
<td>4.311546</td>
<td>19.403000</td>
<td>17.357896</td>
<td>2.569092</td>
<td>2.083053</td>
<td>1.893189</td>
<td>2.027507</td>
<td>0.899825</td>
<td>5.346191</td>
<td>13.534515</td>
<td>3.931203</td>
<td>0.493237</td>
<td>1.377175</td>
<td>4.463417</td>
<td>71.667551</td>
<td>57.100060</td>
<td>267.770056</td>
</tr>
<tr>
<th>min</th>
<td>0.000000</td>
<td>7.827983</td>
<td>53.000000</td>
<td>3.960000</td>
<td>1.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>2.700000</td>
<td>26.900000</td>
<td>2.600000</td>
<td>27.200000</td>
<td>32.000000</td>
<td>3.620000</td>
<td>8.200000</td>
<td>11.200000</td>
<td>91.000000</td>
<td>0.000000</td>
<td>1.000000</td>
</tr>
<tr>
<th>25%</th>
<td>9.209377</td>
<td>15.804082</td>
<td>138.000000</td>
<td>29.500000</td>
<td>3.000000</td>
<td>4.000000</td>
<td>4.000000</td>
<td>7.000000</td>
<td>36.800000</td>
<td>8.300000</td>
<td>63.825000</td>
<td>64.500000</td>
<td>4.540000</td>
<td>12.700000</td>
<td>12.300000</td>
<td>236.000000</td>
<td>1.000000</td>
<td>204.250000</td>
</tr>
<tr>
<th>50%</th>
<td>11.438741</td>
<td>18.062284</td>
<td>149.650000</td>
<td>41.400000</td>
<td>3.000000</td>
<td>6.000000</td>
<td>5.000000</td>
<td>7.500000</td>
<td>37.200000</td>
<td>12.000000</td>
<td>75.500000</td>
<td>64.500000</td>
<td>4.780000</td>
<td>13.300000</td>
<td>12.700000</td>
<td>276.000000</td>
<td>7.000000</td>
<td>398.500000</td>
</tr>
<tr>
<th>75%</th>
<td>14.080082</td>
<td>21.014438</td>
<td>162.225000</td>
<td>54.000000</td>
<td>5.000000</td>
<td>8.000000</td>
<td>6.000000</td>
<td>8.000000</td>
<td>37.900000</td>
<td>16.200000</td>
<td>82.375000</td>
<td>64.500000</td>
<td>5.010000</td>
<td>14.000000</td>
<td>13.300000</td>
<td>328.750000</td>
<td>32.000000</td>
<td>603.750000</td>
</tr>
<tr>
<th>max</th>
<td>18.360000</td>
<td>38.156221</td>
<td>192.000000</td>
<td>103.000000</td>
<td>28.000000</td>
<td>10.000000</td>
<td>10.000000</td>
<td>17.000000</td>
<td>40.200000</td>
<td>37.700000</td>
<td>97.700000</td>
<td>91.000000</td>
<td>14.000000</td>
<td>36.000000</td>
<td>86.900000</td>
<td>708.000000</td>
<td>365.000000</td>
<td>992.000000</td>
</tr>
</tbody>
</table>
</div>
```python
sns.heatmap(df[numerical_features].corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
```
![[output_21_0.png]]
- That's a lot of information! Let's focus on a few key statistics for now.
```python
numerical_features
```
Index(['Age', 'BMI', 'Height', 'Weight', 'Length_of_Stay', 'Alvarado_Score',
'Paedriatic_Appendicitis_Score', 'Appendix_Diameter',
'Body_Temperature', 'WBC_Count', 'Neutrophil_Percentage',
'Segmented_Neutrophils', 'RBC_Count', 'Hemoglobin', 'RDW',
'Thrombocyte_Count', 'CRP', 'US_Number'],
dtype='object')
```python
subset = numerical_features[1:8]
sns.heatmap(df[subset].corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
```
![[output_24_0.png]]
- From this subset, we can see the following features are highly correlated
- Height and Weight
- BMI and Weight
- BMI and Height
- Alvarado Score and Pediatric Appendicitis Score are slightly correlated
- We may consider dropping one of the highly correlated features to reduce multicollinearity in the model. This is similar to a PCA analysis where we want to reduce the number of features to the most important ones.
```python
subset
```
Index(['BMI', 'Height', 'Weight', 'Length_of_Stay', 'Alvarado_Score',
'Paedriatic_Appendicitis_Score', 'Appendix_Diameter'],
dtype='object')
```python
subset = numerical_features[8:]
sns.heatmap(df[subset].corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
```
![[output_27_0.png]]
- The highly correlated features here are:
- RBC_Count and Hemoglobin
- WBC_Count and Neutrophil_Percentage
# Distributions
- Let's check out the distribution of some of these features.
- For breviety, we'll focus on a few features here.
```python
sns.histplot(df['BMI'], kde=True)
```
<Axes: xlabel='BMI', ylabel='Count'>
![[output_31_1.png]]
- BMI has a normal distribution with a slight right skew.
- Median BMI is around 17
```python
sns.histplot(df['Height'], kde=True)
```
<Axes: xlabel='Height', ylabel='Count'>
![[output_33_1.png]]
- Height has a normal distribution with a left skew.
- Median height is around 150
- One height is around 50-60
- Although this is an outlier, we'll keep it for now as it may be a valid entry.
```python
sns.histplot(df['Weight'], kde=True)
```
<Axes: xlabel='Weight', ylabel='Count'>
![[output_35_1.png]]
- Let's see how alvarado score is related to the diagnosis of appendicitis.
```python
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.boxplot(data=df, x='Diagnosis', y='Alvarado_Score')
plt.title('Alvarado Score by Diagnosis')
```
Text(0.5, 1.0, 'Alvarado Score by Diagnosis')
![[output_37_1.png]]
- Cases with appendicitis have a higher median alvarado score (7) than those without appendicitis (5).
- Let's do the same for the pediatric appendicitis score.
```python
plt.subplot(1, 2, 2)
sns.boxplot(data=df, x='Diagnosis', y='Paedriatic_Appendicitis_Score')
plt.title('Pediatric Appendicitis Score by Diagnosis')
plt.tight_layout()
```
![[output_40_0.png]]
- Similar to the alvarado score, cases with appendicitis have a higher median pediatric appendicitis score (6) than those without appendicitis (4).
- Let's convert the target column to a binary column with values 0 and 1
```python
#encode appendicitis as 1 and no appendicitis as 0
#df['Diagnosis'] = df['Diagnosis'].map({'Appendicitis': 1, 'No Appendicitis': 0})
# we'll do the same for severity
#df['Severity'] = df['Severity'].map({'complicated': 1, 'uncomplicated': 0})
#df['Sex'] = df['Sex'].map({'female': 1, 'male': 0})
```
- The following columns are sparse and categorical, let's drop them for now and maybe revisit them later.
```python
sparse_columns = [
"Appendix_Wall_Layers", "Target_Sign", "Appendicolith", "Perfusion",
"Perforation", "Surrounding_Tissue_Reaction", "Appendicular_Abscess",
"Abscess_Location", "Pathological_Lymph_Nodes", "Lymph_Nodes_Location",
"Bowel_Wall_Thickening", "Conglomerate_of_Bowel_Loops", "Ileus",
"Coprostasis", "Meteorism", "Enteritis", "Gynecological_Findings"
]
# we would then drop these columns or handle them in a different way
# not going to drop them for now
```
- Let's sum up the symptoms across all columns so that for each entry, we know how many symptoms they have.
```python
symptom_columns = [
'Migratory_Pain', 'Lower_Right_Abd_Pain', 'Contralateral_Rebound_Tenderness',
'Coughing_Pain', 'Nausea', 'Loss_of_Appetite', 'Neutrophilia', 'Dysuria',
'Peritonitis', 'Psoas_Sign', 'Ipsilateral_Rebound_Tenderness'
]
df['symptom_count'] = df[symptom_columns].apply(lambda x: (x == 'yes').sum(), axis=1)
df['symptom_count']
```
0 4
1 6
2 3
3 6
4 6
..
777 5
778 2
779 2
780 5
781 1
Name: symptom_count, Length: 782, dtype: int64
- It seems like more symtoms should be indicative of appendicitis, but let's check the distribution of symptoms for appendicitis vs. no appendicitis.
```python
plt.subplot(1, 3, 1)
sns.boxplot(x='Diagnosis', y='symptom_count', data=df)
plt.title('Symptom Count by Diagnosis')
```
Text(0.5, 1.0, 'Symptom Count by Diagnosis')
![[output_49_1.png]]
- Yep! Cases with appendicitis have more symptoms on average than those without appendicitis.
- Let's use some of the key lab values to create interaction terms.
- These are key inflammatory markers that doctors use to diagnose appendicitis
- WBC (White Blood Cell) count indicates acute infection/inflammation
- CRP (C-Reactive Protein) is another inflammatory marker
- Neutrophils are the type of white blood cells that respond to bacterial infections.
- The idea is these interaction terms can capture where these inflammatory markers are high together.
```python
df['WBC_CRP_Interaction'] = df['WBC_Count'] * df['CRP']
df['WBC_Neutrophil_Interaction'] = df['WBC_Count'] * df['Neutrophil_Percentage']
df['CRP_Neutrophil_Interaction'] = df['CRP'] * df['Neutrophil_Percentage']
```
- Let's see how these interaction terms are related to the diagnosis of appendicitis.
- We'll use boxplots to visualize this relationship.
```python
plt.figure(figsize=(15, 5))
plt.subplot(1, 3, 1)
sns.boxplot(x='Diagnosis', y='WBC_CRP_Interaction', data=df)
plt.title('WBC × CRP by Diagnosis')
plt.yscale('log') #needed for large range of values
plt.subplot(1, 3, 2)
sns.boxplot(x='Diagnosis', y='WBC_Neutrophil_Interaction', data=df)
plt.title('WBC × Neutrophil % by Diagnosis')
plt.subplot(1, 3, 3)
sns.boxplot(x='Diagnosis', y='CRP_Neutrophil_Interaction', data=df)
plt.title('CRP × Neutrophil % by Diagnosis')
plt.tight_layout()
plt.show()
```
![[output_54_0.png]]
- For two interaction terms, cases with appendicitis have higher median values than those without appendicitis.
- CRP * Neutrophil Percentage seems to have a similar/lower median value for appendicitis cases compared to no appendicitis cases.
- Maybe these interaction terms will be useful for our model(s).
- Just out of curiosity, let's see how these interaction terms are correlated with some of the original numerical features.
```python
# correlation plot with the new features and the orinal numerical features
new_features = ['WBC_CRP_Interaction', 'WBC_Neutrophil_Interaction', 'CRP_Neutrophil_Interaction']
features = numerical_features.tolist()[0:8] + new_features
sns.heatmap(df[features].corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
```
Text(0.5, 1.0, 'Correlation Matrix')
![[output_57_1.png]]
- We can see the WBC_Neutrophil interaction term is highly correlated with the Alvarado Score and Pediatric Appendicitis Score.
- These scores are indicative of appendicitis, so maybe this interaction term alone can be a good predictor of appendicitis.