BREAS T C A NC ER
W ISC ONSIN ( DIAGNOSTIC )
DATASET
Supervised Analysis for Malignancy Detection
Authors: Pablo Laso Mielgo,Helena Sofía Yaben
1. PRES EN TA TION O F THE PRO BLEM
Breast Cancer is the most common type of
cancer among women across the world.
Leading cause of death from cancer in women.
Biopsy is essential to distinguish between malignant and benignant tissue but it requires
expensive and bulky equipment,and highly trained professionals.
Digitalization of pathology slides and application of AI can make diagnosis faster, cheaper,
and provides auseful tool for phathologists.
Inferential analysis can help statistically determine differences between malignant and
benignant populations. It can also help determining important features for further AI
modelling.
1. PRES EN TA TION O F THE PRO BLEM
Is there any difference between
benignant and malignant
populations for each feature?
In other words, which features
could significantly help
determining diagnosis of each
sample?
2. MATERIALS
2 . 2 Data set Informa tion: a cquisition of the data
1. Tissue sample from Breast
Tumor by Fine-Needle
Aspiration (FNA)
569 tissue sa m ples
2. Hystopathological analysis of 10 characteristics for each
nucleus:
Radius (mean of distances from the center to points
on the perimeter)
Texture (std of gry-scale values)
Perimeter
Area
Smoothness (local variation in radius lentghs)
Compactness (perimeter2/area 1)
Concavity (severity of concave portions of the
contour)
Concave points (number of concave portions of the
contour)
Simmetry
Fractal Dimension (“coastline approximation”-1)
*No units were provided
3. Mean, Steand Worst Values of all nuclei
characteristics in each sample
30 fea tures
2 .2 Data set Inform a tion: predictors andtarget
varia ble
For each nucleus in each simple:
1. Radius (mean of distances from the center to points
on the perimeter)
2. Texture (std of gry-scale values)
3. Perimeter
4. Area
5. Smoothness (local variation in radius lentghs)
6. Compactness (perimeter2/area 1)
7. Concavity (severity of concave portions of the
contour)
8. Concave points (number of concave portions of the
contour)
9. Simmetry
10. Fractal Dimension (“coastline approximation”-1)
Mean, Steand Worst Values of nuclei
characteristics in each sample
30 fea tures
Diagnosis:
1. Benignant (B)
2. Malignant (M)
1Ta rg et Va ria ble
+
2. MATERIALS
3. EXPLORATORY DATA ANALYSIS
3.1 Da ta Types
Identifier
Radius Area Concavity Fractal dimension
Texture Smoothness Concave points
Perimeter Compactness Simmetry
ID Numerical
and Discrete
Diagnosis Categorical and
Nominal
Empty Column Numerical, 0
Numerical and
Continuous
Mean,
Std a nd
Worst
Predictors
Targ et Va ria ble
Drop C olum n
Drop C olum n
Tra ns form to num erica l discrete
3.2 Descriptive sta tistics of the dataset
In order to summarize the main and most basic statistical characteristics of the dataset, we will
use the method describe:
No abnormal values for max/min values were initially identified(e.g 0
values or max/min values highly above/below the mean).
By plotting count of samples for each class we find that:
Class imbalance (62.7 %(B, majority class)/ 37.3% (M, minority class): moderate
3. EXPLORATORY DATA ANALYSIS
3.3 Univa ria te Ana lysis: Gra phica l
1. Mean values for radius, texture,
perimeter, area, compactness,
concavity and concave points
seem to be larger in malignant
tissue.
2. Features (distinguishing between
malignant/benignant) follow,
approximately, a normal
distribution.
3. EXPLORATORY DATA ANALYSIS
3.4 Multivariate Analysis: Correlation
1. Strong positive relationship between target variable
and mean and worst values for radius, area,
perimeter, concavity and concave points (P.
Correlation coefficient >0.7).
2. Strongest relationship with worst value for concave
points.
3. Strong correlation between radius, perimeter and
area.
4. Strong correlation between concave points,
concavity and compactness.
Correlation between features may imply
redundant information during diagnosing.
3. EXPLORATORY DATA ANALYSIS
3.5 Multivariate Analysis: Scatter Plots (Diagnosis vs each feature/ Feature vs Feature)
Only 1 feature suitable for Logistic Regression Only 2features suitable for models such as KNN
3. EXPLORATORY DATA ANALYSIS
3.6 Test for Normality
Kolmogorov-Smirnov (KS) test for
normality (n > 50) for each population
(malignant/benignant), for each feature:
H_0: The sample data distribution is not
significantly different than a normal
population.
H_1: The sample data distribution is
significantly different than a normal
population.
In general, normally/non-normally
distribution between malignant and
benignant populations for one feature.
Non-parametric methods should be
used for hypothesis testing.
3. EXPLORATORY DATA ANALYSIS
4. PRE- PROC ESSING OF THE DA TA S ET
4.1Missing Values
1. No missing values,neither explicit (NaN values) nor
implicit (e.g. repeated 0 values for an instance for
different features).
2. Samples with 0 values show the same behavior, all
associated with diagnosis=B and same zero features,
which may imply that these values are indeed
correct.
4. PRE- PROC ESSING OF THE DA TA S ET
4.2 Outliers
1. Outlier detection independently for Benignant/Malignant samples,as they showed different distributions.
2. For detection, considered as outliers those values with abs(z-score) >= 2.5.
3. Very low percentage of outliers per feature (0.3-2%).
4. 20.5% (117 instances) of rows with, at least, one outlier. We shouldn’t consider dropping this quantity of
data.
5. Since variables are highly correlated, random/mean/median imputation methods can introduce bias in the analysis
Tailored Imputation Method Computationally expensive
6. From previous work with AI methods: Performance metrics after dropping outliers are not improved. They
are similar Outliers may not be deleted incorrect values, just values far from the population.
7. We decide to maintain these values.
Orig inaldataset Dataset without outliers (deletion)
Two populations
for each feature:
malignant and benignant
.
Parametric:
H_0: Mean values for malignant and benignant populations are the same.
H_1: Mean values for malignant are different than those for benignant.
Non-parametric:
H_0: Median for malignant and benignant populations are the same.
H_1: Median for malignant are different than those for benignant.
Two independent samples
,
two-sided
hypothesis testing problem.
Non-parametric methods preferred
.
Both
parametric and non-parametric ( MannWhitneyWilcoxon)
methods will be compared for
learning purposes.
5. HYPOTHESIS TESTING
5. HYPOTHESIS TESTING
E.g. : Non-parametric vs Parametric (just some features)
Taking into account results from
non-parametric method, mean
and median values are different
between benignant and
benignant samples for every
feature but for
"fractal_dimension_mean",
"texture_se", and
"smoothness_se".
6. C O NC LUS ION
Dataset with excellent quality for its purpose:
1. Simple data visualization allows to get agreat insight into data distribution.
We saw behavior of malignant nuclei just by scatter plotting Bvs Mfor each feature.
Malignant samples tend to have greater values than benignant.It was statistically
determined that malignant mean and median values were different from those of benignant
samples.
Statistically concluded that nuclei characteristics vary depending on malignancy/benignancy
of the sample. Along with AI models, inferential analysis can help determining decision boundaries and
translate them to clinical practice.
2. Exhaustive and complicated pre-processing is not needed to perform inferential analysis and draw
relevant conclusions.
3. Results show that some features are more relevant than others when determining the diagnosis, which is
relevant for feature selection èdiscard "fractal_dimension_mean", "texture_se", and "smoothness_se”.
4. Feature selection can improve the efficiency of the histopathological analysis by discarding non-relevant
features.