Skip to content

Analysis of the Breast Cancer Winsconsin (Diagnostic) Data Set.

License

Notifications You must be signed in to change notification settings

jarnokoetsier/ScientificProgramming

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scientific Programming Project (MSB1015)

licence status

For the Scientific Programming (MSB1015) course, an adjusted version of the Breast Cancer Wisconsin (Diagnostic) Data Set was analysed. This repository contains all the scripts that were used for this analysis.

  1. Data
  2. Research aim
  3. Analysing the data
  4. App
  5. Contact

Data

The original Breast Cancer Wisconsin (Diagnostic) Data Set can be downloaded from Kaggle. However, for the current analysis a modified version of this data set was used. Contact me to access the adjusted data set.

The data set consist of 569 samples and includes the sample ID, the sample diagnosis (Malignant (M): 212 and Benign (B): 357), as well as 30 variables computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. These 30 variables describe features from the cell nuclei in these images and encompasses the mean, standard error (SE), and the mean of the three largest values (worst) of the following 10 characteristics:

  1. Radius: The mean of distances from center to points on the border of the cell nucleus.
  2. Texture: The standard deviation of gray-scale values of the digitalized image.
  3. Perimeter: The total length of the border of the cell nucleus.
  4. Area: The size of the surface of the cell nucleus.
  5. Smoothness: The local variation in radius lengths.
  6. Compactness: Perimeter2 / Area - 1.0
  7. Concavity: The severity of concave portions of the contour of the cell nucleus.
  8. Concave points: The number of concave portions of the contour of the cell nucleus.
  9. Symmetry: Similarity of the radius length on both sides of the diameter.
  10. Fractal dimension: Coastline approximation - 1

More information about the variables can be found on page 8 in this paper by Westerdijk (2018).

Research aim

The aim of the analysis is three-fold:

  1. Construct a robust classifier to distinguish malignant from benign samples (Classification).
  2. Identify subclasses within the malignant samples (Clustering).
  3. Create an app for the prediction and visualization of new samples (App).

Analysing the data

When performing the analysis, be aware of the following:

  1. Put the data file (Data.xlsx) into the main folder (..PATH../ScientificProgramming/).
  2. Furthermore, it is important to run the scripts in the following order:
    • Pre-processing/Preprocessing.R
    • Classification/Classification.R
    • Clustering/Clustering.R
    • App
  3. Finally, please follow the instructions in the scripts carefully to ensure a successful analysis.

Workflow

App

To run the app in RStudio, click on "Run App" in the top right corner when having either the App/ui.R, App/server.R, or App/global.R file open in the RStudio window.

Start App

If this is not possible, run the following commands:

# Install the shiny package
install.packages("shiny")

# Load the shiny package
library(shiny)

# Run the shiny app
runApp("..PATH../ScientificProgramming/App")

Now you can use the classification model to predict the class of new samples!

App Demo

Contact

Feel free to contact me via email: j.koetsier@student.maastrichtuniversity.nl

About

Analysis of the Breast Cancer Winsconsin (Diagnostic) Data Set.

Topics

Resources

License

Stars

Watchers

Forks

Languages