Skip to content

Our goal is to develop a machine learning algorithm that can detect whether a person is likely to die from heart failure. This will help in diagnosis and early prevention. For this, the above mentioned 12 features in the dataset are used to develop a model for the detection.

Notifications You must be signed in to change notification settings

Petopp/Udacity_Final_Project

Repository files navigation

Final Project

In this final project, all the possibilities and knowledge acquired in the Nanodegree are still being used. In order to evaluate the heart failure prediction data in AutoML as well as HyperDrive and to determine the best model as well as to compare the different technologies. Here you can see a representation of what is being implemented in this project with the help of MS Azure and its technological possibilities.

image

Dataset

Source of the data set

For this project we are using files from Kaggle. In this dataset are data on cardiovascular diseases (CVDs). Which are the number one cause of death worldwide, claiming the lives of an estimated 17 million people each year. This represents approximately 31% of all deaths worldwide. Heart failure is one of the common events caused by CVDs. This dataset contains 12 characteristics that can be used to predict mortality from heart failure. In order for people with cardiovascular disease or at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or established disease) to receive early detection and treatment, these datasets attempt to improve prediction.

Content of the data set

The dataset contains 12 features that can be used to predict mortality from heart failure:

  • age: Age of the patient
  • amaemia: Decrease of red blood cells or hemoglobin
  • creatinine_phosphokinase: Level of the CPK enzyme in the blood (mcg/L)
  • diabetes: If the patient has diabetes
  • ejection_fraction: Percentage of blood leaving the heart at each contraction
  • high_blood_pressure: If the patient has hypertension
  • platelets: Platelets in the blood (kiloplatelets/mL)
  • serum_creatinine: Level of serum creatinine in the blood (mg/dL)
  • serum_sodium: Level of serum sodium in the blood (mEq/L)
  • sex: Woman or man (Gender at birth)
  • smoking: patient smokes or not
  • time: Follow-up period (days)

Target

Our goal is to develop a machine learning algorithm that can detect whether a person is likely to die from heart failure. This will help in diagnosis and early prevention. For this, the above mentioned 12 features in the dataset are used to develop a model for the detection.

Attention!

This is an experiment that was developed in the course of a test for the Udacity learning platform. Do not use this model in a medical environment or for acute indications. Always consult your doctor for medical questions or the medical emergency service in acute cases!

Hyperparameter Tuning

The model used here is a logistic regression model that is trained with a custom script train.py. The dataset is fetched from HERE as a dataset. The hyperparameters chosen for the scikit-learn model are regularisation strength (C) and maximum iterations (max_iter). The trained model is evaluated against 25% data selected from the original dataset. The remaining data is used to train the model.

Hyperparameter tuning with HyperDrive requires several steps:

  • define the parameter search space
  • define a sampling method
  • selecting a primary metric for optimisation
  • selecting an early stop policy.

The parameter sampling method used for this project is Random Sampling. It randomly selects the best hyperparameters for the model so that the entire search space does not need to be searched. The Random Sampling method saves time and is much faster than Grid Sampling and Bayesian Sampling, which are only recommended if you have a budget to explore the entire search space.

The early stop policy used in this project is the Bandit policy, which is based on a slack factor (in this case 0.1) and a scoring interval (in this case 1). This policy stops runs where the primary metric is not within the specified slip factor, compared to the run with the best performance. This will save time and resources as runs that may not produce good results would be terminated early.

Paramerters

in the Jupyter Notebook

# Create the different params that will be needed during training
param_sampling = RandomParameterSampling(
    {
        "--C": uniform(0.001,100),
        "--max_iter": choice(50, 90, 125, 170)
    }
)

and in the train.py

# Path to dataset 
    path_to_data="https://github.com/raw/Petopp/Udacity_Final_Project/main/heart_failure_clinical_records_dataset.csv"
# Split data into train and test sets.
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size=0.25)

Results

Details from Jupyter Notebook image

Details from Azure Experiments image

Experiment completed image

The result in detail

['--C', '97.2861169940756', '--max_iter', '125']
['azureml-logs/55_azureml-execution-tvmps_b3d8a370fdab6acc496b1fa398220948b9ae8dd605d8df21bbd0582f1cc744bc_d.txt', 'azureml-logs/65_job_prep-tvmps_b3d8a370fdab6acc496b1fa398220948b9ae8dd605d8df21bbd0582f1cc744bc_d.txt', 'azureml-logs/70_driver_log.txt', 'azureml-logs/75_job_post-tvmps_b3d8a370fdab6acc496b1fa398220948b9ae8dd605d8df21bbd0582f1cc744bc_d.txt', 'azureml-logs/process_info.json', 'azureml-logs/process_status.json', 'logs/azureml/106_azureml.log', 'logs/azureml/job_prep_azureml.log', 'logs/azureml/job_release_azureml.log', 'outputs/model.joblib']
Best Run Accuracy: 0.84

Automated ML

The AutomatedML run was created using an instance of AutoML Config. The AutoML Config class is a way to use the AutoML SDK to automate machine learning. The following parameters were used for the AutoML run.

Parameter Value Description
task 'classification' Classification is selected since we are performing binary classification, i.e whether or not a death event occurs
debug.log 'automl_errors.log" The debug information is written to this file instead of the automl.log file
training_data train_data train_data is passed that which contains the data to be used for training
label_column_name 'DEATH_EVENT' Since the DEATH_EVENT column contains what we need to predict, it is passed
compute_target computcluster The compute target on which we want this AutoML experiment to run is specified
experiment_timeout_minutes 30 Specifies the time that all iterations combined can take. Due to the lack of resources this is selected as 30
primary_metric 'accuracy' This is the metric that AutoML will optimize for model_selection. Accuracy is selected as it is well suited to problems involving binary classification.
enable_early_stopping True Early Stopping is enabled to terminate a run in case the score is not improving in short term. This allows AutoML to explore more better models in less time
featurization 'auto' Featurization is set to auto so that the featurization step is done automatically
n_cross_validations 4 This is specified so that there are 4 different trainings and each training uses 1/4 of data for validation
verbosity logging.INFO This specifies the verbosity level for writing to the log file
enable_onnx_compatible_models True Export to ONNX format from Azure ML is enabled for later export, more about ONNX can be found HERE

The paramerters in code:

# automl settings 
automl_settings = {
    "enable_early_stopping" : True,
    "experiment_timeout_minutes": 30,
    "n_cross_validations": 4,
    "featurization": "auto",
    "primary_metric": "accuracy",
    "verbosity": logging.INFO
}

# automl config (with onnx compatible modus)
automl_config = AutoMLConfig(
    task="classification",
    debug_log = "automl_errors.log",
    training_data=train_data,
    label_column_name="DEATH_EVENT",
    compute_target=compute_cluster,
    enable_onnx_compatible_models=True,
    **automl_settings
)

Runnig

in the search of the best model image

after finding the best model image

Results

The best result in the Jupyter view image

and in Azure Experiments

image

Model Deployment

The best model was the "MaxAbsScaler, GradinetBootsing" model from the AutoML experiment. This model will we now deploying and testing in the next steps.

First step it's the deploying with this parameters

# Create inference config
script_file_name= "inference/score.py"
inference_config = InferenceConfig(entry_script=script_file_name)

aciconfig = AciWebservice.deploy_configuration(cpu_cores = 2, 
                                               memory_gb = 4, 
                                               tags = {"area": "hfData", "type": "automl_classification"}, 
                                               description = "Heart Failure Prediction (Experiment!)")

aci_service_name = "automl-heart-failure-model"
print(aci_service_name)
aci_service = Model.deploy(ws, aci_service_name, [model], inference_config, aciconfig)
aci_service.wait_for_deployment(True)
print(aci_service.state)
# Enable Application Insights
aci_service.update(enable_app_insights=True)

Final we have the confirmation in Jupyter

automl-heart-failure-model
Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running
2021-05-12 08:44:14+00:00 Creating Container Registry if not exists..
2021-05-12 08:44:25+00:00 Use the existing image.
2021-05-12 08:44:25+00:00 Generating deployment configuration.
2021-05-12 08:44:25+00:00 Submitting deployment to compute.
2021-05-12 08:44:29+00:00 Checking the status of deployment automl-heart-failure-model..
2021-05-12 08:48:15+00:00 Checking the status of inference endpoint automl-heart-failure-model.
Succeeded
ACI service creation operation finished, operation "Succeeded"
Healthy

and in the Web (Endpoints)

!image

After this steps, we will now test the model with this code/parameters

import requests
import json

# Short waiting time, it's stabler in process with this
time.sleep(30)

# URL for the web service, should be similar to:
print ("Scoring URL: "+aci_service.scoring_uri)
scoring_uri = aci_service.scoring_uri


# Two data sets are evaluated, we then receive two results back for this
data = {"data":
        [
          {
            "age": 70.0,
            "anaemia": 1,
            "creatinine_phosphokinase": 4020,
            "diabetes": 1,
            "ejection_fraction": 32,
            "high_blood_pressure": 1,
            "platelets": 234558.23,
            "serum_creatinine": 1.4,
            "serum_sodium": 125,
            "sex": 1,
            "smoking": 0,
            "time": 12
          },
          {
            "age": 65.0,
            "anaemia": 0,
            "creatinine_phosphokinase": 4221,
            "diabetes": 0,
            "ejection_fraction": 22,
            "high_blood_pressure": 0,
            "platelets": 404567.23,
            "serum_creatinine": 1.1,
            "serum_sodium": 115,
            "sex": 0,
            "smoking": 1,
            "time": 7
          },
      ]
    }
# Convert to JSON string
input_data = json.dumps(data)
with open("data.json", "w") as _f:
    _f.write(input_data)

# Set the content type
headers = {"Content-Type": "application/json"}
# If authentication is enabled, set the authorization header

# Make the request and display the response
resp = requests.post(scoring_uri, input_data, headers=headers)
print(resp.json())

an the result of this is

Scoring URL: http://ed926a23-aca1-4a20-980a-71f05596ce2b.southcentralus.azurecontainer.io/score
{"result": [1, 1]}

So we can prove that this model has been successfully published. Otherwise it would not be possible to address it and we would receive an error message.

Save the best model in ONNX

from azureml.automl.runtime.onnx_convert import OnnxConverter
automl_best_run_onnx, automl_fitted_model_onnx = remote_run.get_output(return_onnx_model=True)
OnnxConverter.save_onnx_model(automl_fitted_model_onnx, './outputs/AutoML.onnx' )

So that the calculations can be understood by other systems. The best result is stored in the onnx format. With the ONNX, AI developers can exchange models between different tools and choose the best combination of these tools for them.

Screen Recording

See on youtube

Standout Suggestions

  • We can using a higher runtime
  • Use larger datasets for transecting
  • Over the ONNX File bring thos Model on other EDGE devices
  • Bring more robustness into the code to be able to react better to missing data or when releases are delayed in Azure.

About

Our goal is to develop a machine learning algorithm that can detect whether a person is likely to die from heart failure. This will help in diagnosis and early prevention. For this, the above mentioned 12 features in the dataset are used to develop a model for the detection.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published