Final Project

In this final project, all the possibilities and knowledge acquired in the Nanodegree are still being used. In order to evaluate the heart failure prediction data in AutoML as well as HyperDrive and to determine the best model as well as to compare the different technologies. Here you can see a representation of what is being implemented in this project with the help of MS Azure and its technological possibilities.

Dataset

Source of the data set

For this project we are using files from Kaggle. In this dataset are data on cardiovascular diseases (CVDs). Which are the number one cause of death worldwide, claiming the lives of an estimated 17 million people each year. This represents approximately 31% of all deaths worldwide. Heart failure is one of the common events caused by CVDs. This dataset contains 12 characteristics that can be used to predict mortality from heart failure. In order for people with cardiovascular disease or at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or established disease) to receive early detection and treatment, these datasets attempt to improve prediction.

Content of the data set

The dataset contains 12 features that can be used to predict mortality from heart failure:

age: Age of the patient
amaemia: Decrease of red blood cells or hemoglobin
creatinine_phosphokinase: Level of the CPK enzyme in the blood (mcg/L)
diabetes: If the patient has diabetes
ejection_fraction: Percentage of blood leaving the heart at each contraction
high_blood_pressure: If the patient has hypertension
platelets: Platelets in the blood (kiloplatelets/mL)
serum_creatinine: Level of serum creatinine in the blood (mg/dL)
serum_sodium: Level of serum sodium in the blood (mEq/L)
sex: Woman or man (Gender at birth)
smoking: patient smokes or not
time: Follow-up period (days)

Target

Our goal is to develop a machine learning algorithm that can detect whether a person is likely to die from heart failure. This will help in diagnosis and early prevention. For this, the above mentioned 12 features in the dataset are used to develop a model for the detection.

Attention!

This is an experiment that was developed in the course of a test for the Udacity learning platform. Do not use this model in a medical environment or for acute indications. Always consult your doctor for medical questions or the medical emergency service in acute cases!

Hyperparameter Tuning

The model used here is a logistic regression model that is trained with a custom script train.py. The dataset is fetched from HERE as a dataset. The hyperparameters chosen for the scikit-learn model are regularisation strength (C) and maximum iterations (max_iter). The trained model is evaluated against 25% data selected from the original dataset. The remaining data is used to train the model.

Hyperparameter tuning with HyperDrive requires several steps:

define the parameter search space
define a sampling method
selecting a primary metric for optimisation
selecting an early stop policy.

The parameter sampling method used for this project is Random Sampling. It randomly selects the best hyperparameters for the model so that the entire search space does not need to be searched. The Random Sampling method saves time and is much faster than Grid Sampling and Bayesian Sampling, which are only recommended if you have a budget to explore the entire search space.

The early stop policy used in this project is the Bandit policy, which is based on a slack factor (in this case 0.1) and a scoring interval (in this case 1). This policy stops runs where the primary metric is not within the specified slip factor, compared to the run with the best performance. This will save time and resources as runs that may not produce good results would be terminated early.

Paramerters

in the Jupyter Notebook

# Create the different params that will be needed during training
param_sampling = RandomParameterSampling(
    {
        "--C": uniform(0.001,100),
        "--max_iter": choice(50, 90, 125, 170)
    }
)

and in the train.py

# Path to dataset 
    path_to_data="https://github.com/raw/Petopp/Udacity_Final_Project/main/heart_failure_clinical_records_dataset.csv"

# Split data into train and test sets.
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size=0.25)

Results

Details from Jupyter Notebook

Details from Azure Experiments

Experiment completed

The result in detail

['--C', '97.2861169940756', '--max_iter', '125']
['azureml-logs/55_azureml-execution-tvmps_b3d8a370fdab6acc496b1fa398220948b9ae8dd605d8df21bbd0582f1cc744bc_d.txt', 'azureml-logs/65_job_prep-tvmps_b3d8a370fdab6acc496b1fa398220948b9ae8dd605d8df21bbd0582f1cc744bc_d.txt', 'azureml-logs/70_driver_log.txt', 'azureml-logs/75_job_post-tvmps_b3d8a370fdab6acc496b1fa398220948b9ae8dd605d8df21bbd0582f1cc744bc_d.txt', 'azureml-logs/process_info.json', 'azureml-logs/process_status.json', 'logs/azureml/106_azureml.log', 'logs/azureml/job_prep_azureml.log', 'logs/azureml/job_release_azureml.log', 'outputs/model.joblib']
Best Run Accuracy: 0.84

Automated ML

The AutomatedML run was created using an instance of AutoML Config. The AutoML Config class is a way to use the AutoML SDK to automate machine learning. The following parameters were used for the AutoML run.

Parameter	Value	Description
task	'classification'	Classification is selected since we are performing binary classification, i.e whether or not a death event occurs
debug.log	'automl_errors.log"	The debug information is written to this file instead of the automl.log file
training_data	train_data	train_data is passed that which contains the data to be used for training
label_column_name	'DEATH_EVENT'	Since the DEATH_EVENT column contains what we need to predict, it is passed
compute_target	computcluster	The compute target on which we want this AutoML experiment to run is specified
experiment_timeout_minutes	30	Specifies the time that all iterations combined can take. Due to the lack of resources this is selected as 30
primary_metric	'accuracy'	This is the metric that AutoML will optimize for model_selection. Accuracy is selected as it is well suited to problems involving binary classification.
enable_early_stopping	True	Early Stopping is enabled to terminate a run in case the score is not improving in short term. This allows AutoML to explore more better models in less time
featurization	'auto'	Featurization is set to auto so that the featurization step is done automatically
n_cross_validations	4	This is specified so that there are 4 different trainings and each training uses 1/4 of data for validation
verbosity	logging.INFO	This specifies the verbosity level for writing to the log file
enable_onnx_compatible_models	True	Export to ONNX format from Azure ML is enabled for later export, more about ONNX can be found HERE

The paramerters in code:

# automl settings 
automl_settings = {
    "enable_early_stopping" : True,
    "experiment_timeout_minutes": 30,
    "n_cross_validations": 4,
    "featurization": "auto",
    "primary_metric": "accuracy",
    "verbosity": logging.INFO
}

# automl config (with onnx compatible modus)
automl_config = AutoMLConfig(
    task="classification",
    debug_log = "automl_errors.log",
    training_data=train_data,
    label_column_name="DEATH_EVENT",
    compute_target=compute_cluster,
    enable_onnx_compatible_models=True,
    **automl_settings
)

Runnig

in the search of the best model

after finding the best model

Results

The best result in the Jupyter view

and in Azure Experiments

Model Deployment

The best model was the "MaxAbsScaler, GradinetBootsing" model from the AutoML experiment. This model will we now deploying and testing in the next steps.

First step it's the deploying with this parameters

# Create inference config
script_file_name= "inference/score.py"
inference_config = InferenceConfig(entry_script=script_file_name)

aciconfig = AciWebservice.deploy_configuration(cpu_cores = 2, 
                                               memory_gb = 4, 
                                               tags = {"area": "hfData", "type": "automl_classification"}, 
                                               description = "Heart Failure Prediction (Experiment!)")

aci_service_name = "automl-heart-failure-model"
print(aci_service_name)
aci_service = Model.deploy(ws, aci_service_name, [model], inference_config, aciconfig)
aci_service.wait_for_deployment(True)
print(aci_service.state)

# Enable Application Insights
aci_service.update(enable_app_insights=True)

Final we have the confirmation in Jupyter

automl-heart-failure-model
Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running
2021-05-12 08:44:14+00:00 Creating Container Registry if not exists..
2021-05-12 08:44:25+00:00 Use the existing image.
2021-05-12 08:44:25+00:00 Generating deployment configuration.
2021-05-12 08:44:25+00:00 Submitting deployment to compute.
2021-05-12 08:44:29+00:00 Checking the status of deployment automl-heart-failure-model..
2021-05-12 08:48:15+00:00 Checking the status of inference endpoint automl-heart-failure-model.
Succeeded
ACI service creation operation finished, operation "Succeeded"
Healthy

and in the Web (Endpoints)

!

After this steps, we will now test the model with this code/parameters

import requests
import json

# Short waiting time, it's stabler in process with this
time.sleep(30)

# URL for the web service, should be similar to:
print ("Scoring URL: "+aci_service.scoring_uri)
scoring_uri = aci_service.scoring_uri


# Two data sets are evaluated, we then receive two results back for this
data = {"data":
        [
          {
            "age": 70.0,
            "anaemia": 1,
            "creatinine_phosphokinase": 4020,
            "diabetes": 1,
            "ejection_fraction": 32,
            "high_blood_pressure": 1,
            "platelets": 234558.23,
            "serum_creatinine": 1.4,
            "serum_sodium": 125,
            "sex": 1,
            "smoking": 0,
            "time": 12
          },
          {
            "age": 65.0,
            "anaemia": 0,
            "creatinine_phosphokinase": 4221,
            "diabetes": 0,
            "ejection_fraction": 22,
            "high_blood_pressure": 0,
            "platelets": 404567.23,
            "serum_creatinine": 1.1,
            "serum_sodium": 115,
            "sex": 0,
            "smoking": 1,
            "time": 7
          },
      ]
    }
# Convert to JSON string
input_data = json.dumps(data)
with open("data.json", "w") as _f:
    _f.write(input_data)

# Set the content type
headers = {"Content-Type": "application/json"}
# If authentication is enabled, set the authorization header

# Make the request and display the response
resp = requests.post(scoring_uri, input_data, headers=headers)
print(resp.json())

an the result of this is

Scoring URL: http://ed926a23-aca1-4a20-980a-71f05596ce2b.southcentralus.azurecontainer.io/score
{"result": [1, 1]}

So we can prove that this model has been successfully published. Otherwise it would not be possible to address it and we would receive an error message.

Save the best model in ONNX

from azureml.automl.runtime.onnx_convert import OnnxConverter
automl_best_run_onnx, automl_fitted_model_onnx = remote_run.get_output(return_onnx_model=True)
OnnxConverter.save_onnx_model(automl_fitted_model_onnx, './outputs/AutoML.onnx' )

So that the calculations can be understood by other systems. The best result is stored in the onnx format. With the ONNX, AI developers can exchange models between different tools and choose the best combination of these tools for them.

Screen Recording

See on youtube

Standout Suggestions

We can using a higher runtime
Use larger datasets for transecting
Over the ONNX File bring thos Model on other EDGE devices
Bring more robustness into the code to be able to react better to missing data or when releases are delayed in Azure.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
Data		Data
Outputs		Outputs
interface		interface
training		training
README.md		README.md
amlignore.txt		amlignore.txt
automl.ipynb		automl.ipynb
automl.log		automl.log
automl_env.yml		automl_env.yml
automl_errors.log		automl_errors.log
azureml_automl.log		azureml_automl.log
data.json		data.json
heart_failure_clinical_records_dataset.csv		heart_failure_clinical_records_dataset.csv
hyperdrive_model.joblib		hyperdrive_model.joblib
hyperparameter_tuning.ipynb		hyperparameter_tuning.ipynb
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Final Project

Dataset

Source of the data set

Content of the data set

Target

Attention!

Hyperparameter Tuning

Paramerters

Results

Automated ML

Runnig

Results

Model Deployment

Save the best model in ONNX

Screen Recording

Standout Suggestions

About

Releases

Packages

Languages

Petopp/Udacity_Final_Project

Folders and files

Latest commit

History

Repository files navigation

Final Project

Dataset

Source of the data set

Content of the data set

Target

Attention!

Hyperparameter Tuning

Paramerters

Results

Automated ML

Runnig

Results

Model Deployment

Save the best model in ONNX

Screen Recording

Standout Suggestions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages