diff --git a/notebooks/ptype_ml.ipynb b/notebooks/ptype_ml.ipynb index d25f7c8..3db6321 100644 --- a/notebooks/ptype_ml.ipynb +++ b/notebooks/ptype_ml.ipynb @@ -46,14 +46,9 @@ "| [Pyplot tutorial](https://matplotlib.org/stable/tutorials/pyplot.html) | Helpful | Necessary |\n", "| [Numpy: the absolute basics for beginners](https://numpy.org/doc/stable/user/absolute_beginners.html) | Great to have | arrays are the language of machine learning |\n", "\n", - "- **Time to learn**: \n", + "- **Time to learn**: 45 minutes\n", "\n", - "Under an hour. While it can be easy to get started with the scikit learn syntax, it can take a while to fully understand and learn all of the in's and out's of ML systems. This is designed to just be a very quick introduction. \n", - "\n", - "- **System requirements**:\n", - " - Populate with any system, version, or non-Python software requirements if necessary\n", - " - Otherwise use the concepts table above and the Imports section below to describe required packages as necessary\n", - " - If no extra requirements, remove the **System requirements** point altogether" + "While it can be easy to get started with the scikit learn syntax, it can take a while to fully understand and learn all of the in's and out's of ML systems. This is designed to just be a very quick introduction. " ] }, { @@ -404,7 +399,9 @@ "source": [ "Notice any trends so far? What input features might be the most important? \n", "\n", - "Next we can plot the Correlation Matrix. As the name suggests, this will show us the correlation between variables. The closer the absolute value is to 1, the stronger the relationship between these variables is. Notice how all of our diagonal values equal to 1? this is because they represent the correlation between a variable and itself. Can you see which other variables have strong correlations?" + "Next we can plot the Correlation Matrix. As the name suggests, this will show us the correlation between variables. The closer the absolute value is to 1, the stronger the relationship between these variables is. Notice how all of our diagonal values equal to 1? this is because they represent the correlation between a variable and itself. Can you see which other variables have strong correlations?\n", + "\n", + "For further reading, visit [Correlation Matrix, Demystified](https://towardsdatascience.com/correlation-matrix-demystified-3ae3405c86c1)" ] }, { @@ -593,7 +590,7 @@ "id": "42bbaf4e-bf83-4eef-8e0a-b1a164a532ad", "metadata": {}, "source": [ - "This is a simple problem, we can choose logistic regression or support vector machine as our classification model." + "We will use a linear regression model:" ] }, { @@ -1051,7 +1048,7 @@ "id": "dbe5b7ab-6f93-4721-8d0d-9ef075276a0d", "metadata": {}, "source": [ - "Next step, let's use the testing data" + "Next step, let's use the testing data and plot the new predicted values vs true values." ] }, { @@ -1142,7 +1139,13 @@ "id": "5131ad3f", "metadata": {}, "source": [ - "R-squared (R²) and Root Mean Squared Error (RMSE) are both metrics used to evaluate the performance of regression models, but they convey different types of information. R², also known as the coefficient of determination, measures the proportion of the variance in the dependent variable that is predictable from the independent variables. It ranges from 0 to 1, where a higher value indicates a better fit of the model to the data, with 1 representing a perfect fit. RMSE, on the other hand, quantifies the average magnitude of the prediction errors, providing an absolute measure of fit in the same units as the dependent variable. It calculates the square root of the average squared differences between predicted and observed values, with a lower RMSE indicating a model that predicts more accurately. While R² gives a sense of how well the model explains the variability of the data, RMSE provides a direct measure of the model’s prediction accuracy. This [blog post](https://www.unidata.ucar.edu/blogs/news/entry/r-sup-2-sup-downsides) covers some of the downsides to looking at R2 alone." + "R-squared (R²) and Root Mean Squared Error (RMSE) are both metrics used to evaluate the performance of regression models, but they convey different types of information. \n", + "\n", + "R², also known as the coefficient of determination, measures the proportion of the variance in the dependent variable that is predictable from the independent variables. It ranges from 0 to 1, where a higher value indicates a better fit of the model to the data, with 1 representing a perfect fit. \n", + "\n", + "RMSE, on the other hand, quantifies the average magnitude of the prediction errors, providing an absolute measure of fit in the same units as the dependent variable. It calculates the square root of the average squared differences between predicted and observed values, with a lower RMSE indicating a model that predicts more accurately. While R² gives a sense of how well the model explains the variability of the data, RMSE provides a direct measure of the model’s prediction accuracy. \n", + "\n", + "This [blog post](https://www.unidata.ucar.edu/blogs/news/entry/r-sup-2-sup-downsides) covers some of the downsides to looking at R2 alone." ] }, { @@ -1201,7 +1204,7 @@ "id": "5a4c963b-c722-4b5c-9a72-8c76d71c8636", "metadata": {}, "source": [ - "Let's look at another dataset. This dataset just has snow and freezing rain as the p-types, so overall it will be colder. Let's see if we " + "Let's look at another dataset. This dataset just has snow and freezing rain as the p-types, so overall it will be colder. Let's see if we get similar results." ] }, { @@ -2100,10 +2103,16 @@ "metadata": {}, "source": [ "## Resources and references\n", + "1. [Scikit-learn](https://scikit-learn.org/stable/)\n", + "1. [Correlation Matrix, Demystified](https://towardsdatascience.com/correlation-matrix-demystified-3ae3405c86c1)\n", "1. [What is the Difference Between Test and Validation Datasets?](https://machinelearningmastery.com/difference-test-validation-datasets/)\n", "1. [Machine Learning Foundations in the Earth Systems Sciences](https://elearning.unidata.ucar.edu/dataeLearning/Cybertraining/foundations/#/)\n", "1. [Scikit-learn's StandardScaler Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler)\n", - "1. [What and why behind fit_transform() and transform() in scikit-learn!](https://towardsdatascience.com/what-and-why-behind-fit-transform-vs-transform-in-scikit-learn-78f915cf96fe)" + "1. [What and why behind fit_transform() and transform() in scikit-learn!](https://towardsdatascience.com/what-and-why-behind-fit-transform-vs-transform-in-scikit-learn-78f915cf96fe)\n", + "1. [is\r\n", + "R2: Downsides and Potential Pitfalls for ESS ML Predic](https://www.unidata.ucar.edu/blogs/news/entry/r-sup-2-sup-downsides)\n", + "1. [Scikit-learn's Decision Trees](https://scikit-learn.org/stable/modules/tree.html)\n", + "1. [StatQuest video: Decision and Classification Trees, Clearly Explained!!!](https://www.youtube.com/watch?v=_L39rN6gz7Y)tion" ] } ], @@ -2123,7 +2132,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.8" + "version": "3.11.6" } }, "nbformat": 4,