{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "oZpJHlWqnvBf" }, "source": [ "# Predictive Modeling for Beginners" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "GCNORrN0nvCB" }, "source": [ "In this tutorial, we will walk through the basics of predictive modeling and provide example code along the way." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "iy0E_YLKnvCD" }, "source": [ "### Table of Contents\n", "\n", "1. What is predictive modeling?\n", "2. The Learning Machine\n", "3. Training / Validation / Test Split\n", "4. The Predict Function\n", "5. The Correct Classification Rate (CCR) Function" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "3z-utcBKnvCE" }, "source": [ "### 1. Predictive Modeling" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "SvZUESjAnvCF" }, "source": [ "In the past 15 years, hospitals have begun warehousing patient data, and large medical datasets are now available. The outcome (e.g., diagnosis, or death) for each patient is recorded. Other information about the patient that may be available includes demographic information, the medications prescribed to the patient, various physiological measurements such as core body temperature and the heart rate, and the procedures ordered for the patient.\n", "\n", "We can use predictive modeling to build *models* that predict the outcomes for new patients by generalizing from existing data. That is because patients with similar demographics, physiological measurements, medical histories etc. are likely to have similar outcomes." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "iVJnXRo9nvCG" }, "source": [ "A model uses a number of *predictor variables*. The predictor variables can be used to predict the *outcome variable*.\n", "\n", "Predictive modeling is a powerful tool that can be applied to a vast number of sectors. For example, businesses can use predictive modeling to analyze customer behavior data and predict how, for example, customers will react to a price change. Predictive models can be used to predict stock prices by using past stock market data. Political scientists use predictive modeling to predict the outcome of a new election in a country based on the current political and economic situation by using datasets that contain past electoral outcomes and the political and economic history of the country.\n", "\n", "Predictive models are almost never 100% accurate: they merely generalize patterns in the dataset to predict the most likely outcome for new situations." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "c8TSsNP3nvCH" }, "source": [ "#### Overview \n", "\n", "First, we'll discuss how to use what we call a *Learning Machine* to build a model. We will focus on predicting patient outcomes in the Intensive Care Unit (ICU). We'll show how to generate three important subsets of our dataset: the *training set*, the *validation set*, and the *test set*. We'll then see how our `predict` function can predict the outcomes for new situations -- we will predict whether patients that the Learning Machine hasn't seen before will survive in the ICU. We'll then assess the model's performance by computing the *correct classification rate* (CCR) of the model." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "I7GMs24KnvCJ" }, "source": [ "### 2. The Learning Machine" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "DCXkmdaunvCK" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "bLU7kzbFnvCK" }, "source": [ "We will perform predictive modeling using the Learning Machine (LM). The Learning Machine is an abstraction. We introduce it here to simplify the presentation. Think of the LM as a function that takes in a dataset (which contains predictor variables as well as the outcomes for historical data) and produces a model that can predict the outcomes for new situations. The field of Machine Learning deals with building better LMs. Logistic Regression and Neural Networks are examples of LMs.\n", "\n", "Data is fed into the LM. The LM learns patterns from the data and spits out a model that the LM deems to be the best way of capturing the relationships in the data. " ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "n2SsNuvKnvCL" }, "source": [ "Let's focus on predicting in-hospital patient deaths in the ICU. The data was collected in the ICU of the Beth Israel Deaconess Medical Center in Boston. We provide the data in `ICU_data.csv`. The dataset contains demographic and physiological variables that were collected for each patient during their stay in the ICU. " ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "aCwop8t7nvCM" }, "source": [ "For more information on the dataset, check [this webpage](https://physionet.org/content/challenge-2012/1.0.0/). " ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "l8jlA6idnvCM" }, "source": [ "Let's now get to the code. You will need the following libraries. We recommend installing the [Anaconda](https://www.anaconda.com/distribution/) Python distribution, which comes with all the packages that you need.\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": {}, "colab_type": "code", "id": "ctCzCEVtnvCO" }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt \n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn import preprocessing\n", "import learningmachine as lm\n", "\n", "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "OzQQbOVynvCR" }, "source": [ "The following is part of the set-up process." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": {}, "colab_type": "code", "id": "lNoENqVonvCS" }, "outputs": [], "source": [ "np.random.seed(0)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "_zHZOE3LnvCV" }, "source": [ "We will read in `ICU_data.csv` into Python (note that you can open `ICU_data.csv` in spreadsheet software such as Excel, or even with a text editor).\n", "\n", "When reading in the data, we will use the `Pandas` package. However, we will mostly not use `Pandas` in this tutorial: we will convert the data to a list-of-lists format almost right away." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 375 }, "colab_type": "code", "id": "8WgyT4DknvCW", "outputId": "65ef4772-6154-4454-b403-e70598f1c905" }, "outputs": [], "source": [ "df = pd.read_csv(\"ICU_data.csv\", index_col=0)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "-O4CH3f_nvCa" }, "source": [ "We can display the first several rows of the dataset as follows. (Open the dataset in Excel or a text editor to confirm that the data was read in correctly.)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": {}, "colab_type": "code", "id": "BjgCZb8KnvCa", "outputId": "c787631d-69f6-45c6-b551-496e6fd58896" }, "outputs": [ { "data": { "text/html": [ "
\n", " | Age | \n", "DiasABP | \n", "FiO2 | \n", "GCS | \n", "Gender | \n", "HR | \n", "Height | \n", "K | \n", "MAP | \n", "NIDiasABP | \n", "... | \n", "SysABP | \n", "Temp | \n", "Urine | \n", "Weight | \n", "pH | \n", "SAPS-I | \n", "SOFA | \n", "Length_of_stay | \n", "Survival | \n", "In-hospital_death | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | \n", "42.0 | \n", "59.543420 | \n", "0.792857 | \n", "15.000000 | \n", "M | \n", "118.239130 | \n", "169.787227 | \n", "3.625000 | \n", "81.055075 | \n", "79.000000 | \n", "... | \n", "118.591225 | \n", "37.658333 | \n", "89.729730 | \n", "138.100000 | \n", "7.43250 | \n", "10.00000 | \n", "5.0 | \n", "7.0 | \n", "351.063772 | \n", "0.0 | \n", "
5 | \n", "77.0 | \n", "46.101449 | \n", "0.454545 | \n", "5.956522 | \n", "M | \n", "79.269231 | \n", "169.787227 | \n", "4.333333 | \n", "72.217391 | \n", "55.296296 | \n", "... | \n", "132.913043 | \n", "36.466667 | \n", "134.017937 | \n", "111.592208 | \n", "7.46875 | \n", "14.96168 | \n", "9.0 | \n", "11.0 | \n", "10.000000 | \n", "1.0 | \n", "
6 | \n", "60.0 | \n", "57.753623 | \n", "0.549199 | \n", "13.333333 | \n", "M | \n", "91.652174 | \n", "175.300000 | \n", "4.050000 | \n", "77.985507 | \n", "58.260870 | \n", "... | \n", "99.014493 | \n", "37.634694 | \n", "290.625000 | \n", "85.772059 | \n", "7.39800 | \n", "12.00000 | \n", "0.0 | \n", "6.0 | \n", "351.063772 | \n", "0.0 | \n", "
8 | \n", "83.0 | \n", "60.528090 | \n", "0.496429 | \n", "11.533333 | \n", "F | \n", "79.078652 | \n", "169.787227 | \n", "4.400000 | \n", "86.505618 | \n", "69.666667 | \n", "... | \n", "132.280899 | \n", "36.716667 | \n", "113.829787 | \n", "63.000000 | \n", "7.33750 | \n", "18.00000 | \n", "7.0 | \n", "10.0 | \n", "88.000000 | \n", "0.0 | \n", "
16 | \n", "59.0 | \n", "76.250000 | \n", "0.607143 | \n", "9.454545 | \n", "M | \n", "63.200000 | \n", "169.787227 | \n", "4.150000 | \n", "106.093023 | \n", "61.833333 | \n", "... | \n", "150.045455 | \n", "37.066667 | \n", "91.714286 | \n", "91.000000 | \n", "7.34800 | \n", "16.00000 | \n", "9.0 | \n", "6.0 | \n", "351.063772 | \n", "0.0 | \n", "
5 rows × 25 columns
\n", "\n", " | Age | \n", "DiasABP | \n", "FiO2 | \n", "GCS | \n", "Gender | \n", "HR | \n", "Height | \n", "K | \n", "MAP | \n", "NIDiasABP | \n", "... | \n", "SysABP | \n", "Temp | \n", "Urine | \n", "Weight | \n", "pH | \n", "SAPS-I | \n", "SOFA | \n", "Length_of_stay | \n", "Survival | \n", "In-hospital_death | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | \n", "42.0 | \n", "59.543420 | \n", "0.792857 | \n", "15.000000 | \n", "1 | \n", "118.239130 | \n", "169.787227 | \n", "3.625000 | \n", "81.055075 | \n", "79.000000 | \n", "... | \n", "118.591225 | \n", "37.658333 | \n", "89.729730 | \n", "138.100000 | \n", "7.43250 | \n", "10.00000 | \n", "5.0 | \n", "7.0 | \n", "351.063772 | \n", "0.0 | \n", "
5 | \n", "77.0 | \n", "46.101449 | \n", "0.454545 | \n", "5.956522 | \n", "1 | \n", "79.269231 | \n", "169.787227 | \n", "4.333333 | \n", "72.217391 | \n", "55.296296 | \n", "... | \n", "132.913043 | \n", "36.466667 | \n", "134.017937 | \n", "111.592208 | \n", "7.46875 | \n", "14.96168 | \n", "9.0 | \n", "11.0 | \n", "10.000000 | \n", "1.0 | \n", "
6 | \n", "60.0 | \n", "57.753623 | \n", "0.549199 | \n", "13.333333 | \n", "1 | \n", "91.652174 | \n", "175.300000 | \n", "4.050000 | \n", "77.985507 | \n", "58.260870 | \n", "... | \n", "99.014493 | \n", "37.634694 | \n", "290.625000 | \n", "85.772059 | \n", "7.39800 | \n", "12.00000 | \n", "0.0 | \n", "6.0 | \n", "351.063772 | \n", "0.0 | \n", "
8 | \n", "83.0 | \n", "60.528090 | \n", "0.496429 | \n", "11.533333 | \n", "0 | \n", "79.078652 | \n", "169.787227 | \n", "4.400000 | \n", "86.505618 | \n", "69.666667 | \n", "... | \n", "132.280899 | \n", "36.716667 | \n", "113.829787 | \n", "63.000000 | \n", "7.33750 | \n", "18.00000 | \n", "7.0 | \n", "10.0 | \n", "88.000000 | \n", "0.0 | \n", "
16 | \n", "59.0 | \n", "76.250000 | \n", "0.607143 | \n", "9.454545 | \n", "1 | \n", "63.200000 | \n", "169.787227 | \n", "4.150000 | \n", "106.093023 | \n", "61.833333 | \n", "... | \n", "150.045455 | \n", "37.066667 | \n", "91.714286 | \n", "91.000000 | \n", "7.34800 | \n", "16.00000 | \n", "9.0 | \n", "6.0 | \n", "351.063772 | \n", "0.0 | \n", "
5 rows × 25 columns
\n", "