Data Professor
Posts
How to Build a Data Science Web App in Python (Penguin Classifier)

How to Build a Data Science Web App in Python (Penguin Classifier)

Part 3: ML-Powered Web App in a Little Over 100 Lines of Code

Chanin Nantasenamat
July 16, 2020

This is Part 3 and I will be showing you how to build a machine learning powered data science web app in Python using the Streamlit library in a little over 100 lines of code.

The web app that we will be building today is the Penguins Classifier. The demo of this Penguins Classifier web app that we are building is available at http://dp-penguins.herokuapp.com/.

Previously, in Part 1 of this Streamlit tutorial series, I have shown you how to build your first data science web app in Python that is able to fetch stock price data from Yahoo! Finance followed by displaying a simple line chart. In Part 2, I have shown you how to build a machine learning web app using the Iris dataset.

As also explained in previous articles of this Streamlit Tutorial Series, model deployment is an essential and final component of the data science life cycle that helps to bring the power of data-driven insights to the hands of end users whether it be business stakeholders, managers or customers.

Data science lifecycle. Image drawn by Chanin Nantasenamat.

This article is based on a video that I had made on the same topic on the Data Professor YouTube channel (How to Build a Penguin Classification Web App in Python) in which you can watch it alongside reading this article.

Overview of the Penguin Classification Web App

In this article, we will be building a Penguin Classifier web app for predicting the class label of Penguin species as being Adelie, Chinstrap or Gentoo as a function of 4 quantitative variables and 2 qualitative variables.

Artwork by @allison_horst

Penguins dataset

The data used in this machine learning-powered web app is called the Palmer Penguins dataset, which is released as an R package by Allison Horst. Particularly, the data is derived from the published work of Dr. Kristen Gorman and colleagues entitled Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pygoscelis).

The data set is comprised of 4 quantitative variables:

Bill length (mm)
Bill depth (mm)
Flipper length (mm)
Body mass (g)

And 2 qualitative variables:

Sex (male/female)
Island (Biscoe/Dream/Torgersen)

Let’s take a look at the Penguins dataset (shown below is a truncated version that shows only the first 3 row entries for each of the 3 Penguin species):

(Note: The full version of the Penguins dataset is available on the Data Professor GitHub)

Components of the Penguins Classifier web app

The Penguins Classifier web app is comprised of the Front-end and the Back-end:

Front-end — This is what we see upon loading the web app. The front-end can be further broken down into the Side Panel and the Main Panel. Screenshot of the web app is shown below.

The Side Panel is found on the left and it is labeled to have the header title of “User Input Features”. It is here that the user can either upload a CSV file containing the input features (2 qualitative and 4 quantitative variables). For the 4 quantitative variables, users can manually enter the input values of these input features by adjusting the slider bars. As for the 2 qualitative variables, users can select input values via the drop-down menus.

These user input features serve as input to the machine learning model that will be discussed in the back-end. Once a prediction is made, the resulting class label (the Penguins species) along with the Prediction Probability values are sent back to the front-end for display on the Main Panel.

Back-end — The user input features will be converted into a dataframe and sent to the machine learning model for predictions to be made. Herein, we will be using a pre-trained model that was previously saved as a pickle object called penguins_clf.pkl that can be quickly loaded in by the web app (without the need to build a machine learning model each time the web app is loaded by the user).

Screenshot of the Penguins Classifier web app.

Install prerequisite libraries

For this tutorial, we will be using 5 Python libraries: streamlit, pandas, numpy, scikit-learn and pickle. The first 4 will have to be installed if it is not yet already present in your computer while the last library is comes as a built-in library.

To install the libraries, you can easily do this via thepip install command as follows:

pip install streamlit

Then, repeat the above commands by first replacing streamlit with the name of other library such as pandas such that it becomes pip install pandas, and so forth.

Or, you can install them all at once with this one-liner:

pip install streamlit pandas numpy scikit-learn

Codes of the web app

Now, let’s look under the hood of the web app. You will see that the web app is made up of 2 files: penguins-model-building.py and penguins-app.py.

The first file (penguins-model-building.py) is used to build the machine learning model and saved as a pickle file, penguins_clf.pkl.

Subsequently, the second file (penguins-app.py) will apply the trained model (penguins_clf.pkl) to predict the class label (the Penguin’s species as being Adelie, Chinstrap or Gentoo) by using input parameters from the sidebar panel of the web app’s front-end.

Line-by-line explanation of the code

penguins-model-building.py

Let’s start with the explanation of this first file that will essentially allow us to pre-build a trained machine learning model prior to running the web app. Why are we doing that? It is to save computational resources in the long run as we are initially building the model once and then applying it to make indefinite predictions (or at least until we re-train the model) on user input parameters made on the sidebar panel of the web app.

import pandas as pd
penguins = pd.read_csv('penguins_cleaned.csv')

# Ordinal feature encoding
# https://www.kaggle.com/pratik1120/penguin-dataset-eda-classification-and-clustering
df = penguins.copy()
target = 'species'
encode = ['sex','island']

for col in encode:
    dummy = pd.get_dummies(df[col], prefix=col)
    df = pd.concat([df,dummy], axis=1)
    del df[col]

target_mapper = {'Adelie':0, 'Chinstrap':1, 'Gentoo':2}
def target_encode(val):
    return target_mapper[val]

df[target] = df[target].apply(target_encode)

# Separating X and Y
X = df.drop(target, axis=1)
Y = df[target]

# Build random forest model
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X, Y)

# Saving the model
import pickle
pickle.dump(clf, open('penguins_clf.pkl', 'wb'))

Line 1
Import the pandas library with alias of pd
Line 2
Reads the cleaned penguins dataset from CSV file and assigning it to the penguins variable
Lines 4–19
Perform ordinal feature encoding on the 3 qualitative variables comprising of the target Y variable (species) and the 2 X variables (sex and island).

Before performing ordinal feature encoding.

After performing ordinal feature encoding.

Lines 21–23
Separates the df dataframe to X and Y matrices.
Lines 25–28
Trains a random forest model
Lines 30–32
Saves the trained random forest model to a pickled file called penguins_clf.pkl.

penguins-app.py

This second file will serve the web app that will allow predictions to be made using the machine learning model loaded from the pickled file. As mentioned above, the web app accepts inout values from 2 sources:

Feature values from the slider bars.
Feature values from the uploaded CSV file.

import streamlit as st
import pandas as pd
import numpy as np
import pickle
from sklearn.ensemble import RandomForestClassifier

st.write("""
# Penguin Prediction App
This app predicts the **Palmer Penguin** species!
Data obtained from the [palmerpenguins library](https://github.com/allisonhorst/palmerpenguins) in R by Allison Horst.
""")

st.sidebar.header('User Input Features')

st.sidebar.markdown("""
[Example CSV input file](https://raw.githubusercontent.com/dataprofessor/data/master/penguins_example.csv)
""")

# Collects user input features into dataframe
uploaded_file = st.sidebar.file_uploader("Upload your input CSV file", type=["csv"])
if uploaded_file is not None:
    input_df = pd.read_csv(uploaded_file)
else:
    def user_input_features():
        island = st.sidebar.selectbox('Island',('Biscoe','Dream','Torgersen'))
        sex = st.sidebar.selectbox('Sex',('male','female'))
        bill_length_mm = st.sidebar.slider('Bill length (mm)', 32.1,59.6,43.9)
        bill_depth_mm = st.sidebar.slider('Bill depth (mm)', 13.1,21.5,17.2)
        flipper_length_mm = st.sidebar.slider('Flipper length (mm)', 172.0,231.0,201.0)
        body_mass_g = st.sidebar.slider('Body mass (g)', 2700.0,6300.0,4207.0)
        data = {'island': island,
                'bill_length_mm': bill_length_mm,
                'bill_depth_mm': bill_depth_mm,
                'flipper_length_mm': flipper_length_mm,
                'body_mass_g': body_mass_g,
                'sex': sex}
        features = pd.DataFrame(data, index=[0])
        return features
    input_df = user_input_features()

# Combines user input features with entire penguins dataset
# This will be useful for the encoding phase
penguins_raw = pd.read_csv('https://raw.githubusercontent.com/dataprofessor/data/master/penguins_cleaned.csv')
penguins = penguins_raw.drop(columns=['species'], axis=1)
df = pd.concat([input_df,penguins],axis=0)

# Encoding of ordinal features
# https://www.kaggle.com/pratik1120/penguin-dataset-eda-classification-and-clustering
encode = ['sex','island']
for col in encode:
    dummy = pd.get_dummies(df[col], prefix=col)
    df = pd.concat([df,dummy], axis=1)
    del df[col]
df = df[:1] # Selects only the first row (the user input data)

# Displays the user input features
st.subheader('User Input features')

if uploaded_file is not None:
    st.write(df)
else:
    st.write('Awaiting CSV file to be uploaded. Currently using example input parameters (shown below).')
    st.write(df)

# Reads in saved classification model
load_clf = pickle.load(open('penguins_clf.pkl', 'rb'))

# Apply model to make predictions
prediction = load_clf.predict(df)
prediction_proba = load_clf.predict_proba(df)

st.subheader('Prediction')
penguins_species = np.array(['Adelie','Chinstrap','Gentoo'])
st.write(penguins_species[prediction])

st.subheader('Prediction Probability')
st.write(prediction_proba)

Lines 1–5
Import streamlit, pandas and numpy libraries with aliases of st, pd and np, respectively. Next, import the pickle library and finally imports the RandomForestClassifier() function from sklearn.ensemble.
Lines 7–13
Writes the web app title and intro text.

Sidebar Panel

Line 15
Header title of the sidebar panel.
Lines 17–19
Link to download an example CSV file.
Lines 21–41
Collects feature values and puts it into a dataframe. We are going to use conditional statements if and else for determining whether the user has uploaded a CSV file (if so then read the CSV file and convert that into a dataframe) or enter feature values by sliding the slider bars whose values will also be converted into a dataframe.
Lines 43–47
Combines user input features (either from CSV file or from the slider bars) with the entire penguins dataset. The reason for doing this is to ensure that all variables contain the maximal number of possible values. For instance, if the user input contains data for 1 penguin then the ordinal feature encoding will not work. The reason is because the code will detect only 1 possible value for qualitative variables. For ordinal feature encoding to work, each of the qualitative variable will need to have all possible values.

Situation A

In this first scenario, the qualitative variable island has only 1 possible value which is Biscoe.

island
Biscoe

The above input feature will produce the following ordinal features after encoding.

island_Biscoe
1

Situation B

island
Biscoe
Dream
Torgersen

The above input features will produce the following ordinal features.

island_Biscoe, island_Dream, island_Torgersen
1, 0, 0
0, 1, 0
0, 0, 1

Lines 49–56
Performs ordinal feature encoding in a similar fashion as explained above in the model building phase (penguins-model-building.py).
Lines 58–65
Displays the dataframe of the user input features. Conditional statements will allow the code to automatically determine either to display the dataframe of data from the CSV file or from the slider bars.
Lines 67–68
Loads the predictive model from the pickled file, penguins_clf.pkl.
Lines 70–72
Applies the loaded model to make predictions on the df variable, which corresponds to input from the CSV file or from the slider bars.
Lines 74–76
Predicted class label of the penguins species are displayed here.
Lines 78–79
Prediction probability values for each of the 3 penguins species are shown here.

Running the web app

Now that we have finished coding the web app, let’s launch it by first firing up your command prompt (terminal window) and type the following command:

streamlit run penguins-app.py

The following message should then be displayed in the command prompt:

> streamlit run penguins-app.pyYou can now view your Streamlit app in your browser.Local URL: http://localhost:8501
Network URL: http://10.0.0.11:8501

A screenshot of the penguins classifier web app is shown below:

Deployed penguins classifier web app on Heroku.

Deploying and showcasing the web app

Great job! You have now created a machine learning-powered web app. Let’s deploy the web app to the internet so that you can share it to your friends and family. The next part of this Streamlit tutorial series will go in-depth on how you can deploy the web app on Heroku. In the meantime, check out the following video:

How to Deploy Data Science Web App to Heroku

And while you’re at it, you can include this in your data science portfolio.