How to Build an EDA App in Python

Step-by-step tutorial from scratch in 50 lines of code

Exploratory data analysis (EDA) is an essential and preliminary first steps for exploring and summarizing the main characteristics of datasets. EDA provides the means to help us better understand variables and their relationships. This is achieved by non-graphical (descriptive statistics) and graphical (data visualization) techniques.

In this article, we will be creating an EDA web app that you can use to speed up your EDA analysis or allow your colleagues to perform EDA without having to code in Python. A simple upload of the input CSV file is all it takes to perform EDA analysis.

You can also refer to my YouTube video How to build an Exploratory Data Analysis app using Pandas Profiling on my YouTube channel (Data Professor) as a supplement to this article.

1. The EDA tool (Pandas Profiling)

A simple and intuitive EDA tool is the pandas-profiling library in Python. It allows the generation of a profile report for a df dataframe via the df.profile_report() function.

The generated profile is an interactive HTML report that displays the following statistics for each variable (please see the documentation for more details):

  • Type inference: Identifies the data types of columns in a dataframe.

  • Essentials: Type, unique values and missing values.

  • Quantile statistics: Minimum, maximum, Q1, median, Q3, range and interquartile range.

  • Descriptive statistics: Mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis and skewness.

  • Most frequent values

  • Histograms: Distribution of value ranges.

  • Correlations: Highlighting highly correlated variables, Spearman, Pearson and Kendall matrices.

  • Missing values: Matrix, count, heatmap and dendrogram of missing values.

  • Duplicate rows: Lists the most occurring duplicate rows.

  • Text analysis: Learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.

2. About the EDA App that we are building today

The EDA app that we are going to be building today will allows us to upload our own CSV dataset to the app, which will then automatically generate a profile report (an interactive HTML report that facilitates EDA analysis).

You can try it out for yourself here: EDA App demo

3. Code of the EDA App

The code spans a little over 50 lines of code, 52 to be exact.

Let’s take a look at the tech behind the app. It is based on 5 Python libraries consisting of the following:

  • numpy — used for generating a synthetic dataset consisting of 5 columns (variables) that is used an example dataset (notice the Press to use Example Dataset button in the main panel under the web app’s page title).

  • pandas — used for housing the data in the form of a DataFrame.

  • streamlit — the web framework that makes it possible to build a simple web app.

  • pandas_profiling— the library that makes it possible to generate the interactive HTML report for performing EDA.

  • streamlit_pandas_profiling — the library that allows the pandas_profiling library to be used in a Streamlit web app.

import numpy as np
import pandas as pd
import streamlit as st
from pandas_profiling import ProfileReport
from streamlit_pandas_profiling import st_profile_report

# Web App Title
st.markdown('''
# **The EDA App**
This is the **EDA App** created in Streamlit using the **pandas-profiling** library.
**Credit:** App built in `Python` + `Streamlit` by [Chanin Nantasenamat](https://medium.com/@chanin.nantasenamat) (aka [Data Professor](http://youtube.com/dataprofessor))
---
''')

# Upload CSV data
with st.sidebar.header('1. Upload your CSV data'):
    uploaded_file = st.sidebar.file_uploader("Upload your input CSV file", type=["csv"])
    st.sidebar.markdown("""
[Example CSV input file](https://raw.githubusercontent.com/dataprofessor/data/master/delaney_solubility_with_descriptors.csv)
""")

# Pandas Profiling Report
if uploaded_file is not None:
    @st.cache
    def load_csv():
        csv = pd.read_csv(uploaded_file)
        return csv
    df = load_csv()
    pr = ProfileReport(df, explorative=True)
    st.header('**Input DataFrame**')
    st.write(df)
    st.write('---')
    st.header('**Pandas Profiling Report**')
    st_profile_report(pr)
else:
    st.info('Awaiting for CSV file to be uploaded.')
    if st.button('Press to use Example Dataset'):
        # Example data
        @st.cache
        def load_data():
            a = pd.DataFrame(
                np.random.rand(100, 5),
                columns=['a', 'b', 'c', 'd', 'e']
            )
            return a
        df = load_data()
        pr = ProfileReport(df, explorative=True)
        st.header('**Input DataFrame**')
        st.write(df)
        st.write('---')
        st.header('**Pandas Profiling Report**')
        st_profile_report(pr)

4. Line-by-Line Explanation of the Code

Lines 1–5

  • Import prerequisite libraries consisting of numpy, pandas, streamlit, pandas_profiling and streamlit_pandas_profiling

Lines 7–13

  • Line 7 — commented text to denote that the subsequent code block pertains to the web app’s title.

  • Lines 8–13 — the st.markdown() function to print the web app’s title. The input arguments are in Markdown syntax .

Lines 15–20

  • Line 15 — Commented text to describe that Lines 16–20 is responsible for the CSV upload functionality.

  • Line 16with is used to structure the subsequent lines of code to belong to the code block. st.sidebar.header() specifies the use of the header() function to display the heading text 1. Upload your CSV data in the left panel.

  • Line 17 — The st.sidebar.file_uploader() function allows the creation of the upload box. It should be noted that st.sidebar specifies the location of the upload box to be in the sidebar (the collapsible left panel). The input argument type=["csv"] found inside the file_uploader() function specifies that the upload box would only accept the csv file type.

  • Line 18 — The st.sidebar.markdown() function specifies the use of the markdown() function to display a link to the example CSV file to be in the left panel.

Lines 22–52

  • Line 22 — Commented text to describe that the subsequent code blocks are responsible for creating the Pandas Profiling report. It should be noted here that this section consists of an if-else statement whereby the else code block (Lines 35–52) will be run at default every time the web app loads. Upon uploading an input CSV file, the if code block (Lines 23–34) will then be activated and run.

  • Lines 23–34 — This code block represents the if statement. As mentioned above in the previous bullet point, this code block will run upon uploading of the input CSV file.
    Line 23 — An if condition is set to detect that a file is uploaded as indicated by the uploaded_file is not None.
    Line 24 — The @st.cache decorator allows the app to skip a costly and repetitive of a function defined in the subsequent lines of code.
    Lines 25–27 — A custom function called load_csv that reads in the input CSV file from the upload box via the uploaded_file variable.
    Line 28 — Contents of the CSV file as read from the load_csv custom function is assigned to the df variable.
    Line 29 — The ProfileReport() function is assigned to the pr variable.
    Lines 30–31 — A header for **Input DataFrame** is printed as a heading text via the st.header() function (Line 30). Contents of the DataFrame stored in the df variable is printed (Line 31).
    Line 32"- — -” prints a horizontal bar.
    Line 33–34 — A header for **Pandas Profiling Report** is printed as a heading text via the st.header() function (Line 33). Contents of the pr variable, which stores the output from the ProfileReport() function is printed. Particularly, this is the interactive HTML profile report that allows EDA functionality.

  • Lines 35–52 — This code block represents the else statement.
    Line 35 — An else condition is set as the default state that is run every time the web app loads.
    Line 36 — A text saying Awaiting for CSV file to be uploaded. is printed in an info box via the st.info() function.
    Lines 37–52 — An if condition is used to specify the consequences (statements underneath it will be run) of clicking the Press to use Example Dataset button, which is to use the example dataset to generate the profile report.
    Line 38 — Commented text to indicate that the following lines of code pertains to the example data.
    Line 39 — As described earlier for Line 24, the @st.cache decorator allows the app to skip a potentially costly run of the subsequent lines of code (i.e. otherwise the app will re-generate the synthetic dataset everytime the app is reloaded).
    Lines 40–45 — A custom function is defined to create a synthetic dataset consisting of 5 columns using the NumPy library.
    Line 46 — The synthetic dataset as defined by the custom function from Lines 40–45 is assigned to the df variable.

5. Running the EDA App

It’s now time to run the app.

5.1. Create the conda environment

Let’s assume that you are starting from scratch, thus you will have to create a new conda environment (which is a good idea in order to ensure reproducibility of your code).

Firstly, create a new conda environment called eda as follows in a terminal command line:

conda create -n eda python=3.7.9

Secondly, we will login to the eda environment

conda activate eda

5.2. Install prerequisite libraries

Firstly, download the requirements.txt file

wget https://raw.githubusercontent.com/dataprofessor/eda-app/main/requirements.txt

Secondly, install the libraries as shown below

pip install -r requirements.txt

5.3. Download the EDA App files

You can either download the web app files that are hosted on the GitHub repo of the Data Professor or you also use the 52 lines of code found above.

wget https://github.com/dataprofessor/eda-app/archive/main.zip

Next, unzip the file contents

unzip main.zip

Now enter the main directory via the cd command

cd main

Now that you’re inside the main directory you should be able to see the app.py file.

5.4. Launching the web app

The app can be launched by typing the following commands into a terminal prompt (i.e. ensure that the app.py file is in the current working directory):

streamlit run app.py

Afterwards, you will see the following message in the terminal prompt.

> streamlit run app.pyYou can now view your Streamlit app in your browser.Local URL: http://localhost:8501
Network URL: http://10.0.0.11:8501

Finally, a browser should pop up and the app appears.

Screenshot of the EDA App. To try out the app please refer to the EDA App demo.

Screenshot of the EDA App in action. We can see here the correlations plots. Please refer to the image above in section 2 for an animated example.

Congratulations!

You have now created an EDA app that will allow you to easily perform EDA analysis on-the-go, just simply upload an input CSV file to get started in performing EDA analysis.

Deploying the EDA App to the Internet

If you would like to make your web app publicly available to the world, you can deploy it to the internet. I’ve created YouTube videos showing how you can do that on Heroku and Streamlit Sharing.

An example of the deployed EDA App on the internet via the Streamlit Sharing platform can be accessible here: EDA App demo