- Data Professor
- Posts
- How to Build an EDA App in Python
How to Build an EDA App in Python
Step-by-step tutorial from scratch in 50 lines of code
Exploratory data analysis (EDA) is an essential and preliminary first steps for exploring and summarizing the main characteristics of datasets. EDA provides the means to help us better understand variables and their relationships. This is achieved by non-graphical (descriptive statistics) and graphical (data visualization) techniques.
In this article, we will be creating an EDA web app that you can use to speed up your EDA analysis or allow your colleagues to perform EDA without having to code in Python. A simple upload of the input CSV file is all it takes to perform EDA analysis.
You can also refer to my YouTube video How to build an Exploratory Data Analysis app using Pandas Profiling on my YouTube channel (Data Professor) as a supplement to this article.
1. The EDA tool (Pandas Profiling)
A simple and intuitive EDA tool is the pandas-profiling
library in Python. It allows the generation of a profile report for a df
dataframe via the df.profile_report()
function.
The generated profile is an interactive HTML report that displays the following statistics for each variable (please see the documentation for more details):
Type inference: Identifies the data types of columns in a dataframe.
Essentials: Type, unique values and missing values.
Quantile statistics: Minimum, maximum, Q1, median, Q3, range and interquartile range.
Descriptive statistics: Mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis and skewness.
Most frequent values
Histograms: Distribution of value ranges.
Correlations: Highlighting highly correlated variables, Spearman, Pearson and Kendall matrices.
Missing values: Matrix, count, heatmap and dendrogram of missing values.
Duplicate rows: Lists the most occurring duplicate rows.
Text analysis: Learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.
2. About the EDA App that we are building today
The EDA app that we are going to be building today will allows us to upload our own CSV dataset to the app, which will then automatically generate a profile report (an interactive HTML report that facilitates EDA analysis).
You can try it out for yourself here: EDA App demo
3. Code of the EDA App
The code spans a little over 50 lines of code, 52 to be exact.
Let’s take a look at the tech behind the app. It is based on 5 Python libraries consisting of the following:
numpy
— used for generating a synthetic dataset consisting of 5 columns (variables) that is used an example dataset (notice thePress to use Example Dataset
button in the main panel under the web app’s page title).pandas
— used for housing the data in the form of a DataFrame.streamlit
— the web framework that makes it possible to build a simple web app.pandas_profiling
— the library that makes it possible to generate the interactive HTML report for performing EDA.streamlit_pandas_profiling
— the library that allows thepandas_profiling
library to be used in a Streamlit web app.
import numpy as np
import pandas as pd
import streamlit as st
from pandas_profiling import ProfileReport
from streamlit_pandas_profiling import st_profile_report
# Web App Title
st.markdown('''
# **The EDA App**
This is the **EDA App** created in Streamlit using the **pandas-profiling** library.
**Credit:** App built in `Python` + `Streamlit` by [Chanin Nantasenamat](https://medium.com/@chanin.nantasenamat) (aka [Data Professor](http://youtube.com/dataprofessor))
---
''')
# Upload CSV data
with st.sidebar.header('1. Upload your CSV data'):
uploaded_file = st.sidebar.file_uploader("Upload your input CSV file", type=["csv"])
st.sidebar.markdown("""
[Example CSV input file](https://raw.githubusercontent.com/dataprofessor/data/master/delaney_solubility_with_descriptors.csv)
""")
# Pandas Profiling Report
if uploaded_file is not None:
@st.cache
def load_csv():
csv = pd.read_csv(uploaded_file)
return csv
df = load_csv()
pr = ProfileReport(df, explorative=True)
st.header('**Input DataFrame**')
st.write(df)
st.write('---')
st.header('**Pandas Profiling Report**')
st_profile_report(pr)
else:
st.info('Awaiting for CSV file to be uploaded.')
if st.button('Press to use Example Dataset'):
# Example data
@st.cache
def load_data():
a = pd.DataFrame(
np.random.rand(100, 5),
columns=['a', 'b', 'c', 'd', 'e']
)
return a
df = load_data()
pr = ProfileReport(df, explorative=True)
st.header('**Input DataFrame**')
st.write(df)
st.write('---')
st.header('**Pandas Profiling Report**')
st_profile_report(pr)
4. Line-by-Line Explanation of the Code
Lines 1–5
Import prerequisite libraries consisting of
numpy
,pandas
,streamlit
,pandas_profiling
andstreamlit_pandas_profiling
Lines 7–13
Line 7 — commented text to denote that the subsequent code block pertains to the web app’s title.
Lines 8–13 — the
st.markdown()
function to print the web app’s title. The input arguments are in Markdown syntax .
Lines 15–20
Line 15 — Commented text to describe that Lines 16–20 is responsible for the CSV upload functionality.
Line 16 —
with
is used to structure the subsequent lines of code to belong to the code block.st.sidebar.header()
specifies the use of theheader()
function to display the heading text1. Upload your CSV data
in the left panel.Line 17 — The
st.sidebar.file_uploader()
function allows the creation of the upload box. It should be noted thatst.sidebar
specifies the location of the upload box to be in the sidebar (the collapsible left panel). The input argumenttype=["csv"]
found inside thefile_uploader()
function specifies that the upload box would only accept thecsv
file type.Line 18 — The
st.sidebar.markdown()
function specifies the use of themarkdown()
function to display a link to the example CSV file to be in the left panel.
Lines 22–52
Line 22 — Commented text to describe that the subsequent code blocks are responsible for creating the Pandas Profiling report. It should be noted here that this section consists of an
if-else
statement whereby theelse
code block (Lines 35–52) will be run at default every time the web app loads. Upon uploading an input CSV file, theif
code block (Lines 23–34) will then be activated and run.Lines 23–34 — This code block represents the
if
statement. As mentioned above in the previous bullet point, this code block will run upon uploading of the input CSV file.
— Line 23 — Anif
condition is set to detect that a file is uploaded as indicated by theuploaded_file is not None
.
— Line 24 — The@st.cache
decorator allows the app to skip a costly and repetitive of a function defined in the subsequent lines of code.
— Lines 25–27 — A custom function calledload_csv
that reads in the input CSV file from the upload box via theuploaded_file
variable.
— Line 28 — Contents of the CSV file as read from theload_csv
custom function is assigned to thedf
variable.
— Line 29 — TheProfileReport()
function is assigned to thepr
variable.
— Lines 30–31 — A header for**Input DataFrame**
is printed as a heading text via thest.header()
function (Line 30). Contents of the DataFrame stored in thedf
variable is printed (Line 31).
— Line 32 —"- — -”
prints a horizontal bar.
— Line 33–34 — A header for**Pandas Profiling Report**
is printed as a heading text via thest.header()
function (Line 33). Contents of thepr
variable, which stores the output from theProfileReport()
function is printed. Particularly, this is the interactive HTML profile report that allows EDA functionality.Lines 35–52 — This code block represents the
else
statement.
— Line 35 — Anelse
condition is set as the default state that is run every time the web app loads.
— Line 36 — A text sayingAwaiting for CSV file to be uploaded.
is printed in an info box via thest.info()
function.
— Lines 37–52 — Anif
condition is used to specify the consequences (statements underneath it will be run) of clicking thePress to use Example Dataset
button, which is to use the example dataset to generate the profile report.
— Line 38 — Commented text to indicate that the following lines of code pertains to the example data.
— Line 39 — As described earlier for Line 24, the@st.cache
decorator allows the app to skip a potentially costly run of the subsequent lines of code (i.e. otherwise the app will re-generate the synthetic dataset everytime the app is reloaded).
— Lines 40–45 — A custom function is defined to create a synthetic dataset consisting of 5 columns using theNumPy
library.
— Line 46 — The synthetic dataset as defined by the custom function from Lines 40–45 is assigned to thedf
variable.
5. Running the EDA App
It’s now time to run the app.
5.1. Create the conda environment
Let’s assume that you are starting from scratch, thus you will have to create a new conda
environment (which is a good idea in order to ensure reproducibility of your code).
Firstly, create a new conda
environment called eda
as follows in a terminal command line:
conda create -n eda python=3.7.9
Secondly, we will login to the eda
environment
conda activate eda
5.2. Install prerequisite libraries
Firstly, download the requirements.txt file
wget https://raw.githubusercontent.com/dataprofessor/eda-app/main/requirements.txt
Secondly, install the libraries as shown below
pip install -r requirements.txt
5.3. Download the EDA App files
You can either download the web app files that are hosted on the GitHub repo of the Data Professor or you also use the 52 lines of code found above.
wget https://github.com/dataprofessor/eda-app/archive/main.zip
Next, unzip the file contents
unzip main.zip
Now enter the main
directory via the cd
command
cd main
Now that you’re inside the main
directory you should be able to see the app.py
file.
5.4. Launching the web app
The app can be launched by typing the following commands into a terminal prompt (i.e. ensure that the app.py
file is in the current working directory):
streamlit run app.py
Afterwards, you will see the following message in the terminal prompt.
> streamlit run app.pyYou can now view your Streamlit app in your browser.Local URL: http://localhost:8501
Network URL: http://10.0.0.11:8501
Finally, a browser should pop up and the app appears.
Screenshot of the EDA App. To try out the app please refer to the EDA App demo.
Screenshot of the EDA App in action. We can see here the correlations plots. Please refer to the image above in section 2 for an animated example.
Congratulations!
You have now created an EDA app that will allow you to easily perform EDA analysis on-the-go, just simply upload an input CSV file to get started in performing EDA analysis.
Deploying the EDA App to the Internet
If you would like to make your web app publicly available to the world, you can deploy it to the internet. I’ve created YouTube videos showing how you can do that on Heroku and Streamlit Sharing.
An example of the deployed EDA App on the internet via the Streamlit Sharing platform can be accessible here: EDA App demo