Exploratory Data Analysis With Python and Pandas



About Project
Objective
Apply practical Exploratory Data Analysis (EDA) techniques on any tabular dataset using Python packages such as Pandas and Numpy.
Tools & Technologies
Python, Data Analysis, Pandas, EDA
Share
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
Data source: https://www.kaggle.com/datasets/aungpyaeap/supermarket-sales
The growth of supermarkets in most populated cities are increasing and market competitions are also high. The dataset is one of the historical sales of supermarket company which has recorded in 3 different brtanches for 3 months data.
## read our data set
df= pd.read_csv('/kaggle/input/supermarket-sales/supermarket_sales - Sheet1.csv')
## shows top 10 of the row, use tail(x), x = rows to show the last rows
df.head(10)
## shows columns
df.columns
## shows column definition
df.dtypes
## converts our date as string to datetime
df['Date'] = pd.to_datetime(df['Date'])
df['Date']
## set the index of our dataset
df.set_index('Date', inplace=True)
df.head(5)
## computes a summary of statistics pertaining to the DataFrame columns
df.describe()
What does the distribution of customer ratings looks like? is it skewed?
sns.displot(df['Rating'])
plt.axvline(x=np.mean(df['Rating']), c='red', ls='--', label='mean')
plt.axvline(x=np.percentile(df['Rating'], 25), c='green', ls='--', label='25th-75th percentile')
plt.axvline(x=np.percentile(df['Rating'], 75), c='green', ls='--')
Do aggregate sales numbers differ by much between branches?
df.hist(figsize=(10,10))
sns.countplot(data=df, x='Branch')
sns.countplot(data=df, x='Payment')
Is ther a relationship between gross income and customer ratings?
sns.regplot(data=df, x='Rating', y='gross income')
sns.boxplot(x=df['Branch'],y=df['gross income'])
sns.boxplot(x=df['Gender'],y=df['gross income'])
Is there a noticeabke time trend in gross income?
df.head()
sns.lineplot(x=df.groupby(df.index).mean().index, y=df.groupby(df.index).mean()['gross income'])
df.duplicated().sum()
## shows duplicated rows
# df[df.duplicated() == True]
## removes duplicate rows if there is any
# df.drop_duplicates(inplace=True)
df.isna().sum()
## get ratio of empty cells by columns
# df.isna().sum()/len(df)
sns.heatmap(df.isnull(), cbar=False)
## fill missing data for numerical values
#df.fillna(df.mean(), inplace=True)
## fill missing data for non-numerical values
#df.fillna(df.mode().iloc[0],inplace=True)
dataset = pd.read_csv('/kaggle/input/supermarket-sales/supermarket_sales - Sheet1.csv')
profile = ProfileReport(dataset)
profile
round(np.corrcoef(df['gross income'], df['Rating'])[1][0],2)
np.round(df.corr(),2)
sns.heatmap(np.round(df.corr(),2), annot=True)
> soliver mazo