Here we will do Exploratory Data Analysis on marketplace sales data.
About the Dataset of online sales marketplace data analysis
Column Info
Dataset Source
Dataset can be downloaded from Kaggle website Online Sales Dataset – Popular Marketplace Data
Libraries Importing
#Import the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Dataset Loading
#load the dataset
dt = pd.read_csv('Online Sales Data.csv')
dt
Data exploring and cleaning
Before EDA, first we will explore the dataset and perform the cleaning process.
dt.head()
Here we are checking column names of the data.
dt.columns
dt.info()
from datetime import datetime
dt['Date'] = pd.to_datetime(dt['Date'])
Data column has object data type. We will convert to datetime.
Important info related data types
convert_disc = {'Transaction ID' : 'int32', 'Product Category' : 'string', 'Product Name' : 'string', 'Units Sold' : 'int16',
'Region' : 'string', 'Payment Method' : 'category' }
dt = dt.astype(convert_disc)
dt.info()
dt.describe()
It tells us description of each integer variable.
dt.isnull().sum()
dt.duplicated().sum()
No duplicate value
Exploratory Data Analysis on online marketplace sales data
Univariate Analysis
(1) Which payment method is mostly used?
pm = sns.countplot(data = dt, x = dt['Payment Method'])
plt.figure(figsize = (12,6))
pm.set_xlabel('Payment method used', fontsize = 15)
pm.set_ylabel('Count of records', fontsize = 15)
pm.set_title('Univariate analysis of Payment Method', fontsize= 14)
region = dt['Region'].value_counts()
plt.pie(region.values,
labels=region.index,
autopct='%1.1f%%')
plt.show()
(2) Top 10 selling products
# Top-selling products by frequency
top_products = dt['Product Name'].value_counts().head(10)
top_products.plot(kind='bar', figsize=(16, 6))
plt.title('Top 10 Selling Products by Frequency')
plt.xlabel('Product Name')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()
(3) Distribution of Quantities (Total Number of Product Units) sold
plt.figure(figsize=(10, 6))
sns.histplot(dt['Units Sold'], bins=10, kde=True)
plt.title('Distribution of Quantities (Number of Product Units) Sold')
plt.xlabel('Quantity (Number of Units)')
plt.ylabel('Frequency')
plt.show()
# Summary statistics for quantity
quantity_stats = dt['Units Sold'].describe()
print(quantity_stats)
(4) Distribution of each product unit price
# Distribution of unit prices
plt.figure(figsize=(10, 6))
sns.histplot(dt['Unit Price'], bins=10, kde=True)
plt.title('Distribution of Unit Prices')
plt.xlabel('Unit Price')
plt.ylabel('Frequency')
plt.show()
# Summary statistics for unit price
unit_price_stats = dt['Unit Price'].describe()
print(unit_price_stats)
Bivariate Analysis
(5) How much overall revenue (sales) do you get each month throughout time?
# Extract day, month, year, and weekday
dt['Day'] = dt['Date'].dt.day
dt['Month'] = dt['Date'].dt.month
dt['Year'] = dt['Date'].dt.year
dt['Weekday'] = dt['Date'].dt.weekday
dt.set_index('Date')['Total Revenue'].resample('ME').sum().plot(kind='line', figsize=(12, 6))
plt.title('Monthly sales over time')
plt.xlabel('Date')
plt.ylabel('Total sales')
plt.show()
plt.figure(figsize=(10, 6))
sns.countplot(x=dt['Product Category'], order=dt['Product Category'].value_counts().index)
plt.title('Frequency of Product Categories')
plt.xlabel('Count')
plt.ylabel('Category')
plt.show()
(6) For every product category, what is the total income (Product unit * Unit Price)?
pt_category_revenue = dt.groupby('Product Category')['Total Revenue'].sum().sort_values()
pt_category_revenue.plot(kind='barh', figsize=(10, 6))
plt.title('Total Revenue per each Product Category')
plt.xlabel('Total Revenue')
plt.ylabel('Product Category')
plt.show()
(7) What is the name of the product that brought in the most money?
# Top-selling products by total revenue
top_products_revenue = dt.groupby('Product Name')['Total Revenue'].sum().sort_values(ascending=False).head(10)
top_products_revenue.plot(kind='bar', figsize=(10, 6))
plt.title('Top 10 Selling Products by Total Revenue')
plt.xlabel('Product Name')
plt.ylabel('Total Revenue')
plt.xticks(rotation=45)
plt.show()
(8) Which region has the largest overall revenue generation?
region_revenue = dt.groupby('Region')['Total Revenue'].sum().sort_values()
region_revenue.plot(kind='barh', figsize=(10, 6))
plt.title('Total Revenue per Region')
plt.xlabel('Total Revenue')
plt.ylabel('Region')
plt.show()
(9) How is the entire revenue divided among the various payment options (Units Sold * Unit Price)?
payment_revenue = dt.groupby('Payment Method', observed=True)['Total Revenue'].sum().sort_values()
payment_revenue.plot(kind='barh', figsize=(10, 6))
plt.title('Total Revenue (Units sold * Unit Price) by Payment Method')
plt.xlabel('Total Revenue')
plt.ylabel('Payment Method')
plt.show()
Multivariate Analysis
(10) What is the distribution of sales (total revenue) for each product category across regions, and how do product categories perform in various regions?
# Heatmap for visualizing sales distribution geographically
pivot_table_region = dt.pivot_table(values='Total Revenue', index='Product Category', columns='Region', aggfunc='sum', fill_value=0)
plt.figure(figsize=(12, 8))
sns.heatmap(pivot_table_region, annot=True, fmt='.0f', cmap='Blues')
plt.title('Total Revenue by Category and Region')
plt.xlabel('Region')
plt.ylabel('Category')
plt.show()
(11) What was each Product Category’s Total Revenue throughout time (Units Sold * Unit Price)?
# Line charts with different lines for each category to see trends over time
dt.set_index('Date').groupby('Product Category')['Total Revenue'].resample('ME').sum().unstack().plot(kind='line', figsize=(12, 8))
plt.title('Monthly Sales Over Time by Category')
plt.xlabel('Date')
plt.ylabel('Total Revenue')
plt.legend(title='Category')
plt.show()
(12) Which payment methods have the most sales (revenue) within each product category?
# Stacked bar charts to show sales distribution across payment methods for each category
pivot_table_cat_pay = dt.pivot_table(values='Total Revenue', index='Product Category', columns='Payment Method', aggfunc='sum', fill_value=0, observed = False)
pivot_table_cat_pay.plot(kind='bar', stacked=True, figsize=(12, 8))
plt.title('Total Sales by Product Category and Payment Method')
plt.xlabel('Product Category')
plt.ylabel('Total Sales')
plt.xticks(rotation=45)
plt.legend(title='Payment Method')
plt.show()
(13) What regional differences exist in the payment method preferences?
# Stacked bar charts to compare payment method preferences across regions
pivot_table_reg_payment = dt.pivot_table(values='Total Revenue', index='Region', columns='Payment Method', aggfunc='sum', fill_value=0, observed = True)
pivot_table_reg_payment.plot(kind='bar', stacked=True, figsize=(12, 8))
plt.title('Total Revenue by Region and Payment Method')
plt.xlabel('Region')
plt.ylabel('Total Revenue')
plt.xticks(rotation=45)
plt.legend(title='Payment Method')
plt.show()
I tried my best to finish this project. If you are interesting more data analysis EDA project you can check Customer Segmentation Data Analysis.