Online Marketplace Sales Data Analysis EDA

Here we will do Exploratory Data Analysis on marketplace sales data.

About the Dataset of online sales marketplace data analysis

Column Info

Dataset Source

Dataset can be downloaded from Kaggle website Online Sales Dataset – Popular Marketplace Data

Libraries Importing

#Import the required libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Dataset Loading

#load the dataset
dt = pd.read_csv('Online Sales Data.csv')
dt

Data exploring and cleaning

Before EDA, first we will explore the dataset and perform the cleaning process.

dt.head()

Here we are checking column names of the data.

dt.columns

dt.info()

from datetime import datetime
dt['Date'] = pd.to_datetime(dt['Date'])

Data column has object data type. We will convert to datetime.

Important info related data types

convert_disc = {'Transaction ID' : 'int32', 'Product Category' : 'string', 'Product Name' : 'string', 'Units Sold' : 'int16', 
                'Region' : 'string', 'Payment Method' : 'category'  }

dt = dt.astype(convert_disc)
dt.info()

dt.describe()

It tells us description of each integer variable.

dt.isnull().sum()

dt.duplicated().sum()

No duplicate value

Exploratory Data Analysis on online marketplace sales data

Univariate Analysis

(1) Which payment method is mostly used?

pm = sns.countplot(data = dt, x = dt['Payment Method'])
plt.figure(figsize = (12,6))
pm.set_xlabel('Payment method used', fontsize = 15)
pm.set_ylabel('Count of records', fontsize = 15)
pm.set_title('Univariate analysis of Payment Method', fontsize= 14)

Univariate analysis of Payment method column of dataset

region = dt['Region'].value_counts() 
plt.pie(region.values, 
        labels=region.index, 
        autopct='%1.1f%%') 
plt.show()

Pie chart of region column of online marketplace sales data

(2) Top 10 selling products

# Top-selling products by frequency
top_products = dt['Product Name'].value_counts().head(10)
top_products.plot(kind='bar', figsize=(16, 6))
plt.title('Top 10 Selling Products by Frequency')
plt.xlabel('Product Name')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

(3) Distribution of Quantities (Total Number of Product Units) sold

plt.figure(figsize=(10, 6))
sns.histplot(dt['Units Sold'], bins=10, kde=True)
plt.title('Distribution of Quantities (Number of Product Units) Sold')
plt.xlabel('Quantity (Number of Units)')
plt.ylabel('Frequency')
plt.show()

# Summary statistics for quantity
quantity_stats = dt['Units Sold'].describe()
print(quantity_stats)

(4) Distribution of each product unit price

# Distribution of unit prices
plt.figure(figsize=(10, 6))
sns.histplot(dt['Unit Price'], bins=10, kde=True)
plt.title('Distribution of Unit Prices')
plt.xlabel('Unit Price')
plt.ylabel('Frequency')
plt.show()

# Summary statistics for unit price
unit_price_stats = dt['Unit Price'].describe()
print(unit_price_stats)

Bivariate Analysis

(5) How much overall revenue (sales) do you get each month throughout time?

# Extract day, month, year, and weekday
dt['Day'] = dt['Date'].dt.day
dt['Month'] = dt['Date'].dt.month
dt['Year'] = dt['Date'].dt.year
dt['Weekday'] = dt['Date'].dt.weekday

dt.set_index('Date')['Total Revenue'].resample('ME').sum().plot(kind='line', figsize=(12, 6))
plt.title('Monthly sales over time')
plt.xlabel('Date')
plt.ylabel('Total sales')
plt.show()

plt.figure(figsize=(10, 6))
sns.countplot(x=dt['Product Category'], order=dt['Product Category'].value_counts().index)
plt.title('Frequency of Product Categories')
plt.xlabel('Count')
plt.ylabel('Category')
plt.show()

(6) For every product category, what is the total income (Product unit * Unit Price)?

pt_category_revenue = dt.groupby('Product Category')['Total Revenue'].sum().sort_values()
pt_category_revenue.plot(kind='barh', figsize=(10, 6))
plt.title('Total Revenue per each Product Category')
plt.xlabel('Total Revenue')
plt.ylabel('Product Category')
plt.show()

(7) What is the name of the product that brought in the most money?

# Top-selling products by total revenue
top_products_revenue = dt.groupby('Product Name')['Total Revenue'].sum().sort_values(ascending=False).head(10)
top_products_revenue.plot(kind='bar', figsize=(10, 6))
plt.title('Top 10 Selling Products by Total Revenue')
plt.xlabel('Product Name')
plt.ylabel('Total Revenue')
plt.xticks(rotation=45)
plt.show()

Top selling products of marketplace data nalysis

(8) Which region has the largest overall revenue generation?

region_revenue = dt.groupby('Region')['Total Revenue'].sum().sort_values()
region_revenue.plot(kind='barh', figsize=(10, 6))
plt.title('Total Revenue per Region')
plt.xlabel('Total Revenue')
plt.ylabel('Region')
plt.show()

Total revenue per each region of dataset

(9) How is the entire revenue divided among the various payment options (Units Sold * Unit Price)?

payment_revenue = dt.groupby('Payment Method', observed=True)['Total Revenue'].sum().sort_values()
payment_revenue.plot(kind='barh', figsize=(10, 6))
plt.title('Total Revenue (Units sold * Unit Price) by Payment Method')
plt.xlabel('Total Revenue')
plt.ylabel('Payment Method')
plt.show()

Multivariate Analysis

(10) What is the distribution of sales (total revenue) for each product category across regions, and how do product categories perform in various regions?


# Heatmap for visualizing sales distribution geographically
pivot_table_region = dt.pivot_table(values='Total Revenue', index='Product Category', columns='Region', aggfunc='sum', fill_value=0)
plt.figure(figsize=(12, 8))
sns.heatmap(pivot_table_region, annot=True, fmt='.0f', cmap='Blues')
plt.title('Total Revenue by Category and Region')
plt.xlabel('Region')
plt.ylabel('Category')
plt.show()

Total revenue generated by each category and region of dataset

(11) What was each Product Category’s Total Revenue throughout time (Units Sold * Unit Price)?

# Line charts with different lines for each category to see trends over time
dt.set_index('Date').groupby('Product Category')['Total Revenue'].resample('ME').sum().unstack().plot(kind='line', figsize=(12, 8))
plt.title('Monthly Sales Over Time by Category')
plt.xlabel('Date')
plt.ylabel('Total Revenue')
plt.legend(title='Category')
plt.show()

Monthly sales over time by each category

(12) Which payment methods have the most sales (revenue) within each product category?

# Stacked bar charts to show sales distribution across payment methods for each category
pivot_table_cat_pay = dt.pivot_table(values='Total Revenue', index='Product Category', columns='Payment Method', aggfunc='sum', fill_value=0, observed = False)
pivot_table_cat_pay.plot(kind='bar', stacked=True, figsize=(12, 8))
plt.title('Total Sales by Product Category and Payment Method')
plt.xlabel('Product Category')
plt.ylabel('Total Sales')
plt.xticks(rotation=45)
plt.legend(title='Payment Method')
plt.show()

Sales in each product category by payment method

(13) What regional differences exist in the payment method preferences?

# Stacked bar charts to compare payment method preferences across regions
pivot_table_reg_payment = dt.pivot_table(values='Total Revenue', index='Region', columns='Payment Method', aggfunc='sum', fill_value=0, observed = True)
pivot_table_reg_payment.plot(kind='bar', stacked=True, figsize=(12, 8))
plt.title('Total Revenue by Region and Payment Method')
plt.xlabel('Region')
plt.ylabel('Total Revenue')
plt.xticks(rotation=45)
plt.legend(title='Payment Method')
plt.show()

Total revenue in each region and which payment method used

I tried my best to finish this project. If you are interesting more data analysis EDA project you can check Customer Segmentation Data Analysis.