Data Science Salaries 2023

Here We will do Exploratory Data Analysis on a date from Kaggle Data Science Salaries 2023.

About Dataset

Data Science Salaries dataset variables information

Importing the required Libraries

#Import the required libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Dataset Loading

#load the dataset
ds_salaries = pd.read_csv('ds_salaries.csv')
ds_salaries

Data Exploration and Cleaning Process

Here we are going to perform data exploration and cleaning process that is the basic thing before doing EDA. We will check any null values in dataset and how many duplicated entries.

ds_salaries.head()
DS dataset display 5 rows

ds_salaries.columns
DS Columns names

What are the data types of a dataset Data Science Salaries

ds_salaries.info()

Before performing Analysis on Data Science Salaries 2023 dataset, we will check the every integer variables description like count, mean, min, max, 25th percentile, 50th percentile and 75th percentile.

ds_salaries.describe()
Description of integer variables

Check the null values

ds_salaries.isnull().sum()

No null values are present in the dataset.

How many unique entries in the dataset?

ds_salaries.nunique()

Exploratory Data Analysis

Now it’s time to perform Exploratory Data Analysis on given dataset. I explained about EDA on my blog if someone does not have clear idea of EDA, he/she can read my article Data Analysis EDA.

Univariate Analysis

Here we will analyze the dataset by considering only one variable at a time.

# Histograms
sns.histplot(data = ds_salaries, x = ds_salaries['salary_in_usd'])
plt.suptitle('Distribution of salary_in_usd', fontsize=20)
plt.show()
Visualize the salary_in_usd column

Analyze Distribution of Categorical Variables of Data Science Salaries 2023 dataset.

# Set up the matplotlib figure
plt.figure(figsize=(14, 10))

# Plot count plots for categorical columns
ct_columns = [ 'experience_level', 'employment_type', 'company_size','work_year']

for i, column in enumerate(ct_columns, 1):
    plt.subplot(2, 2, i)
    sns.countplot(x=ds_salaries[column], hue=ds_salaries[column], palette='Set2', legend=False)
    plt.title(f'Distribution of {column}')
    plt.xlabel(column)
    plt.ylabel('Count')

plt.tight_layout()
plt.show()
Distribution of Categorical Variables of DS

Which nation is the majority of the workforce from?

# Count the number of employees from each country
country_counts = ds_salaries['employee_residence'].value_counts()

# Find the country with the highest number of employees
most_common_country = country_counts.idxmax()
most_common_country_count = country_counts.max()

print(f"The country with the most employees is {most_common_country} with {most_common_country_count} employees.")
The country with the most employees is US with 3004 employees.

Bivariate Analysis of Data Science Salaries 2023

It is performed with the help of two variables.

# Bar Plots
plt.figure(figsize=(12, 6))
sns.countplot(data=ds_salaries, x='experience_level', hue='employment_type')
plt.title('Experience Level by Employment Type')
plt.show()
Experience level by employment type

Company Size by Remote Ratio

plt.figure(figsize=(12, 6))
sns.countplot(data=ds_salaries, x='company_size', hue='remote_ratio')
plt.title('Company Size by Remote Ratio')
plt.show()
Company size by remote ratio analysis

Which position pays the most money in US dollars?

# Group by job title and find the maximum salary in USD for each job title
max_salary_by_job = ds_salaries.groupby('job_title')['salary_in_usd'].max().reset_index()

# Find the job title with the highest salary
highest_salary_job = max_salary_by_job[max_salary_by_job['salary_in_usd'] == max_salary_by_job['salary_in_usd'].max()]

print(highest_salary_job)

The highest salary paid to Research Scientist job.

Multivariate Analysis

It is performed by considering more that two variables.

# Select relevant columns for pair plot
pairplot_columns = ['salary_in_usd', 'remote_ratio', 'experience_level']

# Create pair plot
sns.pairplot(ds_salaries[pairplot_columns], hue='experience_level')
plt.suptitle('Pair Plot of Selected Variables', y=1.02)
plt.show()
Pair plot of selected variables of dataset

from mpl_toolkits.mplot3d import Axes3D

# Create 3D scatter plot
fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111, projection='3d')

sc = ax.scatter(ds_salaries['salary'], ds_salaries['salary_in_usd'], ds_salaries['remote_ratio'], c=ds_salaries['experience_level'].astype('category').cat.codes, cmap='viridis', alpha=0.6, s=100)
ax.set_xlabel('Salary')
ax.set_ylabel('Salary in USD')
ax.set_zlabel('Remote Ratio')
plt.title('3D Scatter Plot of Salary, Salary in USD, and Remote Ratio')

# Create color bar
cbar = plt.colorbar(sc)
cbar.set_label('Experience Level')
plt.show()
3D scatter plot
Scroll to Top