top of page

Jewellery Customer Segmentation Analysis

Jewellery Customer Segmentation Analysis

Project Background

A jewellery retail company provided a customer segmentation dataset sourced from Kaggle. The dataset contained extensive information on customer behavior, preferences, and performance metrics. However, it was cluttered with redundant, irrelevant, and inconsistent data. The company required assistance in preparing this dataset for structured storage and future analytical processing. Our task was to clean and standardize the data, then load it into a MySQL database where it could be easily queried and analyzed.

Objectives

Data Cleaning: Remove irrelevant or redundant fields that do not contribute to segmentation analysis.

Duplicate Removal: Identify and eliminate any duplicate customer entries to ensure data uniqueness.

Data Standardization: Ensure consistent formatting of customer-related attributes such as names, dates, and identifiers.

Database Integration: Load the cleaned dataset into a structured MySQL database to support future reporting or machine learning tasks.

Methodology

Initial Data Assessment:

Load the Data: The dataset was imported into a Pandas DataFrame using pd.read_csv().

Explore the Data: The structure of the dataset was examined using functions like .head(), .info(), and .dtypes.

Identify Issues: Various issues were identified, such as unnecessary columns, potential duplicates, and missing values.

Data Cleaning Process:

Column Reduction: Removed numerous columns irrelevant to the final use case, including marketing metrics and profitability scores.

Missing and Error Handling: Assessed for null values and formatting issues using .isnull() and string operations.

Consistency Checks: Standardized column names and data formatting using Pandas string functions.

Duplicate Detection and Removal:

Used .duplicated() to find duplicate rows.

Used .drop_duplicates() to ensure that only unique customer records were retained.

Data Standardization:

Standardized column formats, including proper handling of date fields using datetime and pd.to_datetime().

Ensured that all remaining fields followed a consistent style.

Database Integration:

Used SQLAlchemy to create a secure connection to a MySQL database.

Loaded the cleaned dataset into a MySQL table using .to_sql() with if_exists='replace'.

Included error handling using try/except blocks and logged the progress of each ETL phase using a custom logging function.

Pandas Tools and Techniques Used

Data Loading:

pd.read_csv(): To import the CSV file.

Data Exploration:

df.head(), df.info(), df.dtypes: For examining structure and content.

Data Cleaning:

df.drop(columns=[...]): To remove unnecessary columns.

df.isnull(), df.duplicated(): To identify issues.

df.drop_duplicates(): To remove duplicate rows.

Data Standardization:

str.lower(), str.upper(): For text normalization.

datetime.now(), strftime(): For logging with consistent timestamp formatting.

Database Loading:

create_engine() from SQLAlchemy: For connecting to MySQL.

to_sql(): To send data into the MySQL table.

Project Result

The project resulted in a thoroughly cleaned and structured dataset suitable for storage and analysis. The final outcomes included:

Cleaned Dataset: Removed over a dozen irrelevant or redundant columns, leaving only meaningful customer segmentation data.

No Duplicates: Ensured that all customer records were unique.

Standardized Formats: Applied consistent formatting across all fields for reliability and ease of integration.

MySQL Integration: The refined dataset was successfully uploaded to a MySQL database, ready for further analytical or reporting tasks.

bottom of page