Saikumar Logo Image
Saikumar

Data Cleaning using MySQL

This page contains the case study of Cleaning Data on World_layoffs Data using MySql which includes the Project Overview, Tools Used and Live Links to the project.

Project Image

Project Overview

Data Cleanig on World_layoffs Data Using MySql
Data Sources: In my project on data cleaning using the World_layoffs dataset, I performed a series of SQL operations to ensure data quality and consistency. Here’s a detailed and technical description of the steps I took:
1. Creating a Staging Table: I created a staging table with the same structure as the original layoffs table. This was done using the CREATE TABLE ... LIKE syntax. The staging table served as a workspace to manipulate and clean the data while preserving the original dataset intact. Data was copied from the original table to the staging table using an INSERT ... SELECT statement.
2. Removing Duplicates: I identified duplicates by using the ROW_NUMBER() window function partitioned by key columns such as company, location, industry, total_laid_off, percentage_laid_off, date, stage, country, and funds_raised_millions. This function assigned a unique row number to each duplicate group. I then created a common table expression (CTE) to select rows where the row number was greater than one, indicating duplicates. These duplicate rows were subsequently deleted from the staging table, ensuring that each record was unique.
3. Standardizing Data: I standardized data entries to fix inconsistencies. For example, variations in the industry field (such as "Crypto Currency" and "CryptoCurrency") were standardized to a single value ("Crypto"). I addressed inconsistent country names by removing trailing periods using the TRIM function. The industry field had missing values (nulls) and blanks which were first set to null for easier handling. I then populated these null values by joining the table with itself and using existing non-null values from other rows with the same company.
4. Handling Null Values: I reviewed the null values in critical columns such as industry. For missing industry values, I used a self-join to fill in these gaps wherever possible, ensuring no data was lost. I decided to retain null values in numerical columns like total_laid_off and funds_raised_millions, as these could be meaningful for future analyses.
5. Formatting Date Columns: The date field initially contained string values in various formats. I converted these strings to a standardized date format using the STR_TO_DATE function. Once converted, I altered the column to use the DATE data type, ensuring consistency and enabling date-based operations and analyses.
6. Removing Unnecessary Data: I identified and removed rows that contained null values in both total_laid_off and percentage_laid_off columns since such records were deemed non-informative. This cleanup ensured that only relevant and complete data was retained, improving the dataset's overall quality. By following these steps, I ensured the World_layoffs dataset was clean, standardized, and ready for further analysis. These operations demonstrated my proficiency in SQL and my ability to handle complex data cleaning tasks, including duplicate removal, data standardization, handling null values, and ensuring consistent data formats. This project highlights my capability to prepare real-world datasets for accurate and efficient data analysis

In my MySQL project titled "World Layoffs Data Cleaning," I meticulously cleaned and standardized the World_layoffs dataset to ensure data quality and consistency. By creating a staging table, I preserved the original data while performing necessary manipulations. This involved removing duplicates using advanced window functions, standardizing entries for consistency, addressing null values with strategic self-joins, and formatting date columns for uniformity. Additionally, I eliminated non-informative records to retain only relevant data. These operations highlighted my proficiency in SQL and my ability to handle complex data cleaning tasks, preparing the dataset for accurate and efficient analysis.

Tools Used

SQL
MySQL
EXCEL
GIT
GITHUB