top of page

SPROCKET CENTRAL CONSULTING PROJECT

Project Description:

This project was part of my participation in the KPMG Data Analytics Consultancy Virtual Internship, where I created a data-driven plan and developed a dashboard that addresses the challenge of identifying the right customers to target for the client, Sprocket Central, a company that specialises in the sale of bicycles.

 

Tools Used In This Project:

1.     Microsoft Excel: used to explore and clean the data.

2.     Microsoft SSMS: used to create relational databases and writing SQL queries to model, manipulate and clean the data.

3.     Microsoft Power BI: used to analyse, visualize and report the results to the client company.

 

Background Information:

Sprocket Central Pty Ltd , a fictitious, medium size bikes & cycling accessories organisation and a long-standing KPMG client whom specialises in high-quality bikes and accessible cycling accessories to riders, needs help with its customer and transactions data. The organisation has a large dataset relating to its customers, but their team is unsure how to effectively analyse it to help optimise its marketing strategy.

The marketing team at Sprocket Central Pty Ltd is sure that, if correctly analysed, the data would reveal useful customer insights which could help optimise resource allocation for targeted marketing. Hence, improve performance by focusing on high value customers.

 

Problem Statement:

Sprocket Central Pty Ltd marketing team is looking to boost business by analysing their existing customer dataset to determine customer trends and behaviour. The client provided KPMG with 3 datasets:

·        Customer Demographic

·        Customer Addresses

·        Transactions data in the last year

The client has also provided a new list of 1000 potential customers with their demographics and attributes. However, these customers do not have prior transaction history with the organisation. The client asked to recommend which of these 1000 new customers should be targeted to drive the most value for the organisation.

 

Dataset:

The provided datasets were as following:

1- Customers Demographics data:

This dataset consists of 13 attributes and 4000 records, and a preview is given below:

in the sale of bicycles.

2- Customers Addresses data:

This dataset consists of 6 attributes and 4000 records, and a preview is given below:

3- Transactions data in the last year (2017):

This dataset consists of 13 attributes and 20,000 records, and a preview is given below:

4- New Customers List:

This dataset consists of 18 attributes and 1,000 records, and a preview is given below:

Solving The Problem:

In order to solve the client's problem, I took the following steps in this project:

Step 1: Asking the right questions:

The questions I asked and built the project around were:

1.     What key performance indicators (KPIs) can be created to determine which clients to target?

2.     What are the client's customers' product preferences?

3.     What trends and patterns can we see in the data?

4.     What other datasets could be helpful in gaining more understanding of consumer preferences and likelihood to buy the products?

Strategy:

1.     Making the total number of purchases (orders) the primary determinant of client value.

2.     Categorising customers based on demographics to determine the top buying category (for example, gender: who buys more? Males or females, what age group made the most purchases?,...etc).

3.     Analysing time period data to look for any patterns.

4.     Extracting months from dates and categorising them into climate seasons (Winter, Summer, Autumn, and Spring) to check whether there are any seasonal trends.

5.     Check to see whether there is a rise or drop in the percentage of online orders during a given season.

6.     Examine male and female preferences throughout the yearly seasons.

7.     Analyse the client's products to determine which ones are more popular with clients.

8.     Recommending the top 100 old and new customers based on the previous analysis results.

 

Step 2: Data Validation:

The data validation framework which I used to validate the data consisted of the following standard data quality dimensions:

1.     Accuracy: this requires informing the client in order to guarantee that the data they handled is correct; but, because the client company is fictitious, no communication with the client can be made, and I must simply assume that the data is accurate.

2.     Completeness: missing values were discovered in many of the records, causing the data to be inconsistent. Because the customer is a fake company, there was no way to complete the data, forcing me to remove the incomplete records, lowering the analytical accuracy.

3.     Consistency: this requires the data to be free from any type of contradiction.

4.     Currency: To successfully track trends, the data must be up to date. This was true given that this virtual internship project was initiated in the year 2017.

5.     Relevancy: entails utilising the data relevant to the investigation while avoiding include any unnecessary data.

6.     Validity: involves ensuring that the data has allowable values.

7.     Uniqueness: implies that the data is free of duplication.

Step 3: Data Processing & Cleaning:

As previously pointed out, the data contained several errors and inconsistencies that required cleaning. This is a demonstration of those issues and how I solved them, keeping in mind that these methods were applied to other datasets with the same issues:

 

1- Numbers stored as Text:

Issue: Numbers stored in the text format within the "past_3_years_bike_related_purchases", "property_valuation" and "postcode" columns.

Solution:

Using the "Value function" in MS Excel to convert the numeric values to its proper format.

2- Dates Formatted as Numbers:

Issue: dates stored in the "transaction_date" and "product_first_sold_date" columns was found to be in numeric format.

Solution:

Correcting the format of the wrong formatted dates and ensuring all date records formats are united.

3- Irrelevant Data:

Issue: presence of irrelevant columns within the data such as "default" column in the customers demographics dataset, and the undefined columns in the new customers list data.

Solution:

Removing the irrelevant columns.

4- Missing Data:

Issue: presence of blanks and n/a values in the "last_name", "job_industry" and "job_title" columns.

Solution:

Instead of simply deleting the missing values, which would force me to delete the entire record and lower the accuracy of the analysis, the missing values were replaced with (not available) in the case of the "last_name" column and with (n/a) in the case of the "job_industry" and "job_title" columns. This was valid because, when the customer_id is provided, the values in these columns don't matter as much.

3_Missing_Last_Name_Data_Solved.png

5- Inconsistent Gender Values & Customers With Age > 110:

Issue: Inconsistent data has been found in the gender column, with the values for "Female" and "Male" being spelt incorrectly at times as "Femal" and "Femaale" with "Female" being recorded as "F" at one point and "Female" at another. The gender is a key component in the strategy used to address the client's problem, so it is crucial to find a solution. However, there has been a strange value ("U") present in the gender column, and when the column was filtered to show the ("U") values, the customers' ages were discovered to be over 110 years old and the deceased indicator was showing that these customers were still alive, raising doubts about the credibility of the data.

The "state" column has the same problem with inconsistent values, with the states of "New South Wales" and "Victoria" being shortened as "NSW" and "VIC" one time and completely spelt the next.

Solution: Records with gender values of "U" were eliminated, and gender and state names that were inconsistent were unified by replacing the incorrect and shortened values with the proper values.

Step 4: Data Manipulation & Preparing The Final Dataset:

1- Extracting the year of birth (YOB) from birth dates & Calculating Customers Ages:

  • The "YOB" and "Age" columns were created to aid in categorising the customers for the final analysis by age. In order to do that, the Excel "YEAR" function was utilised, and to determine the age, the YOB was subtracted from 2017 (the year the project was created):

 

​

​

1_Creating_YOB.png
3_Creating_Age.png

2- Categorizing Customers into Age groups:

  • The customers were divided into age groups using a "nested IF" function combined with the "AND" function as follows:

​

​

3- Making The "Full Name" Column:

  • The "CONCATENATE" function was used to combine the first and last names of customers as follows:

​

​

2_Creating_Full_Name.png

4- Extracting Months From Dates & Generating The Season Data:

  • Months have been extracted with the "TEXT" function from the "transaction_date" data. Additionally, the start of the season in Australia (the country of the client) was searched up online. The "Season" column was then created by using the "INDEX-MATCH" technique to match the month with the season:

​

​

4- Product Data Creation, Table Joining, and Final Dataset Creation:  (SQL)

  • The following actions have been taken to accomplish this::

    • Building a database to store data and create tables.

    • Importing data into SSMS from Excel.

    • Writing a SQL query and employing the SQL "CONCAT" function to generate "product" data, which is a mixture of the brand, product line, class, and size to be utilised in the analysis to investigate customers preferences.

    • Joining the transactions, demographics and addresses tables with an inner-join query to create the final dataset.

​

10_Creating_DataBase.png

Step 5: Data Analysis & Creating The Final Dashboard :  (MS Power BI)

1- Loading data into MS Power BI and establishing relationships between the tables:

  • Data was loaded to MS Power BI and a simple data model was established by taking the "customer_id" as the primary key. The model contains the following relationships:

    • One-To-One relationship between demographics and addresses tables.​

    • One-To-Many relationship between the demographics and transactions tables.

    • One-To-Many relationship between the addresses and transactions tables.

​

​

1_Data_Loaded_And_Relationships_Established.png

2- Creating the necessary measures using DAX:

  • The following measure were created:

    • # of Customers = DISTINCTCOUNT(Final_Transactions2[customer_id])

    • # of Orders = COUNT(Final_Transactions2[customer_id])

    • Approved Orders = DIVIDE(CALCULATE(COUNT(Final_Transactions2[order_status]),Final_Transactions2[order_status] IN {"Approved"}),COUNTA(Final_Transactions2[order_status]),0)

    • # of Online Orders = SUM(Final_Transactions2[online_order])

    • Online Orders % = DIVIDE([# of Online Orders],[# of Orders],0)

    • Sales = SUM(Final_Transactions2[list_price])

    • Total Past 3 Years Purchases (Old Customers) = SUM(Customer_Demographic[past_3_years_bike_related_purchases])

    • Total Past 3 Years Purchases (New Customers) = SUM(Customer_Demographic[past_3_years_bike_related_purchases])

​

3- Designing The Final Report:

  • The final dashboard was designed using a simple, user-friendly layout that helps in data storytelling. It was primarily made to visualise the following trends and patterns found in the data:

    • Number of orders by state.

    • Number of orders by brand, product line, class and size.​​

    • Orders trend over time for males and females.

    • Total number of customers.

    • Number of male customers and female customers.

    • Total number of orders.

    • Total number of orders made by males and orders made by females.

    • Percentages of approved and cancelled orders.

    • Orders by season and the breakdown by car ownership.

    • Orders by wealth segment.

    • Orders by age and the breakdown by gender.

    • Orders by job industry.

    • Orders, sales, number of customers who purchased, and online ordering for every single product.

    • Cancelled orders trend and its breakdown by gender.

    • Top 100 old customers by number of orders.

    • Ages of new customers and the breakdown by gender.

    • Recommendation of the top 100 customer of the new customers to target.

  • Slicers have been added to the dashboard to allow the user to explore additional insights as desired, and the images below are from the final dashboard:​

​

1.png

Step 6: Insights & Recommendations :

  • Customers in the dashboard above were advised based on the following findings and insights from the analysis of the datasets provided:

  1. Customers between the ages of 20 and 59 make the most purchases, with the (40-49) age group being the biggest buyer.

  2. The top two selling brands were Solex and WereA2B, with the medium class and size being the most popular. Keeping this in mind, the most frequently purchased combination (product) by consumers is given in the table on the report's highlights page. It is advised to target customers with the most popular products.

  3. Customers categorised as mass customers were discovered to purchase the most, as evidenced by the fact that this sector accounted for 50% of total transactions.

  4. Purchases made by males and females were found to be roughly 2% different, implying that gender is not a factor when determining which clients to target.

  5. The state of New South Wales was shown to have the highest sales (52% of total sales), indicating that the customers who live there are more active.

  6. Men seemed to purchase more in the spring season, while female purchases have been found to decrease, and opposite behaviour was observed in the autumn season.

  7. Customers who possess a car appear to buy more in the spring.

  8. An analysis of cancelled orders (1% of total orders) was performed to see if there was a trend that could reveal the cause of the cancellations and to see if one of the genders has a higher tendency to cancel orders, so that more emphasis can be placed on the gender with a lower cancellation tendency. The data also reveal that gender does not have a role in order cancellation, as shown in the (Cancelled Orders Trend) chart to the right in the old customers page of the report.

  9. The analysis revealed that roughly 50% percent of the orders were online it can be ignored when it comes to determining which customers to target.

  10. Customers working in the manufacturing industry. Health, financial services, and clients whose employment industries are not available were discovered to order more, accounting for 75% of all orders. As a result, it is better to target those customers.

​

​

Step 7: Sharing The Results:

Explore the interactive dashboard on this page:

​

​

​

Or View the report in Microsoft Power BI platform :

References:

Sprocket Central Logo: Ansari, M. A. (n.d.). KPMG - Virtual Experience Program. KPMG - Virtual Experience Program. https://ironstark007.github.io/Portfolio/portfolio/project-6/

KPMG Logo: https://www.gettyimages.ae/detail/news-photo/the-logo-of-kpmg-a-multinational-tax-advisory-and-news-photo/1297929057

​

bottom of page