This is a walkthrough of the cleaning I undertook on excess of 25 million lines of data as part of the #maventaxichallenge , which is the monthly data visualization challenge set up by Maven Analytics. This month the challenge involved looking at detailed NYC Taxi data between 2017 and 2020 and developing a usable single page dashboard to analyze weekly trends.
Details of the challenge can be found here. This includes the data files being provided and the requirements for cleaning and presenting the data. I have also created a walkthrough of the below on a video link if you want to see it in live action.
My plan was fairly simple but structured:
- Write a clear list of the cleaning steps linked back to the data dictionary
- Create a Data Sample
- Apply the cleaning steps to the sample and verify
- Copy out the M Code for each step
- Load the full data set and apply the M Code for each step
Step 1 – Cleaning List
So, the following were the steps required, and what that translated into using the data dictionary:
|Cleaning Request||Data Dictionary Translation|
|Let’s stick to trips that were NOT sent via “store and forward”||store_and_fwd_flag = N|
|I’m only interested in street-hailed trips paid by card or cash, with a standard rate||trip_type = 1 , |
payment_type = 1 or 2,
RatecodeID = 1
|We can remove any trips with dates before 2017 or after 2020||lpep_pickup_datetime = between 01/01/2017 and 31/12/2020 lpep_dropoff_datetime = between 01/01/2017 and 31/12/2020|
|…along with any trips with pickups or drop-offs in unknown zones||PULocationID <> blank/empty|
|Let’s assume any trips with no recorded passengers had 1 passenger||If passenger_count = 0 or null, then replace the 0 with 1|
|If a pickup date/time is AFTER the drop-off date/time, let’s swap them||If lpep_pickup_datetime > lpep_dropoff_datetime|
if lpep_dropoff_datetime < lpep_pickup_datetime
|We can remove trips lasting longer than a day, and any trips which show both a distance and fare amount of zero||add column |
lpep_dropoff_datetime – lpep_pickup_datetime
then filter out values >= 24 hours
if trip_distance AND fare_amount = 0
then filter out values
|If you notice any records where the fare, taxes and surcharges are ALL negative, please make them positive||if fare_amount <0|
and mta_tax <0
and extra <0
and improvement_surcharge <0
and congestion_surcharge <0
then “all negative”
then apply “trick” replacement to change values (see below for more detail)
|For any trips that have a fare amount but have a trip distance of 0, calculate the distance this way: (Fare amount – 2.5)/2.5||If fare_amount > 0|
And trip_distance = 0
Then ((Fare amount – 2.5) / 2.5)
|For any trips that have a trip distance but have a fare amount of 0, calculate the fare amount this way: 2.5 + (trip distance x 2.5)||If trip_distance > 0|
And fare_amount = 0
Then (2.5 + (trip_distance x 2.5))
Step 2 – Create Data Sample
It would be next to impossible to apply the steps to over 25 million lines of data and then easily verify that all the filters, additions, replacements and modifications had taken place and you got the results you were looking for.
A much more digestible method is to recreate a sample data set based on the data we were provided, and ensure that at least one example of each cleaning step scenario listed above is included.
To do this, I took the smallest file (2020), and performed a series of deduplications on values across the locations, passenger numbers, etc. so that I was able to get a small sample of approximately 50 varied line items.
Then, in order to recreate some of the cleaning scenarios, I made minor adjustments to some values. e.g. swap the drop off and pick up times. Above is the resulting sample data set.
Step 3 – Apply Cleaning Steps to Sample and Verify
Next, I created a new PowerBI file, uploaded the sample data set and then moved to edit the set in Power Query.
After performing the usual steps on checking data types (text, date, time, numbers, etc) have been applied, it was then a case of applying each cleaning step, and working through. For example, the first step became: =Table.SelectRows(#”Changed Type”, each [store_and_fwd_flag] = “N”)
M Code Replace Hack
There is one hack that is really worth highlighting here, and will save some added columns and processing time in your Power Query tasks, especially in larger data sets. The below gallery shows a snapshot of each step, but here is a brief description:
- Create a “Dummy Replace” step for the column you which to replace values on.
- Select the Transform ribbon, and click on Replace Values (picture 1)
- Choose two dummy values to replace that would not be values in your column. As an example here, I chose replace 5000 with 6000 (picture 2), where most values would actually be single digit values in that column.
- Click OK, and you will see the M code in the formula box at the top (picture 3)
- Create a “Dummy Custom Column” step to allow you to write the conditional statement you want to insert into your Dummy Replace code
- Select the Add Column ribbon, and click on Custom Column (picture 4)
- use intellisense to help you write the conditional statement you wish to create (picture 5)
- Copy the statement, and click cancel. This way you are not creating an additional column.
- Return to your Dummy Replace step and perform the following:
- For the first value (5000), amend this value to refer to the column you want to replace values in, and precede it with “each”.
- In this example, I replace 5000 with each [fare_amount]
- For the second value (6000), amend this value to refer to the conditional statement you want to apply to the selected column, and again precede it with “each”.
- In this example. I replace 6000 with each if [negatives_charges] = “all negative” then [fare_amount]*-1 else [fare_amount]
- Once you hit return, the replacement of values occurs within the column, without the need to create an additional column. This will increase efficiency for any refreshes.
If you are interested, the “normal language” logic behind this step is:
- For each row value in the the column fare_amount, if the corresponding row in the negative_charges column is equal to “all negative”, then we would like to multiply the fare_amount by -1, thus changing it from a negative to a positive value. Else, if it is not “all negative”, then just keep the fare_amount value as it is (no change).
Step 4 – Copy out M Code
Once you have gone through each of the steps on your sample set and verified that it has had the desired effect on you line items, you will now have the list of steps under your Applied Steps on the right hand side of the Power Query interface. A good tip is to rename these steps based on your need (e.g. Change Negative Values or Swap Dates). This will help when you want to copy steps to use in the full data set.
You can see a list of this information in the Advanced Editor Window (see below). This can be copy and pasted out and saved for future use.
If you click on each step you will see the corresponding M code just above your data table.
Step 5 – Apply to Full Data Set
Now that you have your all your steps and code written, tested and verified, it is now a pretty straight forward proposition to apply them to your full data set.
An easy way to add steps, is to right-click on a step then select “Insert Step After” (see below). This will then allow you to paste in the M code that you have saved. One tip is to check the step reference. It will refer to the name of the preceding step. Therefore make sure you do that in the first instance. For following steps, it should be easy to use your copied values from the sample data, provided you use the same names for your steps.
Once you have completed all your steps, you are done – all that’s left to save is Apply and Close.
The “trial and error” approach is removed, which means that when Power BI applies the updates to the 25 + million line items, you can be reasonably confident you will not have to revisit your Power Query. This is important here, as with such large data sets, the updates can sometimes take hours to complete depending on your computer’s processing capabilities.
You are now free to move on to the next part of the challenge and concentrate on creating you DAX calculations and a nice neat dashboard.
If I have time to finish my dashboard on this one, I will add it on a future post. As always, any comments or queries, please let me know in the comments below!