Maven Challenge – Telecoms Churn Analysis

Alright, alright, alright…..

I had been promising this for a few weeks, and I guess since I was recently successful in this challenge, I better keep my promise! So here goes…..

These are my notes which comprise a mix of strategy and technique on how I came up with and developed a winning entry on the Maven Telecoms Churn challenge #mavenchurnchallenge.

Delayed Start

Before we start, I note that my main constraint was that I had been back home in Ireland (after 4.5 years) on holiday for several weeks, and was thinking of giving this challenge a miss. I got back on 19th July, which left only a few days until the challenge finished, so thought I definitely would not have time.

catching a break from the daily grind

However, that’s where my “friend” jetlag came in. I found myself wide awake for a few nights in a row between 12am and 4am, so I decided to do something useful and see if I could create an entry.

They say “necessity is the mother of invention”, and as I knew I would only have 2-3 nights to complete, I needed to create something fairly targeted and “simple”.

Scope and Brief

For those unfamiliar with the challenge the details are available at https://www.mavenanalytics.io/data-playground

The brief was to analyze demographic details and key account records of 7,043 clients at a Telecoms company and try to answer some of the following queries for the board:

  1. What constituted high value clients?
  2. What influences or drives churn risk?
  3. What steps can be taken to retain clients

Analysis Strategy

Keep it simple

As the brief was centred around defining high value and churn risk, I was thinking of ways to analyse just those two aspects. However, the data is not that straight forward. It contains information on age, tenure, services, costs, marital status, location, etc.

Remembering I had little time to create and weave a story looking at individual variables and creating a jigsaw of analysis, and also still recalling some of the analysis that I learned about in Maven Analytics Machine Learning courses, I thought it would be much simpler to create two metrics which combine multiple variables, as follows:

  • Value metric – allowing me to combine value related variables to differentiate between high and low value clients
  • Churn risk metric – allowing me to combine key churn risk variables to differentiate between higher and lower churn risk clients

Aim

Keeping the end user in mind (the board), I thought this approach of using summarised metrics had several benefits:

  • Focused on the key briefing points
  • Condensed a large number of variables i
  • Made it easy to digest, with no need to look at multiple charts and tables and try to relate and correlate.

Visualisation Strategy

Staying with the theme of simplification, I wanted to create a simple flow, from high level customer revenue values, through to a slightly more detailed look at the spread of clients between value and churn risk, and then finally look at some summary key reasons for churn and recommendations to the board.

The simplification was also an attempt to limit any clutter in the visual. Looking back, if I was critical, maybe the kpi / BAN visuals may have had a little too much information, but I think in general I managed to control the amount of detail in the final product.

I sketched up and outline or wireframe on how I conceptually wished to present the info – see below.

Outline / Wireframe

High – Low Big Aggregated Numbers (BAN)

The purpose of these were to give high level insights into the revenue split between high value and low value clients, and also further into staying, churning and joining clients.

I eventually added some additional information on account numbers and % values, but kept them small in size so that the focus was on the big numbers.

Scatterplots

The purpose here was to show a quadrant split between high and low value clients and those with high and low potential churn risk.

This could be overlaid with the coloured detail of staying, churning and joining clients to see which quadrants each type of client predominantly fell into.

Key Reasons

As I was using metrics that were combining many variables together, thereby making the original variables almost invisible, I thought it important to actually bring out a little data on at least the categories which contributed most to high churn risk.

This would also help form the basis of the recommendations.

Recommendations

The recommendations don’t need a detailed explanation, but the intention was to summarise what the BAN and scatterplots were saying, then make recommendations based on that, and taking into account the key categories contributing to churn risk.

Colour Association

To bring this all together visually, I decided to primarily rely on colour association to focus attention between staying, churned and joining clients. Having that intentional and consistent association across the entire visual means that there is limited need for legends, and the end user can easily scour the entire report for a particular category just by looking at that colour.

Gestalt Principles

As always, I would look to incorporate a few of these, such as enclosure, connection and continuity.

Technique

Ok, on to the juicy topic of technique and how this was actually executed. Firstly, all the work and analysis was performed in Power Query and Power BI.

Secondly, I took some inspiration from the Maven Analytics Machine Learning Modules 1 and 2, which I had taken in the months prior to this challenge.

Value Metrics

The first metric was the value metric. When looking across the variables, there were several that I could associate with ‘value’, namely:

  • Referrals – someone who has actively added more clients to our network
  • Tenure – a measure of loyalty and how long the client has stayed with our network
  • Monthly Charge – clients total monthly charge
  • Total Charge – clients total quarterly charge
  • Total Revenue – revenue that the network has received from the client from their entire tenure

I decided to go with Referrals and Tenure as two variables, but wanted to pick only one of the charge/revenue related values. I decided to go with Monthly Charge for the following reasons:

  • Total Charge was a quarterly figure. As the network had monthly contracts it is conceivable that there are clients that only joined in the last month or two, and their “quarterly” value would be skewed as a lower value than others.
  • Total Revenue was tempting, but I think the spread between the maximum and minimum was so large, there would be a big differential. My view was that as I would already be considering Tenure as a variable factor, I could defer to just considering the Monthly Charge.

Calculation

Ok, so using the three variables and restricting the calculation to staying and churned clients, I created a normalised value by doing the following:

  • Split the allowance between the variables in order of my perceived importance as:
    • 60% Monthly Charge (MC)
    • 30% Tenure (T)
    • 10% Referrals (R)
  • Created the following calculation in Power Query as per the following:
Power Query Steps for Value Calculation
  • Created a normalised version of this value (values 0 – 1) by using the standard formula below and the corresponding Power Query formula as follows:
Power Query Steps for Normalising Value Calculation

This provided a full set of normalised value ratings in a new column with a range similar to the sample shown below. The average shown here is 0.51, and the overall for the larger sample was approximately 0.52. This is the value I would use to classify whether a customer was higher or lower value.

Churn Metrics

This metric is again the normalisation of a sum of variables, however this one is a little more complex, as it includes several more variables, and the values used are the average values from the average churn:stay ratio.

Ok, that is a lot to consider, so let me break it down a little.

If you look at the overall split between churn and stay clients, it is approximately 28% churn to 72% stay.

Now, if you examine different variables you will see deviations from this value. For example, if you examine the contract type below you will see that although the total split is still 28%:72%, those clients with Month-to-Month contracts are actually 52%:48% in favour of churned clients. This represents a deviation of 24% from the average value, and suggests Month-to-Month Contracts as a significant factor in churn.

And, as such, the same for One and Two Year contracts suggests the opposite is the case.

Contract Type

If we substitute the churned % values as deviations from the 28% average, we would get

  • Month-to-Month = 28% – 52% = 24%
  • One Year = 28% – 11% = +17%
  • Two Year = 28% – 3% = +25%

The greater the negative value, the greater the likelihood that the variable contributes to churn risk.

Now, if we consider further variables, like the below split between married and unmarried clients, we can make the same type of calculation.

In the end, I used 16 variables as part of the overall calculation and aggregated the values to give an overall score. Below is the extract from Power Query.

Now as a caveat, I note that this is a lot of manual entry, and there is perhaps a more automated and efficient manner of calculating each number, but my jetlag symptoms meant I was more inclined to do this kind of manual calculation than do some additional research on an efficient M code series – I am open to feedback and suggestions on making this “smarter”.

= Table.AddColumn(#"Rounded Off", "Churn Risk Rating", each (if [Contract] = "Month-to-Month" then (0.2837-0.5169)
else if [Contract] = "One Year" then (0.2837-0.1088)
else if [Contract] = "Two Year" then (0.2837-0.0258)
else null)
+
(if [Married] = "No" then (0.2837-0.3669)
else if [Married] = "Yes" then (0.2837-0.2016)
else null)
+
(if [Payment Method] = "Bank Withdrawal" then (0.2837-0.3565)
else if [Payment Method] = "Credit Card" then (0.2837-0.1581)
else if [Payment Method] = "Mailed Check" then (0.2837-0.414)
else null)
+
(if [Offer] = "None" then (0.2837-0.2921)
else if [Offer] = "Offer A" then (0.2837-0.0673)
else if [Offer] = "Offer B" then (0.2837-0.1226)
else if [Offer] = "Offer C" then (0.2837-0.2289)
else if [Offer] = "Offer D" then (0.2837-0.2674)
else if [Offer] = "Offer E" then (0.2837-0.6762)
else null)
+
(if [Number of Referrals] = 0 then (0.2837-0.3611)
else if [Number of Referrals] = 1 then (0.2837-0.4734)
else if [Number of Referrals] = 2 then (0.2837-0.1135)
else if [Number of Referrals] = 3 then (0.2837-0.1296)
else if [Number of Referrals] = 4 then (0.2837-0.0773)
else if [Number of Referrals] = 5 then (0.2837-0.0824)
else if [Number of Referrals] = 6 then (0.2837-0.0374)
else if [Number of Referrals] = 7 then (0.2837-0.0247)
else if [Number of Referrals] = 8 then (0.2837-0.0097)
else if [Number of Referrals] = 9 then (0.2837-0.0177)
else if [Number of Referrals] = 10 then (0.2837-0)
else if [Number of Referrals] = 11 then (0.2837-0)
else null)
+
(if [Device Protection Plan] = "No" then (0.2837-0.4242)
else if [Device Protection Plan] = "Yes" then (0.2837-0.228)
else (0.2837-0.0841))
+
(if [Online Security] = "No" then (0.2837-0.4465)
else if [Online Security] = "Yes" then (0.2837-0.1495)
else (0.2837-0.0841))
+
(if [Online Backup] = "No" then (0.2837-0.4296)
else if [Online Backup] = "Yes" then (0.2837-0.2202)
else (0.2837-0.0841))
+
(if [Premium Tech Support] = "No" then (0.2837-0.4452)
else if [Premium Tech Support] = "Yes" then (0.2837-0.1552)
else (0.2837-0.0841))
+
(if [Streaming Movies] = "No" then (0.2837-0.3661)
else if [Streaming Movies] = "Yes" then (0.2837-0.3049)
else (0.2837-0.0841))
+
(if [Streaming Music] = "No" then (0.2837-0.3660)
else if [Streaming Music] = "Yes" then (0.2837-0.2989)
else (0.2837-0.0841))
+
(if [Streaming TV] = "No" then (0.2837-0.3641)
else if [Streaming TV] = "Yes" then (0.2837-0.3062)
else (0.2837-0.0841))
+
(if [Paperless Billing] = "No" then (0.2837-0.1793)
else if [Paperless Billing] = "Yes" then (0.2837-0.3523)
else null)
+
(if [Internet Service] = "No" then (0.2837-0.0841)
else if [Internet Service] = "Yes" then (0.2837-0.3348)
else null)
+
(if [Internet Type] = "Cable" then (0.2837-0.2752)
else if [Internet Type] = "DSL" then (0.2837-0.1997)
else if [Internet Type] = "Fiber Optic" then (0.2837-0.4213)
else (0.2837-0.0841))
+
(if [Tenure in Months] <= 12 then (0.2837-0.5987)
else if [Tenure in Months] <= 24 then (0.2837-0.2871)
else if [Tenure in Months] <= 36 then (0.2837-0.2163)
else if [Tenure in Months] <= 48 then (0.2837-0.1903)
else if [Tenure in Months] <= 60 then (0.2837-0.1442)
else if [Tenure in Months] <= 72 then (0.2837-0.0661)
else null))

Examining the results of a sample 1000 clients below shows a resultant minimum value of -2.29 and a maximum value of 3.08, with an average close to zero, which is what to be expected. The clients with the most negative value should in theory be more likely to be at risk of churn.

Now, instead of using this spread, I decided to normalise these values again, using the below formula in Power Query

Looking again at the results of a sample 1000 clients, the min and max were close to 0 and 1, and the average was around 0.43, which was actually the average for the entire data set.

Now this was done, the last step I made in Power Query was to re-categorise the Joined and Stayed client status based on these normalised values. If the normalised value was greater than 0.43, then it is likely that Joined or Stayed clients will remain and be “safe”. However, if it is less than 0.43 there is a higher risk that the client will churn, and would therefore classify as “@ risk”. Below shows this step:

Looking at two tables showing the extremes of the normalised values demonstrates the original and adjusted customer status based on the assessment. It shows that those clients with low scores who have actually stayed are probably at risk of churning in the future.

Low Normalised Values

At the opposite end of the scale, pretty much anyone who scored a very high rating was a staying customer and not at risk of churning.

High Normalised Values

Phew, that was a lot to consume and relive…..I think most was done in a haze the first time round, but that is the main grunt work complete.

One final data manipulation task was to do a little pivoting magic to help me create my Top 12 churn variables.

Top 12 Churn Variables – Pivoting

This was fairly simple. The first step was to create a duplicate of my table inside Power Query.

duplication customer churn table

Next, I selected all the variables I had used in my analysis, and unpivoted them using the below M code

= Table.Unpivot(#"Added Custom4", {"Number of Referrals", "Tenure in Months", "Offer", "Phone Service", "Internet Service", "Internet Type", "Online Security", "Online Backup", "Device Protection Plan", "Premium Tech Support", "Streaming TV", "Streaming Movies", "Streaming Music", "Contract", "Paperless Billing", "Payment Method"}, "Attribute", "Value")

This then gave me two columns that looked a little like this for each individual client.

The final step was to actually join these two values together (separated with a | )to create combined variable, like the below:

I could use this combination of variable and value to see which combination of each was most likely in churning customers.

Ok, so now we can get on to actually creating the visual – I am starting to think writing this blog is taking more work that creating the visual at the time!!!

Building the Visual

Background

Just to note that the immediate background and rounded transparent white border were created in powerpoint. This allows a few more effects to be created than you can in Power BI, and also means it is a few less components that affect performance. There are various Youtube tutorials on how to do this.

BAN – Kpi

The BAN Kpi cards contained several pieces of information, namely:

  • Total Charge (Monthly)
  • % of the Total Charge
  • Average Account Charge
  • No. of Accounts
  • % of Total Accounts
BAN / Kpi

That’s actually quite a lot of information. Obviously, the focus was on the Total Charge with a larger and more prominent font, but the other values were provided in case someone wanted to go back for a second look and deeper dive later on.

The DAX for the calculations followed similar patterns for each of the BAN and are a good contender for using “Calculated Groups”. In this instance, I was looking at High Value, therefore the Total Monthly Charge was filtered for clients with a normalised value rating higher than the average of 0.52.

Total Monthly Charge - High = 
CALCULATE([Total Monthly Charge],
FILTER(telecom_customer_churn,telecom_customer_churn[Normalised Value Rating] >= 0.52
)
)

After having all these calculations I was able to copy and paste to create this stack of BAN / Kpis

Scatterplot

Next came the scatter plot. The hard work was all done in Power Query. So this was simply a matter of dragging and dropping, and applying some simple filters as shown below, as well as applying the consistent colour coding.

You can see the effect that the normalised churn risk value has, where we see the vast majority of churning customers had higher than average churn risk

Splitting Churned, Stayed and Joined

Having completed these charts for the churned and stayed clients, it was a matter of applying the same format to the newly joined clients.

Examining this, it looks like there are a high number of new clients that have a high churn risk. This should sound a warning alarm to the board.

Reasons

Again, the hard work was done using unpivot in Power Query. So it was simply drag and drop, and filter for the top 12 items.

Here we see the power of that unpivot – we can look at all the variables in a single column and see which were most prevalent in churned customers. For example, Month-to-Month Contracts had 1,655 churning customers.

As mentioned earlier, this is a good additional chart to provide supporting information and context should the board want to know at a high level what the top churn risk variables are.

Recommendations

Last, but not least, are the recommendations. These were written to offer some potential steps to reduce the evident risk of churn. They were closely aligned to countering the top reasons for churn.

Final Product

Below shows the final product together. Apart from the above commentary, there were a few other aesthetic items considered which are worth mentioning in addition to the colour.

  • space – I made sure there was some “breathing room” between adjacent visuals and also between enclosed areas. This helps to reduce the visual burden and overload.
  • enclosure – I used enclosure to purposely segregate certain sections to reinforce the fact they were in a group. On a larger scale this would be the two large white boxes on the left, so they could be looked at and assessed in isolation. On a smaller scale you will see I used boxes in the scatterplots to show grouped values.
  • connection – following from the above, I used connecting lines to link text to areas within charts and visuals
  • alignment – to try and improve aesthetics I made sure the text, visuals and charts were aligned with each other to promote clean lines and help with keeping white space

So, that’s my lot. I think you might agree, this visual is a little like a duck in water – it looks quite calm and relaxed on the surface, but there is definitely a lot of paddling going on below the surface.

I was extremely happy to make the final selection again, and even more pleased when I was announced as the winner, becoming a two time winner, and achieving it in a solo capacity this time round.

If you enjoyed this blog, or have and feedback on it’s content or on my methods, please let me know., either here or in my second home (LinkedIn).

If you think an accompanying video might be useful, also hit me up. If I have enough requests, I will try and fit it in.

Maven Challenge – Net Promoter Score and Likert Analysis – Airline Survey

Another Maven challenge, another blog post!

However, rather than a capturing a summary of a particular strategy, this one is on the development of a set of visuals that are well suited to represent the results of survey data and find the categories with the largest proportions of positive and negative attitudes.

The Challenge

The overall challenge was to analyse a survey response data set from around 130,000 passengers who have travelled on Maven Airlines to determine what areas were contributing to the satisfaction rate falling below 50%, and what data-backed steps could be taken to make improvements.

The Data

The data contained various personal information on each passenger, as well as details about their journeys and class of travel.

In addition there were a number of questions which ask to rate particular services on a scale of 1 to 5, with 1 being the worst service and 5 the best (see below extract from the data dictionary).

This type of survey query is typically known as a Likert Scale survey question, named after Renis Likert, a famous social scientist. Here a 5 number scale has been used, but you may also encounter 7 or 10 number scales. You may also come across non-numeric versions (e.g. “very likely” to “not likely at all”).

These types of scales allow for a little more nuance in the sentiment surrounding a survey response where the answer may not be a binary yes/no.

Analysis of Likert Scales

I had read about Likert scales and survey data when working on the remote working challenge. Although I didn’t fully use it there, I was now keen to learn more about it and look at ways it could be visually represented.

One method I had previously come across in my endless trawling of Youtube tutorials and browsing other challenges such as Workout Wednesday and Makeover Monday, was to show the results in a horizontal bar chart, and show divergence between positive and negative results.

Below are a few different ways of presenting data using the scales. This ranges from a simple stacked bar chart, to basing it around a neutral zero value, to extracting neutral values, and all the way to a full deconstruction.

Each method has it’s own particular advantages depending on what you are trying to present, but they are effective at showing the relative distribution of a spectrum of responses.

Net Promoter Score

A child of the Likert scale is the Net Promoter Score (NPS). Strictly speaking, this is used on survey results with scales of 1-10 (see below), and it is calculated by subtracting the % of promoter scores (9-10) from the % of detractor scores (0-6).

If you return a positive result, then people are more enthusiastic about that aspect of your brand or company, whereas if it is negative, that could be damaging to your company.

What I wanted to show

Taking this into account, I wished to show a version of the bar chart type analysis of the Likert data shown above, along with the corresponding NPS for each category – basically a tidied up version of the below.

However, in this case we only have a scale of 1-5. Therefore I opted to re-scale the analysis based on the following:

  • 1-2 = Negative
  • 3 = Neutral
  • 4-5 = Positive

Again, strictly if I strictly followed the NPS chart it probably should be 1-3 as negative, 4 as neutral and 5 as positive, but I would maybe argue we are looking for “satisfaction” rather than “evangelist promoters”, so seeing 4-5 as positive is good enough for me.

Research on building the Likert bar chart and NPS

I did some “research” ( aka Google and YouTube searching), on how to put together these charts, specifically in Power BI. Although I didn’t search for hours, I didn’t manage to find anything useful as a guide.

Step into the rescue, two invaluable sources of information:

  1. The Big Book of Dashboards (Wexler, Shaffer and Cotgreave)
  2. Andy Kriebel’s Tableau YouTube video

The book provided some great advice on visually what does and doesn’t work in these types of charts, and the video provided an overview of the steps and measures to allow me to build the charts.

Apply the techniques to Power BI

Now it was a matter of translating the textbook advice and Tableau tutorial over to Power BI desktop. This was done in several steps.

1. Power Query Work

First I imported the dataset into Power Query, then created a duplicate of the table for this particular work.

Next, I selected the column headers for all the columns containing the Likert categories. After that, I navigated to the Transform ribbon at the top, then selected “Unpivot Columns”, and then “Unpivot Selected columns.

This resulted in the below layout. There is a column called “Attributes”, which relates to all the Likert categories, and another column called “Values” which are the corresponding Likert scale numbers – effectively all the categories and their corresponding values are in two columns.

While I was in Power Query, I created a “sentiment score”, which was -1 for “neutral or dissatisfied”, and +1 for “satisfied”. I never used this in my final report, but was using it in some experimenting small multiple visuals shown below, effectively showing the 1-5 rating in each category, but mirroring the satisfied and unsatisfied passengers.

I found it showed some interesting patterns, but it took quite a close inspection to glean those patterns, therefore I discarded it for my final report, in favour of the stacked bar chart.

2. Measures

Now that the rejigging of the data was done in Power Query, I moved on to start creating the measures that would help me build the below draft visual. This included the:

  • Count the positive scores (4-5)
  • Count the negative scores (1-2)
  • Count the neutral scores (3) – this was done in two parts as they straddled the zero line
  • % positive
  • % negative
  • % neutral
  • NPS

Counts

The fairly simple count DAX measures were as follows:

Note that the below calculation is multiplied by -1 so that is will appear in the negative side of the y-axis.

Next, as mentioned, for the neutral scores, these straddle the zero line, therefore the calculation is split into two, one for the positive side, and one for the negative side.

All Selected Attributes

In order to transform these count measures into % values, I need a denominator. Typically, you could use an ALL() DAX function so that you are measuring as a portion of the total of a given column, but as I have created an unpivoted table with multiple categories, it requires something a little more refined.

Instead of ALL(), I opted for ALLEXCEPT() with several columns referenced – see below:

Percentage Values

Now this was done, all that was left to do was to create a few simple DIVIDE() functions to finalise the percentage values – simply dividing each count by the All Selected Attribute:

again, remember we need two values for the neutral (positive and negative).

NPS

Finally, after all those measures were created, I can now come to finalising the NPS. If we relook at the below calculation, I recreated a version using my measures. (Note that I add my positive to negative scores, as my negative are actually negative numbers)

Rank NPS – (Bonus Calculation)

I almost forgot, there is a bonus calculation that I needed to order my Likert chart by the NPS. I used the RANX function to rank the attributes by the NPS is descending order, as shown below. I will show you a little trick on how to apply this in the next few paragraphs.

3. Building the Visuals

Ok, now onto the applying all those measures into something visual.

First I select the stacked bar chart. Then I add Attribute to the X-axis. Following that, I add the 4 % measures in the order shown below. It looks a little strange, but the add the negative value first, going from the least negative to most negative, then add the positive values after that.

Finally, drop the Rank NPS in the tool tips.

By adding the Rank NPS to the tooltip, it allows you to select it as a sorting option for your chart. To do this, select the ellipsis (…) at the top right of the chart, then select the sort axis as shown below. This will allow you to rank both your Likert chart and NPS chart by the NPS scorings, and this allows you to have common y-axis categories across both charts.

The NPS chart is simply put together, adding the Attribute in the Y-axis and NPS Grouped score in the X-axis, as shown below.

Once the charts have been built, it is a matter of applying your preferred colour palettes, fonts and formats. in the formatting pane.

4. Final product

Once I was happy with my general visual content, I brought them into my final dashboard, and applied final touches to axis labels, legends and covering titles shown below.

I then applied this set of visuals twice; one time for analysing business class passengers, and another for analysing economy and economy plus passengers. This formed my most detailed analysis in the report, allowing me to drill down in detail from those initial large percentage numbers to finding the issues that were affecting passenger satisfaction.

Let me know in the comments below, or feedback on LinkedIn if this kind of content was of use, or if there were other ways you may have gone about executing this.

Thanks for reading.

Maven Challenge – Unicorns – Interviewing the Data

Maven’s data challenge for April-May 2022 involves the analysis of a set of data related to “Unicorns”.

A few of my work colleagues saw my LinkedIn post related to the announcement of this challenge, and thought I was looking at data related to the horsey-type unicorns, and they were pleasantly amused at the thought.

Needless to say, they now understand what the term means in a business sense (as I had learned what it meant a mere 48 hours beforehand).

Anyway, this is a shorter than usual blog post related to my initial approach to this challenge, and again links back to an earlier post I made in LinkedIn about a technique I picked up from Ben Jones in a SWD podcast he had with Cole Nussbaumer Knaflic.

In it, he described how we can take on the role of a journalist, and list out the questions we would like to “ask” the data in order to draw out information that you could use to build a structured story or article with real depth of meaning.

Step 1 – Looking at the type of data

I did not wish to look directly at the data initially in case it skewed or biased my view on the questions I might be able to ask. This is where a data dictionary comes in very handy. It helps give an overview and context for the subject without giving too much away about the content.

The below is the data dictionary provided with the dataset.

FieldDescription
CompanyCompany name
ValuationCompany valuation in billions (B) of dollars
Date JoinedThe date in which the company reached $1 billion in valuation
IndustryCompany industry
CityCity the company was founded in
CountryCountry the company was founded in
ContinentContinent the company was founded in
Year FoundedYear the company was founded
FundingTotal amount raised across all funding rounds in billions (B) or millions (M) of dollars
Select InvestorsTop 4 investing firms or individual investors (some have less than 4)

From an initial review, I divided the types of data up into several categories which could help me form a list of questions

  • Dates – These can help form time and trend related questions
  • Money – These can form queries on average values, return on investment and growth
  • Geography – These can help make comparisons at several hierarchical levels
  • Companies – These can assist in analysis at both high and detailed levels from both and company and investor viewpoint

Now this initial breakdown was complete, I could move on to developing a list of questions I could potentially “ask” the data and form the crux of my analysis.

Step 2 – Ask the Questions Already!

OK, so here are a shortlist of questions I made, along with some “subquestions”. I explored and used some of them in my final design, but left them all listed here in case they help fire the imagination of anyone else in their journey.

Questions Related to Date/Time

  1. Are unicorns a recent phenomena, and is there any trend over the past number of years in the number of unicorns being created? How is this trending with the overall economy?
  2. How long does a company typically take to become a unicorn? Is this impacted by the type of industry or country? Is there any apparent trend over time?
  3. Are there certain industries which have become more prevalent in recent years as opposed to say 10 or 20 years ago?

Questions Related to Companies, Money, Funding and Growth

  1. Which industries make up the majority of unicorns and have the highest average valuations?
  2. Which industry or country is seeing the most funding from venture capitalists, and are there any particular industries providing high capital investment return rates?
  3. Which industry or company has experienced the highest or lowest value annual growth rate since joining the unicorn club?
  4. Are investors typically attracted to particular industries, or do they “spread their bets”?

Questions Related to Location

  1. Where in the world (cities, countries or continents) are the highest prevalence for unicorns? Has this changed over time?
  2. Is there any connection between these locations and the type of industry?
  3. Where in the world has the highest value unicorns or those that provide the highest investment return rates?
  4. How do unicorns measure up against % GDP of GDP per capita in particular countries? (might need some additional data for this one!)
  5. Are unicorns confined to historic world powers, or are emerging markets and countries taking a lead?

Step 3 – Start your analysis

It would maybe not be wise to attempt to answer all these questions, but maybe focus on a selection that you think will tell a story to “illustrate the current landscape of unicorn companies around the globe“.

I will leave you here with at least an alternative technique to approaching these challenges, and hopefully also a little inspiration or some ideas to start or evaluate your own approach.

Below is my final result of answering at least one or two questions from each section.

Maven Challenge – Remote Work – Strategy Execution

Before we start….

There was quite a bit to this challenge, so I am covering it in one blog and one video. This one will look at the strategy and execution of the planned steps I laid out in a LinkedIn post and in the below photo shot; the separate video looks at the technical aspects of cleaning and manipulating the data, as well as assembling the visuals.

For anyone who has stumbled across this blog accidently, and is not familiar with the challenge, details can be found here.

So let’s begin!

As well as Maven providing the raw data from the remote work survey and some background on the contents, the below is what I took as the brief for the challenge, with the crux purposely highlighted in bold by the Maven team.

For this challenge, your task is to assess the productivity and morale implications of working remotely and outline an ideal policy for the post-pandemic future, presented in the form of a single-page report or dashboard.

Maven Remote Challenge Blog

I used this as the basis for my overall plan for the challenge, which I quickly drafted up using the old school pen and paper method.

Now that was done, it was time to put the plan into motion….

Part A – Considering Outcomes and Goals

When reviewing the brief, I asked myself a few questions that would dictate my outcomes and goals.

  • Why am I actually creating this report / dash?
  • Who is it for?
  • What decision needs to be made as a result of it?
  • What can I do by way of design to facilitate that decision process?

Apart from wanting to compete in the Maven challenge, I am creating this report to summarize a large amount of survey data and analyze how remote work impacts productivity and morale among workers.

In terms of “Who?” and “What decision?”, I considered my key stakeholder to be the senior management at a generic company, and anticipated that they would require a high level assessment with a summary of the positive and negative affects of remote working on mainly productivity, but also morale, and how this would shape how employees felt about remote work. This would in turn help the company decide on what plans they would need to enact in a post-COVID world.

After mulling over these questions and answers, it was fairly straightforward in my mind that my overall outcomes would include:

  • Outlining the basis for an ideal future remote working policy for a company
  • Consider any implications (positive or negative) on worker morale and wellbeing
  • Measure how productivity was impacted during remote working
  • Summarize the above in a high level explanatory report

These aspects, as well as the fact the requirement was for a single page static report/dash, meant that firstly, I would be analyzing at a high level rather than a granular level. The only reason to go to a granular level would be to analyze a specific industry or demographic, which I did not consider part of my brief (rightly or wrongly).

Secondly, I would also be looking to produce and explanatory set of visuals rather than an exploratory set.

What’s the difference?

Exploratory vs Explanatory

The storytelling with data blog will define this more eloquently than me, but basically an exploratory report would allow the user to manipulate data through filters and slicers and search for their own answers to questions they may have. For explanatory reports, someone has already done the “exploration” on the data, and is presenting what are hopefully the key points with supporting data and visuals.

Part B – Metrics and Comparisons

Now a general approach has been defined, I looked at potential comparisons and metrics that would be used to determine company policy. I needed to keep in mind that if a company is going to base their policy on these, they need to be robust enough to allow a decision to be made.

You may think it’s strange as I haven’t mentioned looking at the data yet, so how would I know what metrics or comparisons I will be able to make?

While this is true, I believe it is good to have an idea of what the most appropriate metrics could be in order to fulfil your brief, even before you look at the data in any kind of detail.

Productivity

I know I have two data sets, one for 2020 and one for 2021. The obvious metrics would be to try and measure productivity in each separate year, and then do a year-over-year (YoY) comparison. If data is also available, I could also look at a further comparison with “normal working” productivity. This will tell me whether remote working has had a positive or negative effect and whether that was sustained between 2020 and 2021.

Morale

Similar to productivity, I would look at changes in any issues that may affect morale or working conditions between “normal work” and the periods of remote work in 2020 and 2021. Again, if possible, I would like to see how the stats change over time.

General Opinion

I am guessing that the survey will ask people’s opinions of working remotely, and whether they are generally in favour of it and whether they would like to retain it as a benefit long term. If so, I could measure any change over time, or the ratio of people in favour versus those not in favour, and this could also feed into the overall future policy to be developed.

Part C – Reviewing the Data

With ultimate goals and an idea of the type of metrics I feel would fulfil the brief, it was time to look at the data itself.

This was a little dauting at first. It wasn’t the fact that each survey had around 1,500 respondents; rather it was that the 2020 survey had 73 questions, and the 2021 survey had 109 questions. Add to that, many of the questions were quite lengthy, and were the headings of each column. This made it quite difficult to know exactly what you were looking at!

Quick solution

My quick solution for reviewing? Copy and transpose the head row from each file into a new table.  This makes it much easier to read, and easier to see how many of the questions are actually grouped together.

It also allowed me to add a few columns to each table aligned to the metrics. As I reviewed each years’ questions I could tick whether I thought particular questions may be related to assessing morale, productivity or an overall policy. Having this list also came in very handy as a reference when the data was eventually loaded into Power BI.

As an example of the transposed table, below is a sample of a few questions from the 2020 and 2021 files that would touch on productivity.

2020 survey – productivity
2021 survey – productivity

Also, to assist anyone that may try out the challenge, I have added a copy of a blank file that could be used to conduct this exercise.

Once I reviewed and listed out all the questions I thought were applicable to my goals, I was happy to move onto the next step and start some provisional review and analysis of the survey results.

Part D – Pull in Data and Develop Plan

There were two distinct csv files for the surveys, and I loaded them into Power Query and Power BI. The respondents in each survey are not linked, therefore there was no immediate need to link the files via a model.

On reviewing the questions and answers for each data set, I narrowed my focus to a limited set that I could build a plan around. This allowed me to create a storyboard and then further detail each aspect – all of which is laid out below.

Storyboard

I usually try to sketch together a outline plan or storyboard when I have reviewed the data and have a general idea of what I want to show and how I want it to structure and flow. There are usually a few changes along the way, as you will see in the final product. However, you will note here that I wanted to show a main headline to capture productivity, morale and the policy, and then expand on each aspect in separate sections.

Productivity – Self Assessment

I wanted to calculate the productivity ratings in 2020 and 2021 and compare. Both surveys had a similar question to ask how the employee rated their own productivity when compared to working in the workplace. As such, I could use it as an effective comparison.

The answers were categorised in a large number of % values displaying degrees of more, less or the same productivity. I decided to combine and reduce the number of categories to 5, as shown below in my sandbox draft version. I also wanted to show the % change for each category to emphasize the increase in productivity between the two years, also shown below. Text could be further added to further summarize and explain the point.

Productivity – Management Assessment

As the above were self-assessments, I wanted to develop the next level of narrative and to see if managers saw the same impacts on productivity. I was able to find a survey question directed to managers, to see how they rated the productivity of their own employees.

Productivity – Occupation

To complete my narrative on productivity, I wanted to go slightly more granular. The results so far were showing generally improved productivity, but for balance I wanted to see if there were particular occupations that were more likely to experience lower productivity. This could allow management to then potentially request further analysis if their companies had employees that had occupations with low rates of productivity.

Morale – Barriers

There were no questions in the surveys which explicitly dealt with morale; however, I noted that in 2021 there were a set of questions related to potential barriers to remote working, and whether people had improved, worsened or largely the same experiences for each barrier when compared to normal working. These barriers included issues such as:

  • motivation
  • isolation
  • health
  • collaboration
  • poor management

I took these as being related to morale. I consolidated these queries using an unpivot technique in Power Query, and again consolidated the ratings to a simple three ratings of, “worsened”, “same/no barrier” or “improved”. This allowed me to quickly see how each issue affected employees in 2021.

To supplement this, I noticed there were additional related standalone questions related to isolation, health and wellbeing. My plan was to bring these all together in an extended visual with a supporting narrative (see below draft result). It was also an opportunity to try out Gestalt’s principle of connection by boxing around some of the bar chart graphics in an attempt to connect them.

Sandbox sample graphs for morale

Morale – Working Times

From looking at the data and seeing that people were more active and feeling better, I thought then next level would be to examine the typical working day to see what impact remote work was having on people’s general activities.

There were a set of questions in 2021 related to hours people were assigning to daily tasks. I note there were similar questions in 2020, but there were significant errors in the arithmetic, which meant I could not make an equivalent comparison, and therefore discarded them.

I again performed an unpivot of columns in Power Query to consolidate the data, and found an average significant drop in time required for commuting (expected), and an increase in time for personal and family time, which you could potentially relate to increased wellbeing and morale. This was in spite of people actually working slightly longer workday on average.

Policy – Employee Expectations

Finally, I thought the policy section should provide the story arc to bring the productivity and morale factors together. I came across a question that I thought could knit it together – This was how much time employees wished to work remotely in the future.

Again, the answers were categorised into % values in intervals of 10, with some text answers also. I decided to consolidate these into day values to give a more simplified and easier to read view. This would show management that there was significant appetite from employees to work remotely, but that there was also a large number of people who wanted very minimal or no remote work. I highlighted this in red to purposely draw attention, and show there was a potential need for flexibility.

Policy – Retention and Attraction

Finally, and to again supplement the above with closely related data, there were questions to managers as to whether they believed offering remote working would improve employee retention and also attract new talent. Instead of using charts or graphs for these types of supplements, I decided to just use aggregated numbers and supporting text.

Stitch it all together

Once I had completed all my visuals in separate sandbox tabs, I copied them into a combined report/dash, which resembled the sketch I had shown earlier as my initial storyboard.

My initial draft was complete, and I added some summary headings to each section to highlight the key take-aways from each set of charts.

I was reasonably happy with the draft, but I wasn’t quite finished just yet.

Part E – Review Outcomes and Goals

I felt that individual components of my goals had been achieved by each of the visuals I had included, as well as how they were grouped.

However, as part of my review, I took a tip from storytelling with data and Maven’s thinking like an analyst course, and used the opportunity to get feedback from a member of the Maven community, and also one of my subscribers to this blog (Marjolein Opsteegh).

After working with the data on and off for a few days, the intention was to step back and get a set of fresh eyes to briefly study the collection of visuals and see if it was relatively understood, and whether it flowed naturally and captured the brief.

Marjolein was generous enough to take the time to give some constructive feedback and criticism, which is a real learning process I actually enjoy. There were some general layout and formatting tips, but the crucial comment was to bring the overall outcomes and policy to the fore, and put them top and centre of the dash, rather than at the bottom which I had in my draft.

Part F – Fine Tune and Submit

After the review, I rearranged the running order of the dashboard, which made the summary and overall policy the first thing you see. I then followed it by productivity and morale, with some slight formatting on the headers.

Overall, I was quite pleased with the result, and was happy to apply some of the techniques I had been reading about in Cole Nussbaumer Knaflic’s storytelling with data book, a highly recommended read.

One change I might like to make after having the chance to look at it for a few days however, might be to increase the overall size of the visual, and provide a little more white space to give some of the visuals a little more breathing space from each other.

[UPDATE: As noted at the intro – I opted to support this blog with a walkthrough video of creating each of the visuals, especially those that used the unpivot and consolidation functions in Power Query] – for further work on Maven Challenges, stay tuned on LinkedIn, follow my twitter account which is notified when I write a blog, or alternatively just follow this blog.

Any feedback or queries, please drop them in the comments below – Thanks!

My Data Analysis Portfolio

Introduction

From reading a lot of data viz experts and tutors comments on platforms such as LinkedIn, one recommendation that keeps popping up is that any aspiring or practising “visualizer” should maintain a portfolio of work to showcase the skills and breadth of work that they may have.

I have taken that advice, and put together this portfolio style blog. Most of the visualizations that I create in a professional field are not something that I can necessarily share in the public domain, therefore I am opting to showcase the fun challenges and trials I have taken part in over the last 6-8 months.

These mostly include the Maven Analytics challenges, and are mostly performed in Power BI, although there are a few Tableau examples in there too. There are other examples from Onyx and Dataworld.

I have decided to post the visualizations in reverse chronological order for each section, to (hopefully) show some kind of progress in technique and presentation.

I will keep this “live” as a repository for my public displays.

Maven Challenges

The details for all these challenges can be found here.

2022

Maven Space Challenge – Current Entry

This is the current challenge in August and September 2022, and relates to analysing space mission data from the 1950s to present, and coming up with a visual that captures the awe of space travel. I felt that the data required some supporting context to bring alive the “awe” of travel, and focused on missions which either visited other planets within our solar system, or journeyed beyond our realm.

As such, I experimented in some visuals to try and bring out that story. I took a little inspiration from the Beastie Boys, with the title of the visual, and the fonts taken from their album (Hello Nasty). Below is my entry, and my LinkedIn submission post.

Maven Telecoms Churn Challenge – Winner

This challenge was in finalised in July 2022, and related to a set of customer profile data for a fictional telecoms service provider in California. The challenge brief was to identify high value clients, examine churn risk, and look at what steps can be taken to retain clients.

This analysis involved creating a set of normalised metrics to look at both value and risk, based on the profile characteristics of the clients. I have done a more detailed write up on my approach here.

This entry was the winner out of over 300 entries, with the Linkedin notification here, and the judging panel video here.

Power BI – Maven Churn Challenge

Maven Airlines Challenge – Finalist

This challenge ran from May through to mid June 2022, and was in relation to a fictional airline which had crossed the line of having more than 50% of passengers feeling unsatisfied by their experience in travelling with them.

The challenge was to analyze around 130,000 survey responses which included Likert type data, and investigate key areas of improvement which could help get the airline back on track with their passengers. Here is a link to my LinkedIn post, and a snapshot below of my work performed in Power Query and Power BI.

Power BI

Maven Unicorn Challenge – Joint Winner

This challenge was between April and May 2022, and used a summary dataset looking at the valuation, funding, location and investors involved in global unicorns – the $1bn+ privately owned companies, and not the other fantasy figures!

The brief was to illustrate the global landscape of unicorns, and below is my proposed report, which was posted on LinkedIn as usual.

Maven Remote Work Challenge – Finalist (2nd)

This challenge took place during March 2022. It involved analysing the results of two comprehensive surveys which posed over 100 questions to in excess of 1,500 people on the subject of remote working both during and after the recent COVID pandemic lockdowns in NSW, Australia.

I had to provide advice on a proposed remote working policy in a post-COVID world, and translate the qualitative data from the survey to provide quantitative insights into what effects remote work had on productivity and morale. Below is my final report, which was also posted on LinkedIn. I note that I also used this visual in the March SWD challenge.

Power BI

Maven SuperBowl Challenge – Finalist

The challenge ran from January to February 2022, and involved analysing historical SuperBowl advertising data in order to propose an upcoming advertising strategy for an up and coming car company. My original posting is on LinkedIn.

Power BI

2021

Maven Magic (Harry Potter) Challenge – Finalist

This challenge ran from December 2021 to January 2022. It involved reviewing the film scripts for the Harry Potter movie series and coming up with a way to visualize the “magic” of the movies.

My submission scored as a Finalist on this challenge (my third in a row), with the critique for my submission recorded here by the team at Maven Analytics. I also created a few of my own videos showing some of the techniques I used in my data prep and analysis.

Power BI

Maven Taxi Challenge – Finalist

This challenge ran from November to December 2021, and involved following a set criteria on what needed to be presented. This was a real challenge from a data prep point of view, as it encompassed in the region of 27 million line items of journey data.

Because of the complexity of the data load and set of data cleaning steps, I created a video showing my strategy and steps, which has proved to be relatively popular. I made the finals for the second time, which I was very pleased with, and took into account the comments provided by the Maven team.

Power BI

Maven Restaurant Investments Challenge – Finalist

This challenge ran from October to November 2021. It involved assessing customer and restaurant related data in several cities in Mexico to assist investors in selecting a location and type of restaurant that would prove popular.

This was around the time I was experimenting in Tableau, and decided to use it for this challenge. A good advantage of Tableau, is the public platform allows everyone to visit and fully interact with each visualization or dashboard.

I was pretty chuffed and surprised to make the finals for the first time, and I learned that the use of key questions and summary take-aways were powerful techniques in laying out a presentation.

Tableau

Maven Olympics Challenge – Entrant

This challenge ran around the time of the Tokyo Olympics, and was my first foray into the Maven Challenges, after having taken their Excel courses and a few Power BI courses. Instead of using all the data from both the summer and winter games, I decided to focus on only female participation during the summer games only.

Looking back on it now, although I tried to provide some structure and flow, it was quite busy and maybe included too much information. I was quite interested to see the difference in participation between communist and non-communist countries during the cold war.

Looking back, I should have honed in and explored that in more detail. But it was a good introduction, and spurred me on to learn more.

Power BI

Onyx

I have followed Onyx challenges for several months, but finally decided to join in in June 2022, starting with the Forbes Billionaires challenge.

Forbes Billionaires

This base requirements of this challenge was reviewing the raw Forbes data and providing a summary of the greatest philanthropists, as well as the industries which had the most successful billionaires. Here, I focused on those base requirements only, and aimed for the feel of a Forbes type article, giving a high level synopsis rather than a deep dive. The work was done in Power BI, and the LinkedIn post is here.

storytelling with data

Being an avid listener of the storytelling with data podcast and the book by Cole Nussbaumer Knaflic, I joined their online community to take part in their monthly challenges. These are more “sprint” like challenges when compared to Maven, and are good to hone in on particular aspects of improving visual communication.

February 2002 – Focus and Declutter

This challenge can be found here. The goal here was to take busy and potentially confusing data visual, and apply techniques that are taught by Cole in order to declutter the noise surrounding the visual, and attempt to focus in on the story behind the graph. Removing unnecessary components, and appropriate use of colour to focus, I wanted to be able to take the following away within 5 seconds of looking at the graph.

  1. Philadelphia is the only city to have a higher crime rate than 1990
  2. Other cities generally have seen reductions in crime rates
  3. New York is the stand out in terms of overall reduction.
Power BI

Dataworld

I had started to look to take part in the Makeover Monday data viz series, but as soon as I started, it ceased to exist! I had thought I would have time to go back and try out the old challenges, but just haven’t had any spare time lately, so there is only one example below.

Workout Wednesday has however taken on the mantle, and I intend to work on some of the Power BI related challenges, which are sometimes a challenge to recreate a Tableau visual.

The datasets are usually quite small when compared to the Maven challenges, so in theory they should be used for “quick” practise.

The Dark Web

This challenge involved looking at the price index for various illegally obtained goods on the dark web.

I used this challenge to practise the use of parameters in Tableau, which I thought was put to relatively good use when I try it out in Tableau Public, as well as trying a kind of “dark mode” background with light coloured text.

Tableau

Other Miscellaneous Work

Social Media

I created some template Power BI dashboards to present the interactions of a fictional company on LinkedIn, as well as Twitter. I used this mostly as practise for Time Intelligence functions. I created an accompanying video and blog, which allowed people to recreate it, and it has been relatively popular.

Power BI

Weather Data

This was quite a simple dashboard put together in Power BI to look at data collected by a weather balloon travelling over Germany. I used this primarily to practise working with geospatial data as well as grouping scatter charts.

Power BI

Sankey Charts

This was in some way related to my work life, as I used it as a pilot to then present at work.

This is a visual rather than a full dashboard, but I had thought that Sankey Charts would be a good way of visualising the change in risk status before and after mitigating steps were applied.

I also learned how to use particular DAX functions to create the necessary data table structure to utilise the Sankey Chart visual in Power BI. Again, I created an accompanying video and blog, which I still see people interacting with most days.

Where it all began!

This is where it began about 9 months ago in mid 2021. As I was taking the basic Tableau course with Maven Analytics, I began to experiment with some data I picked up. In this instance it was the great debate of Ronaldo vs Messi.

Looking back now, I can see plenty of little touches I could make to improve or other ways I could display this, but it is also a good reminder of the journey taken, mostly with Maven, in the last 9 months!

Tableau

So, if anyone is looking for inspiration to create their own portfolio, or see what improvements you can make to your visuals inside a few months, hopefully you can find it here.

Good luck, and keep vizzing!

Maven Challenge – Superbowl Commercial – My Plan and Strategy

After submitting my Power BI report on LinkedIn as part of the Maven Super Bowl Challenge, I was messaged by a few people asking if I could put together an article or video blog on how I created it. So, here I am!!

Rather than jumping onto the snapshots of the Power BI platform and going through the process of putting the visuals together, I thought I would take a few steps back, and go through the plan and strategy I now go through before attempting these types of challenges, as without that you can often create a confused message. I pick up new lessons with every challenge, and find it an excellent form of self-development.

Final Submission

My Key Steps in Planning and Strategizing

For this particular challenge, I broke my process down into the following steps, but they are equally applicable to other work I do:

  1. Read the Brief
  2. Consider the End User
  3. Summarise Scope and Objectives
  4. Review the Raw Data
  5. Consider available metrics
  6. Develop a story or flow
  7. Sketch your layout and structure
  8. Execute, Review and Polish

Step 1 – Read the Brief

The challenge noted the following:

We’ve just added a brand new data set to the Data Playground, containing data from Super Bowl commercials for 10 popular brands this century

and

For this challenge, you’ll be assuming the role of Marketing Analyst at Maven Motors, an up-and-coming US car manufacturer looking to make a splash in the market. They have approved the budget to run a TV spot during the 2022 Super Bowl, but need you to analyze historical data to help guide the creative direction.

Your task is to recommend a data-driven strategy for the Maven Motors Super Bowl spot, and present it in the form of a single page report or dashboard.

They are asking you for two things, namely to “analyze the historical data to help guide the creative direction…” and to “….recommend a data-driven strategy for the Maven Motors Super Bowl spot.”

So, we need to keep these two points in the forefront of our mind as we progress.

Step 2 – Consider the End User

Perhaps the most important step and question to ask yourself – who am I creating this report for and how do I want them to feel when they look at it? This will dictate the style, format and presentation.

For example, the Harry Potter challenge or the Jordan vs LeBron challenge would encourage the use of colours, graphics and specialised visuals to engage the general viewer and add an element of fun and excitement.

My Harry Potter Magic Submission

However, as this challenge is aimed at the decision makers within a large corporation, the style and format should be very different. For this, if I imagined being in their shoes, I would like to see the following type of summary report or dashboard:

  • Structured – easy to read and flows well
  • Professional – the product should align with the standards and values of the company
  • Related to the brief – no need for superfluous information that distracts or wastes time, something I don’t have a lot of!
  • Comprehensive but to the point – only pick and present the key findings
  • Able to invite a decision – use the data and recommendations to enable an easy decision to be made

Step 3 – Summarise Scope and Objectives

Following Project Management 101 techniques, I then documented my scope and objectives. This may seem like a duplication of the above, but is helpful as a reference to always come back to. For this reason, I usually write this out and have it sitting to the side to glance at as I work on the final solution:

Scope Notes

In case you can’t read my writing:

Scope

  • Provide a 1 Page Report
  • Analyze Historical Data
  • Provide Recommendations for a data-driven strategy

Objectives

  • Enable board members, who are “time-poor” people, to have sufficiently succinct information in order to make a key decision on the direction of the marketing strategy
  • Be clear, concise and professional
  • Enable board members to be confident in your analysis (i.e. structure, methodology, analysis all appear credible, therefore we can put a measure of trust in the recommendations).

Step 4 – Review the Raw Data

It’s only now that I actually come to open the data files and look to see what is in there.

I found that, having performed the above steps, when I started looking at the data I was immediately considering whether this particular type of information or column was relevant to my brief or objectives.

I will try and briefly go through the my main thoughts when reviewing the data (below snapshot for 2000 and 2001), and how it related back to the scope, and what I would look at for more detailed analysis.

I would note at this point that this usually takes a few days, as I often like to let it ferment a little while I sip on a flat white or two.

2000-2001 data snapshot

Data Not Considered Relevant

Here’s point number oneNOT ALL DATA WILL BE RELEVANT.

Therefore, it was important to put aside any items that are not particularly useful in relation to the scope of works. Looking at the above table, these are highlighted in light blue.

Individual Brands

You may argue that the brand is important, but I would argue otherwise.

What’s the difference between Coke and Pepsi, or Bud and Bud Light in the context of this data? I don’t think this data set is going to tell you especially with the individual sample sizes.

What may give insight is the type of brand (food, cars, alcohol, etc. which I will address later). So for me, I took the decision to ignore individual brands.

The links themselves have no data analysis value for me here. Their only use may be to look at the actual ads, or provide a link in your dashboard or report. Therefore, I ignored these.

Estimated Cost

You may think- Wowwww! How can you ignore cost??? Well, I am not totally ignoring cost, but the total cost provides little insight, as there is the variable of ad length to consider.

The longer the ad, the higher the cost, right?

So, I checked if there is a correlation with a quick cost/time calculation. Guess what? The $ cost per second is constant for each year. Therefore I ignored the total cost, and focussed on the $ cost per second instead.

YouTube Likes

Again, you may wonder why I might be interested in YouTube views, but not likes? Well, this is my personal insight. I watch a lot of YouTube, and I enjoy a lot of what I watch, but I rarely feel the need to click “like” – I would be more likely to subscribe. I heard that YouTube itself does not place too much importance on “likes” these days too, they are more likely to track views, duration, click through, etc.

Additionally, the number of views is not underestimating the number of people who have engaged with the video (even if they only watched for a few seconds), but the number of likes will actually underestimate the number of people who actually like the video, as it is not compulsory to click the like button if you liked it! Again, I knocked that one off my analysis list.

TV Viewers

This may or may not be another surprise to some. If you take the time to look at the TV views for any given year, you will see they are pretty much constant – see below ( I recall there was one outlier). You can read into this that the money you pay to advertise during the Super Bowl is giving you the direct access to everyone that is watching.

As it is constant, the duration, brand, type, etc. are not directly impacting the number of viewers. I would only take this into account if you had a live feed of figures while the ads were being watched.

So, you guessed it, I kicked TV Viewers to the kerb!

TV Viewers – Year 2000

Data Considered Relevant

OK, ok….. so I have pretty much now ignored a large part of the dataset. So what exactly am I actually interested in, and what information relevant to the scope do I think I can get out of it? Well, let’s see now….

Year

An easy one – we want to potentially see how our variables change over time, so we will definitely be keeping Year

Brand Category

With ten brands, its a little difficult to compare. However if we can reduce this to a smaller set of categories, it may be of benefit. In the back of my mind, I was thinking that as Maven Motors is a motor company, it might be useful to look at the performance of Kia, Hyundai and Toyota as a group, rather than as individual companies as it effectively provides a larger single sample size.

Following that, it made sense to group the others into Alcohol, Snacks and Drinks, Sports and Services, and see if there were any trends among those categories.

By the way, I created a YouTube video showing you how to create those categories.

Video Characteristics

It was difficult to glean any real insight into the 7 categories without more detailed analysis. Further, it is part of the brief to influence the “creative direction” of the advertisement. Therefore, it was a no-brainer to retain this data.

Again, I created YouTube video actually showing you how to “unpivot” these characteristics for more effective analysis.

Length

Further, I saw that there were varying lengths of ads, and there may be a potential trend over time which may allow me to impart some recommendation on what the current trend is – so that was kept in too.

YouTube Views

I retained this as one of my key data points. As Maven Motors were already gaining access to the TV network through having a large marketing budget, I believed that a true test of whether an ad was “creatively successful” (viral), was whether people then took the time to seek it out on YouTube to watch again, or felt the need to share it with friends, family and their social networks.

$/s

As discussed earlier, I saw that the $/s for advertisements was constant for each year, which makes sense. I kept and used this data point as I would be able to see the trend in cost over time and potentially give further assurance to the board on likely final expenditure (even though budgets were already approved).

Number of Ads per Brand per Year

One further data point I noticed was that some brands were having more than one ad per year. The below example shows Budweiser using 13 ads over 4 Super Bowl events. Again, this is something I thought would be interesting to look at further to see if there were additional recommendations I could make on the potential number of ads to run.

Budweiser Ads – 2000 – 2003

Step5 – Consider Available Metrics

Considering all of the above, I came up with the following list of data points or metrics I wished to examine and present and make data-driven recommendations on:

The number of ads in a campaign year – What was the average number of ads per year per brand category.

Ad Duration – what is considered a standard ad length, and what is the trend over time telling us we should do this year.

$/second – what is the historical trend in ad cost, and see if we can forecast the range for the coming year.

Average YouTube Views vs Ad Characteristics – analyze the each characteristic and see which is more likely to garner additional views.

Step 6 and 7 – Develop a Story or Flow and Sketch Out

The next steps were to capture the entire process above is a single structured sheet that would form my final report. My story needed to:

  • Set the scene (provide context),
  • Lay out my methodology,
  • Establish the scope of the data analysed
  • Present the key data and insights
  • Give a summary recommendation

Again, I find it useful to sketch this out on a piece of paper (see below), along with even some potential visuals I have in my head.

I note that for this report I included a significant amount of text. This was on purpose as one of my key objectives was to enable the the board members to have confidence in my analysis. I felt that if I laid out the context, methodology and recommendations in clear concise language, that is then further backed up by data, this will provide that level of confidence required in such situations.

Report Structure Sketch

Step 8 – Execute, Review and Polish

Once the above it set, it is now a matter of prepping and cleaning any data (categorization, etc.), establishing the measures, then building the visuals into the structure I had envisaged.

One thing I like about Tableau is creating visuals on separate pages, and then constructing a final dashboard. I sometimes replicate that feel in Power BI, where I will trial and build individual visuals on separate tabs before bringing them in. Below are a few samples.

Data Scope and Ad Length Visuals

This gets the concepts down, so I know the type of data I want to show and in what general format.

Options for Ad Characteristics – Phase 1

I will experiment with how things are presented and how well the visual can communicate the point I want to make.

Options for Add Characteristics – Phase 2

Once I am comfortable with the concept, I will look to “polish” it. What does this mean to me?

Polishing Steps

As I said, once I am happy with the story, flow, structure and type of visuals I am using, I will move on to “polish”, which for me usually involves the following:

  • Selecting an appropriate colour theme/palette and applying consistent colours for headings, text and visuals. There are lots of websites that help generate themes and palettes.
  • Apply consistent fonts and font sizes
  • Remove superfluous information:
    • axis titles (if already obvious)
    • axis values (if using data labels)
    • repeated legends, which can be done with use of consistent colouring
    • gridlines, axis lines, etc. unless really needed
    • give concise visual titles
    • any other item that is not conveying
  • Align titles, text boxes, visuals (the eagle eyed will note a few of my visuals are off centre!)
  • Provide consistent spacings

Applying these steps allows me to move from the above visuals to the below final versions, which I believe met my goal of giving clean, concise and professional looking visuals.

If you have made it to the end of this, congratulations!! I apologise that my blogs are maybe not as short and to the point as my dashboard, but I hope this provided some insight into the process I took in developing my challenge submission, and maybe provided you with some tips and tricks for the next one.

Let me know what you think – would you change anything, or do you have a good alternative approach? I would be interested to hear.

Also, If you would like to see a video version of this and maybe some exploration of actually creating the above visuals, leave a comment below, or drop me a message on LinkedIn.

Maven Challenge – Harry Potter Scripts

Using Power Query to match number of particular spells to characters

When I first saw the new Maven Magic Challenge description, I was drawn to the words “share a…visual that captures the MAGIC. And so, when I started looking at the data I had little real interest in the budgets, box office and run times. It was all about the characters, their journey, and the spells they cast… both in the movies and on the movie goers.

Connecting the Places and Spells to the Characters

So, I started to look at how I could connect the story characters and their dialogues to the magical places around the Potter universe, and the data as it was arranged allowed me to do that without too much hassle.

I then came to look at the data associated with the spells. There was no path in the dataset as provided that allowed me to directly link the spells both to the characters, the dialogues and the places. All it would allow me to do is look at them in isolation, which was no good!!

The challenge of this data set was officially set – how can I link the spells to the dialogue, and then onto the characters and places, and how can I pull this weaving story into a single page visual???

I came up with two methods to try and achieve this

My first method would be to count the number of spells cast by a character. The next would be to also identify the type and name of each individual spell cast.

It was time to cast those “spells”…..Powerus Querio!!

Method 1 – Calculating the Number of Spells Cast by a Character

Step 1 – Create a List of Spells

I needed to create a separate list containing only the incantations from the spell file. So, I imported a copy of the “Spells” csv file. I then selected the Incantation column and the Transform tab, selected Convert to List (shown below) and muttered “Reducio” under my breath.

Voila! The file was reduced to a single list of the spell incantations, as shown below. I would use this list as a reference to look at instances where the incantations were mentioned in the character dialogue. Keep reading to find out more!

Step 2 – Searching the Dialogue for Incantations

The next step was to search the Dialogue for any of the line items from my newly acquired list. I imported the Dialogue csv, then set to work deploying a combination of List.Count and Splitter.SplitTextByAnyDelimeter. The line of M code I used was:

= Table.AddColumn(#"Changed Type", "No. of Spells - Splitter", each List.Count(Splitter.SplitTextByAnyDelimiter(Spells)([Dialogue])) -1)

This would count the number of times a spell was mentioned in the line of dialogue. Let me explain a little how this works, starting from the inside out.

Splitter.SplitTextByAnyDelimeter(Spells)[Dialogue]

This code allows us to work through the lines of dialogue, and when it meets a spell from our list, it effectively stops and splits that line into a list. This new mini list will contain a separate line for each time a spell is encountered, and will have a final line to dump the text from the dialogue that does not appear in the spell list.

This new mini list does not capture which spell it has encountered, only that it has encountered one. In fact, it returns a blank line each time it finds a spell. Confused? I was a little. The below picture may help visualize it.

List.Count – Wrap Around

The Text Splitter does not do a whole lot on it’s own. Therefore, the next piece of the puzzle was to wrap it in a List.Count. This would count the number of items in each mini list we have created. However, as the above shows, it would count once for the spell, and once for the remaining text.

The next action would be to add “-1”. The result would be to give us a count of the number of times an incantation from our list had been used in the dialogue. See below example highlighting where a count has been made of an incantation.

Step 3 – Tidy Up

The final step in this initial tidy up is simply to filter the column for values greater than zero. You have now identified all the lines of dialogue that contain an incantation, as well the numbers of incantations used – sometimes there was more than 1 – see below.

Not Satisfied Yet!!!!!!

Method 2 – Extracting the name of the spells cast by each character

Step 1 – Custom Column with List

Create a list of Spells in the same way as Method 1, Step 1 above – Easy as!

Step 2 – Add Spells List to Dialogue Line Items

Add a custom column to your dialogue sheet in Power Query called “Find Spell”, and simply add “= Spells” and hit ok. You will see a new column added containing the word List in each cell. This list contains the full list of spells.

Step 3 – Expanded List

This next step may not be recommended for very large data sets, like the Maven Taxi Challenge, but it performs fine here.

We will now expand the list. This is done by clicking on the double arrows and selecting “Expand to New Rows” as shown below.

This will then duplicate each line of dialogue to equal the number of spells in our list. Below is an example, where we see the line of dialogue repeated, with each spell listed beside it. The reason for doing this will become clear below.

Step 4 – Find the Spell

Next we can use a Boolean calculation in a custom column to find if our line of dialogue contains the spell that is living in the “Find Spell” column.

I have used Text.Contains([Dialogue],[Find Spell]) . This will look at the text in the Dialogue column and see if it contains the text shown in the corresponding Find Spell column. If it finds it, it will return TRUE, and if not it will return FALSE. See the below example, where if I can recall correctly Hermione repairs Harry’s glasses with an Occulus Reparo spell!!

Oculus Reparo - theQuotes.me

Step 5 – Filter for TRUE values

Once the above step is taken, it is a matter of applying a simple filter on your custom column to only show TRUE values. Looking at the below, I can now see only the lines of dialogue that contain a spell, and now also the name of the actual spell that has been used!! This was what I originally had in mind.

Using this Information

As the Dialogue data sheet contains the movie, chapter, place and character IDs, I can now link the spells to these people and locations!

Next Step

The next step is to now build the visualisation I had in mind to work with this new insight.

Watch this space…..

Expecto Patronum!

Social Media Dashboard – Power BI

Overview

I have created a walkthrough so that you can use just 6 DAX formula to develop data extracted from your social media accounts into a metrics dashboard like that below, whether that be LinkedIn, Twitter, blog accounts or your own website.

Sample Fictional Data from LinkedIn Corporate Account
LinkedIn Metrics Dashboard developed in Power BI

A Youtube video of this walkthrough is linked here. Some of you may have come here after viewing it – thanks, and welcome! I will get down to what you came here for…

6 Key Metric DAX Formula

From the video, you will see that I used six key DAX expressions or formula again and again to create a comprehensive set of metrics to allow you to develop the dashboard. Here they are in order of development.

Totals

Total LinkedIn New Followers = 
SUM(LinkedIn_Data[LinkedIn New Followers])

You can use SUMX in lieu of SUM if you wish here, noting you will need to provide a table and expression in lieu of a column.

Year to Date (YTD)

LinkedIn New Followers YTD = 
TOTALYTD(
    [Total LinkedIn New Followers],
    LinkedIn_Data[Date]
)

Latest Month (MTD)

LinkedIn New Followers This Month = 
Calculate(
    [Total LinkedIn New Followers],
    LASTDATE(LinkedIn_Data[Date])
)

Previous Month

LinkedIn Followers Previous Month = 
Calculate(
    SUM(LinkedIn_Data[LinkedIn New Followers])
    ,PREVIOUSMONTH(LinkedIn_Data[Date]
    )
)

Month Over Month Difference


LinkedIn Followers Diff MoM = 
VAR CurrentS = Sum(LinkedIn_Data[LinkedIn New Followers])
VAR PreviousS = [LinkedIn Followers Previous Month]
VAR Result = CurrentS - PreviousS
Return
    Result

Month Over Month % Growth

LinkedIn Followers MoM Growth % = 
Divide(
    [LinkedIn Followers Diff MoM],
    [LinkedIn Followers Previous Month]
)

Bonus – Last Date

One simple bonus DAX formula that will extract the latest date in your date table. This is useful for title blocks and banners in you reports and dashs, as it will automatically update when you add your monthly data.

Latest Date = 
LASTDATE(
    LinkedIn_Data[Date]
)

Resulting Data

Once I have developed a set of calculations, I like to test them in a matrix to make sure they have the desired outcome. This is what I have done below. You can then be confident that the cards and visuals you create accurately reflect your data set.

Resulting DAX calculations with verification table showing outcomes

Comments and Feedback

If you have any comments, feedback, or requests, please let me know below or leave a comment on my Youtube channel.

Thanks

Datasets

Maven Challenge – Power Query Mega Clean – Taxi Challenge

Summary

This is a walkthrough of the cleaning I undertook on excess of 25 million lines of data as part of the #maventaxichallenge , which is the monthly data visualization challenge set up by Maven Analytics. This month the challenge involved looking at detailed NYC Taxi data between 2017 and 2020 and developing a usable single page dashboard to analyze weekly trends.

Details of the challenge can be found here. This includes the data files being provided and the requirements for cleaning and presenting the data. I have also created a walkthrough of the below on a video link if you want to see it in live action.

Plan

My plan was fairly simple but structured:

  1. Write a clear list of the cleaning steps linked back to the data dictionary
  2. Create a Data Sample
  3. Apply the cleaning steps to the sample and verify
  4. Copy out the M Code for each step
  5. Load the full data set and apply the M Code for each step

Step 1 – Cleaning List

So, the following were the steps required, and what that translated into using the data dictionary:

Cleaning RequestData Dictionary Translation
Let’s stick to trips that were NOT sent via “store and forward”store_and_fwd_flag = N
I’m only interested in street-hailed trips paid by card or cash, with a standard ratetrip_type = 1 ,
payment_type = 1 or 2,
RatecodeID = 1
We can remove any trips with dates before 2017 or after 2020lpep_pickup_datetime = between 01/01/2017 and 31/12/2020 lpep_dropoff_datetime = between 01/01/2017 and 31/12/2020
…along with any trips with pickups or drop-offs in unknown zonesPULocationID <> blank/empty
DOLocationID <>blank/empty
Let’s assume any trips with no recorded passengers had 1 passengerIf passenger_count = 0 or null, then replace the 0 with 1
If a pickup date/time is AFTER the drop-off date/time, let’s swap themIf lpep_pickup_datetime > lpep_dropoff_datetime
then lpep_dropoff_datetime
else lpep_pickup_datetime

if lpep_dropoff_datetime < lpep_pickup_datetime
then lpep_pickup_datetime
else lpep_dropoff_datetime
We can remove trips lasting longer than a day, and any trips which show both a distance and fare amount of zeroadd column
lpep_dropoff_datetimelpep_pickup_datetime
then filter out values >= 24 hours

if trip_distance AND fare_amount = 0
then filter out values
If you notice any records where the fare, taxes and surcharges are ALL negative, please make them positiveif fare_amount <0
and mta_tax <0
and extra <0
and improvement_surcharge <0
and congestion_surcharge <0
then “all negative”
else “ok”

then apply “trick” replacement to change values (see below for more detail)
For any trips that have a fare amount but have a trip distance of 0, calculate the distance this way: (Fare amount – 2.5)/2.5If fare_amount > 0
And trip_distance = 0
Then ((Fare amount – 2.5) / 2.5)
Else trip_distance
For any trips that have a trip distance but have a fare amount of 0, calculate the fare amount this way: 2.5 + (trip distance x 2.5)If trip_distance > 0
And fare_amount = 0
Then (2.5 + (trip_distance x 2.5))
Else fare_amount
Cleaning Requests and Action Steps

Step 2 – Create Data Sample

It would be next to impossible to apply the steps to over 25 million lines of data and then easily verify that all the filters, additions, replacements and modifications had taken place and you got the results you were looking for.

A much more digestible method is to recreate a sample data set based on the data we were provided, and ensure that at least one example of each cleaning step scenario listed above is included.

To do this, I took the smallest file (2020), and performed a series of deduplications on values across the locations, passenger numbers, etc. so that I was able to get a small sample of approximately 50 varied line items.

Then, in order to recreate some of the cleaning scenarios, I made minor adjustments to some values. e.g. swap the drop off and pick up times. Above is the resulting sample data set.

Step 3 – Apply Cleaning Steps to Sample and Verify

Next, I created a new PowerBI file, uploaded the sample data set and then moved to edit the set in Power Query.

Sample data in Power Query

After performing the usual steps on checking data types (text, date, time, numbers, etc) have been applied, it was then a case of applying each cleaning step, and working through. For example, the first step became: =Table.SelectRows(#”Changed Type”, each [store_and_fwd_flag] = “N”)

M Code Replace Hack

There is one hack that is really worth highlighting here, and will save some added columns and processing time in your Power Query tasks, especially in larger data sets. The below gallery shows a snapshot of each step, but here is a brief description:

  • Create a “Dummy Replace” step for the column you which to replace values on.
    • Select the Transform ribbon, and click on Replace Values (picture 1)
    • Choose two dummy values to replace that would not be values in your column. As an example here, I chose replace 5000 with 6000 (picture 2), where most values would actually be single digit values in that column.
    • Click OK, and you will see the M code in the formula box at the top (picture 3)
  • Create a “Dummy Custom Column” step to allow you to write the conditional statement you want to insert into your Dummy Replace code
    • Select the Add Column ribbon, and click on Custom Column (picture 4)
    • use intellisense to help you write the conditional statement you wish to create (picture 5)
    • Copy the statement, and click cancel. This way you are not creating an additional column.
  • Return to your Dummy Replace step and perform the following:
    • For the first value (5000), amend this value to refer to the column you want to replace values in, and precede it with “each”.
    • In this example, I replace 5000 with each [fare_amount]
    • For the second value (6000), amend this value to refer to the conditional statement you want to apply to the selected column, and again precede it with “each”.
    • In this example. I replace 6000 with each if [negatives_charges] = “all negative” then [fare_amount]*-1 else [fare_amount]
  • Once you hit return, the replacement of values occurs within the column, without the need to create an additional column. This will increase efficiency for any refreshes.

If you are interested, the “normal language” logic behind this step is:

  • For each row value in the the column fare_amount, if the corresponding row in the negative_charges column is equal to “all negative”, then we would like to multiply the fare_amount by -1, thus changing it from a negative to a positive value. Else, if it is not “all negative”, then just keep the fare_amount value as it is (no change).

Step 4 – Copy out M Code

Once you have gone through each of the steps on your sample set and verified that it has had the desired effect on you line items, you will now have the list of steps under your Applied Steps on the right hand side of the Power Query interface. A good tip is to rename these steps based on your need (e.g. Change Negative Values or Swap Dates). This will help when you want to copy steps to use in the full data set.

You can see a list of this information in the Advanced Editor Window (see below). This can be copy and pasted out and saved for future use.

If you click on each step you will see the corresponding M code just above your data table.

Advanced Editor showing M Code Steps

There is another way that I learned from the Curbal Youtube page, which is pretty powerful. I wont repeat all the steps here, but here is a link to the tutorial.

Step 5 – Apply to Full Data Set

Now that you have your all your steps and code written, tested and verified, it is now a pretty straight forward proposition to apply them to your full data set.

An easy way to add steps, is to right-click on a step then select “Insert Step After” (see below). This will then allow you to paste in the M code that you have saved. One tip is to check the step reference. It will refer to the name of the preceding step. Therefore make sure you do that in the first instance. For following steps, it should be easy to use your copied values from the sample data, provided you use the same names for your steps.

Insert Step

Once you have completed all your steps, you are done – all that’s left to save is Apply and Close.

The “trial and error” approach is removed, which means that when Power BI applies the updates to the 25 + million line items, you can be reasonably confident you will not have to revisit your Power Query. This is important here, as with such large data sets, the updates can sometimes take hours to complete depending on your computer’s processing capabilities.

You are now free to move on to the next part of the challenge and concentrate on creating you DAX calculations and a nice neat dashboard.

If I have time to finish my dashboard on this one, I will add it on a future post. As always, any comments or queries, please let me know in the comments below!

5 Quick Tips for next level BAN in Tableau

Everyone loves a BAN (Big Aggregate Number). They are your all important key numbers in your dataset and should be jumping off the screen and and ingraining themselves in the back of your retinas!

But the standard way of creating them in Tableau can be a bit dull and monochrome.

Below I’ll step through the usual way of creating a set of BAN, followed by 5 quick tips to take them up a level in a recent dashboard I created as part of a Maven challenge

Standard BAN Creation

Normally, you will have a relatively small set of categories that you wish to show an aggregate value for (example being “Age Group” below.

We traditionally create this by dragging the category (Age Group) into the columns, then pulling the calculation into Text marks area.


Voila! We have a BAN. Not very pretty, but a BAN all the same.

You can adjust the header and value for font type, size and colour. Normally, that is about as far as most people go.

What’s the alternative?

But hey – what if you want your BAN in a single horizontal or vertical line, or you want to colour code based on the category or value?

What can we do to customise our BAN and make it that little more memorable?

5 Simple Tips to go from “Boring BAN” to “Badass BAN”

1. Orientation

It is easy to change from a single horizontal to a vertical line, by simply dragging the category from column to rows. This allows you to maximise your data real estate depending on how how you are structuring your overall visuals.

2. Headers

We can drop a duplicate category “Age Group” onto the Text marks card. Then right click on the category header and select remove. This will result in the second image below – still not too pretty, but we are on our way.

You can keep the category above or below the BAN by shifting it up or down on the marks card.

However, my preference is to keep it below, as it keeps the focus on your Big Number!

3. Font

Many people may have different views, but my preference is to keep a single font on a dashboard. Having multiple fonts can become an unwanted distraction and give a clunky look.

4. Size

For the numbers, bigger is better!! Make the size of the numbers much larger than any adjacent text to emphasize the contrast.

To do this, select the Text icon in the marks card, and click on the three little dots on the side.


This will bring up the “edit label” input box. Here you can adjust the size and font attributes (bold, underline, italic).
In my example, I chose 36 for the Aggregate Number and 16 for the underlying category.

5. Colour

Adding some colour can help place emphasis on numbers or categories, and help improve the aesthetics and feel of your visual.

Ctrl dropping the category onto the colour mark would allow you to assign distinct colours based on each category, whereas Ctrl dropping the calculations “CNT(Consumers)” onto the colour mark will allow you to assign colour.

You can also maybe apply a quick table calculation. Here I opted to show a % of total, rather than the straight numbers. This gives a good overall perspective. As shown in the below picture, you can right click on the aggregate number and select the quick table calculation.


Lastly, in my example, I opted for adjusting the colour based on the BAN value itself. As I wanted to draw the eye to the highest value, I used a diverging scale from a green (#00aa00) to a white, which was offset at -20%. This enabled my lowest value to still be almost visible, while keeping the focus squarely on the largest number.

Overall

I was pretty happy with the outcome, and was able to apply the same effect to two sets of BAN. This helped maintain the overall important consistency and look when they were brought into the main dashboard.

What do you think?

As always, if there are any questions or comments, please reach out. I am happy to help where I can, and always open to feedback on alternative methods and learning new tricks from the data fam.

Happy Analysing – DG