Sparkify — to churn or not to churn

Published in

Analytics Vidhya

8 min readMay 7, 2021

The project

Sparkify is a fictional music streaming service and it is the final project for my udacity Data Scientist degree. It is a big data machine learning challenge in which I want to develop a model that predicts users which are about to cancel their subscription based on their interaction with the website — this is called “churn”.

Since it is a big dataset of about 12 GB containing all the user-logs it cannot be computed on a single machine and PySpark is the tool for handling this kind of data. The dataset is hosted on a AWS (Amazon Web Services) S3 repository (s3n://udacity-dsnd/sparkify/sparkify_event_data.json) and for working with the data it will be necessary to set up an EMR cluster on AWS.

Before I get into the details of my work I want to outline my approach to it.

The approach

Working on AWS costs money and a good approach is to start with just a small portion of the data and test the code as well as the approach already on a local machine. udacity also provides a sample dataset with about 123 MB to play with (s3n://udacity-dsnd/sparkify/mini_sparkify_event_data.json) that is just right for that. I did do that and provide the code I developed on my github. In order to provide a suitable dataframe for my ML algorithms I went through the following steps.

get to know your data

It is always important to explore your dataset first and get to know your features. There was no description provided and therefore I provide an overview table here:

For my project I wanted to predict the churn rate and therefore I will focus on the user interaction with the website that is stored in the “page” feature. The page feature has the following unique values:

clean your data

Now that I know the features I wanted to check the “cleanness” of the data. There are nan values in almost every row, but the ones that really have an impact on my model are the empty “userId”. Therefore cleaning those is my only concern. Here is a quick overview of nan values:

nan values in big dataset

The nan “userId” are probably from users that did not login when visiting the webpage. Since I cannot assign their actions to any user I dropped those rows.

create new features where necessary

“ts” is a great absolute time reference, but every user has their own experience with Sparkify. A natural feature that comes into mind the time that passed since the user registered. I created the feature “membership_days” that gives all users a mutual time reference. Judging from my experience I usually try out a new service for a certain time and then make at some point the decision if I want to stick with it or not.

In my data exploration part I also wanted to drill down into Sparkify’s development as a whole. Is the service growing or shrinking, how active are users? A great way of doing this is to aggregate numbers by week and see how they evolve. This also reduces the variance noise significantly. Therefore I created the feature “week” that I could utilize for aggregation.

create a “userId”-aggregated table for ML

The final step in preparing my data for ML was to aggregate “relevant” features in a suited manner by “userId”. I defined all “page”-values as relevant and created new columns that mark their occurrence in a row with a 1 — just like the “get_dummies”-function in Pandas. I did the same thing for “gender”, “level”, and “status”, which I also considered relevant.

After that I summed for every user those 1’s up except for “level” and “gender” here I only wanted to know the maximum on their last entry in the user logs.

From “membership_days” I actually extracted two features:

Maximum: how long was the user a member in its last record
CountDistinct: this is a measure for how many days a user was active in the dataset

The final dataframe had the following schema:

Now the data was ready to deep dive into it.

Explore the data

The first starting point is to look at how the users interact with Sparkify. For a more detailed information on this I looked at the occurrences of the unique “page” values. The fastest way to do this a bar plot of them:

occurrences of unique “page” values with log y-scale

A first glance at this plot already reveals some interesting facts:

the number of upgrades is higher than the number of downgrades — so more users switched to a paid subscription than back to free one
there is more than an order of magnitude between users registering (new users) and users cancelling (churn users) — does Sparkify have a problem finding and keeping new users?

This got me interested into the weekly development of such numbers.

Here we can see a clear trend for user loss. A mere 30 to 50 new users are registering each week whereas between 300 to 400 churn — that does not look good for Sparkify’s business case.

Also the development of service up- and downgrades does not look promising. In the beginning of the dataset there were over 3000 users upgrading with about 600 downgrading. That is a 5:1 ratio. This has melted down to a 1:1 ratio in the last week of the dataset.

I wanted to see how the ratio of different subscription levels evolve for the active users.

weekly evolution of active users for both subscription levels

Here you can see that the paid user base is quite stable between 6000 to 8000 users and the free user base is steadily dropping.

The final question I wanted to explore was after how many days of membership users churn and if there is a difference between the subscription levels.

maximum membership days at churn by subscription level

This is a very nice distribution with the maximum around 25 and 50 days for free users and around 60 days for paid users. It seems that paid users give Sparkify a longer trial period before they make a final decision.

My conclusion from this evaluation is that two points are have to be addressed at Sparkify:

Why do we not attract more new users? Do we have to change our marketing approach?
Why can we not keep our users at Sparkify?

While the first point is a marketing issue, I am hoping to find the answer to the second one with the help of a ML classifier.

the power of machine learning

At the beginning of each machine learning journey is the question of which metric do I want to use to pick the best approach? I treated the problem as a classification of each user into either churn or not churn (hence, the title).

implementing the metric

SciKit-Learn has a large variety of functions to provide a good metric, but the PySpark ML library seems less well equipped. Since I could not always follow how the metric functions come to their values I decided to use my own function that calculates the true positive and true negative as well as their false counterparts. In the PySpark ML library the dataframe has to be vectorized first and my final model gives me a new feature “prediction” for the predicted test data. From here on it’s just comparing “prediction” and “churn”. From this point it is only using the definitions of accuracy, precision, f1, and recall to calculate their values. The confusion matrix is basically a byproduct of this approach.

setting up an ML pipeline

In preparation of a hyperparameter tuning for my preferred classifier, I set up an ML pipeline. It begins with the VectorAssembler. Then I use a StandardScaler to avoid that a feature just dominates because of its values. After that I only needed a classifier. The dataset is not to imbalanced, but I had good results in the past with decision tree and random forest classifiers. Those two where my starting points.

results for initial run of the two classifiers

Both yielded pretty similar results. The RandomForestClassifier, however, used all my features to predict. Since I wanted to understand why users churn I decided to enter the hyperparameter tuning with this one.

hyperparameter tuning

My approach here was to already try out a wider range of parameters on the smaller dataset and the only tune a smaller number and range on the full dataset. I used the f1-score as my metric that defined the best model. This was my setup and the results:

As you can see the results for the f1-score from the built-in metric is significantly higher. Applying my metric to the best model resulted in the following results:

The most important information that I wanted to get was the feature importance. The best model had the following ranking:

feature importance ranking of best model

From this ranking I can see clear indicators for users churning:

As I already saw in the data exploration users give the service 1 to 2 month to check it out and then stay or leave (membership_days). Here is where Sparkify has to increase its effort to keep new users.
The dependence on “active days” is a good indicator if a user is staying interested in Sparkify. If that drops it becomes more likely for him to churn.
The dependence of “Roll Advert” is probably more interesting for the group of free users. Here seems to be a limit how many advertisements a user is willing to accept.
Interesting to me is that Sparkify is doing a good job being attractive for both genders. It is the least important factor in my model.

Conclusion

The combination of explorative data analysis and machine learning gave a good insight into the buisness “health” of Sparkify and helped to identify action fields where it could work on keeping their users and attract new ones.

Due to my limited time I did not fully explore all the classifier options and the full parameter grid space of, e.g., my VectorAssembler or StandardScaler. So there is definitely room to improve the quality of my model if it would need to be deployed.

However, the insights about the reasons for users to churn could already be extracted quite well in this approach.

Please find the full set of files in my github:

https://github.com/snkrause/Sparkify.git