Training a Spark Model for predicting User Churn

11 min readOct 28, 2020

A. Project Definition

This section provides an overview of the data that we have. the overview of what the project is about, along with details about the metrics that are used to evaluate and choose the model.

Project Overview

This project includes using SparkML to create a machine learning model to create new features from existing features and using that to predict churn of users of a music streaming application. In addition to this, we deploy the solution to the IBM cloud in order to use the entire data set, which is bigger in size in order to train the model. The various steps that have been performed in order to train the model for predictions have been outlined below in the following sections and sub sections.

2. Problem Statement

We are given a data set that has user interactions with the music streaming application. The data set has attributes like name of the user, their gender, which page they navigated to on the application, the timestamp at which they interacted with that page etc. We need to find out and predict which users churned and the reasons that drives their move away from the music streaming application. This can in turn be used to predict future churners and in order to reduce the churn, the company could offer them some kind of promotions etc. so that they stay on the application for longer, thereby increasing the revenues of the company.

3. Metrics

The metric used to understand the performance of various machine learning models for this task is the F1 score. We chose the F1 score over individual metrics like accuracy, precision and recall because the F1 score gives us an holistic view about various things like precision as well as recall.

The score reported by the models is then used to compare how they perform on all the data sets namely train, validation and test and then the better model of the two is chosen. More about this has been discussed in the results and conclusion section.

B. Data Exploration and Visualization

The following sub sections cover everything from loading and cleaning the data to performing exploratory data analysis and corresponding visualizations that is then used to decide what new features can be engineered.

Loading and Cleaning Data set

The data set that is being used has data related to the how the users interact with the music streaming application. It has data like firstName, gender song, artist etc. The image below gives you a view of the attributes that we have.

Each row in this data set corresponds to an interaction on a particular page of the application by the user. the data set is close to 12GB in size (the original data set) and is stored in an Amazon S3 bucket. Before actually testing it on the large data set, we use a smaller data set to run various analysis and tests. In order to run analysis, we define a function that checks for things like empty Ids, Nan values etc. This helps us understand how clean the data is. We drop the rows that have a blank ID associated to it and remove columns that have more missing values than actual workable values.

2. Exploratory Data Analysis and Visualisations

Once the data is clean, the next important step is to explore the data, before we can understand how to do feature engineering. We define if a user churn’s based on if they have a “cancellation confirmation” event. Next, we delve deeper into the attributes relating to if the user churned.

As can be seen in the image, the population of churners includes more men than women. Other attributes that we look into relate to the demographics of the user. We should be able to answer questions like “where are most users from?” etc. One of the initial plots that we should be looking at is the count of number of users from every states across the US. On analyzing, we find that most of the users are from the state of California, almost double than the closest states TX, FL and the tri-state area. Extending this, we also look at if there are some cities where the users are particularly from. You can see both of these things in the following two images.

One of the other things that we look at, as part of the EDA is the comparison of the number of sessions a user has w.r.t the Churn. It is evident from the image below that the customers that have fewer sessions are the ones that are likely to Churn as compared to those that have more sessions. This is also true for the number of songs and artists heard by the users. The users that have low number of unique artists that they listen to and fewer number of artists are likely to churn as compared to those that have high numbers for those. All of these metrics are seen in the images below.

C. The Methodology — Data Pre processing, Model implementation and Refinement

In the following sub sections, steps related to data pre processing like engineering of new features and indexing, encoding as well as scaling the features so that they are ready for being used as input in the model are performed. Additionally, the actual steps related to implementing the model as well as fine tuning it using Grid search are performed.

Feature Engineering

Feature engineering is an important step in the machine learning process. It is the step where we decide on the features that we want to use or create new ones based on the existing features using the analysis in the previous step. For this data set, we aggregate various details for a Based on our analysis, we decide to create the following features:

No of unique sessions per user
Activity during AM Hours
Activity during PM Hours
Average song length
Longest session by the user
User State
User City
No. of Next Song events
No. of Thumbs up counts
No. of Thumbs down counts
No of Playlist added events
No of friend added events
Downgrade events per user
Activity during first half of the month
Activity during second half of the month
Home click count
Device Type

2. Pre modeling/processing steps (Indexing, Encoding and Scaling the features)

Once data pre-processing as well as feature engineering steps are completed, only then can the actual “modeling” process begin. In the modeling process, we split the full data set into ‘train’, ‘test’ and ‘validation’ sets and test out three different types of models, as discussed below. Before beginning the modeling process, we again do a sanity check to look out for missing values in the new created features. The major pre modeling steps that we perform before the features are ready to be fed into the models are as follows:

a. String Indexing — This step converts categorical column values into numerical values which can be understood by the model

b. One Hot Encoding — This step creates a Encoder for each of the indexed columns.

c. Vector Assembler — This step creates a single vector of the one hot encoded categorical columns and the numerical columns.

d. Standard Scaler — This step scales the final data/feature vector so that the features are scaled so that the model understands it better.

e. Data Splitting — The data is then split into train, test and validation data sets.

3. Model Implementation and Refinement

Finally, we are now ready to actually train the model. It is important to understand that most of the time for training a model is not actually spent training the model but is rather spent in the pre processing steps in order to make data ready to be used in a model. For this project, we try two different models.We will discuss those below.

Random Forest Classifier

Random forest is made up of a large number of individual decision trees which act as an ensemble. Each tree in the forest spits out a prediction for the class and the one which has the maximum number of votes becomes the model’s prediction.

We use the parameter grid builder to select the optimum parameters for the classifier. In addition to that, we also use the 4-cross validator to get better and accurate predictions. The steps performed for the Random Forest Classifier are visible in the image below.

Another algorithm that we test is the Decision Tree classifier. Its implementation is shown below.

Decision Tree Classifier

A Decision Tree classifier is a basic representation for classifying various examples into one or the other class (in this case, if the use has churned or not). It is a Machine Learning algorithm that is supervised in nature i.e. we have the information of the label in the actual data set, where the data is split according to a certain parameter.

Just like the Random Forest Classifier, we use the Parameter Grid Builder to build and choose the best parameters for the model along with using a 4-Cross validator. The implementation for this algorithm is shown in the image below.

The parameters that are finally tested as part of the parameter grid in both these algorithms are refined and chosen iteratively based on the performance of the validation set. Just like the name suggests, the validation set helps us validate the range of parameters that perform well for the data set and provide better results. The results of both these algorithm are discussed in the next section.

D. Results

1. Model Evaluation and Validation

In this section, we discuss the results one by one for both the algorithms discussed above. The following is the function that we use to calculate metrics for these algorithms. Just to reinforce the metric again, we use F1 score to choose the better algorithm and to perform validation.

Random Forest Classifier

After training the model over various sets of parameters, the best parameters that we get for this particular classifier are as follows:

The following are the metrics that we get from the trained model for the train, validation as well as test data for this classifier.

As seen in the image, the F1 score for the train data set (the data that the model had access to) is 0.89 whereas the F1 score for the test data (data that model did not have access to) is 0.81. These values are relatively good and also iterate the fact that the model does not overfit the data. Next we will discuss the same scores for the Decision Tree Classifier.

Decision Tree Classifier

After training the model over various sets of parameters, the best parameters that we get for this particular classifier are as follows:

The following are the metrics that we get from the trained model for the train, validation as well as test data for this classifier.

As seen in the image, the F1 score for the train data set (the data that the model had access to) is 0.91 whereas the F1 score for the test data (data that model did not have access to) is 0.76. These values are relatively good and also iterate the fact that the model does not overfit the data. However, it does perform worse on unseen data as compared to the Random Forest Classifier while only performing slightly better on seen data.

2. Justifications

Again iterating the fact that the metric that we used to check which algorithm performed better is the “F1 score”. This is because F1 score takes into account both the precision as well as the recall and presents a holistic view as a single number. The F1 score is calculated as follows:

Both of the algorithms perform relatively well, based on the F1 score on the train, validation as well as test sets. However, there is a slight edge that the Random Forest classifier has because it performs better on unseen data when compared to the Decision Tree Classifier while only performing slightly worse on seen data. In real world scenarios, we would like our model to perform better and make better predictions on future data which is unseen data in most cases. To quantify the above points, the F1 scores on the test set for the Random Forest Classifier is 0.81 and for the Decision Forest Classifier is 0.76.

Hence, the model of choice in this case is the Random Forest Classifier.

E. Conclusion

1. Reflections and future improvements

This was the end-to-end process of training a machine learning model from scratch, right from doing exploratory data analysis to actually training the model using Spark. One of the interesting things of this project was the opportunity to actually use Spark on a large data set and train models using the SparkML API. Another interesting aspect was getting insight into how various data points can affect the decision of the new features that we want to create.

One of the points that can be improved is to use feature importances or coefficients, in addition to other visualization libraries to better understand the data when deciding what features to build or create as part of the feature engineering process for the models.

Training a Spark Model for predicting User Churn

Written by Malav Shah