My First Predictive Model: Explanation (Part 1)

Building a model to predict customer churn has been both an exciting and challenging experience. As the main data analyst at my company, this project is a major milestone for myself. My background in mathematics and experience with creating mathematical models helped me significantly, but this is the first time I've applied those skills to a project within a business context.

This post is the first part of a two-part series. In this part, I'll walk you through the background and the overall process I followed in developing the model. The second part (My First Predictive Model: Code (Part 2)) dives into the technical aspects, including the pseudo-code that I created. Why two parts?

It necessary because these articles take time to write, and I wanted to give myself time not only to digest what I’ve done but also to reflect on the experience.
I felt that separating the explanation from the coding would make the content more focused. Combining the two would have resulted in a post that was too long and potentially overwhelming.

Notes:

While this post goes into my process of creating the model, specific definitions and metrics have been modified for confidentiality and privacy reasons. This isn't an exact one-to-one explanation of the real model, but the core ideas are the same.
I'm not an expert in data science; this is meant to explain my process, not to be an instructional or how-to article. I’m always open to learning and improving.

Process

1 Planning: Aligning Metrics and Setting Clear Goals

The first stage of this project was all about understanding what we wanted to achieve. The goal was to build a predictive model that could identify customers likely to churn (aka stop who stop doing business with a company), providing our marketing team with a list of customer for targeted campaigns. Collaborating with stakeholders was important in this phase, since we needed to decide on the data to feed into the model. Given that this was the company’s first predictive model, lot of this involved figuring things out as I go, which gave me the flexibility on the model to use and how to deploy.

2 Researching: Exploring Which Models Fit Best and How to Train Them

The next step was to research the type of model that would work best. For a churn prediction model, logistic regression came up as the most recommended option. It aligned well with the specific goals. Logistic regression is a statistical method used for binary classification (e.g. this photo is of either a cat or a dog, it can only be one or the other), which made it a good starting point for predicting whether a customer would churn or not. To review, I revisited some statistical and mathematical concepts. Although I used a Python library to implement the model, I wanted to understand the underlying concepts. I chose to work with Scikit-learn, a widely-used Python library that’s particularly good for building basic models.

3 Data preparation: Extracting and Cleaning Data from Our Database

The planning and research stages didn't involve any coding, but that changed in the data preparation phase. I used a combination of SQL and Python to extract and clean the data from our database. SQL was used to pull the raw data, while Python was used for cleaning and preparing it for input into the model. I started by building a very basic model using features like total logins and games played. This initial model served as a test to ensure that everything was working correctly, and once confirmed I moved onto further evaluation and updates.

4 Evaluating: Testing Model Performance with Statistical Analysis

With this base model I evaluated the performance. I ran basic statistical analyses, including cross-validation, a confusion matrix, and generating ROC curves along with calculating the AUC (Area Under the Curve) score. These metrics provided a benchmark for how well the model was performing with the initial features.

Cross-validation: This technique was used to assess how the model would generalize to an independent dataset, ensuring that the model wasn't overfitting.
Confusion Matrix: Provided a detailed breakdown of the model’s predictions, showing the number of true positives, true negatives, false positives, and false negatives. This helped in understanding the types of errors the model was making and in evaluating its overall accuracy.
ROC Curve and AUC Score: The ROC curve allowed me to visualize the performance of the model across different thresholds, and the AUC score gave me a single metric to summarize the model’s ability to distinguish between classes.

5 Iterating: Improving the Model with Feature Engineering (and More)

With these established baseline metrics, I moved on to improving the model. To be transparent I used a combination of Google, YouTube, and ChatGPT (for drafting code):

V1: Feature Engineering
- Added features beyond raw data, such as trends, averages, and ratios.
- Tested the effectiveness of these features using correlation analysis.
V2: Hyperparameter Tuning
- Introduced hyperparameters to fine-tune the model.
V3: Testing a New Model - Random Forest
- Shifted to a random forest model to capture more complex patterns in the data.
- Used this model for classification tasks like churn prediction, which was better at handling non-linearities and interactions between features.
V4: Threshold Adjustment, Model Tuning (Grid Search), and Handling Class Imbalance (SMOTE)
- Adjusted the decision threshold to find the optimal balance between sensitivity and specificity.
- Used grid search for hyperparameter optimization.
- Addressed class imbalance using SMOTE (Synthetic Minority Over-sampling Technique).
V5: New Combination of Features
- Added more features and adjusted the combination to improve the model’s accuracy.

I iterated through different versions of the model, testing each one and analyzing the results using basic statistical techniques that I mentioned in the previous section. This continued until I arrived at the final feature set.

6 Deployment: Deploying the Model for Production Use

The model is now ready to deploy into production. My company uses Google Cloud Platform to store our database, so the challenge was adapting the code that ran locally on my computer to run in the cloud. This involved ensuring that all necessary Python libraries were available in the cloud environment and figuring out how to store and access the model using Google Cloud. The model feeds into a table that populates a report I made for the stakeholder to look at. This lets them review the customers it returned and prioritize resources on specific customers. . Although this step took longer than I expected, it was a valuable learning experience that highlighted the complexities of deploying models at scale.

7 Monitoring: Continuously Monitoring the Model

The final stage of the project, which I’m in right now, involves continuously monitoring the model’s performance in production. As new data becomes available, I’ll retrain the model to make sure it remains accurate and effective. This ongoing process will be vital to the model’s long-term success, and I plan to provide updates on how things progress in the future.

Conclusion

While I’m not an experienced data scientist (though I want to learn about the field), I’ve approached this project with a desire to learn and improve. I may have made some mistakes along the way, but the model was created, passed the basic benchmarks, and is now in use. Looking back, I’m sure there will be many areas where I can improve, but this was my experience, and I’m proud of what I’ve done. Remember, this blog post isn’t a guide on how to build a predictive model, it’s a reflection of my journey and learning process. I hope my experience can offer insights to others in similar situations, but I’m always open to feedback and new ideas.