top of page
Kelly Adams

My First Predictive Model: Code (Part 2)

In the first part of this blog series (My First Predictive Model: Explanation Part 1), I explained my methodology to build my predictive churn model. It detailed the framework and the decisions behind the project. In this blog post I go into the technical details. I explain the pseudo-code to give you a clearer picture of the data manipulation, the features engineered for the prediction, and the model's evaluation methods. Part 1 focuses on the 'why' and the 'how' at a high level, Part 2 will arm you with a thorough understanding of the 'what' and the 'how-to'. 


While this post explains the technical aspect, the specific code has been modified for confidentiality and privacy reasons. It is also simplified to convey the overall understanding of the code. This isn't an exact one-to-one explanation of the real model, but the core ideas are the same. It's important to note that this series isn’t a step-by-step guide but rather highlighting my learning process of developing a predictive model.


Process: 

  1. Load Data - The code I used to extract the data.

  2. Train Model - The code I used to train the model.

  3. Deploy Locally - The code I used to run the model on my own machine.

  4. Deploying to Google Cloud Platform - The process I used to deploy to the cloud (doesn't include code).


Below I’ll include breakdowns of code segments with my commentary, the code itself, and, where relevant, output samples.


View all of these files in my Github Repo: predictive-model-1.


Code

1 Load Data


The first step is to get the data I need to feed into my model. This is in the file: load_data.py

Below I import the necessary libraries and initialize the BigQuery client (aka set up the code to use BigQuery). 

# import libraries
import pandas as pd
import numpy as np
from google.cloud import bigquery
import os

# BigQuery client
client = bigquery.Client()

Next I set the variables for the date ranges. I used parameterized queries to prevent SQL injection (to help with security) and improve query reusability (easier to change the dates here instead of in the query). I include my query that will get the data from BigQuery. I use the date ranges as my parameters to use in the WHERE clause of the query. 

# set variables 
start_date = '2023-01-01'
end_date = '2023-12-31'

query = """
SELECT
    user_id,
    DATE_TRUNC(last_login_date, MONTH) AS month,
    COUNT(DISTINCT session_id) AS num_logins,
    SUM(case when activity_type = 'game_play' then 1 else 0 end) AS games_played,
    SUM(case when activity_type = 'purchase' then amount else 0 end) AS total_purchases,
    MIN(last_login_date) AS first_login_date,
    MAX(last_login_date) AS last_login_date
FROM
    gaming.activity_log
WHERE
    DATE(last_login_date) BETWEEN @start AND @end
GROUP BY
    user_id, month
ORDER BY
    user_id, month
"""

# Set up query parameters
job_config = bigquery.QueryJobConfig(
    query_parameters=[
       bigquery.ScalarQueryParameter("start", "STRING", start_date),
        bigquery.ScalarQueryParameter("end", "STRING", end_date)
    ]
)

Then I run the query and save the results to a pandas DataFrame. I also print out the DataFrame (this can be done before saving to double check the results are correct). I save this DataFrame as a CSV file locally on my machine to use to train my model. 

query_job = client.query(query, job_config=job_config)
df = query_job.to_dataframe()

print(df)

# save as a CSV file
base_path = 'C:\\Users\\MyUser\\Projects\\'
filename = os.path.join(base_path, f'model_data.csv')
df.to_csv(filename, index=False)
print(f'Created new CSV File: {filename}')

Output: 

user_id

month

num_logins

games_played

total_purchases

first_login_date

last_login_date

user123

1/1/2023

15

30

200

1/2/2023

1/31/2023

user123

2/1/2023

10

20

150

2/1/2023

2/28/2023

user456

1/1/2023

5

8

50

1/5/2023

1/25/2023

user456

2/1/2023

8

16

75

2/3/2023

2/24/2023

user789

1/1/2023

20

40

300

1/1/2023

1/30/2023

Created new CSV FIle: C:\\Users\\MyUser\\Projects\\model_data.csv

Final Code

# import libraries
import pandas as pd
import numpy as np
from google.cloud import bigquery
import os

# BigQuery client
client = bigquery.Client()

# set variables 
start_date = '2023-01-01'
end_date = '2023-12-31'

query = """
SELECT
    user_id,
    DATE_TRUNC(last_login_date, MONTH) AS month,
    COUNT(DISTINCT session_id) AS num_logins,
    SUM(case when activity_type = 'game_play' then 1 else 0 end) AS games_played,
    SUM(case when activity_type = 'purchase' then amount else 0 end) AS total_purchases,
    MIN(last_login_date) AS first_login_date,
    MAX(last_login_date) AS last_login_date
FROM
    gaming.activity_log
WHERE
    DATE(last_login_date) BETWEEN @start AND @end
GROUP BY
    user_id, month
ORDER BY
    user_id, month
"""

# Set up query parameters
job_config = bigquery.QueryJobConfig(
    query_parameters=[
       bigquery.ScalarQueryParameter("start", "STRING", start_date),
        bigquery.ScalarQueryParameter("end", "STRING", end_date)
    ]
)

query_job = client.query(query, job_config=job_config)
df = query_job.to_dataframe()

print(df)

# save as a CSV file
base_path = 'C:\\Users\\MyUser\\Projects\\'
filename = os.path.join(base_path, f'model_data.csv')
df.to_csv(filename, index=False)
print(f'Created new CSV File: {filename}')

2 Train Model

Next is using this data I extracted to train the model. This is in the file: train_model.py


First I import the following libraries: 

  • Data manipulation: pandas and numpy 

  • Model Training: skitlearn , imblearn , joblib  and pickle 

  • File operations: pickle , joblib ,os 

# import libraries
import pickle
import joblib
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, precision_recall_curve, make_scorer
from sklearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline as make_imbalance_pipeline
import pandas as pd
import numpy as np
import os

Then I load the CSV file with my raw data (that I got from the load_data.py script). I convert month  from a string to a datetime format (to help sort). Then I pivoted the DataFrame to analyze changes over time. This resulted in a multi-index DataFrame (each month forms the first level of the column header and under each month there’s sub-columns for each metric). Finally I’ll fill in the NaN  values with 0 (for inactive months). 

# Load data
base_path = 'C:\\Users\\MyUser\\Projects\\'
filename = os.path.join(base_path, 'model_data.csv')
df = pd.read_csv(filename)

# Convert 'month' from string to datetime format for sorting purposes
df['month'] = pd.to_datetime(df['month'])

# Pivot the DataFrame to wide format
df_wide = df.pivot_table(index='user_id', columns='month', values=['num_logins', 'games_played', 'total_purchases'], aggfunc='sum')

# Fill NaN values that occur from pivoting (no activity months will be NaN)
df_wide.fillna(0, inplace=True)

The new wide DataFrame would look like:

user_id

2021-01-01

2023-01-01

2023-01-01

2023-02-01

2023-02-01

2023-02-01


num_logins

games_played

total_purchases

num_logins

games_played

total_purchases

user123

15

30

200

10

20

150

user456

5

8

50

8

16

75

user789

20

40

300

15

30

225


I calculate monthly trends and averages for user activities to find patterns. Then assigned churn: users with no activities in the last month are labeled as churned. The DataFrame columns are then flattened for easier analysis and to input into my model. 

# Calculating monthly changes as trends for each activity
for activity in ['num_logins', 'games_played', 'total_purchases']:
    df_wide[(activity, 'trend')] = df_wide[activity].diff(axis=1).mean(axis=1)
    df_wide[(activity, 'average')] = df_wide[activity].mean(axis=1)

# Assign churn or not based on activity in the last month (assumed: no activity = churn)
last_month = max(df['month'])
df_wide['churned'] = (df_wide[('num_logins', last_month)] + df_wide[('games_played', last_month)] + df_wide[('total_purchases', last_month)] == 0).astype(int)

# Flatten the columns after wide transformation
df_wide.columns = ['_'.join(col).strip() for col in df_wide.columns.values]

# Now df_wide is ready for feature selection and model training
print(df_wide.head())

Results: 

user_id

num_logins_2023-01-01

games_played_2023-01-01

total_purchases_2023-01-01

num_logins_2023-02-01

games_played_2023-02-01

total_purchases_2023-02-01

user123

15

30

200

10

20

150

user456

5

8

50

8

16

75

user789

20

40

300

15

30

225


I select features related to trends and averages, split data into training and testing sets, and address class imbalance using SMOTE. A pipeline is created for scaling data and applying a random forest classifier, followed by cross-validation to evaluate the model's performance.

  • Feature Selection:

    • Trends: Average month-to-month differences in user metrics (e.g., logins, game plays, purchases) to identify increasing, decreasing, or stable patterns.

    • Averages: Mean values of user metrics across all months to gauge general usage behavior.

  • Data Splitting and Class Imbalance:

    • Data Splitting: Divides the dataset into training (80%) and testing (20%) sets, allowing for both model training and independent evaluation.

    • Class Imbalance: Applies SMOTE to the training set to correct imbalances by synthesizing new examples in the minority class, enhancing model fairness and accuracy.

  • Model Training with Pipeline:

    • Scaling: Uses StandardScaler to normalize features, ensuring all input features contribute equally to model prediction, preventing bias towards variables with larger scales.

    • Random Forest Classifier: An ensemble method that improves prediction stability and accuracy, making it less prone to overfitting compared to individual decision trees.

# Select final features for the model; use corrected DataFrame and column names
final_feature_set = [col for col in df_wide.columns if 'trend' in col or 'average' in col]  # modify as needed
X = df_wide[final_feature_set]
y = df_wide['churned']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Address class imbalance
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Set up a pipeline with scaling and random forest
pipeline = make_imbalance_pipeline(
    StandardScaler(),
    RandomForestClassifier(random_state=42)
)

# Perform cross-validation on the resampled training data
cv_scores = cross_val_score(pipeline, X_train_resampled, y_train_resampled, cv=5, scoring=make_scorer(roc_auc_score))
print(f'Cross-Validation AUC: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}')

Output:

Cross-Validation AUC: 0.8754 ± 0.0312

Hyperparameters of the random forest classifier are tuned using GridSearchCV to optimize the ROC-AUC score. The best parameters are applied, and the model is finalized by training on the entire dataset. The model's performance is then evaluated on a test set with an optimal threshold to balance precision and recall.

  • Grid Search CV: Exhaustively searches through specified parameter values, using cross-validation to find the combination that maximizes the ROC-AUC score, which measures the model's discriminative ability.

  • Hyperarameter Tuned: Includes n_estimators (number of trees), max_depth (depth of each tree), min_samples_split (minimum samples for a split), and min_samples_leaf (minimum samples at a leaf node).

  • Best Parameters: Once optimal parameters are found, they are applied to re-train the model on the entire dataset, finalizing it for deployment. 

# Tune the hyperparameters using GridSearchCV
param_grid = {
    'randomforestclassifier__n_estimators': [100, 200, 300],
    'randomforestclassifier__max_depth': [None, 10, 20, 30],
    'randomforestclassifier__min_samples_split': [2, 5, 10],
    'randomforestclassifier__min_samples_leaf': [1, 2, 4]
}
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='roc_auc')
grid_search.fit(X_train_resampled, y_train_resampled)

# Print the best parameters and the corresponding AUC score
print(f'Best Parameters: {grid_search.best_params_}')
print(f'Best Cross-Validation AUC: {grid_search.best_score_:.4f}')

# Finalize the model by training on the entire dataset
pipeline.set_params(**grid_search.best_params_)
pipeline.fit(X_train_resampled, y_train_resampled)

# Evaluate on the hold-out test set using optimal threshold
test_probs = pipeline.predict_proba(X_test)[:, 1]
precision, recall, thresholds = precision_recall_curve(y_test, test_probs)
optimal_idx = np.argmax(precision * recall)  # Example to maximize F1
optimal_threshold = thresholds[optimal_idx]
test_predictions = (test_probs >= optimal_threshold).astype(int)
test_auc = roc_auc_score(y_test, test_probs)
print(f'Optimal Threshold: {optimal_threshold}')
print(f'Test Set AUC: {test_auc:.4f}')

Output: 

Best Parameters: {'randomforestclassifier__n_estimators': 200, 'randomforestclassifier__max_depth': 20, 'randomforestclassifier__min_samples_split': 2, 'randomforestclassifier__min_samples_leaf': 1} Best Cross-Validation AUC: 0.8912 Optimal Threshold: 0.52 Test Set AUC: 0.8653

The trained model and the final set of features are saved locally. This ensures that both the model and its context are preserved for future use, such as deployment or further analysis.

# Save the finalized model
model_filename = os.path.join(base_path, 'model_rf.pkl')
joblib.dump(pipeline, model_filename)

# Save the feature set used
feature_names_filename = os.path.join(base_path, 'feature_names_model_rf.pkl')
with open(feature_names_filename, 'wb') as f:
    pickle.dump(final_feature_set, f)

print(f'Model and feature names saved as {model_filename} and {feature_names_filename}.')

Output:

Model and feature names saved as C:\Users\MyUser\Projects\model_rf.pkl and C:\Users\MyUser\Projects\feature_names_model_rf.pkl

Final code

# import libraries
import pickle
import joblib
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, precision_recall_curve, make_scorer
from sklearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline as make_imbalance_pipeline
import pandas as pd
import numpy as np
import os

# Load data
base_path = 'C:\\Users\\MyUser\\Projects\\'
filename = os.path.join(base_path, 'model_data.csv')
df = pd.read_csv(filename)

# Convert 'month' from string to datetime format for sorting purposes
df['month'] = pd.to_datetime(df['month'])

# Pivot the DataFrame to wide format
df_wide = df.pivot_table(index='user_id', columns='month', values=['num_logins', 'games_played', 'total_purchases'], aggfunc='sum')

# Fill NaN values that occur from pivoting (no activity months will be NaN)
df_wide.fillna(0, inplace=True)

# Create trend and average features
# Calculating monthly changes as trends for each activity
for activity in ['num_logins', 'games_played', 'total_purchases']:
    df_wide[(activity, 'trend')] = df_wide[activity].diff(axis=1).mean(axis=1)
    df_wide[(activity, 'average')] = df_wide[activity].mean(axis=1)

# Assign churn or not based on activity in the last month (assumed: no activity = churn)
last_month = max(df['month'])
df_wide['churned'] = (df_wide[('num_logins', last_month)] + df_wide[('games_played', last_month)] + df_wide[('total_purchases', last_month)] == 0).astype(int)

# Flatten the columns after wide transformation
df_wide.columns = ['_'.join(col).strip() for col in df_wide.columns.values]

# Now df_wide is ready for feature selection and model training
print(df_wide.head())

# Model

# Select final features for the model; use corrected DataFrame and column names
final_feature_set = [col for col in df_wide.columns if 'trend' in col or 'average' in col]  # modify as needed
X = df_wide[final_feature_set]
y = df_wide['churned']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Address class imbalance
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Set up a pipeline with scaling and random forest
pipeline = make_imbalance_pipeline(
    StandardScaler(),
    RandomForestClassifier(random_state=42)
)

# Perform cross-validation on the resampled training data
cv_scores = cross_val_score(pipeline, X_train_resampled, y_train_resampled, cv=5, scoring=make_scorer(roc_auc_score))
print(f'Cross-Validation AUC: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}')

# Tune the hyperparameters using GridSearchCV
param_grid = {
    'randomforestclassifier__n_estimators': [100, 200, 300],
    'randomforestclassifier__max_depth': [None, 10, 20, 30],
    'randomforestclassifier__min_samples_split': [2, 5, 10],
    'randomforestclassifier__min_samples_leaf': [1, 2, 4]
}
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='roc_auc')
grid_search.fit(X_train_resampled, y_train_resampled)

# Print the best parameters and the corresponding AUC score
print(f'Best Parameters: {grid_search.best_params_}')
print(f'Best Cross-Validation AUC: {grid_search.best_score_:.4f}')

# Finalize the model by training on the entire dataset
pipeline.set_params(**grid_search.best_params_)
pipeline.fit(X_train_resampled, y_train_resampled)

# Evaluate on the hold-out test set using optimal threshold
test_probs = pipeline.predict_proba(X_test)[:, 1]
precision, recall, thresholds = precision_recall_curve(y_test, test_probs)
optimal_idx = np.argmax(precision * recall)  # Example to maximize F1
optimal_threshold = thresholds[optimal_idx]
test_predictions = (test_probs >= optimal_threshold).astype(int)
test_auc = roc_auc_score(y_test, test_probs)
print(f'Optimal Threshold: {optimal_threshold}')
print(f'Test Set AUC: {test_auc:.4f}')

# Save the finalized model
model_filename = os.path.join(base_path, 'model_rf.pkl')
joblib.dump(pipeline, model_filename)

# Save the feature set used
feature_names_filename = os.path.join(base_path, 'feature_names_model_rf.pkl')
with open(feature_names_filename, 'wb') as f:
    pickle.dump(final_feature_set, f)

print(f'Model and feature names saved as {model_filename} and {feature_names_filename}.')

3 Deploy Locally 

This code ran locally on my computer. This involves loading the data, cleaning it, using the model we saved above and returning the predicted users who are likely to churn. With this I could run the model manually.  This is in the file: deploy.py


First, import all necessary libraries for handling data, making queries, and loading models.

import pandas as pd
import os
from google.cloud import bigquery
import joblib
from datetime import datetime, timedelta

Define the paths for the model and output results, and initialize the BigQuery client.

base_path = 'C:\\Users\\MyUser\\Projects\\'
model_path = os.path.join(base_path, 'model_rf.pkl')
results_path = os.path.join(base_path, 'predictions.csv')
client = bigquery.Client().Client()

Calculate the date range for getting data from the last complete month.

today = datetime.today()
first_of_this_month = today.replace(day=1)
last_month = first_of_this_month - timedelta(days=1)
start_date = last_month.replace(day=1).strftime('%Y-%m-%d')
end_date = last_month.strftime('%Y-%m-%d')

Run a SQL query to retrieve user activity from BigQuery within the specified date range.

query = """
SELECT
    user_id,
    DATE_TRUNC(last_login_date, MONTH) AS month,
    COUNT(DISTINCT session_id) AS num_logins,
    SUM(case when activity_type = 'game_play' then 1 else 0 end) AS games_played,
    SUM(case when activity_type = 'purchase' then amount else 0 end) AS total_purchases,
    MIN(last_login_date) AS first_login_date,
    MAX(last_login_date) AS last_login_date
FROM
    gaming.activity_log
WHERE
    DATE(last_login_date) BETWEEN @start AND @end
GROUP BY
    user_id, month
ORDER BY
    user_id, month
"""
job_config = bigquery.QueryJobConfig(
    query_parameters=[
        bigquery.ScalarQueryParameter("start", "STRING", start_date),
        bigquery.ScalarQueryParameter("end", "STRING", end_date)
    ]
)
query_job = client.query(query, job_config=job_config)
df = query_job.to_dataframe()

Prepare the data by converting dates, pivoting the table, and calculating trends. Then flatten the columns into a wide format to feed into the model. 

df['month'] = pd.to_datetime(df['month'])
df_wide = df.pivot_table(index='user_id', columns='month', values=['num_logins', 'games_played', 'total_purchases'], aggfunc='sum')
df_wide.fillna(0, inplace=True)

for activity in ['num_logins', 'games_played', 'total_purchases']:
    df_wide[(activity, 'trend')] = df_wide[activity].diff(axis=1).mean(axis=1)
    df_wide[(activity, 'average')] = df_wide[activity].mean(axis=1)
df_wide.columns = ['_'.join(col).strip() for col in df_wide.columns.values]

Load the trained model and use it to predict churn probability.

model = joblib.load(model_path)
features = [col for col in df_wide.columns if 'trend' in col or 'average' in col]
predictions = model.predict(df_wide[features])
probabilities = model.predict_proba(df_wide[features])[:, 1]

Create a DataFrame with the results, it returns the user_id  , churn_prediction  and churn_probabilitiy . Save the predictions to a CSV file and print the result path.

results_df = pd.DataFrame({
    'user_id': df_wide.index,
    'churn_prediction': predictions,
    'churn_probability': probabilities
})
results_df.to_csv(results_path, index=False)
print(f'Results saved to {results_path}')

Output: 

user_id

churn_prediction

churn_probability

user123

0

0.15

user456

1

0.85

user789

0

0.1


Final code

# import libraries
import pandas as pd
import os
from google.cloud import bigquery
import joblib
from datetime import datetime, timedelta

# Set variables
base_path = 'C:\\Users\\MyUser\\Projects\\'
model_path = os.path.join(base_path, 'model_rf.pkl')
results_path = os.path.join(base_path, 'predictions.csv')

# Set up BigQuery client
client = bigquery.Client()

# Calculate the date range for the most recent complete month
today = datetime.today()
first_of_this_month = today.replace(day=1)
last_month = first_of_this_month - timedelta(days=1)
start_date = last_month.replace(day=1).strftime('%Y-%m-%d')
end_date = last_month.strftime('%Y-%m-%d')

# Define the SQL query to fetch prediction data
query = """
SELECT
    user_id,
    DATE_TRUNC(last_login_date, MONTH) AS month,
    COUNT(DISTINCT session_id) AS num_logins,
    SUM(case when activity_type = 'game_play' then 1 else 0 end) AS games_played,
    SUM(case when activity_type = 'purchase' then amount else 0 end) AS total_purchases,
    MIN(last_login_date) AS first_login_date,
    MAX(last_login_date) AS last_login_date
FROM
    gaming.activity_log
WHERE
    DATE(last_login_date) BETWEEN @start AND @end
GROUP BY
    user_id, month
ORDER BY
    user_id, month
"""
# Set up query parameters
job_config = bigquery.QueryJobConfig(
    query_parameters=[
        bigquery.ScalarQueryParameter("start", "STRING", start_date),
        bigquery.ScalarQueryParameter("end", "STRING", end_date)
    ]
)

# Run the query
query_job = client.query(query, job_config=job_config)
df = query_job.to_dataframe()

# Clean and pivot data to wide format
df['month'] = pd.to_datetime(df['month'])
df_wide = df.pivot_table(index='user_id', columns='month', values=['num_logins', 'games_played', 'total_purchases'], aggfunc='sum')
df_wide.fillna(0, inplace=True)

# Add features
for activity in ['num_logins', 'games_played', 'total_purchases']:
    df_wide[(activity, 'trend')] = df_wide[activity].diff(axis=1).mean(axis=1)
    df_wide[(activity, 'average')] = df_wide[activity].mean(axis=1)

# Flatten the columns after wide transformation
df_wide.columns = ['_'.join(col).strip() for col in df_wide.columns.values]

# Load in the model
model = joblib.load(model_path)

# Predict using the pipeline
features = [col for col in df_wide.columns if 'trend' in col or 'average' in col]  # ensure features match training
predictions = model.predict(df_wide[features])
probabilities = model.predict_proba(df_wide[features])[:, 1]

# Save results as a dataframe
results_df = pd.DataFrame({
    'user_id': df_wide.index,
    'churn_prediction': predictions,
    'churn_probability': probabilities
})

# Save results as a CSV
results_df.to_csv(results_path, index=False)
print(f'Results saved to {results_path}')

4 Deploying to Google Cloud Platform

Why am I not including the code I used to deploy to Google Cloud Platform? After deploying this model locally, I also moved it to the cloud using a combination of Google Cloud Functions and Cloud Scheduler. However, since this approach is quite specific to our use case at work and not commonly used for deploying machine learning models, I’ve decided not to delve into those details here. This strategy, while effective for us, might not be the standard practice. I'm exploring more conventional methods of cloud deployment for machine learning models and will share those insights in future updates.


The model outputs are loaded into a table, which then populates a report I made for stakeholders. This report lets them review the customers it returned and prioritize resources on specific customers. 


Conclusion 

This blog series explains my journey in developing and deploying a churn prediction model. Through this project, I’ve learned about model training and adapted techniques for practical use in cloud environments. As I continue to update my methods and explore other deployment strategies, I will share insights and lessons learned from this experience, hoping to help others on their data science journey.

bottom of page