Overview

Spam is more than an annoyance—it's a security risk. While training a machine learning model in a notebook is straightforward, the real challenge is deploying it as a scalable, production-ready service. This guide walks you through building an end-to-end serverless spam classifier using Scikit-learn for model development and AWS Lambda, S3, and API Gateway for deployment. The result is a lightweight, cost-efficient API that classifies messages in real time. The system is modular: you can retrain the model independently without affecting the live API.

Serverless Spam Detection API: Deploying a Scikit-Learn Model with AWS Lambda and API Gateway — Source: www.freecodecamp.org

Prerequisites

Skills & Tools

Python proficiency: Basic knowledge of Python and machine learning concepts (classification).
AWS account: Access with permissions to create Lambda functions, S3 buckets, and API Gateway resources.
Local environment: Python 3.11 installed with scikit-learn, pandas, and joblib.
AWS CLI: Configured on your machine for uploading files.
(Optional) Hugging Face account: You can download a pre-trained model from my repository.

Step-by-Step Instructions

1. Building the Model: The Brain

The classifier uses supervised learning. Instead of hardcoding spam rules, the algorithm learns patterns from labeled data.

1.1 Vectorization: Converting Text to Numbers

Models cannot read raw text. We use TF-IDF (Term Frequency–Inverse Document Frequency) to transform email content into numerical vectors.

from sklearn.feature_extraction.text import TfidfVectorizer

feature_extraction = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)
X_train_features = feature_extraction.fit_transform(X_train)

The TF-IDF formula assigns a weight to each word:

w(i,j) = tf(i,j) * log(N / df(i))

w(i,j): final importance score of word i in document j
tf(i,j): how often word i appears in document j
N: total number of documents
df(i): number of documents containing word i
log(N/df(i)): penalty for common words like “the” or “is”

1.2 Training and Saving the Model

We use a Logistic Regression classifier (or other algorithm) and save both the vectorizer and the trained model with joblib.

from sklearn.linear_model import LogisticRegression
import joblib

model = LogisticRegression()
model.fit(X_train_features, y_train)

# Save artifacts
joblib.dump(model, 'spam_classifier.pkl')
joblib.dump(feature_extraction, 'vectorizer.pkl')

2. Deploying the Model to AWS

2.1 Upload to S3

Create an S3 bucket (e.g., spam-classifier-models) and upload both .pkl files.

aws s3 cp spam_classifier.pkl s3://spam-classifier-models/
aws s3 cp vectorizer.pkl s3://spam-classifier-models/

2.2 Create the Lambda Function

Write a Lambda function that loads the model and vectorizer from S3, processes incoming text, and returns a prediction.

import json
import boto3
import joblib
import os

s3 = boto3.client('s3')
BUCKET = 'spam-classifier-models'

def load_model():
    model_path = '/tmp/spam_classifier.pkl'
    vec_path = '/tmp/vectorizer.pkl'
    if not os.path.exists(model_path):
        s3.download_file(BUCKET, 'spam_classifier.pkl', model_path)
        s3.download_file(BUCKET, 'vectorizer.pkl', vec_path)
    model = joblib.load(model_path)
    vectorizer = joblib.load(vec_path)
    return model, vectorizer

model, vectorizer = load_model()

def lambda_handler(event, context):
    body = json.loads(event['body'])
    text = body['message']
    features = vectorizer.transform([text])
    prediction = model.predict(features)[0]
    label = 'spam' if prediction == 1 else 'ham'
    return {
        'statusCode': 200,
        'body': json.dumps({'prediction': label})
    }

2.3 Set Up API Gateway

Create a REST API in API Gateway with a POST method that triggers the Lambda function. Deploy the API to a stage (e.g., prod). Note the endpoint URL.

3. Running the Project Locally

To test before deploying, simulate the Lambda handler locally:

import requests

url = 'https://your-api-id.execute-api.region.amazonaws.com/prod/classify'
response = requests.post(url, json={'message': 'Congratulations! You won a free iPhone!'})
print(response.json())  # {'prediction': 'spam'}

You can also run the entire pipeline locally by loading the model files and calling the transformation directly.

Common Mistakes

Library version mismatches: The environment where you train the model must have the same version of scikit-learn as the Lambda runtime. Use joblib and test with the same Python version.
Large model size: Lambda has a deployment package limit of 250 MB compressed. If your pickled files are large, consider using S3 with /tmp caching (as shown above) instead of bundling them in the zip.
Cold start delays: Loading models from S3 on every invocation increases latency. Mitigate by warming the container or using a caching strategy (e.g., Lambda function URL with provisioned concurrency).
IAM permission errors: Ensure your Lambda role has s3:GetObject permission on the bucket containing the models, and that API Gateway has permission to invoke the Lambda function.

Summary

You now have a serverless spam classifier API that can scale from zero to thousands of requests without managing servers. The modular design allows you to retrain the model offline and update it in S3 without touching the API. This approach bridges the gap between ML experimentation and production deployment, making it easy to detect phishing attempts or spam messages in real time. For further reading, check the references on AWS Lambda best practices and scikit-learn model persistence.

Serverless Spam Detection API: Deploying a Scikit-Learn Model with AWS Lambda and API Gateway