lamech.dev
  • Home
  • Projects
  • About Me
  • Contact

South Broward Football - Serverless Stat Tracking Pipeline

South Broward Football - Serverless Stat Tracking Pipeline

A serverless data pipeline built on AWS that tracked play-by-play statistics for 60+ student-athletes across a full varsity football season.

Tech Stack: Python · AWS Lambda · PostgreSQL (RDS) · Amazon S3 · EC2 · VPC (Private + Public Subnets) · Streamlit · AWS QuickSight

🔗 Live Dashboard · Backend Repo · Dashboard Repo

Architecture

Data Flow: Google Sheets → S3 Bucket (Raw Directory) → AWS Lambda (ETL) → PostgreSQL on RDS (Private Subnet) → SQL Aggregation View → S3 Bucket (Analytics Directory) → Streamlit Dashboard

image

The Problem

Most high school football stats are taken by hand, a coach on the sideline with a manila folder or a clipboard, tallying marks during a fast paced game. The data is rarely accurate, almost never stored properly, and completely inaccessible once the season ends. I've played high school football years ago and it was the same way back then. That shouldn't still be a problem.

So I built a system to fix it.

What I Built

Import Modules

  • Roster Import - Ingests player data into the database with an auto generated Player ID system. IDs follow a deterministic format: first four letters of last name + first two of first name + sequence number. So John Williams becomes WILLJO01. If a second John Williams joins the roster, it automatically increments to WILLJO02. This kept player references unique and consistent across every game file all season.
  • Schedule Import - Loads the full season schedule into the database, automatically registering opponent teams and creating game entries for future stat tracking.
  • Game Stats Import - The weekly workhorse. The Lambda parses the filename itself to extract context; For example, 10_South_Broward_GameStats.xlsx tells the system it's Week 10, the team is South Broward, and the file type is game stats. The week number maps back to the schedule import to automatically identify the opponent. Ingests play-by-play data captured from the sideline using an atomic event model, one row per play. Covers rushing, passing, receiving, tackles, sacks, TFLs, returns, and special teams.
  • Database Migration - Initializes the full schema. Run once before anything else.

The first three modules run as AWS Lambda functions triggered by file uploads to the S3 raw bucket.

Database

PostgreSQL hosted on Amazon RDS inside a private subnet, hence its never publicly accessible. Query access runs through an EC2 bastion host on a public subnet, connecting via DBeaver for validation and troubleshooting.

The database stores roster data, schedule data, and every individual play logged across the season.

SQL Aggregation View

A SQL view that transforms the raw row-by-row play data into a season summary per player totals by week and cumulative season stats. This view generates the season_summary.csv that gets exported to the S3 analytics bucket.

Streamlit Dashboard

  • Streamlit Dashboard - A public facing dashboard that reads directly from season_summary.csv in the analytics directory. Players checked their own stats throughout the season, and the data fed directly into MaxPreps submissions for recruiting visibility.

Data Templates

Standardized Google Sheets templates for roster, schedule, and game stats, all stored in the backend repo. Strict filename validation on every Lambda prevents bad data from reaching the database.

Backup

All season data is backed up in Parquet format for long term storage and future analysis.

Dashboard

The Streamlit dashboard gave players a way to look up their own stats throughout the season. It featured a leaderboard, individual player view with weekly and season totals, and game-by-game summaries covering rushing, passing, receiving, tackles, sacks, and special teams.

Dashboard - Home Page
Home Page Season leaderboard at a glance. Navigation links to weekly totals and individual player views.
Dashboard - Weekly View
Weekly Totals Week-by-week stat breakdown with sortable columns, allowing users to filter leaders by any statistic.
Dashboard - Player Detail View
Player Detail Individual player view showing a full statistical profile across the season.

How It Worked in Production

Every Friday night I tracked plays live from the sideline on an iPad. After each game: upload the stats file to the S3 raw directory, Lambda parses the filename and loads the data into Postgres, validate in DBeaver through the bastion host, refresh the season summary view, export the CSV to the analytics directory, and the Streamlit dashboard updates. Final numbers were also pushed to MaxPreps for recruiting visibility.

The system ran reliably for a full 10+ game season with zero missed weeks.

Key Engineering Decisions

Atomic event model - Every play logged as its own row. This kept data clean and made aggregation predictable across the entire season.

Deterministic Player IDs - Auto-generated IDs (WILLJO01) with automatic increment for duplicates (WILLJO02) meant no manual ID management and consistent, unique references across every file upload all season.

Private subnet + bastion host - The database was never publicly accessible. All query access routed through an EC2 bastion on a public subnet, following AWS security best practices.

Directory based S3 organization - Single bucket with separate directories for raw uploads and processed analytics outputs. Clean separation between incoming data and dashboard ready exports without the overhead of managing multiple buckets.

Filename driven routing - Instead of requiring metadata in a separate config or manual input, the Lambda extracts everything it needs from the filename itself. 10_South_Broward_GameStats.xlsx tells the system the week, the team, and the data type, zero manual configuration per upload.

Strict filename validation - Each Lambda enforced naming conventions on uploaded files before processing, preventing bad data from ever reaching the database.

SQL first analytics - Instead of building transformation logic in Python, I used a SQL view to handle all aggregation. Simpler, faster, easier to maintain.

Dashboard isolation - Streamlit reads from a CSV in S3, not directly from the database. A public facing dashboard had no business connecting to my RDS instance. The CSV handoff was intentional, it kept the database secure behind the private subnet.

Post Season Analysis

After the season ended, I connected the data to AWS QuickSight to dig deeper into performance trends. Three metrics were tracked week over week: total tackles, points scored, and the final ranking of each opponent.

The data revealed a clear pattern; the two lowest weeks for tackles and scoring both came against the highest ranked opponents on the schedule. The players were performing with visibly less effort in games where they expected to lose. The data made it measurable.

This is the kind of insight that gives a coaching staff something concrete to address in the offseason, not just "we lost," but "here's where and why effort dropped, and here's the proof."

Weekly Total Tackles
Total amount of team tackles amassed on defense, by week.
Point Difference By Week
The difference in points scored by week, South Broward - Opponent = Point Differential.
Opponents final Max Prep Ranking
Opponents end of the season ranking, as determined by MaxPreps.

What I'd Improve

Automate the season summary export. The CSV export from the SQL view was manual every week. An EventBridge trigger or Lambda automation would have saved time and reduced a repeated manual step.

Start with React instead of Streamlit. Streamlit worked, but a React based dashboard would have given me more control over the UI and a better experience for players viewing their stats.

Build a better data entry interface. Tracking plays by tapping into a Google Sheet on the sideline was functional but clunky during fast paced games. A purpose built input GUI would have been faster and less error prone.

Plan for weather. It rained a lot during the season. I had a weatherproof iPad case, but the overall workflow wasn't designed with bad conditions in mind. Field side data collection needs to account for that.

Skip Postgres, use DynamoDB instead. The RDS Postgres instance was expensive and overengineered for what was essentially key value lookups and weekly stat inserts. A DynamoDB table would have been cheaper, serverless, and simpler to maintain. I wasn't comfortable with DynamoDB at the time, but knowing what I know now, it's the right call, which is exactly why I'm using it for Phase 2.

Video Walkthrough

(embed Loom link here)

Blog Post

What High School Football Stats Revealed About Effort and Scheduling

A post season deep dive into effort stats, scheduling impact, and what the data revealed about how mid-level programs respond to back-to-back top opponents.

Read on Medium

lamech.dev
LinkedInGitHubXInstagram