How It Works | Sibyls Edge

The Core Idea

Most sports bettors lose — not because they're bad at sports, but because they're fighting an opponent (the sportsbook) that has priced every game using its own statistical models. Beating a sportsbook long-term requires finding games where your probability estimate is more accurate than theirs.

Sibyls Edge builds machine learning models trained on thousands of historical games. These models output a win probability for each team. When our probability is meaningfully higher than what the sportsbook's odds imply, we flag it as an edge. The size of your bet is then sized proportionally to that edge using the Kelly Criterion — a mathematically proven formula for optimal bet sizing under uncertainty.

This is the same framework used by professional sports bettors, quantitative hedge funds, and casino advantage players. The only thing that separates a losing strategy from a winning one is the accuracy of the underlying probability model.

What Goes Into the Model

Every prediction starts with data. We collect, clean, and process several categories of information for each sport:

Team Performance

Season-level offensive and defensive efficiency metrics. Not just wins and losses — the underlying statistics that predict future outcomes better than records do.

Strength of Schedule

We compute each team's adjusted rating accounting for who they beat and who beat them. A 10-win team that beat bad opponents is rated very differently from one that beat good opponents.

Recent Form

Rolling performance windows over recent games — because a team's form over the last 5–10 games often predicts this week's result better than their full-season average.

Contextual Factors

Rest schedules, travel distances, back-to-back games, injury reports, and home/away splits. Sports outcomes are affected by more than talent alone.

Historical Game Logs

25+ years of game-by-game results for each sport — the foundation the model learns from. Patterns that held across decades are far more reliable than recent trends alone.

Market Odds

Real-time lines from licensed sportsbooks via The Odds API. We compare our model's probability against the implied market probability to identify edges.

What we don't publish: The specific combination of features, their relative weights, and the exact formulas we use to compute derived statistics are proprietary. Publishing them would allow others to replicate — and eventually arbitrage — the edge we've built. Think of it the way a poker player doesn't reveal their exact hand-reading tells.

How the Model Learns

We use an ensemble machine learning approach — multiple model types trained on the same data, whose predictions are blended together. Ensembles consistently outperform any single model because each component makes different errors, and those errors partially cancel out when averaged.

Feature Engineering

Raw statistics (points scored, yards gained, etc.) are transformed into differentials and ratios — the form that best captures the relative quality gap between two teams before they play.
Gradient Boosting Models

The primary model type. Gradient boosting builds decision trees sequentially, each one correcting the errors of the last. It excels at capturing non-linear relationships — for example, the interaction between a team's rest days and their road travel distance.
Regularized Logistic Regression (Baseline)

A deliberately simple model included in every ensemble. When the linear model matches or beats the complex one, it tells us the signal in the data is strong and clean. It also provides a stable floor that prevents the ensemble from overreacting to noise.
Probability Calibration

Raw model scores are transformed into true probabilities using isotonic regression — a technique that maps the model's output to observed win rates. This step is critical for Kelly sizing: a model that says 65% when the real rate is 58% will overbid and go broke.
Ensemble Blending

The calibrated outputs of each model are combined using weighted averaging. Weights are determined empirically — models that perform better in validation receive higher weight.

How We Know It's Actually Working

This is the part most prediction services skip. Showing a model's performance on data it was trained on is meaningless — of course it looks good on what it already saw. The only test that matters is performance on data the model has never seen.

We use walk-forward cross-validation: the model is trained exclusively on past seasons, then tested on the immediately following season it has never seen. This is repeated across 20 different training/test splits spanning the full history of each sport. The reported metrics are averages across all 20 out-of-sample test periods.

Example: NFL Walk-Forward Validation

Round 1 — Train on 2000–2004, test on 2005. Round 2 — Train on 2000–2005, test on 2006. … Round 20 — Train on 2000–2023, test on 2024.

The final reported AUC of 0.9447 is the average performance across all 20 test seasons — none of which the model saw during training. This is the number that matters.

AUC-ROC (Area Under the Curve) is our primary accuracy metric. It measures the model's ability to correctly rank a winning team above a losing team across all possible probability thresholds. A random coin flip scores 0.50. A perfect model scores 1.00. Real-world predictive systems in competitive markets typically range from 0.55 to 0.75.

Model Performance by Sport

The following metrics are derived from walk-forward cross-validation on held-out seasons only. They reflect the model as of the most recent training run.

Sport	AUC-ROC	Accuracy	Training Data	AUC vs. Chance
NFL	0.9447	86.6%	2000–2024 · 25 seasons	+44.5%
NBA	0.7510	68.2%	1980–2024 · 44 seasons	+25.1%
NHL	0.6640	62.1%	2007–2024 · 17 seasons	+16.4%
MLB	0.6480	59.4%	2000–2024 · 25 seasons	+14.8%
WNBA	0.6820	63.8%	2009–2025 · 17 seasons	+18.2%
Golf (Top 10)	0.720	—	2019–2026 · 114 tournaments	+22.0%
Golf (Win)	0.760	—	2019–2026 · 114 tournaments	+26.0%
Tennis (ATP)	0.720	—	2000–2026 · 134K matches	+22.0%
Tennis (WTA)	0.704	—	2000–2026 · 134K matches	+20.4%
Futbol	N/A*	—	2015–2026 · 51K+ matches	—

Random chance baseline = 0.50 AUC / 50.0% accuracy. NFL's high AUC reflects that NFL outcomes are disproportionately driven by quarterback quality — a measurable, persistent factor. MLB and NHL are the hardest sports to predict due to high variance (baseball's ~162-game sample, hockey's low-scoring nature). WNBA's model leverages 17 seasons of game logs with recent-form and home/away splits, and includes a dedicated totals (O/U) model. Metrics are updated with each model retraining cycle.

* Futbol model is an XGBoost classifier covering EPL, Bundesliga, Ligue 1, Serie A, La Liga & MLS. AUC varies by league; out-of-sample validation in progress for the 2026 season.

Why doesn't higher AUC always mean bigger profits? A model can have high accuracy but still find few betting edges if the sportsbook's lines are already pricing the same signals correctly. Profitability depends on the gap between your model's probability and the market's implied probability — not on accuracy alone. An 86% accurate NFL model is valuable only on the games where our estimate diverges meaningfully from the market.

Specialty Sport Models

Golf, Tennis, and Futbol each require sport-specific architectures. The core principles — XGBoost, walk-forward validation, calibrated probabilities — carry over, but the feature engineering and output format are purpose-built for how each sport works.

Golf Predictions

Uses XGBoost with 25 features per player: rolling top-finish rates over 10 and 20 event windows, course-specific history, major championship history, season stats (events, cut%, top-10s, wins, scoring average), and an OWGR strength proxy. Trained on 114 PGA Tour tournaments. Produces four calibrated probabilities per player: Outright Win, Top 10, Top 20, and Top 30. The Outrights tab ranks every player in the field by their win probability — the same model, different threshold. Validated with 5-fold stratified cross-validation on 10,700 player-tournament observations.

Tennis Predictions

Two independent models — ATP and WTA — each trained on 134,000+ matches from 2000–2026 using an XGBoost + LightGBM ensemble. Features include surface-adjusted Elo ratings, head-to-head record, recent form windows, tournament level, and ranking differential. Surface (clay/hard/grass) is treated as a first-class feature; a player who dominates on clay gets no credit for hard-court performances. The H2H Predictor tab lets users run any two players against each other on any surface using the same model.

Futbol (Soccer) Predictions

Covers six leagues: EPL, Bundesliga, Ligue 1, Serie A, La Liga, and MLS. Trained on 51,000+ matches from 2015–2026. Features include recent form differentials, home/away performance splits, xG (expected goals), clean sheet rates, and league-adjusted team ratings. The model is retrained continuously as 2026 season data becomes available. Predicts moneyline (home/draw/away) probabilities; edges are calculated against Pinnacle's sharp closing lines.

From Probability to Pick

A model probability alone isn't a pick. The workflow that turns our prediction into a recommended wager has three more steps:

1. Edge Calculation

We convert the sportsbook's moneyline odds into an implied probability. If our model says 60% and the book implies 52%, the edge is 8%. Games below our minimum edge threshold are ignored regardless of the predicted winner.

2. Kelly Sizing

Bet size is calculated as a fraction of the Kelly Criterion formula — a mathematically optimal strategy that grows a bankroll faster than any other fixed-fraction strategy while controlling drawdown risk.

3. Fractional Kelly

We apply a conservative fraction of full Kelly (never full Kelly). This reduces variance and protects against model miscalibration — accepting slightly lower expected growth in exchange for significantly smoother drawdowns.

What We Don't Share — and Why

We're deliberately transparent about how the system works but protective of exactly what it uses. This isn't unusual — DraftKings, FanDuel, and every professional sports betting operation protects their models for the same reason.

Kept Proprietary

The complete feature list for each sport
Feature weights and importance rankings
Ensemble blend ratios between models
Specific derived statistics and transformation formulas
Minimum edge thresholds and Kelly fraction applied
Any post-processing or override logic

If we published the exact feature set, sophisticated users could identify which games our model is most confident on — and bet against us in cases where they believe the model is wrong. More importantly, as any signal becomes widely known, the market absorbs it and the edge disappears. Keeping the methodology proprietary is what keeps the edge alive.

What you do get is the output: calibrated win probabilities, edge estimates, and Kelly-sized picks delivered daily — built on a system whose performance is independently validated and openly disclosed on this page.

Honest Limitations

No model is perfect. Here's what ours doesn't handle well — and why that's okay to say out loud:

Breaking News

A late-breaking injury reported 30 minutes before tip-off may not be reflected in the model's features. We pull injury data daily, but real-time roster changes between updates are a known blind spot.

Low-Data Matchups

Early-season games, expansion teams, or unusual schedule circumstances have less historical context to draw from. Model confidence is lower in these cases.

High-Variance Sports

MLB and NHL have intrinsically high outcome variance. Even a team with a 65% win probability loses 35% of the time. Variance is not model error — it's the nature of the sport.

Model Drift

Sports evolve. The NBA of 2010 plays differently from the NBA of 2024. We retrain models periodically to account for this, but there will always be some lag between structural changes in a sport and the model adapting to them.

We disclose these limitations not to undermine confidence in the system, but because a user who understands them will use the picks more intelligently — and will experience fewer surprises on losing nights.

How Sibyls Edge Makes Its Predictions

The Core Idea

What Goes Into the Model

How the Model Learns

How We Know It's Actually Working

Example: NFL Walk-Forward Validation

Model Performance by Sport

Specialty Sport Models

From Probability to Pick

What We Don't Share — and Why

Kept Proprietary

Honest Limitations