Every year, the same questions reappear.
Key Questions
It’s not one question - it’s several, tightly connected. And behind them is a deeper one: can we turn patterns from the past into reliable guidance for the future?
That’s what this article is about.
We’re going to walk through the end-to-end process of building a seasonal prediction system. One that uses historical data, enriches it with weather context, extracts commercial patterns, and then projects forward.
The steps include:
- Capturing and cleaning order data
- Mapping each order to its weather environment
- Structuring the combined data into a daily record
- Engineering features for machine learning
- Training regression models on weather vs. performance
- Forecasting sales and orders using simulated or real weather
- Visualising predictions in a way that's clear, contextual, and actionable
You’ll see how each script, file, and dataset fits together - not in theory, but in real application.
By the end, you’ll have a complete blueprint for a predictive sales system - and the confidence to ask smarter questions about the future.
Let’s get started.
From Siloed Insight to Integrated Prediction
Many of the components involved in this process - sales data, analytics, operations, even logistics - typically live in isolation.
In larger organisations, they exist as distinct departments. Each has its own data, its own dashboards, and its own set of priorities.
But prediction doesn’t work that way.
To forecast with any real accuracy, you need to unify these streams.
Not in theory, but in practice. That means:
- Sales data that connects to customer logistics
- Traffic patterns that inform fulfilment planning
- External signals (like weather) tied to internal readiness
The goal isn’t just to model the past - it’s to create a system that reflects how your business actually functions.
That’s what we’re building: a prediction machine - one that:
- Uses past data
- Compares key commercial periods
- Identifies recurring patterns
- Connects those patterns to specific outcomes
- Forecasts future performance based on structured input
It’s not a dashboard. It’s not a report.
It’s a system.
And now that we’ve framed the challenge, let’s begin where every good prediction starts - with what we already know.
Step 1: Capture Past Sales Data
The first step is building a clean, reliable foundation.
We start by exporting every order placed between January 2020 and today - five full years of transaction history. This is done in quarterly batches from the CMS, filtered to include only successful orders. Once downloaded, we combine all files into a single CSV, resulting in a master dataset containing 48,184 unique records.
We retain the following columns:
- Order ID - to uniquely identify each transaction
- Date/Time - to anchor each order to a specific moment (critical for weather matching)
- Cart Details & Item Codes - to verify uniqueness and help classify order complexity
- Delivery Postcode - required to fetch weather data per order location
- Revenue (inc. VAT) - to calculate and compare total revenue over time
- Status - used to filter only completed, successful transactions
However, raw exports are rarely clean.
Warning: Always validate your data exports before processing. Raw CSV files often contain formatting inconsistencies, special characters, or encoding issues that can corrupt your analysis if not handled properly.
In this case, the CSVs had occasional data spillage - typically caused by special characters (commas, quotes, etc.) in the cart detail fields. These would throw off column alignment and push data into the wrong headers.
To fix this, we built a Python script that scans for misaligned rows using known formatting patterns. When it finds errors, it realigns the columns by searching for anchors: recurring item code formats, expected field lengths or columns that should always have the same values.
This preserves the original data while restoring structural consistency.
The result is a single, validated CSV of orders.
Structured, clean, and ready for pairing with external context.
Step 2: Compare Period With Real Weather Data
With clean sales data in place, the next layer is context - environmental conditions at the time of each transaction.
This step allows us to match what customers bought with the real-world conditions they experienced. By doing so, we open the door to understanding seasonality, demand triggers, and the external factors shaping performance.
Here’s how the process works:
- Postcode Validation - First, every delivery postcode is checked against a UK postcode database to ensure it’s valid and mappable.
- Coordinate Lookup - Each postcode is converted to latitude and longitude to comply with the weather API’s format.
- Timestamp Conversion - Order times are translated into UNIX format, which most weather APIs require for querying historical records.
- API Caching - To save time and cost, API calls are cached. Identical timestamps and coordinates are only queried once.
- 12AM Averaging - Weather data is collapsed into daily averages at 00:00. This ensures consistent granularity across all entries and smooths out fluctuations within the same day.
The result is a clean JSON dataset where each order is now enriched with its corresponding weather data - temperature, humidity, UV, wind speed, and more.
This gives us a fuller picture: not just what was sold, but when, where, and under what conditions.
Step 3: Find Patterns In The Data
Once weather and sales data are matched, we structure everything around a core unit: the day.
Rather than working at the transaction level, we aggregate into daily snapshots. Each date becomes a single JSON object that includes:
{
"2021-06-14": {
"total_orders": 317,
"total_revenue": 18324.98,
"weather": { ... },
"orders": {
"#12345": { "price": 59.99, "cart_items": [...] },
...
}
}
}
This format captures three critical layers:
- Commercial performance - total orders and revenue
- Environmental context - a daily weather snapshot
- Order-level detail - including prices and cart structure
It’s now easy to inspect trends, run comparisons, and feed the data into training pipelines.
By structuring everything around the day, we reduce noise, keep the model’s temporal focus tight, and create a consistent format that scales across years.
From here, we move toward regression - and real prediction.
Step 4: Connect Events Using Regression Analysis
With sales and weather data structured, we now shift from collection to modelling.
This step begins by eliminating any variables that would unfairly inflate the model’s accuracy - revenue and order count among them. These are the targets, not the inputs. Including them directly would cause data leakage and invalidate the predictions.
Next, we begin feature engineering.
This transforms raw data into signals the model can learn from:
- Day of the week and month help identify seasonality.
- Weekend flags expose behavioural shifts tied to leisure time.
- Rolling averages create a smoother version of the data - ideal for tracking trends.
- Lagging values (e.g., yesterday’s sales, last week's UV index) help the model understand inertia, momentum, and time-based dependencies.
Each of these is computed via dedicated functions within the pipeline - modular, reusable, and clearly scoped.
For example, generate_lag_features(df, lag_days=[1, 7, 14]) builds lag-based predictors for different time frames. Another function, add_time_features(df), injects time-based tags like month, weekday, and weekend status.
These scripts aren’t standalone - they operate as stages in a sequence. Once the master JSON is parsed into a structured DataFrame, we pass it through this transformation pipeline before splitting it into training and validation sets.
This step transforms our dataset from historical records into predictive engine fuel.
The model isn’t just learning what happened - it’s learning why.
Step 5: Forecasting the Future
With the training complete, we’re ready to look forward.
Forecasting isn’t just about filling future dates with guesses. It’s about generating synthetic-but-plausible inputs - then seeing how the model interprets them.
Here’s how it works:
We take the trailing two weeks of real weather data and build a forward-looking weather sequence. This becomes our proxy forecast. The script generate_synthetic_forecast.py creates this file - typically output as simulated_weather_forecast.csv - which acts as the new input for prediction.
We use two pre-trained models:
- model.joblib - predicts revenue
- orders_model.joblib - predicts order count
Both models load the forecasted weather features, run through their respective pipelines, and return predicted outcomes for each future day.
These forecasts are then stored in a structured output - a DataFrame of daily predictions, often including confidence intervals or rolling windows to smooth volatility.
The logic behind this setup is modular:
- Forecast generator → produces plausible inputs
- Predictive models → consume and score those inputs
- Output renderer → stores structured results
The entire pipeline can be run via a single orchestration script (run_prediction_pipeline.py), which ensures reproducibility and sequence integrity.
It’s not magic. It’s math, structure, and intent - layered precisely.
Step 6: Visualising the Forecast
Predictions are valuable. But they become actionable when seen.
Instead of inspecting a spreadsheet full of numbers, we generate a time-series graph that overlays predicted revenue and predicted order volume across the forecast window.
This is handled by predict_and_plot_forecast.py, which uses Plotly to render:
- Daily revenue predictions
- Daily order volume
- Time from forecast start to year-end
It’s fully interactive, making it easy to inspect peaks, compare segments, or isolate anomalies.
This step brings the system full circle. You started with raw orders and weather data. You transformed them into a feature-rich model. Then you generated a forecast. And now, you can see the outcome - clearly, visually, and in context.
This isn’t a static report. It’s a living tool. Something you can re-run, adjust, and iterate as conditions change.
And that’s the power of engineering over guessing.
Conclusion
This project began with a question: When will I make more money? From that, we built a forecasting system grounded in real data — not intuition.
We extracted five years of order history, aligned it with weather data, cleaned and structured it into a usable format. From there, we engineered relevant features, trained regression models, and produced forward-looking forecasts based on simulated weather conditions.
The result isn’t a static report. It’s a modular, versioned system — one that can evolve as inputs change. Add paid media, stock levels, or new feature logic, and the model adjusts accordingly.
It’s not perfect. But it’s functional, testable, and repeatable.
And in many cases, that’s enough to make better decisions — earlier, and with more confidence.