Reverse Engineering Enterpise APIs W/ Elastic

This project began with a simple goal: extract every public-facing product record from a complex commerce platform and make it searchable. From there, it evolved into a technical campaign—scraping against paginated APIs, dealing with rate limits and WAFs, adapting to scale, and turning a standalone script into a production pipeline.

Because half-measures don't win data wars.

When we talk about reverse engineering APIs at scale, we're not talking about poking around with developer tools and hoping for the best. We're talking about systematically mapping, decoding, and extracting structured data from large-scale e-commerce ecosystems—cleanly, programmatically, and at volume.

When we first cracked open e-commerce sitemaps and reverse-engineered their product endpoints, the objective was crystal clear: build a scraper that could hoover up every SKU, price, and rating exposed to the public web. First, we built something quick—lightweight, scrappy, good enough to prove the point. It reverse-engineered endpoints, parsed JSON, and indexed over 150 000 products into Elasticsearch. It worked—until we pushed it harder.

Because a good scraper isn't measured by its ability to run; it's measured by its ability to keep running under pressure.

What followed wasn't a rewrite—it was a reinvention. We pivoted from a single-threaded crawler into a fault-tolerant, proxy-powered, metrics-obsessed data pipeline that could parse 180 000+ category URLs without blinking.

Foundations, Then Friction

That early build was fast to construct and surprisingly effective. It showed us what was possible. But as soon as we asked it to operate at scale, it showed us where it would break.

Here's what worked—and where it started to crack:

Aspect	Capability	Why It Wasn't Enough
Endpoint Discovery	Parsed four PLP sitemaps → 183,000+ URLs	Mixed product + category URLs inflated scope and wasted cycles
Scraping Engine	`curl_cffi` with browser impersonation	Single IP = predictable ban pattern
Data Store	Local Docker Elasticsearch & Kibana	Laptop bandwidth + uptime bottleneck
Success Metrics	90.9% request success, 151K+ products indexed	Success ≠ completeness (pagination bug skipped 95% of data)

That first version proved the concept, exposed the data model, and—crucially—surfaced the first bug of scale: we were only grabbing page 1 of every category. Rather than patching that hole with duct tape, we used it as the catalyst to redesign the whole engine.

The Pivot: Designing for Abuse, Not Sunshine

After that first build, we weren't starting over—we were levelling up. The fundamentals were sound, but the reality of scraping at scale introduced a new set of constraints we couldn't ignore:

We kept the goal—total catalogue coverage—but accepted new constraints:

Rate-limiting and fingerprinting were guaranteed.
Every endpoint would paginate, silently.
Scraping would run for days; restarts had to be seamless.
Metrics would beat gut-feel—if it's not logged, it didn't happen.

This mindset produced three core upgrades: a proxy layer, a centralised config system, and resilience baked into every request.

Proxy Infrastructure: Obscuring Outgoing IP Requests

This is where things got serious. If you want to run high-volume scrapers against real-world e-commerce APIs, you can't afford to get blocked by lunchtime. Static IPs? Dead in minutes. Proxy lists? Only useful if you can manage, rotate, and recover them intelligently.

Here's how we built a proxy layer that stayed alive, self-healed, and delivered 99.996 % success—even with 85 out of 100+ proxies benched.

┌──────────┐   health_check()   ┌──────────┐
│ proxies  │ ───────────────▶ │ metrics   │
└──────────┘ ◀─────────────── └──────────┘
    │  rotate()                         ▲
    ▼                                   │
production_scraper_with_proxies.py ─────┘

Key Mechanics

Proxy Loader ingests proxy lists from secrets/proxies.txt, validating authentication and protocol integrity.
Proxy Manager monitors health with 5-minute checks, rotating with round_robin, random, or health_based logic.
Failover Logic automatically escalates to the next proxy if a request fails.
Cooldown & Retry benches unhealthy proxies for 10 minutes before revalidating.

Result: 27,000+ requests, 99.996 % success rate. Only 16 out of 100+ proxies were healthy, but the system adapted, rerouted, and delivered.

Configuration Overhaul: One Source of Truth

Managing a scraper at scale means controlling your environment—because when you're running tens of thousands of requests, even small config changes can cause major failures. Hardcoding values across scripts leads to inconsistencies, errors, and the inability to safely swap environments.

We migrated all tunables into config/settings.py using python-dotenv, enabling clean environment switching and full config parity between dev and prod:

USE_PROXIES = True
PROXY_ROTATION = "health_based"
PROXY_REQUEST_TIMEOUT = 30
BATCH_SIZE = 25
MAX_RETRIES = 3
SQLITE_DB = "data/production/scraper_progress.db"

This meant every operational behaviour—timeouts, retries, proxy mode, database path—was now controllable in one place. It kept the system predictable, portable, and far easier to scale.

Where the Data Lands

Once the data was scraped and cleaned, every product record was pushed directly into Elasticsearch. Initially, this was a local instance running in Docker, giving us full control over schema design and fast feedback during early iterations. Later, we migrated indexing to a cloud-hosted instance—removing hardware constraints and letting us scale without bottlenecks.

Each item was indexed using a well-defined schema:

{
  "id": "JXXX33",
  "modelNumber": "KXXX2",
  "title": "Tennis Wristbands",
  "price": 13,
  "rating": 5,
  "image": "...",
  "url": "/tennis-wristbands/JXXX33.html"
}

Elasticsearch wasn't just a place to store data—it powered everything downstream: visualisation in Kibana, API search endpoints, and all category-level aggregations.

Resilience & Observability

When you're working with unreliable proxies, long scraping windows, and thousands of moving parts, things are going to go wrong. The goal isn't to avoid failure—it's to recover from it fast, without losing progress or duplicating work. This section breaks down how we made our scraper more durable, predictable, and transparent.

Feature	Implementation	Payoff
Progress Persistence	SQLite `url_status` table	Resume after crash, no duplicate work
Adaptive Batching	Shrinks/enlarges batch size based on rolling success	Keeps pipelines flowing under heavy bans
Request-Level Retry	Exponential back-off per proxy	Converts transient failures into eventual successes
Structured Logging	JSON logs + proxy stats exported every run	Immediate insight into throughput & health

If the job crashes at 02:00, it resumes at 02:01 without repeating a single request.

Performance Snapshot

Numbers matter. If you're building something that claims to be production-ready, it should have the stats to back it up. This section covers exactly how the system performed under pressure—volume, success rates, throughput, and reliability—after weeks of continuous scraping.

URLs Discovered: 183,000+
Categories Completed: 18,000+ (≈10 %)
Average Products/Category: 17–120+
Live Success Rate: 99.996 %
Proxy Pool: 100+ total / 16 healthy / 85 on cooldown
Elasticsearch Index Size: 52 MB and growing

Each indexed product includes ID, model number, title, price block, ratings, images, and canonical URL—ready for search, analytics, or competitive intelligence.

What Can We Do With This Data

Once the data is indexed and structured, it stops being just output—it becomes a dataset you can actually work with. Here's how that plays out:

Practical Use Cases:

Searchable Product Interface – Build a full-featured internal search tool or front-end browser.
Price Tracking – Monitor price changes over time or across categories.
Stock Monitoring – Identify when products go out of stock or reappear.
Product Launch Detection – Spot new product IDs that weren't in previous runs.
Cross-Site Benchmarking – Scrape multiple sites using the same schema and compare performance or pricing.

Modelling with Elasticsearch:

Trend Detection – Use time-series rollups or scripted fields to visualise pricing trends.
Custom Ranking Models – Adjust Elasticsearch scoring to prioritise high-rated or heavily discounted items.
Anomaly Detection – Feed product price/availability changes into watch alerts for unexpected movements.
Faceted Aggregations – Combine filters (e.g. only 4+ star items under £30) with statistical summaries.

In other words, once the hard part is done—the scraping, the schema design, the indexing—you're left with a search engine that knows your competitors better than they know themselves.

What This Build Taught Us

Building a scraper that can run at this scale isn't just about getting it working—it's about understanding what breaks, why it breaks, and what to build around it. Here are the key things this project taught us:

Plan for blocks. Assume you'll get rate-limited or banned. Design around it.
Keep a paper trail. Metrics and logs aren't just helpful—they're necessary for debugging and optimisation.
Build for recovery. If something fails, the system should bounce back without losing time or duplicating effort.
Standardise your setup. One environment config for everything makes it easier to test, scale, and deploy.
Check for pagination early. If you miss this, your data will look fine—and be completely incomplete.

These lessons are now baked into the architecture. They don't just make the system work—they make it maintainable.

Things to Consider Next Time

Every project leaves behind a list of "if we were doing this again..." insights. These are the unresolved issues, rough edges, and architectural considerations we'd build into our planning earlier if we were starting from scratch:

Proxy Quality Assurance – Over 80 % of our proxy list failed under load. Better upfront validation or automated filtering would cut downtime significantly.
Endpoint Classification – Many of the 183k discovered URLs weren't actual category endpoints. A clearer system to classify discovery targets would reduce waste.
Pagination Auditing – This was a blind spot early on. Future projects should include automated checks to detect and confirm pagination coverage.
System Health Dashboards – Manual log inspection worked, but having a real-time dashboard would've surfaced issues faster and with less friction.
Scraper Orchestration – Currently everything is managed manually or via ad hoc scripts. Introducing scheduled runners, job queues, or lightweight orchestration (e.g. with cron or Airflow) would add another layer of polish and control.

These are the things we'd bake in sooner next time—not because the system failed, but because we now know where the real tension lives when operating at scale.

Wrapping It Up

This project started with a straightforward ambition: extract product data cleanly and reliably. But by the time we'd fought through the complexity of proxy management, pagination blind spots, and scale-induced chaos, what we ended up with was something much more resilient.

We built a scraper that doesn't just work—it holds up under pressure, recovers from failure, and leaves behind data that's immediately usable for analysis, modelling, or insight. And along the way, we learned how to make it smoother, smarter, and ready to scale further.

Every improvement—whether it was retry logic, health checks, or environment config—was driven by a specific problem we encountered. And that's what made the build real: not theory, not abstraction, but engineering through friction.

It's not perfect. But it's damn solid. And next time, it'll be better still.