“Should I use the official API or scrape the site?”
This question comes up every time a developer needs to integrate a new data source. The answer isn’t always obvious, and making the wrong choice costs you time — either maintaining a fragile scraper or working around an API that doesn’t expose what you need.
Here’s how to think through the decision.
The case for official APIs
When a platform offers a proper API, it should almost always be your first choice.
Advantages:
- Stability: Official APIs maintain backward compatibility. Your code doesn’t break every time the UI changes.
- Terms of service: You’re explicitly authorized. No grey area.
- Structured data: You get clean JSON/XML. No parsing, no cleaning.
- Authentication: OAuth flows give you user-specific data.
- Webhooks: Push-based events rather than polling.
When official APIs fall short:
- They don’t expose the data you need (e.g., Twitter API doesn’t expose follower counts on free tier)
- The quota is too restrictive (YouTube API: 10,000 units/day by default)
- Access requires approval and months of waiting (TikTok, LinkedIn)
- The pricing is designed to extract maximum revenue from data-hungry use cases ($18,000/year for LinkedIn)
- The API simply doesn’t exist (most marketing sites, competitor pages, news sites)
The case for scraping
Web scraping means fetching a page and parsing the HTML (or intercepting the JSON payloads the site’s own frontend fetches).
Advantages:
- Accesses any data that’s publicly visible
- No API key or application approval needed
- Flexible: capture exactly what you see
- Works even when no official option exists
When scraping falls short:
- Maintenance: Any DOM change breaks your selectors
- Bot detection: Cloudflare, reCAPTCHA, browser fingerprinting block naive scrapers
- Dynamic content: JavaScript-rendered pages require headless browsers (memory-intensive, slow)
- IP bans: Repeated scraping from a single IP gets blocked
- Legal grey area: ToS depends on jurisdiction and use case
- Infrastructure: Proxy pools, browser clusters, and queues add complexity and cost
What it actually costs
| Approach | Development time | Maintenance | Infrastructure |
|---|---|---|---|
| Official API | Low | Near-zero | Minimal |
| Custom scraper | High | High | Moderate – High |
| Managed worker | Near-zero | None (maintained externally) | None |
Custom scrapers look cheap initially but have hidden long-term costs. A scraper that breaks monthly and takes 2 hours to fix costs ~24h/year of developer time just in maintenance.
The decision framework
Is there an official API?
↓ YES → Does it cover your data needs?
↓ YES → Does it fit within quota and pricing?
↓ YES → Use the official API ✅
↓ NO → Use official API + supplement with workers
↓ NO → Use a worker ✅
↓ NO → Use a worker ✅ (or build a scraper if worker doesn't exist)
When to build a custom scraper
You should build your own scraper when:
- No official API exists AND no worker covers the source
- The data source is internal/private (your own backend)
- The structure is trivial and unlikely to change
- You need complete control for compliance reasons
For everything else, the maintenance burden of a custom scraper is rarely justified.
Workers: the middle ground
Managed workers occupy the space between brittle DIY scrapers and limited official APIs:
- They’re pre-built and tested against real sites
- Someone else handles bot detection, proxies, and DOM changes
- You get a clean JSON API regardless of the underlying complexity
- Per-call pricing replaces infrastructure overhead
Consider the LinkedIn use case:
- Official API: requires application, limited data, starts at $50K/year for bulk access
- DIY scraper: violates ToS, blocked within days, legally risky
- LinkedIn worker on Seek API: $0.01/profile, maintained by specialists, clean JSON output
The decision tree becomes: “Does a worker exist for this source?” Yes? Use the worker. No? Evaluate a custom scraper.
Hybrid architectures
Real-world data pipelines often mix all three:
- News aggregation: RSS feeds (official) → failing sites (workers) → new sources (custom scraper)
- Lead enrichment: HubSpot API (official CRM data) + LinkedIn worker (public profile data)
- Competitor monitoring: some competitors have APIs (official); most don’t (workers)
Mix and match by source. Use official APIs where they provide what you need. Delegate the rest to workers.
Summary
| Scenario | Recommendation |
|---|---|
| Public data, no API | Worker |
| API exists but too restrictive | Worker or hybrid |
| API exists and fits | Official API |
| Internal data | Direct API integration |
| Simple, stable target, no worker | Custom scraper |
| Complex, frequently changing target | Worker |