Smart Page Fetcher — HTML, Markdown & Text

Design notes for an adaptive batch URL fetcher built for AI agents.

Design notes for smart-page-fetcher, an adaptive web-fetching Apify Actor on the Apify Store. The tool picks the cheapest fetch method that works for each URL in a batch — plain HTTP, a real browser running JavaScript, or a stealth session routed through a residential proxy — and returns each page as HTML, Markdown, or clean text, billing only at the tier that actually produced the content. It’s designed to be called from AI agents (via the Apify MCP server, with pay-per-event billing on x402 and Skyfire agentic-payment rails) but works the same from any code that can hit a REST endpoint.

The Apify Store page covers the schema, current pricing, and how to try it. This page is for the design questions — why the Actor has the shape it does, what alternatives I considered, and what it deliberately refuses.

What this is

Submit a batch of URLs. For each URL, the Actor walks an escalation chain until something returns usable content:

Basic HTTP — plain GET, no JavaScript, no proxy. Fast and cheap, good for static pages, JSON-LD-heavy product pages, documentation, RSS-style content.
JavaScript render — real browser, no anti-detection shims. Loads the page, runs JS, captures the rendered DOM. Good for SPAs and lazy-rendered content.
Stealth + residential proxy — hardened browser session routed through residential IPs, used only when the cheaper tiers can’t get past bot defenses.

Each tier can be locked on or off per request. The default is auto on all three — escalate from cheapest, stop as soon as a tier returns usable content. The customer pays at the realized tier; failed URLs and URLs that ran out of runtime budget are free.

Output is a dataset of records, one per URL, in input order. Per-URL records report which tiers were attempted, which one produced the content, and the requested output formats — raw HTML, cleaned HTML (scripts/styles/tracking stripped while preserving semantic structure), boilerplate-stripped text, Markdown, links, media elements, document heading outline, tables, Schema.org JSON-LD, OpenGraph values, other meta tags, accessibility tree, and full-page screenshot. Large outputs — HTML, accessibility tree, screenshot — go to the Apify Key-Value Store so dataset records stay small; the record carries a public URL.

Two pieces of context that frame everything below: the Actor is unauthenticated (it can’t carry your cookies or auth headers), and it batches by default. The batch is where the design becomes interesting.

Why I built it this way

The interesting choices are about cost — specifically, about the enormous gap between fetch methods, and what happens to a customer’s bill when something else picks the wrong method per URL.

The cost gap is the central problem

A plain HTTP GET against a static page takes around 100ms and a fraction of a cent of resources. A full stealth render against a Cloudflare-protected page can take 30 seconds and burn an entire residential proxy session worth orders of magnitude more. If you commit to one fetch method up front, you either overpay massively for the easy pages or fail on the hard ones.

There’s a worse failure mode that gets less attention: the page renders but is silently wrong. Some bot defenses serve a 200 response with a JavaScript challenge interstitial. A naive HTTP client sees an HTTP 200 and a body, declares success, and hands obfuscated challenge JavaScript to a downstream LLM that has no way to know it didn’t get the article. From the LLM’s perspective the request succeeded; from the agent operator’s perspective the agent hallucinated.

So the Actor needs to detect that “looks like a page” is wrong — known anti-bot markers in the HTML, JavaScript-required signals, the typical 403/429/503 status codes — and escalate to a method that can actually solve the challenge. The cheapest tier that returns content (not an interstitial) wins.

Why escalation rather than per-call selection

The obvious alternative is to ask the caller to specify the tier per URL. “You know what’s defended, tell us.” Two problems: callers usually don’t know in advance (especially LLM-driven callers building URL lists from search results), and even when they think they know, they’re wrong about a significant fraction. URLs flip between defended and undefended depending on the time of day, the originating IP, or the proxy reputation pool — and the URL the agent generates isn’t necessarily the URL it’ll be redirected to.

Auto-escalation makes the right call per URL based on what each one actually returns. For the typical mixed batch of URLs in agent workflows — 90%+ tier-1-solvable, a small JS tail, an even smaller stealth tail — escalation lands most URLs at the cheapest tier without the caller having to think about it. Callers who do know can still pin a specific tier (e.g. stealth: "true" to skip directly to the stealth path for known-defended targets, saving the cost of two failed lower-tier attempts).

A real browser takes 3-5 seconds to launch — Playwright, Chromium, the works. Launching per URL is wasteful. The Actor launches at most one browser per tier per batch and reuses it across all URLs that need it. Pass 50 URLs and the browser startup cost amortizes across the whole batch; pass 1 URL and you pay the full launch cost for a single fetch.

This shapes the pricing: per-URL cost is lower as batch size goes up. It also shapes the recommended use: callers with large URL lists batch them; callers with one-off URLs pay the full per-URL overhead. The Actor doesn’t try to hide this — the pricing structure reflects it transparently.

Why success-only billing

Failed URLs (every allowed tier was tried and errored) and deferred URLs (the runtime budget ran out before the URL was attempted) don’t bill. Only successful pushes to the dataset trigger a per-URL charge.

This is deliberate. The Actor is meant for agents that can’t always predict which URLs will work. Punishing the agent for trying URLs that fail would force defensive batching — smaller batches with pre-validation — which costs more total. It would also poison the agent’s incentives: agents that ought to try fifteen URLs to find the right three would only try the safe ones, and the long tail of agentic web workflows would degrade. Success-only billing means an LLM agent can submit “the ten URLs I think might have what I need” without worrying about being charged for the seven that don’t.

Why a fixed allow-list of request headers

The Actor lets callers set request headers per URL — Accept, Accept-Language, Accept-Encoding, User-Agent, Referer, Content-Type. That’s the whole list. Cookies, Authorization, Proxy-Authorization, and any X-* headers are rejected at input validation.

The point is to stop the Actor from being usable as an authenticated-session proxy on demand. Letting an agent submit Cookie: and Authorization: headers would turn a general-purpose public-page fetcher into a tool for someone else’s session. The Actor is anonymous from the target’s perspective, and the design holds it that way. If you need authenticated scraping, that’s a different Actor with a different scope of consent.

Why HTML returns byte-for-byte

When html is in the requested outputs, the Actor returns the bytes that the target server actually returned — no DOM normalization, no script removal, no analytics scripts injected by the storage layer. This matters when the downstream is an LLM with strict tokenization, a diff tool comparing two captures of the same page, or any DOM parser that’s particular about its inputs. The Apify Key-Value Store API has a habit of injecting a cookie-blocking <script> into text/html responses on serve; the Actor uploads HTML as text/plain to dodge that and preserve the bytes exactly.

Why the runtime budget defaults to 4m30s

The default runtime_budget_ms is 270000 milliseconds — four and a half minutes. This is deliberately under the five-minute timeout that Apify’s run-sync-get-dataset-items endpoint enforces. The 30-second headroom is for cleanup, push delays, and final-result return.

Synchronous calling matters specifically for agents on the x402 payment rail (sync-only by protocol design): the call has to fire and finish inside the sync window or the agent loses its run identifier. URLs not yet attempted when the budget runs out come back as deferred records (zero charge). The caller can retry just those — cheaper than starting over.

For larger async batches, the budget goes up to 60 minutes and callers use the async runs endpoint with polling.

How to use it

A realistic mixed batch — some easy URLs, some that need JS, all with markdown + Schema.org + OpenGraph extraction:

{
  "urls": [
    "https://example.com",
    "https://news.ycombinator.com",
    {
      "url": "https://old.reddit.com/r/programming/.json",
      "headers": { "Accept": "application/json" }
    }
  ],
  "outputs": ["markdown", "json_ld", "og"]
}

The Reddit entry uses the per-URL headers form to route through the JSON variant of Reddit’s listings — only Accept/Accept-Language/Accept-Encoding/User-Agent/Referer/Content-Type are accepted as header names.

Via curl through Apify’s REST API:

curl -X POST "https://api.apify.com/v2/acts/shelvick~smart-page-fetcher/run-sync-get-dataset-items?token=YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://example.com"], "outputs": ["markdown"]}'

Through the Apify Python SDK:

from apify_client import ApifyClient

client = ApifyClient(token=API_TOKEN)
run = client.actor("shelvick/smart-page-fetcher").call(
    run_input={"urls": ["https://example.com"], "outputs": ["markdown"]}
)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["url"], item["realized_tier"], item.get("outputs", {}).get("markdown"))

If you’re calling from an MCP-enabled agent, the same Actor surfaces as a tool through mcp.apify.com. The input schema is structured tightly enough that an LLM can construct correct calls from the tool description alone — one required field (urls), unambiguous enums for the tier flags, a fixed allow-list for output formats. Payment is per call via the Actor’s pay-per-event model — works with x402 and Skyfire agentic-payment rails.

How it compares to alternatives

Approach	Static HTML	JavaScript-rendered	Bot-defended	Cost on easy pages
Plain HTTP fetcher	works	fails	fails	cheapest
Always-stealth fetcher	works	works	works	overpaying for every page
Per-call-selected fetcher	depends on caller’s guess	depends	depends	depends
Smart Page Fetcher (this Actor)	works	works	works	basic-tier price

The differentiating axis is who decides which fetch method to use per URL. Plain HTTP fetchers refuse to escalate; always-stealth fetchers can’t de-escalate; per-call-selected fetchers force the caller to be right. Smart Page Fetcher decides per URL based on what each URL actually returns, then bills at the tier that produced the content.

For agent workflows specifically, where the agent doesn’t know in advance which URLs are defended, the per-URL decision is what makes this work as a callable primitive. The agent submits URLs and gets back content. It doesn’t have to model the defense posture of every domain on the web.

Not the right tool for every workflow, though. For form interaction and authenticated sessions, a dedicated browser-automation tool is the better fit. For batches that need same-domain rate-limiting or built-in pagination, a crawler specialized in multi-page traversal handles those concerns. For sites with custom protected APIs, the platform’s own SDK beats general-purpose fetching every time.

Pricing model

Pay-per-event, billed only on success. Each URL is charged once at the tier that produced its content — basic, JS, or stealth. Failed URLs (every allowed tier tried and errored) and deferred URLs (runtime budget exhausted) don’t bill. A platform-managed actor-start event fires once per run at the platform minimum, so per-batch overhead is effectively zero.

Higher tiers cost more because they involve more infrastructure: a real browser at the JS tier, a real browser plus a residential proxy at the stealth tier. On a typical mixed batch most URLs land on the basic tier and the effective per-URL cost is dominated by that floor. Batch size matters too — browser launch is amortized across all URLs that need a browser, so the per-URL effective cost goes down as batches get larger.

For current per-event rates and any active subscriber discounts, see the Pricing tab on the Apify Store page.

Open questions / future work

A few things I’m watching or thinking about:

Per-URL header allow-list breadth. The current six are content-negotiation and polite-identification. Some agent workflows would benefit from being able to set a DNT (Do Not Track) header or a more specific Save-Data hint for the JS tier. Likely safe additions; haven’t yet had a concrete request that motivates them.
Concurrent same-domain rate-limiting. Currently the batch fires up to ~50 tier-1 fetches in parallel without same-domain coalescing. For batches that happen to be dominated by URLs from a single domain, this is impolite at best and counterproductive (rate-limited by the target) at worst. A per-domain concurrency cap is on the list.
Detection-rule transparency. The escalation logic uses heuristics for “this looks like a challenge page, not a content page.” Surfacing those rules in the output (e.g. an escalation_reason field) would help debugging when an agent gets a failed record and wants to know why.
Structured extraction. The current output formats are all deterministic transformations — no LLM in the loop. A natural next step is an extraction mode where the caller provides a schema or prompt (“extract product name, price, and rating from this page”) and the Actor returns structured JSON. That’s a different cost profile (LLM per call) and likely a separate Actor rather than a feature flag on this one.
Reduce singleton-URL cost. Cold-start dominates the cost of a 1-URL batch; agents that genuinely want to fetch one URL at a time pay disproportionately. A future variant might raise the per-URL fee on small batches to discourage that usage pattern, or — more interesting — fall back to a cheaper non-browser shared service for stealth-tier singletons.