Kin
OnlySearch
Join Waitlist
All posts
13 min read

Where SimilarWeb, SEMrush & Ahrefs Get Their Data

A technical breakdown of how competitive-intel platforms collect data — clickstream, SERP scraping, crawlers — and why the numbers lie more than you think.

  • Data quality
  • Competitive intelligence
  • SEO

The four ways anyone produces "competitor traffic" data

Strip the marketing off every competitive-intelligence platform and you find the same four mechanisms. Knowing which one a number came from tells you most of what you need to know about how to trust it.

1. Panel data (clickstream)

Opt-in users hand over their browsing through a browser extension, a mobile SDK embedded in some app they installed, or in a few cases an ISP partnership. The provider then extrapolates a panel of N million users to a global view. SimilarWeb, Comscore, and the legacy Alexa index were all built on this. SEMrush.Trends adds a clickstream component on top of its core SERP product.

Panel data is the only way to measure non-organic traffic — direct visits, paid traffic, social, app-to-web — for sites you don't own. Everything else in the industry estimates organic only.

2. SERP-derived traffic estimates

Scrape Google's results pages at scale, record what ranks where for which keyword, then multiply search_volume × position_CTR × in_country_factoracross every keyword a domain ranks for. The output is what Ahrefs and SEMrush call "Estimated Organic Traffic."

It's an entirely modeled number. Two things have to be roughly right for it to be useful: the search-volume estimate, and the position-based CTR curve. We'll get to how wrong both can be.

3. Direct measurement

Google Analytics 4, Search Console, Adobe Analytics, server logs. Real users, real clicks, real timestamps. The catch: only the site owner has access. So this is ground truth for one domain — yours — and unavailable for everyone else.

4. Crawlers

Send bots out to fetch and parse the open web. Ahrefs runs the second-largest crawler in the world after Googlebot and uses it to build the backlink graph it's known for. Common Crawl publishes a free monthly snapshot of around 3 billion pages. Crawlers see structure (links, content, metadata), not behavior. They tell you who points at whom, not who visits.

What each major platform actually does

The table below maps the big competitive-intel tools to the mechanisms above. Read it as a lineage chart, not a feature list.

PlatformPrimary mechanismBest atStructural blind spot
SimilarWebPanel + ISP + ML imputationTotal-traffic comparison across non-owned sitesUS/EU/desktop skew; weak under ~50K monthly visits
SEMrushSERP + CTR curve (+ clickstream add-on)Keyword-level competitor breakdownMisses everything that isn't organic search
AhrefsCrawler + SERP + CTR curveBacklink graph; content gap analysisTraffic numbers swing with a few high-volume keywords
DataForSEOSERP / keywords / backlinks APIPay-per-call programmatic accessSame SERP+CTR derivation; you own the modeling
Cloudflare RadarReal network data (1.1.1.1, edge)DNS-level trends, regional/network mixOnly sees traffic touching Cloudflare infrastructure
Common CrawlOpen web crawl (~3B pages/month)Free, replayable corpus for researchNo traffic data; misses JS-rendered content
GA4 / GSCDirect measurementGround truth for sites you ownYou can't see anyone else's

SimilarWeb

The dominant panel-data brand in competitive intelligence. Public filings and reverse-engineering put their data inputs at four sources: a self-built browser-extension panel, public-data partners (mostly ISP-derived in select regions), "direct measurement" deals where site owners volunteer their analytics, and ML imputation to fill the gaps.

What they sell is the only number in the industry that triesto capture total traffic — organic, paid, direct, social, referral — for sites you don't own. That is genuinely useful and there is no real substitute. But the number is a model output. A site with 12.3M reported visits did not have 12.3M visits counted; it had a panel signal, which a model translated into 12.3M.

SEMrush

Hybrid. Their core organic-traffic estimate is SERP+CTR derived: a large keyword database, a position tracker, a CTR curve. They layered on a clickstream offering (".Trends") to cover what SERP scraping alone cannot.

Useful for: which keywords drive a competitor's organic search traffic. Treat the absolute traffic number as soft, and remember it doesn't include direct, social, or paid.

Ahrefs

The strongest crawler in the commercial space. Their marketing claim of a 30+ trillion-URL backlink index is directionally true; the link graph is their moat.

Their traffic numbers are pure SERP+CTR: there is no panel data. That makes them swingy — a domain ranking for one accidentally viral term can show a 10× traffic estimate that is mostly modeling artifact. For backlinks, content explorer, and keyword difficulty, Ahrefs is industry-leading. For absolute traffic, take the number as a directional sketch.

DataForSEO

Wholesale infrastructure. Pay-per-call APIs for SERP results, keyword research, backlinks, domain analytics. They don't run a panel — they run scrapers and buy data — and they price by request rather than by seat. Most of the new wave of "AI SEO agents" (including ours) sit on top of DataForSEO or a peer.

The data is the same kind of data; the price model and the UX are different. The same caveats about CTR curves and search-volume buckets apply.

Cloudflare Radar

A different category. Cloudflare resolves a meaningful share of the world's DNS through 1.1.1.1 and fronts a meaningful share of web traffic through its edge. Radar publishes aggregated signals — request volume, popular domains, country-level traffic mix — drawn directly from that infrastructure.

It's real network measurement, not extrapolation. The catch: if a site sits behind AWS CloudFront, Akamai, Fastly, or Google Cloud edge, Cloudflare can't see it. So Radar tells you what the Cloudflare-fronted slice of the web is doing, which is large but not universal.

Common Crawl

Free, monthly, open-web crawl. Around 3 billion pages per snapshot. It is the dataset under most academic web research, much of the training data debate, and a non-trivial chunk of LLM pretraining corpora. For competitive intelligence specifically, it's useful as a free way to study competitor site structure, content cadence, and external links — but it has no traffic, no behavioral data, and no JavaScript rendering.

The new AI-agent layer

Tools that wrap an LLM around the data sources above and answer in natural language. Some — Perplexity in research mode, ChatGPT with browsing — fetch one page at a time and read it. They're useful for unstructured questions, useless for the kind of join-across-sources work that competitive intel actually needs.

Others — Manus and the data-agent category we sit in — call the underlying APIs (DataForSEO, GA4, GSC, Cloudflare Radar) and join the structured results. The data quality of the agent is exactly the data quality of the sources it calls. The agent doesn't improve the underlying numbers; it improves how fast you can get them and how many sources you can cross-reference per question.

Where the bias hides

Every number above carries a structural bias. The ones below are the ones that actually move decisions.

Panel selection bias

Browser extensions are installed by self-selecting users. The easiest way to grow a panel is to bundle the data collection with a coupon, cashback, or VPN extension — users opt in for the rebate, the panel gets the browsing data. Net effect: panels are skewed toward shopping-tracker users, tech-aware users, and people who don't closely manage their browser extensions.

This skew shows up everywhere downstream. Categories popular with coupon users (consumer e-commerce, deals, marketplaces) tend to be over-represented. Categories popular with privacy-conscious users (developer tools, security, finance) tend to be under-represented.

Geographic blindness

Almost every commercial panel anchors in the US and Europe. China, Japan, Korea, India, and most of Southeast Asia consistently show large under-reporting in independent comparisons. If you're benchmarking a site whose audience is global, the panel is showing you mostly Western traffic and modeling the rest.

Device and app blindness

Mobile is over half of global web traffic, but most panels are predominantly desktop browser extensions. Mobile coverage comes from ISP partnerships (region-limited) or SDKs embedded in mobile apps (app-limited). The mobile traffic figure you see in SimilarWeb is more modeled than measured.

It gets worse for properties that live mostly inside native apps. SimilarWeb can show you chatgpt.com web traffic; what it cannot show is the much larger app and API surface. Same for TikTok, Instagram, and most large social apps. The web property is the tip; the iceberg is invisible.

The CTR curve broke and nobody updated

SEMrush, Ahrefs, and every SERP-derived traffic estimator on the market converts a ranking into a traffic number using a position-based click-through-rate curve. The curves come from third-party clickstream studies — Backlinko, Advanced Web Ranking, sistrix — that pin position-1 CTR somewhere between 22% and 35% depending on the year and the methodology.

Two problems. First, those curves average across all SERPs, but SERPs are not equal: a query that triggers a featured snippet, a People Also Ask block, an image pack, or a video carousel has a very different organic CTR profile from a clean ten-blue-links SERP. Most estimators don't differentiate.

Second — and this is the one nobody has priced in — Google's AI Overviews went mainstream in 2024. For informational queries where an AI Overview appears, position-1 organic CTR has dropped well below the historical curve. Independent studies put the new ceiling for informational SERPs in the single digits to low teens. The estimators are still using pre-AIO curves. So every "estimated organic traffic" number for an information-heavy site in 2025-2026 is structurally too high.

Search-volume buckets compound the error

The search-volume number you see ("this keyword gets 8,100 searches per month") is itself derived. Most providers start from Google Keyword Planner data, which Google publishes in bucketed ranges (10–100, 100–1k, 1k–10k, 10k–100k...) and which Google explicitly flags as imprecise. Providers smooth the buckets with clickstream signals and machine learning. The resulting number can carry ±50% error before you do anything with it.

Then you multiply that by a CTR curve that's also approximate. Errors don't cancel — they compound. The five-significant-digit traffic number on the dashboard is, at best, a single-significant-digit signal.

Aggregation is not measurement

When a competitive-intel dashboard shows you "12,347,829 visits last month," that is the output of a model whose inputs are panel signal, partnership data, ML imputation, and smoothing. The exact figure exists for the same reason your weather app shows 73°F instead of "warm-ish": the model emits a number, and the UI displays it.

This isn't a flaw — there's no other way to estimate traffic for a site you don't own. But the precision is decorative. Treat any single-month, single-platform absolute number as accurate to one significant digit at most.

How to read these numbers without being fooled

Trust trends, not absolutes

Whatever bias a platform carries is roughly stable over time. If SimilarWeb says a competitor went from 10M to 15M monthly visits, the 50% relative move is more reliable than either endpoint. Use deltas. Compare quarter over quarter, not point estimates.

Triangulate across two or three platforms

SimilarWeb, SEMrush, and Cloudflare Radar all use different mechanisms. If they agree directionally, you can be fairly confident. If they disagree, treat the signal as noise rather than picking the platform whose answer you liked.

Anchor on something you can directly measure

For your own site, you have GA4 and GSC. Calibrate the platform's bias for your category by comparing what it claims about your traffic against what your analytics actually show. If SimilarWeb consistently undercounts you by 35%, you can mentally correct competitor numbers in the same category by a similar factor. This is the single biggest accuracy win available.

Be skeptical of small sites

Panel data has a noise floor. Sites under ~50K monthly visits show large month-over-month swings that are mostly sampling artifact. SimilarWeb hides numbers below an internal confidence threshold; many other tools quietly don't. If a competitor's reported traffic doubled this month and they're below 100K baseline, your default assumption should be sampling noise, not a real swing.

Don't mix "organic" with "total"

Ahrefs estimated traffic = organic search only. SimilarWeb total visits = all sources. SEMrush traffic numbers depend on which widget you're looking at. If a site shows 2M in Ahrefs and 3M in SimilarWeb, that's probably 2M organic plus 1M direct/referral/social — not a contradiction. Always check what each number is actually measuring before comparing them.

Picking the right tool for the question

The honest version of the "which tool should I buy" question is: what question am I actually asking? The answer flows from there.

  • Who links to my competitor? Ahrefs. The backlink index is dominant.
  • Which keywords drive a competitor's organic search? SEMrush or Ahrefs. Both use SERP+CTR; results converge for keywords with stable rankings.
  • What is my competitor's total traffic across all sources? SimilarWeb, with the caveats above. There is no real alternative.
  • What is my own traffic doing? GA4 plus GSC. Don't pay anyone to estimate the ground truth you already own.
  • I want raw API access at low cost. DataForSEO and its peers. Cheaper per call than the seat-based tools, but you build the analysis layer yourself.
  • I need actual network-level evidence (DDoS, country mix, real DNS volume). Cloudflare Radar, where their visibility extends.
  • I need free, replayable data for research. Common Crawl plus your own analysis stack.

A closing note on precision

The data is approximate. That isn't a knock on the platforms — it can't be otherwise unless you own the destination. The mistake is treating the number on the dashboard as more precise than it is, then sizing real decisions to that false precision.

Use these numbers to detect direction: is this market growing, is this competitor accelerating, is our category winning or losing share. Use direct measurement when you need magnitude: how many users, how much revenue, how many conversions. Don't mix the two up. The cost of confusing "directional signal" with "ground truth" is the most expensive recurring mistake in competitive intelligence.