Common Crawl × Hugging Face Storage Buckets

The April 2026 web,
by the numbers.

Every aggregate on this page was computed by querying Common Crawl's columnar URL index with DuckDB — read straight from a Hugging Face Storage Bucket over hf://, on a single Job. Nothing was downloaded.

web pages in crawl CC-MAIN-2026-17 · counted from parquet metadata in seconds

Run your own analysis

The charts below are a snapshot I pre-computed — but the whole point is you don't need me to. The index is queryable by anyone, zero download. Prefer your own agent to dig in for you? Hand it the prompt below — it links the quickstart + buckets guide and has it install the HF CLI skill (hf skills add) so it can drive the CLI itself.

The Common Crawl April 2026 crawl (CC-MAIN-2026-17, ~2.19B pages) is on a
Hugging Face Storage Bucket: commoncrawl/commoncrawl. It includes the columnar
URL index (one parquet row per crawled page). Help me explore it WITHOUT
downloading the underlying petabytes.

Get set up (read these, then use the hf CLI):
- huggingface_hub quickstart: https://huggingface.co/docs/huggingface_hub/quick-start
- Buckets guide: https://huggingface.co/docs/huggingface_hub/guides/buckets
- Install the HF CLI skill so you can drive the CLI:
    hf skills add --claude    # Claude Code
    hf skills add             # Codex / Cursor / other agents
- Auth (needed for Jobs): hf auth login   (or set HF_TOKEN)
- Browse the data: hf buckets ls commoncrawl/commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2026-17

Two zero-download ways to query it:

1) DuckDB over hf:// (locally, or inside an HF Job):
   import duckdb, fsspec
   con = duckdb.connect()
   con.register_filesystem(fsspec.filesystem("hf"))
   idx = ("hf://buckets/commoncrawl/commoncrawl/cc-index/table/cc-main/"
          "warc/crawl=CC-MAIN-2026-17/subset=warc/*.parquet")
   con.sql(f"""SELECT url_host_registered_domain, count(*) AS pages
               FROM read_parquet('{idx}')
               GROUP BY 1 ORDER BY 2 DESC LIMIT 20""").show()

2) Scale the full 2.19B-row scan onto Hugging Face Jobs:
   hf jobs uv run --flavor cpu-performance --secrets HF_TOKEN my_query.py

Useful index columns: url, url_host_name, url_host_registered_domain,
url_host_tld, content_languages, content_mime_detected, fetch_status,
fetch_time, warc_filename, warc_record_offset, warc_record_length
(the last three locate a page's bytes in the crawl-data/ WARCs, so you can
go from "rows that match" to the actual page content).

Note: counts are crawl-capture counts (robots.txt / rate-limited), not
popularity. Help me write queries to answer: <my question>.

What this is

Common Crawl mirrors its monthly crawl archive to the Hugging Face Hub as a Storage Bucket. Alongside the raw pages, it now publishes the columnar URL index — one parquet row per crawled page (host, language, MIME type, fetch status, and a pointer to the page's bytes). That makes the whole crawl queryable without touching the petabytes of underlying WARCs.

So you can ask questions of 2+ billion web pages with plain SQL, from your laptop or a small cloud Job, paying only for the columns your query reads.

Languages of the web

Top values of content_languages (Common Crawl's per-page language ID; comma-separated when multiple).

Top domains

Top TLDs

Domains are url_host_registered_domain — subdomains roll up to the registered domain (ICANN suffixes), so blogspot.com is all Blogger blogs combined, not a single site. That rollup is why blog/UGC hosts top the list.

Content types

Top values of content_mime_detected. The crawl is ~90% HTML — but also tens of millions of PDFs and a long tail of feeds, plain text, calendars and more.

How it was made

Register the Hugging Face filesystem with DuckDB, point read_parquet at the index on the bucket, and aggregate. DuckDB pushes the projection down, so a GROUP BY streams only the one column it needs:

# pip install duckdb huggingface_hub fsspec
import duckdb, fsspec
con = duckdb.connect()
con.register_filesystem(fsspec.filesystem("hf"))   # teach DuckDB about hf://

idx = "hf://buckets/commoncrawl/commoncrawl/cc-index/table/cc-main/" \
      "warc/crawl=CC-MAIN-2026-17/subset=warc/*.parquet"

con.sql(f"""
  SELECT content_languages, count(*) AS pages
  FROM read_parquet('{idx}')
  GROUP BY 1 ORDER BY 2 DESC LIMIT 20
""").show()   # 2.19B rows · zero download

The full run aggregates all 300 index partitions (~2.19B rows) on one Hugging Face Job in a few minutes for a few cents — reading over hf://, never staging the data to disk. zero download