Building Blink: A Documentation Scraper for Toph

Programming contests present unique challenges that often require creative solutions. When Toph, our competitive programming platform, needed to provide offline access to programming language documentation during contests, we faced a few specific problems that existing solutions couldn’t quite solve. It led to the creation of Blink, a specialized tool for scraping and mirroring programming language documentation.

Blink: Toph's Documentation Scraper

The Problem

Competitive programming contests, especially on-site events, operate under strict network restrictions. Participants typically only have access to the contest platform (toph.co) and related services. However, contestants need access to programming language references and documentation to write effective code.

While there are excellent existing solutions like devdocs.io, they didn’t quite fit Toph’s specific requirements. In particular, devdocs.io requires a Ruby backend, adding deployment complexity while certain aspects aren’t quite configurable enough.

The Solution: A Custom Approach

Blink takes a focused approach to documentation scraping, explicitly designed for contest environments. Built with Go, it produces completely static sites that Toph can serve without any additional infrastructure.

Architecture Overview

Blink follows a pipeline architecture with three main components:

  • Crawlers
  • Processing Pipeline
  • Static Site Generation

1. Crawlers

Blink supports two types of crawlers:

  • Web Crawler: Uses Colly to scrape live websites
  • Filesystem Crawler: Processes downloaded documentation archives (like Python’s HTML docs)

In the following example, we set up the crawler to start from a base URL and follow only links within the allowed domains that conform to the specified filters. By using this configuration, we can effectively manage the crawling process and focus only on relevant content.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
return site.New(
	"com.cppreference/c",
	web.New(
		baseURL,
		web.AllowedDomains("en.cppreference.com"),
		web.URLFilters(urlFilters...),
		web.DisallowedURLFilters(disallowedURLFilters...),
		web.DisallowedPaths(disallowedPaths...),
	),
	pipe.New(
		pipe.Filters(

2. Processing Pipeline

The pipeline system transforms the crawled pages into plain, sanitized, and minimal HTML.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
	web.DisallowedPaths(disallowedPaths...),
),
pipe.New(
	pipe.Filters(
		pipe.DefaultMeta(),  // Extracts the page title
		pipe.SanitizeHTML(), // Removes all link, style, and script elements
		Marks(),             // Extracts marks (key topics of a page) to be made available as bookmarks
		Meta(),              // Extracts additional metadata from the page
		CleanHTML(),         // Cleans up the HTML content
		pipe.Container("#content"),
		pipe.RewriteURLs(    // Rewrites URLs to be relative to the base URL
			"com.cppreference/c",
			baseURL,
		).
			WithURLFilters(urlFilters...).
			WithDisallowedURLFilters(disallowedURLFilters...).
			WithDisallowedPaths(disallowedPaths...),
		pipe.CleanClassName().
			WithPreserveClasses(
				"t-dcl-begin",
				"t-dsc-header",
				"t-mark-rev",
				"t-li1",
			),
		pipe.CleanStyle(),
		pipe.SyntaxHighlight(),

		pipe.If(pipe.IsURL(baseURL)). // Conditional pipeline rules
			Then(
				Heading("C Programming Language"),
			),
	),
),
site.TrimPathPrefix("/w/c"),
site.Title("C Reference"),

Key pipeline filters include:

  • HTML Sanitization: Removes unwanted elements and scripts
  • Content Extraction: Focuses on specific page sections (e.g., #content)
  • URL Rewriting: Makes links work in the offline environment
  • Syntax Highlighting: Adds code highlighting using Chroma
  • Style Cleanup: Removes unnecessary CSS while preserving important styling

3. Static Site Generation

The scraper produces an entirely static site structure. All generated HTML files include a frontmatter section for metadata. Blink also minifies the HTML to keep things compact.

In addition, it emits a site.json file containing site-specific metadata as well as a data structure of all the extracted marks. This file also tracks redirects discovered during the crawl.

Implementation Details

Smart Content Processing

Blink implements intelligent content filtering. For example, we exclude experimental features and historical content when we scrape the C documentation, as those aren’t relevant to programming contests:

Flexible Serving

There is a built-in server tool in Blink that allows quick exploration of the scraped documentation, for quick checks before deployment or debugging issues, as we integrate any new documentation source:

The tool uses Pico CSS for clean, minimal styling that doesn’t distract from the documentation content.

Incremental Updates

Blink is smart about avoiding unnecessary work. It compares generated content with existing files to skip unchanged pages.

It also has a built-in file-based cache that the crawler uses to avoid sending requests to the documentation source for pages it has already crawled.

Why This Approach Works

Setting up documentation for a contest is straightforward. Any web server can serve the generated static files in the out/ directory.

Toph serves it right in the contest arena:

  1. Static Output: No external website dependencies during contests
  2. Focused Scope: Only includes essential documentation
  3. Fast Processing: Go’s performance enables quick updates
  4. Clean Content: Removes distractions and navigation cruft
  5. Offline First: Everything works without full internet access

Current Support

As of writing this blog post, Blink processes:

  • C/C++ References: From cppreference.com
  • Go documentation: From the docs that come packaged with the toolchain
  • Python documentation: From official Python docs archives

The modular architecture makes adding new documentation sources straightforward by implementing new site configurations.

The Result

What started as a specific need for contest environments has resulted in a flexible, fast documentation scraper. Blink proves that sometimes the best solution to a niche problem is a purpose-built tool rather than trying to adapt existing solutions.

For competitive programming organizers and anyone who needs clean, offline documentation mirrors, Blink provides a lightweight alternative that gets the job done without unnecessary complexity.