Remembering the Web

the problem

Before the Internet Archive, the web had no memory.

Pages appeared and disappeared without warning. A URL that worked on Tuesday might return nothing by Thursday. There was no backup. No redundancy. No institution tasked with remembering.

days avg lifespan

∞

pages lost

> GET /~mike/homepage.html HTTP/1.1
> Host: www.geocities.com
> Date: Tue, 14 Oct 1997 08:22:31 GMT

> HTTP/1.1 404 Not Found
> The requested URL was not found on this server.

everything temporary.

the archive

In 1996, Brewster Kahle founded the Internet Archive in San Francisco. A non-profit dedicated to the permanent preservation of digital material.

Its premise: to save the entire internet. All of it.

Organisation: Internet Archive
Founded: 1996, San Francisco
Type: 501(c)(3) non-profit library
Mission: "Universal access to all knowledge"

The physical archive is housed in a former church. Rows of server racks where pews once stood. A modern Library of Alexandria — built with the awareness that the original burned.

the crawlers

The archive operates through automated crawlers — software that moves from link to link across the open web. Wherever a crawler lands, it records what the server sends back. The crawler does not interpret the page. It stores what the server delivered, byte for byte.

  seed URL            crawler             archive
  --------            -------             -------

  [http://...]  --->  GET request   --->  server responds
                      to server           with HTML + assets
                                                |
                                                v
                                         crawler stores
                                         raw response
                                         (WARC format)
                                                |
                                                v
                                         extract links
                                                |
                                        +-------+-------+
                                        |               |
                                        v               v
                                    follow           follow
                                    link A           link B

snapshots

Each crawl produces what the archive calls a snapshot — a frozen record of the server's response at one specific moment. Dated to the second.

A snapshot is not a copy of a page.
It is a dated performance of a page.

CRAWL LOG — Geocities, November 1997

1997-11-08T14:22:07Z 200 geocities.com/SunsetStrip/8764/index.html
1997-11-08T14:22:09Z 200 geocities.com/SunsetStrip/8764/music.html
1997-11-08T14:22:11Z 404 geocities.com/SunsetStrip/8764/photos.html
1997-11-08T14:22:14Z 200 geocities.com/SunsetStrip/8764/links.html
1997-11-08T14:22:18Z 200 geocities.com/SunsetStrip/8764/guestbook.html
1997-11-08T14:22:19Z 404 geocities.com/SunsetStrip/8764/about.html

The 404s: two pages already gone before the crawler arrived. The guestbook survived. The photos did not.

what was lost

Today the Wayback Machine holds hundreds of billions of archived pages. And still only a fraction of what existed.

Entire categories of content were never archived: password-protected pages, Flash animations, dynamically generated content, sites that blocked crawlers.

Geocities alone contained ~38 million user-built pages when Yahoo shut it down in 2009. The majority is gone.

800B+

pages archived

100PB

total data

% of web saved

sources: Kahle (1997) · Masanes (2006) · archive.org