Before the Internet Archive, the web had no memory.
Pages appeared and disappeared without warning. A URL that worked on Tuesday might return nothing by Thursday. There was no backup. No redundancy. No institution tasked with remembering.
everything temporary.
In 1996, Brewster Kahle founded the Internet Archive in San Francisco. A non-profit dedicated to the permanent preservation of digital material.
The physical archive is housed in a former church. Rows of server racks where pews once stood. A modern Library of Alexandria — built with the awareness that the original burned.
The archive operates through automated crawlers — software that moves from link to link across the open web. Wherever a crawler lands, it records what the server sends back. The crawler does not interpret the page. It stores what the server delivered, byte for byte.
seed URL crawler archive
-------- ------- -------
[http://...] ---> GET request ---> server responds
to server with HTML + assets
|
v
crawler stores
raw response
(WARC format)
|
v
extract links
|
+-------+-------+
| |
v v
follow follow
link A link B
Each crawl produces what the archive calls a snapshot — a frozen record of the server's response at one specific moment. Dated to the second.
The 404s: two pages already gone before the crawler arrived. The guestbook survived. The photos did not.
Today the Wayback Machine holds hundreds of billions of archived pages. And still only a fraction of what existed.
Entire categories of content were never archived: password-protected pages, Flash animations, dynamically generated content, sites that blocked crawlers.
Geocities alone contained ~38 million user-built pages when Yahoo shut it down in 2009. The majority is gone.