AUSTIN, Texas—As much as subscription services want you to believe it, not everything can be found on Amazon or Netflix. Want to read Brett Kavanaugh buddy Mark Judges old book, for instance (or their now infamous yearbook even)? Curious to watch a bunch of vintage smoking ads? How about perusing the largest collection of Tibetan Buddhist literature in the world? Theres one place to turn today, and its not Google or any pirate sites you may or may not frequent.
“Ive got government video of how to wash your hands or prep for nuclear war,” says Mark Graham, director of the Wayback Machine at the Internet Archive. “We could easily make a list of .ppt files in all the websites from .mil, the Military Industrial PowerPoint Complex.”
Graham recently talked with several small groups of attendees at the 2018 Online News Association conference, and Ars was lucky enough to be part of one. He later made a full presentation to the conference, which is now available in audio form. And the immediate takeaway is that the scale of the Internet Archive today may be as hard to fathom as the scale of the Internet itself.
The longtime non-profits physical space remains easy to comprehend, at least, so Graham starts there. The main operation now runs out of an old church (pews still intact) in San Francisco, with the Internet Archive today employing nearly 200 staffers. The archive also maintains a nearby warehouse for storing physical media—not just books, but things like vinyl records, too. Thats where Graham jokes the main unit of measurement is “shipping container.” The archive gets that much material every two weeks.
The company currently stands as the second-largest scanner of books in the world, next to Google. Graham put the current total above four million. The archive even has a wishlist for its next 1.5 million scans, including anything cited on Wikipedia. Yes, the Wayback Machine is in the process of making sure youre not finding 404s during any Wiki rabbithole (Graham recently told the BBC that Wayback bots have restored nearly six million pages lost to linkrot as part of that effort). Today, books published prior to 1923 are free to download through the Internet Archive, and a lot of the stuff from afterwards can be borrowed as a digital copy.
So grateful for the extraordinary work our friends at @internetarchive are doing to fight 404s and digitally preserve millions of links to websites and sources Wikipedians cite, as they build the world's largest encyclopedia. https://t.co/LRN2uyFQKQ
— WikiResearch (@WikiResearch) October 2, 2018
Of course, the Internet Archive offers much more than text these days. Its broadcast-news collection has more than 200 million hours with tools such as the ability to search for words in chyrons and access to recent news (broadcasts are embargoed for 24 hours and then delivered to visitors in searchable two-minute chunks). The growing audio and music portion of the Internet Archive covers radio news, podcasting, and physical media (like a collection of 200,000 78s recently donated by the Boston Library). And as Ars has written about, the organization boasts an extensive classic video game collection that anyone can boot up in a browser-based emulator for research or leisure. Officially, that section involves 300,000-plus overall software titles, “so you can actually play Oregon Trail on an old Apple C computer through a browser right now—no advertising, no tracking users,” Graham says.
“Some might call us hoarders,” he says. “I like to say were archivists.”
In total, Graham says the Internet Archive adds four petabytes of information per year (that's four million gigabytes, for context). The organizations current data totals 22 petabytes—but the Internet Archive actually holds on to 44 petabytes worth. “Because were paranoid,” Graham says. “Machines can go down, and we have a reputation.” That NASA-ish ethos helped the non-profit once survive nearly $600,000 worth of fire damage—all without any archived data loss.
Universal access to knowledge (and facts, so many facts)
The mission statement of the Internet Archive throughout its 22 years has been simple: “universal access to all knowledge.” Doing that in the Web-era means deploying a small army of bots, of course, and Graham notes the Internet Archive constantly has software crawling for content. Roughly 7,000 simultaneous processes reach across the Web to snag 1.5 billion things per week. Some things like the Google or The New York Times home pages may get looked at many times in a day; other stuff may be less frequent.
“We try to get everything, but its challenging,” Graham notes. “Embeds, Javascripts, interactive apps—we cant get some of this stuff, but were working on this.”
That working-on-it cache includes things like ephemeral media like Snapchat or public Telegram groups, and the Wayback Machine maintains on-the-ground contacts in places where some media archives or servers may be at risk (Graham notes partners in Egypt recently, for instance).
The upshot of all this is that the Wayback Machine has evolved into something with far more utility than simply amusing trips to LiveJournals of yore. Ars has used it numerous times, for everything from catching changes in Comcasts net neutrality pledge to seeing how Defense Distributeds organizational description evolved. And Graham points to a recent 2018 controversy when President Trump tweeted that Google didnt promote the State of the Union on its homepage (as it had done in the past). Before Google responded, the company reached out to the Internet Archive with a simple question—have a copy?
“I love Google, but their job isnt to make copies of the homepage every 10 minutes,” Graham says. “Ours is.”
Graham shares that the Wayback Machine had, in fact, captured 835 instances of the Google homepage that day in January 2018. “So we were able to help set the record straight. We dont take sides, but were in favor of the truth.”
The site has played a similar role when the White House recently deleted the entirety of its newsletter archives, and a number of organizations (not just news, but entities like environmental organizations or the ACLU) reached out for captures. And evidence from the Wayback Machine has been admissible in court. “Theres a lot that happens in terms of time stamping,” he adds. As a former VP at NBC News (hence his willingness to attend ONA, perhaps), Graham also proudly points to the site being referenced roughly five times a day within media.
To improve these kind of efforts, Graham says the Wayback Machine has been subtly working on improving its user-facing tools. On the bottom left of the main Wayback Machine page, youll find publicly available APIs, for instance. Graham points to folks using these to build things like a differentiator, where you can take two captures side-by-side and see the changes. Another user-created tool that caught his eye lets you look at a site and make a radial tree graph to see its structure changing over time.
Though perhaps the most simple and effective tool of all comes from the Wayback Machine itself—the site allows anyone to manually send a link to the Internet Archive for archiving right from its homepage. “If Im walking my cat in the garden and I see a story in Google News, you can send it to a printer. But today you can also send to the Internet Archive,” Graham says. He estimated up to one million captures per week can come from that.
“We cast a really big net without pretense,” he says. And whether the bots find something or a dedicated amateur archivist does, the rest of us can simply appreciate the ability to find content like, oh, the original Ars Technica mission. (Luckily, 20 years later, no one has yet reported us for “bad, bad things like NT, Linux, and BeOS content under the same roof.”)
Listing image by Nathan Mattise
[contf] [contfnew]
Ars Technica
[contfnewc] [contfnewc]