Web Archiving Community

💬 Join us on our new ArchiveBox community chat server: https://Zulip.ArchiveBox.io

🔢 Just getting started and want to learn more about why Web Archiving is important? Check out this article: On the Importance of Web Archiving.

The internet archiving community is surprisingly far-reaching and almost universally friendly! It has some overlap with the scraping and OSINT worlds, but it's also kinda its own thing.

Whether you want to learn which organizations are the big players in the web archiving space, want to find a specific open source tool for your web archiving need, or just want to see where archivists hang out online, this is my attempt at an index of the entire web archiving community. I cant promise that this list is up-to-date, the bulk of it was written in ~2022.

The Master Lists Community-maintained indexes of web archiving tools and groups by IIPC, COPTR, ArchiveTeam, Wikipedia, & the ASA.
Web Archiving Software Open source tools and projects in the internet archiving space.
Reading List Articles, posts, and blogs relevant to ArchiveBox and web archiving in general.
Communities A collection of the most active internet archiving communities and initiatives.

The Master Lists

Indexes of archiving institutions and software maintained by other people. If there's anything archivists love doing, it's making lists.

COPTR Wiki of Web Archiving Tools (COPTR)
Awesome Web Archiving Tools (IIPC)
My up-to-date list of starred archiving github projects
Spreadsheet Comparison of Archiving Tools (DataTogether)
Awesome Web Crawling Tools
Awesome Web Scraping Tools
ArchiveTeam's List of Software (ArchiveTeam.org)
List of Web Archiving Initiatives (Wikipedia.org)
Directory of Archiving Organizations (American Society of Archivists)

Web Archiving Projects

Bookmarking Services

Linkwarden Modern bookmarking UI with singlefile archiving
Hoarder
Gosuki A lightweight, open-source, privacy-first bookmark manager that unifies bookmarks across multiple browsers
~~Pocket Premium~~ ~~Bookmarking tool that provides an archiving service in their paid version, run by Mozilla~~
Pinboard Bookmarking tool that provides archiving in a paid version, run by a single independent developer
Raindrop Bookmarking tool with archiving in their paid version, run by a company est. 2011
Instapaper Bookmarking alternative to Pocket/Pinboard (with no archiving)
Wallabag / Wallabag.it Self-hostable web archiving server that can import via RSS
Shaarli Self-hostable bookmark tagging, archiving, and sharing service
ReadWise A paid Pocket/Pinboard alternative that includes article snippet and highlight saving
Diigo Another brookmarking/annotation service with archiving as a paid feature

From the Archive.org & Archive-It teams

Archive.org The O.G. Wayback Machine provided publicly by the Internet Archive (Archive.org)
Archive.it commercial Wayback Machine solution
Heritrix The king of internet archiving crawlers, powers the Wayback Machine
Brozzler chrome headless crawler + WARC archiver maintained by Archive.org
WarcProx WARC proxy recording and playback utility
WarcTools utilities for dealing with WARCs
Grab-Site An easy preconfigured web crawler designed for backing up websites
WPull A pure python implementation of wget with WARC saving
More on their GitHub...

From Webrecorder

Webrecorder develops a suite of open source tools, to capture websites and replay them at a later time as accurately as possible. Webrecorder also publishes the WACZ file format spec.

Browsertrix Fully integrated (self hostable) SaaS web archiving platform
ArchiveWeb.page Chrome extension for manual, interactive archiving of websites as you browse the web. Good for capturing high-fidelity complex interactions
ReplayWeb.page Web archive viewer that runs entirely in the browser and doesn't require any server-hosted component to view WARC and WACZ files. Also available as a standalone electron app for local desktop use
Browsertrix Crawler Command-line crawling application that powers Browsertrix's core crawling features
pywb aka Python Wayback, the open source toolkit forked from archive.org for self-hosting your own wayback machine among other web archiving tools
warcit Create a WARC file out of a folder full of assets
warcio fast streaming asynchronous WARC reader and writer
More on their GitHub...

From Rhizome.org (Conifer)

Conifer by Rhizome.org An open-source personal archiving server that uses pywb under the hood. Previously affiliated with Webrecorder

From the Old Dominion University: Web Science Team

ipwb A distributed web archiving solution using pywb with IPFS for storage
archivenow tool that pushes urls into all the online archive services like Archive.is and Archive.org
node-warc Parse And Create Web ARChive (WARC) files with node.js
WAIL Web archiver GUI using Heritrix and OpenWayback
Squidwarc User-scriptable, archival crawler using Chrome
WAIL (Electron) Electron app version of the original wail for creating and interacting with web archives
warcreate a Chrome extension for creating WARCs from any webpage
More on their GitHub...

From the Archives Unleashed Team

AUT Archives Unleashed Toolkit for analyzing web archives (formerly WarcBase)
Warclight A Rails engine for finding and searching web archives
More on their GitHub...

From the IIPC team

OpenWayback Open source project developing core Wayback Machine components
awesome-web-archiving Large list of archiving projects and orgs
JWARC A Java library for reading and writing WARC files.
More on their GitHub...

Other Public Archiving Services

https://archive.is / https://archive.today
https://ghostarchive.org
https://perma.cc
https://arquivo.pt
https://www.pagefreezer.com
https://www.smarsh.com
https://www.stillio.com
https://archive.st
https://theoldnet.com/
https://timetravel.mementoweb.org/
https://freezepage.com/
https://webcitation.org/archive
https://archiveofourown.org/
https://megalodon.jp/
https://www.webarchive.org.uk/ukwa/
https://github.com/HelloZeroNet/ZeroNet (super cool project)
Google, Bing, DuckDuckGo, and other search engine caches

Other ArchiveBox Alternatives

There are much more recent projects listed here: https://github.com/stars/pirate/lists/internet-archiving

Browsertrix + ArchiveWeb.page + ReplayWeb.page Webrecorder's archiving suite has the highest fidelity, and can flawlessly archive YouTube, X, Facebook, and other complex, JS-heavy SPAs
SingleFile Web Extension / CLI util for Firefox and Chrome to save a web page as a single HTML file
Memex by Worldbrain.io a beautiful, user-friendly browser extension that archives all history with full-text search, annotation support, and more
Hypothes.is a web/pdf/ebook annotation tool that also archives content
Reminiscence extremely similar to ArchiveBox, uses a Django backend + UI and provides auto-tagging and summary features with NLTK
Shaarchiver very similar project that archives Firefox, Shaarli, or Delicious bookmarks and all linked media, generating a markdown/HTML index
Archivy Python-based self-hosted knowledge base embedded into your filesystem
Polarized a desktop application for bookmarking, annotating, and archiving articles offline
LinkWarden Link archival and curation web app, very similar to ArchiveBox
Photon a fast crawler with archiving and asset extraction support
Scoop Create high-fidelity WARC/WACZ captures using a playwright browser, with support for signing, media extraction, PDFs, etc. (by the Perma.cc team)

Ones I haven't personally vetted:

Shiori Simple bookmark manager + readability archiver built with Go (like a clone of Pocket)
Percollate A command-line tool to turn web pages into beautiful, readable PDF, EPUB, or HTML docs.
LinkAce A self-hosted bookmark management tool that saves snapshots to archive.org
LinkDing Self-hosted bookmark manager that is designed be to be minimal, fast, and easy to set up using Docker.
LinkWallet A self-hosted bookmark database with full-text page content search and limited archiving features
Espial Bookmark manager and search tool with limited archiving features
Diskernet Archiving tool that uses the Chrome debugger protocol to save each page as-loaded in the browser** (aka 22120 by c0fe or i5ik)
Trilium Personal web UI based knowledge-base with web clipping and note-taking
Herodotus Django-based web archiving tool with a focus on collecting text-based content
Buku Browser-independent bookmark manager CLI written in Python3 and SQLite3
ReadableWebProxy A proxying archiver that downloads content from sites and can snapshot multiple versions of sites over time
Perkeep "Perkeep lets you permanently keep your stuff, for life."
Fossilo A commercial archiving solution that appears to be very similar to ArchiveBox
NeonLink Simple self-hosted bookmark management + Benotes note-taking app with limited archiving features
Archivematica web GUI for institutional long-term archiving of web and other content
Headless Chrome Crawler distributed web crawler built on puppeteer with screenshots
WWWofle old proxying recorder software similar to ArchiveBox
Erised Super simple CLI utility to bookmark and archive webpages
Zotero collect, organize, cite, and share research (mainly for technical/scientific papers & citations)
TiddlyWiki Non-linear bookmark and note-taking tool with archiving support
Joplin Desktop + mobile app for knowledge-base-style info collection and notes (w/ optional plugin for archiving)
Hunchly A paid web archiving / session recording tool design for OSINT
Monolith CLI tool for saving complete web pages as a single HTML file
Obelisk Go package and CLI tool for saving web page as single HTML file
Munin Archiver Social media archiver for Facebook, Instagram and VKontakte accounts.
Wayback Archiving in style like ArchiveBox, but with a chat.

Smaller Utilities

Random helpful utilities for web archiving, WARC creation and replay, and more...

https://github.com/TheCakeIsNaOH/xbs-to-archivebox A utility to sync xBrowserSync bookmarks with ArchiveBox
https://github.com/karlicoss/promnesia A browser extension that collects and collates all the URLs you visit into a hierarchical/graph structure with metadata
https://github.com/vrtdev/save-page-state A Chrome extension for saving the state of a page in multiple formats
https://github.com/jsvine/waybackpack command-line tool that lets you download the entire Wayback Machine archive for a given URL
https://github.com/hartator/wayback-machine-downloader Download an entire website from the Internet Archive Wayback Machine.
https://github.com/Lifesgood123/prevent-link-rot Replace any broken URLs in some content with Wayback machine URL equivalents
https://en.archivarix.com download an archived page or entire site from the Wayback Machine
https://proofofexistence.com prove that a certain file existed at a given time using the blockchain
https://github.com/chfoo/warcat for merging, extracting, and verifying WARC files
https://github.com/mozilla/readability tool for extracting article contents and text
https://github.com/mholt/timeliner All your digital life on a single timeline, stored locally
https://github.com/wkhtmltopdf/wkhtmltopdf Webkit HTML to PDF archiver/saver
Sheetsee-Pocket project that provides a pretty auto-updating index of your Pocket links (without archiving them)
Pocket -> IFTTT -> Dropbox Post by Christopher Su on his Pocket saving IFTTT recipe
http://squidman.net/squidman/index.html
https://wordpress.org/plugins/broken-link-checker/
https://github.com/ArchiveTeam/wpull
http://freedup.org/
https://en.wikipedia.org/wiki/Furl
https://preservica.com/digital-archive-software-1/active-digital-preservation For-profit company offering a digital preservation software suite
https://github.com/karlicoss/grasp capture webpages from Firefox and Chrome into Org-mode documents
https://github.com/dgtlmoon/changedetection.io Change detection and monitoring of web page content changes
And many more on the other lists...

Reading List

A collection of blog posts and articles about internet archiving, contact me / open an issue if you want to add a link here!

Articles We Like About Internet Archiving

https://items.ssrc.org/parameters/on-the-importance-of-web-archiving/
https://theconversation.com/your-internet-data-is-rotting-115891
https://www.bbc.com/future/story/20190401-why-theres-so-little-left-of-the-early-internet
https://sr.ithaka.org/publications/the-state-of-digital-preservation-in-2018/
https://gizmodo.com/delete-never-the-digital-hoarders-who-collect-tumblrs-1832900423
https://siarchives.si.edu/blog/we-are-not-alone-progress-digital-preservation-community
https://www.gwern.net/Archiving-URLs
http://brewster.kahle.org/2015/08/11/locking-the-web-open-a-call-for-a-distributed-web-2/
https://lwn.net/Articles/766374/
https://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives
https://medium.com/@giovannidamiola/making-the-internet-archives-full-text-search-faster-30fb11574ea9
https://xkcd.com/1909/
https://samsaffron.com/archive/2012/06/07/testing-3-million-hyperlinks-lessons-learned#comment-31366
https://www.gwern.net/docs/linkrot/2011-muflax-backup.pdf
https://thoughtstreams.io/higgins/permalinking-vs-transience/
http://ait.blog.archive.org/files/2014/04/archiveit_life_cycle_model.pdf
https://blog.archive.org/2016/05/26/web-archiving-with-national-libraries/
https://blog.archive.org/2014/10/28/building-libraries-together/
https://ianmilligan.ca/2018/03/27/ethics-and-the-archived-web-presentation-the-ethics-of-studying-geocities/
https://ianmilligan.ca/2018/05/22/new-article-if-these-crawls-could-talk-studying-and-documenting-web-archives-provenance/
https://ws-dl.blogspot.com/2019/02/2019-02-08-google-is-being-shuttered.html

If any of these links are dead, you can find an archived version on https://archive.sweeting.me or https://web.archive.org.

ArchiveBox-Specific Posts, Tutorials, and Guides

Beware: many of these may be outdated, as ArchiveBox has frequent updates and continual improvement.

"Install ArchiveBox on SaltBox.dev" https://docs.saltbox.dev/sandbox/apps/archivebox/#3-setup
"ArchiveBox is an open-source self-hosted web archiving system for the web and the desktop" https://medevel.com/archivebox/
"Install ArchiveBox on a One-Click Docker Application" https://www.vultr.com/docs/install-archivebox-on-a-oneclick-docker-application/
"ArchiveBox, una solución para crear nuestro propio Archive.org en miniatura y personalizado" https://www.genbeta.com/herramientas/archivebox-solucion-para-crear-nuestro-propio-archive-org-miniatura-personalizado
"网页存档的开源工具ArchiveBox，可以将网页文字、图片、媒体文件等都保存下来，供日后查看。基于Python的开源项目，可搭建私人的网络存档服务。" https://www.bilibili.com/s/video/BV1ib4y1X7SL
"Персональный интернет-архив без боли" https://habr.com/ru/company/vdsina/blog/550180/
"ArchiveBox, una solución para crear nuestro propio Archive.org en miniatura y personalizado" https://www.genbeta.com/herramientas/archivebox-solucion-para-crear-nuestro-propio-archive-org-miniatura-personalizado
"Preserve the Internet With ArchiveBox" https://www.cyberpunks.com/preserve-the-internet-with-archivebox/
"Сам себе архивариус. Изучаем возможности ArchiveBox" https://xakep.ru/2021/02/01/archivebox/
"使用存档盒制作自己的Internet存档" http://www.diglog.com/story/1045192.html
"How to Make Your Own Internet Archive With ArchiveBox" https://nixintel.info/osint-tools/make-your-own-internet-archive-with-archive-box/
"Mit ArchiveBox Webseiten auf der Festplatte archivieren" https://www.linux-community.de/ausgaben/linuxuser/2020/12/mit-archivebox-webseiten-auf-der-festplatte-archivieren/
"ArchiveBox：开源的WEB存档" https://zhen.bushini.de/14738.html / https://www.1fishsauce.com/?p=4206
"两个基于爬虫的项目: Kiwix & ArchiveBox" https://blog.csdn.net/JackLang/article/details/108328791
"如何创建自己的私人自托管即时阅读应用程序" https://www.pcpc.me/tech/self-hosted-read-later-app
"How to install ArchiveBox to preserve websites you care about" https://blog.sleeplessbeastie.eu/2019/06/19/how-to-install-archivebox-to-preserve-websites-you-care-about/
"How to remotely archive websites using ArchiveBox" https://blog.sleeplessbeastie.eu/2019/06/26/how-to-remotely-archive-websites-using-archivebox/
"How to Create Your Own Private Self-Hosted Read-It-Later App" https://www.makeuseof.com/tag/self-hosted-read-later-app/
"How to use CutyCapt inside ArchiveBox" https://blog.sleeplessbeastie.eu/2019/07/10/how-to-use-cutycapt-inside-archivebox/
"Automate ArchiveBox with Google Spreadsheet to Backup your internet" https://manfred.life/archivebox
"【デモ有♪】ConoHaのArchiveBoxアプリケーションを使ってみたよ" https://qiita.com/CloudRemix/items/691caf91efa3ef19a7ad
"WEB-ARCHIV TEIL 8: WALLABAG UND ARCHIVEBOX" http://webermartin.net/blog/web-archiv-teil-8-wallabag-und-archivebox/
https://metaxyntax.neocities.org/entries/7.html

Aggregators: ProductHunt, AlternativeTo, SaaSHub, Logiciels, SteemHunt, Recurse Center: The Joy of Computing, GitHub Changelog, Dev.To Ultra List, O'Reilly 4 Short Links, JaxEnter
Blog Posts & Podcasts: Korben.info, Defining Desktop Linux Podcast #296 (0:55:00), Binärgewitter Podcast #221, Schrankmonster.de, La Ferme Du Web
Hacker News threads and comments: #1, #2, #3, #4, and many more...
Reddit r/DataHoarder, r/SelfHosted, etc. posts and comments: #1, #2, #3, #4, #5 , #6, #7, #8, and many more...
Twitter: Python Trending, PyCoder's Weekly, Python Hub, Smashing Magazine, and many more...

Communities

Most Active Communities

The Internet Archive (Archive.org) (USA)
International Internet Preservation Consortium (IIPC) (International)
The Archive Team, URL Team, r/ArchiveTeam (International)
Rhizome.org The digital preservation group that works on Conifer by Rhizome formerly Webrecorder.io (USA)
Webrecorder (formerly known¹ as Webrecorder.io) is a company led by Ilya Kreymer, that researches and develops web archiving tools, widely used by the community.
Old Dominion University: Web Science and Digital Libraries (WS-DL @ ODU) (Virginia, USA)
r/DataHoarder, r/Archivists, r/DHExchange (International)
The Eye Non-profit working on content archival and long-term preservation (Europe)
Digital Preservation Coalition & their Software Tool Registry (COPTR) (UK & Wales)
Archives Unleashed Project and UAP GitHub (Canada)

Web Archiving Communities

Follow these technological and organizational archiving hubs for the latest archiving news.

Canadian Web Archiving Coalition (Canada)
Web Archives for Historical Research Group (Canada)
Smithsonian Institution Archives: Digital Curation (Washington D.C., USA)
National Digital Stewardship Alliance (NDSA) (USA)
Digital Library Federation (DLF) (USA)
Council on Library and Information Resources (CLIR) (USA)
Digital Curation Centre (DCC) (UK)
ArchiveMatica & their Community Wiki (International)
Professional Development Institutes for Digital Preservation (POWRR) (USA)
Institute of Museum and Library Services (IMLS) (USA)
Stanford Libraries Web Archiving (USA)
Society of American Archivists: Electronic Records (SAA) (USA)
BitCurator Consortium (BCC) (USA)
Ethics & Archiving the Web Conference (Rhizome) (USA)
Archivists Round Table of NYC (USA)

General Archiving Foundations, Coalitions, Initiatives, and Institutes

Find your local archiving group in the list and see how you can contribute!

You can find more organizations and initiatives on these other lists:

ArchiveBox Community Resources

ArchiveBox Chat Rooms

ArchiveBox on Package Distribution Platforms

^ Back to Top ^

PreviousUsage Next_Footer

Last updated 18 days ago

hashtagThe Master Lists

hashtagWeb Archiving Projects

hashtagBookmarking Services

hashtagFrom the Archive.org & Archive-It teams

hashtagFrom Webrecorder

hashtagFrom Rhizome.org (Conifer)

hashtagFrom the Old Dominion University: Web Science Team

hashtagFrom the Archives Unleashed Team

hashtagFrom the IIPC team

hashtagOther Public Archiving Services

hashtagOther ArchiveBox Alternatives

hashtagSmaller Utilities

hashtagReading List

hashtagBlogs Friends of ArchiveBox

hashtagArticles We Like About Internet Archiving

hashtagArchiveBox-Specific Posts, Tutorials, and Guides

hashtagArchiveBox Discussions in News & Social Media

hashtagCommunities

hashtagMost Active Communities

hashtagWeb Archiving Communities

hashtagGeneral Archiving Foundations, Coalitions, Initiatives, and Institutes

hashtagArchiveBox Community Resources

hashtagArchiveBox Chat Rooms

hashtagArchiveBox on Social Media

hashtagArchiveBox on Package Distribution Platforms