Publishing Your Archive
Last updated
Last updated
There are two ways to publish your archive: using the archivebox server
or by exporting and hosting it as static HTML.
This server is enabled out-of-the-box if you're using docker-compose
to run ArchiveBox, and there is a commented-out example nginx config with SSL set up as well. If hosting publicly, it's essential to place an SSL termination server in front of ArchiveBox (e.g. , , or ),
[!TIP] Advanced: You can use nginx to serve the static
/archive/
dir directly from the filesystem to increase performance. To protect the/admin/
dashboard, it should ideally be served from a using redirects.
Here's a sample nginx configuration that works to serve your static archive folder:
Make sure you're not running any content as CGI or PHP, you only want to serve static files!
Urls look like: https://demo.archivebox.io/archive/1493350273/en.wikipedia.org/wiki/Dining_philosophers_problem.html
[!CAUTION] Re-hosting untrusted archived content on a domain can potentially compromise all apps on that domain! (including other subdomains)
(This is why we don't support serving ArchiveBox from a subdirectory like myapps.example.com/archivebox/
, it's too dangerous to share domains)
The industry standard approach is to use a separate domain for untrusted content, for example Github uses githubusercontent.com
and Google uses googleusercontent.com
for all user-uploaded files. If hosting ArchiveBox publicly, do the same and keep it on an isolated domain in order to mitigate potential damage of leaked cookies, CORS, and CSRF attack.
To protect the Admin dashboard, it's also recommended to serve all content under /archive/
on a separate domain from /admin/
. We do this on our servers using a simple redirect rule in nginx/cloudflare like so:
https://demo.archivebox.io: only serves /
, redirects /archive/*
to demo-static.
https://demo-static.archivebox.io: only serves /archive/
, redirects everything else to demo.
More info:
https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overview
https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overview#publishing
https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overview#%EF%B8%8F-things-to-watch-out-for-%EF%B8%8F
[!WARNING] Be aware that some sites you archive may not allow you to rehost their content publicly for copyright reasons, it's up to you to host responsibly and respond to takedown requests appropriately based on the laws in your jurisdiction.
As a general rule of thumb:
Copies cannot be made for commercial purposes
The copying cannot be systematic (e.g., to replace subscriptions)
All copies made must include a notice stating that the materials may be protected under copyright.
Keep in mind individuals, companies, schools, and libraries all have different copyright exemptions in different countries. Double check the specific laws for your situation in your own jurisdiction!
https://www.copyright.gov/title17/
https://help.archive.org/help/rights/
https://blog.archive.org/2024/03/01/fair-use-in-action-at-the-internet-archive/
https://www.lib.ncsu.edu/workshops/understanding-copyright-and-fair-use-archival-research
https://libguides.colorado.edu/c.php?g=1154758&p=8428124
https://fairuse.stanford.edu/2003/11/10/digital_preservation_and_copyr/
https://guides.library.oregonstate.edu/copyright/libraries
https://www.clir.org/pubs/reports/pub112/body/
https://github.com/pirate/internet-archiving-talk
Make sure you thoroughly understand the dangers of , and how viewing it can enable across all apps on the same domain. If a logged-in user happens to visit an archived page with malicious Javascript embedded, it would allow the JS to hijack any cookies on the domain and pretend to be them, potentially exfiltrating or modifying other Snapshots/data on your server.
Note: This is still recommended, but less critical if your /archive/
folder does not contain any archived JS (e.g. if you ).
Archiving for personal backups, research, and some other use-cases are covered by in the USA, but if your archive can deprive the original author of revenue (e.g. if you rehost it for profit), then your use case might no longer be covered and you have to respond to DMCA takedown notices.
Please modify the config variable to add your contact info to the footer of your index.
Note: ArchiveBox prevents search engines from indexing your archives using by default. It's not recommended to this as it often leads to a flood of automated takedown requests and abuse reports to your hosting provider (from anti-piracy bots that scan for cloned copyrighted content via search engines).