Publishing Your Archive
There are two ways to publish your archive: using the archivebox server
or by exporting and hosting it as static HTML.
1. Use the built-in web server
This server is enabled out-of-the-box if you're using docker-compose
to run ArchiveBox, and there is a commented-out example nginx config with SSL set up as well. If hosting publicly, it's essential to place an SSL termination server in front of ArchiveBox (e.g. traefik
, caddy
, or cloudflared
),
[!TIP] Advanced: You can use nginx to serve the static
/archive/
dir directly from the filesystem to increase performance. To protect the/admin/
dashboard, it should ideally be served from a different domain using redirects.
2. Export and host it as static HTML
Here's a sample nginx configuration that works to serve your static archive folder:
Make sure you're not running any content as CGI or PHP, you only want to serve static files!
Urls look like: https://demo.archivebox.io/archive/1493350273/en.wikipedia.org/wiki/Dining_philosophers_problem.html
Security Concerns
[!CAUTION] Re-hosting untrusted archived content on a domain can potentially compromise all apps on that domain! (including other subdomains)
Make sure you thoroughly understand the dangers of hosting untrusted HTML/JS/CSS that may be captured during archiving, and how viewing it can enable CSRF attacks across all apps on the same domain. If a logged-in user happens to visit an archived page with malicious Javascript embedded, it would allow the JS to hijack any cookies on the domain and pretend to be them, potentially exfiltrating or modifying other Snapshots/data on your server.
(This is why we don't support serving ArchiveBox from a subdirectory like myapps.example.com/archivebox/
, it's too dangerous to share domains)
The industry standard approach is to use a separate domain for untrusted content, for example Github uses githubusercontent.com
and Google uses googleusercontent.com
for all user-uploaded files. If hosting ArchiveBox publicly, do the same and keep it on an isolated domain in order to mitigate potential damage of leaked cookies, CORS, and CSRF attack.
Protecting the Admin Dashboard
To protect the Admin dashboard, it's also recommended to serve all content under /archive/
on a separate domain from /admin/
. We do this on our servers using a simple redirect rule in nginx/cloudflare like so:
https://demo.archivebox.io: only serves
/
, redirects/archive/*
todemo-static.
https://demo-static.archivebox.io: only serves
/archive/
, redirects everything else todemo.
Note: This is still recommended, but less critical if your
/archive/
folder does not contain any archived JS (e.g. if you setSAVE_WGET=False
andSAVE_DOM=False
).
More info:
https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overview
https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overview#publishing
https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overview#%EF%B8%8F-things-to-watch-out-for-%EF%B8%8F
Copyright Concerns
[!WARNING] Be aware that some sites you archive may not allow you to rehost their content publicly for copyright reasons, it's up to you to host responsibly and respond to takedown requests appropriately based on the laws in your jurisdiction.
Archiving for personal backups, research, and some other use-cases are covered by fair use copyright exemptions in the USA, but if your archive can deprive the original author of revenue (e.g. if you rehost it for profit), then your use case might no longer be covered and you have to respond to DMCA takedown notices.
As a general rule of thumb:
Copies cannot be made for commercial purposes
The copying cannot be systematic (e.g., to replace subscriptions)
All copies made must include a notice stating that the materials may be protected under copyright.
Please modify the FOOTER_INFO
config variable to add your contact info to the footer of your index.
Note: ArchiveBox prevents search engines from indexing your archives using /robots.txt
by default. It's not recommended to disable this as it often leads to a flood of automated takedown requests and abuse reports to your hosting provider (from anti-piracy bots that scan for cloned copyrighted content via search engines).
Keep in mind individuals, companies, schools, and libraries all have different copyright exemptions in different countries. Double check the specific laws for your situation in your own jurisdiction!
Further Reading: USA Copyright Law & Fair Use Exemptions
https://www.copyright.gov/title17/
https://help.archive.org/help/rights/
https://blog.archive.org/2024/03/01/fair-use-in-action-at-the-internet-archive/
https://www.lib.ncsu.edu/workshops/understanding-copyright-and-fair-use-archival-research
https://libguides.colorado.edu/c.php?g=1154758&p=8428124
https://fairuse.stanford.edu/2003/11/10/digital_preservation_and_copyr/
https://guides.library.oregonstate.edu/copyright/libraries
https://www.clir.org/pubs/reports/pub112/body/
https://github.com/pirate/internet-archiving-talk
Last updated