Chromium-Install
Chrome / Chromium Setup
By default, ArchiveBox looks for any existing installed version of Chrome/Chromium and uses it if found. You can optionally install a specific version and set the environment variable CHROME_BINARY
to force ArchiveBox to use that one, e.g.:
CHROME_BINARY=google-chrome-beta
CHROME_BINARY=/usr/bin/chromium-browser
CHROME_BINARY='/Applications/Chromium.app/Contents/MacOS/Chromium'
CHROME_BINARY='~/Library/Caches/ms-playwright/chromium-857950/chrome-mac/Chromium.app/Contents/MacOS/Chromium'
If you don't already have Chrome installed, I recommend installing Chromium instead of Google Chrome, as it's the open-source fork of Chrome that doesn't send as much tracking data to Google.
Check for existing Chrome/Chromium install:
Installing Chromium
⭐️ Any OS (recommended)
playwright
(by the Microsoft team) and puppeteer
(by the Google team) are two options to get stable, repeatable Chromium distributions on many OSs.
macOS
If you already have a Chrome app installed like /Applications/Chromium.app
, you don't need to run this.
Ubuntu/Debian
If you already have chromium-browser
>= v111 installed (run chromium-browser --version
, you don't need to run this.
Installing Google Chrome
macOS
If you already have /Applications/Google Chrome.app
, you don't need to run this.
Ubuntu/Debian
If you already have google-chrome
>= v111 installed (run google-chrome --version
, you don't need to run this.
Troubleshooting Chromium Install
If you encounter problems setting up Google Chrome or Chromium, see the Troubleshooting page.
Setting Up a Chromium User Profile
You may choose to set up a Chrome/Chromium user profile in order to use your cookies/sessions to log into sites behind authentication/paywall during archiving.
Note: not all extractors use Chrome (e.g. wget
, mercury
, media
), so COOKIES_FILE
should be set up as well after this.
[!WARNING] We strongly recommend you use separate burner credentials dedicated to archiving, e.g. don't provide cookies for your normal daily Facebook/Instagram/Google/etc. accounts as server responses and page content will often contain your name/email/PII, session cookies, private tokens, etc. which then get preserved in your snapshots for eternity.
Future viewers of your archive may be able to use any reflected archived session tokens to log in as you, or at the very least, associate the content with your real identity. Even if this tradeoff seems acceptable now or you plan to keep your archive data private, you may want to share a snapshot with others in the future, and snapshots are very hard to sanitize/anonymize after-the-fact!
For this reason, it's best to set up dedicated fake profile accounts for each site you want to archive, and consider them burned if you ever share any of your archived snapshots of those sites with untrusted people.
Docker VNC Setup
If using ArchiveBox in Docker, the easiest way to set up session credentials is by remote controlling the ArchiveBox Chrome browser over VNC, and using it to log in to the sites you want to save.
Enable the
novnc
server using these settings in yourdocker-compose.yml
:
docker-compose.yml
:
Start the
novnc
window server container
Start ArchiveBox's Chrome inside Docker
(make sure you set DISPLAY
& CHROME_USER_DATA_DIR
and added the line to volumes:
above first!)
Open
http://localhost:8080/vnc.html
in your browser. You should see a remote linux desktop shown with Chrome open, allowing you to remote-control ArchiveBox's browser. Use it to log into any sites where you want to save credentials.✅ Close the browser, stop & remove novnc, and then run archivebox normally. It will use the profile stored in
CHROME_USER_DATA_DIR=/data/personas/Default/chrome_profile
going forward, you should now be able to archive sites as if you were logged in!
Under the hood this uses Xvfb + Fluxbox + novnc
to provide a virtual display, window manager, and VNC server + novnc websocket viewer.
Non-Docker Setup (Local Host)
If running ArchiveBox on your local machine without Docker, this process is fairly easy.
First, tell archivebox where you want to store your Chrome profile.
Then run Chrome (with that profile dir) to open a visible browser window where you can log into things, e.g.:
Once it's open, log in to all the sites you want to be logged in to for archiving, then close/quit Chrome.
✅ All ArchiveBox extractors that use Chrome (e.g. Screenshot, PDF, DOM, Singlefile) should now use that profile.
Don't forget to set up COOKIES_FILE
for the rest!
Non-Docker Setup (Remote Host)
You must set up the profile using the exact same version of chrome that ArchiveBox is running (which can be found with archivebox version
). You can download the latest chromium with pip install playwright && playwright install --with-deps chromium
, or get older versions of Chrome from https://chromium.cypress.io.
General steps:
Make sure you are running the same OS and have the same version of Chrome installed as the host running ArchiveBox
Follow the
Non-Docker Setup (Local Host)
setups above to create a Chrome profile locallyRsync your chrome profile from your local machine to the remote archivebox host
rsync --archive /path/to/profile remotehost:/path/to/profile/on/remote/host
Configure ArchiveBox on the remote host to use the
rsync
'ed Chrome profilearchivebox config --set CHROME_USER_DATA_DIR=/path/to/profile/on/remote/host
You may need to run chown -R archivebox /path/to/profile/on/remote/host
on the remote host to make the profile editable by the archivebox
user on that machine.
✅ All ArchiveBox extractors that use Chrome (e.g. Screenshot, PDF, DOM, Singlefile) should now use that profile.
Don't forget to set up COOKIES_FILE
for the rest!
More Info & Troubleshooting
https://github.com/ArchiveBox/ArchiveBox/issues/952
https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overview#archiving-private-content
https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overview#%EF%B8%8F-things-to-watch-out-for-%EF%B8%8F
https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overview#publishing
https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#chrome_user_data_dir
https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#chrome_binary
https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#cookies_file
Last updated