Chromium-Install
Last updated
Last updated
By default, ArchiveBox looks for any existing installed version of Chrome/Chromium and uses it if found. You can optionally install a specific version and set the environment variable CHROME_BINARY
to force ArchiveBox to use that one, e.g.:
CHROME_BINARY=google-chrome-beta
CHROME_BINARY=/usr/bin/chromium-browser
CHROME_BINARY='/Applications/Chromium.app/Contents/MacOS/Chromium'
CHROME_BINARY='~/Library/Caches/ms-playwright/chromium-857950/chrome-mac/Chromium.app/Contents/MacOS/Chromium'
If you don't already have Chrome installed, I recommend installing Chromium instead of Google Chrome, as it's the open-source fork of Chrome that doesn't send as much tracking data to Google.
Check for existing Chrome/Chromium install:
(by the Microsoft team) and (by the Google team) are two options to get stable, repeatable Chromium distributions on many OSs.
If you already have a Chrome app installed like /Applications/Chromium.app
, you don't need to run this.
If you already have chromium-browser
>= v111 installed (run chromium-browser --version
, you don't need to run this.
If you already have /Applications/Google Chrome.app
, you don't need to run this.
If you already have google-chrome
>= v111 installed (run google-chrome --version
, you don't need to run this.
You may choose to set up a Chrome/Chromium user profile in order to use your cookies/sessions to log into sites behind authentication/paywall during archiving.
[!WARNING] We strongly recommend you use separate burner credentials dedicated to archiving, e.g. don't provide cookies for your normal daily Facebook/Instagram/Google/etc. accounts as server responses and page content will often contain your name/email/PII, session cookies, private tokens, etc. which then get preserved in your snapshots for eternity.
Future viewers of your archive may be able to use any reflected archived session tokens to log in as you, or at the very least, associate the content with your real identity. Even if this tradeoff seems acceptable now or you plan to keep your archive data private, you may want to share a snapshot with others in the future, and snapshots are very hard to sanitize/anonymize after-the-fact!
For this reason, it's best to set up dedicated fake profile accounts for each site you want to archive, and consider them burned if you ever share any of your archived snapshots of those sites with untrusted people.
If using ArchiveBox in Docker, the easiest way to set up session credentials is by remote controlling the ArchiveBox Chrome browser over VNC, and using it to log in to the sites you want to save.
Enable the novnc
server using these settings in your docker-compose.yml
:
docker-compose.yml
:
Start the novnc
window server container
Start ArchiveBox's Chrome inside Docker
(make sure you set DISPLAY
& CHROME_USER_DATA_DIR
and added the line to volumes:
above first!)
✅ Close the browser, stop & remove novnc, and then run archivebox normally. It will use the profile stored in CHROME_USER_DATA_DIR=/data/personas/Default/chrome_profile
going forward, you should now be able to archive sites as if you were logged in!
If running ArchiveBox on your local machine without Docker, this process is fairly easy.
First, tell archivebox where you want to store your Chrome profile.
Then run Chrome (with that profile dir) to open a visible browser window where you can log into things, e.g.:
Once it's open, log in to all the sites you want to be logged in to for archiving, then close/quit Chrome.
You must set up the profile using the exact same version of chrome that ArchiveBox is running (which can be found with archivebox version
). You can download the latest chromium with pip install playwright && playwright install --with-deps chromium
, or get older versions of Chrome from https://chromium.cypress.io.
General steps:
Make sure you are running the same OS and have the same version of Chrome installed as the host running ArchiveBox
Follow the Non-Docker Setup (Local Host)
setups above to create a Chrome profile locally
Rsync your chrome profile from your local machine to the remote archivebox host
rsync --archive /path/to/profile remotehost:/path/to/profile/on/remote/host
Configure ArchiveBox on the remote host to use the rsync
'ed Chrome profile
archivebox config --set CHROME_USER_DATA_DIR=/path/to/profile/on/remote/host
You may need to run chown -R archivebox /path/to/profile/on/remote/host
on the remote host to make the profile editable by the archivebox
user on that machine.
https://github.com/ArchiveBox/ArchiveBox/issues/952
https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overview#archiving-private-content
https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overview#%EF%B8%8F-things-to-watch-out-for-%EF%B8%8F
https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overview#publishing
https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#chrome_user_data_dir
https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#chrome_binary
https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#cookies_file
If you encounter problems setting up Google Chrome or Chromium, see the page.
Note: not all extractors use Chrome (e.g. wget
, mercury
, media
), so should be set up as well after this.
Open in your browser. You should see a remote linux desktop shown with Chrome open, allowing you to remote-control ArchiveBox's browser. Use it to log into any sites where you want to save credentials.
Under the hood this uses + + to provide a virtual display, window manager, and VNC server + novnc websocket viewer.
✅ All ArchiveBox extractors that use Chrome (e.g. Screenshot, PDF, DOM, Singlefile) should now use that profile. Don't forget to set up for the rest!
✅ All ArchiveBox extractors that use Chrome (e.g. Screenshot, PDF, DOM, Singlefile) should now use that profile. Don't forget to set up for the rest!