Configuration of ArchiveBox is done by using the archivebox config command, modifying the ArchiveBox.conf file in the data folder, or by using environment variables. All three methods work equivalently when using Docker as well.
Some equivalent examples of setting some configuration options:
Environment variables take precedence over the config file, which is useful if you only want to use a certain option temporarily during a single run. For more examples see Usage: Configuration...
In case this document is ever out of date, check the source code for config definitions: archivebox/config/common.py ➡️
General Settings
General options around the archiving process, output format, and timing.
ONLY_NEW
Possible Values: [True]/False Toggle whether or not to attempt rechecking old links when adding new ones, or leave old incomplete links alone and only archive the new links.
By default, ArchiveBox will only archive new links on each import. If you want it to go back through all links in the index and download any missing files on every run, set this to False.
Note: Regardless of how this is set, ArchiveBox will never re-download sites that have already succeeded previously. When this is False it only attempts to fix previous pages have missing archive extractor outputs, it does not re-archive pages that have already been successfully archived.
OVERWRITE
Possible Values: [False]/True When set to True, ArchiveBox will re-archive URLs even if they have already been successfully archived before, overwriting any existing output.
TIMEOUT
Possible Values: [60]/120/... Maximum allowed download time per archive method for each link in seconds. If you have a slow network connection or are seeing frequent timeout errors, you can raise this value.
Note: Do not set this to anything less than 5 seconds as it will cause Chrome to hang indefinitely and many sites to fail completely.
MAX_URL_ATTEMPTS
Possible Values: [50]/100/... Maximum number of times ArchiveBox will attempt to archive a URL before giving up. Useful for handling transient failures.
RESOLUTION
Possible Values: [1440,2000]/1024,768/... Default screenshot/PDF resolution in pixels width,height. Used as the fallback for SCREENSHOT_RESOLUTION, PDF_RESOLUTION, and CHROME_RESOLUTION.
CHECK_SSL_VALIDITY
Possible Values: [True]/False Whether to enforce HTTPS certificate and HSTS chain of trust when archiving sites. Set this to False if you want to archive pages even if they have expired or invalid certificates. Be aware that when False you cannot guarantee that you have not been man-in-the-middle'd while archiving content, so the content cannot be verified to be what's on the original site.
USER_AGENT
Possible Values: [Mozilla/5.0 ... ArchiveBox/{VERSION} ...]/"Mozilla/5.0 ..."/... The default user agent string used during archiving. Individual extractors (wget, Chrome, curl, etc.) can override this with their own *_USER_AGENT settings, or fall back to this value.
COOKIES_FILE
Possible Values: [None]//path/to/cookies.txt/...
Cookies file to pass to wget, curl, yt-dlp and other extractors that don't use Chrome (with its CHROME_USER_DATA_DIR) for authentication. To capture sites that require a user to be logged in, you configure this option to point to a netscape-format cookies.txt file containing all the cookies you want to use during archiving.
You can generate this cookies.txt file by using a number of different browser extensions that can export your cookies in this format, or by using wget on the command line with --save-cookies + --user=... --password=....
Alternatively, you can create a persona and import cookies directly from your browser profile:
[!WARNING] Make sure you use separate burner credentials dedicated to archiving, e.g. don't re-use your normal daily Facebook/Instagram/Youtube/etc. account cookies as server responses often contain your name/email/PII, session tokens, etc. which then get preserved in your snapshots!
Possible Values: [Default]/personal/work/... The persona profile to use by default when archiving. Personas allow you to have separate sets of cookies, Chrome profiles, and user agent strings for different archiving contexts.
URL_DENYLIST
Possible Values: [\.(css|js|otf|ttf|woff|woff2|gstatic\.com|googleapis\.com/css)(\?.*)?$]/.+\.exe$/...
A regex expression used to exclude certain URLs from archiving.
Possible Values: [None]/^http(s)?:\/\/(.+)?example\.com\/?.*$/...
A regex expression used to exclude all URLs that don't match the given pattern from archiving. Useful for recursive crawling within a single domain.
SAVE_ALLOWLIST
Possible Values: [{}]/{".*example\\.com.*": ["screenshot", "pdf"]}/... A JSON dictionary mapping URL regex patterns to lists of archive methods. Only the specified methods will be used for URLs matching each pattern.
SAVE_DENYLIST
Possible Values: [{}]/{".*\\.pdf$": ["screenshot", "dom"]}/... A JSON dictionary mapping URL regex patterns to lists of archive methods to skip.
TAG_SEPARATOR_PATTERN
Possible Values: [[,]]/[,;]/... Regex pattern used to split tag strings into individual tags.
Server Settings
Options for the web UI, authentication, and reverse proxy configuration.
ADMIN_USERNAME / ADMIN_PASSWORD
Possible Values: [None]/"admin"/...
Only used on first run / initial setup in Docker. ArchiveBox will create an admin user with the specified username and password when these options are found in the environment.
Possible Values: [True]/False Configure whether or not login is required to use each area of ArchiveBox.
SECRET_KEY
Possible Values:auto-generated random string Django's secret key for cryptographic signing (sessions, CSRF tokens, etc.). Automatically generated on first run.
BIND_ADDR
Possible Values: [127.0.0.1:8000]/0.0.0.0:8000/... Address and port for the ArchiveBox web server to listen on.
LISTEN_HOST
Possible Values: [archivebox.localhost:8000]/archive.example.com:443/... The public hostname and port that ArchiveBox is accessible at.
ALLOWED_HOSTS
Possible Values: [*]/archive.example.com,localhost/... Comma-separated list of allowed HTTP Host header values. Set this to your domain name(s) in production.
CSRF_TRUSTED_ORIGINS
Possible Values: [http://admin.archivebox.localhost:8000]/https://archive.example.com/... Comma-separated list of trusted origins for CSRF validation. Must include the scheme (http/https).
ADMIN_BASE_URL
Possible Values: [""]//admin//... Base URL path for the Django admin interface.
ARCHIVE_BASE_URL
Possible Values: [""]//archive//... Base URL path for serving archived content.
SNAPSHOTS_PER_PAGE
Possible Values: [40]/100/... Maximum number of Snapshots to show per page on Snapshot list pages.
PREVIEW_ORIGINALS
Possible Values: [True]/False Whether to show inline previews of the original URL on snapshot detail pages.
FOOTER_INFO
Possible Values: [Content is hosted for personal archiving purposes only. Contact server owner for any takedown requests.]/... Text to display in the footer of the archive index.
CUSTOM_TEMPLATES_DIR
Possible Values: [data/custom_templates]//path/to/custom_templates/... Path to a directory containing custom html/css/images for overriding the default UI styling.
REVERSE_PROXY_USER_HEADER
Possible Values: [Remote-User]/X-Remote-User/... HTTP header containing user name from authenticated proxy.
Possible Values: [911]/1000/... Note: Only applicable for Docker users, settable via environment variables only. User and Group ID that the data directory should be owned by.
Possible Values: [windows]/unix/ascii/... Restrict output filenames to be compatible with the given filesystem type.
ENFORCE_ATOMIC_WRITES
Possible Values: [True]/False Whether to use atomic writes when saving files.
TMP_DIR
Possible Values: [data/tmp/<machine_id>]//tmp/archivebox/abc5d851/... Path for temporary files, unix sockets, and supervisor config. Must be a local, fast, short-path directory.
LIB_DIR
Possible Values: [data/lib/<arch>-<os>]//usr/local/share/archivebox/abc5/... Path for installed binary dependencies.
LIB_BIN_DIR
Possible Values: [LIB_DIR/bin] Path where installed binaries are symlinked for easy PATH management.
Search Settings
Options for full-text search backend configuration.
USE_INDEXING_BACKEND
Possible Values: [True]/False Enable the search indexing backend.
USE_SEARCHING_BACKEND
Possible Values: [True]/False Enable the search querying backend.
SEARCH_BACKEND_ENGINE
Possible Values: [ripgrep]/sqlite/sonic Which search backend engine to use. ripgrep (default) requires no setup. sqlite uses FTS5. sonic requires a running Sonic instance.
SEARCH_PROCESS_HTML
Possible Values: [True]/False Whether to strip HTML tags before indexing content for search.
Shell Options
Options around the format of the CLI output.
DEBUG
Possible Values: [False]/True Enable debug mode. Automatically set to True if --debug is passed on the command line.
IS_TTY
Possible Values:auto-detected Whether stdout is a TTY (interactive terminal).
USE_COLOR
Possible Values: [True]/False Colorize console output. Defaults to True if stdin is a TTY.
SHOW_PROGRESS
Possible Values: [True]/False Show real-time progress bar in console output. Defaults to True if stdin is a TTY.
IN_DOCKER
Possible Values: [False]/True Whether ArchiveBox is running inside a Docker container.
IN_QEMU
Possible Values: [False]/True Whether ArchiveBox is running inside QEMU emulation.
Plugin Settings
ArchiveBox uses a plugin system where each extractor defines its own configuration via config.json files. All plugin config options can be set the same way as core options — via environment variables, ArchiveBox.conf, or archivebox config --set.
elements and click 'load more' buttons for comments
INFINISCROLL_MIN_HEIGHT
Default: [16000] Minimum page height to scroll to in pixels
INFINISCROLL_SCROLL_DELAY
Default: [2000] Delay between scrolls in milliseconds
INFINISCROLL_SCROLL_DISTANCE
Default: [1600] Distance to scroll per step in pixels
INFINISCROLL_SCROLL_LIMIT
Default: [10] Maximum number of scroll steps
INFINISCROLL_TIMEOUT
Default: [120] (falls back to TIMEOUT) Maximum timeout for scrolling in seconds
DOM Outlinks Parser Settings
PARSE_DOM_OUTLINKS_ENABLED
Default: [True] Enable DOM outlinks parsing from archived pages
PARSE_DOM_OUTLINKS_TIMEOUT
Default: [30] (falls back to TIMEOUT) Timeout for DOM outlinks parsing in seconds
HTML URL Parser Settings
PARSE_HTML_URLS_ENABLED
Default: [True] Enable HTML URL parsing
JSONL URL Parser Settings
PARSE_JSONL_URLS_ENABLED
Default: [True] Enable JSON Lines URL parsing
Netscape URL Parser Settings
PARSE_NETSCAPE_URLS_ENABLED
Default: [True] Enable Netscape bookmarks HTML URL parsing
Text URL Parser Settings
PARSE_TXT_URLS_ENABLED
Default: [True] Enable plain text URL parsing
RSS URL Parser Settings
PARSE_RSS_URLS_ENABLED
Default: [True] Enable RSS/Atom feed URL parsing
Claude Code Settings
ANTHROPIC_API_KEY
Default: [""] Anthropic API key for Claude Code authentication
CLAUDECODE_BINARY
Default: [claude] Path to Claude Code CLI binary
CLAUDECODE_ENABLED
Default: [False] Enable Claude Code AI agent integration. Controls whether the claudecode plugin participates in crawl-time extraction; child plugins still need the claudecode plugin installed and a working Claude binary.
CLAUDECODE_MAX_TURNS
Default: [10] Maximum number of agentic turns per invocation
CLAUDECODE_MODEL
Default: [sonnet] Claude model to use (e.g. sonnet, opus, haiku)
CLAUDECODE_TIMEOUT
Default: [120] (falls back to TIMEOUT) Timeout for Claude Code operations in seconds
Claude Chrome Settings
CLAUDECHROME_ENABLED
Default: [False] Enable Claude for Chrome browser extension for AI-driven page interaction
CLAUDECHROME_MAX_ACTIONS
Default: [15] Maximum number of agentic loop iterations (screenshots + actions) per page
CLAUDECHROME_MODEL
Default: [sonnet] Claude model to use (e.g. sonnet, opus, haiku). Availability depends on your plan.
CLAUDECHROME_PROMPT
Default: [see defaults] Prompt for Claude to execute on the page. Claude can click buttons, fill forms, download files, and interact with any page element.
CLAUDECHROME_TIMEOUT
Default: [120] (falls back to TIMEOUT) Timeout for Claude for Chrome operations in seconds
Claude Code Extract Settings
CLAUDECODEEXTRACT_ENABLED
Default: [False] Enable Claude Code AI extraction
CLAUDECODEEXTRACT_MAX_TURNS
Default: [10] (falls back to CLAUDECODE_MAX_TURNS) Maximum number of agentic turns for extraction
CLAUDECODEEXTRACT_MODEL
Default: [sonnet] (falls back to CLAUDECODE_MODEL) Claude model to use for extraction (e.g. sonnet, opus, haiku)
CLAUDECODEEXTRACT_PROMPT
Default: [see defaults] Custom prompt for Claude Code extraction. Use this to define what Claude should extract or generate from the snapshot.
CLAUDECODEEXTRACT_TIMEOUT
Default: [120] (falls back to CLAUDECODE_TIMEOUT) Timeout for Claude Code extraction in seconds
Claude Code Cleanup Settings
CLAUDECODECLEANUP_ENABLED
Default: [False] Enable Claude Code AI cleanup of snapshot files
CLAUDECODECLEANUP_MAX_TURNS
Default: [15] (falls back to CLAUDECODE_MAX_TURNS) Maximum number of agentic turns for cleanup
CLAUDECODECLEANUP_MODEL
Default: [sonnet] (falls back to CLAUDECODE_MODEL) Claude model to use for cleanup (e.g. sonnet, opus, haiku)
CLAUDECODECLEANUP_PROMPT
Default: [see defaults] Custom prompt for Claude Code cleanup. Defines what Claude should clean up and how to determine which duplicates to keep.
CLAUDECODECLEANUP_TIMEOUT
Default: [120] (falls back to CLAUDECODE_TIMEOUT) Timeout for Claude Code cleanup in seconds