Scheduled Archiving
ArchiveBox contains a built-in scheduler that supports pulling in URLs regularly from the web or from the local filesystem.
ArchiveBox ignores links that are imported multiple times (keeping the earliest version that it's seen). This means you can add cron jobs that regularly poll the same file or URL for new links, adding only new ones as necessary, or you can pass --overwrite
to save a fresh copy each time the scheduled task runs.
The list of defined scheduled tasks can be inspected and cleared with archivebox schedule --show
and archivebox schedule --clear
.
⚠️ Many popular sites such as Twitter, Reddit, Facebook, etc. take efforts to block/ratelimit/lazy-load content to avoid being scraped by bots like ArchiveBox. It may be better to use an alternative frontend with minimal JS when archiving those sites: https://github.com/mendel5/alternative-front-ends
The scheduled interval can be passed easily using --every={day,week,month,year}
or by passing a cron-style schedule e.g. --every='5 4 * * *'
to run at 04:05 every day.
The scheduler can also be run in --foreground
mode to avoid relying on your host system's cron scheduler to be running.
In foreground mode, it will run all tasks previously added using archivebox schedule
in a long-running foreground process. This is useful for running scheduled tasks inside docker-compose or supervisord.
Docker Usage
docker-compose.yml
:
For a full Docker Compose example config see here: https://github.com/ArchiveBox/ArchiveBox/blob/dev/docker-compose.yml#:~:text=schedule
For more examples of plain Docker and Docker Compose usage with scheduling, see: https://github.com/ArchiveBox/ArchiveBox/issues/1155#issuecomment-1590146616
Example: Archive a Twitter user's Tweets and linked content within once a week
Nitter is an alternative frontends recommended Twitter that formats the content better for archiving/bots and avoids ratelimits.
Example: Archive a Reddit subreddit and discussions for every post once a week
Teddit is an alternative frontend recommended for Reddit that formats the content better for archiving/bots and avoids ratelimits.
--overwrite
is passed to save a fresh copy each week, otherwise the URL will be ignored as it's already present in the collection after the first time it's added.
Example: Archive the HackerNews front page and some linked articles every 24 hours
Example: Archive all URLs in an RSS feed from Pocket every 12 hours
This example imports your Pocket bookmark feed and archives any new links every 12 hours:
First, set your Pocket RSS feed to "public" under https://getpocket.com/privacy_controls.
Then tell ArchiveBox to pull it regularly:
Example: Archive a Github repository's source code only once a month
--extract=git
tells it to only use the Git source extractor and skip saving the HTML/screenshot/etc. other extractor methods.
Example: Archive a list of URLs pulled from the filesystem every 30 minutes
Advanced Scheduling Using Cron
To schedule regular archiving you can also use any other task scheduler like cron
, at
, systemd
, etc. aside from the built-in scheduler archivebox schedule
.
For some example configs, see the etc/cron.d
and etc/supervisord
folders.
Example: Export and archive Firefox browser history every 24 hours
This example exports your browser history and archives it once a day, saving a summary to disk:
First download the ArchiveBox helper script for browser history exporting https://github.com/ArchiveBox/ArchiveBox/blob/dev/bin/export_browser_history.sh to ./bin/export_browser_history.sh
Then create /home/ArchiveBox/archivebox/bin/scheduled_firefox_import.sh
:
Then tell cron to run your script every 24 hours:
Example: Import an RSS feed from Pocket every 12 hours
If you need to customize the import process or archive a password-locked RSS feed, you can do it manually with a bash script + cron /home/ArchiveBox/archivebox/bin/scheduled_imports.sh
:
Then create a cronjob telling your system to run the script on your chosen regular interval (e.g. every 12 hours):
Last updated