Roadmap
Last updated
Last updated
▶️ Comment here to discuss the contribution roadmap: Official Roadmap Discussion.
(this is not set in stone, just a rough estimate)
v0.7: Schema improvements
move config loading logic into settings.py
move all the extractors into "plugin" style folders that register their own config
right now, the paths of the extractor output are scattered all over the codebase, e.g. output.pdf
(should be moved to constants at the top of the plugin config file)
make out_dir, link_dir, extractor_dir, naming consistent across codebase
remove timestamps
as primary keys in favor of hashes, UUIDs, or some other slug https://github.com/ArchiveBox/ArchiveBox/issues/74
create a migration system for folder layout independent of the index (mv
is atomic at the FS level, so we just need a transaction.atomic(): move(oldpath, newpath); snap.data_dir = newpath; snap.save()
)
make Tag
a real model ManyToMany
with Snapshots
allow multiple Snapshots of the same site over time + CLI / UI to manage those, + migration from old style #2020-01-01
hack to proper versioned snapshots
upgrade from Django 3 to Django 5 https://github.com/ArchiveBox/ArchiveBox/issues/988
v0.8: Security
Add CSRF/CSP/XSS protection to rendered archive pages
Provide secure reverse proxy in front of archivebox server in docker-compose.yml
Create UX flow for users to setup session cookies / auth for archiving private sites
cookies for wget, curl, etc low-level commands
localstorage, cookies, indexedb setup for chrome archiving methods
v0.9: Performance
setup huey, break up archiving process into tasks on a queue that a worker pool executes
setup pyppeteer2 to wrap chrome so that it's not open/closed during each extractor
v1.0: Full headless browser control
run user-scripts / extensions in the context of the page during archiving
community userscripts for unrolling twitter threads, reddit threads, youtube comment sections, etc.
pywb-based headless browser session recording and warc replay
archive proxy support
support sending upstream requests through an external proxy
support for exposing a proxy that archives all downstream traffic
...
v2.0 Federated or distributed archiving + paid hosted service offering
ZFS / merkel tree for storing archive output subresource hashes
DHT for assigning merkel tree hash:file shards to nodes
tag system for tagging certain hashes with human-readable names, e.g. title, url, tags, filetype etc.
distributed tag lookup system
✅ release pip
, apt
, pkg
, and brew
packaged distributions for installing ArchiveBox
✅ add an optional web GUI for managing sources, adding new links, and viewing the archive
✅ switch to django + sqlite db with migrations system & json/html export for managing archive schema changes and persistence
modularize internals to allow importing individual components
switch to sha256 of URL as unique link ID
support storing multiple snapshots of pages over time
support custom user puppeteer scripts to run while archiving (e.g. for expanding reddit threads, scrolling thread on twitter, etc)
support named collections of archived content with different user access permissions
support sharing archived assets via DHT + torrent / ipfs / ZeroNet / other sharing system
support pushing pages to multiple 3rd party services using ArchiveNow instead of just archive.org
✅ body text extraction to markdown (using fathom readability and mercury)
featured image / thumbnail extraction
auto-tagging links based on important/frequent keywords in extracted text (like pocket)
automatic article summary paragraphs from extracted text with nlp summarization library
✅ full-text search of extracted text with elasticsearch/elasticlunr/ag sonic and ripgrep
✅ download closed-caption subtitles from Youtube and other video sites (TODO: submit the subtitle files to the full-text search index)
try pulling dead sites from archive.org and other sources if original is down (https://github.com/hartator/wayback-machine-downloader)
And more in the issues list...
IMPORTANT: Please don't work on any of these major long-term tasks without contacting me first, work is already in progress for many of these, and I may have to reject your PR if it doesn't align with the existing work!
To see how this spec has been scheduled / implemented / released so far, read these pull requests:
✅ v0.1.x pre-git-history (~2017)
✅ v0.2.x (~2018/12)
✅ v0.3.x (~2019/03)
✅ v0.4.x (~2019/04)
✅ v0.5.x (~2020/11)
✅ v0.6.x (~2021/03)
🏖️ sabbatical / coding hiatus during 2022
✅ v0.7.x (~2023/11)
🛠 v0.8.x (~2024/05)
📅 v0.9.x up next...
https://github.com/ArchiveBox/ArchiveBox/issues/1358
https://github.com/ArchiveBox/ArchiveBox/issues/1273
https://github.com/ArchiveBox/ArchiveBox/issues/988
https://github.com/ArchiveBox/ArchiveBox/issues/930
gallery-dl
: https://github.com/ArchiveBox/ArchiveBox/issues/564
forum-dl
: https://github.com/ArchiveBox/ArchiveBox/issues/1368
scihub-dl
: https://github.com/ArchiveBox/ArchiveBox/issues/720
cad-dl
: https://github.com/ArchiveBox/ArchiveBox/issues/668
aria2
: https://github.com/ArchiveBox/ArchiveBox/issues/1355
podcast-archiver
: https://github.com/ArchiveBox/ArchiveBox/issues/1357
bdfr
: https://github.com/ArchiveBox/ArchiveBox/issues/778
cutycapt
screenshots: https://github.com/ArchiveBox/ArchiveBox/issues/253
sourcemap downloader: https://github.com/ArchiveBox/ArchiveBox/issues/1291
ArchiveBox Developer Documentation: Contributing a New Extractor
And others we're considering for the future:
https://github.com/instaloader/instaloader (instagram downloader)
https://github.com/althonos/InstaLooter (stale)
Telegram
https://github.com/iyear/tdl (telegram downloader)
TikTok
https://github.com/charmparticle/tiktokget (tiktok downloader using yt-dlp)
https://github.com/TerminalWarlord/TikTok-Downloader-Bot
https://github.com/n0l3r/tiktok-downloader
https://github.com/hansputera/tiktok-dl
https://github.com/naseif/tiktok-scraper
https://github.com/irevenko/tiktik
https://github.com/samirelanduk/tiktok-save
https://github.com/Dinoosauro/tiktok-to-ytdlp
https://github.com/krypton-byte/tiktok-downloader
https://github.com/HoloArchivists/twspace-dl (stale, twitter spaces archiver)
https://github.com/soimort/you-get ⭐️
https://github.com/lay295/TwitchDownloader
https://github.com/ihabunek/twitch-dl
https://github.com/iawia002/lux (generic video/audio downloader)
https://github.com/wukko/cobalt (generic video/audio downloader)
https://github.com/jaysonlong/webvideo-downloader (Bilibili, iQIYI, Tencent Video, MGTV and WeTV)
https://github.com/spaam/svtplay-dl (comedy central, twitch, HBO, etc. video downloader)
https://github.com/aajanki/yle-dl (Yle Areena Finnish broadcasting video downloader)
https://github.com/WHTJEON/widevine-dl (encrypted widevine video downloader)
https://github.com/nathom/streamrip (Qobuz, Tidal, Deezer and SoundCloud)
https://github.com/0xHJK/music-dl
https://github.com/guanguans/music-dl
https://github.com/CharlesPikachu/musicdl
https://github.com/iheanyi/bandcamp-dl
https://github.com/spotDL/spotify-downloader
https://github.com/Shabinder/SpotiFlyer
https://github.com/SathyaBhat/spotify-dl / https://github.com/SwapnilSoni1999/spotify-dl / https://github.com/dhruv-ahuja/spoti-dl
https://github.com/vitiko98/qobuz-dl (Qobuz music downloader)
https://github.com/akhilrex/podgrab (stale)
https://github.com/yaronzz/Tidal-Media-Downloader-PRO (stale)
https://github.com/flyingrub/scdl (stale)
https://github.com/ravishi/rdio-dl (stale, Rdio song downloader)
https://github.com/carlosflorencio/laracasts-downloader (stale?)
https://github.com/mikf/gallery-dl ⭐️
https://github.com/Bionus/imgbrd-grabber (generic image board downloader like gallery-dl)
https://github.com/Xonshiz/comic-dl (comic, anime, manga, etc. downloader)
https://github.com/justfoolingaround/animdl (anime downloader)
https://github.com/metafates/mangal (manga downloader)
https://github.com/boredazfcuk/docker-icloudpd (iCloud Photos downloader)
https://github.com/Oshan96/monkey-dl (stale? anime downloader)
https://github.com/QianyanTech/Image-Downloader (stale?)
https://github.com/Xonshiz/anime-dl (stale?)
https://github.com/mikwielgus/forum-dl ⭐️
https://github.com/AndyTheFactory/newspaper4k ⭐️
https://github.com/AAndyProgram/SCrawler (Twitter, Reddit, Instagram, Threads, Facebook, Pinterest, nsfw sites downloader)
https://github.com/extractus/article-extractor
https://github.com/shadowmoose/RedditDownloader (stale?)
https://github.com/aliparlakci/bulk-downloader-for-reddit (stale?)
https://github.com/coursera-dl/coursera-dl
https://github.com/rand-net/khan-dl
https://github.com/C0D3D3V/Moodle-DL
https://github.com/r0oth3x49/acloud-dl
https://github.com/Puyodead1/udemy-downloader
https://github.com/PyJun/Mooc_Downloader (stale)
https://github.com/yann0917/dedao-dl (stale, MOOC course downloader)
https://github.com/coursera-dl/edx-dl (stale?)
https://github.com/SigureMo/mooc-dl (stale?)
https://github.com/calvinhobbes23/Skillshare-DL (stale)
https://github.com/r0oth3x49/lynda-dl (stale, Lynda.com course downloader)
https://github.com/hartator/wayback-machine-downloader
https://github.com/MiniGlome/Archive.org-Downloader
https://github.com/ArchiveTeam/grab-site
https://github.com/oduwsdl/archivenow
https://github.com/wabarc/warcraft
https://github.com/sul-dlss/wasapi-downloader
https://github.com/KellyStathis/warc_downloader
https://github.com/internetarchive/heritrix3
https://github.com/AhmadIbrahiim/Website-downloader (wget wrapper)
https://github.com/igrigorik/gharchive.org (stale? Github downloader)
https://github.com/KurtBestor/Hitomi-Downloader
https://github.com/nilaoda/BBDown
https://github.com/biliup/biliup
https://github.com/yutto-dev/bilili
https://github.com/nICEnnnnnnnLee/BilibiliDown
https://github.com/matlink/gplaycli (Google Play store Android app downloader)
https://github.com/AlphaSlayer1964/kemono-dl (Patreon, gumroad, etc. archiver)
https://github.com/manga-download/hakuneko
https://github.com/cancerian0684/dli-downloader (Digital Library of India ebook downloader)
https://github.com/tusharbabbar/gaana-dl (gaana.com bollywood song downloader)
https://github.com/rebane2001/matterport-dl (stale? virtual house tour downloader)