Roadmap

▶️ Comment here to discuss the contribution roadmap: Official Roadmap Discussion.

Planned Specification

(this is not set in stone, just a rough estimate)

`v0.7: Schema improvements`

move config loading logic into settings.py
move all the extractors into "plugin" style folders that register their own config
right now, the paths of the extractor output are scattered all over the codebase, e.g. output.pdf (should be moved to constants at the top of the plugin config file)
make out_dir, link_dir, extractor_dir, naming consistent across codebase
remove timestamps as primary keys in favor of hashes, UUIDs, or some other slug https://github.com/ArchiveBox/ArchiveBox/issues/74
create a migration system for folder layout independent of the index (mv is atomic at the FS level, so we just need a transaction.atomic(): move(oldpath, newpath); snap.data_dir = newpath; snap.save())
make Tag a real model ManyToMany with Snapshots
allow multiple Snapshots of the same site over time + CLI / UI to manage those, + migration from old style #2020-01-01 hack to proper versioned snapshots
upgrade from Django 3 to Django 5 https://github.com/ArchiveBox/ArchiveBox/issues/988

`v0.8: Security`

Add CSRF/CSP/XSS protection to rendered archive pages
Provide secure reverse proxy in front of archivebox server in docker-compose.yml
Create UX flow for users to setup session cookies / auth for archiving private sites
- cookies for wget, curl, etc low-level commands
- localstorage, cookies, indexedb setup for chrome archiving methods

`v0.9: Performance`

setup huey, break up archiving process into tasks on a queue that a worker pool executes
setup pyppeteer2 to wrap chrome so that it's not open/closed during each extractor

`v1.0: Full headless browser control`

run user-scripts / extensions in the context of the page during archiving
community userscripts for unrolling twitter threads, reddit threads, youtube comment sections, etc.
pywb-based headless browser session recording and warc replay
archive proxy support
- support sending upstream requests through an external proxy
- support for exposing a proxy that archives all downstream traffic

...

`v2.0 Federated or distributed archiving + paid hosted service offering`

ZFS / merkel tree for storing archive output subresource hashes
DHT for assigning merkel tree hash:file shards to nodes
tag system for tagging certain hashes with human-readable names, e.g. title, url, tags, filetype etc.
distributed tag lookup system

Major long-term changes

✅ release pip, apt, pkg, and brew packaged distributions for installing ArchiveBox
✅ add an optional web GUI for managing sources, adding new links, and viewing the archive
✅ switch to django + sqlite db with migrations system & json/html export for managing archive schema changes and persistence
modularize internals to allow importing individual components
switch to sha256 of URL as unique link ID
support storing multiple snapshots of pages over time
support custom user puppeteer scripts to run while archiving (e.g. for expanding reddit threads, scrolling thread on twitter, etc)
support named collections of archived content with different user access permissions
support sharing archived assets via DHT + torrent / ipfs / ZeroNet / other sharing system

Smaller planned features

support pushing pages to multiple 3rd party services using ArchiveNow instead of just archive.org
✅ body text extraction to markdown (using ~~fathom~~ readability and mercury)
featured image / thumbnail extraction
auto-tagging links based on important/frequent keywords in extracted text (like pocket)
automatic article summary paragraphs from extracted text with nlp summarization library
✅ full-text search of extracted text with ~~elasticsearch/elasticlunr/ag~~ sonic and ripgrep
✅ download closed-caption subtitles from Youtube and other video sites (TODO: submit the subtitle files to the full-text search index)
try pulling dead sites from archive.org and other sources if original is down (https://github.com/hartator/wayback-machine-downloader)
And more in the issues list...

IMPORTANT: Please don't work on any of these major long-term tasks without contacting me first, work is already in progress for many of these, and I may have to reject your PR if it doesn't align with the existing work!

Past Releases

To see how this spec has been scheduled / implemented / released so far, read these pull requests:

✅ v0.1.x pre-git-history (~2017)
✅ v0.2.x (~2018/12)
✅ v0.3.x (~2019/03)
✅ v0.4.x (~2019/04)
✅ v0.5.x (~2020/11)
✅ v0.6.x (~2021/03)
🏖️ sabbatical / coding hiatus during 2022
✅ v0.7.x (~2023/11)
🛠 v0.8.x (~2024/05)
📅 v0.9.x up next...

UI / UX Improvements Planned

https://github.com/ArchiveBox/ArchiveBox/issues/1358
https://github.com/ArchiveBox/ArchiveBox/issues/1273
https://github.com/ArchiveBox/ArchiveBox/issues/988
https://github.com/ArchiveBox/ArchiveBox/issues/930

New Extractors Planned

gallery-dl: https://github.com/ArchiveBox/ArchiveBox/issues/564
forum-dl: https://github.com/ArchiveBox/ArchiveBox/issues/1368
scihub-dl: https://github.com/ArchiveBox/ArchiveBox/issues/720
cad-dl: https://github.com/ArchiveBox/ArchiveBox/issues/668
aria2: https://github.com/ArchiveBox/ArchiveBox/issues/1355
podcast-archiver: https://github.com/ArchiveBox/ArchiveBox/issues/1357
bdfr: https://github.com/ArchiveBox/ArchiveBox/issues/778
cutycapt screenshots: https://github.com/ArchiveBox/ArchiveBox/issues/253
sourcemap downloader: https://github.com/ArchiveBox/ArchiveBox/issues/1291

ArchiveBox Developer Documentation: Contributing a New Extractor

And others we're considering for the future:

Instagram
- https://github.com/instaloader/instaloader (instagram downloader)
- https://github.com/althonos/InstaLooter (stale)
Telegram
- https://github.com/iyear/tdl (telegram downloader)
TikTok
- https://github.com/charmparticle/tiktokget (tiktok downloader using yt-dlp)
- https://github.com/TerminalWarlord/TikTok-Downloader-Bot
- https://github.com/n0l3r/tiktok-downloader
- https://github.com/hansputera/tiktok-dl
- https://github.com/naseif/tiktok-scraper
- https://github.com/irevenko/tiktik
- https://github.com/samirelanduk/tiktok-save
- https://github.com/Dinoosauro/tiktok-to-ytdlp
- https://github.com/krypton-byte/tiktok-downloader
Twitter
- https://github.com/HoloArchivists/twspace-dl (stale, twitter spaces archiver)

Video/Streams

https://github.com/soimort/you-get ⭐️
https://github.com/lay295/TwitchDownloader
https://github.com/ihabunek/twitch-dl
https://github.com/iawia002/lux (generic video/audio downloader)
https://github.com/wukko/cobalt (generic video/audio downloader)
https://github.com/jaysonlong/webvideo-downloader (Bilibili, iQIYI, Tencent Video, MGTV and WeTV)
https://github.com/spaam/svtplay-dl (comedy central, twitch, HBO, etc. video downloader)
https://github.com/aajanki/yle-dl (Yle Areena Finnish broadcasting video downloader)
https://github.com/WHTJEON/widevine-dl (encrypted widevine video downloader)

Audio/Music

https://github.com/nathom/streamrip (Qobuz, Tidal, Deezer and SoundCloud)
https://github.com/0xHJK/music-dl
https://github.com/guanguans/music-dl
https://github.com/CharlesPikachu/musicdl
https://github.com/iheanyi/bandcamp-dl
https://github.com/spotDL/spotify-downloader
https://github.com/Shabinder/SpotiFlyer
https://github.com/SathyaBhat/spotify-dl / https://github.com/SwapnilSoni1999/spotify-dl / https://github.com/dhruv-ahuja/spoti-dl
https://github.com/vitiko98/qobuz-dl (Qobuz music downloader)
https://github.com/akhilrex/podgrab (stale)
https://github.com/yaronzz/Tidal-Media-Downloader-PRO (stale)
https://github.com/flyingrub/scdl (stale)
https://github.com/ravishi/rdio-dl (stale, Rdio song downloader)
https://github.com/carlosflorencio/laracasts-downloader (stale?)

Photos/Images/Comics

https://github.com/mikf/gallery-dl ⭐️
https://github.com/Bionus/imgbrd-grabber (generic image board downloader like gallery-dl)
https://github.com/Xonshiz/comic-dl (comic, anime, manga, etc. downloader)
https://github.com/justfoolingaround/animdl (anime downloader)
https://github.com/metafates/mangal (manga downloader)
https://github.com/boredazfcuk/docker-icloudpd (iCloud Photos downloader)
https://github.com/Oshan96/monkey-dl (stale? anime downloader)
https://github.com/QianyanTech/Image-Downloader (stale?)
https://github.com/Xonshiz/anime-dl (stale?)

Text/Forums

https://github.com/mikwielgus/forum-dl ⭐️
https://github.com/AndyTheFactory/newspaper4k ⭐️
https://github.com/AAndyProgram/SCrawler (Twitter, Reddit, Instagram, Threads, Facebook, Pinterest, nsfw sites downloader)
https://github.com/extractus/article-extractor
https://github.com/shadowmoose/RedditDownloader (stale?)
https://github.com/aliparlakci/bulk-downloader-for-reddit (stale?)

MOOC/Educational Content

https://github.com/coursera-dl/coursera-dl
https://github.com/rand-net/khan-dl
https://github.com/C0D3D3V/Moodle-DL
https://github.com/r0oth3x49/acloud-dl
https://github.com/Puyodead1/udemy-downloader
https://github.com/PyJun/Mooc_Downloader (stale)
https://github.com/yann0917/dedao-dl (stale, MOOC course downloader)
https://github.com/coursera-dl/edx-dl (stale?)
https://github.com/SigureMo/mooc-dl (stale?)
https://github.com/calvinhobbes23/Skillshare-DL (stale)
https://github.com/r0oth3x49/lynda-dl (stale, Lynda.com course downloader)

Re-Archiving / WARC Creation

https://github.com/hartator/wayback-machine-downloader
https://github.com/MiniGlome/Archive.org-Downloader
https://github.com/ArchiveTeam/grab-site
https://github.com/oduwsdl/archivenow
https://github.com/wabarc/warcraft
https://github.com/sul-dlss/wasapi-downloader
https://github.com/KellyStathis/warc_downloader
https://github.com/internetarchive/heritrix3
https://github.com/AhmadIbrahiim/Website-downloader (wget wrapper)
https://github.com/igrigorik/gharchive.org (stale? Github downloader)

Other

https://github.com/KurtBestor/Hitomi-Downloader
https://github.com/nilaoda/BBDown
https://github.com/biliup/biliup
https://github.com/yutto-dev/bilili
https://github.com/nICEnnnnnnnLee/BilibiliDown
https://github.com/matlink/gplaycli (Google Play store Android app downloader)
https://github.com/AlphaSlayer1964/kemono-dl (Patreon, gumroad, etc. archiver)
https://github.com/manga-download/hakuneko
https://github.com/cancerian0684/dli-downloader (Digital Library of India ebook downloader)
https://github.com/tusharbabbar/gaana-dl (gaana.com bollywood song downloader)
https://github.com/rebane2001/matterport-dl (stale? virtual house tour downloader)

PreviousQuickstart NextScheduled Archiving

Last updated 1 year ago

hashtagPlanned Specification

hashtagv0.7: Schema improvements

hashtagv0.8: Security

hashtagv0.9: Performance

hashtagv1.0: Full headless browser control

hashtagv2.0 Federated or distributed archiving + paid hosted service offering

hashtagMajor long-term changes

hashtagSmaller planned features

hashtagPast Releases

hashtagUI / UX Improvements Planned

hashtagNew Extractors Planned

hashtagSocial Media

hashtagVideo/Streams

hashtagAudio/Music

hashtagPhotos/Images/Comics

hashtagText/Forums

hashtagMOOC/Educational Content

hashtagRe-Archiving / WARC Creation

hashtagOther