2 repositories on SrcLog
News crawling with StormCrawler - stores content as WARC
Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code