leechcrawler
Incremental crawling capabilities for Apache Tika. Crawl content out of e.g. file systems, http(s) sources (webcrawling) imap(s) servers or your own arbitrary data sources. LeechCrawler offers additional Tika parsers providing these crawling capabilities.
How to download and setup leechcrawler
Open terminal and run command
git clone https://github.com/DFKI/leechcrawler.git
git clone is used to create a copy or clone of leechcrawler repositories.
You pass git clone a repository URL. it supports a few different network protocols and corresponding URL formats.
Also you may download zip file with leechcrawler https://github.com/DFKI/leechcrawler/archive/master.zip
Or simply clone leechcrawler with SSH
[email protected]:DFKI/leechcrawler.git
If you have some problems with leechcrawler
You may open issue on leechcrawler support forum (system) here: https://github.com/DFKI/leechcrawler/issuesSimilar to leechcrawler repositories
Here you may see leechcrawler alternatives and analogs
scrapy Sasila colly headless-chrome-crawler Lulu gopa newspaper isp-data-pollution webster cdp4j spidy stopstalk-deployment N2H4 memorious easy-scraping-tutorial antch pomp Harvester diffbot-php-client talospider corpuscrawler Python-Crawling-Tutorial learn.scrapinghub.com crawling-projects dig-etl-engine crawlkit scrapy-selenium spidyquotes zcrawl podcastcrawler