WebCrawler

crawling

This Python program is a bot designed to explore the pages of a website (crawling), extract the hyperlinks from each page and store them for later use. Each hyperlink on the website is tested with a request to obtain the HTTP response code (200, 404, 403, 500 etc) and to extract internal and external links from the page content (scraping).

How to download and setup WebCrawler

Open terminal and run command

git clone https://github.com/thomasgottvalles/WebCrawler.git

git clone is used to create a copy or clone of WebCrawler repositories. You pass git clone a repository URL.
it supports a few different network protocols and corresponding URL formats.

Also you may download zip file with WebCrawler https://github.com/thomasgottvalles/WebCrawler/archive/master.zip

Or simply clone WebCrawler with SSH

[email protected]:thomasgottvalles/WebCrawler.git

If you have some problems with WebCrawler

You may open issue on WebCrawler support forum (system) here: https://github.com/thomasgottvalles/WebCrawler/issues

Similar to WebCrawler repositories

Here you may see WebCrawler alternatives and analogs

scrapy Sasila colly headless-chrome-crawler Lulu crawler newspaper isp-data-pollution webster cdp4j spidy stopstalk-deployment N2H4 memorious easy-scraping-tutorial antch pomp Harvester diffbot-php-client talospider corpuscrawler Python-Crawling-Tutorial learn.scrapinghub.com crawling-projects dig-etl-engine crawlkit scrapy-selenium spidyquotes zcrawl podcastcrawler