Most popular crawling repositories and open source projects

stopstalk-deployment stopstalk Python

Stop stalking and start StopStalking :wink:

318 100 318

memorious alephdata Python

Lightweight web scraping toolkit for documents and structured data.

315 64 315

crawler infinilabs Go

🕷️ An easy-to-use spider written in Golang. (previous named GOPA.)

314 81 314

scrapper amerkurev Python

Web scraper with a simple REST API living in Docker and using a Headless browser and Readability.js for parsing.

314 52 314

Sasila da2vin Python

一个灵活、友好的爬虫框架

297 69 297

Instagram-Bot mustafadalga Python

An Instagram bot developed using the Selenium Framework

284 84 284

antch antchfx Go

Antch, a fast, powerful and extensible web crawling & scraping framework for Go

265 40 265

laravel roach-php PHP

Laravel adapter for Roach, the complete web scraping toolkit for PHP.

264 30 264

Infect mishakorzik Shell

Create you virus in termux!

263 33 263

facebook-data-extraction 18520339 Python

Experience for effectively fetching Facebook data by Querying Graph API with Account-based Token and Operating undetectable scraping Bots to extract C...

226 62 226

N2H4 forkonlp R

네이버 뉴스 수집을 위한 도구

221 77 221

spidercreator carlosplanchon Python

Automated web scraping spider generation using Browser Use and LLMs. Streamline the creation of Playwright-based spiders with minimal manual coding. I...

217 23 217

corpuscrawler google Python

Crawler for linguistic corpora

214 52 214

Grawler A3h1nt PHP

Grawler is a tool written in PHP which comes with a web interface that automates the task of using google dorks, scrapes the results, and stores them...

211 56 211

estela bitmakerla TypeScript

estela, an elastic web scraping cluster 🕸

196 18 196

SpideyX RevoltSecurities Python

SpideyX a multipurpose Web Penetration Testing tool with asynchronous concurrent performance with multiple mode and configurations.

192 31 192

DotnetCrawler mehmetozkaya C#

DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library des...

181 64 181

Squidwarc N0taN3rd JavaScript

Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head

175 25 175

courlan adbar Python

Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters

167 13 167

massivedl dimkouv Go

Download a large list of files concurrently

166 11 166

crawler trandoshan-io Go

Go process used to crawl websites

150 21 150

sasori karthikuj JavaScript

Sasori is a dynamic web crawler powered by Puppeteer, designed for lightning-fast endpoint discovery.

146 17 146

cdp4j webfolderio Java

cdp4j - Chrome DevTools Protocol for Java

144 43 144

proxifier rookmoot Go

A fast, modern and intelligent proxy rotator perfect for crawling and scraping public data.

143 17 143

double-agent unblocked-web TypeScript

A test suite of common scraper detection techniques. See how detectable your scraper stack is.

139 10 139

wget-lua ArchiveTeam C

Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.

135 18 135

sitemapper seantomburke TypeScript

Parse through any sitemap in Node.js

133 81 133

pdf-crawler SimFin Python

SimFin's open source PDF crawler

130 46 130

scraply alash3al Go

Scraply a simple dom scraper to fetch information from any html based website

129 13 129

LinkedIn-Skills-Crawler varadchoudhari Python

A simple Python script to crawl complete list of LinkedIn skills

124 109 124

bots-zoo antoinevastel JavaScript

116 28 116

aioscpy ihandmine Python

An asyncio + aiolibs crawler imitate scrapy framework

115 10 115

jkcrawler topiccrawler Python

使用 Scrapy 写成的 JK 爬虫，图片源自哔哩哔哩、Tumblr、Instagram，以及微博、Twitter

114 28 114

warc-parquet maxcountryman Rust

🗄️ A simple CLI for converting WARC to Parquet.

114 1 114

goClone shurco Go

🌱 goClone - clone websites in seconds

113 9 113

abx-dl ArchiveBox Python

⬇️ A simple all-in-one CLI tool to download EVERYTHING from a URL (like youtube-dl/yt-dlp, forum-dl, gallery-dl, simpler ArchiveBox). 🎭 Uses headless...

111 5 111

qcrawl crawlcore Python

qcrawl - fast async web crawling & scraping framework for Python.

109 5 109

burp-dom-scanner fcavallarin Java

Burp Suite's extension to scan and crawl Single Page Applications

107 17 107

dig-etl-engine usc-isi-i2

Download DIG to run on your laptop or server.

105 37 105

devdocs-to-llm alexfazio Jupyter Notebook

Turn any developer documentation into a GPT

102 16 102

AyugeSpiderTools shengchenyang Python

使 scrapy 开发不用在意 item，pipeline，middleware 等通用场景下模块的编写，解放开发者的双手。

98 16 98

bathyscaphe creekorful Go

Fast, highly configurable, cloud native dark web crawler.

95 21 95

feedsearch-crawler DBeath Python

Crawl sites for RSS, Atom, and JSON feeds.

92 15 92

robots.txt jonasjacek

Simple robots.txt template. Keep unwanted robots out (disallow). White lists (allow) legitimate user-agents. Useful for all websites.

89 38 89

ARGUS datawizard1337 Python

ARGUS is an easy-to-use web scraping tool. The program is based on the Scrapy Python framework and is able to crawl a broad range of different website...

89 25 89