Topic

crawling

Repositories (1350)

stopstalk-deployment
stopstalk-deployment stopstalk Python

Stop stalking and start StopStalking :wink:

318
memorious
memorious alephdata Python

Lightweight web scraping toolkit for documents and structured data.

315
crawler
crawler infinilabs Go

🕷️ An easy-to-use spider written in Golang. (previous named GOPA.)

314
scrapper
scrapper amerkurev Python

Web scraper with a simple REST API living in Docker and using a Headless browser and Readability.js for parsing.

314
Sasila
Sasila da2vin Python

一个灵活、友好的爬虫框架

297
Instagram-Bot
Instagram-Bot mustafadalga Python

An Instagram bot developed using the Selenium Framework

284
antch
antch antchfx Go

Antch, a fast, powerful and extensible web crawling & scraping framework for Go

265
laravel
laravel roach-php PHP

Laravel adapter for Roach, the complete web scraping toolkit for PHP.

264
Infect
Infect mishakorzik Shell

Create you virus in termux!

263
facebook-data-extraction
facebook-data-extraction 18520339 Python

Experience for effectively fetching Facebook data by Querying Graph API with Account-based Token and Operating undetectable scraping Bots to extract C...

226
N2H4
N2H4 forkonlp R

네이버 뉴스 수집을 위한 도구

221
spidercreator
spidercreator carlosplanchon Python

Automated web scraping spider generation using Browser Use and LLMs. Streamline the creation of Playwright-based spiders with minimal manual coding. I...

217
corpuscrawler
corpuscrawler google Python

Crawler for linguistic corpora

214
Grawler
Grawler A3h1nt PHP

Grawler is a tool written in PHP which comes with a web interface that automates the task of using google dorks, scrapes the results, and stores them...

211
estela
estela bitmakerla TypeScript

estela, an elastic web scraping cluster 🕸

196
SpideyX
SpideyX RevoltSecurities Python

SpideyX a multipurpose Web Penetration Testing tool with asynchronous concurrent performance with multiple mode and configurations.

192
DotnetCrawler
DotnetCrawler mehmetozkaya C#

DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library des...

181
Squidwarc
Squidwarc N0taN3rd JavaScript

Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head

175
courlan
courlan adbar Python

Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters

167
massivedl
massivedl dimkouv Go

Download a large list of files concurrently

166
crawler
crawler trandoshan-io Go

Go process used to crawl websites

150
sasori
sasori karthikuj JavaScript

Sasori is a dynamic web crawler powered by Puppeteer, designed for lightning-fast endpoint discovery.

146
cdp4j
cdp4j webfolderio Java

cdp4j - Chrome DevTools Protocol for Java

144
proxifier
proxifier rookmoot Go

A fast, modern and intelligent proxy rotator perfect for crawling and scraping public data.

143
double-agent
double-agent unblocked-web TypeScript

A test suite of common scraper detection techniques. See how detectable your scraper stack is.

139
wget-lua
wget-lua ArchiveTeam C

Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.

135
sitemapper
sitemapper seantomburke TypeScript

Parse through any sitemap in Node.js

133
pdf-crawler
pdf-crawler SimFin Python

SimFin's open source PDF crawler

130
scraply
scraply alash3al Go

Scraply a simple dom scraper to fetch information from any html based website

129
LinkedIn-Skills-Crawler
LinkedIn-Skills-Crawler varadchoudhari Python

A simple Python script to crawl complete list of LinkedIn skills

124
bots-zoo
bots-zoo antoinevastel JavaScript
116
aioscpy
aioscpy ihandmine Python

An asyncio + aiolibs crawler imitate scrapy framework

115
jkcrawler
jkcrawler topiccrawler Python

使用 Scrapy 写成的 JK 爬虫,图片源自哔哩哔哩、Tumblr、Instagram,以及微博、Twitter

114
warc-parquet
warc-parquet maxcountryman Rust

🗄️ A simple CLI for converting WARC to Parquet.

114
goClone
goClone shurco Go

🌱 goClone - clone websites in seconds

113
abx-dl
abx-dl ArchiveBox Python

⬇️ A simple all-in-one CLI tool to download EVERYTHING from a URL (like youtube-dl/yt-dlp, forum-dl, gallery-dl, simpler ArchiveBox). 🎭 Uses headless...

111
qcrawl
qcrawl crawlcore Python

qcrawl - fast async web crawling & scraping framework for Python.

109
burp-dom-scanner
burp-dom-scanner fcavallarin Java

Burp Suite's extension to scan and crawl Single Page Applications

107
dig-etl-engine
dig-etl-engine usc-isi-i2

Download DIG to run on your laptop or server.

105
devdocs-to-llm
devdocs-to-llm alexfazio Jupyter Notebook

Turn any developer documentation into a GPT

102
AyugeSpiderTools
AyugeSpiderTools shengchenyang Python

使 scrapy 开发不用在意 item,pipeline,middleware 等通用场景下模块的编写,解放开发者的双手。

98
bathyscaphe
bathyscaphe creekorful Go

Fast, highly configurable, cloud native dark web crawler.

95
feedsearch-crawler
feedsearch-crawler DBeath Python

Crawl sites for RSS, Atom, and JSON feeds.

92
robots.txt
robots.txt jonasjacek

Simple robots.txt template. Keep unwanted robots out (disallow). White lists (allow) legitimate user-agents. Useful for all websites.

89
ARGUS
ARGUS datawizard1337 Python

ARGUS is an easy-to-use web scraping tool. The program is based on the Scrapy Python framework and is able to crawl a broad range of different website...

89
arachnid
arachnid watzon Crystal

Powerful web scraping framework for Crystal

78
tech-seo-crawler
tech-seo-crawler jroakes Python

Build a small, 3 domain internet using Github pages and Wikipedia and construct a crawler to crawl, render, and index.

75
Harvester
Harvester TransparencyToolkit JavaScript

Web crawling and document processing through a usable interface.

72
rag-web-browser
rag-web-browser apify TypeScript

RAG Web Browser is an Apify Actor to feed your LLM applications and RAG pipelines with up-to-date text content scraped from the web.

71
web-languages
web-languages commoncrawl

Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code

68