Topic

crawling

Repositories (1230)

scrapy
scrapy scrapy Python

Scrapy, a fast high-level web crawling & scraping framework for Python.

61.2k
Scrapling
Scrapling D4Vinci Python

🕷️ An adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl!

35k
colly
colly gocolly Go

Elegant Scraper and Crawler Framework for Golang

24.5k
crawlee
crawlee apify TypeScript

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs,...

22.7k
newspaper
newspaper codelucas Python

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:

15k
crawlee-python
crawlee-python apify Python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, P...

8.7k
awesome-web-scraping
awesome-web-scraping lorien Makefile

List of libraries, tools and APIs for web scraping and data processing.

7.1k
rod
rod go-rod Go

A Chrome DevTools Protocol driver for web automation and scraping.

6.2k
ferret
ferret MontFerret Go

Declarative web scraping

5.8k
headless-chrome-crawler
headless-chrome-crawler yujiosaka JavaScript

Distributed crawler powered by Headless Chrome

5.6k
hakrawler
hakrawler hakluke Go

Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application

4.8k
puppeteer-sharp
puppeteer-sharp hardkoded C#

Headless Chrome .NET API

3.7k
nutch
nutch apache Java

Apache Nutch is an extensible and scalable web crawler

3k
ai.robots.txt
ai.robots.txt ai-robots-txt Python

A list of AI agents and robots to block.

2.9k
cariddi
cariddi edoardottt Go

Take a list of domains, crawl urls and scan for endpoints, secrets, api keys, file extensions, tokens and more

2.7k
awesome-puppeteer
awesome-puppeteer transitive-bullshit

A curated list of awesome puppeteer resources.

2.5k
grab
grab lorien Python

Web Scraping Framework

2.5k
skycaiji
skycaiji zorlan PHP

蓝天采集器是一款开源免费的爬虫系统,仅需点选编辑规则即可采集数据,可运行在本地、虚拟主机或云服务器中,几乎能采集所有类型的网页,无缝对接各类CMS建站程...

2k
holiday-cn
holiday-cn NateScarlet Python

📅🇨🇳中国法定节假日数据 自动每日抓取国务院公告

1.6k
core
core roach-php PHP

The complete web scraping toolkit for PHP.

1.4k
mlscraper
mlscraper lorey Python

🤖 Scrape data from HTML websites automatically by just providing examples

1.4k
bhban_rpa
bhban_rpa needleworm Python

<6개월 치 업무를 하루 만에 끝내는 업무 자동화(생능출판사, 2020)>의 예제 코드입니다. 파이썬을 한 번도 배워본 적 없는 분들을 위한 예제이며, 엑셀부터 디...

1.1k
crawly
crawly elixir-crawly Elixir

Crawly, a high-level web crawling & scraping framework for Elixir.

1k
scrapy-selenium
scrapy-selenium clemfromspace Python

Scrapy middleware to handle javascript pages using selenium

945
rebrowser-patches
rebrowser-patches rebrowser JavaScript

Collection of patches for puppeteer and playwright to avoid automation detection and leaks. Helps to avoid Cloudflare and DataDome CAPTCHA pages. Easy...

901
scrapyrt
scrapyrt scrapinghub Python

HTTP API for Scrapy spiders

864
browsertrix-crawler
browsertrix-crawler webrecorder TypeScript

Run a high-fidelity browser-based web archiving crawler in a single Docker container

831
Lulu
Lulu iawia002 Python

[Unmaintained] A simple and clean video/music/image downloader 👾

807
easy-scraping-tutorial
easy-scraping-tutorial MorvanZhou Jupyter Notebook

Simple but useful Python web scraping tutorial code.

802
AdminHack
AdminHack mishakorzik Shell

today we will hack the admin panel of the site.

799
siteone-crawler
siteone-crawler janreges Rust

SiteOne Crawler is a cross-platform website crawler and analyzer for SEO, security, accessibility, and performance optimization—ideal for developers,...

709
dataflowkit
dataflowkit slotix Go

Extract structured data from web sites. Web sites scraping.

688
linkedin-profile-scraper-api
linkedin-profile-scraper-api josephlimtech TypeScript

🕵️‍♂️ LinkedIn profile scraper returning structured profile data in JSON.

680
Craw4LLM
Craw4LLM cxcscmu Python

Official repository for "Craw4LLM: Efficient Web Crawling for LLM Pretraining"

633
isp-data-pollution
isp-data-pollution essandess Python

ISP Data Pollution to Protect Private Browsing History with Obfuscation

608
LinkedInDumper
LinkedInDumper l4rm4nd Python

Python 3 script to dump/scrape/extract company employees from LinkedIn API

580
scrapfly-scrapers
scrapfly-scrapers scrapfly Python

Scalable Python web scraping scripts for +40 popular domains

569
spidermon
spidermon scrapinghub Python

Scrapy Extension for monitoring spiders execution.

546
webster
webster zhuyingda JavaScript

a reliable high-level web crawling & scraping framework for Node.js.

540
crawljax
crawljax crawljax Java

Crawljax

526
seonaut
seonaut StJudeWasHere Go

Open source SEO audit tool.

407
second-order
second-order mhmdiaa Go

Second-order subdomain takeover scanner

405
WarcDB
WarcDB Florents-Tselai Python

WarcDB: Web crawl data as SQLite databases.

404
webpalm
webpalm XORbit01 Go

🕸️ Crawl in the web network

371
crawler
crawler crwlrsoft PHP

Library for Rapid (Web) Crawler and Scraper Development

366
spidy
spidy rivermont Python

The simple, easy to use command line web crawler.

349
telegram-crawler
telegram-crawler MarshalX Python

🕷 Automatically detect changes made to the official Telegram sites, clients and servers.

346
stopstalk-deployment
stopstalk-deployment stopstalk Python

Stop stalking and start StopStalking :wink:

317
memorious
memorious alephdata Python

Lightweight web scraping toolkit for documents and structured data.

313
crawler
crawler infinilabs Go

🕷️ An easy-to-use spider written in Golang. (previous named GOPA.)

309