>_

WEBSITE_SCRAPER

v1.0.0

[SYS] Comprehensive website content extraction system: Sitemap parsing + Text processing

$ cat description.txt Advanced website content extraction workflow that automatically discovers and processes sitemaps to extract text content from all pages on a website. Intelligently handles various sitemap formats, cleans and processes HTML content, and stores the extracted text in a structured database for knowledge retrieval and AI applications.

CORE_FEATURES:

Intelligent Discovery

> Automatic sitemap detection
> Robots.txt parsing
> Multi-level sitemap support

Content Extraction

> Clean text extraction
> Navigation removal
> Format normalization

Data Processing

> URL structure mapping
> Batch processing
> Supabase integration

Content Storage

> Structured database storage
> URL and content mapping
> RAG-ready format

$ system_requirements
    
MODELS: none required
STORAGE: supabase
SERVICES: none required
OUTPUT: supabase table entries
PRICING: supabase - free tier        
EST. PER RUN COST: free
>_

PROCESS_FLOW:

[INPUT] -> Website URL | [DISCOVERY] -> Robots.txt + Sitemap Detection | [EXTRACTION] -> URL Collection from Sitemaps | [PROCESSING] -> Page Content Retrieval | ├── [CLEANING] -> HTML Tag Removal | ├── [FORMATTING] -> Text Normalization | [STORAGE] -> Supabase Database Integration | [OUTPUT] -> Structured Content Database

AUTOMATION_BENEFITS:

  • > Create a private knowledge base from any website
  • > Automate content extraction for AI training datasets
  • > Build RAG systems with domain-specific content
  • > Monitor website content changes over time
  • > Generate searchable content archives without manual processing
€79
PURCHASE_TEMPLATE

* Compatible with all n8n installations v1.0.0+