RAG_DATA_PREP

v1.0.0

[SYS] Advanced data preparation workflow for RAG (Retrieval Augmented Generation) systems

$ cat description.txt Comprehensive n8n workflow for preparing and enriching data for vector database ingestion. Processes text from various sources including URLs, YouTube videos, and multiple file formats. Automatically chunks, cleans, tags, and generates FAQs to enhance retrieval performance in RAG systems.

CORE_FEATURES:

Multi Source Processing

> URL content extraction
> YouTube transcript retrieval
> PDF, CSV, audio, text processing

Intelligent Tagging

> Content categorization
> Automated metadata generation
> Semantic classification

FAQ Generation

> Automatic question creation
> Content-based answers
> Query optimization

Vector DB Ready

> Optimal chunk sizing
> JSON structure formatting
> Enhanced retrieval preparation

EXAMPLE_OUTPUT:

Workflow Input:

"Source URL": "https://docs.n8n.io/integrations/builtin/cluster-nodes/root-nodes/n8n-nodes-langchain.aifunctions-ai-agent/",

"Content Format": "website",

"Chunking Strategy": "semantic paragraphs",

"Generate FAQs": "yes"

RESULTING RAG DATA

AI Agents Documentation (RAG-Ready)

ℹ️
This document contains processed content about various AI Agents in n8n, cleaned and prepared for Retrieval Augmented Generation (RAG) systems. The content has been split into semantic chunks with tags and metadata for improved retrieval.

Document Metadata

Source: n8n Documentation
Chunks: 12 semantic paragraphs
FAQs Generated: 13
Embedded Tags: #AIAgent, #PlanExecuteAgent, #ReActAgent, #SQLAgent, #VectorDatabase, #Prompting

Content Chunks

Chunk #1: Plan and Execute Agent

Tags: #AIAgent #PlanExecuteAgent

Plan and Execute Agent node
The Plan and Execute Agent is like the ReAct agent but with a focus on planning. It first creates a high-level plan to solve the given task and then executes the plan step by step. This agent is most useful for tasks that require a structured approach and careful planning.

Chunk #2: ReAct AI Agent

Tags: #AIAgent #ReActAgent

ReAct AI Agent node
The ReAct Agent node implements ReAct logic. ReAct (reasoning and acting) brings together the reasoning powers of chain-of-thought prompting and action plan generation.
The ReAct Agent reasons about a given task, determines the necessary actions, and then executes them. It follows the cycle of reasoning and acting until it completes the task. The ReAct agent can break down complex tasks into smaller sub-tasks, prioritise them, and execute them one after the other.

                                ⋮ 

                                9 more chunks available (hidden for brevity)

Generated FAQs

Q: What is a vector database?

A: A vector database stores mathematical representations of information. Use with embeddings and retrievers to create a database that your AI can access when answering questions.

Q: What is the Plan and Execute Agent?

A: The Plan and Execute Agent is like the ReAct agent but with a focus on planning. It first creates a high-level plan to solve the given task and then executes the plan step by step. This agent is most useful for tasks that require a structured approach and careful planning.

Q: What is the SQL AI Agent?

A: The SQL Agent uses a SQL database as a data source. It can understand natural language questions, convert them into SQL queries, execute the queries, and present the results in a user-friendly format. This agent is valuable for building natural language interfaces to databases.

                                ⋮ 

                                10 more FAQs available (hidden for brevity)

Vector Database Ready Format

{
  "id": "chunk_001",
  "text": "Plan and Execute Agent node\nThe Plan and Execute Agent is like the ReAct agent but with a focus on planning...",
  "metadata": {
    "source": "n8n-docs",
    "url": "https://docs.n8n.io/integrations/builtin/cluster-nodes/root-nodes/n8n-nodes-langchain.aifunctions-ai-agent/",
    "tags": ["AIAgent", "PlanExecuteAgent"],
    "chunk_type": "semantic_paragraph"
  },
  "embedding": [0.023, -0.112, 0.043, ...] // Vector representation (768 dimensions)
}

💡
Each content chunk is processed into this format before being stored in your vector database of choice. This structure enables efficient semantic search and relevance scoring for RAG applications.

This is an example of RAG-prepared content created with our template

$ system_requirements

MODELS: gemini 2.5, openai whisper
STORAGE: google drive, google sheets
SERVICES: potentially rapidapi youtube v2
OUTPUT: google sheet rows of file links
PRICING: gemini - per token,
         whisper - per minute, 
         google drive - free,
         google sheets - free, 
         rapidapi youtube v2 api - free tier        
EST. PER RUN COST: €0.01

PROCESS_FLOW:

AUTOMATION_BENEFITS:

> Process diverse content types with one workflow
> Dramatically improve RAG retrieval accuracy
> Generate FAQs to enhance knowledge coverage
> Eliminate manual data preparation tasks
> Consistent formatting for all content sources

€129

PURCHASE_TEMPLATE

* Compatible with all n8n installations v1.0.0+

*Superflowz is a subsidiary of CARDUME ESBELTO UNIP. LDA. Your purchase will be from, and your receipt will list, CARDUME ESBELTO