You have ten thousand help articles in your database. A customer types a question into your search bar. Your LIKE '%keyword%' query returns forty seven results, sorted by date, and not a single one actually answers what they asked. The customer leaves. The support ticket arrives. The cycle repeats.
This is the reality for most Laravel applications that rely on traditional search. Full text search is better, sure, but it still matches words, not meaning. When someone asks "how do I reset my billing cycle without losing credits," they are not looking for every article that contains the word "billing." They want the one paragraph, buried in page twelve of your documentation, that explains exactly how to do it.
What if your Laravel app could find that paragraph, hand it to an LLM, and generate a grounded, accurate answer in under two seconds? That is exactly what Retrieval Augmented Generation does. And now, you can do it with a single line:
$answer = Rag::from(HelpArticle::class)->ask('How do I reset my billing cycle without losing credits?');
No Python sidecar. No external service. Just your Laravel app, your data, and a fluent API that feels like it belongs in the framework.
What Is RAG, Actually?
The acronym stands for Retrieval Augmented Generation, but the concept is simpler than the name suggests. Instead of asking a language model to make up an answer from its training data, you give it the right context first and then ask the question.
Think of it like this: imagine you have a brilliant colleague who has never read your company's documentation. If you ask them a product question off the top of their head, they will give you a confident sounding answer that might be completely wrong. But if you hand them the five most relevant pages from your docs and then ask the same question, they will give you something accurate, specific, and grounded in reality.
That is RAG in a nutshell. It works in three steps:
Embed your documents as vectors. Vectors are arrays of numbers that capture the semantic meaning of text. Two sentences that talk about the same concept will have similar vectors, even if they use completely different words. "Cancel my subscription" and "I want to stop paying" end up close together in vector space, while "The weather is nice today" lands somewhere entirely different.
Retrieve the most relevant chunks when a question comes in. Instead of matching keywords, you compare the vector of the question against the vectors of your stored content. The closest matches are your context, the pieces of your documentation that are most likely to contain the answer.
Generate a response using that context. You take the retrieved chunks, bundle them into a prompt, and send the whole thing to an LLM. The model now has both the question and the evidence, so it generates an answer that is grounded in your actual data rather than hallucinated from training weights.
This is why RAG has become the standard architecture for building AI features on top of proprietary data. It gives you the fluency of a large language model with the accuracy of your own documentation. No fine tuning, no model training, no six figure GPU bills. Just retrieval and generation, working together.
But not all RAG implementations are equal. A naive approach that embeds entire documents and retrieves the top three matches will get you maybe 60% of the way there. Production systems need smarter chunking, hybrid search that combines semantic and keyword matching, reranking to surface the best results, and sometimes even agentic retrieval loops that refine queries when the first pass is not sufficient. That spectrum from naive to advanced is exactly what we built laravel-rag to cover.
The Gap in the Laravel Ecosystem
If you work in Python, you have LangChain, LlamaIndex, Haystack, and a dozen other frameworks for building RAG pipelines. The ecosystem is mature, well documented, and battle tested. If you work in Laravel? Until recently, you were on your own.
The options were not great. You could spin up a Python microservice alongside your Laravel app, adding deployment complexity, a new language to maintain, and an HTTP boundary between your data and your AI logic. You could write raw pgvector queries by hand, managing embeddings manually, building your own chunking, and reinventing retrieval logic that other ecosystems have already solved. Or you could pay for a hosted solution like Pinecone or Weaviate Cloud, sending your proprietary data to yet another third party service.
None of these options felt right for the Laravel ecosystem. Laravel developers are used to a certain level of craft. Migrations that read like English. Eloquent models that abstract complexity without hiding it. Facades that let you do powerful things in a single expressive call. Queue workers that handle background jobs with a simple dispatch(). Debugbar panels that show you exactly what happened during a request.
We needed a RAG pipeline that respected all of that. Not a port of a Python library with PHP syntax bolted on top, but something designed from the ground up to feel native to Laravel. Something where adding a vector column to your migration is as natural as adding a string column. Where embedding your documents is a trait you add to a model. Where querying your knowledge base reads like Eloquent, because it is Eloquent.
That is why we built laravel-rag. A complete, production ready RAG pipeline that lives inside your Laravel application and speaks the same language your codebase already speaks.
How the Pipeline Works
The package is organized around two pipelines: one for getting data in, and one for getting answers out. Let us walk through each stage.
Ingesting Documents
Before you can ask questions, you need to get your content into vector storage. The Ingest facade gives you a fluent interface for this:
use Moneo\Rag\Facades\Ingest;
// From a file
Ingest::file('docs/user-guide.pdf')
->chunk(strategy: 'markdown', size: 500, overlap: 50)
->storeIn(Document::class)
->run();
// From raw text
Ingest::text($article->body)
->chunk(strategy: 'sentence')
->withMetadata(['source' => 'blog', 'author' => $article->author])
->storeIn(Document::class)
->run();
// Async via queue
Ingest::file('docs/large-export.csv')
->chunk(strategy: 'character', size: 1000)
->storeIn(Document::class)
->dispatch();
Chunking is the first critical decision in any RAG pipeline. If your chunks are too large, you waste context window tokens on irrelevant text. If they are too small, you lose the surrounding meaning. The package ships with four strategies:
Character splits text at fixed character boundaries. Simple and predictable. Good for uniform content like log entries or structured records.
Sentence respects natural language boundaries. It will never cut a sentence in half. Best for prose, articles, and help documentation.
Markdown understands document structure. It splits on headers, preserving sections as coherent units. Ideal for technical documentation, READMEs, and knowledge bases.
Semantic uses the embedding model itself to find natural meaning boundaries. It compares adjacent text segments and splits where the semantic similarity drops. This is the most intelligent strategy, and the most expensive, since it requires embedding calls during chunking.
Embedding and Caching
Once your text is chunked, each chunk gets embedded into a vector. The package uses Prism PHP as its embedding layer, which means you can swap providers by changing a single environment variable:
RAG_EMBEDDING_DRIVER=openai
RAG_EMBEDDING_MODEL=text-embedding-3-small
RAG_EMBEDDING_DIMENSIONS=1536
Switch to Ollama for local development, Azure for enterprise compliance, or Anthropic for your own preference. The rest of your code stays exactly the same.
But here is where it gets interesting. Embedding API calls cost money, and if your content overlaps between chunks or gets reindexed during updates, you are paying to embed the same text multiple times. The package includes a built in embedding cache that hashes each text chunk with SHA256, stores the resulting vector, and returns the cached version on subsequent calls. In our production workloads, this reduces embedding API costs by 60 to 80 percent. Every cached entry is HMAC signed, so tampered or corrupted cache entries are automatically detected and evicted.
Vector Storage
The embedded vectors need to live somewhere that supports similarity search. The package ships with two drivers:
pgvector for production. If you are already running PostgreSQL (and most Laravel apps are), adding vector support is a single extension install. The package provides Blueprint macros that feel completely native to Laravel migrations:
Schema::create('documents', function (Blueprint $table) {
$table->id();
$table->text('content');
$table->vector('embedding', 1536);
$table->vectorIndex('embedding', method: 'hnsw', distance: 'cosine');
$table->timestamps();
});
sqlite vec for local development. No PostgreSQL required. Just install the package and start building. When you deploy to production, flip the driver to pgvector and everything works.
Both drivers implement the same VectorStoreContract, which means community drivers for Qdrant, Weaviate, Pinecone, or Milvus can slot in without changing your application code.
Retrieval: Finding the Right Context
This is where the magic happens. When a user asks a question, the package needs to find the most relevant chunks from your vector store. There are several strategies, and you can layer them:
Similarity search is the baseline. Embed the question, compare it against stored vectors, return the closest matches. Fast and effective for straightforward queries.
Hybrid search combines semantic similarity with traditional full text keyword matching using Reciprocal Rank Fusion (RRF). This catches cases where the exact terminology matters, not just the meaning. A query about "HNSW indexing" benefits from keyword matching even when the semantic search already finds relevant results about vector indexes.
LLM reranking takes a broader initial retrieval set, say twenty chunks, and asks a language model to score each one for relevance to the original question. The top scoring chunks become your final context. This dramatically improves precision.
Agentic retrieval is the most advanced strategy. It runs an iterative loop: retrieve context, evaluate whether it is sufficient to answer the question, and if not, automatically generate a refined query and retrieve again. Up to three iterations by default.
Here is what the fluent API looks like in practice:
use Moneo\Rag\Facades\Rag;
// Simple similarity search
$result = Rag::from(Document::class)
->limit(5)
->ask('What is our refund policy?');
// Hybrid search with reranking
$result = Rag::from(Document::class)
->hybrid(semanticWeight: 0.7, fulltextWeight: 0.3)
->limit(20)
->rerank(topK: 5)
->ask('How does HNSW indexing affect query performance?');
// Agentic retrieval for complex questions
$result = Rag::from(Document::class)
->agentic(maxSteps: 3)
->ask('Compare our enterprise and startup pricing tiers across all product lines');
Generating the Answer
The retrieved chunks are assembled into a context window and sent to the LLM along with the original question. The model generates a grounded answer, and the package wraps everything in a RagResult object:
$result = Rag::from(Document::class)
->askWithSources('What are the system requirements?');
echo $result->answer;
foreach ($result->sources() as $source) {
echo $source['content'] . ' (score: ' . $source['score'] . ')';
}
For real time user interfaces, streaming is built in:
$stream = Rag::from(Document::class)->stream('Explain our deployment process');
Quick Start
Getting up and running takes about five minutes. Here is the complete walkthrough.
Install the package:
composer require moneo/laravel-rag
Publish the configuration and migrations:
php artisan vendor:publish --tag=rag-config
php artisan vendor:publish --tag=rag-migrations
php artisan migrate
Configure your environment variables:
RAG_VECTOR_STORE=pgvector
RAG_EMBEDDING_DRIVER=openai
RAG_EMBEDDING_MODEL=text-embedding-3-small
RAG_EMBEDDING_DIMENSIONS=1536
RAG_LLM_PROVIDER=openai
RAG_LLM_MODEL=gpt-4o
RAG_EMBEDDING_CACHE=true
Create your model migration with a vector column:
Schema::create('documents', function (Blueprint $table) {
$table->id();
$table->string('title');
$table->text('content');
$table->json('metadata')->nullable();
$table->vector('embedding', 1536);
$table->vectorIndex('embedding', method: 'hnsw', distance: 'cosine');
$table->timestamps();
});
Add the traits to your model:
namespace App\Models;
use Illuminate\Database\Eloquent\Model;
use Moneo\Rag\Concerns\HasVectorSearch;
use Moneo\Rag\Concerns\AutoEmbeds;
class Document extends Model
{
use HasVectorSearch, AutoEmbeds;
protected string $embedSource = 'content';
protected string $vectorColumn = 'embedding';
}
The HasVectorSearch trait adds semantic and hybrid search capabilities to your model. The AutoEmbeds trait automatically generates embeddings whenever the model is created or updated, so you never have to think about keeping vectors in sync.
Ingest some content:
Ingest::text('Laravel is a web application framework with expressive, elegant syntax...')
->chunk(strategy: 'sentence')
->storeIn(Document::class)
->run();
Ask a question:
$result = Rag::from(Document::class)->ask('What is Laravel?');
echo $result->answer;
That is it. You now have a working RAG pipeline in your Laravel application. Everything from embedding to retrieval to generation happens inside your app, using your data, with no external dependencies beyond the LLM API.
Beyond the Basics
Once your pipeline is running, the package has a deep feature set for production use cases.
Conversation Memory
Most RAG implementations treat every question as independent. But real users ask follow up questions. "What is our refund policy?" followed by "What about for enterprise customers?" The second question only makes sense in the context of the first.
RagThread provides persistent conversation memory with automatic context summarization:
use Moneo\Rag\Memory\RagThread;
$thread = RagThread::create(['model' => Document::class]);
$first = $thread->ask('What is our refund policy?');
// "Our standard refund policy allows returns within 30 days..."
$second = $thread->ask('What about for enterprise customers?');
// "Enterprise customers have an extended 90-day refund window..."
The thread tracks message history, manages token counts, and automatically summarizes older messages to keep the context window manageable. All of it is backed by your database, so conversations persist across sessions.
RAG Evaluation
You cannot improve what you do not measure. The package includes a first of its kind Laravel native evaluation framework with three built in metrics:
Faithfulness measures whether the generated answer is actually supported by the retrieved context. Did the LLM stick to the evidence, or did it hallucinate?
Relevancy checks whether the retrieved chunks are actually relevant to the question. If your retrieval is pulling in noise, this metric will catch it.
Context Recall compares the retrieved context against a known expected answer. Did the retrieval find the information needed to produce the correct response?
use Moneo\Rag\Facades\RagEval;
$report = RagEval::suite()
->using(Rag::from(Document::class))
->add(
question: 'What is pgvector?',
expected: 'A PostgreSQL extension for vector similarity search'
)
->add(
question: 'How do I create an HNSW index?',
expected: 'Use the vectorIndex Blueprint method with hnsw'
)
->run();
You can run evaluations from the command line and integrate them into your CI pipeline:
php artisan rag:eval --suite=core --fail-below=0.8
MCP Server
If you use Claude Desktop, Cursor, or Windsurf, you can expose your RAG pipeline as MCP (Model Context Protocol) tools. This lets AI assistants query your application's knowledge base directly:
php artisan rag:mcp-serve
Your RAG pipeline becomes a tool that any MCP compatible client can discover and call. Ask Claude a question about your product, and it queries your actual documentation through your Laravel app.
Drop In UI Components
For teams that want a chat interface without building one from scratch, the package includes a Livewire component:
<livewire:rag-chat :model="App\Models\Document" placeholder="Ask anything about our products..." />
One line in your Blade template gives you a streaming chat interface with source attribution and conversation threading. There is also a Filament plugin for admin panels, giving you document management, embedding inspection, and interactive query testing in your back office.
Security Hardening
RAG pipelines are a new attack surface. Users can craft inputs designed to manipulate LLM behavior, a technique known as prompt injection. The package includes an InputSanitiser that detects and strips over forty known injection patterns before any text reaches the language model. Vector inputs are validated for correct dimensions, NaN values, and infinity. Cache entries are HMAC signed to prevent tampering. Log entries hash sensitive text fields instead of storing them raw.
Developer Experience
We built the tooling that we wished existed when we started working with RAG in Laravel.
Cost estimation before you commit to indexing a large dataset:
php artisan rag:estimate --model="App\Models\Document"
This scans your table, calculates average token counts per record, and estimates the total API cost for embedding everything. Run it before you index a hundred thousand documents, not after.
Interactive testing lets you try queries from the command line:
php artisan rag:test "What is pgvector?" --model="App\Models\Document" --rerank
Batch indexing with progress bars and chunking control:
php artisan rag:index --model="App\Models\Document" --chunk=markdown
Debugbar collector shows RAG operations directly in Laravel Debugbar: embedding calls, cache hits, retrieval timing, and generation metrics. Telescope watcher traces the same operations for async and queued workloads. Every operation flows through structured logging with privacy by default, text content is hashed in logs so you never accidentally store user queries in plaintext.
Wrapping Up
laravel-rag is MIT licensed and open source. It supports PHP 8.2 and above, works with Laravel 11, 12, and 13, and uses Prism PHP as its LLM layer so you are never locked into a single provider.
The gap between traditional search and AI powered answers is smaller than most teams think. If you have a Laravel application with data that users need to query intelligently, whether that is help documentation, product catalogs, legal contracts, or internal knowledge bases, you can have a working RAG pipeline in production by the end of the day.
We have been running this in production across multiple client projects at Moneo, and the results have been consistently strong: faster support resolution, fewer tickets, and users who actually find what they are looking for.
The source code is on GitHub. Issues, pull requests, and community drivers are welcome. If you build something with it, we would love to hear about it.
Moneo as Your Enterprise Partner
We collaborate closely with enterprise teams to design, deliver, and operate systems built for the long run.
Start PartnershipHow We Can Help
Software Consulting
We help you make the right technical decisions, improve processes, and build smarter with advice from engineers who have seen it all.
Learn moreGEO & AI Search Optimization
Make your brand visible in AI-generated answers. Optimize for ChatGPT, Perplexity, Gemini and other AI search engines.
Learn moreLong-term Collaboration
Work with a team that feels like your own. Our engineers join long-term to support, scale, and grow your product.
Learn more
GEO & AI Search Optimization for Shopify
LLMRank is a Shopify app that makes product catalogs discoverable by ChatGPT, Perplexity, Google AI Overviews, and every AI system that reads the web through a language model. Here is how it works.
The Web Is Being Tokenized. Serve Markdown.
We built a Laravel package that unifies Cloudflare's three Markdown conversion services under one elegant API. Convert URLs, files, and raw HTML to Markdown. Make your Laravel app agent ready with a single middleware.
Quiet Failures in Usage Metering
Usage metering rarely crashes loudly. It drifts silently. We built Laravel Usage Limiter to make metering atomic, idempotent, and auditable under real production pressure.