feat(agent): Pretty decent first pass at agent mode

This commit is contained in:
Willie Zutz 2025-06-09 23:00:25 -06:00
parent 3f1f437d4f
commit b4e2585856
18 changed files with 2380 additions and 5193 deletions

View file

@ -1,19 +1,7 @@
# GitHub Copilot Instructions for Perplexica
This file provides context and guidance for GitHub Copilot when working with the Perplexica codebase.
## Project Overview
# Project Overview
Perplexica is an open-source AI-powered search engine that uses advanced machine learning to provide intelligent search results. It combines web search capabilities with LLM-based processing to understand and answer user questions, similar to Perplexity AI but fully open source.
## Key Components
- **Frontend**: Next.js application with React components (in `/src/components` and `/src/app`)
- **Backend Logic**: Node.js backend with API routes (in `/src/app/api`) and library code (in `/src/lib`)
- **Search Engine**: Uses SearXNG as a metadata search engine
- **LLM Integration**: Supports multiple models including OpenAI, Anthropic, Groq, Ollama (local models)
- **Database**: SQLite database managed with Drizzle ORM
## Architecture
The system works through these main steps:
@ -29,20 +17,24 @@ The system works through these main steps:
- **Frontend**: React, Next.js, Tailwind CSS
- **Backend**: Node.js
- **Database**: SQLite with Drizzle ORM
- **AI/ML**: LangChain for orchestration, various LLM providers
- **AI/ML**: LangChain for orchestration, various LLM providers including OpenAI, Anthropic, Groq, Ollama (local models)
- **Search**: SearXNG integration
- **Embedding Models**: For re-ranking search results
## Project Structure
- `/src/app`: Next.js app directory with page components and API routes
- `/src/app/api`: API endpoints for search and LLM interactions
- `/src/components`: Reusable UI components
- `/src/lib`: Backend functionality
- `/lib/search`: Search functionality and meta search agent
- `/lib/db`: Database schema and operations
- `/lib/providers`: LLM and embedding model integrations
- `/lib/prompts`: Prompt templates for LLMs
- `/lib/chains`: LangChain chains for various operations
- `lib/search`: Search functionality and meta search agent
- `lib/db`: Database schema and operations
- `lib/providers`: LLM and embedding model integrations
- `lib/prompts`: Prompt templates for LLMs
- `lib/chains`: LangChain chains for various operations
- `lib/agents`: LangGraph agents for advanced processing
- `lib/tools`: LangGraph tools for use by agents
- `lib/utils`: Utility functions and types including web content retrieval and processing
## Focus Modes
@ -84,6 +76,8 @@ When working on this codebase, you might need to:
- Update database schema in `/src/lib/db/schema.ts`
- Create new prompt templates in `/src/lib/prompts`
- Build new chains in `/src/lib/chains`
- Implement new LangGraph agents in `/src/lib/agents`
- Create new tools for LangGraph agents in `/src/lib/tools`
## AI Behavior
@ -92,3 +86,5 @@ When working on this codebase, you might need to:
- If you don't know the answer, ask for clarification
- Do not add additional packages or dependencies unless explicitly requested
- Only make changes to the code that are relevant to the task at hand
- Do not create new files to test changes
- Do not run the application unless asked

View file

@ -219,9 +219,9 @@ This fork adds several enhancements to the original Perplexica project:
- Added BASE_URL config to support reverse proxy deployments
- Added autocomplete functionality proxied to SearxNG
- ✅ Enhanced Reddit focus mode to work around SearxNG limitations
- ✅ Adds Quality mode that uses the full content of web pages to answer queries
- ✅ Enhanced Balance mode that uses a headless web browser to retrieve web content and use relevant excerpts to enhance responses
- ✅ Adds Agent mode that uses the full content of web pages to answer queries and an agentic flow to intelligently answer complex queries with accuracy
- See the [README.md](docs/architecture/README.md) in the docs architecture directory for more info
- Enhances Balanced mode which uses relevant excerpts of web content to answer queries
### AI Functionality

View file

@ -61,7 +61,7 @@ The API accepts a JSON object in the request body, where you define the focus mo
- `speed`: Prioritize speed and get the quickest possible answer. Minimum effort retrieving web content. - Only uses SearXNG result previews.
- `balanced`: Find the right balance between speed and accuracy. Medium effort retrieving web content. - Uses web scraping technologies to retrieve partial content from full web pages.
- `quality`: Get the most thorough and accurate answer. High effort retrieving web content. Requires a good AI model. May take a long time. - Uses web scraping technologies to retrieve and summarize full web content.
- `agent`: Use an agentic workflow to answer complex multi-part questions. This mode requires a model that is trained for tool use.
- **`query`** (string, required): The search query or question.

View file

@ -8,13 +8,11 @@ Perplexica's architecture consists of the following key components:
4. **LLMs (Large Language Models)**: Utilized by agents and chains for tasks like understanding content, writing responses, and citing sources. Examples include Claude, GPTs, etc.
5. **Embedding Models**: To improve the accuracy of search results, embedding models re-rank the results using similarity search algorithms such as cosine similarity and dot product distance.
6. **Web Content**
- In Quality mode the application uses Crawlee, Playwright, and Chromium to load web content into a real full browser
- This significantly increases the size of the docker image and also means it can only run on x64 architectures
- The docker build has been updated to restrict images to linux/amd64 architecture
- In Balanced mode, the application uses JSDoc and Mozilla's Readability to retrieve and rank relevant segments of web content
- This approach is less successful than Quality as it doesn't use a full web browser and can't load dynamic content
- It is also more prone to being blocked by ads or scraping detection
- Because it only uses segments of web content, it can be less accurate than Quality mode
- In Agent mode, the application uses an agentic workflow to answer complex multi-part questions
- The agent can use reasoning steps to provide comprehensive answers to complex questions
- Agent mode is experimental and may consume lots of tokens and take a long time to produce responses
- In Balanced mode, the application retrieves web content using Playwright and Mozilla Readability to extract relevant segments of web content
- Because it only uses segments of web content, it can be less accurate than Agent mode
- In Speed mode, the application only uses the preview content returned by SearXNG
- This content is provided by the search engines and contains minimal context from the actual web page
- This mode is the least accurate and is often prone to hallucination

4027
package-lock.json generated

File diff suppressed because it is too large Load diff

View file

@ -15,12 +15,13 @@
"@headlessui/react": "^2.2.0",
"@iarna/toml": "^2.2.5",
"@icons-pack/react-simple-icons": "^12.3.0",
"@langchain/anthropic": "^0.3.15",
"@langchain/community": "^0.3.36",
"@langchain/core": "^0.3.42",
"@langchain/google-genai": "^0.1.12",
"@langchain/anthropic": "^0.3.21",
"@langchain/community": "^0.3.45",
"@langchain/core": "^0.3.57",
"@langchain/google-genai": "^0.2.10",
"@langchain/langgraph": "^0.3.1",
"@langchain/ollama": "^0.2.0",
"@langchain/openai": "^0.0.25",
"@langchain/openai": "^0.5.12",
"@langchain/textsplitters": "^0.1.0",
"@mozilla/readability": "^0.6.0",
"@tailwindcss/typography": "^0.5.12",
@ -28,10 +29,10 @@
"@xenova/transformers": "^2.17.2",
"axios": "^1.8.3",
"better-sqlite3": "^11.9.1",
"cheerio": "^1.1.0",
"clsx": "^2.1.0",
"compute-cosine-similarity": "^1.1.0",
"compute-dot": "^1.1.0",
"crawlee": "^3.13.5",
"drizzle-orm": "^0.40.1",
"html-to-text": "^9.0.5",
"jsdom": "^26.1.0",
@ -41,7 +42,7 @@
"next": "^15.2.2",
"next-themes": "^0.3.0",
"pdf-parse": "^1.1.1",
"playwright": "*",
"playwright": "^1.52.0",
"react": "^18",
"react-dom": "^18",
"react-syntax-highlighter": "^15.6.1",

View file

@ -43,7 +43,7 @@ type EmbeddingModel = {
type Body = {
message: Message;
optimizationMode: 'speed' | 'balanced' | 'quality';
optimizationMode: 'speed' | 'balanced' | 'agent';
focusMode: string;
history: Array<[string, string]>;
files: Array<string>;

View file

@ -30,7 +30,7 @@ interface embeddingModel {
}
interface ChatRequestBody {
optimizationMode: 'speed' | 'balanced';
optimizationMode: 'speed' | 'balanced' | 'agent';
focusMode: string;
chatModel?: chatModel;
embeddingModel?: embeddingModel;
@ -128,7 +128,9 @@ export const POST = async (req: Request) => {
const abortController = new AbortController();
const { signal } = abortController;
const promptData = await getSystemPrompts(body.selectedSystemPromptIds || []);
const promptData = await getSystemPrompts(
body.selectedSystemPromptIds || [],
);
const emitter = await searchHandler.searchAndAnswer(
body.query,
@ -139,7 +141,7 @@ export const POST = async (req: Request) => {
[],
promptData.systemInstructions,
signal,
promptData.personaInstructions
promptData.personaInstructions,
);
if (!body.stream) {

View file

@ -1,4 +1,4 @@
import { ChevronDown, Minimize2, Sliders, Star, Zap } from 'lucide-react';
import { ChevronDown, Minimize2, Sliders, Star, Zap, Bot } from 'lucide-react';
import { cn } from '@/lib/utils';
import {
Popover,
@ -22,17 +22,24 @@ const OptimizationModes = [
'Find the right balance between speed and accuracy. Medium effort retrieving web content.',
icon: <Sliders size={20} className="text-[#4CAF50]" />,
},
// {
// key: 'quality',
// title: 'Quality',
// description:
// 'Get the most thorough and accurate answer. High effort retrieving web content. Requires a good AI model. May take a long time.',
// icon: (
// <Star
// size={16}
// className="text-[#2196F3] dark:text-[#BBDEFB] fill-[#BBDEFB] dark:fill-[#2196F3]"
// />
// ),
// },
{
key: 'quality',
title: 'Quality',
key: 'agent',
title: 'Agent (Experimental)',
description:
'Get the most thorough and accurate answer. High effort retrieving web content. Requires a good AI model. May take a long time.',
icon: (
<Star
size={16}
className="text-[#2196F3] dark:text-[#BBDEFB] fill-[#BBDEFB] dark:fill-[#2196F3]"
/>
),
'Use an agentic workflow to answer complex multi-part questions. This mode may take longer and is experimental. It uses large prompts and may not work with all models. Best with at least a 8b model that supports 32k context or more.',
icon: <Bot size={20} className="text-[#9C27B0]" />,
},
];

6
src/instrumentation.ts Normal file
View file

@ -0,0 +1,6 @@
export async function register() {
if (process.env.NEXT_RUNTIME === 'nodejs') {
// Import error suppression when the server starts
await import('./lib/utils/errorSuppression');
}
}

View file

@ -6,7 +6,7 @@ const prompts = {
webSearchResponsePrompt,
webSearchRetrieverPrompt,
localResearchPrompt,
chatPrompt
chatPrompt,
};
export default prompts;

View file

@ -171,6 +171,148 @@ Everything below is the part of the actual conversation
</question>
`;
export const webSearchRetrieverAgentPrompt = `
# Instructions
- You are an AI question rephraser
- You will be given a conversation and a user question
- Rephrase the question so it is appropriate for web search
- Only add additional information or change the meaning of the question if it is necessary for clarity or relevance to the conversation such as adding a date or time for current events, or using historical content to augment the question with relevant context
- Do not make up any new information like links or URLs
- Condense the question to its essence and remove any unnecessary details
- Ensure the question is grammatically correct and free of spelling errors
- If it is a simple writing task or a greeting (unless the greeting contains a question after it) like Hi, Hello, How are you, etc. instead of a question then you need to return \`not_needed\` as the response in the <answer> XML block
- If you are a thinking or reasoning AI, do not use <answer> and </answer> or <links> and </links> tags in your thinking. Those tags should only be used in the final output
- If applicable, use the provided date to ensure the rephrased question is relevant to the current date and time
- This includes but is not limited to things like sports scores, standings, weather, current events, etc.
- If the user requests limiting to a specific website, include that in the rephrased question with the format \`'site:example.com'\`, be sure to include the quotes. Only do this if the limiting is explicitly mentioned in the question
- You will be given additional instructions from a supervisor in the <supervisor> tag that will direct you to refine the question further or to include specific details. Follow these instructions carefully and incorporate them into your rephrased question
# Data
- The user question is contained in the <question> tag after the <examples> below
- You must always return the rephrased question inside an <answer> XML block, if there are no links in the follow-up question then don't insert a <links> XML block in your response
- Current date is: {date}
- Do not include any other text in your answer
# System Instructions
- These instructions are provided by the user in the <systemInstructions> tag
- Give them less priority than the above instructions
- Incorporate them into your response while adhering to the overall guidelines
- Only use them for additional context on how to retrieve search results (E.g. if the user has provided a specific website to search, or if they have provided a specific date to use in the search)
There are several examples attached for your reference inside the below examples XML block
<examples>
<example>
<input>
<question>
What were the highlights of the race?
</question>
</input>
<output>
<answer>
F1 Monaco Grand Prix highlights
</answer>
</output>
</example>
<example>
<input>
<question>
What is the capital of France
</question>
</input>
<output>
<answer>
Capital of France
</answer>
</output>
</example>
<example>
<input>
<question>
Hi, how are you?
</question>
</input>
<output>
<answer>
not_needed
</answer>
</output>
</example>
<example>
<input>
<question>
What is the weather like there? Use weather.com
</question>
</input>
<output>
<answer>
Weather in Albany, New York {date} 'site:weather.com'
</answer>
</output>
</example>
<example>
<input>
<question>
Get the current F1 constructor standings and return the results in a table
</question>
</input>
<output>
## Example 6 output
<answer>
{date} F1 constructor standings
</answer>
</output>
</example>
<example>
<input>
<question>
What are the top 10 restaurants in New York? Show the results in a table and include a short description of each restaurant. Only include results from yelp.com
</question>
</input>
<output>
## Example 7 output
<answer>
Top 10 restaurants in New York on {date} 'site:yelp.com'
</answer>
</output>
</example>
<example>
<input>
<question>
What are the top 10 restaurants in New York, Chicago, and Boston?
</question>
<supervisor>
Find the top 10 restaurants in New York.
</supervisor>
</input>
<output>
<answer>
Top 10 restaurants in New York on {date}
</answer>
</output>
</examples>
Everything below is the part of the actual conversation
<systemInstructions>
{systemInstructions}
</systemInstructions>
<question>
{query}
</question>
<supervisor>
{supervisor}
</supervisor>
`;
export const webSearchResponsePrompt = `
You are Perplexica, an AI model skilled in web search and crafting detailed, engaging, and well-structured answers. You excel at summarizing web pages and extracting relevant information to create professional, blog-style responses

View file

@ -0,0 +1,517 @@
import { Embeddings } from '@langchain/core/embeddings';
import { BaseChatModel } from '@langchain/core/language_models/chat_models';
import {
BaseMessage,
HumanMessage,
SystemMessage,
} from '@langchain/core/messages';
import { ChatPromptTemplate, PromptTemplate } from '@langchain/core/prompts';
import {
Annotation,
Command,
END,
MemorySaver,
START,
StateGraph,
} from '@langchain/langgraph';
import { EventEmitter } from 'events';
import { Document } from 'langchain/document';
import LineOutputParser from '../outputParsers/lineOutputParser';
import { webSearchRetrieverAgentPrompt } from '../prompts/webSearch';
import { searchSearxng } from '../searxng';
import { formatDateForLLM } from '../utils';
import { getModelName } from '../utils/modelUtils';
import { summarizeWebContent } from '../utils/summarizeWebContent';
/**
* State interface for the agent supervisor workflow
*/
export const AgentState = Annotation.Root({
messages: Annotation<BaseMessage[]>({
reducer: (x, y) => x.concat(y),
default: () => [],
}),
query: Annotation<string>({
reducer: (x, y) => y ?? x,
default: () => '',
}),
relevantDocuments: Annotation<Document[]>({
reducer: (x, y) => x.concat(y),
default: () => [],
}),
bannedUrls: Annotation<string[]>({
reducer: (x, y) => x.concat(y),
default: () => [],
}),
searchInstructionHistory: Annotation<string[]>({
reducer: (x, y) => x.concat(y),
default: () => [],
}),
searchInstructions: Annotation<string>({
reducer: (x, y) => y ?? x,
default: () => '',
}),
next: Annotation<string>({
reducer: (x, y) => y ?? x ?? END,
default: () => END,
}),
analysis: Annotation<string>({
reducer: (x, y) => y ?? x,
default: () => '',
}),
});
/**
* Agent Search class implementing LangGraph Supervisor pattern
*/
export class AgentSearch {
private llm: BaseChatModel;
private embeddings: Embeddings;
private checkpointer: MemorySaver;
private emitter: EventEmitter;
private systemInstructions: string;
private personaInstructions: string;
private signal: AbortSignal;
constructor(
llm: BaseChatModel,
embeddings: Embeddings,
emitter: EventEmitter,
systemInstructions: string = '',
personaInstructions: string = '',
signal: AbortSignal,
) {
this.llm = llm;
this.embeddings = embeddings;
this.checkpointer = new MemorySaver();
this.emitter = emitter;
this.systemInstructions = systemInstructions;
this.personaInstructions = personaInstructions;
this.signal = signal;
}
/**
* Web search agent node
*/
private async webSearchAgent(
state: typeof AgentState.State,
): Promise<Command> {
const template = PromptTemplate.fromTemplate(webSearchRetrieverAgentPrompt);
const prompt = await template.format({
systemInstructions: this.systemInstructions,
query: state.query,
date: formatDateForLLM(new Date()),
supervisor: state.searchInstructions,
});
const searchQueryResult = await this.llm.invoke(
[...state.messages, prompt],
{ signal: this.signal },
);
// Parse the response to extract the search query with the lineoutputparser
const lineOutputParser = new LineOutputParser({ key: 'answer' });
const searchQuery = await lineOutputParser.parse(
searchQueryResult.content as string,
);
try {
console.log(`Performing web search for query: "${searchQuery}"`);
const searchResults = await searchSearxng(searchQuery, {
language: 'en',
engines: [],
});
let bannedUrls = state.bannedUrls || [];
let attemptedUrlCount = 0;
// Summarize the top 2 search results
let documents: Document[] = [];
for (const result of searchResults.results) {
if (bannedUrls.includes(result.url)) {
console.log(`Skipping banned URL: ${result.url}`);
continue; // Skip banned URLs
}
if (attemptedUrlCount >= 5) {
console.warn(
'Too many attempts to summarize URLs, stopping further attempts.',
);
break; // Limit the number of attempts to summarize URLs
}
attemptedUrlCount++;
bannedUrls.push(result.url); // Add to banned URLs to avoid duplicates
if (documents.length >= 1) {
break; // Limit to top 1 document
}
const summary = await summarizeWebContent(
result.url,
state.query,
this.llm,
this.systemInstructions,
this.signal,
);
if (summary) {
documents.push(summary);
console.log(
`Summarized content from ${result.url} to ${summary.pageContent.length} characters. Content: ${summary.pageContent}`,
);
} else {
console.warn(`No relevant content found for URL: ${result.url}`);
}
}
if (documents.length === 0) {
return new Command({
goto: 'analyzer',
update: {
messages: [new SystemMessage('No relevant documents found.')],
},
});
}
const responseMessage = `Web search completed. ${documents.length === 0 && attemptedUrlCount < 5 ? 'This search query does not have enough relevant information. Try rephrasing your query or providing more context.' : `Found ${documents.length} results that are relevant to the query.`}`;
console.log(responseMessage);
return new Command({
goto: 'analyzer',
update: {
messages: [new SystemMessage(responseMessage)],
relevantDocuments: documents,
bannedUrls: bannedUrls,
},
});
} catch (error) {
console.error('Web search error:', error);
const errorMessage = new SystemMessage(
`Web search failed: ${error instanceof Error ? error.message : 'Unknown error'}`,
);
return new Command({
goto: END,
update: {
messages: [errorMessage],
},
});
}
}
private async analyzer(state: typeof AgentState.State): Promise<Command> {
try {
console.log(
`Analyzing ${state.relevantDocuments.length} documents for relevance...`,
);
const analysisPromptTemplate = `You are an expert content analyzer. Your task is to analyze the provided document and determine if we have enough relevant information to fully answer the user's query. If the content is not sufficient, you will suggest a more specific search query to gather additional information.
# Instructions
- Carefully analyze the content of the context provided and determine if it contains sufficient information to answer the user's query
- The content should completely address the query, providing detailed explanations, relevant facts, and necessary context
- Use the content provided in the \`context\` tag, as well as the historical context of the conversation, to make your determination
- If the context provides conflicting information, explain the discrepancies and what additional information is needed to resolve them
- Today's date is ${formatDateForLLM(new Date())}
# Output Format
- If the content is sufficient, respond with "good_content" in an <answer> XML tag
- If the content is not sufficient, respond with "need_more_info" in an <answer> XML tag and provide a detailed question that would help gather more specific information to answer the query in a <question> XML tag
- This question will be used to generate a web search query to gather more information and should be specific, actionable, and focused on the gaps in the current content
- This step will be repeated until sufficient information is gathered to answer the query. Do not try to answer the entire query at once
- It should be concise and avoid pleasantries or unnecessary details
- Break down the query into a smaller, more focused question that can be answered with a web search
- For example, if the query is asking about specific information from multiple locations, break the query into one smaller query for a single location
- If if the query is asking about a complex topic, break it down into a single smaller question that can be answered one at a time
- Avoid asking for general information or vague details; focus on specific, actionable questions that can lead to concrete answers
- Avoid giving the same guidance more than once, and avoid repeating the same question multiple times
- Respond with your answer in a <answer> XML tag
- If you need more information, provide a detailed question in a <question> XML tag
# Refinement History
- The following questions have been asked to refine the search
${state.searchInstructionHistory.map((question) => ` - ${question}`).join('\n')}
# System Instructions
- The system instructions provided to you are:
{systemInstructions}
# Example Output
- If the content is sufficient:
<answer>good_content</answer>
- If the content is not sufficient:
<answer>need_more_info</answer>
<question>A question that would help gather more specific information to answer the query?</question>
# Context
<context>
{context}
</context>
`;
const analysisPrompt = await ChatPromptTemplate.fromTemplate(
analysisPromptTemplate,
).format({
systemInstructions: this.systemInstructions,
context: state.relevantDocuments
.map((doc) => doc.pageContent)
.join('\n\n'),
});
const response = await this.llm.invoke(
[...state.messages, new SystemMessage(analysisPrompt)],
{ signal: this.signal },
);
// Parse the response to extract the analysis result
const analysisOutputParser = new LineOutputParser({ key: 'answer' });
const moreInfoOutputParser = new LineOutputParser({ key: 'question' });
const analysisResult = await analysisOutputParser.parse(
response.content as string,
);
const moreInfoQuestion = await moreInfoOutputParser.parse(
response.content as string,
);
console.log('Analysis result:', analysisResult);
console.log('More info question:', moreInfoQuestion);
if (analysisResult.startsWith('need_more_info')) {
return new Command({
goto: 'web_search',
update: {
messages: [
new SystemMessage(
`The following question can help refine the search: ${moreInfoQuestion}`,
),
],
searchInstructions: moreInfoQuestion,
searchInstructionHistory: [moreInfoQuestion],
},
});
}
return new Command({
goto: 'synthesizer',
update: {
messages: [
new SystemMessage(
`Analysis completed. We have sufficient information to answer the query.`,
),
],
},
});
} catch (error) {
console.error('Analysis error:', error);
const errorMessage = new SystemMessage(
`Analysis failed: ${error instanceof Error ? error.message : 'Unknown error'}`,
);
return new Command({
goto: END,
update: {
messages: [errorMessage],
},
});
}
}
/**
* Synthesizer agent node that combines information to answer the query
*/
private async synthesizerAgent(
state: typeof AgentState.State,
): Promise<Command> {
try {
const synthesisPrompt = `You are an expert information synthesizer. Based on the search results and analysis provided, create a comprehensive, well-structured answer to the user's query.
## Response Instructions
Your task is to provide answers that are:
- **Informative and relevant**: Thoroughly address the user's query using the given context
- **Well-structured**: Include clear headings and subheadings, and use a professional tone to present information concisely and logically
- **Engaging and detailed**: Write responses that read like a high-quality blog post, including extra details and relevant insights
- **Cited and credible**: Use inline citations with [number] notation to refer to the context source(s) for each fact or detail included
- **Explanatory and Comprehensive**: Strive to explain the topic in depth, offering detailed analysis, insights, and clarifications wherever applicable
### Formatting Instructions
- **Structure**: Use a well-organized format with proper headings (e.g., "## Example heading 1" or "## Example heading 2"). Present information in paragraphs or concise bullet points where appropriate
- **Tone and Style**: Maintain a neutral, journalistic tone with engaging narrative flow. Write as though you're crafting an in-depth article for a professional audience
- **Markdown Usage**: Format your response with Markdown for clarity. Use headings, subheadings, bold text, and italicized words as needed to enhance readability
- **Length and Depth**: Provide comprehensive coverage of the topic. Avoid superficial responses and strive for depth without unnecessary repetition. Expand on technical or complex topics to make them easier to understand for a general audience
- **No main heading/title**: Start your response directly with the introduction unless asked to provide a specific title
- **Conclusion or Summary**: Include a concluding paragraph that synthesizes the provided information or suggests potential next steps, where appropriate
### Persona Instructions
- Additional user specified persona instructions are provided in the <personaInstructions> tag
### Citation Requirements
- Cite every single fact, statement, or sentence using [number] notation corresponding to the source from the provided \`context\`
- Integrate citations naturally at the end of sentences or clauses as appropriate. For example, "The Eiffel Tower is one of the most visited landmarks in the world[1]."
- Ensure that **every sentence in your response includes at least one citation**, even when information is inferred or connected to general knowledge available in the provided context
- Use multiple sources for a single detail if applicable, such as, "Paris is a cultural hub, attracting millions of visitors annually[1][2]."
- Always prioritize credibility and accuracy by linking all statements back to their respective context sources
- Avoid citing unsupported assumptions or personal interpretations; if no source supports a statement, clearly indicate the limitation
### Example Output
- Begin with a brief introduction summarizing the event or query topic
- Follow with detailed sections under clear headings, covering all aspects of the query if possible
- Provide explanations or historical context as needed to enhance understanding
- End with a conclusion or overall perspective if relevant
<personaInstructions>
${this.personaInstructions}
</personaInstructions>
User Query: ${state.query}
Available Information:
${state.relevantDocuments
.map(
(doc, index) =>
`<${index + 1}>\n
<title>${doc.metadata.title}</title>\n
${doc.metadata?.url.toLowerCase().includes('file') ? '' : '\n<url>' + doc.metadata.url + '</url>\n'}
<content>\n${doc.pageContent}\n</content>\n
</${index + 1}>`,
)
.join('\n')}
`;
// Stream the response in real-time using LLM streaming capabilities
let fullResponse = '';
// Emit the sources as a data response
this.emitter.emit(
'data',
JSON.stringify({
type: 'sources',
data: state.relevantDocuments,
searchQuery: '',
searchUrl: '',
}),
);
const stream = await this.llm.stream(
[new SystemMessage(synthesisPrompt), new HumanMessage(state.query)],
{ signal: this.signal },
);
for await (const chunk of stream) {
if (this.signal.aborted) {
break;
}
const content = chunk.content;
if (typeof content === 'string' && content.length > 0) {
fullResponse += content;
// Emit each chunk as a data response in real-time
this.emitter.emit(
'data',
JSON.stringify({
type: 'response',
data: content,
}),
);
}
}
// Emit model stats and end signal after streaming is complete
const modelName = getModelName(this.llm);
this.emitter.emit(
'stats',
JSON.stringify({
type: 'modelStats',
data: { modelName },
}),
);
this.emitter.emit('end');
// Create the final response message with the complete content
const response = new SystemMessage(fullResponse);
return new Command({
goto: END,
update: {
messages: [response],
},
});
} catch (error) {
console.error('Synthesis error:', error);
const errorMessage = new SystemMessage(
`Failed to synthesize answer: ${error instanceof Error ? error.message : 'Unknown error'}`,
);
return new Command({
goto: END,
update: {
messages: [errorMessage],
},
});
}
}
/**
* Create and compile the agent workflow graph
*/
private createWorkflow() {
const workflow = new StateGraph(AgentState)
// .addNode('supervisor', this.supervisor.bind(this), {
// ends: ['web_search', 'analyzer', 'synthesizer', END],
// })
.addNode('web_search', this.webSearchAgent.bind(this), {
ends: ['analyzer'],
})
.addNode('analyzer', this.analyzer.bind(this), {
ends: ['web_search', 'synthesizer'],
})
// .addNode("url_analyzer", this.urlAnalyzerAgent.bind(this), {
// ends: ["supervisor"],
// })
.addNode('synthesizer', this.synthesizerAgent.bind(this), {
ends: [END],
})
.addEdge(START, 'analyzer');
return workflow.compile({ checkpointer: this.checkpointer });
}
/**
* Execute the agent search workflow
*/
async searchAndAnswer(query: string, history: BaseMessage[] = []) {
const workflow = this.createWorkflow();
try {
const initialState = {
messages: [...history, new HumanMessage(query)],
query,
};
const result = await workflow.invoke(initialState, {
configurable: { thread_id: `agent_search_${Date.now()}` },
recursionLimit: 15,
});
return result;
} catch (error) {
console.error('Agent workflow error:', error);
// Fallback to a simple response
const fallbackResponse = await this.llm.invoke(
[
new SystemMessage(
"You are a helpful assistant. The advanced agent workflow failed, so please provide a basic response to the user's query based on your knowledge.",
),
new HumanMessage(query),
],
{ signal: this.signal },
);
return {
messages: [...history, new HumanMessage(query), fallbackResponse],
query,
searchResults: [],
next: END,
analysis: '',
};
}
}
}

View file

@ -21,15 +21,12 @@ import path from 'node:path';
import LineOutputParser from '../outputParsers/lineOutputParser';
import LineListOutputParser from '../outputParsers/listLineOutputParser';
import { searchSearxng } from '../searxng';
import { formatDateForLLM } from '../utils';
import computeSimilarity from '../utils/computeSimilarity';
import {
getDocumentsFromLinks,
getWebContent,
getWebContentLite,
} from '../utils/documents';
import { getDocumentsFromLinks, getWebContent } from '../utils/documents';
import formatChatHistoryAsString from '../utils/formatHistory';
import { getModelName } from '../utils/modelUtils';
import { formatDateForLLM } from '../utils';
import { AgentSearch } from './agentSearch';
export interface MetaSearchAgentType {
searchAndAnswer: (
@ -37,7 +34,7 @@ export interface MetaSearchAgentType {
history: BaseMessage[],
llm: BaseChatModel,
embeddings: Embeddings,
optimizationMode: 'speed' | 'balanced' | 'quality',
optimizationMode: 'speed' | 'balanced' | 'agent',
fileIds: string[],
systemInstructions: string,
signal: AbortSignal,
@ -306,7 +303,7 @@ class MetaSearchAgent implements MetaSearchAgentType {
llm: BaseChatModel,
fileIds: string[],
embeddings: Embeddings,
optimizationMode: 'speed' | 'balanced' | 'quality',
optimizationMode: 'speed' | 'balanced' | 'agent',
systemInstructions: string,
signal: AbortSignal,
emitter: eventEmitter,
@ -392,9 +389,7 @@ class MetaSearchAgent implements MetaSearchAgentType {
.pipe(this.processDocs),
}),
ChatPromptTemplate.fromMessages([
[
'system', this.config.responsePrompt,
],
['system', this.config.responsePrompt],
new MessagesPlaceholder('chat_history'),
['user', '{query}'],
]),
@ -405,128 +400,12 @@ class MetaSearchAgent implements MetaSearchAgentType {
});
}
private async checkIfEnoughInformation(
docs: Document[],
query: string,
llm: BaseChatModel,
systemInstructions: string,
signal: AbortSignal,
): Promise<boolean> {
const formattedDocs = this.processDocs(docs);
const systemPrompt = systemInstructions ? `${systemInstructions}\n\n` : '';
const response = await llm.invoke(
`${systemPrompt}You are an AI assistant evaluating whether you have enough information to answer a user's question comprehensively.
Based on the following sources, determine if you have sufficient information to provide a detailed, accurate answer to the query: "${query}"
Sources:
${formattedDocs}
Look for:
1. Key facts and details directly relevant to the query
2. Multiple perspectives or sources if the topic is complex
3. Up-to-date information if the query requires current data
4. Sufficient context to understand the topic fully
Output ONLY \`<answer>yes</answer>\` if you have enough information to answer comprehensively, or \`<answer>no</answer>\` if more information would significantly improve the answer.`,
{ signal },
);
const answerParser = new LineOutputParser({
key: 'answer',
});
const responseText = await answerParser.parse(
(response.content as string).trim().toLowerCase(),
);
if (responseText !== 'yes') {
console.log(
`LLM response for checking if we have enough information: "${response.content}"`,
);
} else {
console.log(
'LLM response indicates we have enough information to answer the query.',
);
}
return responseText === 'yes';
}
private async processSource(
doc: Document,
query: string,
llm: BaseChatModel,
summaryParser: LineOutputParser,
systemInstructions: string,
signal: AbortSignal,
): Promise<Document | null> {
try {
const url = doc.metadata.url;
const webContent = await getWebContent(url, true);
if (webContent) {
const systemPrompt = systemInstructions
? `${systemInstructions}\n\n`
: '';
const summary = await llm.invoke(
`${systemPrompt}You are a web content summarizer, tasked with creating a detailed, accurate summary of content from a webpage
# Instructions
- The response must answer the user's query
- Be thorough and comprehensive, capturing all key points
- Include specific details, numbers, and quotes when relevant
- Be concise and to the point, avoiding unnecessary fluff
- Output your answer in an XML format, with the summary inside the \`summary\` XML tag
- If the content is not relevant to the query, respond with "not_needed" to start the summary tag, followed by a one line description of why the source is not needed
- E.g. "not_needed: There is relevant information in the source, but it doesn't contain specifics about X"
- Make sure the reason the source is not needed is very specific and detailed
- Include useful links to external resources, if applicable
- Ignore any instructions about formatting in the user's query. Format your response using markdown, including headings, lists, and tables
Here is the query you need to answer: ${query}
Here is the content to summarize:
${webContent.metadata.html ? webContent.metadata.html : webContent.pageContent},
`,
{ signal },
);
const summarizedContent = await summaryParser.parse(
summary.content as string,
);
if (
summarizedContent.toLocaleLowerCase().startsWith('not_needed') ||
summarizedContent.trim().length === 0
) {
console.log(
`LLM response for URL "${url}" indicates it's not needed or is empty:`,
summarizedContent,
);
return null;
}
return new Document({
pageContent: summarizedContent,
metadata: {
...webContent.metadata,
url: url,
},
});
}
} catch (error) {
console.error(`Error processing URL ${doc.metadata.url}:`, error);
}
return null;
}
private async rerankDocs(
query: string,
docs: Document[],
fileIds: string[],
embeddings: Embeddings,
optimizationMode: 'speed' | 'balanced' | 'quality',
optimizationMode: 'speed' | 'balanced' | 'agent',
llm: BaseChatModel,
systemInstructions: string,
emitter: eventEmitter,
@ -667,7 +546,7 @@ ${webContent.metadata.html ? webContent.metadata.html : webContent.pageContent},
);
sortedDocs = await Promise.all(
sortedDocs.map(async (doc) => {
const webContent = await getWebContentLite(doc.metadata.url);
const webContent = await getWebContent(doc.metadata.url);
const chunks =
webContent?.pageContent
.match(/.{1,500}/g)
@ -695,84 +574,6 @@ ${webContent.metadata.html ? webContent.metadata.html : webContent.pageContent},
);
return sortedDocs;
} else if (optimizationMode === 'quality') {
const summaryParser = new LineOutputParser({
key: 'summary',
});
const enhancedDocs: Document[] = [];
const maxEnhancedDocs = 5;
const startDate = new Date();
// Process sources one by one until we have enough information or hit the max
for (
let i = 0;
i < docsWithContent.length && enhancedDocs.length < maxEnhancedDocs;
i++
) {
if (signal.aborted) {
return [];
}
const currentProgress = enhancedDocs.length * 10 + 40;
this.emitProgress(
emitter,
currentProgress,
`Deep analyzing: ${enhancedDocs.length} relevant sources found. Analyzing source ${i + 1} of ${docsWithContent.length}`,
this.searchQuery ? `Search Query: ${this.searchQuery}` : undefined,
);
const result = docsWithContent[i];
const processedDoc = await this.processSource(
result,
query,
llm,
summaryParser,
systemInstructions,
signal,
);
if (processedDoc) {
enhancedDocs.push(processedDoc);
}
// After getting sources for 60 seconds, or at least 2 sources or adding a new one, check if we have enough info
if (
new Date().getTime() - startDate.getTime() > 60000 &&
enhancedDocs.length >= 2
) {
this.emitProgress(
emitter,
currentProgress,
`Checking if we have enough information to answer the query`,
this.searchQuery
? `Search Query: ${this.searchQuery}`
: undefined,
);
const hasEnoughInfo = await this.checkIfEnoughInformation(
enhancedDocs,
query,
llm,
systemInstructions,
signal,
);
if (hasEnoughInfo) {
break;
}
}
}
this.emitProgress(
emitter,
95,
`Ranking attached files`,
this.searchQuery ? `Search Query: ${this.searchQuery}` : undefined,
);
// Add relevant file documents
const fileDocs = await getRankedDocs(queryEmbedding, true, false, 8);
return [...enhancedDocs, ...fileDocs];
}
} catch (error) {
console.error('Error in rerankDocs:', error);
@ -864,12 +665,52 @@ ${docs[index].metadata?.url.toLowerCase().includes('file') ? '' : '\n<url>' + do
}
}
/**
* Execute agent workflow asynchronously with proper streaming support
*/
private async executeAgentWorkflow(
llm: BaseChatModel,
embeddings: Embeddings,
emitter: eventEmitter,
message: string,
history: BaseMessage[],
systemInstructions: string,
personaInstructions: string,
signal: AbortSignal,
) {
try {
const agentSearch = new AgentSearch(
llm,
embeddings,
emitter,
systemInstructions,
personaInstructions,
signal,
);
// Execute the agent workflow
const result = await agentSearch.searchAndAnswer(message, history);
// No need to emit end signals here since synthesizerAgent
// is now streaming in real-time and emits them
} catch (error) {
console.error('Agent search error:', error);
emitter.emit(
'error',
JSON.stringify({
data: `Agent search failed: ${error instanceof Error ? error.message : 'Unknown error'}`,
}),
);
emitter.emit('end');
}
}
async searchAndAnswer(
message: string,
history: BaseMessage[],
llm: BaseChatModel,
embeddings: Embeddings,
optimizationMode: 'speed' | 'balanced' | 'quality',
optimizationMode: 'speed' | 'balanced' | 'agent',
fileIds: string[],
systemInstructions: string,
signal: AbortSignal,
@ -877,6 +718,23 @@ ${docs[index].metadata?.url.toLowerCase().includes('file') ? '' : '\n<url>' + do
) {
const emitter = new eventEmitter();
// Branch to agent search if optimization mode is 'agent'
if (optimizationMode === 'agent') {
// Execute agent workflow asynchronously to maintain streaming
this.executeAgentWorkflow(
llm,
embeddings,
emitter,
message,
history,
systemInstructions,
personaInstructions || '',
signal,
);
return emitter;
}
// Existing logic for other optimization modes
const answeringChain = await this.createAnsweringChain(
llm,
fileIds,

View file

@ -1,12 +1,13 @@
import axios from 'axios';
import { htmlToText } from 'html-to-text';
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
import { CheerioWebBaseLoader } from '@langchain/community/document_loaders/web/cheerio';
import { PlaywrightWebBaseLoader } from '@langchain/community/document_loaders/web/playwright';
import { Document } from '@langchain/core/documents';
import pdfParse from 'pdf-parse';
import { Configuration, Dataset, PlaywrightCrawler } from 'crawlee';
import { Readability } from '@mozilla/readability';
import axios from 'axios';
import { JSDOM } from 'jsdom';
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
import fetch from 'node-fetch';
import pdfParse from 'pdf-parse';
import type { Browser, Page } from 'playwright';
export const getDocumentsFromLinks = async ({ links }: { links: string[] }) => {
const splitter = new RecursiveCharacterTextSplitter();
@ -21,13 +22,16 @@ export const getDocumentsFromLinks = async ({ links }: { links: string[] }) => {
: `https://${link}`;
try {
// First, check if it's a PDF
const headRes = await axios.head(link);
const isPdf = headRes.headers['content-type'] === 'application/pdf';
if (isPdf) {
// Handle PDF files
const res = await axios.get(link, {
responseType: 'arraybuffer',
});
const isPdf = res.headers['content-type'] === 'application/pdf';
if (isPdf) {
const pdfText = await pdfParse(res.data);
const parsedText = pdfText.text
.replace(/(\r\n|\n|\r)/gm, ' ')
@ -51,36 +55,29 @@ export const getDocumentsFromLinks = async ({ links }: { links: string[] }) => {
return;
}
const parsedText = htmlToText(res.data.toString('utf8'), {
selectors: [
{
selector: 'a',
options: {
ignoreHref: true,
},
},
],
})
.replace(/(\r\n|\n|\r)/gm, ' ')
.replace(/\s+/g, ' ')
.trim();
// Handle web pages using CheerioWebBaseLoader
const loader = new CheerioWebBaseLoader(link, {
selector: 'body',
});
const splittedText = await splitter.splitText(parsedText);
const title = res.data
.toString('utf8')
.match(/<title.*>(.*?)<\/title>/)?.[1];
const webDocs = await loader.load();
if (webDocs && webDocs.length > 0) {
const webDoc = webDocs[0];
const splittedText = await splitter.splitText(webDoc.pageContent);
const linkDocs = splittedText.map((text) => {
return new Document({
pageContent: text,
metadata: {
title: title || link,
title: webDoc.metadata.title || link,
url: link,
},
});
});
docs.push(...linkDocs);
}
} catch (err) {
console.error(
'An error occurred while getting documents from links: ',
@ -102,14 +99,9 @@ export const getDocumentsFromLinks = async ({ links }: { links: string[] }) => {
return docs;
};
interface CrawledContent {
text: string;
title: string;
html?: string;
}
/**
* Fetches web content from a given URL using Crawlee and Playwright. Parses it using Readability.
* Fetches web content from a given URL using LangChain's PlaywrightWebBaseLoader.
* Parses it using Readability for better content extraction.
* Returns a Document object containing the parsed text and metadata.
*
* @param url - The URL to fetch content from.
@ -120,94 +112,89 @@ export const getWebContent = async (
url: string,
getHtml: boolean = false,
): Promise<Document | null> => {
let crawledContent: CrawledContent | null = null;
const crawler = new PlaywrightCrawler(
{
async requestHandler({ page }) {
// Wait for the content to load
try {
console.log(`Fetching content from URL: ${url}`);
const loader = new PlaywrightWebBaseLoader(url, {
launchOptions: {
headless: true,
timeout: 30000,
},
gotoOptions: {
waitUntil: 'domcontentloaded',
timeout: 10000,
},
async evaluate(page: Page, browser: Browser) {
// Wait for the content to load properly
await page.waitForLoadState('networkidle', { timeout: 10000 });
// Allow some time for dynamic content to load
await page.waitForTimeout(3000);
console.log(`Crawling URL: ${url}`);
// Get the page title
const title = await page.title();
try {
// Use Readability to parse the page content
const content = await page.content();
const dom = new JSDOM(content, { url });
const reader = new Readability(dom.window.document, {
charThreshold: 25,
}).parse();
const crawleeContent: CrawledContent = {
text: reader?.textContent || '',
title,
html: getHtml
? reader?.content || (await page.content())
: undefined,
};
crawledContent = crawleeContent;
} catch (error) {
console.error(
`Failed to parse content with Readability for URL: ${url}`,
error,
);
}
return await page.content();
},
maxRequestsPerCrawl: 1,
maxRequestRetries: 2,
retryOnBlocked: true,
maxSessionRotations: 3,
},
new Configuration({ persistStorage: false }),
);
});
try {
await crawler.run([url]);
const docs = await loader.load();
if (!crawledContent) {
console.warn(`Failed to parse article content for URL: ${url}`);
if (!docs || docs.length === 0) {
console.warn(`Failed to load content for URL: ${url}`);
return null;
}
const content = crawledContent as CrawledContent;
const doc = docs[0];
const dom = new JSDOM(doc.pageContent, { url });
const reader = new Readability(dom.window.document, { charThreshold: 25 });
const article = reader.parse();
// Normalize the text content
const normalizedText =
content?.text
article?.textContent
?.split('\n')
.map((line: string) => line.trim())
.filter((line: string) => line.length > 0)
.join('\n') || '';
// Create a Document with the parsed content
const returnDoc = new Document({
pageContent: normalizedText,
metadata: {
html: content?.html,
title: content?.title,
title: article?.title || doc.metadata.title || '',
url: url,
html: getHtml ? article?.content : undefined,
},
});
console.log(
`Got content with Crawlee and Readability, URL: ${url}, Text Length: ${returnDoc.pageContent.length}, html Length: ${returnDoc.metadata.html?.length || 0}`,
`Got content with LangChain Playwright, URL: ${url}, Text Length: ${returnDoc.pageContent.length}`,
);
return returnDoc;
} catch (error) {
console.error(`Error fetching/parsing URL ${url}:`, error);
// Fallback to CheerioWebBaseLoader for simpler content extraction
try {
console.log(`Fallback to Cheerio for URL: ${url}`);
const cheerioLoader = new CheerioWebBaseLoader(url);
const docs = await cheerioLoader.load();
if (docs && docs.length > 0) {
return docs[0];
}
} catch (fallbackError) {
console.error(
`Cheerio fallback also failed for URL ${url}:`,
fallbackError,
);
}
return null;
} finally {
await crawler.teardown();
}
};
/**
* Fetches web content from a given URL and parses it using Readability.
* Fetches web content from a given URL using CheerioWebBaseLoader for faster, lighter extraction.
* Returns a Document object containing the parsed text and metadata.
*
* @param {string} url - The URL to fetch content from.
@ -219,42 +206,72 @@ export const getWebContentLite = async (
getHtml: boolean = false,
): Promise<Document | null> => {
try {
const response = await fetch(url, { timeout: 5000 });
const html = await response.text();
console.log(`Fetching content (lite) from URL: ${url}`);
// Create a DOM from the fetched HTML
const dom = new JSDOM(html, { url });
const loader = new CheerioWebBaseLoader(url);
// Get title before we modify the DOM
const originalTitle = dom.window.document.title;
const docs = await loader.load();
// Use Readability to parse the article content
const reader = new Readability(dom.window.document, { charThreshold: 25 });
const article = reader.parse();
if (!article) {
console.warn(`Failed to parse article content for URL: ${url}`);
if (!docs || docs.length === 0) {
console.warn(`Failed to load content for URL: ${url}`);
return null;
}
const doc = docs[0];
// Try to use Readability for better content extraction if possible
if (getHtml) {
try {
const response = await fetch(url, { timeout: 5000 });
const html = await response.text();
const dom = new JSDOM(html, { url });
const originalTitle = dom.window.document.title;
const reader = new Readability(dom.window.document, {
charThreshold: 25,
});
const article = reader.parse();
if (article) {
const normalizedText =
article?.textContent
article.textContent
?.split('\n')
.map((line) => line.trim())
.filter((line) => line.length > 0)
.join('\n') || '';
// Create a Document with the parsed content
return new Document({
pageContent: normalizedText || '',
pageContent: normalizedText,
metadata: {
html: getHtml ? article.content : undefined,
html: article.content,
title: article.title || originalTitle,
url: url,
},
});
}
} catch (readabilityError) {
console.warn(
`Readability parsing failed for ${url}, using Cheerio fallback`,
);
}
}
// Normalize the text content from Cheerio
const normalizedText = doc.pageContent
.split('\n')
.map((line: string) => line.trim())
.filter((line: string) => line.length > 0)
.join('\n');
return new Document({
pageContent: normalizedText,
metadata: {
title: doc.metadata.title || 'Web Page',
url: url,
html: getHtml ? doc.pageContent : undefined,
},
});
} catch (error) {
console.error(`Error fetching/parsing URL ${url}:`); //, error);
console.error(`Error fetching/parsing URL ${url}:`, error);
return null;
}
};

View file

@ -0,0 +1,42 @@
let isErrorSuppressionActive = false;
export const suppressTokenCountingMessages = () => {
// Prevent multiple initializations
if (isErrorSuppressionActive) {
return;
}
const originalWarn = console.warn;
console.warn = (...args) => {
const message = args.join(' ');
// Skip warnings related to token counting
if (
message.includes('Failed to calculate number of tokens') ||
message.includes('Unknown model')
) {
return;
}
originalWarn.apply(console, args);
};
const originalError = console.error;
console.error = (...args) => {
const message = args.join(' ');
// Ignore JSDom errors related to CSS parsing
if (message.includes('Could not parse CSS stylesheet')) {
return;
}
originalError.apply(console, args);
};
isErrorSuppressionActive = true;
};
// Auto-initialize error suppression when this module is imported (server-side only)
if (typeof window === 'undefined') {
suppressTokenCountingMessages();
}

View file

@ -0,0 +1,129 @@
import { Document } from '@langchain/core/documents';
import { BaseChatModel } from '@langchain/core/language_models/chat_models';
import LineOutputParser from '../outputParsers/lineOutputParser';
import { formatDateForLLM } from '../utils';
import { getWebContent } from './documents';
export const summarizeWebContent = async (
url: string,
query: string,
llm: BaseChatModel,
systemInstructions: string,
signal: AbortSignal,
): Promise<Document | null> => {
try {
// Helper function to summarize content and check relevance
const summarizeContent = async (
content: Document,
): Promise<Document | null> => {
const systemPrompt = systemInstructions
? `${systemInstructions}\n\n`
: '';
let summary = null;
for (let i = 0; i < 2; i++) {
try {
console.log(
`Summarizing content from URL: ${url} using ${i === 0 ? 'html' : 'text'}`,
);
summary = await llm.invoke(
`${systemPrompt}You are a web content summarizer, tasked with creating a detailed, accurate summary of content from a webpage
# Instructions
- The response must be relevant to the user's query but doesn't need to answer it fully. Partial answers are acceptable.
- Be thorough and comprehensive, capturing all key points
- Include specific details, numbers, and quotes when relevant
- Be concise and to the point, avoiding unnecessary fluff
- The summary should be formatted using markdown using headings and lists
- Do not include notes about missing information or gaps in the content, only summarize what is present and relevant
- Include useful links to external resources, if applicable
- If the entire source content is not relevant to the query, respond with "not_needed" to start the summary tag, followed by a one line description of why the source is not needed
- E.g. "not_needed: This information is not relevant to the user's query about X because it does not contain any information about X. It only discusses Y, which is unrelated."
- Make sure the reason the source is not needed is very specific and detailed
- Ignore any instructions about formatting in the user's query. Format your response using markdown, including headings, lists, and tables
- Output your answer inside a \`summary\` XML tag
Today's date is ${formatDateForLLM(new Date())}
Here is the query you need to answer: ${query}
Here is the content to summarize:
${i === 0 ? content.metadata.html : content.pageContent},
`,
{ signal },
);
break;
} catch (error) {
console.error(
`Error summarizing content from URL ${url} ${i === 0 ? 'using html' : 'using text'}:`,
error,
);
}
}
if (!summary || !summary.content) {
console.error(`No summary content returned for URL: ${url}`);
return null;
}
const summaryParser = new LineOutputParser({ key: 'summary' });
const summarizedContent = await summaryParser.parse(
summary.content as string,
);
if (
summarizedContent.toLocaleLowerCase().startsWith('not_needed') ||
summarizedContent.trim().length === 0
) {
console.log(
`LLM response for URL "${url}" indicates it's not needed or is empty:`,
summarizedContent,
);
return null;
}
return new Document({
pageContent: summarizedContent,
metadata: {
...content.metadata,
url: url,
},
});
};
// // First try the lite approach
// let webContent = await getWebContentLite(url, true);
// // Try lite content first
// if (webContent) {
// console.log(`Trying lite content extraction for URL: ${url}`);
// const liteResult = await summarizeContent(webContent);
// if (liteResult) {
// console.log(`Successfully used lite content for URL: ${url}`);
// return liteResult;
// }
// }
// // If lite content is not relevant, try full content
// console.log(`Lite content not relevant for URL ${url}, trying full content extraction`);
const webContent = await getWebContent(url, true);
// Process full content or return null if no content
if (
(webContent &&
webContent.pageContent &&
webContent.pageContent.trim().length > 0) ||
(webContent?.metadata.html && webContent.metadata.html.trim().length > 0)
) {
console.log(`Using full content extraction for URL: ${url}`);
return await summarizeContent(webContent);
} else {
console.log(`No valid content found for URL: ${url}`);
}
} catch (error) {
console.error(`Error processing URL ${url}:`, error);
}
return null;
};

2067
yarn.lock

File diff suppressed because it is too large Load diff