Secure, Multilingual Conversations with Sensitive Documents

Client challenge

Organizations struggled to efficiently process, extract, and query information from a diverse range of unstructured documents and online sources. The lack of a unified system hindered quick access to critical data and the ability to generate precise, cited answers, especially for complex formats like PDFs, multi-sheet Excel files and scanned documents.

Solution

Organizations face significant challenges in transforming vast, unstructured data into actionable intelligence. The need for secure, efficient, and accurate information retrieval from internal knowledge bases is paramount for informed decision-making, regulatory compliance, and operational efficiency. This project addressed the critical demand for a system that can intelligently process diverse data formats while maintaining strict data privacy and providing verifiable insights.

Benefits

Offline deployment options ensure enhanced data privacy and control.
Accurate Q&A with page-level and exact location citations.
Support for complex document types, including Excel and scanned images.
Multilingual web interface for broader user accessibility.

Background

The client operates in an environment rich with diverse digital content, including PDFs, Word documents, Excel spreadsheets, Markdown, plain text files, and online resources. Extracting specific, accurate information from this fragmented data landscape was a significant hurdle. Traditional search methods often failed to provide precise answers or context, leading to time-consuming manual data retrieval.

A key challenge involved processing complex document structures, such as multi-sheet Excel files, where data needed to be extracted, structured, and made queryable. Similarly, scanned documents and images required advanced processing to extract text and integrate it into the knowledge base. The client also prioritized data privacy, necessitating solutions that could operate securely, potentially on-premises, without compromising sensitive information.

Stakeholders, from individual users needing quick answers to administrators managing user access and data, required a system that was intuitive, scalable, and reliable. The existing infrastructure lacked the capability to handle simultaneous processing tasks efficiently or to offer granular control over user features and document limits.

Executive Summary

Our solution implemented a sophisticated RAG pipeline to transform diverse unstructured data into a queryable knowledge base. Leveraging a microservice architecture, files are processed through dedicated pipelines, stored in object storage (AWS S3 or MinIO), and indexed in a Milvus vector database. To ensure scalability, low latency, and complete data privacy, we incorporated offline, production-grade LLM deployment using vLLM, enabling high-throughput inference without relying on external APIs. This system empowers users to ask natural language questions and receive precise, cited answers, significantly enhancing information access and data utilization across the organization.

Key Business Challenges

Difficulty in extracting structured data from complex documents like multi-sheet Excel files, leading to manual effort and potential errors in analysis.
Lack of a unified system to query information across diverse document formats and online resources, hindering efficient knowledge retrieval and decision-making.
Inability to provide precise, cited answers to user queries, impacting trust in information and requiring manual verification processes.
Concerns regarding data privacy and security, necessitating options for on-premises deployment and offline functionality to meet compliance requirements.

Solution Overview

The solution is built on a microservice architecture with dedicated data processing pipelines. When a user uploads a document, it is stored in object storage, and its metadata—including file name, type, and user association—is recorded in a central database for tracking. A processing job is then created and pushed to a message queue (e.g., RabbitMQ, Apache Kafka, or AWS SQS), which distributes tasks across processing services. Based on resource availability, jobs are dynamically scheduled and executed, enabling scalable, fault-tolerant, and parallel processing of files while ensuring high availability and efficient resource utilization.

The system supports a wide array of document formats, including PDF, DOC/DOCX, Markdown, TXT, XLS/XLSX, and images, as well as online resources such as websites and Google Docs. Each format is handled through a dedicated parsing pipeline. For instance, with Excel files, data is extracted from every sheet, converted into a structured format, and stored in the database. The processed and structured data is then embedded into a Milvus vector database for fast, semantic querying.

To ensure low latency, scalability, and complete data privacy, the platform uses offline, production-grade LLM deployment powered by vLLM, enabling high-throughput inference without relying on third-party APIs. Users can query one or multiple documents and receive precise, cited answers with page numbers and exact content locations. The system further enhances the user experience with clickable references, read-aloud functionality, and answer regeneration. It also provides multi-model support, allowing seamless integration with OpenAI, Gemini, Grok, and locally hosted models to offer flexible and diverse AI capabilities. This architecture provides the flexibility for both cloud-hosted and on-premises deployments, meeting diverse enterprise data privacy and regulatory compliance needs.

Outcomes and Impact

Enabled Natural Language Querying: Enabled natural language querying across diverse document types, including complex Excel sheets and scanned documents.
Delivered Precise Citations: Delivered precise, page-level citations for answers, enhancing trust and verifiability of information.
Supported Flexible Deployment: Supported flexible deployment options—including cloud, on-premises, and offline production-grade LLM deployment using vLLM—to meet diverse data privacy and control requirements.
Integrated Multiple AI Models: Integrated multiple AI models (OpenAI, Gemini, Grok, and locally hosted models), providing flexibility and future-proofing for advanced capabilities.
Streamlined Information Access: Streamlined information access and knowledge retrieval from private data sources, significantly reducing manual effort and improving efficiency.