Mitigate AI Platform

Knowledge Base

Manage documents and document sources to build a knowledge base for your AI chatbot. Upload files, crawl websites, and integrate with Jira.

Build your chatbot's knowledge base by uploading documents or configuring automated document sources. Documents are processed into searchable chunks with vector embeddings, enabling the chatbot to provide accurate, context-aware responses.

Documents

Documents are individual files or web pages that make up your knowledge base. Each document is split into chunks, enriched with metadata, and vectorized for semantic search.

Supported File Types

  • Documents: PDF, DOCX, XLSX, PPTX
  • Web Content: HTML, Markdown
  • Data: CSV, Plain Text

Uploading Documents

Go to Documents

Go to AdminDocuments and click Upload Documents.

Select Files

Select one or more files and choose which Workspaces should have access.

Upload

Click Upload. Documents are automatically processed through the ingestion pipeline.

Uploaded documents go through the following processing stages:

  1. Loading — File content is extracted
  2. Chunking — Content is split into searchable segments
  3. Enrichment — Title, description, and locale are generated using AI
  4. Contextual Processing — Each chunk receives surrounding context for better retrieval
  5. Vectorization — Embeddings are generated for semantic search

Managing Documents

The documents index provides search and filtering:

  • Search by title or source URL
  • Filter by vectorization status (vectorized or pending)
  • Sort by name, date, or file size
  • Bulk delete multiple documents at once

Each document displays its processing status, including the number of chunks created and how many have been vectorized.

Workspace Access

Documents can be shared across workspaces. For manually uploaded documents (not from a source), admins can manage workspace access from the document detail page.

Document Sources

Document sources automate the ingestion of documents from external systems. Instead of uploading files manually, configure a source to crawl and import content automatically.

Web Source

Crawl websites to import their content as documents.

Go to Document Sources

Go to AdminDocument Sources and click Add Web Source.

Configure Settings

Configure the source settings (see table below).

Save and Process

Click Save, then click Process to start the initial crawl.

Web Source Settings

SettingDescription
URLStarting URL for the crawler (must be HTTP or HTTPS)
LimitMaximum number of documents to fetch (1–10,000, default: 100)
Max DepthHow deep the crawler follows links (0–10, default: 10)
Include URL GlobsURL patterns to include, semicolon-separated (e.g., https://example.com/docs/*)
Exclude URL GlobsURL patterns to exclude, semicolon-separated. Exclude rules take precedence over include rules.
JavaScript EnabledEnable headless Chrome rendering for JavaScript-heavy websites. Disabled by default for faster crawling.
Ignore SelectorsCSS selectors for elements to remove from pages (e.g., nav;footer;.sidebar)
HeaderCustom HTTP headers, semicolon-separated (see Crawling Authenticated Websites)
LocaleExpected language of documents (ISO 639-1 code, e.g., en, lv)
Periodic CrawlEnable daily automatic re-crawling
TranslateAutomatically translate documents to the base language
DescriptionOptional description of the source

Crawling Authenticated Websites

If the website you want to crawl is behind a login or requires authentication, you can use the Header setting to pass custom HTTP headers with each request. Headers are semicolon-separated.

Option 1: Basic Authentication

If the website uses HTTP Basic Authentication, add an Authorization header with a Base64-encoded username:password value:

Authorization: Basic dXNlcm5hbWU6cGFzc3dvcmQ=

To generate the Base64 value, encode username:password (e.g., echo -n 'username:password' | base64).

Option 2: Re-using a Session Cookie

If you can log into the website manually, you can copy the session cookie from your browser and pass it as a header:

Cookie: session_id=abc123def456

Session cookies typically expire after some time. You will need to update the header when the cookie expires.

Option 3: API Key or Bearer Token

If the website supports token-based access, use the appropriate header:

Authorization: Bearer your_access_token

or

X-Api-Key: your_api_key

This option typically requires involvement from the website developers to configure robot/service account access that does not expire.

Jira Source

Import Jira issues as documents. Each issue is converted to a structured document containing its summary, description, comments, status, and other metadata.

Note: Jira integration requires the Jira feature to be enabled.

Go to Document Sources

Go to AdminDocument Sources and click Add Jira Source.

Configure Settings

Configure the source settings (see table below).

Save and Process

Click Save, then click Process to start the initial import.

Jira Source Settings

SettingDescription
URLJira instance URL
Chunk SizeText chunk size for splitting issue content (1–100,000)
LocaleExpected language of issues (ISO 639-1 code)
Periodic CrawlEnable daily automatic re-import (only fetches issues updated since last run)
TranslateAutomatically translate issues to the base language
DescriptionOptional description of the source

Google Drive Source

Import files from Google Drive into your knowledge base. Supports Google Docs, Sheets, and Slides (automatically converted to DOCX, XLSX, and PPTX), along with all other supported file types.

Note: Google Drive integration requires a Google Drive connector with OAuth authentication.

Go to Document Sources

Go to AdminDocument Sources and click Add Google Drive Source.

Select a Connector

Select the Google Drive connector to use for authentication.

Pick Files

Click Browse Google Drive to open the file picker. Select individual files or entire folders. When a folder is selected, all files within it (including subfolders) are ingested.

Configure Settings

Configure the source settings (see table below).

Save and Process

Click Save, then click Process to start the initial import.

Google Drive Source Settings

SettingDescription
ConnectorGoogle Drive connector for OAuth authentication
Selected FilesFiles and folders to import (selected via the file picker)
Chunk SizeText chunk size for splitting document content (1–100,000)
LocaleExpected language of documents (ISO 639-1 code)
Periodic CrawlEnable daily automatic re-import. Only changed files are reprocessed.
TranslateAutomatically translate documents to the base language
DescriptionOptional description of the source

Google native formats (Docs, Sheets, Slides) are automatically exported to Office formats for processing. Files that have not changed since the last crawl are skipped.

SharePoint Source

Import files from Microsoft SharePoint into your knowledge base. Browse SharePoint sites, drives, and folders to select files for ingestion.

Note: SharePoint integration requires a Microsoft SharePoint connector with OAuth authentication.

Go to Document Sources

Go to AdminDocument Sources and click Add SharePoint Source.

Select a Connector

Select the SharePoint connector to use for authentication.

Pick Files

Select a SharePoint site, then click Browse to open the file picker. Select individual files or entire folders. Folders are traversed recursively during ingestion.

Configure Settings

Configure the source settings (see table below).

Save and Process

Click Save, then click Process to start the initial import.

SharePoint Source Settings

SettingDescription
ConnectorSharePoint connector for OAuth authentication
Selected FilesFiles and folders to import (selected via the file picker)
Chunk SizeText chunk size for splitting document content (1–100,000)
LocaleExpected language of documents (ISO 639-1 code)
Periodic CrawlEnable daily automatic re-import. Only changed files are reprocessed.
TranslateAutomatically translate documents to the base language
DescriptionOptional description of the source

Processing Sources

After creating a source, click Process to start the crawl. You can monitor progress from the Crawler Runs page, which shows:

  • Status — Pending, running, completed, or failed
  • Progress — Percentage of completion
  • Duration — How long the crawl took
  • Error Message — Details if the crawl failed

Periodic Crawling

When Periodic Crawl is enabled on a source, the system automatically re-crawls daily. For Jira sources, only issues updated since the last run are fetched, making subsequent crawls faster.

Workspace Access

Document sources can be assigned to specific workspaces. All documents imported from a source are automatically available to the same workspaces.

Translation

When the translation feature is enabled, documents can be automatically translated to a configured base language. This is useful when your knowledge base contains documents in multiple languages but you want consistent retrieval.

  • Enable Translate on the document source
  • Set the base translation language in AdminSettings
  • Translated content is used alongside original content for search
  • Documents already in the target language are left unchanged

On this page