Version: v3.3 print this page

Unstructured Knowledge Bases

Unstructured Knowledge Bases allow you to transform your documents and files into intelligent, searchable repositories in the Amorphic Cloud Platform. This powerful feature uses advanced AI technology to create, manage, and query knowledge bases from datasets and domains containing PDF, DOCX, TXT, and other file types. Whether you're building a document search system, creating a Q&A interface, or organizing enterprise knowledge, unstructured knowledge bases provide the tools to make your data more accessible and intelligent.

With KnowledgeBase, you can:

Create and manage knowledge bases with multiple data sources
Sync and index your files present in your dataset, domain
Query knowledge bases using natural language
Get intelligent responses based on your data content along with citations
Track indexing metrics and sync status
Manage access permissions and resource associations

This guide will walk you through everything you need to know to leverage KnowledgeBase's capabilities effectively.

knowledge-base-overview

Knowledge Base Operations

Amorphic provides the following operations for Knowledge Base:

Operation	Description
Create Unstructured Knowledge Base	Creates an unstructured knowledge base in AWS Bedrock and other necessary AWS resources.
View Unstructured Knowledge Base	View the details of an existing unstructured knowledge base.
Update Unstructured Knowledge Base	Update an existing unstructured knowledge base configuration.
Add Sources	Add new data sources to an existing knowledge base.
Sync Knowledge Base and Sources	Sync data sources in a knowledge base to update indexed content.
View Sync Status	View sync status and metrics for a knowledge base.
Query Knowledge Base	Query a knowledge base using natural language.
Remove Sources	Remove data sources from a knowledge base.
Delete Knowledge Base	Delete an existing knowledge base.

Getting Started

Overview

Unstructured KnowledgeBase is an AI-powered data repository system that enables you to:

Transform your unstructured datasets and files into searchable knowledge bases
Query your data using natural language
Get intelligent responses based on your actual data content along with citations
Sync and index new or updated data
Manage multiple sources within a single knowledge base

The system integrates with AWS Bedrock to provide advanced natural language processing and retrieval capabilities, making your data more accessible and useful.

Note

All sources must be properly registered in the Amorphic platform and accessible to your user account. The system automatically handles file format detection and content extraction during the indexing process.

Key Features

Knowledge Base Management

KnowledgeBase provides comprehensive management capabilities for creating, updating, and maintaining your data repositories.

Feature	Description
Knowledge base creation	Create new knowledge bases with custom names and descriptions
Source association	Attach multiple datasets, domains to a knowledge base
Sync management	Sync and index sources with status tracking
Access control	Manage permissions and user access to knowledge bases
Metrics tracking	Monitor indexing statistics and sync performance

Note

Knowledge bases are created with unique identifiers and can contain multiple data sources
Sync operations are performed sequentially to avoid conflicts
You can choose to sync either the entire knowledge base or individual sources one at a time
All operations are logged for audit and compliance purposes
Access permissions are inherited from the underlying data sources

Natural Language Querying

KnowledgeBase leverages advanced LLMs to enable natural language interactions with your data:

Feature	Description
Natural language processing	Query your data using plain English
Context-aware responses	Get answers based on your actual data content
Chunk-based retrieval	Intelligent document chunking for better responses
Response formatting	Structured responses with source attribution

Source Synchronization

The system provides robust synchronization capabilities for keeping your knowledge bases up-to-date:

Feature	Description
Indexing	Detect and index new or modified files
Incremental sync	Only process changed content for efficiency
Status tracking	Monitor sync progress and completion status
Error handling	Retry logic with exponential backoff
Email notifications	Notify owners and editors of sync completion

Create Unstructured Knowledge Base

Create Knowledge Base

To create an Unstructured Knowledge Base:

Navigate to the AI Services section in the left sidebar
Select Knowledge Bases from the available options
Click on + Create Knowledge Base.
Fill in the details shown in the table:

Attribute	Description
Knowledge Base Name	Give your knowledge base a unique name.
Description	Describe the knowledge base's purpose and relevant details.
Knowledge Base Type	Choose the knowledge base type Whether it is unstructured or structured
Models	Select the Embedding Models enabled in your account. These models convert text into numerical vectors for semantic search.
Keywords	Add relevant keywords to the knowledge base.
Guardrail	Select a relevant guardrail for the knowledge base. If no guardrail is specified, the system will apply a default guardrail automatically.
Access Control	Configure access permissions for the knowledge base. By default, the creator has full access.

Citations

When you query a knowledge base through AI Studio Chats or Projects, the response includes inline citations that link back to the source documents. For full details, see Understanding Citations.

Description Best Practices

Provide a clear and detailed description that accurately reflects the content and purpose of your knowledge base sources
A well-written description helps users understand the knowledge base scope and enables agents to effectively access and utilize the information
Include key topics, data types, and intended use cases in the description for better discoverability

View Unstructured Knowledge Base

The Unstructured Knowledge Base details page provides comprehensive information organized into three main tabs:

Tab	Component	Description
Overview	Basic Information	Knowledge Base Name: Unique identifier Description: Purpose and content details Created: Creator and creation date Updated: Last modifier and modification date
	Model Information	Model: Embedding model (e.g., amazon.titan-embed-text-v2:0) Last Synced: Most recent sync timestamp Last Synced Status: Current sync state (SUCCEEDED/FAILED/IN_PROGRESS)
	Keywords	Associated tags and owner information
	Donut Chart Metrics	Sources Attached: Connected datasets/domains Files Scanned: Total processed files Files Deleted: Removed files Files Failed: Failed indexing attempts Metadata Files Scanned: Processed metadata files Metadata Files Modified: Updated metadata files Modified Files Indexed: Re-indexed existing files New Files Indexed: Successfully indexed new files
	Summary Cards	Sources Added: Total attached sources Latest Files Processed: Recent processing count Latest Files Indexed: Recent indexing success count Latest Files Failed: Recent indexing failure count
Sources	Source Management	Information about connected data sources and their status
Runs	Sync Operations	Details about synchronization operations and their outcomes
Activity Logs	Timeline Events	Creation events Source addition records Sync operation history Knowledge base modifications Each log entry includes: User who performed the action Action description and timestamp

The Knowledge Base details page provides comprehensive information about your knowledge base, including its configuration, metrics, and activity history. The page is organized into three main tabs: Overview, Sources, and Runs.

Note

The Overview tab provides the most comprehensive view of your knowledge base status and performance
Use the Test Knowledge Base button to verify your knowledge base is working correctly
Monitor the Activity Logs to track all changes and operations performed on your knowledge base
The metrics help you understand the scope and health of your indexed content

Update Unstructured Knowledge Base

To update an Unstructured Knowledge Base (for example, its description or guardrail):

Navigate to the Knowledge Base details page
Click on the Edit action button
Update the description and/or guardrail as needed
Click Save to apply the changes

Note

Only the description field and guardrail can be modified after a knowledge base is created. The name and models configurations cannot be changed.

Add Sources

Add Data Sources To add sources to your Knowledge Base:

Navigate to the Knowledge Base details page
Click the Add Source button
Select your source type (Dataset or Domain)
Configure the required fields
Click Save to attach the source

The following fields need to be configured when adding a source:

Field	Description
Source Type	Select between Dataset or Domain as the source type
Name	Select from the list of available datasets or domains based on the chosen source type
Description	Add details about the source content and purpose
Chunking Strategy	Select a chunking strategy for how your documents are split into searchable chunks. See the Chunking Strategies table below for details.
Parsing Strategy	Select how content is extracted from your files. See Parsing Strategies below.

Chunking Strategies

The chunking strategy determines how your documents are divided into smaller, searchable pieces for indexing and retrieval. Choose a strategy based on the structure and type of your documents:

Chunking Method	When to Use
No Chunking	Use when data is already optimally chunked outside of Amazon Bedrock and you plan to use it as-is with Amazon Bedrock Knowledge Bases.
Fixed Size Chunking	Ideal for documents with loose semantic connections between paragraphs and texts, such as FAQs, data reports, statistics, news, newsletters, or news articles. Also suited to files containing structured data like CSVs.
Semantic Chunking	Best suited for documents with strong semantic relationships between paragraphs and texts, such as reviews, customer conversations, sales and marketing materials.
Hierarchical Chunking	Recommended for documents with clear hierarchies (headers, sections, subsections, paragraphs, etc.), such as technical manuals, research papers, and legal contracts.

warning

When using No Chunking, the entire document content is processed as a single chunk. Ingestion will fail if the document's total content exceeds the input token limit of the selected embedding model (e.g., 8,192 tokens for Amazon Titan Text Embeddings). This strategy should only be selected when documents are guaranteed to fall within these model-specific constraints.

Parsing Strategies

Parsing is how the system extracts content from your raw files (e.g., text from PDFs or Word documents) before chunking and indexing.

In Amorphic, only the Default parsing strategy is supported. It uses the Amazon Bedrock default parser to extract text from supported file types.

General Notes

Important limitations:

Currently in version 3.3, only the Default Parsing Strategy is supported for processing source content
Maximum 5 sources can be attached per knowledge base
If a domain is selected as a source, individual datasets from that domain cannot be added separately
This limitation helps optimize query performance across the knowledge base

warning

For structured files, Fixed size is the ideal chunking strategy to select. Other chunking strategies may lead to sync failures with larger structured files.
In the event the knowledgebase sync fails for a structured file, the file content may still be consumed through the projects feature. This is due to the retrieval being backed up by SQL AI.

Sync Knowledge Base and Sources

Individual Source Sync

Sync Individual Source

Step	Action	Details
1	Navigate	Go to the Sources tab
2	Initiate	Click `Sync` on your target source
3	Monitor	Track progress in the Runs tab
4	Review Metrics	View detailed source metrics: • Files scanned, deleted, and failed • Metadata files processed and indexed • New files indexed • Latest processing status
5	Verify	Check file status (INDEXED or FAILED)

Complete Knowledge Base Sync

Sync Knowledge Base

Step	Action	Details
1	Initiate	Click `Sync` at knowledge base level
2	Monitor	Track progress in Runs tab
3	Review	Check metrics for all sources

Note

Important considerations:

Only one sync operation can run at a time per knowledge base
Sync operations run sequentially to prevent conflicts
Sync duration can take up to a maximum of 6 hours
Average sync duration depends on file count and size; large files require more processing time. For example, a sync involving 10 files averaging 10MB will take 15-20 minutes
If a sync operation times out, please try syncing again
Email notifications confirm completion
Failed syncs automatically retry with exponential backoff

View Sync Status

Monitoring Dashboard

Navigate to the Runs tab to view comprehensive sync details:

Information	Description
Source Name	Individual source or knowledge base
Execution Scope	Datasource/KnowledgeBase
Status	Current sync status
Start Time	Operation start timestamp
End Time	Operation completion timestamp
Synced By	User who initiated the sync

Detailed Metrics

Each sync operation provides:

Metric Type	Details Tracked
File Processing	• Files scanned • Files deleted • Failed files
Metadata Status	• Files scanned • Files modified • Files indexed
Index Updates	• New files indexed • Processing status • Latest results

Query Knowledge Base

Query Knowledge Base The Knowledge Base provides an intuitive interface for querying your indexed content using natural language.

Step	Field	Description
1	Access Query Interface	Select your target knowledge base from the list Click the `Test Knowledge Base` button in the top right A chat interface window will appear
2	Configure Query Scope	Choose your preferred scope: • Query the entire knowledge base • Select a specific data source • Target an individual file • Combine source and file selection
3	Select AI Model	Choose an appropriate AI model for your query Recommended models for optimal results: • Claude-4.5-Sonnet • Other advanced models
4	Submit and Review	Enter your natural language query Click `Submit` to process Review the AI-generated response Examine source references provided as chunks below each response

Note

Important considerations:

Queries only work on successfully indexed content
Use specific, well-formed questions for better accuracy
All responses include source references for verification
Access control ensures users only receive information from files they have permission to view

Best Practices

For optimal results:

Use advanced models like Claude-4-Sonnet or other advanced models
Craft clear and specific prompts
Review source references to validate responses
Start with broader queries, then refine as needed

Understanding Citations

When you query an unstructured knowledge base through AI Studio Chats or Projects, the system provides precise inline citations in the response. Citations allow you to trace each piece of information back to its exact location in the source PDF or DOCX files, ensuring transparency and verifiability.

info

Citations are only available when querying through AI Studio Chats or Projects. They are not available when using the "Test Knowledge Base" feature.

What are Citations?

Citations are reference markers embedded in responses that link specific statements back to their source documents. When a response is generated, it includes numbered references (e.g., [1], [2], [3]) that correspond to specific elements in your indexed documents.

Key Features:

Inline References: Citations appear as clickable numbered tags (e.g., [1], [2]) directly within the response text
Element-Level Precision: Each citation points to a specific document element (paragraph, table, image, etc.)
Page Image: An image of the source page is displayed in the right-side Citation panel
Visual Context: The cited content is highlighted with bounding boxes on the page for precise identification
Page-Level Navigation: Citation include page number, File Name, Dataset Name, Domain Name for easy document navigation

Citations in AI Studio Chats

When you query a knowledge base through AI Studio Chats, the response includes inline citation markers (e.g., [1], [2]). Clicking a citation opens a Citation Panel on the right side, displaying the source page image and metadata for verification.

Citations in AI Studio Projects

Projects provide the same citation experience as Chats. When you query a knowledge base from within a Project, the response includes numbered citation markers, and the Citation Panel displays the source document page with metadata for verification.

How Citations Work

Citations are automatically generated when you query a knowledge base through AI Studio Chats or Projects:

During Sync: Smart Extraction processes PDF and DOCX files, breaking them into searchable chunks while preserving information about paragraphs, tables, images, and their exact locations in the document
During Query: When you ask a question, the system finds the most relevant content from your documents
In Response: The system generates an answer and adds citation numbers (e.g., [1], [2]) wherever it references specific information from your documents
In Citation Panel: You can see the exact source page and verify the information yourself

File Type Support

Citations require Smart Extraction, which is only available for specific file types in specific regions:

Requirement	Details
Supported File Types	PDF (.pdf) and Word (.docx) only
Regional Availability	Citations are not supported in US East (Ohio) and Canada (Central) due to Smart Extraction unavailability
Cost	Smart Extraction is billed at $0.01 per page during sync operations

Access Control and Permissions

Citations respect your knowledge base's access control settings:

Dataset-Level Access: Users only see citations from files in datasets they can access
Domain-Level Access (DLA): Users with DLA see citations from all datasets in the domain
File-Level Access (TBAC): For read-only users with tag-based access control, citations are filtered based on file-level tags
Deleted Files: Citations from deleted source files are not accessible

Remove Sources

Remove Data Sources To remove sources from a knowledge base:

Navigate to Knowledge Base: Go to the knowledge base details page
Select Remove Sources: Choose the option to remove data sources
Confirm Removal: The data sources will be detached from the knowledge base
Clean Up: Associated metadata will be cleaned up automatically

Note

Removing sources will make their content unavailable for querying
The operation cannot be undone
Associated metadata will be cleaned up automatically

Delete Knowledge Base

To delete a knowledge base:

Navigate to Knowledge Base: Go to the knowledge base details page
Select Delete Option: Click the delete button in the top right
Confirm Deletion: Review the warning message and confirm deletion
Automatic Cleanup: The system will automatically:
- Remove all associated data sources
- Delete corresponding indexed files
- Clean up related metadata

warning

This action is permanent and cannot be undone. Make sure you want to delete the knowledge base before confirming.

Note

The knowledge base must be in Active state in order to perform the delete operation on it.

Access Control and Permissions

The system implements robust access control:

Owner Access: Full control over knowledge base operations
Editor Access: Can modify and sync knowledge bases, including adding/removing sources and updating settings, but cannot delete knowledge bases
Reader Access: Can query knowledge bases
Resource-level Permissions: Inherited from underlying data sources

Best Practices

To get the most out of KnowledgeBase, consider these best practices:

Organize Sources
- Group related datasets and files logically
- Use descriptive names for knowledge bases
- Consider domain-based organization
Optimize Sync Operations
- Monitor sync status and address failures promptly
- Use incremental syncs when possible
Query Optimization
- Be specific in your questions for better results
- Use context from previous queries when relevant
- Review source attribution for accuracy verification
Access Management
- Regularly review and update access permissions
- Monitor usage patterns and adjust accordingly
- Implement least-privilege access principles

Improving Query Quality

Using clear and descriptive file names and metadata will significantly improve KnowledgeBase's ability to provide accurate responses. Consider adding tags and descriptions to your data sources when possible.

Supported File Types

File Type	Extension
Plain text (ASCII only)	.txt
Markdown	.md
HyperText Markup Language	.html
Microsoft Word document	.doc/.docx
Comma-separated values	.csv
Microsoft Excel spreadsheet	.xls/.xlsx
Portable Document Format	.pdf

Current Limitations

Maximum of 100 Knowledge Bases can be created per account
Individual file size must not exceed 50MB quota
Only S3 datasets are currently supported as data sources
Maximum 5 data sources can be attached per Knowledge Base
Knowledge Base names must be unique within your account
Sync operations run sequentially - only one sync can be active at a time
Sync operations have a maximum duration of 6 hours; if a timeout occurs, try syncing again
Knowledge base queries are limited to indexed content only
Large files may take significant time to index
Query responses are based on indexed chunks and may not include full context
Real-time updates require manual sync operations

Knowledge Base Operations​

Getting Started​

Overview​

Key Features​

Knowledge Base Management​

Natural Language Querying​

Source Synchronization​

Create Unstructured Knowledge Base​

View Unstructured Knowledge Base​

Update Unstructured Knowledge Base​

Add Sources​

Chunking Strategies​

Parsing Strategies​

Sync Knowledge Base and Sources​

Individual Source Sync​

Complete Knowledge Base Sync​

View Sync Status​

Monitoring Dashboard​

Detailed Metrics​

Query Knowledge Base​

Understanding Citations​

What are Citations?​

Citations in AI Studio Chats​

Citations in AI Studio Projects​

How Citations Work​

File Type Support​

Access Control and Permissions​

Remove Sources​

Delete Knowledge Base​

Access Control and Permissions​

Best Practices​

Knowledge Base Operations

Getting Started

Overview

Key Features

Knowledge Base Management

Natural Language Querying

Source Synchronization

Create Unstructured Knowledge Base

View Unstructured Knowledge Base

Update Unstructured Knowledge Base

Add Sources

Chunking Strategies

Parsing Strategies

Sync Knowledge Base and Sources

Individual Source Sync

Complete Knowledge Base Sync

View Sync Status

Monitoring Dashboard

Detailed Metrics

Query Knowledge Base

Understanding Citations

What are Citations?

Citations in AI Studio Chats

Citations in AI Studio Projects

How Citations Work

File Type Support

Access Control and Permissions

Remove Sources

Delete Knowledge Base

Access Control and Permissions

Best Practices