Introducing Built-in Embedding Generation
Antarys now handles the complete vector pipeline—from raw text to searchable embeddings—without external dependencies.
Introducing Built-in Embedding Generation
Vector databases have always promised a simple workflow: take your data, convert it to vectors, and query it semantically. In practice, this has required juggling multiple services—an embedding API, a vector database, and the infrastructure to connect them.
Today, we're changing that with Antarys's built-in embedding API.
The Traditional Pipeline
Building semantic search typically looks like this:
Raw Data → External Embedding API → Vector Database → Query Results
(OpenAI, Cohere, etc.) (Separate service)This works, but introduces friction:
- Multiple API calls: Your data makes a round trip to an external service before reaching storage
- Network latency: Each embedding request adds milliseconds or seconds
- API costs: Pay per embedding, which adds up quickly at scale
- Dependency risk: Your search depends on external service availability
- Data privacy: Sensitive data leaves your infrastructure
The Antarys Approach
With built-in embedding generation, the pipeline simplifies:
Raw Data → Antarys (Embedding + Storage) → Query Results
(Single service)Here's what this looks like in practice:
import antarys
client = await antarys.create_client("http://localhost:8080")
# Your raw documents
documents = [
"Python is great for data science",
"JavaScript powers modern web apps",
"Machine learning transforms industries"
]
# Generate embeddings directly
embeddings = await antarys.embed_documents(client, documents)
# Store them immediately
vector_ops = client.vector_operations("docs")
records = [
{
"id": f"doc_{i}",
"values": embedding,
"metadata": {"text": doc}
}
for i, (doc, embedding) in enumerate(zip(documents, embeddings))
]
await vector_ops.upsert(records)
# Query semantically
query_emb = await antarys.embed_query(client, "AI applications")
results = await vector_ops.query(vector=query_emb, top_k=5)Three operations. One service. No external dependencies.
What This Enables
Faster Development
Skip the boilerplate of integrating multiple APIs. Deploy semantic search in minutes instead of hours.
Lower Latency
Embeddings generate on the same server that stores them. No network hops to external services.
Predictable Costs
No per-embedding API fees. Run as many embeddings as your hardware supports.
Data Privacy
Your text never leaves your infrastructure. Critical for regulated industries or sensitive data.
Simpler Operations
One service to deploy, monitor, and scale. No coordination between embedding and storage layers.
The Technical Foundation
Antarys uses optimized ONNX models for embedding generation:
- BGE models: State-of-the-art retrieval performance from BAAI
- Quantized weights: Smaller memory footprint without accuracy loss
- CPU optimization: Fast inference without GPU requirements
- Batched processing: Efficient throughput for large datasets
The embedding API validates model integrity on startup, ensuring corrupted downloads don't cause runtime failures.
When to Use External Embedding APIs
Built-in embeddings aren't always the answer:
- Multilingual requirements: If you need 100+ languages, external APIs may have broader coverage
- Domain-specific models: Specialized models (medical, legal, code) often come from external providers
- Largest models: If you need embeddings larger than 768 dimensions
- GPU constraints: External APIs handle GPU infrastructure for you
For most applications—semantic search, content recommendations, document clustering—built-in embeddings provide the best balance of simplicity, performance, and cost.
Getting Started
Enable embedding generation when starting Antarys:
# Download requirements
antarys download onnx
antarys download embedding 4
# Start server with embeddings
antarys --enable-embedding --embedding-model 4Then use the embedding API from any client:
import { createClient, embed } from 'antarys';
const client = createClient('http://localhost:8080');
const embeddings = await embed(client, ['text 1', 'text 2']);Future Roadmap
Current State
The embedding API generates text embeddings from raw data using CPU-optimized ONNX models. This works well for most applications, providing fast inference without GPU requirements.
Next: GPU-Accelerated Embeddings
The next update will add GPU support for parallel embedding generation. This will dramatically increase throughput for large-scale batch operations—think millions of documents instead of thousands.
Unified Memory Buffers
A major optimization coming soon: when you generate embeddings and immediately store them in a collection, Antarys will reuse the same memory buffers throughout the pipeline.
Here's how it works:
# Future API (not yet available)
embeddings = await antarys.embed_documents(
client,
documents,
collection="docs", # Specify target collection
auto_upsert=True # Store directly
)When you specify a target collection, Antarys will:
- Know the data size upfront
- Allocate memory buffers once
- Generate embeddings in those buffers
- Store vectors from the same buffers
- Reuse buffers for subsequent operations
This eliminates memory copies between embedding generation and storage, cutting latency significantly for large batches.
Planned: Vector Database Events
Native pub/sub notifications for database changes:
# Planned API (design phase)
@client.on_vector_change("docs")
async def handle_change(event):
if event.type == "upsert":
print(f"New vectors: {event.ids}")
elif event.type == "commit":
print("Data persisted to disk")This enables reactive applications that respond to database changes without polling—useful for cache invalidation, index rebuilding, or triggering downstream workflows.
Platform Support: Built-in embeddings currently support macOS and Linux (ARM64/x64). The API accepts any vectors, so you can still use external embedding providers on all platforms.
Understanding HNSW - The Algorithm Behind Fast Vector Search
Deep dive into Hierarchical Navigable Small World graphs, the algorithm powering modern vector databases and semantic search applications.
Why Open Dimension Recommends Antarys for AI Chatbot Platforms
How faster vector search performance can reduce costs and improve chatbot response times for AI platforms building on semantic search.