Envoy AI Gateway Blog

The Reality and Performance of MCP Traffic Routing with Envoy AI Gateway

2025-12-08T00:00:00.000Z

Envoy AI Gateway (AIGW) provides a production-ready bridge between AI agents and their tools by handling Model Context Protocol (MCP) traffic. As teams adopt MCP, questions about scale, performance, and architecture naturally arise.

This post addresses those questions by first clearing up common misunderstandings, then diving into the actual architecture, the design choices we made (and why), and how you can test and evaluate whether it is the right solution for your system.

This post will give you the context you need to:

Evaluate MCP routing in Envoy AI Gateway with realistic expectations
Understand the design decisions, their impact, and how they impact you
Learn about how you can configure and tune MCP routing in Envoy AI Gateway to meet your needs

Part 1: Common Misconceptions

Before discussing how it works, let’s address how people think it works. There are a few frequent misconceptions about how Envoy AI Gateway handles MCP traffic.

Misconception 1: "AI Gateway's MCP implementation is slow."

Reality

The AIGW MCP implementation offers performance comparable to other cloud-native solutions while providing full access to all Envoy traffic-handling features.

The Nuance

The default configuration settings balance performance, functionality, and security. If you want to further reduce latency based on your use case, you can easily adjust the configuration to prioritize raw speed in internal or low-latency environments.

Misconception 2: "Envoy AI Gateway is a separate project that ignores core Envoy."

Reality

Envoy AI Gateway is deeply integrated into the Envoy Ecosystem. It extends the Envoy Gateway control plane and leverages Envoy Proxy data-plane extensions.

It is not a fork or a side project running in isolation; it is designed to feed proven implementations back into Envoy core, accelerating the adoption of emerging patterns like MCP that require faster iteration than the core release cycle typically allows.

Misconception 3: “Envoy is not able to handle AI traffic.”

Reality

The Envoy proxy's architecture is ideally suited for managing AI traffic, including use cases like MCP. Both LLM and MCP traffic ultimately rely on HTTP. As a cloud-native, battle-tested HTTP proxy utilized at scale for nearly a decade, Envoy is a proven solution.

Furthermore, the Envoy community is actively working toward making the core Envoy proxy fully MCP-compliant. This future compatibility means Envoy AI Gateway users will be able to leverage the standard Envoy proxy for their data plane needs.

The Nuance

Currently, Envoy AI Gateway leverages Envoy Proxy extensions to fill functionality gaps and enable MCP routing via Envoy Proxy. This is thanks to Envoy Proxy's extensible architecture, which allows us to rapidly respond to and handle the various challenges the new era of AI gives us.

Part 2: The Actual Architecture

The core challenge of MCP is that it is currently a stateful protocol. A session ID must be reused across calls so the server can maintain context. When you place a gateway in the middle, a single client session might need to route to multiple upstream MCP servers (e.g., GitHub, Jira, local files), each maintaining its own independent session.

The "Stateless" MCP Routing Design

Instead of maintaining persistent session mappings in a centralized session store, Envoy AI Gateway uses a token-encoding architecture.

Initialization: When an agent initializes an MCP session, the gateway establishes upstream sessions with the backend servers (e.g., GitHub, Jira).
Secure session encoding: The gateway builds a compact description of these upstream sessions and wraps it into a safe, self-contained client session ID. This ensures the map is tamper-proof and that the client cannot see the internal topology.
Routing: The client receives this session ID. On subsequent requests, the client returns it. Any gateway replica can decode it and immediately route traffic to the correct upstream servers.

Tune performance to your needs

This design allows Envoy AI Gateway users to configure further and tune performance to meet their needs.

Understanding and Tuning Performance

The performance overhead users may observe stems from the key‑derivation function (KDF) used during session encryption, not from MCP routing itself. Envoy AI Gateway encrypts the session ID to prevent leaking internal details to MCP clients.

The default settings favor "stronger" key derivation over raw speed, leading to two regimes:

Default encryption settings (e.g., 100k KDF iterations): Adds on the order of tens of milliseconds of overhead per new session.
Tuned encryption settings (e.g., ~100 iterations): Overhead drops to around 1–2 ms per session, comparable to other MCP gateways.

The following graph illustrates proxy performance across different Session Encryption values and you can run the benchmarks on your own hardware to explore what would work best for you.

Benchmark Setup

These benchmarks were run on a MacBook Pro 17,1 (M1) laptop with 8 cores to demonstrate the impact of session encryption on overall performance.

The benchmarks measure the time taken to call a simple “echo” tool directly vs. calling it via Envoy AI Gateway with different session encryption configurations.

Analysis of Proxy Performance with Session Encryption

Average execution time in milliseconds (lower is better)

tip

You can run these benchmarks yourself to explore how configuring encryption settings meets your performance needs.

Part 3: How We Arrived at This Design (and Why)

When designing this, we faced a binary architectural choice: Keep state in the gateway or Encode state in the client.

The Alternative We Rejected: Centralized State

One approach (Option One) would have been to return a UUID to the client and store the mapping (UUID -> [Upstream_Session_1, Upstream_Session_2]) in a shared store like Redis.

The Problem:
- More components to manage: This introduces an additional component and dependency. To scale the gateway, you must also scale and manage a highly available session store. If that store fails, all traffic stops.
- High Availability Architecture Complexity: To make the solution multi-region and highly available, you must ensure that the data store backing the gateway is replicated across regions.

The Path We Chose: Encoded State

We chose the second option—encoding the session information into the client session ID—for three specific reasons:

Simple Horizontal Scaling: Because the session information travels with the request, you can spin up new gateway replicas instantly. Any replica can handle any request without needing to sync with a database.
Operational Simplicity: This eliminates the need to provision, secure, and operate an entire component.

Encoding session information into the client token is the current design and will evolve to meet the needs of AI traffic routing. Any evolution will follow the principles of focusing on real‑world production use cases, maintaining consistent configuration, and aligning with Envoy core.

Addressing the Trade-offs

This design choice creates a deliberate trade-off: we accept a negligible amount of compute overhead in exchange for operational simplicity to address current use cases and traffic routing needs.

Instead of paying the cost of managing, scaling, and querying a database for every request, the gateway simply spends a few CPU cycles processing the token. This is an efficient exchange that keeps the architecture clean and stateless.

Part 4: Evaluation — Is it Right for You?

The best way to decide if this architecture meets your needs is to test it against your specific performance and security requirements.

Configure an MCP route in Envoy AI Gateway: Point your agent at AIGW and experiment with encryption tuning.
Run the benchmark harness that you can find here in the Envoy AI Gateway repo.

# Clone the repo and build the latest CLI
git clone git@github.com:envoyproxy/ai-gateway.git
cd ai-gateway
make build.aigw

# Run the benchmarks
go test -timeout=15m -run='^$' -bench=. -count=10 ./tests/bench/...

Share your experience: Open an issue or join the community channels.

Part 5: The Envoy Ecosystem Alignment

Part of the Battle-Tested Envoy Ecosystem

Envoy AI Gateway is part of the Envoy ecosystem, which offers key benefits:

Battle‑tested data plane: Routing traffic through AIGW is built on Envoy Proxy’s decade of experience as a high‑performance, production‑grade data plane.
Ongoing core MCP work: AIGW can quickly adapt to new AI patterns and feed proven implementations back into Envoy core. AIGW leverages the proven Envoy Proxy extension mechanisms to address traffic-handling needs for AI system builders rapidly.
Shared investments and community: Improvements to observability, security, and performance in Envoy generally benefit AI workloads.

Summary and Conclusion

Envoy AI Gateway (AIGW) addresses the challenges of routing Model Context Protocol (MCP) traffic, specifically the need to handle long-lived, stateful sessions at scale.

Key Design Decisions

Feature	Design Choice in AIGW	Benefit
Session State Management	Encode session information into an encrypted client session ID (stateless gateway).	Simple horizontal scaling, failure isolation, and operational simplicity (no shared session store required).
Latency	Conservative, security-heavy default encryption settings introduce latency during encryption and decryption.	High integrity and defense-in-depth, configurable to tune speed vs. security (tens of milliseconds default; tunable to 1-2ms).
Ecosystem Alignment	Integrated with the battle-tested Envoy Proxy data plane, and investing in enhancing Envoy Proxy to support MCP and AI traffic.	Reliability, high performance, and rapid adaptation to new AI/MCP patterns. Allowing adopters of the AIGW control plane to benefit from Envoy Ecosystem advancements.
Best Practice	Encourage agent context, appropriate routing patterns, and tool grouping.	Keeps session tokens compact, enables granular security policy enforcement, and improves LLM performance.

Conclusion

Envoy AI Gateway is a production-ready solution for managing stateful MCP traffic, prioritizing horizontal scalability and operational simplicity through its session-encoding design. While the default cryptographic settings offer strong integrity and add slight latency, this overhead is explicitly configurable to meet various performance benchmarks.

As part of the Envoy ecosystem, AIGW is positioned to evolve rapidly, incorporating features like optional session persistence and advanced policy based on MCP metadata, ensuring it remains the reliable control point for connecting LLM agents to their tools. Teams are encouraged to test performance with the provided benchmark harness and share real-world usage feedback.

Announcing Model Context Protocol Support in Envoy AI Gateway

2025-10-02T00:00:00.000Z

We’re excited to announce that the next release of Envoy AI Gateway will introduce first-class support for Model Context Protocol (MCP), cementing Envoy AI Gateway (EAIGW) as the universal gateway for modern production AI workloads.

Envoy AI Gateway started in close collaboration with Bloomberg and Tetrate to meet production-scale AI workload demands, combining real-world expertise and innovation from some of the industry’s largest adopters. Built upon the battle-tested Envoy Proxy data plane as the AI extension of Envoy Gateway, it is trusted for critical workloads by thousands of enterprises worldwide. EAIGW already provides unified LLM access, cost and quota enforcement, credential management, intelligent routing, resiliency, and robust observability for mission-critical AI traffic.

With the addition of MCP, we have brought these features to the communication between Agents and external tools, making EAIGW even more versatile for enterprise-scale AI deployments. For a deeper look at the collaborative story and technical vision, see the Bloomberg partnership announcement, their official release coverage, and previous project announcements.

Why MCP Matters for AI Gateways

MCP is rapidly becoming the industry’s open standard for enabling AI agents to securely and flexibly connect to external tools and data sources. As the AI ecosystem shifts from monolithic models to agentic architectures, building robust, policy-driven, and observable pathways between AI and the rest of the enterprise stack has never been more critical. Integrating MCP directly into Envoy AI Gateway means:

Seamless interoperability between AI agents, tools, and context providers, whether they’re third-party cloud LLMs or internal enterprise services.
Consistent security and governance: The gateway can now apply fine-grained authentication, authorization, and observability over tool invocations and data access flowing through MCP.
Accelerated development: With MCP supported natively, teams can adopt the latest agent-based AI flows on their existing Envoy infrastructure without custom or glue code.

Key Features in the First Implementation

The initial implementation aims for a reliable implementation of the latest version of the spec, covering the full spectrum of features, not only focusing on tool calling:

Feature	Description
Streamable HTTP Transport	Full support for MCP’s streamable HTTP transport, aligning with the June 2025 MCP spec. Efficient handling of stateful sessions and multi-part JSON-RPC messaging over persistent HTTP connections.
OAuth Authorization	Native enforcement of OAuth authentication flows for AI agents and services bridging via MCP, ensuring secure tool usage at scale. Backwards compatibility with the previous version of the authorization spec to maximize compatibility with existing agents.
MCP Server multiplexing, Tool Routing, and Filtering	Route tool calls and notifications to the right MCP backends, aggregating and filtering available tools based on gateway policy. Dynamically aggregate, merge, and filter messages and streaming notifications from multiple MCP servers, so agents receive a unified, policy-governed interface to all available services.
Upstream Authentication	Built-in upstream authentication primitives to securely connect to external MCP servers, with support for credential injection and validation using existing Envoy Gateway patterns.
Full MCP Spec Coverage	Complete June 2025 MCP spec compliance, including support for tool calls, notifications, prompts, resources, and bi-directional server-to-client requests. Robust session and stream management, including reconnection logic (e.g., Last-Event-ID for SSE), ensuring resilience and correctness for long-lived agent conversations.
Zero Friction Developer and Production UX	All features are supported in standalone mode, allowing them to start the Envoy AI Gateway locally on their machine with a single command and start leveraging all MCP features. These configurations can be used as-is in production environments, as there is full compatibility between local standalone mode and Kubernetes.
Tested in Real Life	The implementation has full protocol test coverage and is tested with real-world providers like GitHub and agents like Goose, validating end-to-end functionality in the existing ecosystem.

Under the Hood

Adding MCP support meant more than just passing bytes through the stack. Our approach leverages Envoy’s architecture and existing features, introducing a lightweight MCP Proxy that handles session management, multiplexes streams, and bridges the gap between the stateful JSON-RPC protocol and the broader Envoy extension mechanisms.

Key design decisions included:

Minimal Architectural Complexity: The implementation does not add any additional component/complexity to the existing Envoy AI Gateway’s architecture
Fully leverages Envoy's networking stack: The MCP proxy harnesses Envoy’s proven networking stack for connection management, load balancing, circuit breaking, rate-limiting, observability, etc.
Decoupled Iteration Velocity: The MCP Proxy is implemented as a lightweight Go server to keep pace with the rapidly evolving MCP specification, while still relying on Envoy for networking primitives.

info

You can find more details about the architecture and design decisions in the design document of the MCP contribution pull request.

Getting Started

One of the easiest ways to get started with the MCP features in Envoy AI Gateway is by using the standalone mode to run it on your machine with no friction. In standalone mode, you can start the MCP Gateway with your existing MCP servers configuration file used by agents like Claude Code.

Use your existing MCP servers file

In the following example, we’ll start Envoy AI Gateway to proxy the GitHub and Context7 MCP servers. First, define the servers in the mcp-servers.json file:

{
  "mcpServers": {
    "context7": {
      "type": "http",
      "url": "https://mcp.context7.com/mcp"
    },
    "github": {
      "type": "http",
      "url": "https://api.githubcopilot.com/mcp/readonly",
      "headers": {
        "Authorization": "Bearer ${GITHUB_ACCESS_TOKEN}"
      }
    }
  }
}

And then start Envoy AI Gateway:

tip

Refer to the CLI installation instructions if you haven’t installed the CLI yet.

$ aigw run --mcp-config mcp-servers.json

This will start the Envoy AI Gateway locally and start serving the MCP servers at http://localhost:1975/mcp.
You can point your preferred agent (Claude, Goose, etc) to this URL as a Streamable HTTP MCP server.

You can also use some features, like Tool Filtering, with the existing server file. To do it, you just need to add the list of tools to expose for each server (by default, all tools are exposed), like in the following example:

{
  "mcpServers": {
    "context7": {
      "type": "http",
      "url": "https://mcp.context7.com/mcp"
    },
    "github": {
      "type": "http",
      "url": "https://api.githubcopilot.com/mcp/readonly",
      "headers": {
        "Authorization": "Bearer ${GITHUB_ACCESS_TOKEN}"
      },
      "tools": ["issue_read", "list_issues"]
    }
  }
}

Using the new MCPRoute API

You can also use the new MCPRoute API, which will allow a more fine-grained configuration and will work in standalone mode and in Kubernetes as well.

The following example defines a complete MCPRoute that shows how simple it is to configure:

MCP Server multiplexing.
MCP Authentication using OAuth.
Tool filtering.
MCP Server upstream authentication.

apiVersion: aigateway.envoyproxy.io/v1alpha1
kind: MCPRoute
metadata:
  name: mcp-route
  namespace: default
spec:
  parentRefs:
    - name: aigw-run
      kind: Gateway
      group: gateway.networking.k8s.io
  backendRefs:
    - name: context7
      kind: Backend
      group: gateway.envoyproxy.io
      path: "/mcp"
    - name: github
      kind: Backend
      group: gateway.envoyproxy.io
      path: "/mcp/readonly" # Use the radonly endpoint
      # Only expose certain tools
      toolSelector:
        includeRegex:
          - .*pull_requests?.*
          - .*issues?.*
      # Configure upstream authentication
      securityPolicy:
        apiKey:
          secretRef:
            name: github-access-token
  # Configure the gateway to enforce authentication using OAuth.
  securityPolicy:
    oauth:
      issuer: "https://auth-server.example.com"
      protectedResourceMetadata:
        # The URL here must match the URL of the Envoy AI Gateway
        resource: "http://localhost:1975/mcp"
        scopesSupported:
          - "profile"
          - "email"

This file can be used in your Kubernetes cluster and in standalone mode. You can quickly try it out with:

$ aigw run mcp-route.yaml

And point your local agents again to http://localhost:1975/mcp.
Once you’re happy with the configuration, you can apply it as-is on your Kubernetes cluster and production environments.

What’s Next?

This is just the beginning. As MCP and agentic architectures advance, we’ll continue to evolve and upstream features—ensuring Envoy AI Gateway remains the universal, most reliable, policy-driven, and interoperable AI gateway available modern production AI workloads.

We are proud to contribute the implementation of the MCP protocol, and look forward to continue enhancing Envoy AI Gateway together with the community and bring more use cases and features to the project.

We’d love your feedback as you start exploring MCP in your GenAI or agent infrastructure. Join our community meetings or file an issue to help shape the roadmap!

Enhancing AI Gateway Observability - OpenTelemetry Tracing Arrives in Envoy AI Gateway

2025-08-25T00:00:00.000Z

Aggregated metrics like latency, error rates, and throughput on their own won't reveal the source of why a system's output was wrong, slow, or expensive.

The v0.3 release of Envoy AI Gateway brings comprehensive OpenTelemetry tracing support with OpenInference semantic conventions, extending the existing metrics foundation to provide complete visibility into LLM application behavior.

This enables you to improve the quality and safety of your AI-integrated applications by allowing you to understand the full context of a request journey, as your LLM traces will inform application improvements and guardrail needs.

The Observability Challenges in AI Applications

Traditional observability looks at request speed, request volume, and error rates. These work well for simple stateless HTTP services. But they are not enough for AI applications.

AI applications present unique challenges:

LLM requests have complicated costs based on token usage
Responses can vary and may stream out tokens slowly
Semantic failures occur when the AI fails to understand or produce the correct answer
These issues don't show up as HTTP errors

Envoy AI Gateway has addressed some of these challenges. It collects a set of robust metrics through OpenTelemetry. It tracks token use, request times, and provider performance data.

The v0.1.3 update added GenAI-specific metrics. These include time-to-first-token, available via Prometheus endpoints.

However, these metrics alone can't tell you the full story or the root cause of an issue; this is where OpenTelemetry tracing with OpenInference conventions comes in.

Specifications Made for AI: OpenInference Semantic Conventions

To enable you to gain traffic insights in the best way, Envoy AI Gateway chooses to stay close to standards. Instead of creating custom trace formats, Envoy AI Gateway uses OpenInference. This is a widely accepted standard for AI applications, compatible with OpenTelemetry. Many frameworks, like BeeAI and HuggingFace SmolAgents, support it. OpenInference sets rules for tracking how large language models work. It covers things like the prompt, model settings, tokens used, and the response. Key moments, such as the time to get the first token during streaming, are recorded as span events. These relate to the earlier discussed metrics.

This OpenTelemetry approach uses spans and works well with common tracing systems. These systems usually link traces, not logs. For example, you can set up Envoy AI Gateway to work with Jaeger. This setup is similar to how LLM evaluation systems like Arize Phoenix handle OpenInference directly. Redaction controls are available from the start. They help you manage your trace data and balance it with your evaluation needs.

Redaction controls are available from the start. They help you manage your trace data and balance it with your evaluation needs.

How it All Fits Together: Envoy AI Gateway OpenTelemetry Tracing Architecture

Here's an example of a simple trace that includes both application and gateway spans, shown in Arize Phoenix:

Capture and Evaluate your Traffic: LLM Evaluation

Tracing data isn't only for in-the-moment troubleshooting; this data is your key to optimizing your AI system.

OpenInference traces provide the structured data foundation necessary for comprehensive LLM evaluation:

Capture the whole interaction context, including prompts, model parameters, and outputs
Leverage evaluation frameworks to identify patterns over time
Find optimization opportunities for performance, accuracy, and/or cost

Evaluation Patterns

To analyze the trace data, you can leverage an LLM-as-a-Judge evaluation pattern using your production trace data.

Since your traces are in OpenInference format, your easiest option is to leverage a solution that can consume them. Platforms like Arize Phoenix consume OpenInference traces directly, enabling easy analysis of inference traffic captured through Envoy AI Gateway.

Privacy Controls

When capturing tracing data, you want to ensure you keep private data private. The gateway includes configurable privacy controls for sensitive environments:

Selectively redact content from spans
Limit data captured in multimodal interactions
Apply custom filtering based on organizational requirements

Telemetry via the Gateway: Zero-Application-Change Integration

Gateway-level tracing requires no code changes in your applications. As traffic is routed via Envoy AI Gateway, the OpenInference-compliant traces are automatically created for all requests to the OpenAI-compatible API. This happens regardless of whether the calling applications include OpenTelemetry instrumentation.

Automatic Trace Propagation

If your calling applications are already instrumented with OpenTelemetry, the client spans will automatically join the same distributed trace as the gateway spans, providing end-to-end visibility. This trace propagation offers a complete view of the AI interactions.

Seamless Integration

As it integrates with your current tools, you and your team can begin using this new feature without needing to learn and adopt new tooling and instrumentation patterns.

Follow the Development Lifecycle: Deployment Flexibility

As you move from local development, through dev, test, staging, and to production, you can capture and trace traffic in the same way.

The tracing capability works consistently across deployment modes:

Local development: Standalone CLI tool
Production: Kubernetes Gateway
All environments: Full tracing support for OpenAI-compatible requests, including streaming responses and multimodal inputs

This deployment consistency simplifies your observability integration throughout your development lifecycle. Teams can establish observability patterns during local development that integrate seamlessly into production environments without architectural changes.

Looking Ahead: AI Applications Evolve Fast, and Infrastructure and Observability with it

The new tracing capability is available with the v0.3 Envoy AI Gateway release. For complete configuration details and integration examples, see the tracing documentation to get started.

As AI infrastructure continues to evolve, comprehensive observability becomes essential for managing operational complexity and ensuring application quality. OpenTelemetry tracing with OpenInference conventions provides the foundation teams need to build reliable, observable AI systems.

Get Involved

Want to get involved in building Envoy AI Gateway and further improve the observability capabilities?

Raise an issue on our GitHub repository
Join us on Slack in the #envoy-ai-gateway channel
Join our weekly community meetings

Announcing the Envoy AI Gateway v0.3 Release

2025-08-22T00:00:00.000Z

The Envoy AI Gateway v0.3 release introduces intelligent inference routing through Endpoint Picker (EPP) integration, expands our provider ecosystem with Google Vertex AI Production Support as well as Native Anthropic API, and delivers Enterprise-Grade Observability with OpenInference tracing.

The Big Shifts in v0.3

Envoy AI Gateway v0.3 isn't just another feature release; it's a fundamental shift toward intelligent, production-ready AI infrastructure. This release addresses three critical challenges that have been holding back AI adoption in enterprise environments:

1. From Static to Intelligent Routing

Traditional load balancers treat AI inference endpoints like web servers, but AI workloads are fundamentally different. With Endpoint Picker integration, Envoy AI Gateway now makes intelligent routing decisions based on real-time AI-specific metrics like KV-cache usage, queue depth, and LoRA adapter information.

What this means for you:

Benefit	Description
Latency reduction	Optimal endpoint selection based on real-time AI metrics
Automatic resource optimization	Intelligent resource allocation across your inference infrastructure
Zero manual intervention	Automated endpoint management without operational overhead

2. Expanded Provider Ecosystem

We've moved beyond experimental integrations to deliver production-grade support for the AI providers that matter most to enterprises.

Google Vertex AI is now supported with complete streaming capabilities for Gemini models. Anthropic on Vertex AI moves from experimental to production-ready with multi-tool support and configurable API versions.

What this means for you:

Benefit	Description
Unified OpenAI-compatible API	Single interface across Google, Anthropic, AWS, and more providers
Enterprise-grade reliability	Production-ready stability for mission-critical AI workloads
Provider flexibility	Switch between providers without architectural changes or vendor lock-in

3. Enterprise Observability for AI

AI workloads require specialized observability that traditional monitoring tools can't provide. v0.3 delivers comprehensive AI-specific monitoring across four key areas.

What this means for you:

Observability Feature	Description
OpenInference tracing	Complete request lifecycle visibility and evaluation system compatibility
Configurable metrics labels	Granular monitoring based on request headers for custom filtering
Embeddings metrics support	Comprehensive token usage tracking for accurate cost attribution
Enhanced GenAI metrics	Improved accuracy with OpenTelemetry semantic conventions

Notable New Features in v0.3

Endpoint Picker Provider: The Future of AI Load Balancing

A highlight of v0.3 is our integration with the Gateway API Inference Extension, which allows intelligent endpoint selection that understands AI workloads.

# AIGatewayRoute with InferencePool
apiVersion: aigateway.envoyproxy.io/v1alpha1
kind: AIGatewayRoute
metadata:
  name: intelligent-routing
spec:
  rules:
    - matches:
        - headers:
            - name: x-ai-eg-model
              value: meta-llama/Llama-3.1-8B-Instruct
      backendRefs:
        - group: inference.networking.x-k8s.io
          kind: InferencePool
          name: vllm-llama3-pool

This isn't just about load balancing; it's about intelligent infrastructure that adapts to your AI workloads in real-time.

Google Vertex AI: Enterprise AI at Scale

Google Vertex AI support moves to production-ready status with:

GCP Vertex AI Authentication with Service Account Key or Workload Identity Federation.
Complete Gemini Support with OpenAI API compatibility for function calls, multimodal, reasoning and streaming.
Complete Anthropic on Vertex AI Support with OpenAI API compatibility for function calls, multimodal, extended thinking and streaming.
Native Anthropic API via GCP Vertex AI to unlock use case like ClaudeCode.
Enterprise-grade reliability for mission-critical deployments.

This brings the power of Google's AI platform into your unified AI infrastructure, managed through a single, consistent API.

Comprehensive AI Observability

Traditional observability tools fall short when monitoring AI workloads. v0.3 delivers four significant observability enhancements:

Enhancement	Feature	Benefit
OpenInference Tracing Integration	Complete LLM request tracing with timing and token information	Deep visibility into AI request lifecycle
OpenInference Tracing Integration	Evaluation system compatibility with tools like Arize Phoenix	Seamless integration with AI evaluation workflows
OpenInference Tracing Integration	Full chat completion request/response data capture	Complete audit trail for debugging and analysis
Configurable Metrics Labels	Custom labeling based on HTTP request headers	Flexible monitoring and alerting setup
Configurable Metrics Labels	Granular monitoring by user ID, API version, or application context	Enhanced filtering and segmentation
Configurable Metrics Labels	Enhanced filtering and alerting capabilities	More targeted monitoring and alerts
Embeddings Metrics Support	Comprehensive token usage tracking for both chat and embeddings APIs	Better cost control and usage insights
Embeddings Metrics Support	Accurate cost attribution across different operation types	Precise cost allocation and budgeting
Embeddings Metrics Support	OpenTelemetry semantic conventions compliance	Standardized observability integration
Enhanced GenAI Metrics	Improved error handling and attribute mapping	More reliable performance monitoring
Enhanced GenAI Metrics	More accurate token latency measurements	Better performance analysis data
Enhanced GenAI Metrics	Better performance analysis data	Improved optimization insights

Model Name Virtualization: Abstraction for Flexibility

The new modelNameOverride field enables powerful model abstraction:

backendRefs:
  - name: openai-backend
    modelNameOverride: "gpt-4"
  - name: anthropic-backend
    modelNameOverride: "claude-3"

By abstracting away the model name, application developers can use standardized model names, while the gateway handles provider-specific routing. This is, for example, useful when doing A/B testing, gradual migrations, safeguarding against provider lock-in, and multi-provider strategies.

Unified LLM and non-LLM APIs

Enhanced Gateway resource management by allowing both standard HTTPRoute and AIGatewayRoute to be attached to the same Gateway object.

This provides a unified routing configuration that supports both AI and non-AI traffic within a single gateway infrastructure, simplifying deployment and management.

Community Impact and Momentum

Growing Community

The v0.3 release represents the collaborative effort of our rapidly expanding community:

Contributors from Tetrate, Bloomberg, Tencent, Google, and Nutanix
Independent developers driving innovation
Enterprise adopters providing real-world feedback

This diversity of perspectives has shaped v0.3 into a release that serves both bleeding-edge innovators and enterprise production needs.

Visit on GitHub and Star the Repo to show your support.

Standards Leadership

Our integration with the Gateway API Inference Extension demonstrates our commitment to open standards and vendor-neutral solutions. By building on proven Gateway API patterns, we're ensuring that Envoy AI Gateway remains interoperable and future-proof.

Enabling tracing through OpenInference Tracing Integration further cements and showcases our community's commitment to industry standards, collaboration, and ecosystem integration.

What This Release Enables

Benefit	Impact
Simplified model deployment with intelligent routing	Faster development cycles
Performance optimization through real-time metrics	Better model performance
Cost control with token-based rate limiting	More predictable operating costs
Multi-model support in a single infrastructure	Reduced complexity and maintenance
Unified AI infrastructure supporting diverse workloads	Scalable, future-proof architecture
Standards-based architecture for long-term sustainability	Vendor-neutral, interoperable solutions
Vendor flexibility without architectural changes	Reduced lock-in risk
Enterprise observability for production confidence	Production-ready monitoring
Reduced operational complexity through automation	Lower operational overhead
Improved reliability with intelligent failover	Higher system reliability
Better resource utilization across infrastructure	Optimized infrastructure costs
Streamlined monitoring with AI-specific telemetry	Simplified troubleshooting

Get Involved: Join the AI Infrastructure Revolution

The future of AI infrastructure is open, collaborative, and community-driven. Here's how you can be part of it:

Action	Resource	Description
🚀 Try v0.3 Today	Download the release	Get the latest release and start exploring
	Follow our getting started guide	Step-by-step setup instructions
	Explore the examples	Real-world configuration examples
💬 Join the Community	Weekly Community Meetings	Add to your calendar
	Slack Channel #envoy-ai-gateway	Join the conversation on Envoy Slack
	GitHub Discussions	Share experiences and ask questions
🛠️ Contribute to the Future	Report Issues	Help us improve by reporting bugs
	Request Features	Tell us what you need for future releases
	Submit Code	Contribute to the next release

Acknowledgments: The Power of Open Source

v0.3 wouldn't exist without our incredible community. Special recognition goes to:

Enterprise contributors who provided production feedback and requirements
Open source maintainers from the Gateway API and CNCF communities
Individual developers who contributed code, documentation, and ideas
Early adopters who tested pre-releases and reported issues

Get Started Today

Ready to experience the future of AI infrastructure?

Get started with Envoy AI Gateway v0.3 and see how intelligent inference routing, expanded provider support, and enterprise observability can transform your AI deployments.

The future of AI infrastructure is open, intelligent, and community-driven. Join us in building it.

🚀 Get Started with v0.3 →

Envoy AI Gateway v0.3 is available now. For detailed release notes, API changes, and upgrade guidance, visit our release notes page.

Envoy AI Gateway Introduces Endpoint Picker Support

2025-07-30T00:00:00.000Z

Introduction

Envoy AI Gateway now supports Endpoint Picker Provider (EPP) integration as per the Gateway API Inference Extension.

This feature enables you to leverage intelligent, dynamic routing for AI inference workloads through intelligent endpoint selection based on real-time metrics, including KV-cache usage, queued requests, and LoRA adapter information.

When running AI inference at scale, this means your system can automatically select the optimal inference endpoint for each request, thereby optimizing resource utilization.

The Problem: Traditional Load Balancing Falls Short for AI Workloads

As LLM inference workloads have a new set of characteristics compared to traditional API servers, informing the load balancer of which upstream target is the most optimal to serve the request requires a new set of information for making those decisions.

Some of the unique characteristics of AI inference workloads:

Variable processing times based on model complexity and input size
Different resource requirements for different models and configurations
Real-time performance metrics that change constantly (KV-cache usage, queue depth, etc.)
Specialized hardware requirements (GPUs, TPUs, different model variants)

Without taking these into account when routing, you’ll end up with:

Overloaded endpoints while others sit idle
Increased latency due to poor endpoint selection
Resource waste from suboptimal load distribution
Manual intervention is required for optimal performance

The Solution: Endpoint Picker Provider Integration

Envoy AI Gateway's new EPP support addresses these challenges by providing users with the option to integrate an Endpoint Picker. The endpoint picker will take these pieces of information into account and inform the gateway of the best upstream target to route the request to.

1. Intelligent Endpoint Selection

The Endpoint Picker (EPP) automatically routes requests to the most suitable backend by analyzing real-time metrics from your inference endpoints, such as:

Current KV-cache usage
Number of queued inference requests
LoRA adapter information
Endpoint health and performance metrics

2. Dynamic Load Balancing

Unlike static load balancers, the EPP continuously receives endpoint status, which affects routing decisions in real-time. This dynamic decision ensures optimal resource utilization across your entire inference infrastructure.

3. Automatic Failover

When an endpoint becomes unavailable or performance degrades, the EPP automatically routes traffic to healthy alternatives, ensuring high availability for your AI services.

4. Extensible Architecture

The EPP architecture supports custom endpoint picker providers, allowing you to implement domain-specific routing logic tailored to your unique requirements.

How It Works: Two Integration Approaches

Envoy AI Gateway supports EPP integration through two powerful approaches:

HTTPRoute + InferencePool

For simple inference workloads, you can use the standard Gateway API HTTPRoute with InferencePool. This approach offers basic intelligent routing, automatic load balancing, and straightforward configuration, making it ideal for simple use cases.

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: inference-pool-with-httproute
  namespace: default
spec:
  parentRefs:
    - group: gateway.networking.k8s.io
      kind: Gateway
      name: inference-pool-with-httproute
      namespace: default
  rules:
    - backendRefs:
        - group: inference.networking.k8s.io
          kind: InferencePool
          name: vllm-llama3-8b-instruct
          namespace: default
          weight: 1
      matches:
        - path:
            type: PathPrefix
            value: /
      timeouts:
        request: 60s

This approach provides:

Basic intelligent routing
Automatic load balancing
Simple configuration

AIGatewayRoute + InferencePool

For advanced AI-specific features, use Envoy AI Gateway's custom AIGatewayRoute.

apiVersion: aigateway.envoyproxy.io/v1alpha1
kind: AIGatewayRoute
metadata:
  name: inference-pool-with-aigwroute
  namespace: default
spec:
  targetRefs:
    - name: inference-pool-with-aigwroute
      kind: Gateway
      group: gateway.networking.k8s.io
  rules:
    - matches:
        - headers:
            - type: Exact
              name: x-ai-eg-model
              value: meta-llama/Llama-3.1-8B-Instruct
      backendRefs:
        - group: inference.networking.k8s.io
          kind: InferencePool
          name: vllm-llama3-8b-instruct
    - matches:
        - headers:
            - type: Exact
              name: x-ai-eg-model
              value: mistral:latest
      backendRefs:
        - group: inference.networking.k8s.io
          kind: InferencePool
          name: mistral
    - matches:
        - headers:
            - type: Exact
              name: x-ai-eg-model
              value: some-cool-self-hosted-model
      backendRefs:
        - name: envoy-ai-gateway-basic-testupstream

This enhanced approach adds:

Multi-model routing based on the request body modelName
Token-based rate limiting for cost control in self-hosted Models
Advanced LLM observability across LLM Models

Real-World Benefits

For AI/ML Engineers

Lower latency: Requests automatically routed to optimal endpoints
Better throughput: Intelligent distribution prevents bottlenecks
Cost optimization: Efficient resource usage reduces infrastructure costs
Enhanced observability: Real-time metrics and performance insights

For Platform Teams

Standards compliance: Built on the Gateway API Inference Extension
Vendor flexibility: Support for multiple EPP implementations
Future-proof architecture: Extensible design for evolving requirements
Kubernetes-native: Seamless integration with existing infrastructure

For DevOps Teams

Reduced operational overhead: No more manual endpoint management
Improved reliability: Automatic failover and health monitoring
Better resource utilization: Dynamic load balancing maximizes efficiency
Simplified scaling: Add new endpoints without configuration changes

Getting Started

Setting up EPP support in Envoy AI Gateway is straightforward:

1. Install InferencePool CRDs

# Install Gateway API Inference Extension
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v0.5.1/manifests.yaml

2. Deploy Your Inference Backends

# Deploy sample vLLM backend
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/v0.5.1/config/manifests/vllm/sim-deployment.yaml

3. Configure InferenceObjective and InferencePool

apiVersion: inference.networking.k8s.io/v1
kind: InferenceObjective
metadata:
  name: base-model
spec:
  modelName: meta-llama/Llama-3.1-8B-Instruct
  criticality: Critical
  poolRef:
    name: vllm-llama3-8b-instruct
---
apiVersion: inference.networking.k8s.io/v1
kind: InferencePool
metadata:
  name: vllm-llama3-8b-instruct
spec:
  targetPortNumber: 8000
  selector:
    app: vllm-llama3-8b-instruct
  extensionRef:
    name: vllm-llama3-8b-instruct-epp

4. Create Your Route

Choose between HTTPRoute or AIGatewayRoute based on your needs, and you're ready to go!

What's Next

The introduction of EPP support represents a significant milestone in Envoy AI Gateway's evolution. This feature enables intelligent routing for AI workloads, making it easier than ever to deploy and manage production AI inference services.

Upcoming Enhancements

Upstream Conformance Test: Integrate with Gateway API Inference Extension Conformance Tests.
Fallback Support: This feature provides support for endpoint picker fallback when the Endpoint Picker is unavailable, based on the Envoy HostOverride LbPolicy. It ensures continuous service availability even in the event of Endpoint Picker failure.
Internal Managed EPP: This upcoming feature will provide support for an Envoy AI Gateway-managed Endpoint Picker, simplifying the management of your AI inference endpoints.
Enhance End-to-End (E2E) Testing: Add more E2E tests for InferencePool support.

Community Contributions

We're excited to see how the community leverages this capability. Whether you're building custom EPP implementations, contributing to the Gateway API Inference Extension, or sharing your deployment experiences, we'd love to hear from you.

Conclusion

Endpoint Picker Provider support enables Envoy AI Gateway to serve not only as an Egress AI Gateway but also as an Ingress AI Gateway for AI inference workloads. By automatically selecting optimal endpoints based on real-time metrics, this feature improves performance and maximizes resource utilization for hosted inference systems.

Whether you're running a small AI service or a large-scale inference platform, EPP support provides the intelligent routing capabilities you need to deliver reliable, high-performance AI services to your users.

Ready to get started with Envoy AI Gateway? Check out our documentation for guides and examples, join our community discussions to share your experiences, raise issues, request features, and learn from others.

Envoy AI Gateway continues to evolve as the premier solution for routing and managing AI workloads. Stay tuned for more exciting features and capabilities as we work to simplify, enhance, and improve the reliability and efficiency of AI deployment.

Resources

A Reference Architecture for Adopters of Envoy AI Gateway

2025-07-15T00:00:00.000Z

Building a Scalable, Flexible, Cloud-Native GenAI Platform with Open Source Solutions

AI workloads are complex, and unmanaged complexity kills velocity. Your architecture is the key to mastering it.

As generative AI (GenAI) becomes foundational to modern software products, developers face a chaotic new reality, juggling different APIs from various providers while also attempting to deploy self-hosted open-source models. This leads to credential sprawl, inconsistent security policies, runaway costs, and an infrastructure that is difficult to scale and govern.

Your architecture doesn’t have to be this complex.

Platform engineering teams need a secure, scalable way to serve both internal and external LLMs to their users. That’s where Envoy AI Gateway comes in.

This reference architecture outlines how to build a flexible GenAI platform using the open source solutions Envoy AI Gateway, KServe, and complementary tools. Whether you're self-hosting models or integrating with model-serving services on cloud providers, such as OpenAI and Anthropic, this architecture enables a unified, governable interface for LLM traffic.

Core Architecture: Two-Tier Gateway Design

The foundation of this platform is a Two-Tier Gateway Architecture:

Tier One Gateway

Deployed in a centralized Gateway Cluster. It serves as the main API traffic entry point for client applications.

Its Job: To route traffic to external LLM providers (e.g., OpenAI, Anthropic, Bedrock, Vertex) or to the appropriate internal gateway to access an internal model-serving cluster.

Why It Matters: This gateway provides a unified API for all application developers. They don't need to know or care if a model is hosted by a third party or in-house. It centralizes coarse-grained policies, such as authentication, top-level routing, and global rate limiting, simplifying the developer experience and providing a single control point for platform-wide governance.

Tier Two Gateway

Deployed as part of a self-hosted model serving cluster

Its Job: To handle internal traffic routing, load balancing, and policy enforcement specific to self-hosted models running on platforms like KServe.

Why It Matters: This empowers platform teams. They can manage model versions, conduct releases, and apply specific security rules for self-hosted models without needing to make changes to the primary, customer-facing gateway. This separation ensures that internal operational changes don't impact external clients.

Design Benefits

This design cleanly separates external access from internal implementation, giving teams the autonomy they need to move fast without breaking things, and provides:

Centralized credential management
Unified API access
Cost tracking and traffic governance

Routing and Traffic Management

Envoy AI Gateway provides a consistent client interface and abstracts the complexity of accessing diverse GenAI backends. It supports:

Feature	Problem	Solution
Upstream Authentication with Credential Injection	Application developers must manage, store, and rotate API keys for multiple external LLM providers. This is a security risk and an operational burden that slows down development.	The gateway injects the correct credentials per provider. Developers can make requests to the gateway using a single, internal authentication token, and the gateway handles attaching the appropriate third-party API key. This decouples applications from external secrets, dramatically improving security and simplifying code.
Token-Based Rate Limiting and Cost Optimization	A single buggy script or a new feature can lead to unexpected spikes in usage, resulting in huge bills from LLM providers or overwhelming your self-hosted infrastructure.	The gateway acts as an intelligent financial and operational circuit breaker. You can enforce policies based on token usage, requests, or estimated cost. This prevents abuse and ensures that you stay within budget, providing critical protection for the business.
Observability Hooks for Usage Patterns and Latency	When you use multiple model providers, it's nearly impossible to get a clear picture of usage. Which teams are spending the most? Which models have the highest latency? How many tokens is our new feature consuming?	By routing all traffic through a single point, the gateway provides a unified source of truth and built-in support for metrics, logs, and traces tailored for GenAI workloads. This enables you to accurately track costs, identify performance bottlenecks, and understand usage patterns across your entire GenAI stack.

tip

Clients simply hit a single endpoint and let the Gateway handle routing to the appropriate backend—self-hosted or third-party.

Self-Hosted Model Serving with KServe

While many organizations start with external providers, self-hosting models offer advantages in terms of cost, privacy, and customization. If you are self-hosting models, KServe is a powerful addition.

For many data scientists and ML engineers, turning a trained model into a production-ready API is a major hurdle that requires deep expertise in Kubernetes, networking, and infrastructure. KServe bridges this gap and eliminates the complexity. KServe is a model serving platform that automates the heavy lifting, allowing an engineer to simply provide a configuration file for their model while KServe builds the scalable, resilient API endpoint.

A few highlights of what KServe provides:

Autoscaling (including token-based autoscaling for LLMs and scale-to-zero for GPUs)
Multi-node inference (via vLLM)
Support for OpenAI-compatible APIs and advanced runtime integrations (vLLM)
Built-in support for model and prompt caching

Check out the KServe Documentation to learn more about all the capabilities.

Use tools like Backstage to simplify model deployment and namespace management for teams.

info

Self-hosting is optional. Many adopters will initially rely on external providers and add internal hosting as requirements evolve.

Observability, Control, and Optimization for Production Readiness

In a production GenAI platform environment, observability, control, and optimization become must-haves. Envoy AI Gateway and KServe offer integrations that cater to these needs.

Observability

Achieving visibility into your system enables the identification of bottlenecks, effective cost management, and ensures model reliability.

Unified Metrics and Tracing:
- Envoy AI Gateway integrates seamlessly with OpenTelemetry, adhering to the GenAI Semantic Conventions. This delivers a unified view of requests, latency, token usage, and errors across all external and internal models.
- Leverage specialized telemetry, like OpenLLMetry, for granular insights into LLM-specific metrics, ensuring you capture essential details like prompt and completion lengths, token throughput, and model-specific performance.
Centralized Logging:
- Centralize logging across providers through Envoy AI Gateway, facilitating easier debugging, auditing, and compliance.

Control

Enforcing policies and controls ensures your platform remains secure, stable, and cost-effective:

Policy Enforcement and Guardrails:
- Set usage-based guardrails directly in Envoy AI Gateway to prevent cost overruns, enforce compliance, and safeguard against prompt misuse or model hallucination.
- Implement safety checks and output validation rules that enable your team to control quality and compliance centrally, rather than embedding these checks individually within applications.

Optimization

Maximizing the efficiency and responsiveness of your models can significantly enhance user experience and operational efficiency:

Caching Strategies:
- KServe's model caching further optimizes inference, significantly lowering response times and improving model utilization.
Disaggregated Serving:
- Optimize hardware resources by leveraging KServe’s support for disaggregated inference, which enables the separate management of compute-intensive inference stages and memory-intensive operations, thereby maximizing hardware efficiency and reducing costs.

info

Be sure to review the documentation for each project to learn more about preparing your setup for production.

Pluggable and Flexible

This is not a rigid, all-or-nothing platform. It’s a set of foundational components you can adopt and extend to fit your unique environment. This architecture works whether you're fully committed to Kubernetes or using a hybrid cloud.

You can:

Start with externally hosted LLMs or self-hosted inference
Use Envoy AI Gateway with any compatible provider via a unified API
Add your own authorization logic and/or custom functionality via Envoy’s extension filters
Deploy your Gateways in different clusters and cloud providers, giving you flexibility of choice in hosting environments

This is not a rigid platform. It’s a foundational component you can build on and extend.

Summary

The Envoy AI Gateway reference architecture gives platform teams a guide to:

Centralize access to GenAI models with a unified API
Enforce consistent traffic and security policies across model providers
Support both internal and external LLM usage without refactoring client applications
Scale safely and cost-effectively without reinventing core infrastructure

Envoy AI Gateway sits at the heart of this design, an intelligent, extensible control point for all your GenAI traffic.

Announcing the first Envoy AI Gateway Release – A Community Milestone!

2025-02-25T00:00:00.000Z

Today, we're excited to announce the 0.1 release of the Envoy AI Gateway, the first AI gateway built on CNCF's Envoy Gateway and backed by a thriving, growing community.

The journey to the Envoy AI Gateway started with a simple but powerful vision: make it easier for enterprises to integrate and scale AI in their applications.

Where We Are Now

The Envoy AI Gateway is now available on GitHub and ready for developers to deploy and explore. It enables enterprises to integrate AI services through a unified API while managing authorization, cost control, and scalability with built-in features:

✅ Unified API for seamless integration with multiple LLM providers (starting with AWS Bedrock and OpenAI).
✅ Upstream Authorization to simplify authentication across AI providers.
✅ Usage Rate Limiting based on word tokens to control costs and ensure operational efficiency.

With this release, we're making AI adoption in cloud-native environments more straightforward and accessible to organizations of all sizes.

The Power of Community

This milestone wouldn't be possible without the incredible contributions and participation from across the industry.

A shout out to our community members from Tetrate, Bloomberg, WSO2, RedHat, Google, and our independent contributors who have joined discussions, provided feedback, and helped shape the roadmap.

🐱 Even the cat Mellow has attended community meetings, proving that AI isn't just for humans!

The momentum behind Envoy AI Gateway speaks to the need for an open, collaborative approach to GenAI infrastructure. Our contributor's and early adopters' excitement and expertise drive innovation forward.

Where We're Headed

We're just getting started! Here's a sneak peek at what's next:

Google Gemini 2.0 integration out-of-the-box.
Provider and Model Fallback Logic to ensure continued service availability.
Prompt Templating for consistent AI interactions.
Semantic Caching to optimize response efficiency and reduce costs.

The roadmap is community-driven, and we'd love for more contributors to help shape the future of AI infrastructure!

Join the Movement

Want to be part of the journey? Here's how you can get involved:

Download & Try Envoy AI Gateway → GitHub Repo
Join the Conversation → Attend our community meetings (Mellow might show up too!)
Join us on Slack → Register for Envoy Slack and join #envoy-ai-gateway
Contribute → Raise issues, suggest improvements, and submit PRs on GitHub
Meet Us In Person → Register for the Envoy AI Gateway Workshop in London on March 31st

The future of GenAI infrastructure is open and collaborative, and we're excited to work with you to build it!

🚀 Onward to 1.0!

End User Keynote at KubeCon 2024

2024-11-14T00:00:00.000Z

At KubeCon North America 2024, Alexa Griffith had the opportunity to present the End User Keynote on Centralizing & Simplifying Enterprise AI Workflows with Envoy AI Gateway.

About the presentation

As Generative AI reshapes the industry, the demands on AI platforms have rapidly evolved. Organizations now require centralized infrastructure to manage and optimize access to self-trained, open source, and commercial AI models at scale.

In this talk, we introduce the Envoy AI Gateway, a collaborative open source effort led by engineers from Bloomberg and Tetrate.

Learn how the Envoy AI Gateway, which is built atop Envoy Gateway and Envoy Proxy, provides a unified, scalable solution for model access, usage limiting, and upstream authorization.

Introducing Envoy AI Gateway

2024-10-18T00:00:00.000Z

The industry is embracing Generative AI functionality, and we need to evolve how we handle traffic on an industry-wide scale. Keeping AI traffic handling features exclusive to enterprise licenses is counterproductive to the industry’s needs. This approach limits incentives to a single commercial entity and its customers. Even single-company open-source initiatives do not promote open multi-company collaboration.

A shared challenge like this presents an opportunity for open collaboration to build the necessary features. We believe bringing together different use cases and requirements through open collaboration will lead to better solutions and accelerate innovation. The industry will benefit from diverse expertise and experiences by openly collaborating on software across companies and industries.

That is why Tetrate and Bloomberg have started an open collaboration to bring critical features for this new era of Gen AI integration. Collaborating openly in the Envoy community, bringing AI traffic handling features to Envoy, via Envoy Gateway and Envoy Proxy.

Why we need AI traffic handling features

What makes traffic to LLM models different from traditional API traffic?

On the surface it appears similar. Traffic comes from a client app that is making an API request, and this request has to get to the provider that hosts the LLM model.

However, it is different. Managing LLM traffic from multiple apps, to multiple LLM providers, introduces new and different challenges where traditional API Gateway features fall short.

For example, traditional rate-limiting based on number of requests doesn’t work for controlling usage of LLM providers as they’re computationally complex services. To measure usage LLM providers tokenize the words in the request message and response message, and count the number of tokens used. This count gives a good approximation of the computational complexity and cost of serving the request.

Beyond controlling usage of LLMs there are many more challenges relating to ease of integration and high-availability architectures. It’s no longer enough to just optimize for quality of service alone, adopters must consider costs of usage in real time. As adopters of Gen AI look for Gateway solutions to handle these challenges for their system, they often find the necessary features locked behind enterprise licenses.

Three key MVP features

Now, let’s look at how handling AI traffic poses new challenges for Gateways. There are several features we discussed together with our collaborators at Bloomberg, and together we decided on three key features for the MVP:

Usage Limiting – to control LLM usage based on word tokens
Unified API – to simplify client integration with multiple LLM providers
Upstream Authorization – to configure Authorization to multiple upstream LLM providers What other features are you looking for? Get in touch with us to share your use case and define the future of Envoy AI Gateway.

We are really excited about these features being part of Envoy. They will benefit those integrating with LLM providers and, ultimately, also Gateway users for general API request traffic.

When it comes to AI Gateway features, we have chosen to collaborate and build within the CNCF Envoy project because we believe multi-company, open-source projects benefit the entire industry by enabling innovation without creating single vendor risk.

Envoy AI Gateway Blog

The Reality and Performance of MCP Traffic Routing with Envoy AI Gateway

Part 1: Common Misconceptions​

Misconception 1: "AI Gateway's MCP implementation is slow."​

Reality​

The Nuance​

Misconception 2: "Envoy AI Gateway is a separate project that ignores core Envoy."​

Reality​

Misconception 3: “Envoy is not able to handle AI traffic.”​

Reality​

The Nuance​

Part 2: The Actual Architecture​

The "Stateless" MCP Routing Design​

Understanding and Tuning Performance​

Part 3: How We Arrived at This Design (and Why)​

The Alternative We Rejected: Centralized State​

The Path We Chose: Encoded State​

Addressing the Trade-offs​

Part 4: Evaluation — Is it Right for You?​

Part 5: The Envoy Ecosystem Alignment​

Part of the Battle-Tested Envoy Ecosystem​

Summary and Conclusion​

Key Design Decisions​

Conclusion​

Announcing Model Context Protocol Support in Envoy AI Gateway

Why MCP Matters for AI Gateways​

Key Features in the First Implementation​

Under the Hood​

Getting Started​

Use your existing MCP servers file​

Using the new MCPRoute API​

What’s Next?​

Enhancing AI Gateway Observability - OpenTelemetry Tracing Arrives in Envoy AI Gateway

The Observability Challenges in AI Applications​

Specifications Made for AI: OpenInference Semantic Conventions​

How it All Fits Together: Envoy AI Gateway OpenTelemetry Tracing Architecture​

Capture and Evaluate your Traffic: LLM Evaluation​

Evaluation Patterns​

Privacy Controls​

Telemetry via the Gateway: Zero-Application-Change Integration​

Automatic Trace Propagation​

Seamless Integration​

Follow the Development Lifecycle: Deployment Flexibility​

Looking Ahead: AI Applications Evolve Fast, and Infrastructure and Observability with it​

Get Involved​

Announcing the Envoy AI Gateway v0.3 Release

The Big Shifts in v0.3​

1. From Static to Intelligent Routing​

2. Expanded Provider Ecosystem​

3. Enterprise Observability for AI​

Notable New Features in v0.3​

Endpoint Picker Provider: The Future of AI Load Balancing​

Google Vertex AI: Enterprise AI at Scale​

Comprehensive AI Observability​

Model Name Virtualization: Abstraction for Flexibility​

Unified LLM and non-LLM APIs​

Community Impact and Momentum​

Growing Community​

Standards Leadership​

What This Release Enables​

Get Involved: Join the AI Infrastructure Revolution​

Acknowledgments: The Power of Open Source​

Get Started Today​

Envoy AI Gateway Introduces Endpoint Picker Support

Introduction​

The Problem: Traditional Load Balancing Falls Short for AI Workloads​

The Solution: Endpoint Picker Provider Integration​

1. Intelligent Endpoint Selection​

2. Dynamic Load Balancing​

3. Automatic Failover​

4. Extensible Architecture​

How It Works: Two Integration Approaches​

HTTPRoute + InferencePool​

AIGatewayRoute + InferencePool​

Real-World Benefits​

For AI/ML Engineers​

For Platform Teams​

For DevOps Teams​

Getting Started​

1. Install InferencePool CRDs​

Part 1: Common Misconceptions

Misconception 1: "AI Gateway's MCP implementation is slow."

Reality

The Nuance

Misconception 2: "Envoy AI Gateway is a separate project that ignores core Envoy."

Reality

Misconception 3: “Envoy is not able to handle AI traffic.”

Reality

The Nuance

Part 2: The Actual Architecture

The "Stateless" MCP Routing Design

Understanding and Tuning Performance

Part 3: How We Arrived at This Design (and Why)

The Alternative We Rejected: Centralized State

The Path We Chose: Encoded State

Addressing the Trade-offs

Part 4: Evaluation — Is it Right for You?

Part 5: The Envoy Ecosystem Alignment

Part of the Battle-Tested Envoy Ecosystem

Summary and Conclusion

Key Design Decisions

Conclusion

Why MCP Matters for AI Gateways

Key Features in the First Implementation

Under the Hood

Getting Started

Use your existing MCP servers file

Using the new MCPRoute API

What’s Next?

The Observability Challenges in AI Applications

Specifications Made for AI: OpenInference Semantic Conventions

How it All Fits Together: Envoy AI Gateway OpenTelemetry Tracing Architecture

Capture and Evaluate your Traffic: LLM Evaluation

Evaluation Patterns

Privacy Controls

Telemetry via the Gateway: Zero-Application-Change Integration

Automatic Trace Propagation

Seamless Integration

Follow the Development Lifecycle: Deployment Flexibility

Looking Ahead: AI Applications Evolve Fast, and Infrastructure and Observability with it

Get Involved

The Big Shifts in v0.3

1. From Static to Intelligent Routing

2. Expanded Provider Ecosystem

3. Enterprise Observability for AI

Notable New Features in v0.3

Endpoint Picker Provider: The Future of AI Load Balancing

Google Vertex AI: Enterprise AI at Scale

Comprehensive AI Observability

Model Name Virtualization: Abstraction for Flexibility

Unified LLM and non-LLM APIs

Community Impact and Momentum

Growing Community

Standards Leadership

What This Release Enables

Get Involved: Join the AI Infrastructure Revolution

Acknowledgments: The Power of Open Source

Get Started Today

Introduction

The Problem: Traditional Load Balancing Falls Short for AI Workloads

The Solution: Endpoint Picker Provider Integration

1. Intelligent Endpoint Selection

2. Dynamic Load Balancing

3. Automatic Failover

4. Extensible Architecture

How It Works: Two Integration Approaches

HTTPRoute + InferencePool

AIGatewayRoute + InferencePool

Real-World Benefits

For AI/ML Engineers

For Platform Teams

For DevOps Teams

Getting Started

1. Install InferencePool CRDs

2. Deploy Your Inference Backends

3. Configure InferenceObjective and InferencePool

4. Create Your Route

What's Next

Upcoming Enhancements

Community Contributions