09 Feb 2026

Observability Stack: Monitoring Your AI Gateway

Introduction

Observability is crucial for any production AI Gateway deployment. Enterprise AgentGateway emits comprehensive OpenTelemetry-compatible metrics, logs, and traces out of the box. In this guide, we’ll deploy a complete observability stack including Grafana, Prometheus, Tempo, and Loki to collect, store, and visualize this rich telemetry data.

This setup will give you real-time visibility into your AI Gateway’s performance, cost metrics, token usage, streaming performance, and more.

What You’ll Learn

Deploy Tempo for distributed tracing
Install Prometheus and Grafana for metrics and visualization
Configure AgentGateway-specific monitoring
Set up the official AgentGateway Grafana dashboard
Access and interpret observability data

Prerequisites

Kind cluster with Enterprise AgentGateway installed (from Part 1)
kubectl configured to work with your cluster
Helm 3.x installed

Architecture Overview

Our observability stack will include:

Grafana: Visualization and dashboards
Prometheus: Metrics collection and storage
Tempo: Distributed tracing backend
Loki: Log aggregation (optional)
AgentGateway Datasources: Pre-configured Grafana dashboards

Deploy Monitoring Stack

Create Monitoring Namespace

kubectl create namespace monitoring

Install Prometheus

First, let’s deploy Prometheus to collect metrics:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install prometheus prometheus-community/prometheus \
    --namespace monitoring \
    --set server.service.type=NodePort \
    --set server.service.nodePort=30090 \
    --set server.persistentVolume.enabled=false \
    --set alertmanager.enabled=false \
    --set kube-state-metrics.enabled=true \
    --set nodeExporter.enabled=false \
    --set pushgateway.enabled=false

Install Tempo

Deploy Tempo for distributed tracing:

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Create Tempo configuration
cat <<EOF > tempo-values.yaml
tempo:
  retention: 24h
  receivers:
    jaeger:
      protocols:
        grpc:
          endpoint: 0.0.0.0:14250
        thrift_http:
          endpoint: 0.0.0.0:14268
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318

service:
  type: NodePort
  
storage:
  trace:
    backend: local
    local:
      path: /var/tempo/traces

compactor:
  retention: 24h

metrics_generator:
  enabled: true
  registry:
    external_labels:
      source: tempo
EOF

helm install tempo grafana/tempo \
    --namespace monitoring \
    --values tempo-values.yaml

Install Grafana

Deploy Grafana with pre-configured datasources:

# Create Grafana configuration
cat <<EOF > grafana-values.yaml
service:
  type: NodePort
  nodePort: 30080

persistence:
  enabled: false

datasources:
  datasources.yaml:
    apiVersion: 1
    datasources:
    - name: Prometheus
      type: prometheus
      url: http://prometheus-server:80
      access: proxy
      isDefault: true
    - name: Tempo
      type: tempo
      url: http://tempo:3100
      access: proxy

dashboardProviders:
  dashboardproviders.yaml:
    apiVersion: 1
    providers:
    - name: 'default'
      orgId: 1
      folder: ''
      type: file
      disableDeletion: false
      editable: true
      options:
        path: /var/lib/grafana/dashboards/default

dashboards:
  default:
    agentgateway-genai:
      gnetId: 21703
      revision: 1
      datasource: Prometheus

adminPassword: admin
EOF

helm install grafana grafana/grafana \
    --namespace monitoring \
    --values grafana-values.yaml

Configure AgentGateway for Observability

Update AgentGateway Configuration

We need to configure AgentGateway to send traces to our Tempo instance:

kubectl apply -f- <<'EOF'
---
apiVersion: enterpriseagentgateway.solo.io/v1alpha1
kind: EnterpriseAgentgatewayParameters
metadata:
  name: agentgateway-params
  namespace: enterprise-agentgateway
spec:
  # Enable shared extensions
  sharedExtensions:
    extauth:
      enabled: true
      deployment:
        spec:
          replicas: 1
    ratelimiter:
      enabled: true
      deployment:
        spec:
          replicas: 1
    extCache:
      enabled: true
      deployment:
        spec:
          replicas: 1
  
  # Enhanced observability configuration
  rawConfig:
    config:
      logging:
        level: info
        fields:
          add:
            jwt: 'jwt'
            request.body: json(request.body)
            response.body: json(response.body)
            request.body.modelId: json(request.body).modelId
            llm.provider: 'llm.provider'
            llm.model: 'llm.requestModel'
            llm.tokens.input: 'llm.inputTokens'
            llm.tokens.output: 'llm.outputTokens'
            llm.cost.total: 'llm.totalCost'
        format: json
        
      tracing:
        randomSampling: 'true'
        collector:
          endpoint: tempo.monitoring.svc.cluster.local:4317
        fields:
          add:
            # GenAI semantic conventions
            gen_ai.operation.name: '"chat"'
            gen_ai.system: "llm.provider"
            gen_ai.prompt: 'llm.prompt'
            gen_ai.completion: 'llm.completion.map(c, {"role":"assistant", "content": c})'
            gen_ai.request.model: "llm.requestModel"
            gen_ai.response.model: "llm.responseModel"
            gen_ai.usage.completion_tokens: "llm.outputTokens"
            gen_ai.usage.prompt_tokens: "llm.inputTokens"
            gen_ai.request: 'flatten(llm.params)'
            
            # Additional context
            jwt: 'jwt'
            response.body: 'json(response.body)'
            llm.cost.total: 'llm.totalCost'
            llm.cost.input: 'llm.inputCost'
            llm.cost.output: 'llm.outputCost'
            
      metrics:
        enabled: true
        prefix: "agentgateway"
        tags:
          add:
            provider: 'llm.provider'
            model: 'llm.requestModel'
            user: 'jwt.sub // "anonymous"'
            
  # Service configuration for monitoring
  service:
    spec:
      type: NodePort
      ports:
      - name: http
        port: 8080
        targetPort: 8080
        nodePort: 30080
        protocol: TCP
      - name: metrics
        port: 9091
        targetPort: 9091
        nodePort: 30091
        protocol: TCP
  
  # Deployment configuration
  deployment:
    spec:
      replicas: 1
      template:
        spec:
          containers:
          - name: agentgateway
            resources:
              requests:
                cpu: 200m
                memory: 128Mi
              limits:
                cpu: 500m
                memory: 256Mi
EOF

Verify Monitoring Stack

Check All Pods are Running

kubectl get pods -n monitoring

Expected output:

NAME                                    READY   STATUS    RESTARTS   AGE
grafana-7b8c4c4d4c-xyz12               1/1     Running   0          3m
prometheus-server-5f8b8b7d7d-abc34     1/1     Running   0          5m
tempo-0                                1/1     Running   0          4m

Verify AgentGateway Configuration

kubectl get pods -n enterprise-agentgateway
kubectl logs deploy/agentgateway -n enterprise-agentgateway --tail=20

Look for log entries indicating successful startup and trace collection.

Access Monitoring Interfaces

Grafana Dashboard

Access Grafana at: http://localhost:30080

Username: admin
Password: admin

Prometheus UI

Access Prometheus at: http://localhost:30090

AgentGateway Metrics

Access AgentGateway metrics directly: http://localhost:30091/metrics

AgentGateway GenAI Dashboard

The Grafana deployment automatically imports the official AgentGateway dashboard (ID: 21703). This dashboard provides:

Key Metrics Panels

Request Rate and Latency
- Requests per second by provider
- P95, P99 latency percentiles
- Error rates and status codes
Token Usage and Costs
- Input/output token consumption
- Cost tracking per provider
- Token efficiency metrics
Model Performance
- Response time by model
- Token generation rates
- Streaming performance
Provider Health
- Provider availability
- Error rates by provider
- Failover events

Using the Dashboard

Navigate to Dashboards > Browse in Grafana
Open AgentGateway GenAI Dashboard
Set time range (e.g., Last 1 hour)
Select providers/models using dropdown filters

Testing Your Monitoring Setup

Let’s create some test traffic to see data flowing through our observability stack:

Deploy Test Mock Backend

kubectl apply -f- <<'EOF'
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mock-openai
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mock-openai
  template:
    metadata:
      labels:
        app: mock-openai
    spec:
      containers:
      - name: wiremock
        image: wiremock/wiremock:3.3.1
        ports:
        - containerPort: 8080
        args:
        - --port=8080
        - --verbose
        volumeMounts:
        - name: wiremock-data
          mountPath: /home/wiremock
        env:
        - name: JAVA_OPTS
          value: "-Dfile.encoding=UTF-8"
      volumes:
      - name: wiremock-data
        configMap:
          name: mock-openai-config
---
apiVersion: v1
kind: Service
metadata:
  name: mock-openai
  namespace: default
spec:
  selector:
    app: mock-openai
  ports:
  - port: 80
    targetPort: 8080
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: mock-openai-config
  namespace: default
data:
  mappings.json: |
    {
      "mappings": [
        {
          "request": {
            "method": "POST",
            "url": "/v1/chat/completions"
          },
          "response": {
            "status": 200,
            "headers": {
              "Content-Type": "application/json"
            },
            "jsonBody": {
              "id": "chatcmpl-test-{{randomValue type='ALPHANUMERIC' length=10}}",
              "object": "chat.completion",
              "created": {{currentTimestamp}},
              "model": "gpt-4",
              "choices": [
                {
                  "index": 0,
                  "message": {
                    "role": "assistant",
                    "content": "Hello! This is a mock response from the test OpenAI backend. Current time: {{currentTimestamp}}."
                  },
                  "finish_reason": "stop"
                }
              ],
              "usage": {
                "prompt_tokens": 25,
                "completion_tokens": 15,
                "total_tokens": 40
              }
            },
            "transformers": ["response-template"],
            "fixedDelayMilliseconds": 100
          }
        }
      ]
    }
EOF

Configure AgentGateway Route

kubectl apply -f- <<'EOF'
---
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayBackend
metadata:
  name: mock-openai-backend
  namespace: default
spec:
  openai:
    connectionString: http://mock-openai.default.svc.cluster.local
    authToken:
      secretRef:
        name: mock-openai-secret
        key: api-key
    model:
      default: gpt-4
      mapping:
        "gpt-4": gpt-4
        "gpt-3.5-turbo": gpt-3.5-turbo
---
apiVersion: v1
kind: Secret
metadata:
  name: mock-openai-secret
  namespace: default
type: Opaque
stringData:
  api-key: "test-api-key"
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: mock-openai-route
  namespace: default
spec:
  parentRefs:
  - name: agentgateway
    namespace: enterprise-agentgateway
  hostnames:
  - "*"
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /openai
    filters:
    - type: URLRewrite
      urlRewrite:
        path:
          type: ReplaceFullPath
          replaceFullPath: /v1/chat/completions
    backendRefs:
    - name: mock-openai-backend
      group: agentgateway.dev
      kind: AgentgatewayBackend
      port: 80
EOF

Generate Test Traffic

# Send test requests to generate observability data
for i in {1..10}; do
  curl -X POST http://localhost:8080/openai \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer test-token" \
    -d '{
      "model": "gpt-4",
      "messages": [
        {"role": "user", "content": "Test message '${i}' for observability demo"}
      ],
      "max_tokens": 50
    }' && echo
  sleep 2
done

Viewing Observability Data

In Grafana

Dashboard Overview: Navigate to the AgentGateway dashboard
Request Metrics: See request rates, response times
Token Usage: Monitor input/output tokens and costs
Error Analysis: Check error rates and types

In Tempo (Distributed Tracing)

Go to Explore in Grafana
Select Tempo datasource
Search for traces by service name: agentgateway
Analyze trace spans showing request flow

Key Traces to Look For

HTTP Request Span: Gateway ingress
LLM Provider Span: Backend communication
Auth Span: Authentication processing
Rate Limit Span: Rate limiting decisions

Sample Tempo Query

{service.name="agentgateway"}

Advanced Monitoring Configuration

Custom Metrics

Add custom metrics to track specific business KPIs:

rawConfig:
  config:
    metrics:
      enabled: true
      customMetrics:
        - name: "user_requests_total"
          type: "counter"
          help: "Total requests per user"
          labels:
            - user: 'jwt.sub // "anonymous"'
            - model: 'llm.requestModel'
            
        - name: "token_cost_dollars"
          type: "histogram"
          help: "Cost in dollars per request"
          buckets: [0.001, 0.01, 0.1, 1.0, 10.0]
          value: 'llm.totalCost'

Alert Rules

Create Prometheus alert rules for critical conditions:

groups:
- name: agentgateway.rules
  rules:
  - alert: HighErrorRate
    expr: rate(agentgateway_requests_total{status=~"5.."}[5m]) > 0.1
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "High error rate detected"
      
  - alert: HighLatency
    expr: histogram_quantile(0.95, rate(agentgateway_request_duration_seconds_bucket[5m])) > 5
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High latency detected"
      
  - alert: HighCost
    expr: increase(agentgateway_token_cost_dollars_total[1h]) > 50
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: "High hourly cost detected"

Monitoring Best Practices

Resource Planning

Monitor these key metrics for capacity planning:

Request Rate: Track requests/second trends
Token Velocity: Monitor tokens/minute per model
Cost Burn Rate: Track $/hour consumption
Provider Latency: Monitor P95/P99 response times

Performance Optimization

Use observability data to optimize:

Route Configuration: Based on latency patterns
Caching Strategies: Based on request patterns
Rate Limiting: Based on usage distribution
Provider Selection: Based on cost/performance

Cost Management

Track and alert on:

Daily/Monthly spend per provider
Cost per user/team
Token efficiency (output/input ratio)
Most expensive models/users

Troubleshooting

Missing Traces

If traces aren’t appearing in Tempo:

# Check Tempo is receiving traces
kubectl logs -l app.kubernetes.io/name=tempo -n monitoring

# Verify AgentGateway trace configuration
kubectl get enterpriseagentgatewayparameters agentgateway-params -n enterprise-agentgateway -o yaml

# Check connectivity
kubectl exec -n enterprise-agentgateway deployment/agentgateway -- nc -zv tempo.monitoring.svc.cluster.local 4317

Missing Metrics

If metrics aren’t showing in Prometheus:

# Check Prometheus targets
curl http://localhost:30090/targets

# Verify AgentGateway metrics endpoint
curl http://localhost:30091/metrics

# Check Prometheus configuration
kubectl get configmap prometheus-server -n monitoring -o yaml

Dashboard Issues

If the dashboard isn’t loading:

# Check Grafana pod logs
kubectl logs -l app.kubernetes.io/name=grafana -n monitoring

# Verify datasource connectivity
kubectl exec -n monitoring deployment/grafana -- nc -zv prometheus-server 80
kubectl exec -n monitoring deployment/grafana -- nc -zv tempo 3100

Cleanup

To remove the monitoring stack:

helm uninstall grafana -n monitoring
helm uninstall prometheus -n monitoring
helm uninstall tempo -n monitoring
kubectl delete namespace monitoring

Next Steps

With comprehensive observability in place, you’re ready to:

Set up development environments with mock providers
Configure real AI provider integrations
Implement advanced routing strategies
Add security and rate limiting policies

In our next blog post, we’ll create a mock OpenAI environment for cost-free development and testing.

Key Takeaways

Complete visibility into AI Gateway performance and costs
Real-time monitoring with Grafana dashboards
Distributed tracing for request flow analysis
Cost tracking and optimization insights
Production-ready monitoring stack for kind clusters

Your AgentGateway is now fully instrumented and ready for production workloads with comprehensive observability!