Observability

ToolHive provides comprehensive observability for your MCP server interactions through built-in OpenTelemetry instrumentation. You get complete visibility into how your MCP servers perform, including detailed traces, metrics, and error tracking.

How telemetry works

ToolHive automatically instruments your MCP server interactions without requiring changes to your servers. When you enable telemetry, ToolHive captures detailed information about every request, tool call, and server interaction.

ToolHive's telemetry captures rich, protocol-aware information because it understands MCP operations. You get detailed traces showing tool calls, resource access, and prompt operations rather than generic HTTP requests.

Distributed tracing

Distributed tracing shows you the complete journey of each request through your MCP servers. ToolHive creates comprehensive traces that provide end-to-end visibility across the proxy-container boundary.

Trace structure

Here's what a trace looks like when a client calls a tool in the GitHub MCP server (some fields omitted for brevity):

Span: tools/call create_issue (150ms)
├── service.name: thv-github
├── service.version: v0.1.9
├── http.request.method: POST
├── http.request.body.size: 256
├── http.response.status_code: 202
├── http.response.body.size: 1024
├── url.full: /messages?session_id=b1d22d07-b35f-4260-9c0c-b872f92f64b1
├── url.path: /messages
├── url.scheme: https
├── server.address: localhost:14972
├── user_agent.original: claude-code/1.0.53
├── mcp.method.name: tools/call
├── mcp.server.name: github
├── mcp.session.id: abc123
├── rpc.system.name: jsonrpc
├── jsonrpc.protocol.version: 2.0
├── jsonrpc.request.id: 5
├── gen_ai.tool.name: create_issue
├── gen_ai.operation.name: execute_tool
├── gen_ai.tool.call.arguments: owner=stacklok, repo=toolhive, pullNumber=1131
├── network.transport: tcp
└── network.protocol.name: http

MCP-specific traces

ToolHive automatically captures traces for all MCP operations, including:

Tool calls (tools/call) - When AI assistants use tools
Resource access (resources/read) - When servers read files or data
Prompt operations (prompts/get) - When servers retrieve prompts
Connection events (initialize) - When clients connect to servers

Trace attributes

Each trace includes detailed context across several layers:

Service information

service.name: thv-github
service.version: v0.1.9
host.name: my-machine

HTTP layer

http.request.method: POST
http.request.body.size: 256
http.response.status_code: 202
http.response.body.size: 1024
url.full: /messages?session_id=b1d22d07-b35f-4260-9c0c-b872f92f64b1
url.path: /messages
url.scheme: https
url.query: session_id=b1d22d07-b35f-4260-9c0c-b872f92f64b1
server.address: localhost:14972
user_agent.original: claude-code/1.0.53

Network layer

network.transport: tcp
network.protocol.name: http
network.protocol.version: 1.1
client.address: 127.0.0.1
client.port: 52431

MCP protocol details

Details about the MCP operation being performed (some fields are specific to each operation):

mcp.method.name: tools/call
mcp.server.name: github
mcp.session.id: abc123
mcp.protocol.version: 2025-03-26
mcp.is_batch: false
rpc.system.name: jsonrpc
jsonrpc.protocol.version: 2.0
jsonrpc.request.id: 123

Method-specific attributes

tools/call traces include:
- gen_ai.tool.name - The name of the tool being called
- gen_ai.operation.name - Set to execute_tool
- gen_ai.tool.call.arguments - Sanitized tool arguments (sensitive values redacted)
resources/read traces include:
- mcp.resource.uri - The URI of the resource being accessed
prompts/get traces include:
- gen_ai.prompt.name - The name of the prompt being retrieved
initialize traces include:
- mcp.protocol.version - The MCP protocol version negotiated

Legacy attribute names

By default, ToolHive emits both the new OpenTelemetry semantic convention attribute names shown above and legacy attribute names (e.g., http.method, mcp.method, mcp.tool.name) for backward compatibility with existing dashboards. You can control this with the --otel-use-legacy-attributes flag.

Metrics collection

ToolHive automatically collects metrics about your MCP server usage and performance. These metrics help you understand usage patterns, performance characteristics, and identify potential issues.

Metric labels

All metrics include consistent labels for filtering and aggregation:

server - MCP server name (e.g., fetch, github)
transport - Backend transport type (stdio, sse, or streamable-http)
method - HTTP method (POST, GET)
mcp_method - MCP protocol method (e.g., tools/call, resources/read)
status - Request outcome (success or error)
status_code - HTTP status code (200, 400, 500)
tool - Tool name for tool-specific metrics

Key metrics

Example metrics from the Prometheus /metrics endpoint are shown below (some fields are omitted for brevity):

Request metrics

# HELP toolhive_mcp_requests_total Total number of MCP requests
# TYPE toolhive_mcp_requests_total counter
toolhive_mcp_requests_total{mcp_method="tools/list",method="POST",server="github",status="success",status_code="202",transport="stdio"} 2

# HELP toolhive_mcp_request_duration_seconds Duration of MCP requests in seconds
# TYPE toolhive_mcp_request_duration_seconds histogram
toolhive_mcp_request_duration_seconds_bucket{mcp_method="tools/list",method="POST",server="github",status="success",status_code="202",transport="stdio",le="10000"} 2
toolhive_mcp_request_duration_seconds_bucket{mcp_method="tools/list",method="POST",server="github",status="success",status_code="202",transport="stdio",le="+Inf"} 2
toolhive_mcp_request_duration_seconds_sum{mcp_method="tools/list",method="POST",server="github",status="success",status_code="202",transport="stdio"} 0.000219416
toolhive_mcp_request_duration_seconds_count{mcp_method="tools/list",method="POST",server="github",status="success",status_code="202",transport="stdio"} 2

Connection metrics

# HELP toolhive_mcp_active_connections Number of active MCP connections
# TYPE toolhive_mcp_active_connections gauge
toolhive_mcp_active_connections{connection_type="sse",server="github",transport="stdio"} 3

Tool-specific metrics

# HELP toolhive_mcp_tool_calls_total Total number of MCP tool calls
# TYPE toolhive_mcp_tool_calls_total counter
toolhive_mcp_tool_calls_total{server="github",status="success",tool="get_file_contents"} 15
toolhive_mcp_tool_calls_total{server="github",status="success",tool="list_pull_requests"} 4
toolhive_mcp_tool_calls_total{server="github",status="success",tool="search_issues"} 2

MCP semantic convention metrics

In addition to the ToolHive-prefixed metrics above, ToolHive emits metrics that follow the OpenTelemetry MCP semantic conventions:

Metric	Type	Description
`mcp.server.operation.duration`	Histogram	Duration of MCP server operations
`mcp.client.operation.duration`	Histogram	Duration of MCP client operations (vMCP)

These metrics use the same labels as the ToolHive-prefixed metrics and are compatible with dashboards built for the OpenTelemetry MCP semantic conventions.

vMCP metrics

When using Virtual MCP Server (vMCP), additional metrics are available for monitoring backend operations, workflow executions, and optimizer performance. For details, see the vMCP telemetry guide.

Trace context propagation

ToolHive supports two methods of trace context propagation:

HTTP headers: Standard W3C Trace Context (traceparent and tracestate headers) and W3C Baggage propagation
MCP _meta field: Trace context embedded in MCP request parameters via the params._meta field, following the MCP specification

When both are present, the MCP _meta trace context takes priority. This enables proper trace correlation across MCP server boundaries, even when MCP clients inject trace context into the request payload rather than HTTP headers.

Export options

ToolHive supports multiple export formats to integrate with your existing observability infrastructure.

OTLP export

ToolHive supports OpenTelemetry Protocol (OTLP) export for both traces and metrics to any compatible backend, either directly or via a collector application.

The OpenTelemetry ecosystem includes a wide range of observability backends including open source solutions like Jaeger, self-hosted solutions like Splunk, and SaaS solutions like Datadog, New Relic, and Honeycomb.

Prometheus export

ToolHive can expose Prometheus-style metrics at a /metrics endpoint, enabling:

Direct scraping by Prometheus servers
Service discovery in Kubernetes environments
Integration with existing Prometheus-based monitoring stacks

Dual export

Both OTLP and Prometheus can be enabled simultaneously, allowing you to:

Send traces to specialized tracing backends
Expose metrics for Prometheus scraping
Maintain compatibility with existing monitoring infrastructure

Data sanitization

ToolHive automatically protects sensitive information in traces:

Sensitive arguments: Tool arguments containing passwords, tokens, or keys are redacted
Sensitive key detection: Arguments with keys containing patterns like "password", "token", "secret", "key", "auth", or "credential" are redacted
Argument truncation: Long arguments are truncated to prevent excessive trace size

For example, a tool call with sensitive arguments:

gen_ai.tool.call.arguments: password=secret123, api_key=abc456, title=Bug report

ToolHive sanitizes this in the trace as:

gen_ai.tool.call.arguments: password=[REDACTED], api_key=[REDACTED], title=Bug report

Monitoring examples

These examples show how ToolHive's observability works in practice.

Tool call monitoring

When a client calls the create_issue tool:

Request:

{
  "jsonrpc": "2.0",
  "id": "req_456",
  "method": "tools/call",
  "params": {
    "name": "create_issue",
    "arguments": {
      "title": "Bug report",
      "body": "Found an issue with the API"
    }
  }
}

Generated trace:

Span: tools/call create_issue
├── mcp.method.name: tools/call
├── jsonrpc.request.id: req_456
├── gen_ai.tool.name: create_issue
├── gen_ai.tool.call.arguments: title=Bug report, body=Found an issue with...
├── mcp.server.name: github
├── network.transport: tcp
├── http.request.method: POST
├── http.response.status_code: 200
└── duration: 850ms

Generated metrics:

toolhive_mcp_requests_total{mcp_method="tools/call",server="github",status="success"} 1
toolhive_mcp_request_duration_seconds{mcp_method="tools/call",server="github"} 0.85
toolhive_mcp_tool_calls_total{server="github",tool="create_issue",status="success"} 1

Error tracking

Failed requests generate error traces and metrics:

Error trace:

Span: tools/call invalid_tool
├── mcp.method.name: tools/call
├── gen_ai.tool.name: invalid_tool
├── http.response.status_code: 400
├── span.status: ERROR
├── span.status_message: Tool not found
└── duration: 12ms

Error metrics:

toolhive_mcp_requests_total{mcp_method="tools/call",server="github",status="error",status_code="400"} 1
toolhive_mcp_tool_calls_total{server="github",tool="invalid_tool",status="error"} 1

Key performance indicators

Monitor these key metrics for optimal MCP server performance:

Request rate: rate(toolhive_mcp_requests_total[5m])
Error rate: rate(toolhive_mcp_requests_total{status="error"}[5m])
Response time: histogram_quantile(0.95, toolhive_mcp_request_duration_seconds_bucket)
Active connections: toolhive_mcp_active_connections

Setting up dashboards and alerts

This section shows practical examples of integrating ToolHive's observability data with common monitoring tools.

Prometheus integration

Configure Prometheus to scrape ToolHive metrics:

prometheus.yml
scrape_configs:
  - job_name: 'toolhive-mcp-proxy'
    static_configs:
      - targets: [
        'localhost:43832',  # Example MCP server
        'localhost:51712'   # Example MCP server
      ]
    scrape_interval: 15s
    metrics_path: /metrics

Grafana dashboard queries

Example queries for monitoring dashboards:

# Request rate by server
sum(rate(toolhive_mcp_requests_total[5m])) by (server)

# Error rate percentage
sum(rate(toolhive_mcp_requests_total{status="error"}[5m])) by (server) /
sum(rate(toolhive_mcp_requests_total[5m])) by (server) * 100

# Response time percentiles
histogram_quantile(0.95, sum(rate(toolhive_mcp_request_duration_seconds_bucket[5m])) by (le, server))

# Tool usage distribution
sum(rate(toolhive_mcp_tool_calls_total[5m])) by (tool, server)

# Active connections
toolhive_mcp_active_connections

Alerting rules

Example Prometheus alerting rules:

prometheus.yml
groups:
  - name: toolhive-mcp-proxy
    rules:
      - alert: HighErrorRate
        expr: rate(toolhive_mcp_requests_total{status="error"}[5m]) > 0.1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: 'High error rate in MCP proxy'
          description: 'Error rate is {{ $value }} errors per second'

      - alert: HighResponseTime
        expr:
          histogram_quantile(0.95, toolhive_mcp_request_duration_seconds_bucket)
          > 2.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: 'High response time in MCP proxy'
          description: '95th percentile response time is {{ $value }}s'

      - alert: ProxyDown
        expr: up{job="toolhive-mcp-proxy"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: 'MCP proxy is down'
          description: 'ToolHive MCP proxy has been down for more than 1 minute'

Recommendations

Production deployment

Use appropriate sampling rates (1-10% for high-traffic systems)
Configure authentication for OTLP endpoints
Use HTTPS transport in production
Monitor telemetry overhead with metrics
Set up alerting on key performance indicators

Development and testing

Use 100% sampling for complete visibility
Enable local backends (Jaeger, Prometheus)
Test with realistic workloads to validate metrics
Verify trace correlation across service boundaries

Cost optimization

Tune sampling rates based on traffic patterns
Use head-based sampling for consistent trace collection
Monitor backend costs and adjust accordingly
Filter out health check requests if not needed

Next steps

Now that you understand how ToolHive's observability works, you can:

Choose a monitoring backend that fits your needs and budget
Follow the tutorial to set up a local observability stack with OpenTelemetry, Jaeger, Prometheus, and Grafana
Enable telemetry when running your servers:
- using the ToolHive CLI
- using the Kubernetes operator
Set up basic dashboards to track request rates, error rates, and response times
Configure alerts for critical issues

The telemetry system works automatically once enabled, providing immediate insights into your MCP server performance and usage patterns.

How telemetry works​

Distributed tracing​

Trace structure​

MCP-specific traces​

Trace attributes​

Service information​

HTTP layer​

Network layer​

MCP protocol details​

Method-specific attributes​

Metrics collection​

Metric labels​

Key metrics​

Request metrics​

Connection metrics​

Tool-specific metrics​

MCP semantic convention metrics​

vMCP metrics​

Trace context propagation​

Export options​

OTLP export​

Prometheus export​

Dual export​

Data sanitization​

Monitoring examples​

Tool call monitoring​

Error tracking​

Key performance indicators​

Setting up dashboards and alerts​

Prometheus integration​

Grafana dashboard queries​

Alerting rules​

Recommendations​

Production deployment​

Development and testing​

Cost optimization​

Next steps​