Contents

OpenTelemetry实战:用Docker Compose一键搭建可观测性平台

为什么需要OpenTelemetry?

微服务架构下,一个请求可能经过5-10个服务。出了问题怎么办?传统的做法是翻日志、查数据库、猜链路——效率极低。

OpenTelemetry(简称OTel) 是CNCF孵化的可观测性标准框架,提供统一的API和SDK,让你用一套代码同时采集Traces(链路追踪)Metrics(指标)Logs(日志),然后导出到任意后端。

本文用Docker Compose一键搭建完整的可观测性平台,从零到可用只要10分钟。

架构总览

1
2
3
4
应用(OTel SDK) → OTel Collector → Prometheus(指标)
                                  → Jaeger(追踪)
                                  → Loki(日志)
                    Grafana(统一展示)

一、项目结构

创建以下目录结构:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
otel-demo/
├── docker-compose.yml
├── otel-config.yml          # OTel Collector配置
├── prometheus.yml            # Prometheus配置
├── grafana/
│   └── provisioning/
│       └── datasources/
│           └── datasource.yml
├── app/
│   ├── Dockerfile
│   ├── requirements.txt
│   └── main.py               # 示例应用

二、Docker Compose编排

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# docker-compose.yml
version: "3.8"

services:
  # OTel Collector - 统一采集入口
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.102.0
    command: ["--config=/etc/otelcol/config.yml"]
    volumes:
      - ./otel-config.yml:/etc/otelcol/config.yml
    ports:
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
      - "8888:8888"   # Collector指标

  # Prometheus - 存储指标
  prometheus:
    image: prom/prometheus:v2.53.0
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  # Jaeger - 存储和展示追踪
  jaeger:
    image: jaegertracing/all-in-one:1.58
    environment:
      - COLLECTOR_OTLP_ENABLED=true
    ports:
      - "16686:16686"  # Jaeger UI
      - "14250:14250"

  # Grafana - 统一可视化
  grafana:
    image: grafana/grafana:11.1.0
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin123
    ports:
      - "3000:3000"

  # 示例应用
  app:
    build: ./app
    environment:
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
      - OTEL_SERVICE_NAME=demo-app
    ports:
      - "8080:8080"

三、OTel Collector配置

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# otel-config.yml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024

exporters:
  # 追踪导出到Jaeger
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true

  # 指标导出到Prometheus
  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: demo

  # 日志导出到控制台(开发环境)
  debug:
    verbosity: basic

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug]

四、示例应用(Python)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
# app/main.py
from flask import Flask, jsonify
import time
import random
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

# 初始化Tracer
tracer_provider = TracerProvider()
tracer_provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="otel-collector:4317", insecure=True))
)
trace.set_tracer_provider(tracer_provider)
tracer = trace.get_tracer("demo-app")

# 初始化Meter
metric_reader = PeriodicExportingMetricReader(
    OTLPMetricExporter(endpoint="otel-collector:4317", insecure=True),
    export_interval_millis=5000,
)
meter_provider = MeterProvider(metric_readers=[metric_reader])
metrics.set_meter_provider(meter_provider)
meter = metrics.get_meter("demo-app")

# 自定义指标
request_counter = meter.create_counter(
    name="http_requests_total",
    description="Total HTTP requests",
    unit="1",
)
request_duration = meter.create_histogram(
    name="http_request_duration_seconds",
    description="Request duration",
    unit="s",
)

app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()


@app.route("/")
def index():
    with tracer.start_as_current_span("handle_index") as span:
        span.set_attribute("user.id", "demo-user")
        # 模拟业务逻辑
        _simulate_work()
        request_counter.add(1, {"method": "GET", "path": "/"})
        return jsonify({"message": "Hello OpenTelemetry!"})


@app.route("/slow")
def slow_endpoint():
    with tracer.start_as_current_span("handle_slow") as span:
        delay = random.uniform(0.5, 2.0)
        span.set_attribute("delay.seconds", delay)
        time.sleep(delay)
        request_counter.add(1, {"method": "GET", "path": "/slow"})
        request_duration.record(delay, {"path": "/slow"})
        return jsonify({"delay": delay})


@app.route("/error")
def error_endpoint():
    with tracer.start_as_current_span("handle_error") as span:
        try:
            raise ValueError("Something went wrong!")
        except ValueError as e:
            span.record_exception(e)
            span.set_status(trace.StatusCode.ERROR, str(e))
            request_counter.add(1, {"method": "GET", "path": "/error"})
            raise


def _simulate_work():
    with tracer.start_as_current_span("simulate_work"):
        time.sleep(random.uniform(0.01, 0.1))


if __name__ == "__main__":
    app.run(host="0.0.0.0", port=8080)

五、应用依赖与Dockerfile

1
2
3
4
5
6
7
# app/requirements.txt
flask==3.0.3
opentelemetry-api==1.25.0
opentelemetry-sdk==1.25.0
opentelemetry-exporter-otlp-proto-grpc==1.25.0
opentelemetry-instrumentation-flask==0.46b0
opentelemetry-instrumentation-requests==0.46b0
1
2
3
4
5
6
7
8
# app/Dockerfile
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY main.py .
EXPOSE 8080
CMD ["python", "main.py"]

六、Prometheus与Grafana配置

1
2
3
4
5
6
7
8
# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: "otel-collector"
    static_configs:
      - targets: ["otel-collector:8889"]
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# grafana/provisioning/datasources/datasource.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    url: http://prometheus:9090
    isDefault: true
  - name: Jaeger
    type: jaeger
    url: http://jaeger:16686

七、启动与验证

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# 启动所有服务
docker-compose up -d

# 查看日志确认启动成功
docker-compose logs -f otel-collector

# 生成一些请求
curl http://localhost:8080/
curl http://localhost:8080/slow
curl http://localhost:8080/error

# 打开各组件UI
# Grafana: http://localhost:3000 (admin/admin123)
# Jaeger:  http://localhost:16686
# Prometheus: http://localhost:9090

八、实战技巧

1. 给现有服务加埋点(最少改动)

用自动插桩,只需要3行代码:

1
2
3
from opentelemetry.instrumentation.flask import FlaskInstrumentor
# 原有代码不需要改
FlaskInstrumentor().instrument_app(app)  # 加这一行

Python支持自动插桩的库:Flask、Django、FastAPI、requests、SQLAlchemy、Redis、gRPC等。

2. Span之间自动关联

同一进程内的Span会自动继承Context,不需要手动传递。跨进程调用通过HTTP Header自动传播:

1
2
# 自动注入traceparent Header到出站请求
RequestsInstrumentor().instrument()  # 自动处理

3. 采样策略(生产环境必配)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# otel-config.yml中添加
processors:
  probabilistic_sampler:
    sampling_percentage: 10  # 只采样10%
  tail_sampling:
    policies:
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow
        type: latency
        latency: {threshold_ms: 1000}

关键:错误和慢请求100%采集,正常请求采样10%,既节省存储又不丢重要数据。

总结

OpenTelemetry的最大价值是标准化——你不需要为每个监控后端写不同的集成代码。换后端只需改Collector配置,应用代码一行不用动。

组件 作用 默认端口
OTel Collector 数据采集+转发 4317(gRPC) / 4318(HTTP)
Prometheus 指标存储+查询 9090
Jaeger 追踪存储+查询 16686(UI)
Grafana 统一可视化面板 3000

整套方案全部开源,没有任何厂商锁定。从单体到微服务,从开发到生产,这套可观测性栈够用很久。