为什么需要OpenTelemetry?
微服务架构下,一个请求可能经过5-10个服务。出了问题怎么办?传统的做法是翻日志、查数据库、猜链路——效率极低。
OpenTelemetry(简称OTel) 是CNCF孵化的可观测性标准框架,提供统一的API和SDK,让你用一套代码同时采集Traces(链路追踪)、Metrics(指标)和Logs(日志),然后导出到任意后端。
本文用Docker Compose一键搭建完整的可观测性平台,从零到可用只要10分钟。
架构总览
1
2
3
4
|
应用(OTel SDK) → OTel Collector → Prometheus(指标)
→ Jaeger(追踪)
→ Loki(日志)
Grafana(统一展示)
|
一、项目结构
创建以下目录结构:
1
2
3
4
5
6
7
8
9
10
11
12
|
otel-demo/
├── docker-compose.yml
├── otel-config.yml # OTel Collector配置
├── prometheus.yml # Prometheus配置
├── grafana/
│ └── provisioning/
│ └── datasources/
│ └── datasource.yml
├── app/
│ ├── Dockerfile
│ ├── requirements.txt
│ └── main.py # 示例应用
|
二、Docker Compose编排
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
|
# docker-compose.yml
version: "3.8"
services:
# OTel Collector - 统一采集入口
otel-collector:
image: otel/opentelemetry-collector-contrib:0.102.0
command: ["--config=/etc/otelcol/config.yml"]
volumes:
- ./otel-config.yml:/etc/otelcol/config.yml
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
- "8888:8888" # Collector指标
# Prometheus - 存储指标
prometheus:
image: prom/prometheus:v2.53.0
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
# Jaeger - 存储和展示追踪
jaeger:
image: jaegertracing/all-in-one:1.58
environment:
- COLLECTOR_OTLP_ENABLED=true
ports:
- "16686:16686" # Jaeger UI
- "14250:14250"
# Grafana - 统一可视化
grafana:
image: grafana/grafana:11.1.0
volumes:
- ./grafana/provisioning:/etc/grafana/provisioning
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin123
ports:
- "3000:3000"
# 示例应用
app:
build: ./app
environment:
- OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
- OTEL_SERVICE_NAME=demo-app
ports:
- "8080:8080"
|
三、OTel Collector配置
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
|
# otel-config.yml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 1024
exporters:
# 追踪导出到Jaeger
otlp/jaeger:
endpoint: jaeger:4317
tls:
insecure: true
# 指标导出到Prometheus
prometheus:
endpoint: "0.0.0.0:8889"
namespace: demo
# 日志导出到控制台(开发环境)
debug:
verbosity: basic
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp/jaeger]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [batch]
exporters: [debug]
|
四、示例应用(Python)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
|
# app/main.py
from flask import Flask, jsonify
import time
import random
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
# 初始化Tracer
tracer_provider = TracerProvider()
tracer_provider.add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint="otel-collector:4317", insecure=True))
)
trace.set_tracer_provider(tracer_provider)
tracer = trace.get_tracer("demo-app")
# 初始化Meter
metric_reader = PeriodicExportingMetricReader(
OTLPMetricExporter(endpoint="otel-collector:4317", insecure=True),
export_interval_millis=5000,
)
meter_provider = MeterProvider(metric_readers=[metric_reader])
metrics.set_meter_provider(meter_provider)
meter = metrics.get_meter("demo-app")
# 自定义指标
request_counter = meter.create_counter(
name="http_requests_total",
description="Total HTTP requests",
unit="1",
)
request_duration = meter.create_histogram(
name="http_request_duration_seconds",
description="Request duration",
unit="s",
)
app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()
@app.route("/")
def index():
with tracer.start_as_current_span("handle_index") as span:
span.set_attribute("user.id", "demo-user")
# 模拟业务逻辑
_simulate_work()
request_counter.add(1, {"method": "GET", "path": "/"})
return jsonify({"message": "Hello OpenTelemetry!"})
@app.route("/slow")
def slow_endpoint():
with tracer.start_as_current_span("handle_slow") as span:
delay = random.uniform(0.5, 2.0)
span.set_attribute("delay.seconds", delay)
time.sleep(delay)
request_counter.add(1, {"method": "GET", "path": "/slow"})
request_duration.record(delay, {"path": "/slow"})
return jsonify({"delay": delay})
@app.route("/error")
def error_endpoint():
with tracer.start_as_current_span("handle_error") as span:
try:
raise ValueError("Something went wrong!")
except ValueError as e:
span.record_exception(e)
span.set_status(trace.StatusCode.ERROR, str(e))
request_counter.add(1, {"method": "GET", "path": "/error"})
raise
def _simulate_work():
with tracer.start_as_current_span("simulate_work"):
time.sleep(random.uniform(0.01, 0.1))
if __name__ == "__main__":
app.run(host="0.0.0.0", port=8080)
|
五、应用依赖与Dockerfile
1
2
3
4
5
6
7
|
# app/requirements.txt
flask==3.0.3
opentelemetry-api==1.25.0
opentelemetry-sdk==1.25.0
opentelemetry-exporter-otlp-proto-grpc==1.25.0
opentelemetry-instrumentation-flask==0.46b0
opentelemetry-instrumentation-requests==0.46b0
|
1
2
3
4
5
6
7
8
|
# app/Dockerfile
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY main.py .
EXPOSE 8080
CMD ["python", "main.py"]
|
六、Prometheus与Grafana配置
1
2
3
4
5
6
7
8
|
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: "otel-collector"
static_configs:
- targets: ["otel-collector:8889"]
|
1
2
3
4
5
6
7
8
9
10
|
# grafana/provisioning/datasources/datasource.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus:9090
isDefault: true
- name: Jaeger
type: jaeger
url: http://jaeger:16686
|
七、启动与验证
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
# 启动所有服务
docker-compose up -d
# 查看日志确认启动成功
docker-compose logs -f otel-collector
# 生成一些请求
curl http://localhost:8080/
curl http://localhost:8080/slow
curl http://localhost:8080/error
# 打开各组件UI
# Grafana: http://localhost:3000 (admin/admin123)
# Jaeger: http://localhost:16686
# Prometheus: http://localhost:9090
|
八、实战技巧
1. 给现有服务加埋点(最少改动)
用自动插桩,只需要3行代码:
1
2
3
|
from opentelemetry.instrumentation.flask import FlaskInstrumentor
# 原有代码不需要改
FlaskInstrumentor().instrument_app(app) # 加这一行
|
Python支持自动插桩的库:Flask、Django、FastAPI、requests、SQLAlchemy、Redis、gRPC等。
2. Span之间自动关联
同一进程内的Span会自动继承Context,不需要手动传递。跨进程调用通过HTTP Header自动传播:
1
2
|
# 自动注入traceparent Header到出站请求
RequestsInstrumentor().instrument() # 自动处理
|
3. 采样策略(生产环境必配)
1
2
3
4
5
6
7
8
9
10
11
12
|
# otel-config.yml中添加
processors:
probabilistic_sampler:
sampling_percentage: 10 # 只采样10%
tail_sampling:
policies:
- name: errors
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow
type: latency
latency: {threshold_ms: 1000}
|
关键:错误和慢请求100%采集,正常请求采样10%,既节省存储又不丢重要数据。
总结
OpenTelemetry的最大价值是标准化——你不需要为每个监控后端写不同的集成代码。换后端只需改Collector配置,应用代码一行不用动。
| 组件 |
作用 |
默认端口 |
| OTel Collector |
数据采集+转发 |
4317(gRPC) / 4318(HTTP) |
| Prometheus |
指标存储+查询 |
9090 |
| Jaeger |
追踪存储+查询 |
16686(UI) |
| Grafana |
统一可视化面板 |
3000 |
整套方案全部开源,没有任何厂商锁定。从单体到微服务,从开发到生产,这套可观测性栈够用很久。