Server, Client and Internal Middlewares with OpenTelemetry in Golang

Peter Gillich
Dev Genius
Published in
10 min readOct 23, 2023

--

Middleware is a modern design pattern to isolate reusable codes from business logic. Most common middleware in Golang is the server middleware, which is supported by web frameworks. Client and internal (goroutine) middlewares are also possible, but with less support. The typical (but not only) usage of the middlewares is the observability, which is (will be) covered by vendor- and tool-agnostic OpenTelemetry. This article presents example implementations for using OpenTelemetry SDK for several middlewares.

OpenTelemetry concept

The example codes of middlewares can be found at https://github.com/pgillich/opentracing-example/tree/middleware .

OpenTelemetry SDK

The scope of OpenTelemetry is traces, metrics and logs. OpenTelemetry is a collection of vendor- and tool-agnostic specs, API, SDK, and tools. Today (2023 October) only traces and metrics are supported for Golang:

https://opentelemetry.io/docs/instrumentation/go/ (2023 October)

Logs is also discussed in this article, but the current implementation uses the log/slog package.

See more info about sage of Go OTEL SDK here: https://opentelemetry.io/docs/instrumentation/go/manual/ .

Traces

OpenTelemetry trace info should be stored into request-scoped context.Context. This context must be passed on the whole call chain.

Usage of OTel SDK traces is described in my earlier article: https://faun.pub/multi-hop-tracing-with-opentelemetry-in-golang-792df5feb37c . The current example is an updated version of it (in a different branch).

The Span exporter in the example code is Jaeger (which is deprecated, see: OpenTelemetry-Go Jaeger Exporter), but vendor-agnostic OTLP example is described in my earlier article: https://blog.devgenius.io/trace-log-correlation-with-grafana-tempo-311d499ce63a .

The Trace State has some limitations, which must be counted during planning, for example:

  • Control and special characters are not allowed, so escaping (or replacing to valid character) is needed.
  • Maximum length of a State value is 256. Cutting should be done at UTF-8 and escaping boundaries.
  • Maximum number of fields is 10.

Metrics

OpenTelemetry metrics SDK is an adapter, which expects a metric provider. In the example code it’s Prometheus, which is described at https://pkg.go.dev/go.opentelemetry.io/otel/exporters/prometheus . The two specs are different, which must be counted during planning, for example:

  • Metric name is suffixed by the specified unit, which is automatically mapped to Prometheus unit names, for example s to seconds.
  • Counter name is suffixed by _total.
  • OTEL scope name and version are added to Prometheus labels (otel_scope_name, otel_scope_version)

Example generated Prometheus counter name from example_job_duration, when the unit is s: example_job_duration_seconds_total .

Example for setting up the Prometheus metric provider can be found at https://github.com/pgillich/opentracing-example/blob/middleware/internal/middleware/metrics.go . The relevant code part is in the GetMeter function:

  exporter, err := prometheus.New()
if err != nil {
log.Error("unable to instantiate prometheus exporter", logger.KeyError, err)
panic(err)
}
provider := metric.NewMeterProvider(metric.WithReader(exporter))
meter = provider.Meter(
"github.com/pgillich/opentracing-example/internal/middleware",
metric_api.WithInstrumentationVersion("0.1"),
)

The instruments (root of metrics) should be initialized only once, so a registry should be created for it, where the key is the metric name:

// InstrumentReg stores the already registered instruments
//
//nolint:structcheck // generics
type InstrumentReg[T any, O any] struct {
instruments map[string]T
mu sync.Mutex
newInstrument func(name string, options ...O) (T, error)
}

// GetInstrument registers a new instrument,
// otherwise returns the already created.
func (r *InstrumentReg[T, O]) GetInstrument(
name string, options ...O,
) (T, error) {
(...)
}

func Int64CounterGetInstrument(
name string, options ...metric_api.Int64CounterOption,
) (metric_api.Int64Counter, error) {
return regInt64Counter.GetInstrument(name, options...)
}

func Float64CounterGetInstrument(
name string, options ...metric_api.Float64CounterOption,
) (metric_api.Float64Counter, error) {
return regFloat64Counter.GetInstrument(name, options...)
}

Example metrics can be found at https://github.com/pgillich/opentracing-example/blob/middleware/doc/example_metrics.txt .

The cardinality rule is important to keep Prometheus performance. Be careful to add dynamic value to the label, for example URL query or path with ID.

Correlation

Using same or similar metric attributes (Prometheus labels) is important to correlate related metrics about same event. Let’s see metrics of same request/response pair from the 2 nodes (client and server):

# Client node: frontend
http_out_total{
error="",
host="127.0.0.1:45237",method="GET",
otel_scope_name="github.com/pgillich/opentracing-example/internal/middleware",
otel_scope_version="0.1",
path="/ping",service="frontend",status="200",target_service="backend",
url="http://127.0.0.1:45237/ping?x=a"
} 1

# Server node: backend
http_in_total{
host="127.0.0.1:45237",method="GET",
otel_scope_name="github.com/pgillich/opentracing-example/internal/middleware",
otel_scope_version="0.1",
path="/ping",path_pattern="/ping",service="backend",
status="200",url="/ping?x=a"
} 1

There are a few labels, which is missing at one of the metric:

  • error can be seen only at the client side (http.Client.Transport supports it, which is useful for network troubleshooting)
  • target_service can be seen only at the client side, which is useful to know the logical name of target service (service in the server side), because host name is confusing in Kubernetes.
  • path_pattern can be seen only at the server side, which is useful if the path contains identifiers.

There are a few labels, which has different value:

  • url does not contain the scheme and the host name at the server side, because of the Go http.Server implementation.

There are a few labels which is a part of the whole request URL: host, path, path_pattern . This additional labels are redundant, but makes the the metric grouping more easy and does not make the cardinality more wrong. Similar meaning has the service and target_service labels.

Logs

The OTEL SDK does not support Logs for Go (yet), see: https://github.com/open-telemetry/opentelemetry-go/#project-status (in 2023 October: under development).

The example code uses the log/slog package, introduced in Go 1.21.

The otel.SetLogger function, which is used for internal logging, expects logr.Logger . Since log/slog is a new package, the adapter for rs/logr has not been finished yet, see: https://github.com/go-logr/logr/pull/196 .

The otel.SetLogger function expects logr.Logger .

Example logs can be found at https://github.com/pgillich/opentracing-example/blob/middleware/doc/example_logs.log and https://github.com/pgillich/opentracing-example/blob/middleware/doc/example_logs.json .

Correlation

The slog package concept is similar to zerolog , which means the key-value pairs for the records should be set in an attribute list, which is added to a string, so it cannot be overwritten by a next Logger.With call (unlike sirupsen/logrus, which keeps the key-value pairs in a map). The list of key-value pairs of a log line looks like a history of key-value pair setting, for example (see spanID 4 times):

time=2023-10-23T12:09:23.942+02:00 level=INFO msg=OUT_REQ 
logger=frontend service=frontend instance=frontend inMethod=GET inUrl=/proxy
traceID=8e190b15e1baf020fb2951427632a31f spanID=d5abfcfd7534307a
spanID=67f60e44e2f062e3 spanID=23652919e579d86a job_type=example
job_name="GET http://127.0.0.1:45237/ping?x=a"
job_id="GET http://127.0.0.1:45237/ping?x=a #0" outMethod=GET
outUrl="http://127.0.0.1:45237/ping?x=a" spanID=8c97dfce7c214d7c

Since the spanID is appeared several times in one log line, the latest appearance is the relevant, which can be a challenge for log processors. If the output is JSON, the situation is similar: spanID is appeared several times, which can be problematic for JSON parsers and only the latest appearance may be used:

{
"time":"2023-10-24T22:30:01.103088404+02:00","level":"INFO","msg":"OUT_REQ",
"logger":"frontend","service":"frontend","instance":"frontend",
"inMethod":"GET","inUrl":"/proxy","traceID":"442f364bdfcc8af6b0ae2482ade9dbff",
"spanID":"e0547bd9232afa25","spanID":"e73c8042b01e2edb",
"spanID":"76d9ff649bf2e6b3","job_type":"example",
"job_name":"GET http://127.0.0.1:34131/ping?x=a",
"job_id":"GET http://127.0.0.1:34131/ping?x=a #0",
"outMethod":"GET","outUrl":"http://127.0.0.1:34131/ping?x=a",
"spanID":"cf389c7b710f376c"
}

A possible solution for this issue is using a sirupsen/logrus handler for log/slog, for example: https://github.com/samber/slog-logrus .

Another key should have the parent spanID, to correlate the log lines, for example: spanParentID.

Middlewares

New spans are created/started by the 3 middlewares, see an example, with Span IDs of above log item:

Spans created by middlewares

Where:

  • The incoming Span ID is: d5abfcfd7534307a
  • The server middleware starts a new span: 67f60e44e2f062e3
  • The internal middleware starts a new span: 23652919e579d86a
  • The client middleware starts a new span, which is sent to the next hop: 8c97dfce7c214d7c

The new Span IDs are logged by the middlewares, in order to correlate the log messages to traces.

The server, internal and client middlewares are glued indirectly by request-scoped context.Context. This context must be passed on the whole call chain. For example, the server middleware adds a key-value attribute pair to the context logger, which is used by the outgoing client middleware to log the incoming request info (incoming HTTP request URL, Trace ID).

Trace Span, metrics and log handling are handled by middlewares, so only instrumenting and request-scoped context.Context propagating is needed to use OpenTelemetry and other features, instead of spreading this kind of features around the whole source code.

Server middlewares

This kind of middlewares are supported natively by the web frameworks. The example code uses Chi, which is simple and does not have any 3rd party dependency. Example code to register Chi middlewares:

 r := chi.NewRouter()
r.Use(chi_middleware.Recoverer)
r.Use(mw_server.ChiLoggerBaseMiddleware(s.log))
r.Use(mw_server.ChiTracerMiddleware(tr, s.config.Instance))
r.Use(mw_server.ChiLoggerMiddleware(slog.LevelInfo, slog.LevelInfo))
r.Use(mw_server.ChiMetricMiddleware(middleware.GetMeter(s.log),
"http_in", "HTTP in response", map[string]string{
"service": "frontend",
}, s.log,
))

Each mw_server middlewares are own-developed and can be found at https://github.com/pgillich/opentracing-example/tree/middleware/internal/middleware/server . The ChiMetricMiddleware expects metric name and additional metric attributes (Prometheus labels)

Client middlewares

The http.Client.Transport supports client middlewares. The OTEL SDK uses this possibility. This way can be extended by Chain of Responsibility design pattern, for example:

 req, err := http.NewRequestWithContext(
ctx, http.MethodGet, beURL, http.NoBody,
)
if err != nil {
return "", errors.Wrap(err, "unable to send request")
}

httpClient := &http.Client{Transport: otelhttp.NewTransport(
mw_client.NewMetricTransport(
mw_client.NewLogTransport(
http.DefaultTransport,
slog.LevelInfo,
slog.LevelInfo,
),
middleware.GetMeter(log),
"http_out", "HTTP out response", map[string]string{
"service": "frontend",
"target_service": "backend",
},
middleware.FirstErr,
),
otelhttp.WithPropagators(otel.GetTextMapPropagator()),
otelhttp.WithSpanOptions(trace.WithAttributes(
attribute.String(tracing.SpanKeyComponent, tracing.SpanKeyComponentValue),
)),
)}

resp, err := httpClient.Do(req)

Where the call chain is below for each outgoing HTTP request:

  • otelhttp.Transport
  • mw_client.MetricTransport
  • mw_client.LogTransport
  • http.DefaultTransport (http.Transport)

The mw_client.MetricTransport can be configured by the metric name, additional metric attributes (Prometheus labels) and error formatter.

The mw_client.LogTransport can be configured by the log levels of request and response log messages.

The client middleware source codes can be found at https://github.com/pgillich/opentracing-example/tree/middleware/internal/middleware/client .

Internal middlewares

Using internal middlewares is rare, but powerful if background processes or parallel outgoing requests are started by an incoming requests (or a trigger). Several middlewares can be used to decorate goroutines, instead implement this features for each goroutines:

  • Capturing and handling Go panics
  • Rate limiting (for example: max number of outgoing parallel requests)
  • New Span to identify parallel jobs
  • Identify parallel jobs in logs
  • Metrics (number of starts, duration)

There is no an idiomatic way to implement this kind of middlewares, so a tiny framework should be implemented, where the middlewares can be registered and called.

The function, which should be decorated by the internal middlewares, should have below signature:

type InternalMiddlewareFn func(ctx context.Context) (interface{}, error)

If more parameters are needed to the decorated function, it can be added by function generator or closure.

Each internal middleware should have below signature:

type InternalMiddleware func(next InternalMiddlewareFn) InternalMiddlewareFn

If more parameters are needed to the middleware, it can be added by function generator.

Finally, a function is needed the call the chain and the decorated function:

func InternalMiddlewareChain(mws ...InternalMiddleware) InternalMiddleware {
return func(next InternalMiddlewareFn) InternalMiddlewareFn {
fn := next
for mw := len(mws) - 1; mw >= 0; mw-- {
fn = mws[mw](fn)
}

return fn
}
}

That’s all to have an internal middleware framework.

The implemented internal middlewares can be found at https://github.com/pgillich/opentracing-example/tree/middleware/internal/middleware/inner .

Example usage for internal middlewares to decorate goroutine by simple way:

  sem := semaphore.NewWeighted(s.config.MaxReq)
type bodyRespT struct {
Body string
Err error
}
bodyRespCh := make(chan bodyRespT)
for be, beURL := range inBodies {
go func(be int, beURL string) {
(...)

retVal, bodyResp.Err = mw_inner.InternalMiddlewareChain(
mw_inner.TryCatch(),
mw_inner.SemAcquire(sem),
mw_inner.Span(tr, jobID),
mw_inner.Logger(map[string]string{
"job_type": "example",
"job_name": jobName,
"job_id": jobID,
}, slog.LevelInfo, slog.LevelDebug),
mw_inner.Metrics(ctx, meter, "example_job", "Example job",
map[string]string{
"job_type": "example",
"job_name": jobName,
}, middleware.FirstErr),
mw_inner.TryCatch(),
)(func(ctx context.Context) (interface{}, error) {
return s.sendToBackend(ctx, beURL)
})(ctx)

(...)
bodyRespCh <- bodyResp
}(be, beURL)
}

Where:

  • A for loop creates goroutines for each items.
  • The decorated function is s.sendToBackend(ctx, beURL) , which is encapsulated by an anonymus function.
  • The mw_inner.InternalMiddlewareChain receives the list of the middlewares and finally calls the decorated function
  • The last mw_inner.TryCatch captures the Go panic of the decorated function and propagates as simple error, which can be handled by log, metric and Span middlewares. The first mw_inner.TryCatch captures the Go panic of middlewares.
  • The mw_inner.SemAcquire implements a simple rate limiting using the x/sync/semaphore package. It should be registered before any observability middlewares (in order to report the real start and the duration)
  • The mw_inner.Span creates/starts a new Span for each goroutines.
  • The mw_inner.Logger logs the begin and end of the goroutine. Log levels and additional log key-value attributes can be configured.
  • The mw_inner.Metrics makes metric for the attempted calls and duration. Additional metric attributes (Prometheus labels) and error formatter can be configured.

Above compact code is enough to decorate any goroutine of the source code, without changing it. Of course, propagating the request-scoped context.Context is needed.

Correlation

Some attributes are same for metric and log middlewares, which helps the correlation between logs and metrics, an example about same event:

example_job_total{
error="",
job_name="GET http://127.0.0.1:45237/ping?x=a",job_type="example",
otel_scope_name="github.com/pgillich/opentracing-example/internal/middleware",
otel_scope_version="0.1"
} 1

time=2023-10-23T12:09:23.941+02:00 level=INFO msg=GO_BEGIN
logger=frontend service=frontend instance=frontend
inMethod=GET inUrl=/proxy traceID=8e190b15e1baf020fb2951427632a31f
spanID=d5abfcfd7534307a spanID=67f60e44e2f062e3 spanID=23652919e579d86a
job_type=example job_name="GET http://127.0.0.1:45237/ping?x=a"
job_id="GET http://127.0.0.1:45237/ping?x=a #0"

The job_type and job_name are same, but the metric name related info (example_job_total) is missing in the log.

Correlation

Tracing, logging and metering information is good, but not enough, because this kind of information should be correlated.

The trace — log correlation is explained my below articles here:

Challenges of metric — metric and log — log correlation are explained above.

Example for metric — log correlation challenge (same case to metric — metric correlation example):

# Client node: frontend, outgoing request/response metric
http_out_total{
error="",
host="127.0.0.1:45237",method="GET",
otel_scope_name="github.com/pgillich/opentracing-example/internal/middleware",
otel_scope_version="0.1",
path="/ping",service="frontend",status="200",target_service="backend",
url="http://127.0.0.1:45237/ping?x=a"
} 1

# Client node: frontend, outgoing request/response duration metric
http_out_duration_seconds_total{
error="",
host="127.0.0.1:45237",method="GET",
otel_scope_name="github.com/pgillich/opentracing-example/internal/middleware",
otel_scope_version="0.1",
path="/ping",service="frontend",status="200",target_service="backend",
url="http://127.0.0.1:45237/ping?x=a"
} 0.002012822
# Client node: frontend, outgoing request log
time=2023-10-23T12:09:23.942+02:00 level=INFO msg=OUT_REQ
logger=frontend service=frontend instance=frontend
inMethod=GET inUrl=/proxy
traceID=8e190b15e1baf020fb2951427632a31f spanID=d5abfcfd7534307a
spanID=67f60e44e2f062e3 spanID=23652919e579d86a
job_type=example job_name="GET http://127.0.0.1:45237/ping?x=a"
job_id="GET http://127.0.0.1:45237/ping?x=a #0"
outMethod=GET outUrl="http://127.0.0.1:45237/ping?x=a"
spanID=8c97dfce7c214d7c

# Client node: frontend, outgoing response log
time=2023-10-23T12:09:23.943+02:00 level=INFO msg=OUT_RESP
logger=frontend service=frontend instance=frontend
inMethod=GET inUrl=/proxy
traceID=8e190b15e1baf020fb2951427632a31f spanID=d5abfcfd7534307a
spanID=67f60e44e2f062e3 spanID=23652919e579d86a
job_type=example job_name="GET http://127.0.0.1:45237/ping?x=a"
job_id="GET http://127.0.0.1:45237/ping?x=a #0"
outMethod=GET outUrl="http://127.0.0.1:45237/ping?x=a"
spanID=8c97dfce7c214d7c
outStatusCode=200 outReqContentLength=0 outRespContentLength=21
outDuration=0.001
# Server node: backend, incoming request/response metric
http_in_total{
host="127.0.0.1:45237",method="GET",
otel_scope_name="github.com/pgillich/opentracing-example/internal/middleware",
otel_scope_version="0.1",
path="/ping",path_pattern="/ping",service="backend",
status="200",url="/ping?x=a"
} 1

# Server node: backend, incoming request/response duration metric
http_in_duration_seconds_total{
host="127.0.0.1:45237",method="GET",
otel_scope_name="github.com/pgillich/opentracing-example/internal/middleware",
otel_scope_version="0.1",path="/ping",path_pattern="/ping",service="backend",
status="200",url="/ping?x=a"
} 5.9e-06
# Server node: backend, incoming request log
time=2023-10-23T12:09:23.943+02:00 level=INFO msg=IN_HTTP
logger=backend service=backend instance=backend-1
inMethod=GET inUrl="/ping?x=a"
traceID=8e190b15e1baf020fb2951427632a31f spanID=8c97dfce7c214d7c
spanID=4ff8ccda4ba4ef96

# Server node: backend, incoming response log
time=2023-10-23T12:09:23.943+02:00 level=INFO msg=IN_HTTP_RESP
logger=backend service=backend instance=backend-1
inMethod=GET inUrl="/ping?x=a"
traceID=8e190b15e1baf020fb2951427632a31f spanID=8c97dfce7c214d7c
spanID=4ff8ccda4ba4ef96 inPathPattern=/ping
inStatusCode=200 inReqContentLength=0 inRespContentLength=""
inDuration=0.000

The keys for correlation is highlighted in metric labels and log attributes. The keys are different, so it’s hard to find same information, for example:

  • http_outOUT_REQ, OUT_RESP
  • http_inIN_HTTP, IN_HTTP_RESP
  • methodoutMethod, inMethod
  • urloutUrl, inUrl
  • status outStatusCode, inStatusCode
  • duration outDuration, inDuration

It’s more easy to find same information, if the keys are same, for example job_type, job_name, job_id about the goroutine.

Summary

Using middlewares is a powerful solution to introduce new features to an existing source code, not only for observability (for example: capturing Go panic, rate limiting).

OpenTelemetry SDK is good option to use vendor- and tool-agnostic observability features.

References

Own articles

Own source codes

--

--