# OpenTelemetry Bridge

This document includes design / developer / maintenance notes for the
Node.js APM Agent *OpenTelemetry Bridge*.

Spec: https://github.com/elastic/apm/blob/main/specs/agents/tracing-api-otel.md

## Maintenance

- We should release a new agent version with an updated "@opentelemetry/api"
  dependency relatively soon after any new *minor* release. Otherwise a user
  upgrading their "@opentelemetry/api" dep to "1.x+1", e.g. "1.2.0", will find
  that the OTel Bridge which uses version "1.x", e.g. "1.1.0" or lower, does
  not work.

  The reason is that the OTel Bridge registers global providers (e.g.
  `otel.trace.setGlobalTracerProvider`) with its version of the OTel API. When
  user code attempts to *get* a tracer with **its version** of the OTel API, the
  [OTel API compatibility logic](https://github.com/open-telemetry/opentelemetry-js-api/blob/v1.1.0/src/internal/semver.ts#L24-L33)
  decides that using a v1.1.x Tracer with a v1.2.0 Tracer API is not compatible
  and falls back to a noop implementation.


## Development / Debugging

When doing development on, or debugging the OTel Bridge, it might be helpful to
enable logging of (almost) every `@opentelemetry/api` call into the bridge.
This is done by setting this in `lib/opentelemetry-bridge/setup.js`.

    const LOG_OTEL_API_CALLS = true

It looks like this:

```
% cd test/opentelemetry-bridge/fixtures
% ELASTIC_APM_OPENTELEMETRY_BRIDGE_ENABLED=true node -r ../../../start.js start-span.js
otelapi: OTelTracerProvider.getTracer(...)
otelapi: OTelContextManager.active()
otelapi: OTelTracer.startSpan(name=mySpan, options={}, context=OTelBridgeRunContext<>)
otelapi: OTelContextManager.active()
otelapi: OTelBridgeRunContext.getValue(Symbol(OpenTelemetry Context Key SPAN))
otelapi: OTelSpan<Transaction<52260136515317aa, "mySpan">>.end(endTime=undefined)
```

Together with the agent's usual debug logging, this can help show how the bridge
is working.

```
% ELASTIC_APM_OPENTELEMETRY_BRIDGE_ENABLED=true \
    ELASTIC_APM_LOG_LEVEL=debug \
    node -r ../../../start.js start-span.js | ecslog
...
```


## Naming

In general, the following variable/class/file naming is used:

- A class that implements an OTel interface is prefixed with "OTel". For
  example `class OTelSpan` implements OTel `interface Span`.
- A class that bridges between an OTel interface and an object in the APM
  agent is prefixed with `OTelBridge`. For example `OTelBridgeRunContext`
  bridges between an OTel `interface Context` and the APM agent's `RunContext`,
  i.e. it implements both interfaces/APIs.
- A variable that holds an OpenTelemetry object is prefixed with `otel`, or
  `...OTel...` if it in the middle of the var name. Some examples:
  - `otelSpanOptions` holds an OTel `SpanOptions` instance
  - `parentOTelSpanContext`
  - `epochMsFromOTelTimeInput()` converts from an OTel `TimeInput` to a number
    of milliseconds since the Unix epoch


## Design Overview

The OpenTelemetry API is, currently, [these four interfaces](https://github.com/open-telemetry/opentelemetry-js-api/tree/main/src/api/):

- `otel.context.*` - API for managing Context, i.e. what the APM agent calls
  "run context". More below.
- `otel.trace.*` - API for manipulating spans, and getting a `Tracer` to
  create spans. More below.
- `otel.diag.*` - This is used to hook into internal OpenTelemetry diagnostics,
  i.e. internal logging. There is very little `otel.diag` usage in
  `@opentelemetry/api`, more in the SDK. The APM agent hooks up `otel.diag`
  logging to its own logger **if `logLevel=trace`**.
- `otel.propagation.*` - Used for abstracting trace-context propagation
  (reading/writing "traceparent" et al headers) and Baggage handling. This
  isn't touched by the OTel Bridge, and shouldn't be necessary until either
  the bridge supports Baggage or TextMapPropagator implementations like
  `W3CTraceContextPropagator`. The APM agent implements its own internally.

In `Agent#start()`, if the `opentelemetryBridgeEnabled` config is true, then
a global [`ContextManager`](./OTelContextManager.js) and a global [`TracerProvider`](./OTelTracerProvider.js) are registered, which "enables" the bridge.

From the OTel Bridge spec:

> In order to avoid potentially complex and tedious synchronization issues
> between OTel and our existing agent implementations, the bridge implementation
> SHOULD provide an abstraction to have a single "active context" storage.

For this bridge, the agent's `RunContext` class was extended to support the
small [`interface Context`](https://github.com/open-telemetry/opentelemetry-js-api/blob/v1.1.0/src/context/types.ts#L17-L41)
API and the agent's run context managers were updated to allow passing in a
subclass of `RunContext` to use. So the "single active context storage" is
instances of [`OTelBridgeRunContext`](./OTelBridgeRunContext.js) in the agent's
usual run context managers.

The way the "active span" is tracked by the OTel API is to call
`context.setValue(SPAN_KEY, span)`. The `OTelBridgeRunContext` class translates
calls using `SPAN_KEY` into the API that the agent's RunContext class uses.
Roughly this:

- `context.setValue(SPAN_KEY, span)` -> `this.enterSpan(span)`
- `context.getValue(SPAN_KEY)` -> `return new OTelSpan(this.currSpan())`

Otherwise the `*RunContextManager` classes in the agent map very well to the
OpenTelemetry `ContextManager` interface: the [`OTelContextManager`](./OTelContextManager.js)
implementation is very straightforward.

The `@opentelemetry/api` supports two ways to create objects that are internally
implemented and do not call the registered global providers.

1. `otel.trace.wrapSpanContext(...)` supports creating a `NonRecordingSpan` (a
   class that isn't exported) instance that implements `interface Span`. [This
   test fixture](../../test/opentelemetry-bridge/fixtures/nonrecordingspan-parent.js)
   shows a use case. The bridge wraps this in an `OTelBridgeNonRecordingSpan`
   that implements both OTel `interface Span` and the agent's Transaction API.
2. `otel.ROOT_CONTEXT` is a singleton object (an internal `BaseContext` class
   instance) that implements `interface Context` but is not created via any
   bridge API. That means bridge code cannot rely on a given `context` argument
   being an instance of its `OTelBridgeRunContext` class.
   [This test fixtures](../../test/opentelemetry-bridge/fixtures/using-root-context.js)
   shows an example.

The trickiest part of the bridge is handling these two cases, especially at
the top of `startSpan` in [`OTelTracer`](./OTelTracer.js)


## Limitations / Differences with OpenTelemetry SDK

- The OpenTelemetry SDK defines [SpanLimits](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/sdk.md#span-limits).
  This OpenTelemetry Bridge differs as follows:
  - Attribute count is not limited. The OTel SDK defaults to a limit of 128.
    (To implement this, start at `maybeSetOTelAttr` in "OTelSpan.js".)
  - Attribute value strings are truncated at 1024 bytes. The OpenTelemetry SDK
    uses `AttributeValueLengthLimit (Default=Infinity)`.
    (We could consider using the configurable `longFieldMaxLength` for the
    attribute value truncation limit, if there is a need.)
  - Span events are not currently supported by this bridge.
  - Span link *attributes* are not supported by the bridge (Elastic APM
    supports span links, but not span link attributes).

- The OpenTelemetry Bridge spec says APM agents
  ["MAY"](https://github.com/elastic/apm/blob/main/specs/agents/tracing-api-otel.md#attributes-mapping)
  report OTel span attributes as spad and transaction *labels* if the upstream
  APM Server is less than version 7.16. This implementation opts *not* to do
  that. The OTel spec allows a larger range of types for span attributes values
  than is allowed for "tags" (aka labels) in the APM Server intake API, so some
  further filtering of attributes would be required.

- There is a semantic difference between this OTel Bridge and the OpenTelemetry
  SDK with `span.end()` that could impact parent/child relationships of spans.
  This demonstrates the different:

    ```js
    const otel = require('@opentelemetry/api')
    const tracer = otel.trace.getTracer()
    tracer.startActiveSpan('s1', s1 => {
      tracer.startActiveSpan('s2', s2 => {
        s2.end()
      })
      s1.end()
      tracer.startActiveSpan('s3', s3 => {
        s3.end()
      })
    })
    ```

  With the OTel SDK that will yield:

    ```
    span s1
    `- span s2
    `- span s3
    ```

  With the Elastic APM agent:

    ```
    transaction s1
    `- span s2
    transaction s3
    ```

  In current Elastic APM semantics, when a span is ended (e.g. `s1` above) it is
  *no longer the current/active span in that async context*. This is historical
  and allows a stack of current spans in sync code, e.g.:

    ```js
    const t1 = apm.startTransaction('t1')
    const s2 = apm.startSpan('s2')
    const s3 = apm.startSpan('s3') // s3 is a child of s2
    s3.end() // s3 is no longer active (popped off the stack)
    const s4 = apm.startSpan('s4') // s4 is a child of s2
    s4.end()
    s2.end()
    t1.end()
    ```

  This semantic difference is not expected to be common, because it is expected
  that typically OTel API user code will end a span only at the end of its
  function:

    ```js
    tracer.startActiveSpan('mySpan', mySpan => {
      // ...
      mySpan.end() // .end() only at end of function block
    })
    ```

  Note that active span context *is* properly maintained when a new async task
  is created (e.g. with `setTimeout`, etc.), so the following code produces
  the expected trace:

    ```js
    tracer.startActiveSpan('s1', s1 => {
    setImmediate(() => {
      tracer.startActiveSpan('s2', s2 => {
        s2.end()
      })
      setTimeout(() => {  // s1 is bound as the active span in this async task.
        tracer.startActiveSpan('s3', s3 => {
          s3.end()
        })
      }, 100)
      s1.end()
    })
    ```

  If this *does* turn out to be a common issue, the OTel semantics for span.end()
  can likely be accommodated.



