A Practitioner's Guide to Wide Events

Adopting Wide Event-style instrumentation has been one of the highest-leverage changes I’ve made in my engineering career. The feedback loop on all my changes tightened and debugging systems became so much easier. Systems that were scary to work on suddenly seemed a lot more manageable.

Lately there have been a lot of good blog posts on what “Wide Events” mean and why they are important. Here are some of my recent favorites:

The tl;dr is that for each unit-of-work in your system (usually, but not always an HTTP request / response) you emit one “event” with all of the information you can collect about that work. “Event” is an over-loaded term in telemetry so replace that with “log line” or “span” if you like. They are all effectively the same thing.

While this idea has gained new momentum lately, it is not a new idea. Brandur Leach wrote about “Canonical Log Lines” both on his own blog in 2016 and as used by Stripe in 2019. And AWS has recommended it as a best-practice for ages.

Okay… I think I get the idea… but how do I do “wide events”?

This is where I find a lot of developers get tripped up. The idea sounds good in theory, and we should totally try that one day! But I have this stack of features to ship, that bug that’s been keeping me up at night, and like 30 new AI tools that came out yesterday to learn about. And like… where do you even start? What data should I add?

Like anything in software, there are a lot of options for how to approach this, but I’ll talk through one approach that has worked for me.

We’ll cover how to approach this in tooling and code, an extensive list of attributes to add, and I’ll respond to some frequent objections that come up when discussing this approach.

For this post we’ll focus on web services, but you would apply a similar approach to any workload.

Choose your tools

We will need some way to instrument your code (traces or structured log lines) and somewhere to send the instrumentation to in order to query and visualize it.

This approach is best paired with a tool that lets you query your data in quick iterations. I like Honeycomb for this, but any Observability tool backed by a modern OLAP database is likely going to work in a pinch.

Honeycomb and DataDog built their own columnar OLAP data stores, though now with the availability of ClickHouse, InfluxDB IOx, Apache Pinot, and DuckDB there are new Observability tools popping up all the time.

If you aren’t constrained, I highly recommend defaulting to using OpenTelemetry and Honeycomb. Your life will be easier.

However even if you are stuck in a corporate environment with a strong allergy to technology built after 2010 you can leverage log search tools like ElasticSearch in a pinch. Stripe’s blog post goes over how to use Splunk for this.

In any tool you want to focus on getting proficient at 3 core techniques in order to sift through your events. The faster you are able to apply these, iterate, and ask questions of your data, the better you’ll be able to debug issues and see what your system is really doing. When observability folks refer to “slicing and dicing” data, this is what they are generally referring to. I’ll represent queries using a made-up SQL dialect, but you should be able to find equivalents in your tool’s query language.

Visualizing

Existing in a human body comes with its fair share of downsides, but the human visual cortex is really, really good at recognizing patterns. Give it a fighting chance by getting really good at summoning visualizations of the data your system is emitting. COUNT, COUNT_DISTINCT, HEATMAP, P90, MAX, MIN, Histogram. Learn to leverage whatever graphs your tool makes available to you. Practice it. Get fast.

A Honeycomb screenshot of heatmap

A Splunk screenshot of histogram

Grouping

With each new annotation that we add to our wide events, we create another dimension along which we can slice our data. GROUP BY allows us to look along that dimension and see if the values along that dimension match our expectations.

GROUP BY instance.id
GROUP BY client.OS, client.version

Filtering

Once we’ve narrowed in one dimension that is interesting, we usually want to dig further into that data. Filtering down so that we’re only looking at data from one endpoint, or from one IP address, or sent by the iOS app, or only from users with a specific feature flag turned on allows us to narrow our focus to a very specific segment of traffic.

WHERE http.route = "/user/account"
WHERE http.route != "/health"
WHERE http.user_agent_header contains "Android"

Write a middleware to help you

If you are using an OpenTelemetry SDK it is already creating a wrapping span around the request and response. You can access it by asking for the active span at any point during the processing of the request.

let span = opentelemetry.trace.getActiveSpan();
span.setAttributes({
  "user_agent.original": c.req.header("User-Agent"),
});

However if anyone wraps any of your code in a child span the “active span” will change to be that new wrapping span! There is no first-class way of addressing this original “main” span in OpenTelemetry. However, we can work around this by saving a reference to this specific span in the context so we can always have access to the “main” wrapping span.

// create a reference to store the span on the opentelemetry context object
const MAIN_SPAN_CONTEXT_KEY = createContextKey("main_span_context_key");

function mainSpanMiddleware(req, res, next) {
  // pull the active span created by the http instrumentation
  let span = trace.getActiveSpan();

  // get the current context
  let ctx = context.active();

  // set any attributes we always want on the main span
  span.setAttribute("main", true);

  // OpenTelemetry context is immutable, so to modify it we create
  // a new version with our span added
  let newCtx = ctx.setValue(MAIN_SPAN_CONTEXT_KEY, span);

  // set that new context as active for the duration of the request
  context.with(newCtx, () => {
    next();
  });
}

// create another function that allows you to annotate this saved span easily
function setMainSpanAttributes(attributes) {
  let mainSpan = context.active().getValue(MAIN_SPAN_CONTEXT_KEY);
  if (mainSpan) {
    mainSpan.setAttributes(attributes);
  }
}

Now our annotation code can look a little simpler, and we can always know that we’re setting these attributes on the wrapping span.

setMainSpanAttributes({
  "user.id": "123",
  "user.type": "enterprise",
  "user.auth_method": "oauth",
});

You can play around with a minimal running example here.

At Heroku we had internal OpenTelemetry Distributions that set this up for you automatically and added as many automatic annotations as possible to these spans.

If you are not using OpenTelemetry here’s a gist that might help you get started. My previous post may help you put this logic together.

What do I add to this “main” span?

We need to add attributes about the request, and there are likely far more of these than you would expect. It’s easy to come up with a dozen or so, but in a well-instrumented code base there will be hundreds of attributes.

Note that while this is a long list, it is definitely not exhaustive. OpenTelemetry defines sets of attribute names as Semantic Conventions that can also be used for inspiration. I have tried to follow these in my naming where possible.

A convention to filter out everything else

Traces contain lots of spans, so it’s helpful to have a convention for identifying and searching for these “wide events”. root and canon were floated as options, but I’ve landed on calling them main spans.

AttributeExamplesDescription
maintruePresent only for spans designated as a “wide event”, usually wrapping a request / response, or a background job

This convention allows you to quickly figure out “what does the traffic to this service look like?” with a single query:

SELECT
  COUNT(*)
WHERE
  main = true
GROUP BY http.route

Graph of traffic grouped by route over a week. There is an anomally.

Service metadata

Of course we need to add some information about the service we’re running. Consider adding additional metadata about which team owns the system, or which Slack channel the owning team hangs out in, though note that this can be tedious to update if your workplace experiences frequent re-orgs. Tying these to a service catalog like Backstage is left as an exercise to the reader.

AttributeExamplesDescription
service.nameapi
shoppingcart
What is the name of this service?
service.environmentproduction
staging
development
Where is this service running?
service.teamweb-services
dev-ex
Which team owns this service. Useful for knowing who to page in during incidents.
service.slack_channelweb-services
dev-ex
If I discover an issue with this service, where should I reach out?

How many services does each team run?

SELECT
  COUNT_DISTINCT(service.name)
WHERE
  service.environment = "production"
GROUP BY service.team

Ever look at the load on a system and then wonder “Is that appropriate for the machine this is running on?”, and now you have to look through other tools or config files to get that information. Throw that context on the wide event so that it’s available when you need it.

AttributeExamplesDescription
instance.id656993bd-40e1-4c76-baff-0e50e158c6ebAn ID that maps to this one instance of the service
instance.memory_mb12336How much RAM is available to this service?
instance.cpu_count4
8
196
How many cores are available to this service?
instance.typem6i.xlargeDoes your vendor have a name for this type of instance?

What are the services with the most memory that we run? What instance types do they use?

SELECT
  service.name,
  instance.memory_mb,
  instance.type
ORDER BY instance.memory_mb DESC
GROUP BY service.name, instance.type
LIMIT 10

However you’re orchestrating your systems make sure that all of the relevant information is added. I’ve included some examples from the Kubernetes semantic conventions for inspiration.

AttributeExamplesDescription
container.ida3bf90e006b2An ID used to identify Docker containers
container.namenginx-proxy
wordpress-app
Container name used by container runtime
k8s.cluster.nameapi-clusterName of the kubernetes cluster your service is running in
k8s.pod.namenginx-2723453542-065rxName of the kubernetes pod your service is running in
cloud.availability_zoneus-east-1cAZ where you’re running your service
cloud.regionus-east-1Region where you’re running your service

But even if you’re using a Platform-as-a-Service you can still pull out a lot of useful information!

AttributeExamplesDescription
heroku.dynoweb.1
worker.3
The env var DYNO that is set on your app at runtime
heroku.dyno_typeweb
worker
The first part of the DYNO env var before the .. Separating this makes it easier to query
heroku.dyno_index1
3
The second part of the DYNO env var after the .. Separating this makes it easier to query
heroku.dyno_sizeperformance-mThe selected dyno size
heroku.spacemy-private-spaceThe name of the private space that your are deployed into
heroku.regionvirginia
oregon
Which region is this app located in?

How many dynos are we running? What dyno types are they? For which services?

SELECT
  COUNT_DISTINCT(heroku.dyno_index)
GROUP BY service.name, heroku.dyno_type, instance.type

Build info

Inevitably some of the first questions asked in any incident are “Did something just go out?” or “What changed?“. Instead of jumping to your deployment tool or looking through GitHub repositories, add that data to your telemetry.

Threading this data from your build system through to your production system so that it’s available at runtime can be a non-trivial amount of glue code, but having this information easily available during incidents is invaluable.

AttributeExamplesDescription
service.versionv123
9731945429d3d083eb78666c565c61bcef39a48f
However you track your version, ex: a version string or a hash of the built image
service.build.idacd8bb57-fb9f-4b2d-a750-4315e99dac64If your build system gives you an ID, this context allows you to audit the build if something goes wrong
service.build.git_hash6f6466b0e693470729b669f3745358df29f97e8dThe git SHA of the deployed commit so you can know exactly which code was running
service.build.pull_request_urlhttps://github.com/your-company/api-service/pull/121The url of the pull request that was merged that triggered the deploy
service.build.diff_urlhttps://github.com/your-company/api-service/compare/c9d9380..05e5736A url that compares the previously deployed commit against the newly deployed commit
service.build.deployment.at2024-10-14T19:47:38ZTimestamp when the deployment process started
service.build.deployment.user[email protected]Which authenticated user kicked off the build? Could be a bot
service.build.deployment.triggermerge-to-main
slack-bot
api-request
config-change
What triggered the deploy? Extremely valuable context during an deploy-triggered incident
service.build.deployment.age_minutes1
10230
How old is this deploy? Shortcuts the frequent incident question “Did something just go out?”

Won’t this be a lot of repetitive data? These values do not change except between deploys! See Frequent Objections

What systems have recently been deployed?

SELECT
  service.name,
  MIN(service.build.deployment.age_minutes) as age
WHERE
  service.build.deployment.age_minutes < 20
GROUP BY service.name
ORDER BY age ASC
LIMIT 10

What’s up with the spike of 500s when we did the last deploy?

SELECT
  COUNT(*)
WHERE
  service.name = "api-service" AND
  main = true
GROUP BY http.status_code, service.version

Graph showing requests grouped by http status code and version. There is a spike of 500s correlating to v1 shutting down.

HTTP

You should get most of these from your tracing library instrumentation, but there are usually more you can add if, for example, your organization uses non-standard headers. Don’t settle for only what OpenTelemetry gives you by default!

AttributeExamplesDescription
server.addressexample.com
localhost
Name of the HTTP server that received the request
url.path/checkout
/account/123/features
URI path after the domain
url.schemehttp, httpsURI scheme
url.queryq=test, ref=####URI query component
http.request.id79104EXAMPLEB723Platform request id: ex: x-request-id, x-amz-request-id
http.request.methodGET
PUT
POST
OPTIONS
HTTP request method
http.request.body_size3495Size of the request payload body in bytes
http.request.header.content-typeapplication/jsonValue of a specific request header, “content-type” in this case, but there are many more. Pick out any that are important for your service
http.response.status_code200
404
500
HTTP response status code
http.response.body_size200
404
500
Size of the response payload body in bytes
http.request.header.content-typetext/htmlValue of a specific response header, “content-type” in this case, but there are many more. Pick out any that are important for your service
SELECT
  HEATMAP(http.response.body_size),
WHERE
  main = true AND
  service.name = "api-service"

A heatmap of response sizes. Most are within a fixed band, but there are sharp outliers that warrant more investigation.

User-Agent headers contain a wealth of info. Don’t rely on regex queries to try and make sense of them down the road. Parse them into structured data from the beginning.

AttributeExamplesDescription
user_agent.originalMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.3The value of the HTTP User-Agent header
user_agent.devicecomputer
tablet
phone
Device type derived from the User-Agent header
user_agent.OSWindows
MacOS
OS derived from the User-Agent header
user_agent.browserChrome
Safari
Firefox
Browser derived from the User-Agent header
user_agent.browser_version129
18.0
Browser version derived from the User-Agent header

What browsers are my users using?

SELECT
  COUNT(*)
GROUP BY user_agent.browser, user_agent.browser_version

If you have any custom user agents or headers used as a convention within your org parse that out too.

AttributeExamplesDescription
user_agent.serviceapi-gateway
auth-service
If you have a distributed architecture, have each service send a custom User-Agent header with its name and version
user_agent.service_versionv123
6f6466b0e693470729b669f3745358df29f97e8d
If you have a distributed architecture, have each service send a custom User-Agent header with its name and version
user_agent.appiOS
android
If a request is coming from a mobile app, make sure it includes which app and its version
user_agent.app_versionv123
6f6466b0e693470729b669f3745358df29f97e8d
If a request is coming from a mobile app, make sure it includes which app and its version

Route info

We’re not done with HTTP attributes yet! One of the most important bits is the API endpoint that the request matched. OpenTelemetry SDKs will usually give this to you automagically but not always. Consider extracting the route parameters and query parameters as additional attributes.

AttributeExamplesDescription
http.route/team/{team_id}/user/{user_id}The route pattern that the url path is matched against
http.route.param.team_id14739
team-name-slug
The extracted segment of the url path as it is parsed for each parameter
http.route.query.sort_dirascThe query parameters that are relevant to the response of your service. Ex: ?sort_dir=asc&...
SELECT
  P99(duration_ms)
WHERE
  main = true AND
  service.name = "api-service"
GROUP BY http.route

A chart of p99&#x27;s broken down by route. There is a spike on only some of them. We should break down by version now to check if this was caused by a deploy.

User and customer info

Once you get the basics down, this is the most important piece of metadata that you can add. No automagic SDK will be able to encode the particulars of your user model.

It’s common for a single user or account to be responsible for a 10%+ of a business’ revenue, and frequently their usage patterns look significantly different than the average user. They probably have more users, store more data, and hit limits and edge-cases that will never show up for the user paying $10 / month. Be sure you can separate their traffic from others.

AttributeExamplesDescription
user.id2147483647
[email protected]
The primary ID for a user. If this is an email and you’re using a vendor, consider your org’s policy on putting PII in external services.
user.typefree
premium
enterprise
vip
How does the business see this type of user? Individual accounts are sometimes responsible for 10%+ of a business’ income. Make sure you can separate their traffic from others!
user.auth_methodtoken
basic-auth
jwt
sso-github
How did this user authenticate into your system?
user.team.id5387
web-services
If you have a team construct, which one does this user belong to?
user.org.id278
enterprise-name
If this user is part of an organization with an enterprise contract, track that!
user.age_days0
637
Not the user’s literal age, but how long ago was this account created? Is this an issue experienced by someone new to your app, or only once they’ve saved a lot of data?
user.assumedtrueHave an internal way of assuming a user’s identity for debugging? Be sure to track this
user.assumed_by[email protected]And track which actual user is assuming the user’s identity
SELECT
  P99(duration_ms)
WHERE
  main = true AND
  service.name = "api-service"
GROUP BY user.type

Rate limits

Whatever your rate limiting strategy, make sure the current rate limit info gets added too. Can you quickly find examples of users that are being rate-limited by your service?

AttributeExamplesDescription
ratelimit.limit200000You might not now, but you will likely have users with different rate limits in the future, note down what the actual limit is for this request
ratelimit.remaining130000What is the budget remaining for this user?
ratelimit.used70000How many requests have been used in the current rate window
ratelimit.reset_at2024-10-14T19:47:38ZWhen will the rate limit be reset next? if applicable

This user has a support ticket open about being rate-limited. Let’s see what they were doing

SELECT
  COUNT(*)
WHERE
  main = true AND
  service.name = "api-service" AND
  user.id = 5838
GROUP BY http.route

A graph of one users activity. There is a big spike hitting the same route a lot at the end. This gives us a starting point for investigation

What routes are users who have burned most of their rate limit hitting? Does this activity look suspicious?

SELECT
  COUNT(*)
WHERE
  main = true AND
  service.name = "api-service" AND
  ratelimit.remaining < 100
GROUP BY http.route

Caching

For every code path where we could shortcut with a cache response, add whether or not it was successful

AttributeExamplesDescription
cache.session_infotrue
false
Was the session info cached or did it need to be re-fetched?
cache.feature_flagstrue
false
Were the feature flags cached for this user or did they need to be re-fetched?

Localization info

What localization options has the user chosen? This can be a frequent source of bugs

AttributeExamplesDescription
localization.language_dirrtl, ltrWhich direction is text laid out in their language?
localization.countrymexico, ukWhich country are they from?
localization.currencyUSD, CADWhich currency have they chosen to work with?

Uptime

Tracking how long the service has been running when it serves a request can help you visualize several classes of bugs:

  • Issues that show up on a reboot
  • Memory leaks that only start to show up when the service has been running for a long time
  • Frequent crashes / restarts if you have automatically restart the service on failure

I recommend also either adding the log10 of the uptime or having some way of visualizing this. When graphed this emphasizes the important first few minutes of a service without being squished into the bottom of the graph by instances with several days or more of uptime.

AttributeExamplesDescription
uptime_sec1533How long has this instance of your app been running? Can be useful to visualize to see restarts
uptime_sec_log_103.185Grows sub-linearly which allows you to visualize long-running services and brand new ones on the same graph
SELECT
  HEATMAP(uptime_sec),
  HEATMAP(uptime_sec_log_10)
WHERE
  main = true AND
  service.name = "api-service"

Heatmaps of uptime when a service enters a crash loop. It&#x27;s far easier to distinguish in log scale

Metrics

This one might be a bit controversial, but I’ve found it helpful to tag spans with context about what the system was experiencing while processing the request. We fetch this information every ~10 seconds, cache it, and add it to every main span produced during that time.

Capturing metrics in this way is not mathematically sound. Since you only get data when traffic is flowing, you can’t calculate a P90 for cpu load that would stand up to any rigorous scrutiny, but that’s actually fine in practice. It’s close enough to get some quick signal while you’re debugging without switching to a different tool, especially if you can avoid calculations and visualize with a heatmap.

I wouldn’t recommend setting alerts on this data though. Plain ol’ metrics are great for that.

Jessica Kerr recently wrote about this approach on the Honeycomb Blog.

AttributeExamplesDescription
metrics.memory_mb153
2593
How much memory is being used by the system at the time its service this request
metrics.cpu_load0.57
5.89
CPU load of the system service this request. Given as # of active cores
metrics.gc_count5390Last observed number of garbage collections. Could be cumulative (total since service started) or delta (ex: number in the last minute)
metrics.gc_pause_time_ms14
325
Time spent in garbage collections. Could also be cumulative or delta. Pick one and document which
metrics.go_routines_count3
3000
Number of go routines running
metrics.event_loop_latency_ms0
340
Cumulative time spent waiting on the next event loop tick. An important metric for Node apps

Are these requests getting slow because we’re running out of memory or CPU?

SELECT
  HEATMAP(duration_ms),
  HEATMAP(metrics.memory_mb),
  HEATMAP(metrics.cpu_load)
WHERE
  main = true AND
  service.name = "api-service"
GROUP BY instance.id

An example showing using the metrics data tagged on the span to get context for whats happening with the system

Async request summaries

When using a tracing system async requests should get their own spans, but it can still be useful to roll up some stats to identify outliers and quickly find interesting traces.

AttributeExamplesDescription
stats.http_requests_count1
140
How many http requests were triggered during the processing of this request?
stats.http_requests_duration_ms849Cumulative time spent in these http requests
stats.postgres_query_count7
742
How many Postgres queries were triggered during the processing of this request?
stats.postgres_query_duration_ms1254Cumulative time spent in these Postgres queries
stats.redis_query_count3
240
How many redis queries were triggered during the processing of this request?
stats.redis_query_duration_ms43Cumulative time spent in these redis queries
stats.twilio_calls_count1
4
How many calls to this vendors api were triggered during the processing of this request?
stats.twilio_calls_duration_ms2153Cumulative time spent in these vendor calls

Surely my service makes a reasonable number of calls to the database… right?

SELECT
  HEATMAP(stats.postgres_query_count)
WHERE
  main = true AND
  service.name = "api-service"

A heatmap of db queries per request. There is a bi-modal distribution but also some outliers that make a lot of requests

Instead of adding this explicitly, couldn’t we aggregate this by querying the whole trace? See Frequent Objections

Sampling

Once you start collecting fine-grained telemetry from your systems at a significant scale you run head-on into the problem of sampling. Running systems can produce a lot of data! Engineers frequently want to store and query all of it. Exact answers always! Make it fast! Also cheap! But it’s trade-offs all the way down. Telemetry data is fundamentally different from the transaction data you’re storing for your users, and you should think about it differently.

Luckily you only really need a statistically significant subset of the full dataset. Even sampling 1 out of every 1000 requests can provide a suprisingly detailed picture of the overall traffic patterns in a system.

Sampling is a suprisingly deep topic. Keep it simple if you’re starting and do uniform random head sampling, but track your sample rate per-span so you can be ready for more sophisticated approaches down-the-line.

Good tooling will weight your calculations with a per-span, so you don’t have to mentally multiple the COUNT call by the sample_rate to get an accurate answer. Here are some relevant articles:

AttributeExamplesDescription
sample_rate1
500
N where 1 in N events will be sampled and stored and the rest dropped. If you’re sampling 1% of requests, the sample_rate would be 100

Timings

I find it super useful to break up the work that gets done to respond to a request into a handful of important chunks and track how long each segment took on the main span.

Wait, isn’t that what child spans are for?

Wrapping absolutely everything in its own span is the most common failure mode I see when engineers first get access to tracing tools. You have to design the structure of your data for the way you want to query it.

Child spans are helpful for waterfall visualization for a single request, but can be difficult to query and visualize across all of your requests. Putting that information on a single span make it easier to query and also helps with tools like Honeycomb’s BubbleUp which can then immediately tell you that that that group of requests was slow because authentication took 10 seconds for some reason.

AttributeExamplesDescription
auth.duration_ms52.2
0.2
How long did we spend performing authentication during this request?
payload_parse.duration_ms22.1
0.1
Identify the core workloads of the service and add timings for them

Errors

If you encounter an error and need to fail the operation, tag the span with the error information: type, stacktrace, etc.

One approach that I have found super-valuable is tagging each location where we throw an error with a unique slug describing the error. If this string is unique within your codebase, it is easily found with a quick search. This allows someone to jump straight from a spike in errors on a dashboard to the exact line of code that throwing the error. It also provides a convenient low-cardinality field to GROUP BY.

You’re unlikely to be able to wrap all possible errors, but any time a failed request doesn’t have an exception.slug that is a good sign that you have places in your code where your error handling could be improved. It’s now really easy to find examples of requests that failed in ways you didn’t anticipate.

if isNotRecoverable(err) {
  // note the use of a plain string, not a variable, not dynamically generated
  // consider enforcing this with custom lint rules
  setErrorAttributes(err, "err-stripe-call-failed-exhausted-retries");
  throw err;
}
AttributeExamplesDescription
errortrue
false
Special field for whether the request failed or not
exception.messageCan't convert 'int' object to str
undefined is not a function
The exception message encoded in the exception
exception.typeIOError
java.net.ConnectException
The programmatic type of the exception
exception.stacktraceReferenceError: user is not defined
at myFunction (/path/to/file.js:12:2)
...
Capture the stack trace if its available to help pin-point where the error is being thrown
exception.expectedtrue, falseIs this an expected exception like a bot trying to hit a url that doesn’t exist? Allows filtering out of exceptions we can’t prevent but don’t need to worry about
exception.slugauth-error
invalid-route
github-api-unavailable
Create a unique grepp-able slug-value to identify the code location of an error if its predictable during development time

Which of our enterprise users hit the most errors last week? And which one?

SELECT
  COUNT_DISTINCT(user.id)
WHERE
  main = true AND
  service.name = "api-service" AND
  user.type = "enterprise"
GROUP BY exception.slug

Show me traces where we likely need to improve our error handling

SELECT
  trace.trace_id
WHERE
  main = true AND
  service.name = "api-service" AND
  error = true AND
  exception.slug = NULL
GROUP BY trace.trace_id

Feature flags

Fine-grained feature flags are a developer super power that allows you to test code changes in production with only a fraction of your users or traffic. Adding the flag information per-request allows you to compare how the new code is working as you opt more of your traffic into the new code path. Coupled with the broad visibility you can get with wide events, and this can make even tricky migrations vastly more manageable and allow you to ship code with confidence.

Note that semantic conventions differ here and suggest adding feature flag information as events on the span. I would suggest following that standard since it will ultimately have the best support from vendors if it’s moved to stable, but especially in the mean time, I’m also putting this info on the main span.

AttributeExamplesDescription
feature_flag.auth_v2true
false
The value of a particular feature flag for this request
feature_flag.double_write_to_new_dbtrue
false
The value of a particular feature flag for this request

What errors are the users in the new authentication flows hitting? How does it compare to the control group?

SELECT
  COUNT
WHERE
  main = true AND
  service.name = "api-service" AND
GROUP BY feature_flag.auth_v2, exception.slug

Versions of important things

Runtimes, frameworks, and any major libraries you are using can be really helpful context.

AttributeExamplesDescription
go.versiongo1.23.2What version of your language runtime are you using?
rails.version7.2.1.1Pick out any core libraries like web frameworks and track their version too
postgres.version16.4If you can add the versions of any datastores you’re using, even better

A security issue with Rails just got announced. What versions of the framework are our services using?

SELECT
  COUNT_DISTINCT(service.name)
WHERE
  service.environment = "production"
GROUP BY rails.version

Our memory usage seems higher that it used to be. Didn’t we upgrade the runtime recently? Does that correlate?

SELECT
  HEATMAP(metrics.memory_mb)
WHERE
  main = true AND
  service.name = "api-service"
GROUP BY go.version

Your specific application

Now we go off the map and get to the really valuable stuff. Your app likely does something unique or works in a particular domain. You might need to really care about which professional credentials a Dentist using your app has, or which particular storage warehouse a package is in, or which chip is in the embedded tracking device installed in the cat that your app exists to track.

No framework is going to be able to understand what parts of your domain are important to track and automate this for you, you have to do that.

AttributeExamplesDescription
asset_upload.s3_bucket_paths3://bucket-name/path/to/asset.jpgIf you upload something, add context about where
email_vendor.transaction_id62449c60-b51e-4d5c-8464-49217d91c441If you interact with a vendor, track whatever transaction ID they give you in case you need to follow up with them
vcs_integration.vendorgithub
gitlab
bitbucket
If there are 3-4 types that something might fall into, be sure to add that context. Ex: If 2% of requests start failing because bitbucket is experiencing issues, this will help identify the source of the issue immediately.
process_submission.queue_length153
1
Any time you interact with a queue, see if you can get the current length during submission

Things to note

You should probably add the thing

If you find yourself asking “Am I ever really going to need this bit of data?”, default to throwing the attribute on. The marginal cost of each extra attribute is very small. If the data volume does start to grow, prefer wider, more context-rich events and a higher sample rate vs smaller events with a lower sample rate.

Heatmaps are your friend

Honeycomb’s heatmaps are amazing at helping you find outliers, seeing multi-modal distributions, and getting a feel for your data. I wish more tooling supported them. I am not sure I can build software without them any more.

Embrace the feedback loop

When you are modifying code, make a change to the telemetry so that you can see the impact of the new code running. Once the code is released, check to make sure that you see the outcome you expected. Don’t hesitate to add specific fields for one release and them remove them after.

Tighter feedback loops are like going faster on a bicycle. They make for more stable systems and let you move faster with confidence.

Semantic conventions and naming consistency

I’ve tried to embrace semantic conventions in my naming, but would not be surprised if I’ve made multiple errors. Naming is hard!

It’s also hard to get consistency right within an organization or even across multiple systems owned by the same team. I would recommend trying to use semantic conventions as a guide, but do prioritize getting data out of your system in some form and getting some early wins over exacting adherence to an evolving specification. Once this data has proven its value within your organization, then you will have the leverage to spend engineering cycles on making things consistent.

In the long run semantic conventions should allow Observability vendors to build new value and understanding on top of the telemetry you emit, but this effort is only just getting started.

Frequent Objections

Does this really work??

I have done this for dozens of production systems. Every single time the data has been invaluable for digging in and understanding what the system is actually doing, and we’ve found something surprising, even for the engineers who had worked on the system for many years.

Things like:

  • Oh, actually 90% of the traffic of this system comes from one user
  • Wait, one of our worker processes is actually running a month old version of the code somehow?
  • This API endpoint usually has payloads of 1-2kb, but there is an edge case affecting one user where it’s 40+MB. This causes their page loads to be several minutes longer than the p99.
  • After instrumenting the authentication middleware, around 20% of requests still didn’t have user info. There was a whole second authentication system for a different class of users that hadn’t been touched in years.
  • This endpoint that we’d like to deprecate accepts data in the form of A, B, and C, but none of our traffic ever even uses C. We can just drop support for that now.

I don’t like it. This feels wrong

For anyone feeling that way now, I ask you to give it five minutes.

I find that when a log line wraps around your terminal window multiple times, most developers have a pretty visceral negative reaction.

This feels right:

[2024-09-18 22:48:32.990] Request started http_path=/v1/charges request_id=req_123
[2024-09-18 22:48:32.991] User authenticated auth_type=api_key key_id=mk_123 user_id=usr_123
[2024-09-18 22:48:32.992] Rate limiting ran rate_allowed=true rate_quota=100 rate_remaining=99
[2024-09-18 22:48:32.998] Charge created charge_id=ch_123 permissions_used=account_write request_id=req_123
[2024-09-18 22:48:32.999] Request finished http_status=200 request_id=req_123

But this feels wrong:

[2024-10-20T14:43:36.851Z] duration_ms=1266.1819686777117 main=true http.ip_address=92.21.101.252 instance.id=api-1 instance.memory_mb=12336
instance.cpu_count=4 instance.type=t3.small http.request.method=GET http.request.path=/api/categories/substantia-trado
http.route=/api/categories/:slug http.request.body.size=293364 http.request.header.content_type=application/xml
user_agent.original="Mozilla/5.0 (X11; Linux i686 AppleWebKit/535.1.2 (KHTML, like Gecko) Chrome/39.0.826.0 Safari/535.1.2" user_agent.device=phone
user_agent.os=Windows user_agent.browser=Edge user_agent.browser_version=3.0 url.scheme=https url.host=api-service.com service.name=api-service
service.version=1.0.0 build.id=1234567890 go.version=go1.23.2 rails.version=7.2.1.1 service.environment=production service.team=api-team
service.slack_channel=#api-alerts service.build.deployment.at=2024-10-14T19:47:38Z
service.build.diff_url=https://github.com/your-company/api-service/compare/c9d9380..05e5736
service.build.pull_request_url=https://github.com/your-company/api-service/pull/123
service.build.git_hash=05e5736 service.build.deployment.user=keanu.reeves@your-company.com
service.build.deployment.trigger=manual container.id=1234567890 container.name=api-service-1234567890 cloud.availability_zone=us-east-1
cloud.region=us-east-1 k8s.pod.name=api-service-1234567890 k8s.cluster.name=api-service-cluster feature_flag.auth_v2=true
http.response.status_code=401 user.id=Samanta27@gmail.com user.type=vip user.auth_method=sso-google user.team_id=team-1

You are structuring data so that it can be read efficiently by machines, not humans. Our systems emit too much data to waste precious human lifetimes using our eyeballs to scan lines of text looking for patterns to jump out. Let the robots help.

This seems like a lot of work

If you want to implement everything I’ve talked about in this post that would a ton of work. However, even implementing the easiest subset is going to provide a lot of value. Not doing this results in so much more work building a mental model of your system, trying to debug by thinking through the code and hoping your mental model matches reality.

A lot of this logic can be put into shared libraries within your org, though getting them adopted, keeping them updated and in-sync, and getting engineers used to these tools presents a whole different set of challenges.

Many of these things could be surfaced to you by opinionated platforms or frameworks. I would love to see things move in this direction.

Isn’t this a lot of data? Won’t it cost a lot??

Hacker News comment: This isn&#x27;t an unknown idea outside of Meta, it&#x27;s just really expensive, especially if you&#x27;re using a vendor and not building your own tooling. Prohibitively so, even with sampling.

First, you should compare this to your current log volume per request. I have seen many systems where this approach would reduce overall log volume.

However storing this data for every request against your system could be too expensive at scale. That’s where sampling comes in. Sampling gives you the controls to determine what you want to spend vs the value you receive from storing and making that data available to query.

Realtime OLAP systems are also getting cheaper all the time. Once upon a time Scuba held all data in memory to make these types of questions quick to answer. Now most OLAP systems are evolving to columnar files stored on cloud object storage with queries handled by ephemeral compute which is many orders of magnitude cheaper.

In the next section I’ll show just how much cheaper.

Repeated data

Many of these fields will be the same for every request. Isn’t that really inefficient?

This is where our intuitions can lie to us. Let’s look at a concrete example.

I wrote a script [1] to generate a newline-delimited JSON file with a lot of the above fields and at least somewhat reasonable fake values.

Let’s say our service is serving 1000 req/s all day and sampling 1% of that traffic. Rounding to a whole number, that’s about a million events. Generating a million example wide events results in a 1.6GB file.

http_logs.ndjson     1607.61 MB

But we repeat the keys on every single line. Even just turning it into a CSV cuts the size by more than 50%.

http_logs.csv         674.72 MB

Gzipping the file shows an amazing amount of compression, hinting that this isn’t as much data as we might think.

http_logs.ndjson.gz   101.67 MB

Column store formats like parquet and Duckdb’s native format can do even better.

http_logs.parquet      88.83 MB
http_logs.duckdb       80.01 MB

They store all of the data for a specific column contiguously, which lends itself to different compression approachs. In the simplest case, if the column is always the same value, it can store that fact only once. Values that are the same across an entire row group are incredibly cheap.

DuckDB diagram showing how a constant value along a whole column gets compressed

If there are 2-3 different values, it can use dictionary-encoding to bit-pack these values really tightly. This also speeds up queries against this column.

DuckDB diagram showing how values get compressed using dictionary encoding

DuckDB has a great writeup on this which goes into much more detail. All of the data remains available and is easily (and quickly!) queryable.

This is hardly “big data”. Storing this on Cloudflare’s R2 for a month would cost US$ 0.0012. You could keep 60 days of rentention for US$ 0.072 / month.

❯ duckdb http_logs.duckdb
D SELECT COUNT(*) FROM http_logs;
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│      1000000 │
└──────────────┘
Run Time (s): real 0.002 user 0.002350 sys 0.000946
D SELECT SUM(duration_ms) FROM http_logs;
┌───────────────────┐
│ sum(duration_ms)  │
│      double       │
├───────────────────┤
│ 999938387.7714149 │
└───────────────────┘
Run Time (s): real 0.003 user 0.008020 sys 0.000415

There are even in-memory and transport formats to help reduce size in memory and on the wire. OpenTelemetry is adopting arrow for its payloads for this reason.

I found this podcast on the FDAP stack particularly helpful in understanding this space.

Couldn’t we JOIN data from multiple spans together to get this information? Query the whole trace at once?

This is certainly possible. Honeycomb has started allowing you to filter on fields on other spans in the same trace. However I’d qualify this as very advanced. You want to make the right thing the easiest thing, and if you make it harder to ask questions, people will simply ask fewer questions. There are already a million things competing for our attention. Keep it simple. Make it fast.

Does this mean I don’t need metrics?

You should probably still generate high-level metrics, though I bet you will need far fewer.

Metrics are great when you know you want an exact answer to a very specific question that you know ahead of time. Questions like “How many requests did a serve yesterday?” or “What was my CPU usage like last month?”


[1] Well… mostly Cursor wrote it