<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Home</title><description>Jeremy Morrell&apos;s homepage</description><link>https://jeremymorrell.dev/</link><item><title>Reading a zip file from a Cloudflare Worker</title><link>https://jeremymorrell.dev/sketches/cloudflare-read-zip-file/</link><guid isPermaLink="true">https://jeremymorrell.dev/sketches/cloudflare-read-zip-file/</guid><description>Experiments in reducing dependencies via LLMs</description><pubDate>Tue, 30 Sep 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;tl;dr - Example repo: https://github.com/jmorrell/cloudflare-read-zip-file-example&lt;/p&gt;
&lt;p&gt;I needed to add some import / export functionality to a project, and ran into the problem of
working with zip files on Cloudflare workers.&lt;/p&gt;
&lt;p&gt;The standard library for working with Zip files seems to be &lt;a href=&quot;https://github.com/gildas-lormeau/zip.js&quot;&gt;&lt;code&gt;zip.js&lt;/code&gt;&lt;/a&gt;
but this is quite a large dependency. While that might not matter for a long-running Node app, on Workers your
code must be downloaded and parsed on every cold start, so avoiding big dependencies where possible is
usually worth it.&lt;/p&gt;
&lt;p&gt;For creating the zip files I found &lt;a href=&quot;https://www.npmjs.com/package/littlezipper&quot;&gt;&lt;code&gt;littlezipper&lt;/code&gt;&lt;/a&gt;, which has
worked well, however I did not find a similarly small alternative for reading zip files.&lt;/p&gt;
&lt;p&gt;However this is exactly the kind of small, well-defined problem that LLMs are actually really good at solving.
We have a well-defined widely-used file format, and a set of JavaScript APIs for doing stream manipulation, and
both should be well-represented in the training data of a modern LLM.&lt;/p&gt;
&lt;p&gt;With some prompting Claude came up with &lt;a href=&quot;https://github.com/jmorrell/cloudflare-read-zip-file-example/blob/c7c874065e8d1ec7f6158d2b648120b1ab502f56/src/zip.ts&quot;&gt;&lt;code&gt;ZipReader&lt;/code&gt;&lt;/a&gt; implementation and a &lt;a href=&quot;https://github.com/jmorrell/cloudflare-read-zip-file-example/blob/c7c874065e8d1ec7f6158d2b648120b1ab502f56/test/zip.test.ts&quot;&gt;set of tests&lt;/a&gt; that should satisfy my use-case.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;const arrayBuffer = await file.arrayBuffer();
const zipReader = new ZipReader(arrayBuffer);

console.log(`Processing zip file: ${file.name}`);
console.log(`Total entries: ${zipReader.getFiles().length}`);

for (const fileInfo of zipReader.getFiles()) {
  console.log(fileInfo.filename);
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Both my implementation and &lt;code&gt;littlezipper&lt;/code&gt; are pretty limited in that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;They only support the &quot;Deflate&quot; compression method, which seems to be the most common&lt;/li&gt;
&lt;li&gt;Both require holding all of the file contents in memory, which will be a problem for larger files with Worker&apos;s 128MB memory constraint&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A better approach might be to leverage R2 (or &lt;a href=&quot;https://blog.cloudflare.com/nodejs-workers-2025/#a-new-virtual-file-system-and-the-node-fs-module&quot;&gt;the new &lt;code&gt;node:fs&lt;/code&gt; support&lt;/a&gt;) for storage and HTTP Ranged Reads to only ever hold part of the file in memory at any given time, but that will have to wait for another day. &lt;a href=&quot;https://github.com/rinsuki/async-zip-reader/tree/master&quot;&gt;&lt;code&gt;rinsuki/async-zip-reader&lt;/code&gt;&lt;/a&gt; seems to implement this approach, but it brings in its own dependencies, and I have not tried it out.&lt;/p&gt;
</content:encoded></item><item><title>Resolving &quot;FOREIGN KEY constraint failed&quot; with Cloudflare SQLite</title><link>https://jeremymorrell.dev/sketches/cloudflare-sqlite-foreign-key-constraint/</link><guid isPermaLink="true">https://jeremymorrell.dev/sketches/cloudflare-sqlite-foreign-key-constraint/</guid><description>You likely need &quot;PRAGMA defer_foreign_keys = on&quot;</description><pubDate>Sun, 21 Sep 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;tl;dr - You likely need &lt;code&gt;PRAGMA defer_foreign_keys = on;&lt;/code&gt; and &lt;a href=&quot;https://developers.cloudflare.com/d1/sql-api/foreign-keys/&quot;&gt;these docs&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;While writing a SQL migration for a Durable Object I kept hitting this frustrating
error:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;FOREIGN KEY constraint failed: SQLITE_CONSTRAINT
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I had something similar to the following:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;CREATE TABLE A (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  name TEXT
);

CREATE TABLE B (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  id2 INTEGER,
  book TEXT,
  FOREIGN KEY(id2) REFERENCES A(id)
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I wanted to add a &lt;code&gt;CHECK&lt;/code&gt; constraint to a column in table &lt;code&gt;A&lt;/code&gt; which can be done by renaming and copying:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;CREATE TABLE new_A (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  name TEXT NOT NULL CHECK (type IN (&apos;image&apos;, &apos;html&apos;, &apos;pdf&apos;))
);

INSERT INTO new_A
SELECT
  id,
  name
FROM A;

DROP TABLE A;

ALTER TABLE new_A RENAME TO A;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note that this invariant was already enforced by the app. This would not be safe to do in general.&lt;/p&gt;
&lt;p&gt;However this won&apos;t work, and you&apos;ll get the &lt;code&gt;FOREIGN KEY constraint failed&lt;/code&gt; error.&lt;/p&gt;
&lt;p&gt;The &lt;a href=&quot;https://sqlite.org/foreignkeys.html#fk_schemacommands&quot;&gt;SQLite docs&lt;/a&gt; will lead you to:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;PRAGMA foreign_keys = false

-- Do the migration

PRAGMA foreign_keys = true
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;However this will not work either. Though frustratingly it will if you dig into the &lt;code&gt;.wrangler&lt;/code&gt; directory, find the sqlite db
file, and try applying it directly.&lt;/p&gt;
&lt;p&gt;The solution is in the &lt;a href=&quot;https://developers.cloudflare.com/d1/sql-api/foreign-keys/&quot;&gt;Cloudflare docs&lt;/a&gt;.
The doc is for D1 but this applies to SQLite in Durable Objects as well.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;D1&apos;s foreign key enforcement is equivalent to SQLite&apos;s PRAGMA foreign_keys = on directive. Because D1 runs every query inside an implicit transaction, user queries cannot change this during a query or migration.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;D1 allows you to call PRAGMA defer_foreign_keys = on or off, which allows you to violate foreign key constraints temporarily (until the end of the current transaction).&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;We can now fix our migration!&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;PRAGMA defer_foreign_keys = on;

CREATE TABLE new_A (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT NOT NULL CHECK (type IN (&apos;image&apos;, &apos;html&apos;, &apos;pdf&apos;))
);

INSERT INTO new_A
SELECT
  id,
  name
FROM A;

DROP TABLE A;
ALTER TABLE new_A RENAME TO A;

PRAGMA defer_foreign_keys = off;
&lt;/code&gt;&lt;/pre&gt;
</content:encoded></item><item><title>Automatically documenting a Durable Object&apos;s SQLite schema</title><link>https://jeremymorrell.dev/sketches/documenting-durable-object-sqlite-schema/</link><guid isPermaLink="true">https://jeremymorrell.dev/sketches/documenting-durable-object-sqlite-schema/</guid><description>Keeping mutations in an array keeps things simple, but can be difficult to reason about</description><pubDate>Sat, 20 Sep 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;One of the first things you run into trying to build with &lt;a href=&quot;https://blog.cloudflare.com/sqlite-in-durable-objects/&quot;&gt;SQLite in Durable Objects&lt;/a&gt;
is handling SQL migrations.&lt;/p&gt;
&lt;p&gt;I&apos;ve been using &lt;a href=&quot;https://orm.drizzle.team/docs/connect-cloudflare-do&quot;&gt;Drizzle&lt;/a&gt; to manage my Durable Object schemas, which has worked well, but felt that it was
a bit heavier than what I needed for my current project. Cloudflare has recently released the &lt;a href=&quot;https://github.com/cloudflare/actors&quot;&gt;&lt;code&gt;@cloudflare/actors&lt;/code&gt;&lt;/a&gt;
library which has a much simpler approach (originally based on &lt;a href=&quot;https://www.lambrospetrou.com/&quot;&gt;Lambros&apos;&lt;/a&gt; &lt;a href=&quot;https://github.com/lambrospetrou/durable-utils?tab=readme-ov-file#sqlite-schema-migrations&quot;&gt;&lt;code&gt;durable-utils&lt;/code&gt;&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;With this library you can define your migrations in-line and run them yourself before you access your SQL tables.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;import { Storage } from &quot;@cloudflare/actors/storage&quot;;

export class ChatRoom extends DurableObject&amp;lt;Env&amp;gt; {
  storage: Storage;

  constructor(ctx: DurableObjectState, env: Env) {
    super(ctx, env);
    this.storage = new Storage(ctx.storage);
    this.storage.migrations = [
      {
        idMonotonicInc: 1,
        description: &quot;Create users table&quot;,
        sql: &quot;CREATE TABLE IF NOT EXISTS users (id INTEGER PRIMARY KEY)&quot;,
      },
    ];
  }
  async fetch(request: Request): Promise&amp;lt;Response&amp;gt; {
    // Run migrations before executing SQL query
    await this.storage.runMigrations();

    // Query with SQL template
    let userId = new URL(request.url).searchParams.get(&quot;userId&quot;);
    const query = this.storage.sql`SELECT * FROM users WHERE id = ${userId};`;
    return new Response(`${JSON.stringify(query)}`);
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I appreciate the simplicity of this approach, however once you have more than a few migrations, it can be difficult to
keep track of what your current schema even &lt;em&gt;is&lt;/em&gt;. I don&apos;t want to re-play SQL statements in my head to understand how to
query what I need.&lt;/p&gt;
&lt;p&gt;I worked around this by extracting the migrations to a separate file, which I can import in the Durable Object directly:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;export default [
  {
    idMonotonicInc: 1,
    description: &quot;Create exports table&quot;,
    sql: `
      CREATE TABLE IF NOT EXISTS example (
        id TEXT PRIMARY KEY,
        created_at INTEGER NOT NULL DEFAULT (unixepoch()),
        name TEXT NOT NULL
      );
    `,
  },
  {
    idMonotonicInc: 2,
    description: &quot;Add description column to example table&quot;,
    sql: `ALTER TABLE example ADD COLUMN description TEXT NOT NULL DEFAULT &apos;default&apos;;`,
  },
  {
    idMonotonicInc: 3,
    description: &quot;Add age column to example table&quot;,
    sql: `ALTER TABLE example ADD COLUMN age INTEGER NOT NULL DEFAULT 0;`,
  },
];
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;import { Storage } from &quot;@cloudflare/actors/storage&quot;;
import migrations from &quot;./example-migrations&quot;;

export class Example extends DurableObject {
  storage: Storage;
  env: Env;

  constructor(ctx: DurableObjectState, env: Env) {
    super(ctx, env);
    this.env = env;
    this.storage = new Storage(ctx.storage);
    this.storage.migrations = migrations;
  }
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;However, since the database in the Durable Object is &quot;just&quot; SQLite, and our migrations module contains a plain JavaScript object
we can write a script to run them against an empty SQLite database and dump the resulting schema. Any time we add a new migration
we can re-run this and check the output in as documentation.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;import sqlite3 from &quot;sqlite3&quot;;
import { open } from &quot;sqlite&quot;;
import fs from &quot;node:fs&quot;;
import sqlFormatter from &quot;@sqltools/formatter&quot;;

async function getSchema(migrationsPath, schemaWritePath) {
  let migrations = await import(migrationsPath).then((m) =&amp;gt; m.default);

  let dbPath = `/tmp/test-${Math.random().toString(36).slice(2)}.db`;

  let db = await open({
    filename: dbPath,
    driver: sqlite3.Database,
  });

  for (const migration of migrations) {
    await db.exec(migration.sql);
  }

  let schema = await db.all(`
    SELECT sql 
    FROM sqlite_master 
    ORDER BY type DESC, name
  `);

  let out = `-- This file was generated by a script
-- DO NOT EDIT THIS FILE MANUALLY
`;

  for (let item of schema) {
    if (item.sql) {
      out += &quot;\n&quot; + sqlFormatter.format(item.sql) + &quot;;\n&quot;;
    }
  }

  await db.close();

  fs.rmSync(dbPath);

  fs.writeFileSync(schemaWritePath, out);
}

// A list of each of the Durable Objects with migrations to document
let DurableObjects = [
  {
    migrations: &quot;path/to/example-migrations.ts&quot;,
    schema: &quot;path/to/example-schema.sql&quot;,
  },
];

for (let durableObject of DurableObjects) {
  await getSchema(durableObject.migrations, durableObject.schema);
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now instead of trying to play back all of the migration SQL commands in my head, I have a nice up-to-date schema to reference while I&apos;m
working on my logic.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;-- This file was generated by a script
-- DO NOT EDIT THIS FILE MANUALLY

CREATE TABLE example (
  id TEXT PRIMARY KEY,
  created_at INTEGER NOT NULL DEFAULT (unixepoch()),
  name TEXT NOT NULL,
  description TEXT NOT NULL DEFAULT &apos;default&apos;,
  age INTEGER NOT NULL DEFAULT 0
);
&lt;/code&gt;&lt;/pre&gt;
</content:encoded></item><item><title>Generating Image Placeholders on Cloudflare Workers</title><link>https://jeremymorrell.dev/sketches/lqip-images-on-cloudflare-workers/</link><guid isPermaLink="true">https://jeremymorrell.dev/sketches/lqip-images-on-cloudflare-workers/</guid><description>Using the Cloudflare Images binding to explore a few image placeholder algorithms.</description><pubDate>Thu, 24 Jul 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;I was recently blown away by &lt;a href=&quot;https://leanrada.com/notes/css-only-lqip/&quot;&gt;Lean Rada&apos;s excellent blog post on CSS-only blurry image placeholders&lt;/a&gt;
(seriously, it&apos;s some very creative work!) and I wanted to try generating image placeholders for
a Cloudflare Workers app that I&apos;m working on.&lt;/p&gt;
&lt;p&gt;If you&apos;re in a hurry, the code can be found in &lt;a href=&quot;https://github.com/jmorrell/low-quality-image-placeholders-on-cloudflare-workers&quot;&gt;this GitHub repo&lt;/a&gt; along with a little demo app.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;./cherry-blossoms.jpg&quot; alt=&quot;Example output for compressing a photo of cherry blossoms&quot; /&gt;&lt;/p&gt;
&lt;p&gt;A lot of the existing docs on generating these image placeholders rely on modules like &lt;a href=&quot;https://sharp.pixelplumbing.com/&quot;&gt;Sharp&lt;/a&gt;
which don&apos;t run on Cloudflare Workers. You could also compile a native module into WASM and use that, but that&apos;s
quite a lot of work to sort out compilers and config!&lt;/p&gt;
&lt;p&gt;Luckily you can ditch all of these and just leverage the platform&apos;s &lt;a href=&quot;https://developers.cloudflare.com/images/&quot;&gt;Image bindings&lt;/a&gt;
directly.&lt;/p&gt;
&lt;p&gt;The Image bindings let you &lt;a href=&quot;https://developers.cloudflare.com/images/transform-images/transform-via-workers/&quot;&gt;transform, resize, filter, rotate, add watermarks, etc.&lt;/a&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;const response = (
  await env.IMAGES.input(stream)
    .transform({ rotate: 90 })
    .transform({ width: 128 })
    .transform({ blur: 20 })
    .output({ format: &quot;image/png&quot; })
).response();
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;There were two things missing from the docs &amp;lt;a href=&quot;#footnote-1&quot; id=&quot;footnote-ref-1&quot; data-footnote-ref=&quot;true&quot; aria-describedby=&quot;footnote-label&quot;&amp;gt;&amp;lt;code&amp;gt;[1]&amp;lt;/code&amp;gt;&amp;lt;/a&amp;gt; that I discovered while looking at the &lt;a href=&quot;https://workers-types.pages.dev/#ImageOutputOptions&quot;&gt;type definitions&lt;/a&gt; that help us out here:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You can request an array of &lt;a href=&quot;https://workers-types.pages.dev/#ImageOutputOptions&quot;&gt;&lt;code&gt;rgb&lt;/code&gt; or &lt;code&gt;rgba&lt;/code&gt; pixel values&lt;/a&gt; from the Workers binding directly. No need to write code to parse an image file format directly&lt;/li&gt;
&lt;li&gt;There is a &quot;fit&quot; for resizing images that ignores the existing aspect ratio called &lt;a href=&quot;https://workers-types.pages.dev/#BasicImageTransformations.fit&quot;&gt;&quot;squeeze&quot;&lt;/a&gt; that is currently missing from the docs&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Another issue was an inability to get the resized image dimensions from the output. When requesting an &lt;code&gt;rgb&lt;/code&gt; output you
receive an array with a single dimension with 3 values per pixel. Unlike a photo format this does not encode any information about the width
or height of the image. I work around this by calculating the expected dimensions from the original&apos;s dimensions.&lt;/p&gt;
&lt;p&gt;I implemented 4 different image placeholder algorithms:&lt;/p&gt;
&lt;h3&gt;Dominant Color (easy)&lt;/h3&gt;
&lt;p&gt;The simplest approach I could think of was to squish the whole image down into one pixel, and let the image software
decide what the color should be. This works surprisingly well given the lack of sophistication! The whole implementation fits
in just a few lines:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;async function getDominantColor(image: ReadableStream): Promise&amp;lt;string&amp;gt; {
  let rgbImage = await env.IMAGES.input(image)
    .transform({ width: 1, height: 1, fit: &quot;cover&quot; })
    .output({ format: &quot;rgb&quot; });
  let rgbImageBuffer = await rgbImage.response().arrayBuffer();
  let pixelData = new Uint8Array(rgbImageBuffer);

  let r = pixelData[0];
  let g = pixelData[1];
  let b = pixelData[2];

  return `#${r.toString(16).padStart(2, &quot;0&quot;)}${g.toString(16).padStart(2, &quot;0&quot;)}${b.toString(16).padStart(2, &quot;0&quot;)}`;
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Dominant Color from a Palette&lt;/h3&gt;
&lt;p&gt;Averaging all of the colors together sometimes creates a muddy looking output. This resizes the image to be
easy to work with in the workers environment and uses the &lt;a href=&quot;https://github.com/lokesh/quantize&quot;&gt;underlying Modified Median Cut Quantization (MMCQ)
algorithm&lt;/a&gt; from the &lt;a href=&quot;https://github.com/lokesh/color-thief&quot;&gt;color-thief&lt;/a&gt;
library to group colors together. If the image has a lot of a single color, such as text on a white background,
it will tend to pick out this main color instead of averaging it with the rest of the colors in the image.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;async function getDominantColorFromPalette(image: ReadableStream): Promise&amp;lt;string&amp;gt; {
  let rbgImage = await env.IMAGES.input(image)
    .transform({ width: 200, height: 200, fit: &quot;cover&quot; })
    .output({ format: &quot;rgb&quot; });
  let rgbImageBuffer = await rbgImage.response().arrayBuffer();
  let pixelData = new Uint8Array(rgbImageBuffer);

  // get a representative color palette from the image
  let palette = quantize(pixelData, 5);
  // get the most prominent color
  let dominantColor = palette[0];
  let r = dominantColor[0];
  let g = dominantColor[1];
  let b = dominantColor[2];

  return `#${r.toString(16).padStart(2, &quot;0&quot;)}${g.toString(16).padStart(2, &quot;0&quot;)}${b.toString(16).padStart(2, &quot;0&quot;)}`;
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Blurhash&lt;/h3&gt;
&lt;p&gt;https://blurha.sh/ is a clever approach that compresses an image into blurry gradients and packs that data into a small string like &lt;code&gt;LEHV6nWB2yk8pyo0adR*.7kCMdnj&lt;/code&gt;. The &lt;a href=&quot;https://github.com/woltapp/blurhash/tree/master/TypeScript&quot;&gt;TypeScript library&lt;/a&gt; works great on Workers,
though note that it expects an &lt;code&gt;rgba&lt;/code&gt; array as input.&lt;/p&gt;
&lt;p&gt;I did have to jump through some hoops to get the correct dimensions of the new image. &lt;a href=&quot;https://github.com/jmorrell/low-quality-image-placeholders-on-cloudflare-workers/blob/main/src/index.ts&quot;&gt;See the full code here&lt;/a&gt;.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;async function getBlurhash(image: ReadableStream, aspectRatioInfo: AspectRatioInfo): Promise&amp;lt;string&amp;gt; {
  let resizedImage = await env.IMAGES.input(image)
    .transform({ width: RESIZE_DIMENSION, height: RESIZE_DIMENSION, fit: &quot;contain&quot; })
    .output({ format: &quot;rgba&quot; });
  let resizedImageBuffer = await resizedImage.response().arrayBuffer();
  let pixelDataClamped = new Uint8ClampedArray(resizedImageBuffer);

  let { width: resizedWidth, height: resizedHeight } = getResizedDimensions(
    aspectRatioInfo,
    RESIZE_DIMENSION,
    pixelDataClamped.length / 4
  );
  return encodeBlurhash(pixelDataClamped, resizedWidth, resizedHeight, 4, 4);
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;CSS Blobhash&lt;/h3&gt;
&lt;p&gt;I cribbed from &lt;a href=&quot;https://github.com/Kalabasa/leanrada.com/blob/7b6739c7c30c66c771fcbc9e1dc8942e628c5024/main/scripts/update/lqip.mjs#L118-L159&quot;&gt;Lean&apos;s implementation&lt;/a&gt;
and lightly converted it for the Images binding. &lt;a href=&quot;https://github.com/jmorrell/low-quality-image-placeholders-on-cloudflare-workers/blob/main/src/css-blob-hash.ts&quot;&gt;See the full implementation here&lt;/a&gt;.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;async function getCSSBlobHash(image: ReadableStream): Promise&amp;lt;number&amp;gt; {
  let resizedImage = await env.IMAGES.input(image)
    .transform({ width: 3, height: 2, fit: &quot;squeeze&quot; })
    .output({ format: &quot;rgb&quot; });
  let resizedImageBuffer = await resizedImage.response().arrayBuffer();
  let pixelDataClamped = new Uint8ClampedArray(resizedImageBuffer);

  return encodeCSSBlobHash(pixelDataClamped);
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Results&lt;/h2&gt;
&lt;p&gt;Playing around with the output I was quite surprised to find that picking out the dominant color from a palette produced what I thought
was consistently the best balance. The blurry approaches are often very pleasing for photographs, but my use-case will have a lot
of images designed for showing up in a social media feed, and these feel a little off when blurred. Sometimes keeping it simple is the
best approach.&lt;/p&gt;
&lt;p&gt;I&apos;ll close out with some screenshots of examples. Feel free to play with the generator with your own images: https://github.com/jmorrell/low-quality-image-placeholders-on-cloudflare-workers&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;blurhash-social-image.jpg&quot; alt=&quot;Example output Blurhash&apos;s social image&quot; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;catpuccin-social-image.jpg&quot; alt=&quot;Example output for the catpuccin color theme&quot; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;cloudflare-social-image.jpg&quot; alt=&quot;Example output for a Cloudflare blog post header&quot; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;half-dome-yosemite.jpg&quot; alt=&quot;Example output for a beautiful shot of Half Dome at Yosemite&quot; /&gt;&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;&amp;lt;a href=&quot;#footnote-ref-1&quot; id=&quot;footnote-1&quot;&amp;gt;&amp;lt;code&amp;gt;[1]&amp;lt;/code&amp;gt;&amp;lt;/a&amp;gt; I&apos;ve filed tickets internally to get these issues fixed&lt;/p&gt;
</content:encoded></item><item><title>On OpenTelemetry and the value of Standards</title><link>https://jeremymorrell.dev/blog/opentelemetry-and-the-value-of-standards/</link><guid isPermaLink="true">https://jeremymorrell.dev/blog/opentelemetry-and-the-value-of-standards/</guid><description>OpenTelemetry is not perfect, but the value of having one shared standard for instrumentation and telemetry is huge</description><pubDate>Sun, 15 Dec 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;./bolt-diagram.jpg&quot; alt=&quot;Engineering diagram illustrating various parameters on hex-head bolts&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Physical standards are amazingly foundational things that we frequently take for granted.
I can buy a 1/4-20 bolt today and fit it into machines built many decades ago. There is a variety
of tooling readily available to work with threaded bolts: wrenches, impact drivers,
torque wrenches of all kinds, thread gauges, taps and dies, threaded inserts.&lt;/p&gt;
&lt;p&gt;Importantly, this is all astoundingly cheap. An individual bolt might just be a few pennies, barely more than the
cost of raw materials. &lt;a href=&quot;https://ourworldindata.org/learning-curve&quot;&gt;Wright&apos;s Law&lt;/a&gt; states that for every cumulative
doubling of units produced, costs tend to fall by a constant percentage. The more we make of something, the cheaper
it is per-unit.&lt;/p&gt;
&lt;p&gt;The volumes and competition generated by a shared standard push us further up the &lt;a href=&quot;https://en.wikipedia.org/wiki/Experience_curve_effects&quot;&gt;experience curve&lt;/a&gt;
than would be possible if each company produced their own bespoke product. We can build bigger, faster, more
specialized tools and amortize their cost over more usage.&lt;/p&gt;
&lt;p&gt;Standards enable investment.&lt;/p&gt;
&lt;p&gt;While we operate under different constraints, the same is broadly true in software. One-off solutions
can be eye-wateringly expensive, but standards make for cheap and plentiful tooling.&lt;/p&gt;
&lt;h3&gt;Today&apos;s tools&lt;/h3&gt;
&lt;p&gt;When it comes to understanding and introspecting our systems, our industry has been a bit of a wild west
for a long time.&lt;/p&gt;
&lt;p&gt;Mostly we rely on vendors. Vendors build SDKs that bend over backward to automagically wrap libraries
and generate mystery payloads that get shipped off to proprietary services. This usually exists as an opaque
layer on top of our software. We get trained that instrumentation is someone else&apos;s responsibility, too
complicated for us mere mortals.&lt;/p&gt;
&lt;p&gt;This can be great for the handful of vendors that manage to win a majority of the market, but it&apos;s a pretty bad
experience for everyone else. Additionally it draws hard limits on what&apos;s possible.&lt;/p&gt;
&lt;p&gt;You&apos;re never going to get Django to build in Datadog support. AWS is never going to integrate New Relic
into its platform. And if they did, this would only entrench that handful of vendors even more.&lt;/p&gt;
&lt;p&gt;Without standards in this area all investment in the tooling comes from vendors who all end
up re-building very similar things. It&apos;s hard or economically treacherous for others to build on top
of their proprietary stacks. The lack of standards prevents broader investment.&lt;/p&gt;
&lt;p&gt;We&apos;re trapped in a local maximum. Open Standards provide a way out and, hopefully, a better experience.&lt;/p&gt;
&lt;h2&gt;Today&apos;s experience&lt;/h2&gt;
&lt;p&gt;When I deploy a web app built on a common stack today (say: Rails and Postgres) whether on a cloud, PaaS,
or my own hardware I get precious little information about what my software is doing. I may get some
basic &lt;a href=&quot;https://devcenter.heroku.com/articles/metrics&quot;&gt;HTTP, memory, CPU metrics&lt;/a&gt;, but if I want more
fine-grained data I&apos;ll need to go elsewhere and start from scratch.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;./heroku-metrics-http.png&quot; alt=&quot;Heroku&apos;s HTTP response time graph&quot; /&gt;&lt;/p&gt;
&lt;p&gt;This is in sharp contrast to the detailed information about each HTTP request or database query we get in an
APM tool like New Relic. Want to see the slowest requests on the &lt;code&gt;/checkout&lt;/code&gt; endpoint filtered by region
and device? New Relic likely has you covered. Need to see your slowest database queries? It&apos;s only a couple
clicks away!&lt;/p&gt;
&lt;p&gt;Fair play to New Relic, it took &lt;strong&gt;a lot&lt;/strong&gt; of exceptional engineering to make this look seamless!
What if the next company didn&apos;t have to build all of this from scratch? What if we could expect more
from our tools in the first place? Rails can easily capture all of this information. Why do we need
a vendor to write fancy code to wrap it and surface it to us? Wouldn&apos;t the instrumentation likely
be better if the Rails developers designed the instrumentation for their own code?&lt;/p&gt;
&lt;p&gt;Prior to OpenTelemetry there hasn&apos;t been a clear mechanism for doing this. Imagine I&apos;m writing a database client library
and want to communicate to my user&apos;s system an observation like &lt;code&gt;query X took 1.2ms, returning 152kb of data across 200 rows&lt;/code&gt;. Without a standard way of emitting this data, I don&apos;t have many good options. I could
write log messages to &lt;code&gt;stdout&lt;/code&gt;, but they might not be formatted the way the user needs. I might build a plugin
system so the user could bring their own logic, requiring more work from the user and making my own library more
complicated in the process.&lt;/p&gt;
&lt;h2&gt;A vision for the future&lt;/h2&gt;
&lt;p&gt;What if I could build instrumentation directly into my code and the user could configure
what they wanted to do with it? It might look like:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;class Database
  def execute(sql)
    in_span(&quot;sql_query&quot;) do |span|
      # execute the query
      span.set_attributes(
        &quot;query&quot; =&amp;gt; obfuscated_query,
        &quot;results_size_kb&quot; =&amp;gt; results_size,
        &quot;results_row_count&quot; =&amp;gt; results.length,
      )
    end
    return results
  end
end
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That already exists! It&apos;s the &lt;a href=&quot;https://github.com/open-telemetry/opentelemetry-ruby/tree/main/api&quot;&gt;OpenTelemetry API&lt;/a&gt;.
It&apos;s not even directly tied to the &lt;a href=&quot;https://opentelemetry.io/docs/specs/otlp/&quot;&gt;OpenTelemetry OTLP format&lt;/a&gt;, and you
could use this instrumentation to generate a completely different trace format or even just log lines if you wanted.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://medium.com/opentelemetry/opentelemetry-specification-v1-0-0-tracing-edition-72dd08936978&quot;&gt;&lt;img src=&quot;./opentelemetry-api.jpeg&quot; alt=&quot;Diagram showing how the API and SDK work together&quot; /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Let&apos;s go a bit farther and imagine a better future where all of our dependencies come with re-usable instrumentation,
where profileration of a single standard mean our IDE, platforms, languages, and frameworks call all work seamlessly together.&lt;/p&gt;
&lt;p&gt;When building my service locally, I have &lt;a href=&quot;https://github.com/CtrlSpice/otel-desktop-viewer&quot;&gt;local tools to view the telemetry my service is emitting&lt;/a&gt;,
perhaps they&apos;re even built into my IDE. When I deploy to a PaaS service, it automatically collects and surfaces traces, metrics, and
logs right in the dashboard. &lt;a href=&quot;https://jeremymorrell.dev/blog/a-practitioners-guide-to-wide-events/&quot;&gt;Maybe the platform automatically pulls in more metadata for me&lt;/a&gt; so I can easily view how my requests are performing as a new versions roll out,
or how my latest changes affected response times in Germany, or notice a spike in errors on the &lt;code&gt;/account/:id/details&lt;/code&gt; endpoint and jump to an exemplary trace.&lt;/p&gt;
&lt;p&gt;A few button clicks later and now this data is flowing to my preferred Observability vendor as well as our own data
warehouse for deeper analysis and long-term storage.&lt;/p&gt;
&lt;p&gt;I haven&apos;t had to write a single line of code to enable this, or stand up any custom infrastructure, and I don&apos;t even
really need to know how it&apos;s happening. Like my IDE debugger or the Chrome Dev Tools, it&apos;s just another tool that I
expect to work. However once I discover that I do need some additional data, there&apos;s a standard and easy way of extending
the data my frameowrk and libraries are creating for me.&lt;/p&gt;
&lt;p&gt;OpenTelemetry is usually sold as a way of avoiding lock-in for Observability vendors
&amp;lt;a href=&quot;#footnote-1&quot; id=&quot;footnote-ref-1&quot; data-footnote-ref=&quot;true&quot; aria-describedby=&quot;footnote-label&quot;&amp;gt;&amp;lt;code&amp;gt;[1]&amp;lt;/code&amp;gt;&amp;lt;/a&amp;gt;,
but that&apos;s just a start. If the standard is successful it&apos;ll also be built into libraries, frameworks, tooling, IDEs,
languages, and platforms. For many users it&apos;s likely the standard
&lt;a href=&quot;https://en.wikipedia.org/wiki/Ephemeralization&quot;&gt;will mostly disappear into the background&lt;/a&gt; and become an expected part of the system.&lt;/p&gt;
&lt;p&gt;We&apos;re still &lt;em&gt;very early&lt;/em&gt; on this journey as an industry with OpenTelemetry.&lt;/p&gt;
&lt;p&gt;&amp;lt;blockquote class=&quot;bluesky-embed&quot; data-bluesky-uri=&quot;at://did:plc:gttrfs4hfmrclyxvwkwcgpj7/app.bsky.feed.post/3ldca564b3s2f&quot; data-bluesky-cid=&quot;bafyreibfxbrfdfhvmdc2shhsxgpjcf3i2tltigcqaeimox3igzrkvk7ua4&quot;&amp;gt;&amp;lt;p lang=&quot;en&quot;&amp;gt;what I really want to see are framework-level “batteries included” sdk reimplementations on top of the OTel APIs.&lt;/p&gt;
&lt;p&gt;spiritually similar to how rust has coalesced around the tracing crate with pluggable exporters&amp;lt;/p&amp;gt;— austin 🎄 (&amp;lt;a href=&quot;https://bsky.app/profile/did:plc:gttrfs4hfmrclyxvwkwcgpj7?ref_src=embed&quot;&amp;gt;@aparker.io&amp;lt;/a&amp;gt;) &amp;lt;a href=&quot;https://bsky.app/profile/did:plc:gttrfs4hfmrclyxvwkwcgpj7/post/3ldca564b3s2f?ref_src=embed&quot;&amp;gt;December 14, 2024 at 1:21 PM&amp;lt;/a&amp;gt;&amp;lt;/blockquote&amp;gt;&amp;lt;script async src=&quot;https://embed.bsky.app/static/embed.js&quot; charset=&quot;utf-8&quot;&amp;gt;&amp;lt;/script&amp;gt;&lt;/p&gt;
&lt;h2&gt;But is OpenTelemetry the &lt;em&gt;right&lt;/em&gt; standard?&lt;/h2&gt;
&lt;p&gt;How do we know that OpenTelemetry is the &lt;strong&gt;right standard&lt;/strong&gt; to bet on? Don&apos;t some people have problems with it? Isn&apos;t it &quot;design by committee&quot;?&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://news.ycombinator.com/item?id=37296594&quot;&gt;&lt;img src=&quot;hn-comment.png&quot; alt=&quot;OpenTelemetry is a marketing-driven project, designed by committee, implemented naively and inefficiently, and guided by the primary goal of allowing Fortune X00 CTOs to tick off some boxes on their strategy roadmap documents.&quot; /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://cra.mr/the-problem-with-otel/&quot;&gt;David Cramer had a popular post about his issues with OpenTelemetry&lt;/a&gt;, and &lt;a href=&quot;https://bsky.app/profile/isburmistrov.bsky.social&quot;&gt;Ivan Burmistrov&lt;/a&gt; brought up a number of points in &lt;a href=&quot;https://isburmistrov.substack.com/p/all-you-need-is-wide-events-not-metrics&quot;&gt;All you need is Wide Events, not “Metrics, Logs and Traces”&lt;/a&gt;. &lt;a href=&quot;https://bsky.app/profile/hazelweakly.me&quot;&gt;Hazel Weakly&lt;/a&gt; recently dove into one of OpenTelemetry&apos;s biggest flaws: &lt;a href=&quot;https://thenewstack.io/opentelemetry-challenges-handling-long-running-spans/&quot;&gt;lack of a standard for long-running spans&lt;/a&gt;. I generally agree with most of their points! &amp;lt;a href=&quot;#footnote-2&quot; id=&quot;footnote-ref-2&quot; data-footnote-ref=&quot;true&quot; aria-describedby=&quot;footnote-label&quot;&amp;gt;&amp;lt;code&amp;gt;[2]&amp;lt;/code&amp;gt;&amp;lt;/a&amp;gt;&lt;/p&gt;
&lt;p&gt;If your main experience with OpenTelemetry today looks like &lt;a href=&quot;https://signoz.io/blog/opentelemetry-ruby&quot;&gt;a more tedious, worse version of adding an APM vendor&lt;/a&gt;
I don&apos;t fault you for taking this view. I&apos;ve even cautioned people away from adoption today if they have a small team
or are under a lot of pressure. (&lt;a href=&quot;http://jeremymorrell.dev/blog/a-practitioners-guide-to-wide-events&quot;&gt;My wide events guide can help though!&lt;/a&gt;)&lt;/p&gt;
&lt;p&gt;&amp;lt;blockquote class=&quot;bluesky-embed&quot; data-bluesky-uri=&quot;at://did:plc:qbim5usypxqjsxb27zjxn733/app.bsky.feed.post/3lasw3fary62j&quot; data-bluesky-cid=&quot;bafyreigvnfzc5fhetkp7pagqyjxq6jkhes3grbxdud5enaeilux7t5jl3u&quot;&amp;gt;&amp;lt;p lang=&quot;&quot;&amp;gt;OpenTelemetry is a really important project, but if you’re in a place where you are struggling to understand your systems, trying to adopt OTel as a way out is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;likely not going to solve your problem&lt;/li&gt;
&lt;li&gt;going to give you heaps more complexity to navigate&amp;lt;/p&amp;gt;— Jeremy Morrell (&amp;lt;a href=&quot;https://bsky.app/profile/did:plc:qbim5usypxqjsxb27zjxn733?ref_src=embed&quot;&amp;gt;@jeremymorrell.dev&amp;lt;/a&amp;gt;) &amp;lt;a href=&quot;https://bsky.app/profile/did:plc:qbim5usypxqjsxb27zjxn733/post/3lasw3fary62j?ref_src=embed&quot;&amp;gt;April 20, 2024&amp;lt;/a&amp;gt;&amp;lt;/blockquote&amp;gt;&amp;lt;script async src=&quot;https://embed.bsky.app/static/embed.js&quot; charset=&quot;utf-8&quot;&amp;gt;&amp;lt;/script&amp;gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;OpenTelemetry is complicated and endlessly extensible. That comes as a necessary byproduct of supporting so many different stakeholders. What looks like unnecessary bloat to you frequently turns out to be core to someone else&apos;s adoption. That complexity &lt;em&gt;enables&lt;/em&gt; everyone to get on the same page.&lt;/p&gt;
&lt;p&gt;It&apos;s not perfect, but it is succeeding at the main thing a standard needs: wide adoption. OpenTelemetry has shown major uptake by vendors and is starting to be adopted by end-users in enterprise. &lt;a href=&quot;https://www.cncf.io/blog/2024/11/15/gain-insights-into-cloud-native-applications-with-the-opentelemetry-certified-associate-otca/&quot;&gt;OpenTelemetry is the 2nd most active CNCF project after Kubernetes&lt;/a&gt;. There is clear momentum and no obvious competitors.&lt;/p&gt;
&lt;p&gt;The benefits from having a single standard, and what we could build on top of that foundation, far outweigh the drawbacks in complexity. We should be very thankful that &lt;a href=&quot;https://medium.com/opentracing/a-roadmap-to-convergence-b074e5815289&quot;&gt;the creators of OpenCensus and OpenTracing decided to collaborate rather than compete&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;OpenTelemetry is not a monolith&lt;/h2&gt;
&lt;p&gt;OpenTelemetry can be the foundation of this better future, but it&apos;s not a certainty yet. It&apos;s &lt;a href=&quot;https://opentelemetry.io/docs/specs/otel/&quot;&gt;a big, sprawling
specification&lt;/a&gt;, and it can be helpful to break down how each part contributes
to the whole.&lt;/p&gt;
&lt;p&gt;Caveat: Most of my experience is with the Ruby, Go, and JavaScript ecosystems. Things may look different in other languages.&lt;/p&gt;
&lt;h4&gt;&lt;a href=&quot;https://opentelemetry.io/docs/specs/otlp/&quot;&gt;OTLP&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;If a vendor supports receiving OpenTelemetry, usually that means they support receiving OTLP formatted data. All
of my interactions with the OTLP format so far (mostly tracing and metrics) have been good ones. &lt;a href=&quot;https://jeremymorrell.dev/blog/minimal-js-tracing/&quot;&gt;It&apos;s not very hard to generate from scratch&lt;/a&gt;, and there is good library support for both generating and parsing this data.&lt;/p&gt;
&lt;p&gt;There are a &lt;a href=&quot;https://github.com/CtrlSpice/otel-desktop-viewer&quot;&gt;growing&lt;/a&gt; &lt;a href=&quot;https://learn.microsoft.com/en-us/dotnet/core/diagnostics/observability-otlp-example&quot;&gt;number&lt;/a&gt; of &lt;a href=&quot;https://github.com/ymtdzzz/otel-tui&quot;&gt;tools&lt;/a&gt;
to visualize and work with this data locally, and I hope this trend continues. No one should
be debugging traces and metrics by squinting at serialized data structures dumped to their terminal.&amp;lt;a href=&quot;#footnote-3&quot; id=&quot;footnote-ref-3&quot; data-footnote-ref=&quot;true&quot; aria-describedby=&quot;footnote-label&quot;&amp;gt;&amp;lt;code&amp;gt;[3]&amp;lt;/code&amp;gt;&amp;lt;/a&amp;gt;&lt;/p&gt;
&lt;p&gt;My most pressing request for OTLP is &lt;a href=&quot;https://www.cncf.io/blog/2024/06/14/why-embrace-created-span-snapshots-for-mobile-observability-with-opentelemetry/&quot;&gt;an official way of representing unfinished spans similar to the span snapshots that Embrace is using in their mobile tooling&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;&lt;a href=&quot;https://opentelemetry.io/docs/specs/otel/overview/#opentelemetry-client-architecture&quot;&gt;API&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;The split between the API and SDK in OpenTelemetry is not well-understood, but is one of the
most interesting ideas in the project.&lt;/p&gt;
&lt;p&gt;The API is a small set of interfaces to create traces, metrics, logs that have no implementation, just
a hook that the end user can use to register their desired behavior. The OpenTelemetry SDKs build on these
interfaces, but nothing is stopping others from doing the same thing. If library authors embed them into
their code, then we can avoid needing fancy Instrumentation libraries.&lt;/p&gt;
&lt;p&gt;I mainly have experience with the
&lt;a href=&quot;https://opentelemetry.io/docs/specs/otel/trace/api/&quot;&gt;tracing&lt;/a&gt; and
&lt;a href=&quot;https://opentelemetry.io/docs/specs/otel/context/&quot;&gt;context&lt;/a&gt; APIs, and find them fairly straight-forward
to use, if not always the most idiomatic.&lt;/p&gt;
&lt;p&gt;Ultimately I think that these APIs are good candidates for getting folded into the language itself.
&lt;a href=&quot;https://docs.rs/tracing/latest/tracing/&quot;&gt;Rust&lt;/a&gt; is leading the way.&lt;/p&gt;
&lt;h4&gt;&lt;a href=&quot;https://opentelemetry.io/docs/specs/otel/trace/sdk/&quot;&gt;SDK&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;If you have added OpenTelemetry to your service, the library you installed was the SDK. It&apos;s a full-featured,
configurable toolkit for instrumenting your code and emitting telemetry. You can think of it like the
Sentry or New Relic libraries, though a better analogy might be a toolkit you could use to build those
libraries.&lt;/p&gt;
&lt;p&gt;I&apos;ve had very mixed experiences with the OpenTelemetry SDKs.&lt;/p&gt;
&lt;p&gt;They are amazingly extensible and battle-hardened bits of software. Even when handed really weird
requirements, I was always able to find a way to get them to do what my company needed them to do. I&apos;ve only
experienced one or two production issues even after years of using them at scale. The maintainers
should be incredibly proud of their work.&lt;/p&gt;
&lt;p&gt;However it&apos;s not all sunshine and rainbows. The documentation is not always the best, and I frequently
find myself digging into their code to figure out how something works. The SDKs introduce many concepts
to the user and require lots of configuration even on the simplest happy path. When compared to a
&lt;a href=&quot;https://github.com/getsentry/sentry-ruby?tab=readme-ov-file#getting-started&quot;&gt;polished, opinionated vendor experience&lt;/a&gt;
it&apos;s clear that the SDKs don&apos;t measure up.&lt;/p&gt;
&lt;p&gt;I&apos;d like to see a better default experience. When I &lt;code&gt;printf&lt;/code&gt; to &lt;code&gt;stdout&lt;/code&gt; I see the data immediately and
without configuration in my terminal. I can easily pipe this to a file. It shows up in my IDE. My deployment
platform likely has &lt;a href=&quot;https://devcenter.heroku.com/articles/log-drains&quot;&gt;first-class support&lt;/a&gt; for &lt;a href=&quot;https://render.com/docs/log-streams&quot;&gt;collecting and sending this data to a vendor of my choice&lt;/a&gt;. I want that same ease-of-use for OTLP streams.&lt;/p&gt;
&lt;p&gt;I fear that we are moving away from this vision and towards &lt;a href=&quot;https://github.com/open-telemetry/opentelemetry-configuration/&quot;&gt;even more layers of configuration&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;&lt;a href=&quot;https://opentelemetry.io/docs/collector/&quot;&gt;Collector&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;The Collector is configurable glue, a swiss army knife that can morph into whatever your organization needs. There&apos;s
probably a plugin that converts from whatever you have to OTLP and back again. If there isn&apos;t, writing a custom one isn&apos;t
too difficult. It allows an organziation to gradually adopt OpenTelemetry in their existing systems and makes building
centralized tooling to handle telemetry streams more tractable.&lt;/p&gt;
&lt;p&gt;I think the Collector is a large reason for the success organizations are having in adoption of OpenTelemetry.
People love deploying collectors! Perhaps too much. It&apos;s easy to turn around and realize that you have more
than a dozen in production, all somehow justified.&lt;/p&gt;
&lt;p&gt;In a future where OpenTelemetry has become a standard I think there will be less need of the Collector-as-glue, but I still
see it living on at the heart of many toolchains.&lt;/p&gt;
&lt;h4&gt;&lt;a href=&quot;https://opentelemetry.io/docs/specs/otel/overview/#contrib-packages&quot;&gt;Contrib&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;The SDK provides core functionality, and the contrib collections provide plugins and adapters. This is where you find
OpenTelemetry&apos;s &lt;a href=&quot;https://github.com/open-telemetry/opentelemetry-ruby-contrib/tree/main/instrumentation/rails&quot;&gt;Rails instrumentation&lt;/a&gt;
or &lt;a href=&quot;https://github.com/open-telemetry/opentelemetry-js-contrib/tree/main/plugins/node/opentelemetry-instrumentation-pg&quot;&gt;Node&apos;s Postgres client instrumentation&lt;/a&gt;
or &lt;a href=&quot;https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/jaegerreceiver&quot;&gt;plugins for the collector to receive legacy formats&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The contrib instrumentations represent gobsmacking amounts of engineering effort. This code can be really difficult
to write and maintain, especially as the underlying library evolves and changes. Navigating breaking changes and
which version is supported by which contrib version can be a real pain. Despite that I&apos;ve mostly had good experiences
with these!&lt;/p&gt;
&lt;p&gt;However I would like to see a future where they largely aren&apos;t necessary because vendor-neutral instrumentation has been
added directly to the libraries or other systems themselves.&lt;/p&gt;
&lt;h4&gt;&lt;a href=&quot;https://opentelemetry.io/docs/concepts/semantic-conventions/&quot;&gt;Semantic Conventions&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;Semantic Conventions is the idea that every time your are instrumenting a similar thing, such as an HTTP service or a job queue, the
telemetry emitted should look the same, have the same shape as telemetry from other HTTP services or job queues. The status code
for an HTTP response should always be &lt;code&gt;http.response.status_code&lt;/code&gt;, not &lt;code&gt;http.status_code&lt;/code&gt;, not &lt;code&gt;http_status_code&lt;/code&gt;, not &lt;code&gt;statusCode&lt;/code&gt; or &lt;code&gt;status-code&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;While technically the simplest, in my opinion Semantic Conventions is one of the most ambitious OpenTelemetry specs. If broadly adopted it will allow
tooling to recognize patterns and surface much better information for users. However trying to name so many things across so many domains is an
incredibly difficult thing to achieve, let alone getting buy in across so many implementations.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;&amp;lt;a href=&quot;#footnote-ref-1&quot; id=&quot;footnote-1&quot;&amp;gt;&amp;lt;code&amp;gt;[1]&amp;lt;/code&amp;gt;&amp;lt;/a&amp;gt; I suspect vendor neutrality is over-hyped relative to its actual merits. From experience migrating an org with a non-trivial number of systems, alerts, dashboards, and integrations with other systems, and hundreds of engineers who need to learn how to use the new vendor, double-writing telemetry into the new system is not the hard part.&lt;/p&gt;
&lt;p&gt;While it does provide some insurance against a vendor charging you extortionate rates, for most organizations migrating vendors could cost on the order of millions of dollars in salary and months of time in opportunity cost that could be have been spent improving the core product. And that&apos;s if the organization has the engineering maturity to organize this kind of migration in the first place, many do not.&lt;/p&gt;
&lt;p&gt;Choose your observability vendors with the same care with which you choose your cloud provider, whether you use OpenTelemetry or not.&lt;/p&gt;
&lt;p&gt;&amp;lt;a href=&quot;#footnote-ref-2&quot; id=&quot;footnote-2&quot;&amp;gt;&amp;lt;code&amp;gt;[2]&amp;lt;/code&amp;gt;&amp;lt;/a&amp;gt; In particular, I generally agree with Ivan Burmistrov&apos;s points in &lt;a href=&quot;https://isburmistrov.substack.com/p/all-you-need-is-wide-events-not-metrics&quot;&gt;All you need is Wide Events, not “Metrics, Logs and Traces”&lt;/a&gt;. A span is an event. A set of metrics can be an event. A log is an event. An event attached to a span is... another event.&lt;/p&gt;
&lt;p&gt;OpenTelemetry creates a lot of complexity that seems unnecessary from this perspective, but I suspect a simpler event-based model would never have been adopted by any traditional vendor. Without wide adoption, it would very likely fail as a standard.&lt;/p&gt;
&lt;p&gt;&amp;lt;a href=&quot;#footnote-ref-3&quot; id=&quot;footnote-3&quot;&amp;gt;&amp;lt;code&amp;gt;[3]&amp;lt;/code&amp;gt;&amp;lt;/a&amp;gt; I&apos;d like if
the OpenTelemetry project itself &lt;a href=&quot;https://github.com/open-telemetry/community/issues/1515&quot;&gt;provided more tooling here&lt;/a&gt; though those discussions &lt;a href=&quot;https://github.com/open-telemetry/oteps/pull/230&quot;&gt;seem to have fizzled out&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;And please &lt;a href=&quot;https://www.jaegertracing.io/docs/1.6/getting-started/&quot;&gt;do not make me run a heavy-weight production tool in docker to visualize my telemetry&lt;/a&gt;.
Give me native apps, CLI tools, IDE plugins. Treat this as a first-class concern because it shapes so much of the users&apos; experience.&lt;/p&gt;
</content:encoded></item><item><title>A Practitioner&apos;s Guide to Wide Events</title><link>https://jeremymorrell.dev/blog/a-practitioners-guide-to-wide-events/</link><guid isPermaLink="true">https://jeremymorrell.dev/blog/a-practitioners-guide-to-wide-events/</guid><description>The existing articles on Wide Events define the concept well but leave the implementation details to the reader.</description><pubDate>Tue, 22 Oct 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Adopting Wide Event-style instrumentation has been one of the highest-leverage changes I&apos;ve made
in my engineering career. The feedback loop on all my changes tightened and debugging systems
became so much easier. Systems that were scary to work on suddenly seemed a lot more manageable.&lt;/p&gt;
&lt;p&gt;Lately there have been a lot of good blog posts on what &quot;Wide Events&quot; mean and why they are
important. Here are some of my recent favorites:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://isburmistrov.substack.com/p/all-you-need-is-wide-events-not-metrics&quot;&gt;All you need is Wide Events, not “Metrics, Logs and Traces”&lt;/a&gt; by &lt;a href=&quot;https://bsky.app/profile/isburmistrov.bsky.social&quot;&gt;Ivan Burmistrov&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://boristane.com/blog/observability-wide-events-101/&quot;&gt;Observability wide events 101&lt;/a&gt; by
&lt;a href=&quot;https://twitter.com/boristane&quot;&gt;Boris Tane&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://charity.wtf/2024/08/07/is-it-time-to-version-observability-signs-point-to-yes/&quot;&gt;Is it time to version Observability? (Signs point to yes)&lt;/a&gt; by &lt;a href=&quot;https://bsky.app/profile/mipsytipsy.bsky.social&quot;&gt;Charity Majors&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The tl;dr is that for each unit-of-work in your system (usually, but not always an HTTP request / response)
you emit one &quot;event&quot; with all of the information you can collect about that work. &quot;Event&quot; is an over-loaded
term in telemetry so replace that with &quot;log line&quot; or &quot;span&quot; if you like. &lt;a href=&quot;https://jeremymorrell.dev/blog/minimal-js-tracing/&quot;&gt;They are all effectively the same
thing&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://bsky.app/profile/mipsytipsy.bsky.social&quot;&gt;Charity Majors&lt;/a&gt; has been promoting this approach lately under the
name &lt;a href=&quot;https://www.honeycomb.io/blog/one-key-difference-observability1dot0-2dot0&quot;&gt;&quot;Observability 2.0&quot;&lt;/a&gt;, creating some
new momentum around the concept, however, it is &lt;em&gt;not&lt;/em&gt; a new idea. &lt;a href=&quot;https://twitter.com/brandur&quot;&gt;Brandur Leach&lt;/a&gt; wrote
about &quot;Canonical Log Lines&quot; both on &lt;a href=&quot;https://brandur.org/canonical-log-lines&quot;&gt;his own blog in 2016&lt;/a&gt; and
&lt;a href=&quot;https://stripe.com/blog/canonical-log-lines&quot;&gt;as used by Stripe in 2019&lt;/a&gt;. And &lt;a href=&quot;https://aws.amazon.com/builders-library/instrumenting-distributed-systems-for-operational-visibility/#Request_log_best_practices&quot;&gt;AWS has recommended it as a best-practice for ages&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Okay... I think I get the idea... but how do I do &quot;wide events&quot;?&lt;/h2&gt;
&lt;p&gt;This is where I find a lot of developers get tripped up. The idea sounds good in theory,
and we should totally try that one day! But I have this stack of features to ship, that
bug that&apos;s been keeping me up at night, and 30 new AI tools that came out
yesterday to learn about. And like... where do you even start? What data should I add?&lt;/p&gt;
&lt;p&gt;Like anything in software, there are a lot of options for how to approach this, but I&apos;ll talk
through one approach that has worked for me.&lt;/p&gt;
&lt;p&gt;We&apos;ll cover how to approach this in tooling and code, an &lt;strong&gt;extensive&lt;/strong&gt; list of attributes to add,
and I&apos;ll respond to some frequent objections that come up when discussing this approach.&lt;/p&gt;
&lt;p&gt;For this post we&apos;ll focus on web services, but you would apply a similar approach to any workload.&lt;/p&gt;
&lt;h2&gt;Choose your tools&lt;/h2&gt;
&lt;p&gt;We will need some way to instrument your code (traces or structured log lines) and somewhere to
send the instrumentation to in order to query and visualize it.&lt;/p&gt;
&lt;p&gt;This approach is best paired with a tool that lets you query your data in quick iterations.
I like &lt;a href=&quot;https://www.honeycomb.io/&quot;&gt;Honeycomb&lt;/a&gt; for this, but any Observability tool backed by
a modern OLAP database is likely going to work in a pinch.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.honeycomb.io/&quot;&gt;Honeycomb&lt;/a&gt; has &lt;a href=&quot;https://www.honeycomb.io/resources/why-we-built-our-own-distributed-column-store&quot;&gt;Retriever&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.datadoghq.com/&quot;&gt;DataDog&lt;/a&gt; has &lt;a href=&quot;https://www.datadoghq.com/blog/engineering/introducing-husky/&quot;&gt;Husky&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://newrelic.com/&quot;&gt;New Relic&lt;/a&gt; has &lt;a href=&quot;https://docs.newrelic.com/docs/data-apis/get-started/nrdb-horsepower-under-hood/&quot;&gt;NRDB&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://baselime.io/&quot;&gt;Baselime&lt;/a&gt; uses &lt;a href=&quot;https://boristane.com/talks/observability-with-clickhouse/&quot;&gt;ClickHouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://signoz.io/&quot;&gt;SigNoz&lt;/a&gt; uses &lt;a href=&quot;https://clickhouse.com/blog/signoz-observability-solution-with-clickhouse-and-open-telemetry&quot;&gt;ClickHouse&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Honeycomb, New Relic, and DataDog built their own columnar &lt;a href=&quot;https://aws.amazon.com/compare/the-difference-between-olap-and-oltp/&quot;&gt;OLAP&lt;/a&gt; data stores,
though now with the availability of &lt;a href=&quot;https://clickhouse.com/&quot;&gt;ClickHouse&lt;/a&gt;, &lt;a href=&quot;https://www.influxdata.com/blog/influxdb-engine/&quot;&gt;InfluxDB IOx&lt;/a&gt;,
&lt;a href=&quot;https://pinot.apache.org/&quot;&gt;Apache Pinot&lt;/a&gt;, and &lt;a href=&quot;https://duckdb.org/&quot;&gt;DuckDB&lt;/a&gt; there are new Observability tools popping up all the time.&lt;/p&gt;
&lt;p&gt;If you aren&apos;t constrained, I &lt;strong&gt;highly recommend&lt;/strong&gt; defaulting to using &lt;a href=&quot;https://opentelemetry.io/&quot;&gt;OpenTelemetry&lt;/a&gt;
and &lt;a href=&quot;https://www.honeycomb.io/&quot;&gt;Honeycomb&lt;/a&gt;. Your life will be easier.&lt;/p&gt;
&lt;p&gt;However even if you are stuck in a corporate environment with a strong allergy to technology built after 2010 you
can leverage log search tools like ElasticSearch in a pinch. &lt;a href=&quot;https://stripe.com/blog/canonical-log-lines&quot;&gt;Stripe&lt;/a&gt;&apos;s
blog post goes over how to use Splunk for this.&lt;/p&gt;
&lt;p&gt;In any tool you want to focus on getting proficient at 3 core techniques in order to sift through your events.
The faster you are able to apply these, iterate, and ask questions of your data, the better you&apos;ll be able to
debug issues and see what your system is really doing. When observability folks refer to &quot;slicing and dicing&quot;
data, this is what they are generally referring to. I&apos;ll represent queries using a made-up SQL dialect, but
you should be able to find equivalents in your tool&apos;s query language.&lt;/p&gt;
&lt;h4&gt;Visualizing&lt;/h4&gt;
&lt;p&gt;Existing in a human body comes with its fair share of downsides, but the human visual cortex is really, really
good at recognizing patterns. Give it a fighting chance by getting really good at summoning visualizations
of the data your system is emitting. &lt;code&gt;COUNT&lt;/code&gt;, &lt;code&gt;COUNT_DISTINCT&lt;/code&gt;, &lt;code&gt;HEATMAP&lt;/code&gt;, &lt;code&gt;P90&lt;/code&gt;, &lt;code&gt;MAX&lt;/code&gt;, &lt;code&gt;MIN&lt;/code&gt;, Histogram.
Learn to leverage whatever graphs your tool makes available to you. Practice it. Get fast.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;./heatmaps.png&quot; alt=&quot;A Honeycomb screenshot of heatmap&quot; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;./splunk-histogram.png&quot; alt=&quot;A Splunk screenshot of histogram&quot; /&gt;&lt;/p&gt;
&lt;h4&gt;Grouping&lt;/h4&gt;
&lt;p&gt;With each new annotation that we add to our wide events, we create another dimension along which we can
slice our data. &lt;code&gt;GROUP BY&lt;/code&gt; allows us to look along that dimension and see if the values along that
dimension match our expectations.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;GROUP BY instance.id
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;GROUP BY client.OS, client.version
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Filtering&lt;/h4&gt;
&lt;p&gt;Once we&apos;ve narrowed in one dimension that is interesting, we usually want to dig further into
that data. Filtering down so that we&apos;re only looking at data from one endpoint, or from one IP address,
or sent by the iOS app, or only from users with a specific feature flag turned on allows us to narrow our
focus to a very specific segment of traffic.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;WHERE http.route = &quot;/user/account&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;WHERE http.route != &quot;/health&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;WHERE http.user_agent_header contains &quot;Android&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Write a middleware to help you&lt;/h2&gt;
&lt;p&gt;If you are using an OpenTelemetry SDK it is already creating a wrapping span around the request and
response. You can access it by asking for the active span at any point during the processing of
the request.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;let span = opentelemetry.trace.getActiveSpan();
span.setAttributes({
  &quot;user_agent.original&quot;: c.req.header(&quot;User-Agent&quot;),
});
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;However if anyone wraps any of your code in a child span the &quot;active span&quot; will change to be that
new wrapping span! There is no first-class way of addressing this original &quot;main&quot; span in OpenTelemetry.
However, we can work around this by saving a reference to this specific span in the &lt;a href=&quot;https://opentelemetry.io/docs/specs/otel/context/&quot;&gt;context&lt;/a&gt;
so we can always have access to the &quot;main&quot; wrapping span.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;// create a reference to store the span on the opentelemetry context object
const MAIN_SPAN_CONTEXT_KEY = createContextKey(&quot;main_span_context_key&quot;);

function mainSpanMiddleware(req, res, next) {
  // pull the active span created by the http instrumentation
  let span = trace.getActiveSpan();

  // get the current context
  let ctx = context.active();

  // set any attributes we always want on the main span
  span.setAttribute(&quot;main&quot;, true);

  // OpenTelemetry context is immutable, so to modify it we create
  // a new version with our span added
  let newCtx = ctx.setValue(MAIN_SPAN_CONTEXT_KEY, span);

  // set that new context as active for the duration of the request
  context.with(newCtx, () =&amp;gt; {
    next();
  });
}

// create another function that allows you to annotate this saved span easily
function setMainSpanAttributes(attributes) {
  let mainSpan = context.active().getValue(MAIN_SPAN_CONTEXT_KEY);
  if (mainSpan) {
    mainSpan.setAttributes(attributes);
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now our annotation code can look a little simpler, and we can always know that we&apos;re setting these
attributes on the wrapping span.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;setMainSpanAttributes({
  &quot;user.id&quot;: &quot;123&quot;,
  &quot;user.type&quot;: &quot;enterprise&quot;,
  &quot;user.auth_method&quot;: &quot;oauth&quot;,
});
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can play around with a minimal running example &lt;a href=&quot;https://github.com/jmorrell/a-practitioners-guide-to-wide-events/tree/main/opentelemetry-js-example&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;At Heroku we had internal &lt;a href=&quot;https://opentelemetry.io/docs/concepts/distributions/&quot;&gt;OpenTelemetry Distributions&lt;/a&gt;
that set this up for you automatically and added as many automatic annotations as possible to these spans.&lt;/p&gt;
&lt;p&gt;If you are not using OpenTelemetry
&lt;a href=&quot;https://gist.github.com/jmorrell/76a9ee631370e073d6e2616dc1f67feb&quot;&gt;here&apos;s a gist that might help you get started&lt;/a&gt;.
&lt;a href=&quot;https://jeremymorrell.dev/blog/minimal-js-tracing/&quot;&gt;My previous post&lt;/a&gt; may help you put this logic together.&lt;/p&gt;
&lt;h2&gt;What do I add to this &quot;main&quot; span?&lt;/h2&gt;
&lt;p&gt;&amp;lt;blockquote class=&quot;twitter-tweet&quot; data-conversation=&quot;none&quot; data-dnt=&quot;true&quot; data-theme=&quot;light&quot;&amp;gt;&amp;lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&amp;gt;And how many dimensions do you plan to emit and pack into your wide events?&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;MANY. Hundreds! The more you have, the better you can detect and correlate rare conditions with precision.&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;As you adjust to the joys of debugging with rich context, you will itch for it everywhere. ☺️&amp;lt;/p&amp;gt;— Charity Majors (@mipsytipsy) &amp;lt;a href=&quot;https://twitter.com/mipsytipsy/status/1744579558962336138&quot;&amp;gt;January 9, 2024&amp;lt;/a&amp;gt;&amp;lt;/blockquote&amp;gt; &amp;lt;script async src=&quot;https://platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&amp;gt;&amp;lt;/script&amp;gt;&lt;/p&gt;
&lt;p&gt;We need to add attributes about the request, and there are likely far more of these than you would expect.
It&apos;s easy to come up with a dozen or so, but in a well-instrumented code base there will be hundreds of attributes.&lt;/p&gt;
&lt;p&gt;Note that while this is a long list, it is definitely not exhaustive. OpenTelemetry defines sets of attribute names as
&lt;a href=&quot;https://opentelemetry.io/docs/specs/semconv/&quot;&gt;Semantic Conventions&lt;/a&gt; that can also be used for inspiration. I have tried
to follow these in my naming where possible.&lt;/p&gt;
&lt;h3&gt;A convention to filter out everything else&lt;/h3&gt;
&lt;p&gt;Traces contain lots of spans, so it&apos;s helpful to have a convention for identifying and searching for these &quot;wide events&quot;.
&lt;code&gt;root&lt;/code&gt; and &lt;code&gt;canon&lt;/code&gt; were floated as options, but I&apos;ve landed on calling them &lt;code&gt;main&lt;/code&gt; spans.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attribute&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;main&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;true&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Present only for spans designated as a &quot;wide event&quot;, usually wrapping a request / response, or a background job&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This convention allows you to quickly figure out &quot;what does the traffic to this service look like?&quot; with a single query:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;SELECT
  COUNT(*)
WHERE
  main = true
GROUP BY http.route
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&quot;./traffic-by-route.png&quot; alt=&quot;Graph of traffic grouped by route over a week. There is an anomally.&quot; /&gt;&lt;/p&gt;
&lt;h3&gt;Service metadata&lt;/h3&gt;
&lt;p&gt;Of course we need to add some information about the service we&apos;re running. Consider adding additional metadata about
which team owns the system, or which Slack channel the owning team hangs out in, though note that this can be
tedious to update if your workplace experiences frequent re-orgs. Tying these to a service catalog like &lt;a href=&quot;https://backstage.io/&quot;&gt;Backstage&lt;/a&gt;
is left as an exercise to the reader.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attribute&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;service.name&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;api&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;shoppingcart&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;What is the name of this service?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;service.environment&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;production&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;staging&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;development&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Where is this service running?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;service.team&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;web-services&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;dev-ex&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Which team owns this service. Useful for knowing who to page in during incidents.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;service.slack_channel&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;web-services&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;dev-ex&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;If I discover an issue with this service, where should I reach out?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;blockquote&gt;
&lt;p&gt;How many services does each team run?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;pre&gt;&lt;code&gt;SELECT
  COUNT_DISTINCT(service.name)
WHERE
  service.environment = &quot;production&quot;
GROUP BY service.team
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Ever look at the load on a system and then wonder &quot;Is that appropriate for the machine this is running on?&quot;, and
now you have to look through other tools or config files to get that information. Throw that context on the wide
event so that it&apos;s available when you need it.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attribute&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;instance.id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;656993bd-40e1-4c76-baff-0e50e158c6eb&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;An ID that maps to this one instance of the service&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;instance.memory_mb&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;12336&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;How much RAM is available to this service?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;instance.cpu_count&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;4&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;8&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;196&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;How many cores are available to this service?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;instance.type&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;m6i.xlarge&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Does your vendor have a name for this type of instance?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;blockquote&gt;
&lt;p&gt;What are the services with the most memory that we run? What instance types do they use?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;pre&gt;&lt;code&gt;SELECT
  service.name,
  instance.memory_mb,
  instance.type
ORDER BY instance.memory_mb DESC
GROUP BY service.name, instance.type
LIMIT 10
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;However you&apos;re orchestrating your systems make sure that all of the relevant information is added. I&apos;ve included some
examples from &lt;a href=&quot;https://opentelemetry.io/docs/specs/semconv/resource/k8s/&quot;&gt;the Kubernetes semantic conventions&lt;/a&gt; for inspiration.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attribute&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;container.id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;a3bf90e006b2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;An ID used to identify Docker containers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;container.name&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;nginx-proxy&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;wordpress-app&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Container name used by container runtime&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;k8s.cluster.name&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;api-cluster&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Name of the kubernetes cluster your service is running in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;k8s.pod.name&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;nginx-2723453542-065rx&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Name of the kubernetes pod your service is running in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;cloud.availability_zone&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;us-east-1c&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;AZ where you&apos;re running your service&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;cloud.region&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;us-east-1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Region where you&apos;re running your service&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;But even if you&apos;re using a Platform-as-a-Service you can still pull out a lot of useful information!&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attribute&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;heroku.dyno&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;web.1&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;worker.3&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The env var &lt;code&gt;DYNO&lt;/code&gt; that is set on your app at runtime&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;heroku.dyno_type&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;web&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;worker&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The first part of the &lt;code&gt;DYNO&lt;/code&gt; env var before the &lt;code&gt;.&lt;/code&gt;. Separating this makes it easier to query&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;heroku.dyno_index&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;3&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The second part of the &lt;code&gt;DYNO&lt;/code&gt; env var after the &lt;code&gt;.&lt;/code&gt;. Separating this makes it easier to query&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;heroku.dyno_size&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;performance-m&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The selected dyno size&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;heroku.space&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;my-private-space&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The name of the private space that your are deployed into&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;heroku.region&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;virginia&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;oregon&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Which region is this app located in?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;blockquote&gt;
&lt;p&gt;How many dynos are we running? What dyno types are they? For which services?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;pre&gt;&lt;code&gt;SELECT
  COUNT_DISTINCT(heroku.dyno_index)
GROUP BY service.name, heroku.dyno_type, instance.type
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Build info&lt;/h3&gt;
&lt;p&gt;Inevitably some of the first questions asked in any incident are &quot;Did something just go out?&quot; or &quot;What changed?&quot;.
Instead of jumping to your deployment tool or looking through GitHub repositories, add that data to your telemetry.&lt;/p&gt;
&lt;p&gt;Threading this data from your build system through to your production system so that it&apos;s available at runtime can
be a non-trivial amount of glue code, but having this information easily available during incidents is invaluable.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attribute&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;service.version&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;v123&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;9731945429d3d083eb78666c565c61bcef39a48f&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;However you track your version, ex: a version string or a hash of the built image&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;service.build.id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;acd8bb57-fb9f-4b2d-a750-4315e99dac64&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;If your build system gives you an ID, this context allows you to audit the build if something goes wrong&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;service.build.git_hash&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;6f6466b0e693470729b669f3745358df29f97e8d&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The git SHA of the deployed commit so you can know exactly which code was running&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;service.build.pull_request_url&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;https://github.com/your-company/api-service/pull/121&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The url of the pull request that was merged that triggered the deploy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;service.build.diff_url&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;https://github.com/your-company/api-service/compare/c9d9380..05e5736&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;A url that compares the previously deployed commit against the newly deployed commit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;service.build.deployment.at&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2024-10-14T19:47:38Z&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Timestamp when the deployment process started&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;service.build.deployment.user&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;keanu.reeves@your-company.com&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Which authenticated user kicked off the build? Could be a bot&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;service.build.deployment.trigger&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;merge-to-main&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;slack-bot&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;api-request&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;config-change&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;What triggered the deploy? Extremely valuable context during an deploy-triggered incident&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;service.build.deployment.age_minutes&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;10230&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;How old is this deploy? Shortcuts the frequent incident question &quot;Did something just go out?&quot;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Won&apos;t this be a lot of repetitive data?&lt;/strong&gt; These values do not change except between deploys! See &lt;a href=&quot;#frequent-objections&quot;&gt;Frequent Objections&lt;/a&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;What systems have recently been deployed?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;pre&gt;&lt;code&gt;SELECT
  service.name,
  MIN(service.build.deployment.age_minutes) as age
WHERE
  service.build.deployment.age_minutes &amp;lt; 20
GROUP BY service.name
ORDER BY age ASC
LIMIT 10
&lt;/code&gt;&lt;/pre&gt;
&lt;blockquote&gt;
&lt;p&gt;What&apos;s up with the spike of 500s when we did the last deploy?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;pre&gt;&lt;code&gt;SELECT
  COUNT(*)
WHERE
  service.name = &quot;api-service&quot; AND
  main = true
GROUP BY http.status_code, service.version
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&quot;./group-by-http-and-status.png&quot; alt=&quot;Graph showing requests grouped by http status code and version. There is a spike of 500s correlating to v1 shutting down.&quot; /&gt;&lt;/p&gt;
&lt;h3&gt;HTTP&lt;/h3&gt;
&lt;p&gt;You should get most of these from your tracing library instrumentation, but there are usually more you can add if, for example,
your organization uses non-standard headers. Don&apos;t settle for only what OpenTelemetry gives you by default!&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attribute&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;server.address&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;example.com&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;localhost&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Name of the HTTP server that received the request&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;url.path&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/checkout&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;/account/123/features&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;URI path after the domain&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;url.scheme&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;http&lt;/code&gt;, &lt;code&gt;https&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;URI scheme&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;url.query&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;q=test&lt;/code&gt;, &lt;code&gt;ref=####&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;URI query component&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;http.request.id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;79104EXAMPLEB723&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Platform request id: ex: &lt;code&gt;x-request-id&lt;/code&gt;, &lt;code&gt;x-amz-request-id&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;http.request.method&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;GET&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;PUT&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;POST&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;OPTIONS&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;HTTP request method&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;http.request.body_size&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;3495&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Size of the request payload body in bytes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;http.request.header.content-type&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;application/json&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Value of a specific request header, &quot;content-type&quot; in this case, but there are many more. Pick out any that are important for your service&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;http.response.status_code&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;200&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;404&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;500&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;HTTP response status code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;http.response.body_size&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1284&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;2202009&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Size of the response payload body in bytes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;http.request.header.content-type&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;text/html&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Value of a specific response header, &quot;content-type&quot; in this case, but there are many more. Pick out any that are important for your service&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;pre&gt;&lt;code&gt;SELECT
  HEATMAP(http.response.body_size),
WHERE
  main = true AND
  service.name = &quot;api-service&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&quot;./http-body-size-annotated.png&quot; alt=&quot;A heatmap of response sizes. Most are within a fixed band, but there are sharp outliers that warrant more investigation.&quot; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;User-Agent&lt;/code&gt; headers contain a wealth of info. Don&apos;t rely on regex queries to try and make sense of them down the road. Parse them
into structured data from the beginning.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attribute&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;user_agent.original&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.3&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The value of the HTTP &lt;code&gt;User-Agent&lt;/code&gt; header&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;user_agent.device&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;computer&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;tablet&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;phone&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Device type derived from the &lt;code&gt;User-Agent&lt;/code&gt; header&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;user_agent.OS&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Windows&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;MacOS&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;OS derived from the &lt;code&gt;User-Agent&lt;/code&gt; header&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;user_agent.browser&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Chrome&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;Safari&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;Firefox&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Browser derived from the &lt;code&gt;User-Agent&lt;/code&gt; header&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;user_agent.browser_version&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;129&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;18.0&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Browser version derived from the &lt;code&gt;User-Agent&lt;/code&gt; header&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;blockquote&gt;
&lt;p&gt;What browsers are my users using?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;pre&gt;&lt;code&gt;SELECT
  COUNT(*)
GROUP BY user_agent.browser, user_agent.browser_version
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If you have any custom user agents or headers used as a convention within your org parse that out too.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attribute&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;user_agent.service&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;api-gateway&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;auth-service&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;If you have a distributed architecture, have each service send a custom &lt;code&gt;User-Agent&lt;/code&gt; header with its name and version&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;user_agent.service_version&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;v123&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;6f6466b0e693470729b669f3745358df29f97e8d&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;If you have a distributed architecture, have each service send a custom &lt;code&gt;User-Agent&lt;/code&gt; header with its name and version&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;user_agent.app&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;iOS&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;android&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;If a request is coming from a mobile app, make sure it includes which app and its version&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;user_agent.app_version&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;v123&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;6f6466b0e693470729b669f3745358df29f97e8d&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;If a request is coming from a mobile app, make sure it includes which app and its version&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Route info&lt;/h3&gt;
&lt;p&gt;We&apos;re not done with HTTP attributes yet! One of the most important bits is the API endpoint
that the request matched. OpenTelemetry SDKs will &lt;em&gt;usually&lt;/em&gt; give this to you automagically
but not always. Consider extracting the route parameters and query parameters as additional attributes.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attribute&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;http.route&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/team/{team_id}/user/{user_id}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The route pattern that the url path is matched against&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;http.route.param.team_id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;14739&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;team-name-slug&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The extracted segment of the url path as it is parsed for each parameter&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;http.route.query.sort_dir&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;asc&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The query parameters that are relevant to the response of your service. Ex: &lt;code&gt;?sort_dir=asc&amp;amp;...&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;pre&gt;&lt;code&gt;SELECT
  P99(duration_ms)
WHERE
  main = true AND
  service.name = &quot;api-service&quot;
GROUP BY http.route
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&quot;./p99-duration-annotated.png&quot; alt=&quot;A chart of p99&apos;s broken down by route. There is a spike on only some of them. We should break down by version now to check if this was caused by a deploy.&quot; /&gt;&lt;/p&gt;
&lt;h3&gt;User and customer info&lt;/h3&gt;
&lt;p&gt;Once you get the basics down, this is &lt;strong&gt;the most important&lt;/strong&gt; piece of metadata that you can add. No automagic SDK
will be able to encode the particulars of your user model.&lt;/p&gt;
&lt;p&gt;It&apos;s common for a single user or account to be responsible for a 10%+ of a business&apos; revenue, and frequently their
usage patterns look significantly different than the average user. They probably have more users, store more data,
and hit limits and edge-cases that will never show up for the user paying $10 / month. Be sure you can separate
their traffic from others.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attribute&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;user.id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2147483647&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;user@example.com&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The primary ID for a user. If this is an email and you&apos;re using a vendor, consider your org&apos;s policy on putting PII in external services.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;user.type&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;free&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;premium&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;enterprise&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;vip&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;How does the business see this type of user? Individual accounts are sometimes responsible for 10%+ of a business&apos; income. Make sure you can separate their traffic from others!&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;user.auth_method&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;token&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;basic-auth&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;jwt&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;sso-github&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;How did this user authenticate into your system?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;user.team.id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;5387&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;web-services&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;If you have a team construct, which one does this user belong to?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;user.org.id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;278&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;enterprise-name&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;If this user is part of an organization with an enterprise contract, track that!&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;user.age_days&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;637&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Not the user&apos;s literal age, but how long ago was this account created? Is this an issue experienced by someone new to your app, or only once they&apos;ve saved a lot of data?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;user.assumed&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;true&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Have an internal way of assuming a user&apos;s identity for debugging? Be sure to track this&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;user.assumed_by&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;engineer-3@your-company.com&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;And track which actual user is assuming the user&apos;s identity&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;pre&gt;&lt;code&gt;SELECT
  P99(duration_ms)
WHERE
  main = true AND
  service.name = &quot;api-service&quot;
GROUP BY user.type
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Rate limits&lt;/h3&gt;
&lt;p&gt;Whatever your rate limiting strategy, make sure the current rate limit info gets added too. Can you quickly find
examples of users that are being rate-limited by your service?&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attribute&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ratelimit.limit&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;200000&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;You might not now, but you will likely have users with different rate limits in the future, note down what the actual limit is for this request&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ratelimit.remaining&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;130000&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;What is the budget remaining for this user?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ratelimit.used&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;70000&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;How many requests have been used in the current rate window&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ratelimit.reset_at&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2024-10-14T19:47:38Z&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;When will the rate limit be reset next? if applicable&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;blockquote&gt;
&lt;p&gt;This user has a support ticket open about being rate-limited. Let&apos;s see what they were doing&lt;/p&gt;
&lt;/blockquote&gt;
&lt;pre&gt;&lt;code&gt;SELECT
  COUNT(*)
WHERE
  main = true AND
  service.name = &quot;api-service&quot; AND
  user.id = 5838
GROUP BY http.route
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&quot;./rate-limit-investigation-annotated.png&quot; alt=&quot;A graph of one users activity. There is a big spike hitting the same route a lot at the end. This gives us a starting point for investigation&quot; /&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;What routes are users who have burned most of their rate limit hitting? Does this activity look suspicious?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;pre&gt;&lt;code&gt;SELECT
  COUNT(*)
WHERE
  main = true AND
  service.name = &quot;api-service&quot; AND
  ratelimit.remaining &amp;lt; 100
GROUP BY http.route
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Caching&lt;/h3&gt;
&lt;p&gt;For every code path where we could shortcut with a cache response, add whether or not it was successful&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attribute&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;cache.session_info&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;true&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;false&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Was the session info cached or did it need to be re-fetched?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;cache.feature_flags&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;true&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;false&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Were the feature flags cached for this user or did they need to be re-fetched?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Localization info&lt;/h3&gt;
&lt;p&gt;What localization options has the user chosen? This can be a frequent source of bugs&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attribute&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;localization.language_dir&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;rtl&lt;/code&gt;, &lt;code&gt;ltr&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Which direction is text laid out in their language?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;localization.country&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;mexico&lt;/code&gt;, &lt;code&gt;uk&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Which country are they from?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;localization.currency&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;USD&lt;/code&gt;, &lt;code&gt;CAD&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Which currency have they chosen to work with?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Uptime&lt;/h3&gt;
&lt;p&gt;Tracking how long the service has been running when it serves a request can help you visualize several classes of bugs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Issues that show up on a reboot&lt;/li&gt;
&lt;li&gt;Memory leaks that only start to show up when the service has been running for a long time&lt;/li&gt;
&lt;li&gt;Frequent crashes / restarts if you have automatically restart the service on failure&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I recommend also either adding the &lt;code&gt;log10&lt;/code&gt; of the uptime or having some way of visualizing this. When graphed this emphasizes
the important first few minutes of a service without being squished into the bottom of the graph by instances with several days
or more of uptime.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attribute&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;uptime_sec&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1533&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;How long has this instance of your app been running? Can be useful to visualize to see restarts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;uptime_sec_log_10&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;3.185&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Grows sub-linearly which allows you to visualize long-running services and brand new ones on the same graph&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;pre&gt;&lt;code&gt;SELECT
  HEATMAP(uptime_sec),
  HEATMAP(uptime_sec_log_10)
WHERE
  main = true AND
  service.name = &quot;api-service&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&quot;./uptime.png&quot; alt=&quot;Heatmaps of uptime when a service enters a crash loop. It&apos;s far easier to distinguish in log scale&quot; /&gt;&lt;/p&gt;
&lt;h3&gt;Metrics&lt;/h3&gt;
&lt;p&gt;This one might be a bit controversial, but I&apos;ve found it helpful to tag spans with context about what the system
was experiencing while processing the request. We fetch this information every ~10 seconds, cache it, and add it
to every main span produced during that time.&lt;/p&gt;
&lt;p&gt;Capturing metrics in this way is not mathematically sound. Since you only get data when traffic is flowing, you
can&apos;t calculate a &lt;code&gt;P90&lt;/code&gt; for cpu load that would stand up to any rigorous scrutiny, but that&apos;s actually fine
in practice. It&apos;s close enough to get some quick signal while you&apos;re debugging without switching to a different tool,
especially if you can avoid calculations and visualize with a heatmap.&lt;/p&gt;
&lt;p&gt;I wouldn&apos;t recommend setting alerts on this data though. Plain ol&apos; metrics are great for that.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://jessitron.com/&quot;&gt;Jessica Kerr&lt;/a&gt; recently wrote about this approach on the &lt;a href=&quot;https://www.honeycomb.io/blog/get-infinite-custom-metrics-for-free&quot;&gt;Honeycomb Blog&lt;/a&gt;.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attribute&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;metrics.memory_mb&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;153&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;2593&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;How much memory is being used by the system at the time its service this request&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;metrics.cpu_load&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.57&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;5.89&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;CPU load of the system service this request. Given as # of active cores&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;metrics.gc_count&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;5390&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Last observed number of garbage collections. Could be cumulative (total since service started) or delta (ex: number in the last minute)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;metrics.gc_pause_time_ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;14&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;325&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Time spent in garbage collections. Could also be cumulative or delta. Pick one and document which&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;metrics.go_routines_count&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;3&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;3000&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Number of go routines running&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;metrics.event_loop_latency_ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;340&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Cumulative time spent waiting on the next event loop tick. An important metric for Node apps&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;blockquote&gt;
&lt;p&gt;Are these requests getting slow because we&apos;re running out of memory or CPU?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;pre&gt;&lt;code&gt;SELECT
  HEATMAP(duration_ms),
  HEATMAP(metrics.memory_mb),
  HEATMAP(metrics.cpu_load)
WHERE
  main = true AND
  service.name = &quot;api-service&quot;
GROUP BY instance.id
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&quot;./metrics.jpg&quot; alt=&quot;An example showing using the metrics data tagged on the span to get context for whats happening with the system&quot; /&gt;&lt;/p&gt;
&lt;h3&gt;Async request summaries&lt;/h3&gt;
&lt;p&gt;When using a tracing system async requests should get their own spans, but it can still be useful to roll up
some stats to identify outliers and quickly find interesting traces.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attribute&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;stats.http_requests_count&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;140&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;How many http requests were triggered during the processing of this request?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;stats.http_requests_duration_ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;849&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Cumulative time spent in these http requests&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;stats.postgres_query_count&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;7&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;742&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;How many Postgres queries were triggered during the processing of this request?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;stats.postgres_query_duration_ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1254&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Cumulative time spent in these Postgres queries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;stats.redis_query_count&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;3&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;240&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;How many redis queries were triggered during the processing of this request?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;stats.redis_query_duration_ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;43&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Cumulative time spent in these redis queries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;stats.twilio_calls_count&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;4&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;How many calls to this vendors api were triggered during the processing of this request?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;stats.twilio_calls_duration_ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2153&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Cumulative time spent in these vendor calls&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;blockquote&gt;
&lt;p&gt;Surely my service makes a reasonable number of calls to the database... right?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;pre&gt;&lt;code&gt;SELECT
  HEATMAP(stats.postgres_query_count)
WHERE
  main = true AND
  service.name = &quot;api-service&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&quot;./postgres-queries.png&quot; alt=&quot;A heatmap of db queries per request. There is a bi-modal distribution but also some outliers that make a lot of requests&quot; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Instead of adding this explicitly, couldn&apos;t we aggregate this by querying the whole trace?&lt;/strong&gt; See &lt;a href=&quot;#frequent-objections&quot;&gt;Frequent Objections&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;Sampling&lt;/h3&gt;
&lt;p&gt;Once you start collecting fine-grained telemetry from your systems at a significant scale you run head-on
into the problem of sampling. Running systems can produce a lot of data! Engineers frequently want to store
and query all of it. Exact answers always! Make it fast! Also cheap! But it&apos;s trade-offs all the way down.
Telemetry data is fundamentally different from the transaction data you&apos;re storing for your users, and you
should think about it differently.&lt;/p&gt;
&lt;p&gt;Luckily you only really need a statistically significant subset of the full dataset. Even sampling 1 out of every
1000 requests can provide a suprisingly detailed picture of the overall traffic patterns in a system.&lt;/p&gt;
&lt;p&gt;Sampling is a suprisingly deep topic. Keep it simple if you&apos;re starting and do uniform random head sampling,
but track your sample rate per-span so you can be ready for more sophisticated approaches down-the-line.&lt;/p&gt;
&lt;p&gt;Good tooling will weight your calculations with a per-span, so you don&apos;t have to mentally multiple the &lt;code&gt;COUNT&lt;/code&gt;
call by the &lt;code&gt;sample_rate&lt;/code&gt; to get an accurate answer. Here are some relevant articles:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://research.facebook.com/file/2964294030497318/scuba-diving-into-data-at-facebook.pdf&quot;&gt;I was first introduced to this idea in the Scuba paper&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://docs.honeycomb.io/manage-data-volume/sample/sampled-data-in-honeycomb/&quot;&gt;Honeycomb supports per-event sample rates&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://blog.cloudflare.com/explaining-cloudflares-abr-analytics/&quot;&gt;Cloudflare&apos;s Analytics Engine will automatically sample for you based on volume&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attribute&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;sample_rate&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;500&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;N&lt;/code&gt; where 1 in &lt;code&gt;N&lt;/code&gt; events will be sampled and stored and the rest dropped. If you&apos;re sampling &lt;code&gt;1%&lt;/code&gt; of requests, the &lt;code&gt;sample_rate&lt;/code&gt; would be &lt;code&gt;100&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Timings&lt;/h3&gt;
&lt;p&gt;I find it super useful to break up the work that gets done to respond to a request into a handful of important chunks
and track how long each segment took on the main span.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Wait, isn&apos;t that what child spans are for?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Wrapping absolutely everything in its own span is the most common failure mode I see when engineers first get access
to tracing tools. You have to design the structure of your data for the way you want to query it.&lt;/p&gt;
&lt;p&gt;Child spans are helpful for waterfall visualization for a single request, but can be difficult to query and visualize
across &lt;em&gt;all&lt;/em&gt; of your requests. Putting that information on a single span makes it easier to query and also helps with
tools like &lt;a href=&quot;https://www.honeycomb.io/bubbleup&quot;&gt;Honeycomb&apos;s BubbleUp&lt;/a&gt; which can then immediately tell you that
that group of requests was slow because authentication took 10 seconds for some reason.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attribute&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;auth.duration_ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;52.2&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;0.2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;How long did we spend performing authentication during this request?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;payload_parse.duration_ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;22.1&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;0.1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Identify the core workloads of the service and add timings for them&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Errors&lt;/h3&gt;
&lt;p&gt;If you encounter an error and need to fail the operation, tag the span with the error information: type, stacktrace, etc.&lt;/p&gt;
&lt;p&gt;One approach that I have found super-valuable is tagging each location where we throw an error with a unique slug describing
the error. If this string is unique within your codebase, it is easily found with a quick search. This allows someone
to jump straight from a spike in errors on a dashboard to the exact line of code that throwing the error. It also provides
a convenient low-cardinality field to &lt;code&gt;GROUP BY&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;You&apos;re unlikely to be able to wrap all possible errors, but any time a failed request doesn&apos;t have an &lt;code&gt;exception.slug&lt;/code&gt;
that is a good sign that you have places in your code where your error handling could be improved. It&apos;s now really easy
to find examples of requests that failed in ways you didn&apos;t anticipate.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;if isNotRecoverable(err) {
  // note the use of a plain string, not a variable, not dynamically generated
  // consider enforcing this with custom lint rules
  setErrorAttributes(err, &quot;err-stripe-call-failed-exhausted-retries&quot;);
  throw err;
}
&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attribute&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;error&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;true&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;false&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Special field for whether the request failed or not&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;exception.message&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Can&apos;t convert &apos;int&apos; object to str&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;undefined is not a function&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The exception message encoded in the exception&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;exception.type&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;IOError&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;java.net.ConnectException&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The programmatic type of the exception&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;exception.stacktrace&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ReferenceError: user is not defined&lt;/code&gt;&amp;lt;br&amp;gt;&lt;code&gt;at myFunction (/path/to/file.js:12:2)&lt;/code&gt;&amp;lt;br&amp;gt;&lt;code&gt;...&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Capture the stack trace if its available to help pin-point where the error is being thrown&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;exception.expected&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;true&lt;/code&gt;, &lt;code&gt;false&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Is this an expected exception like a bot trying to hit a url that doesn&apos;t exist? Allows filtering out of exceptions we can&apos;t prevent but don&apos;t need to worry about&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;exception.slug&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;auth-error&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;invalid-route&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;github-api-unavailable&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Create a unique grepp-able slug-value to identify the code location of an error if its predictable during development time&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;blockquote&gt;
&lt;p&gt;Which of our enterprise users hit the most errors last week? And which one?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;pre&gt;&lt;code&gt;SELECT
  COUNT_DISTINCT(user.id)
WHERE
  main = true AND
  service.name = &quot;api-service&quot; AND
  user.type = &quot;enterprise&quot;
GROUP BY exception.slug
&lt;/code&gt;&lt;/pre&gt;
&lt;blockquote&gt;
&lt;p&gt;Show me traces where we likely need to improve our error handling&lt;/p&gt;
&lt;/blockquote&gt;
&lt;pre&gt;&lt;code&gt;SELECT
  trace.trace_id
WHERE
  main = true AND
  service.name = &quot;api-service&quot; AND
  error = true AND
  exception.slug = NULL
GROUP BY trace.trace_id
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Feature flags&lt;/h3&gt;
&lt;p&gt;Fine-grained feature flags are a developer super power that allows you to test code changes in production with only a fraction of
your users or traffic. Adding the flag information per-request allows you to compare how the new code is working as you opt more
of your traffic into the new code path. Coupled with the broad visibility you can get with wide events, and this can make even tricky
migrations vastly more manageable and allow you to ship code with confidence.&lt;/p&gt;
&lt;p&gt;Note that semantic conventions differ here and suggest adding feature flag information as
&lt;a href=&quot;https://opentelemetry.io/docs/specs/semconv/feature-flags/feature-flags-spans/&quot;&gt;events on the span&lt;/a&gt;.
I would suggest following that standard since it will ultimately have the best support from vendors if it&apos;s moved to stable,
but especially in the mean time, I&apos;m also putting this info on the main span.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attribute&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;feature_flag.auth_v2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;true&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;false&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The value of a particular feature flag for this request&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;feature_flag.double_write_to_new_db&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;true&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;false&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The value of a particular feature flag for this request&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;blockquote&gt;
&lt;p&gt;What errors are the users in the new authentication flows hitting? How does it compare to the control group?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;pre&gt;&lt;code&gt;SELECT
  COUNT
WHERE
  main = true AND
  service.name = &quot;api-service&quot; AND
GROUP BY feature_flag.auth_v2, exception.slug
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Versions of important things&lt;/h3&gt;
&lt;p&gt;Runtimes, frameworks, and any major libraries you are using can be really helpful context.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attribute&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;go.version&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;go1.23.2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;What version of your language runtime are you using?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;rails.version&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;7.2.1.1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Pick out any core libraries like web frameworks and track their version too&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;postgres.version&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;16.4&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;If you can add the versions of any datastores you&apos;re using, even better&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;blockquote&gt;
&lt;p&gt;A security issue with Rails just got announced. What versions of the framework are our services using?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;pre&gt;&lt;code&gt;SELECT
  COUNT_DISTINCT(service.name)
WHERE
  service.environment = &quot;production&quot;
GROUP BY rails.version
&lt;/code&gt;&lt;/pre&gt;
&lt;blockquote&gt;
&lt;p&gt;Our memory usage seems higher that it used to be. Didn&apos;t we upgrade the runtime recently? Does that correlate?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;pre&gt;&lt;code&gt;SELECT
  HEATMAP(metrics.memory_mb)
WHERE
  main = true AND
  service.name = &quot;api-service&quot;
GROUP BY go.version
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Your specific application&lt;/h3&gt;
&lt;p&gt;Now we go off the map and get to the really valuable stuff. Your app likely does something unique or works in a particular
domain. You might need to &lt;em&gt;really&lt;/em&gt; care about which professional credentials a Dentist using your app has, or which
particular storage warehouse a package is in, or which chip is in the embedded tracking device installed in the cat that
your app exists to track.&lt;/p&gt;
&lt;p&gt;No framework is going to be able to understand what parts of your domain are important to track and automate this for you,
you have to do that.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attribute&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;asset_upload.s3_bucket_path&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;s3://bucket-name/path/to/asset.jpg&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;If you upload something, add context about where&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;email_vendor.transaction_id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;62449c60-b51e-4d5c-8464-49217d91c441&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;If you interact with a vendor, track whatever transaction ID they give you in case you need to follow up with them&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;vcs_integration.vendor&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;github&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;gitlab&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;bitbucket&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;If there are 3-4 types that something might fall into, be sure to add that context. Ex: If 2% of requests start failing because bitbucket is experiencing issues, this will help identify the source of the issue immediately.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;process_submission.queue_length&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;153&lt;/code&gt; &amp;lt;br&amp;gt; &lt;code&gt;1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Any time you interact with a queue, see if you can get the current length during submission&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;Things to note&lt;/h2&gt;
&lt;h3&gt;You should probably add the thing&lt;/h3&gt;
&lt;p&gt;If you find yourself asking &quot;Am I ever really going to need this bit of data?&quot;, default to throwing the attribute on.
The marginal cost of each extra attribute is very small. If the data volume does start to grow, prefer wider,
more context-rich events and a higher sample rate vs smaller events with a lower sample rate.&lt;/p&gt;
&lt;h3&gt;Heatmaps are your friend&lt;/h3&gt;
&lt;p&gt;&lt;a href=&quot;https://www.honeycomb.io/blog/heatmaps-are-the-new-hotness&quot;&gt;Honeycomb&apos;s heatmaps&lt;/a&gt; are amazing at helping you find outliers, seeing multi-modal distributions,
and getting a feel for your data. I wish more tooling supported them. I am not sure I can build software without them any more.&lt;/p&gt;
&lt;h3&gt;Embrace the feedback loop&lt;/h3&gt;
&lt;p&gt;When you are modifying code, make a change to the telemetry so that you can see the impact of the new code running. Once
the code is released, check to make sure that you see the outcome you expected. Don&apos;t hesitate to add specific fields
for one release and them remove them after.&lt;/p&gt;
&lt;p&gt;Tighter feedback loops are like going faster on a bicycle. They make for more stable systems and let you move faster
with confidence.&lt;/p&gt;
&lt;h3&gt;Semantic conventions and naming consistency&lt;/h3&gt;
&lt;p&gt;I&apos;ve tried to embrace &lt;a href=&quot;https://opentelemetry.io/docs/specs/semconv/general/trace/&quot;&gt;semantic conventions&lt;/a&gt; in my naming, but would not
be surprised if I&apos;ve made multiple errors. Naming is hard!&lt;/p&gt;
&lt;p&gt;It&apos;s also hard to get consistency right within an organization or even across multiple systems owned by the same team. I would recommend
trying to use semantic conventions as a guide, but do prioritize getting data out of your system in some form and getting some early wins
over exacting adherence to an evolving specification. Once this data has proven its value within your organization, then you will have the
leverage to spend engineering cycles on making things consistent.&lt;/p&gt;
&lt;p&gt;In the long run semantic conventions should allow Observability vendors to build new value and understanding on top of the telemetry you emit,
but this effort is only just getting started.&lt;/p&gt;
&lt;h2&gt;Frequent Objections&lt;/h2&gt;
&lt;h3&gt;Does this really work??&lt;/h3&gt;
&lt;p&gt;I have done this for dozens of production systems. Every single time the data has been invaluable for digging in and understanding what the
system is &lt;em&gt;actually&lt;/em&gt; doing, and we&apos;ve found something surprising, even for the engineers who had worked on the system for many years.&lt;/p&gt;
&lt;p&gt;Things like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Oh, actually 90% of the traffic of this system comes from one user&lt;/li&gt;
&lt;li&gt;Wait, one of our worker processes is actually running a month old version of the code somehow?&lt;/li&gt;
&lt;li&gt;This API endpoint usually has payloads of 1-2kb, but there is an edge case affecting one user where it&apos;s 40+MB. This causes their page loads to be &lt;strong&gt;several minutes longer than the p99&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;After instrumenting the authentication middleware, around 20% of requests still didn&apos;t have user info. There was a whole second authentication system for a different class of users that hadn&apos;t been touched in years.&lt;/li&gt;
&lt;li&gt;This endpoint that we&apos;d like to deprecate accepts data in the form of &lt;code&gt;A&lt;/code&gt;, &lt;code&gt;B&lt;/code&gt;, and &lt;code&gt;C&lt;/code&gt;, but none of our traffic ever even uses &lt;code&gt;C&lt;/code&gt;. We can just drop support for that now.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;I don&apos;t like it. This feels wrong&lt;/h3&gt;
&lt;p&gt;For anyone feeling that way now, I ask you to &lt;a href=&quot;https://signalvnoise.com/posts/3124-give-it-five-minutes&quot;&gt;give it five minutes&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I find that when a log line wraps around your terminal window multiple times, most developers have a pretty visceral negative reaction.&lt;/p&gt;
&lt;p&gt;This &lt;em&gt;feels&lt;/em&gt; right:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;[2024-09-18 22:48:32.990] Request started http_path=/v1/charges request_id=req_123
[2024-09-18 22:48:32.991] User authenticated auth_type=api_key key_id=mk_123 user_id=usr_123
[2024-09-18 22:48:32.992] Rate limiting ran rate_allowed=true rate_quota=100 rate_remaining=99
[2024-09-18 22:48:32.998] Charge created charge_id=ch_123 permissions_used=account_write request_id=req_123
[2024-09-18 22:48:32.999] Request finished http_status=200 request_id=req_123
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;But this &lt;em&gt;feels&lt;/em&gt; wrong:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;[2024-10-20T14:43:36.851Z] duration_ms=1266.1819686777117 main=true http.ip_address=92.21.101.252 instance.id=api-1 instance.memory_mb=12336
instance.cpu_count=4 instance.type=t3.small http.request.method=GET http.request.path=/api/categories/substantia-trado
http.route=/api/categories/:slug http.request.body.size=293364 http.request.header.content_type=application/xml
user_agent.original=&quot;Mozilla/5.0 (X11; Linux i686 AppleWebKit/535.1.2 (KHTML, like Gecko) Chrome/39.0.826.0 Safari/535.1.2&quot; user_agent.device=phone
user_agent.os=Windows user_agent.browser=Edge user_agent.browser_version=3.0 url.scheme=https url.host=api-service.com service.name=api-service
service.version=1.0.0 build.id=1234567890 go.version=go1.23.2 rails.version=7.2.1.1 service.environment=production service.team=api-team
service.slack_channel=#api-alerts service.build.deployment.at=2024-10-14T19:47:38Z
service.build.diff_url=https://github.com/your-company/api-service/compare/c9d9380..05e5736
service.build.pull_request_url=https://github.com/your-company/api-service/pull/123
service.build.git_hash=05e5736 service.build.deployment.user=keanu.reeves@your-company.com
service.build.deployment.trigger=manual container.id=1234567890 container.name=api-service-1234567890 cloud.availability_zone=us-east-1
cloud.region=us-east-1 k8s.pod.name=api-service-1234567890 k8s.cluster.name=api-service-cluster feature_flag.auth_v2=true
http.response.status_code=401 user.id=Samanta27@gmail.com user.type=vip user.auth_method=sso-google user.team_id=team-1
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;You are structuring data so that it can be read efficiently by machines, not humans.&lt;/strong&gt; Our systems emit too much data to waste
precious human lifetimes using our eyeballs to scan lines of text looking for patterns to jump out. Let the robots help.&lt;/p&gt;
&lt;h3&gt;This seems like a lot of work&lt;/h3&gt;
&lt;p&gt;If you want to implement everything I&apos;ve talked about in this post that would be a &lt;em&gt;ton&lt;/em&gt; of work. However, even implementing
the easiest subset is going to provide a lot of value. Not doing this results in so much &lt;em&gt;more work&lt;/em&gt; building a mental model
of your system, trying to debug by thinking through the code and hoping your mental model matches reality.&lt;/p&gt;
&lt;p&gt;A lot of this logic can be put into shared libraries within your org, though getting them adopted, keeping them updated and
in-sync, and getting engineers used to these tools presents a whole different set of challenges.&lt;/p&gt;
&lt;p&gt;Many of these things could be surfaced to you by opinionated platforms or frameworks. I would love to see things move in
this direction.&lt;/p&gt;
&lt;h3&gt;Isn&apos;t this a lot of data? Won&apos;t it cost a lot??&lt;/h3&gt;
&lt;p&gt;&lt;a href=&quot;https://news.ycombinator.com/item?id=39531022&quot;&gt;&lt;img src=&quot;./hn-comment-on-cost.png&quot; alt=&quot;Hacker News comment: This isn&apos;t an unknown idea outside of Meta, it&apos;s just really expensive, especially if you&apos;re using a vendor and not building your own tooling. Prohibitively so, even with sampling.&quot; /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;First, you should compare this to your current log volume per request. I have seen many systems where this
approach would &lt;em&gt;reduce&lt;/em&gt; overall log volume.&lt;/p&gt;
&lt;p&gt;However storing this data for every request against your system could be too expensive at scale. That&apos;s
where sampling comes in. Sampling gives you the controls to determine what you want to spend vs the value
you receive from storing and making that data available to query.&lt;/p&gt;
&lt;p&gt;Realtime OLAP systems are also getting cheaper all the time. Once upon a time Scuba held all data in memory
to make these types of questions quick to answer. Now most OLAP systems are evolving to columnar files stored
on cloud object storage with queries handled by ephemeral compute which is many orders of magnitude cheaper.&lt;/p&gt;
&lt;p&gt;In the next section I&apos;ll show just how much cheaper.&lt;/p&gt;
&lt;h3&gt;Repeated data&lt;/h3&gt;
&lt;blockquote&gt;
&lt;p&gt;Many of these fields will be the same for every request. Isn&apos;t that really inefficient?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This is where our intuitions can lie to us. Let&apos;s look at a concrete example.&lt;/p&gt;
&lt;p&gt;I &lt;a href=&quot;https://github.com/jmorrell/a-practitioners-guide-to-wide-events/blob/main/column-storage-compression/index.js&quot;&gt;wrote a script&lt;/a&gt;
&amp;lt;a href=&quot;#footnote-1&quot; id=&quot;footnote-ref-1&quot; data-footnote-ref=&quot;true&quot; aria-describedby=&quot;footnote-label&quot;&amp;gt;&amp;lt;code&amp;gt;[1]&amp;lt;/code&amp;gt;&amp;lt;/a&amp;gt;
to generate a newline-delimited JSON file with a lot of the above fields and at least somewhat reasonable fake values.&lt;/p&gt;
&lt;p&gt;Let&apos;s say our service is serving &lt;code&gt;1000&lt;/code&gt; req/s all day and sampling 1% of that traffic. Rounding to a whole number, that&apos;s about a
million events. Generating a million example wide events results in a &lt;code&gt;1.6GB&lt;/code&gt; file.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;http_logs.ndjson     1607.61 MB
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;But we repeat the keys on every single line. Even just turning it into a CSV cuts the size by
more than 50%.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;http_logs.csv         674.72 MB
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Gzipping the file shows an amazing amount of compression, hinting that this isn&apos;t
as much data as we might think.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;http_logs.ndjson.gz   101.67 MB
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Column store formats like &lt;code&gt;parquet&lt;/code&gt; and Duckdb&apos;s native format can do even better.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;http_logs.parquet      88.83 MB
http_logs.duckdb       80.01 MB
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;They store all of the data for a specific column contiguously, which lends itself to different
compression approachs. In the simplest case, if the column is always the same value, it can store that fact only once.
Values that are the same across an entire &lt;a href=&quot;https://cloudsqale.com/2020/05/29/how-parquet-files-are-written-row-groups-pages-required-memory-and-flush-operations/&quot;&gt;row group&lt;/a&gt;
are incredibly cheap.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;./constant.png&quot; alt=&quot;DuckDB diagram showing how a constant value along a whole column gets compressed&quot; /&gt;&lt;/p&gt;
&lt;p&gt;If there are 2-3 different values, it can use dictionary-encoding to bit-pack these values really tightly. This also
speeds up queries against this column.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;./dictionary.png&quot; alt=&quot;DuckDB diagram showing how values get compressed using dictionary encoding&quot; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://duckdb.org/2022/10/28/lightweight-compression.html&quot;&gt;DuckDB has a great writeup on this&lt;/a&gt; which goes into much more detail.
All of the data remains available and is easily (and quickly!) queryable.&lt;/p&gt;
&lt;p&gt;This is hardly &quot;big data&quot;. Storing this on &lt;a href=&quot;https://developers.cloudflare.com/r2/pricing/&quot;&gt;Cloudflare&apos;s R2&lt;/a&gt; for a month would cost &lt;code&gt;US$ 0.0012&lt;/code&gt;.
You could keep 60 days of retention for &lt;code&gt;US$ 0.072&lt;/code&gt; / month.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;❯ duckdb http_logs.duckdb
D SELECT COUNT(*) FROM http_logs;
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│      1000000 │
└──────────────┘
Run Time (s): real 0.002 user 0.002350 sys 0.000946
D SELECT SUM(duration_ms) FROM http_logs;
┌───────────────────┐
│ sum(duration_ms)  │
│      double       │
├───────────────────┤
│ 999938387.7714149 │
└───────────────────┘
Run Time (s): real 0.003 user 0.008020 sys 0.000415
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;There are even &lt;a href=&quot;https://arrow.apache.org/&quot;&gt;in-memory&lt;/a&gt; and &lt;a href=&quot;https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/&quot;&gt;transport formats&lt;/a&gt; to help
reduce size in memory and on the wire. &lt;a href=&quot;https://opentelemetry.io/blog/2023/otel-arrow&quot;&gt;OpenTelemetry is adopting arrow for its payloads&lt;/a&gt; for this reason.&lt;/p&gt;
&lt;p&gt;I found &lt;a href=&quot;https://www.youtube.com/watch?v=dlO1cKnfWAI&quot;&gt;this podcast on the FDAP stack particularly helpful in understanding this space&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;Couldn&apos;t we &lt;code&gt;JOIN&lt;/code&gt; data from multiple spans together to get this information? Query the whole trace at once?&lt;/h3&gt;
&lt;p&gt;This is certainly possible. &lt;a href=&quot;https://docs.honeycomb.io/investigate/query/build/#clauses&quot;&gt;Honeycomb has started allowing you to filter on fields on other spans in the same trace&lt;/a&gt;.
However I&apos;d qualify this as very advanced. You want to make the right thing the easiest thing, and if you make it
harder to ask questions, people will simply ask fewer questions. There are already a million things competing for our
attention. Keep it simple. Make it fast.&lt;/p&gt;
&lt;h3&gt;Does this mean I don&apos;t need metrics?&lt;/h3&gt;
&lt;p&gt;You should probably still generate high-level metrics, though I bet you will need far fewer.&lt;/p&gt;
&lt;p&gt;Metrics are great when you know you want an exact answer to a very specific question that you know ahead of time.
Questions like &quot;How many requests did a serve yesterday?&quot; or &quot;What was my CPU usage like last month?&quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;&amp;lt;a href=&quot;#footnote-ref-1&quot; id=&quot;footnote-1&quot;&amp;gt;&amp;lt;code&amp;gt;[1]&amp;lt;/code&amp;gt;&amp;lt;/a&amp;gt; Well... mostly &lt;a href=&quot;https://www.cursor.com/&quot;&gt;Cursor&lt;/a&gt; wrote it&lt;/p&gt;
</content:encoded></item><item><title>OpenTelemetry Tracing in 200 lines of code</title><link>https://jeremymorrell.dev/blog/minimal-js-tracing/</link><guid isPermaLink="true">https://jeremymorrell.dev/blog/minimal-js-tracing/</guid><description>Distributed Tracing is scary and complicated... right?</description><pubDate>Wed, 11 Sep 2024 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Developers tend to treat tracing as deep magic, and OpenTelemetry is no exception. OpenTelemetry may be even
&lt;em&gt;more&lt;/em&gt; mysterious given how many concepts your are exposed to even with
&lt;a href=&quot;https://github.com/open-telemetry/opentelemetry-js/blob/9c30124e764e08bd6ccf8dbfbe426a8531c20352/examples/basic-tracer-node/index.js&quot;&gt;beginning examples&lt;/a&gt;.
&amp;lt;a href=&quot;#footnote-1&quot; id=&quot;footnote-ref-1&quot; data-footnote-ref=&quot;true&quot; aria-describedby=&quot;footnote-label&quot;&amp;gt;&amp;lt;code&amp;gt;[1]&amp;lt;/code&amp;gt;&amp;lt;/a&amp;gt;&lt;/p&gt;
&lt;p&gt;It also doesn&apos;t help that as part of building a mature, battle-tested tracing library, the code itself tends to become
more and more cryptic over time, contorting itself to avoid edge-cases, work across many environments, and optimize
code paths to avoid impacting performance of its host applications. &lt;a href=&quot;https://github.com/open-telemetry/opentelemetry-js-contrib/blob/12adb4354f09ade438cd96340bdfd1f715b5fed3/plugins/node/opentelemetry-instrumentation-express/src/instrumentation.ts#L153-L327&quot;&gt;Auto-instrumentation is particularly prone to
inscrutibility&lt;/a&gt; as it seeks to auto-magically wrap code that may never have been intended to be wrapped or extended.&lt;/p&gt;
&lt;p&gt;It&apos;s no wonder then that most developers approach tracing libraries as unknowable black boxes. We add them to our
applications, cross our fingers, and hope they give us useful information when the pager goes off at 2am.&lt;/p&gt;
&lt;p&gt;They are likely a lot simpler than you expect! Once you peel back the layers, I find a useful mental model of tracing
looks like &quot;fancy logging&quot; combined with &quot;context propagation&quot; a.k.a &quot;passing some IDs around&quot;.&lt;/p&gt;
&lt;h3&gt;Logs&lt;/h3&gt;
&lt;p&gt;Developers tend to be very comfortable with logs. We start off with &quot;Hello world!&quot;, and they stay with us forever after.
We reach for them to add &lt;code&gt;console.log(&quot;potato&quot;)&lt;/code&gt; to see if our code is even being run (even though the debugger is
like &lt;em&gt;right there&lt;/em&gt;).&lt;/p&gt;
&lt;p&gt;Logs are useful! Though hopefully someone comes along and convinces you that your logs should always be structured as sets of key / value pairs.&lt;/p&gt;
&lt;p&gt;&amp;lt;blockquote class=&quot;twitter-tweet&quot;&amp;gt;&amp;lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&amp;gt;If you make one New Year’s resolution related to modernizing your infrastructure and preparing for the brave new world of distributed systems, I suggest this:&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;Structure your logs.&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;You won’t be sorry. You will eventually be so, so glad you did.&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;Just do it.&amp;lt;/p&amp;gt;— Charity Majors (@mipsytipsy) &amp;lt;a href=&quot;https://twitter.com/mipsytipsy/status/951272678345687040?ref_src=twsrc%5Etfw&quot;&amp;gt;January 11, 2018&amp;lt;/a&amp;gt;&amp;lt;/blockquote&amp;gt; &amp;lt;script async src=&quot;https://platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&amp;gt;&amp;lt;/script&amp;gt;&lt;/p&gt;
&lt;p&gt;It&apos;s nice to have some consistency in your logs: make sure each one has a consistently-formatted timestamp or includes a field like
&quot;name&quot; so you can easily find them when searching. You probably have found yourself writing a helper function like this in your projects:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;log(&quot;user-authenticated&quot;, { userId, remaingingRateLimit });

// ...

function log(name, attributes = {}) {
  console.log(
    JSON.format({
      name,
      timestamp: new Date().getTime(),
      ...attributes,
    })
  );
}
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;{ &quot;timestamp&quot;: 1725845013447, &quot;name&quot;: &quot;user-authenticated&quot;, &quot;userId&quot;: &quot;1234&quot;, &quot;remaingingRateLimit&quot;: 100 }
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You might also have written something to help track how long some sub-task takes:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;logTiming(&quot;query-user-info&quot;, () =&amp;gt; {
  db.fetchUserInfo();
});

// ....

function logTiming(name, attributes = {}, lambda) {
  let startTime = new Date().getTime();

  // run some subtask
  lambda();

  let durationMs = new Date().getTime() - startTime;
  log(name, { durationMs, ...attributes });
}
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;{ &quot;timestamp&quot;: 1725845013447, &quot;name&quot;: &quot;query-user-info&quot;, &quot;durationMs&quot;: 12 }
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If so, congrats! You&apos;re halfway to re-inventing trace spans.&lt;/p&gt;
&lt;h2&gt;Spans are ✨fancy✨ log lines&lt;/h2&gt;
&lt;p&gt;A trace is built-up of spans. The example below shows a single trace with 4 different spans:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;./trace-example.png&quot; alt=&quot;A Honeycomb screenshot of a trace waterfall&quot; /&gt;&lt;/p&gt;
&lt;p&gt;You can think of a span as a collection of key / value pairs, much like a log line, with a few required fields.
Spans must have:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;a name&lt;/li&gt;
&lt;li&gt;a timestamp&lt;/li&gt;
&lt;li&gt;a duration&lt;/li&gt;
&lt;li&gt;a set of IDs: trace ID, span ID, and a parent span ID. More about these later&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All other fields can be added as additional keys and values in an &lt;code&gt;attributes&lt;/code&gt; map.&lt;/p&gt;
&lt;p&gt;In code this might look something like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;let span = new Span(&quot;api-req&quot;);
let resp = await api(&quot;get-user-limits&quot;);
span.setAttributes({ responseCode: resp.code });
span.End();
console.log(span);

// ...

class Span {
  constructor(name, context = {}, attributes = new Map()) {
    this.startTime = new Date().getTime();
    this.traceID = context.traceID ?? crypto.randomBytes(16).toString(&quot;hex&quot;);
    this.parentSpanID = context.spanID ?? undefined;
    this.spanID = crypto.randomBytes(8).toString(&quot;hex&quot;);
    this.name = name;
    this.attributes = attributes;
  }

  setAttributes(keyValues) {
    for (let [key, value] of Object.entries(keyValues)) {
      this.attributes.set(key, value);
    }
  }

  end() {
    this.durationMs = new Date().getTime() - this.startTime;
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This output would be:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Span {
  startTime: 1722436476271,
  traceID: &apos;cfd3fd1ad40f008fea72e06901ff722b&apos;,
  parentSpanID: undefined,
  spanID: &apos;6b65f0c5db08556d&apos;,
  name: &apos;api-req&apos;,
  attributes: Map(0) {
    &quot;responseCode&quot;: 200
  },
  durationMs: 3903
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;which we could format as an equivalent log line:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;{
  &quot;startTime&quot;: 1722436476271,
  &quot;traceID&quot;: &quot;cfd3fd1ad40f008fea72e06901ff722b&quot;,
  &quot;spanID&quot;: &quot;6b65f0c5db08556d&quot;,
  &quot;name&quot;: &quot;api-req&quot;,
  &quot;responseCode&quot;: 200,
  &quot;durationMs&quot;: 3903
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Traces are collections of spans&lt;/h2&gt;
&lt;p&gt;If you want to see all of the logs from a particular request, you have likely done something like:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;// generate a request id and inherit one from your platform
let requestID = req.headers[&quot;X-REQUEST-ID&quot;];
// ...
log(&quot;api-request-start&quot;, { requestID });
let resp = await apiRequest();
log(&quot;api-request-end&quot;, { requestID });
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;which would allow you to search for a particular request id to see what happened while that particular request was executed:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;{ &quot;timestamp&quot;: 1722436476271, &quot;requestID&quot;: &quot;1234&quot;, &quot;name&quot;: &quot;fetch-user-permissions&quot; }
{ &quot;timestamp&quot;: 1722436476321, &quot;requestID&quot;: &quot;1234&quot;, &quot;name&quot;: &quot;api-request-start&quot; }
{ &quot;timestamp&quot;: 1722436476345, &quot;requestID&quot;: &quot;1234&quot;, &quot;name&quot;: &quot;api-request-end&quot; }
{ &quot;timestamp&quot;: 1722436476431, &quot;requestID&quot;: &quot;1234&quot;, &quot;name&quot;: &quot;update-db-record&quot; }
{ &quot;timestamp&quot;: 1722436476462, &quot;requestID&quot;: &quot;1234&quot;, &quot;name&quot;: &quot;create-email-job&quot; }
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Tracing execution like this can get you surprisingly far! But we can do better.&lt;/p&gt;
&lt;p&gt;Trace spans have 3 different IDs that make up the Trace Context. The first two are really simple:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Span ID&lt;/strong&gt;: a random ID for each span&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Trace ID&lt;/strong&gt;: a random ID that multiple spans can share&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The last one is a little more complicated:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Parent Span ID&lt;/strong&gt;: the Span ID that was the active span when this span was created&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The Parent Span ID is what allows a system to create a &lt;a href=&quot;https://en.wikipedia.org/wiki/Directed_acyclic_graph&quot;&gt;DAG&lt;/a&gt;
out of each trace once it has received each individual span. When rendered as a tree this results in
the waterfall view we all know and love, but it&apos;s important to remember that this is only one possible
visualization of the data.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;./trace-example.png&quot; alt=&quot;A Honeycomb screenshot of a trace waterfall&quot; /&gt;&lt;/p&gt;
&lt;h2&gt;Existing in the context&lt;/h2&gt;
&lt;p&gt;Our context only really needs two values: the trace id, and the id of the current span. When
we create a new span, we can inherit the &lt;code&gt;spanID&lt;/code&gt; if one exists, generate a new &lt;code&gt;spanID&lt;/code&gt;,
and set the new context.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;let context = {
  traceID: &quot;cfd3fd1ad40f008fea72e06901ff722b&quot;,
  spanID: &quot;6b65f0c5db08556d&quot;,
};
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We need some way of passing this context around our application. We could do this manually:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;let { span, newContext } = new Span(&quot;api-req&quot;, oldContext);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Indeed, this is &lt;a href=&quot;https://github.com/open-telemetry/opentelemetry-go/blob/b37e8a9860f03b78baf2c3ca0edcbc6c7f8fd969/example/namedtracer/main.go#L65&quot;&gt;how it is done in the official Go SDK&lt;/a&gt; however in most languages
this is done implicitly and handled automatically by the library. In Ruby or Python you can use a thread
local variable, but in Node you would use &lt;a href=&quot;https://nodejs.org/api/async_context.html#class-asynclocalstorage&quot;&gt;AsyncLocalStorage&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;At this point it helps to wrap our span creation in a helper function:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;import { AsyncLocalStorage } from &quot;node:async_hooks&quot;;

let asyncLocalStorage = new AsyncLocalStorage();
// start with an empty context
asyncLocalStorage.enterWith({ traceID: undefined, spanID: undefined });

async function startSpan(name, lambda) {
  let ctx = asyncLocalStorage.getStore();
  let span = new Span(name, ctx, new Map());
  await asyncLocalStorage.run(span.getContext(), lambda, span);
  span.end();
  console.log(span);
}

startSpan(&quot;parent&quot;, async (parentSpan) =&amp;gt; {
  parentSpan.setAttributes({ outerSpan: true });
  startSpan(&quot;child&quot;, async (childSpan) =&amp;gt; {
    childSpan.setAttributes({ outerSpan: false });
  });
});
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And with that we have the core of our tracing library done! 🎉&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Span {
  startTime: 1725850276756,
  traceID: &apos;b8d002c2f6ae1291e0bd29c9791c9756&apos;,
  parentSpanID: &apos;50f527cbf40230c3&apos;,
  name: &apos;child&apos;,
  attributes: Map(1) { &apos;outerSpan&apos; =&amp;gt; false },
  spanID: &apos;8037a93b6ed25c3a&apos;,
  durationMs: 11.087375000000002
}
Span {
  startTime: 1725850276744,
  traceID: &apos;b8d002c2f6ae1291e0bd29c9791c9756&apos;,
  parentSpanID: undefined,
  name: &apos;parent&apos;,
  attributes: Map(1) { &apos;outerSpan&apos; =&amp;gt; true },
  spanID: &apos;50f527cbf40230c3&apos;,
  durationMs: 26.076625
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Our tracing library is perfectly usable even at &lt;a href=&quot;https://github.com/jmorrell/minimal-nodejs-otel-tracer/blob/5ff45161aa2d6c0c6d0a41139803d54a2f88829d/simple_tracer.js&quot;&gt;under 60 LoC&lt;/a&gt;
but it has 2 big drawbacks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;We have to manually add it everywhere&lt;/li&gt;
&lt;li&gt;It also is restricted to a single system. We do not have a mechanism for passing the trace
context between any two systems in our larger service.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Let&apos;s fix that!&lt;/p&gt;
&lt;h2&gt;Going distributed&lt;/h2&gt;
&lt;p&gt;Distributed tracing sounds scary and intimidating, but it generally means that tracing
context can be passed around between systems so that you can track what operation kicked off
what other operation, and that all of this data gets reported to the same place at the end.&lt;/p&gt;
&lt;p&gt;Whenever we make an HTTP call to another system, we need to pass along the Trace ID and the
Current Span ID. We could add these two fields to all of our HTTP payloads manually but
there&apos;s a &lt;a href=&quot;https://www.w3.org/TR/trace-context/&quot;&gt;W3C standard&lt;/a&gt; for how to encode this into
an HTTP header so that it gets sent as metadata for every request. The &lt;code&gt;traceparent&lt;/code&gt; header
looks like:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;There is a full specification to digest and implement here, but for our purposes today we can ignore
most of this and think of it as conforming to this format:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;00-{ Trace ID}-{ Parent Span ID }-01
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;which allows us to parse and serialize our trace context with some simple functions:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;let getTraceParent = (ctx) =&amp;gt; `00-${ctx.traceID}-${ctx.spanID}-01`;

let parseTraceParent = (header) =&amp;gt; ({
  traceID: header.split(&quot;-&quot;)[1],
  spanID: header.split(&quot;-&quot;)[2],
});
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We need to be sure to add it to our outgoing requests, and parse it on any incoming requests.
Instrumentation helps with that.&lt;/p&gt;
&lt;h2&gt;Wrap stuff in other stuff&lt;/h2&gt;
&lt;p&gt;&lt;em&gt;Instrumentation&lt;/em&gt; is a fancy term for &quot;wrap some other code in our code so we can track stuff&quot;. Real
tracing libraries sometimes go to extreme lengths to make wrapping built-in or popular libraries
happen behind the scenes when you configure the library. We&apos;re not going to be doing that.&lt;/p&gt;
&lt;p&gt;Instead we&apos;ll provide some &lt;a href=&quot;https://hono.dev/docs/guides/middleware&quot;&gt;middleware&lt;/a&gt; for the
&lt;a href=&quot;https://hono.dev/&quot;&gt;Hono&lt;/a&gt; web framework that the user can manually add.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;async function honoMiddleware(c, next) {
  // check the incoming request for the traceparent header
  // if it exists parse and inherit the trace context
  let context = EMPTY_CONTEXT;
  if (c.req.header(&quot;traceparent&quot;)) {
    context = parseTraceParent(c.req.header(&quot;traceparent&quot;));
  }

  // set the context and wrap the whole req / res operation in a span
  await setContext(context, async () =&amp;gt; {
    await startSpan(`${c.req.method} ${c.req.path}`, async (span) =&amp;gt; {
      // Before we let our app handle the request, let&apos;s pull some info about
      // it off and add it to our trace.
      span.setAttributes({
        &quot;http.request.method&quot;: c.req.method,
        &quot;http.request.path&quot;: c.req.path,
      });

      // let our app handle the request
      await next();

      // Pull information about how our app responded
      span.setAttributes({
        &quot;http.response.status_code&quot;: c.res.status,
      });
    });
  });
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We also need to handle making outgoing HTTP calls and making sure we attach the &lt;code&gt;traceparent&lt;/code&gt;
header. There is no built-in concept of middleware for the built-in &lt;code&gt;fetch&lt;/code&gt; command, so instead
we&apos;re going to wrap it like a javascript burrito. We&apos;ll have to do this ourselves, but it&apos;s not
so bad, &lt;a href=&quot;https://github.com/open-telemetry/opentelemetry-js/blob/9c30124e764e08bd6ccf8dbfbe426a8531c20352/experimental/packages/opentelemetry-instrumentation-fetch/src/fetch.ts#L304&quot;&gt;especially compared to what doing it for real looks like&lt;/a&gt;.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;// pass the original function into our wrapping logic
function patchFetch(originalFetch) {
  // return a function with the same signature, but that executes our logic too
  return async function patchedFetch(resource, options = {}) {
    // generate and add the traceparent header
    let ctx = getContext();

    if (!options.headers) {
      options.headers = {};
    }
    options.headers[&quot;traceparent&quot;] = getTraceParent(ctx);

    // run the underlying fetch function, but wrap it in a span and
    // pull out some info while we&apos;re at it
    let resp;
    await startSpan(&quot;fetch&quot;, async (span) =&amp;gt; {
      span.setAttributes({ &quot;http.url&quot;: resource });
      resp = await originalFetch(resource, options);
      span.setAttributes({ &quot;http.response.status_code&quot;: resp.status });
    });

    // pass along fetch&apos;s response. It&apos;s like we were never here
    return resp;
  };
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;a href=&quot;https://github.com/jmorrell/minimal-nodejs-otel-tracer/blob/5ff45161aa2d6c0c6d0a41139803d54a2f88829d/index.js&quot;&gt;Here&apos;s&lt;/a&gt; a quick little app
to show off our tracer in action.&lt;/p&gt;
&lt;h2&gt;Let&apos;s send this to Honeycomb&lt;/h2&gt;
&lt;p&gt;We can log out our spans to the terminal, but that&apos;s not a great user experience. Before we look
at OpenTelemetry, I think it&apos;s instructive to look at &lt;a href=&quot;https://docs.honeycomb.io/api/tag/Events#operation/createEvents&quot;&gt;Honeycomb&apos;s Event&apos;s API&lt;/a&gt;.
Before &lt;a href=&quot;https://honeycomb.io/&quot;&gt;Honeycomb&lt;/a&gt; went all-in on OpenTelemetry they had a much simpler
just-send-us-JSON approach. They no longer recommend it, but we can still use this API today
for our toy project.&lt;/p&gt;
&lt;p&gt;You can see the full exporter code &lt;a href=&quot;https://github.com/jmorrell/minimal-nodejs-otel-tracer/blob/5ff45161aa2d6c0c6d0a41139803d54a2f88829d/honeycomb-exporter.js&quot;&gt;here&lt;/a&gt;,
but the logic for building the payload is the interesting bit:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;// literally put all of the data together in one big json blob... like a log line!
// and then POST it to their API
function spanToHoneycombJSON(span) {
  return {
    ...Object.fromEntries(globalAttributes),
    ...Object.fromEntries(span.attributes),
    name: span.name,
    trace_id: span.traceID,
    span_id: span.spanID,
    parent_span_id: span.parentSpanID,
    start_time: span.startTime,
    duration_ms: span.durationMs,
  };
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Since we&apos;re not using a standard format you do have to tell Honeycomb which field maps to Trace ID, Span ID,
etc in the dataset configuration, but this is all the data needed to build up the trace waterfall!&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;./honeycomb-events-trace.png&quot; alt=&quot;Honeycomb trace waterfall view&quot; /&gt;&lt;/p&gt;
&lt;h2&gt;Where&apos;s the OpenTelemetry?&lt;/h2&gt;
&lt;p&gt;So we have our &lt;s&gt;fancy logs&lt;/s&gt; tracing set up, and instrumentation and context propagation
are actually pretty simple, but OpenTelemetry is a big, complicated standard! Have you
read through &lt;a href=&quot;https://opentelemetry.io/docs/specs/otel/trace/sdk/&quot;&gt;the specification&lt;/a&gt;!?&lt;/p&gt;
&lt;p&gt;And... that&apos;s not wrong. OpenTelemetry is a big project, however we only need one small
part of it for our purposes. When you install an OpenTelemetry SDK for your langauge
it emits data in the &lt;a href=&quot;https://opentelemetry.io/docs/specs/otlp/&quot;&gt;OpenTelemetry Protocol (OTLP)&lt;/a&gt;.
Every OpenTelemetry SDK in every language emits OTLP. The &lt;a href=&quot;https://github.com/open-telemetry/opentelemetry-collector&quot;&gt;OpenTelemetry
Collector&lt;/a&gt; is a collection of
tools for receiving OTLP, transforming OTLP, translating other formats to OTLP.
You could say OTLP is kind of a big deal.&lt;/p&gt;
&lt;p&gt;OTLP has &lt;a href=&quot;https://github.com/open-telemetry/opentelemetry-proto/blob/6f589125b0b7d708c9b0f32916378182ac1e123d/opentelemetry/proto/trace/v1/trace.proto#L86&quot;&gt;its own protobuf specification&lt;/a&gt;
so you can efficiently compress telemetry data into a binary message that will be the same
across any platform, OS, or CPU architecture. We could generate a JavaScript module to
parse and emit these protobuf messages from the &lt;code&gt;.proto&lt;/code&gt; files, but that sounds like
too much work.&lt;/p&gt;
&lt;p&gt;Protobuf also defines a &lt;a href=&quot;https://opentelemetry.io/docs/specs/otlp/#json-protobuf-encoding&quot;&gt;JSON mapping&lt;/a&gt; as part
of the spec, and the specification repo has &lt;a href=&quot;https://github.com/open-telemetry/opentelemetry-proto/blob/6f589125b0b7d708c9b0f32916378182ac1e123d/examples/trace.json&quot;&gt;a handy example for us to start from&lt;/a&gt;, so let&apos;s keep it simple instead.&lt;/p&gt;
&lt;p&gt;Generating this payload is a bit more complicated than Honeycomb&apos;s old events format. There are some
new terms like &quot;resource&quot; and &quot;scope&quot;, and we have to massage the attributes a bit. However if you
squint, you can see that it&apos;s all the same data. &lt;a href=&quot;https://github.com/jmorrell/minimal-nodejs-otel-tracer/blob/5ff45161aa2d6c0c6d0a41139803d54a2f88829d/tracer.js#L107-L179&quot;&gt;Full version here&lt;/a&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;function spanToOTLP(span) {
  return {
    resourceSpans: [
      {
        resource: {
          attributes: toAttributes(Object.fromEntries(globalAttributes)),
        },
        scopeSpans: [
          {
            scope: {
              name: &quot;minimal-tracer&quot;,
              version: &quot;0.0.1&quot;,
              attributes: [],
            },
            spans: [
              {
                traceId: span.traceID,
                spanId: span.spanID,
                parentSpanId: span.parentSpanID,
                name: span.name,
                startTimeUnixNano: span.startTime * Math.pow(10, 6),
                endTimeUnixNano:
                  (span.startTime + span.elapsedMs) * Math.pow(10, 6),
                kind: 2,
                attributes: toAttributes(Object.fromEntries(span.attributes)),
              },
            ],
          },
        ],
      },
    ],
  };
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;However the power of standards means that we&apos;re no longer limited to just one vendor. We can now send this to any service that accepts OpenTelemetry!&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://honeycomb.io/&quot;&gt;Honeycomb&lt;/a&gt; still works, of course.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;./honeycomb-otel-trace.png&quot; alt=&quot;A Honeycomb screenshot of a trace waterfall&quot; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://baselime.io/&quot;&gt;Baselime&lt;/a&gt; too!&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;./baselime-trace.png&quot; alt=&quot;A Baselime screenshot of a trace waterfall&quot; /&gt;&lt;/p&gt;
&lt;p&gt;We can visualize our telemetry locally using &lt;a href=&quot;https://github.com/CtrlSpice/otel-desktop-viewer&quot;&gt;otel-desktop-viewer&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;./otel-desktop-viewer-trace.png&quot; alt=&quot;OTel desktop viewer trace waterfall&quot; /&gt;&lt;/p&gt;
&lt;p&gt;We can even render our data in the terminal with &lt;a href=&quot;https://github.com/ymtdzzz/otel-tui&quot;&gt;&lt;code&gt;otel-tui&lt;/code&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;./otel-tui-trace.png&quot; alt=&quot;A trace rendered in the terminal&quot; /&gt;&lt;/p&gt;
&lt;h2&gt;That&apos;s it!?&lt;/h2&gt;
&lt;p&gt;That&apos;s it! In &lt;a href=&quot;https://github.com/jmorrell/minimal-nodejs-otel-tracer/blob/ad35d40e3255715b7fb69bcd367f66f1e5cd2a57/tracer.js&quot;&gt;181 lines of code&lt;/a&gt; we&apos;ve implemented tracing, trace context propagation, instrumentation, and exported it in a standard format that any vendor or tool should be able
to accept. Thanks to the magic of 🌈standards🌈.&lt;/p&gt;
&lt;h2&gt;Is this... you know... legal?&lt;/h2&gt;
&lt;p&gt;If you were paying attention, there are a lot of &lt;a href=&quot;https://opentelemetry.io/docs/specs/otel/&quot;&gt;OpenTelemetry specifications&lt;/a&gt; around
how an OpenTelemetry SDK should be structured and behave. Our little library doesn&apos;t do almost any of that. This is just for
learning but one could imagine a simplified non-compliant SDK such as &lt;a href=&quot;https://github.com/evanderkoogh/otel-cf-workers&quot;&gt;this one for Cloudflare
Workers&lt;/a&gt;. It emits OTLP but doesn&apos;t conform to all of the SDK specifications.
How should we think about this?&lt;/p&gt;
&lt;p&gt;At the &lt;a href=&quot;https://events.linuxfoundation.org/open-telemetry-community-day/&quot;&gt;OTel Community Day&lt;/a&gt; last June, OpenTelemetry cofounder
&lt;a href=&quot;https://twitter.com/tedsuo&quot;&gt;Ted Young&lt;/a&gt; was asked a similar question. I wrote down his answer as best as I could. Lightly paraphrased:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We don’t care about the spec. We care about that the black box participates in tracing and emits OTLP and semantic conventions. The true spec is the data.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I take that to mean, while official SDKs are expected to follow the spec, it&apos;s reasonable for other implementations
to deviate as long as an outside observer cannot tell the difference from the behavior. Our little library does&apos;t quite
pass due to the shortcuts taken in parsing the &lt;code&gt;traceparent&lt;/code&gt; header, but it would not take much more code to fix this.&lt;/p&gt;
&lt;p&gt;If OpenTelemetry continues to be successful I expect OTLP, the ability to emit and receive it, will get built into everything:
your IDE tooling, the platform you deploy on, the framework on which you are building, even hardware. Some day your internet-connected
dryer will almost certainly speak OTLP, if it doesn&apos;t already.&lt;/p&gt;
&lt;h2&gt;Hold on... why is &lt;a href=&quot;https://github.com/open-telemetry/opentelemetry-js&quot;&gt;opentelemetry-js&lt;/a&gt; so much bigger?&lt;/h2&gt;
&lt;p&gt;If we can build a functional tracer in under 200 lines, why does the official JavaScript SDK have so much more code?&lt;/p&gt;
&lt;p&gt;It might help to go over a non-exhaustive list of things the offical SDK handles that our little learning library doesn&apos;t:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Buffer and batch outgoing telemetry data in a more efficient format. Don&apos;t send one-span-per-http request in production. Your vendor will want to have words.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/open-telemetry/opentelemetry-js?tab=readme-ov-file#supported-runtimes&quot;&gt;Work in both the browser and in Node environments&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Gracefully handle errors, wrap this library around your core functionality at your own peril&lt;/li&gt;
&lt;li&gt;Be incredibly configurable. Need to do something that isn&apos;t bog standard? You can probably make it happen&lt;/li&gt;
&lt;li&gt;Automagically wrap common libraries in robust, battle-tested instrumentation&lt;/li&gt;
&lt;li&gt;Optimized for performance when used in your code&apos;s tight loops&lt;/li&gt;
&lt;li&gt;Conform to &lt;a href=&quot;https://opentelemetry.io/docs/concepts/semantic-conventions/&quot;&gt;semantic conventions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Support two whole other types of telemetry: metrics and logging&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And more! I hope this post has given you a better mental model for the work happening under all of the production hardening and
extensibility that tends to build up on these libraries when they have to be used in the real world. A library that can reliably
work across most places we deploy JavaScript and meet a very broad range of user needs is highly non-trivial.&lt;/p&gt;
&lt;p&gt;But tracing? We know that&apos;s just ✨fancy logging✨ right?&lt;/p&gt;
&lt;p&gt;All code for this post can be found &lt;a href=&quot;https://github.com/jmorrell/minimal-nodejs-otel-tracer&quot;&gt;at &lt;code&gt;jmorrell/minimal-nodejs-otel-tracer&lt;/code&gt; on Github&lt;/a&gt;.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;&amp;lt;a href=&quot;#footnote-ref-1&quot; id=&quot;footnote-1&quot;&amp;gt;&amp;lt;code&amp;gt;[1]&amp;lt;/code&amp;gt;&amp;lt;/a&amp;gt; Count the concepts in this &lt;a href=&quot;https://github.com/open-telemetry/opentelemetry-js/blob/9c30124e764e08bd6ccf8dbfbe426a8531c20352/examples/basic-tracer-node/index.js&quot;&gt;simple example&lt;/a&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;code&gt;Tracer&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;TracerProvider&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;SpanExporter&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;SpanProcessor&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Resource&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Attribute&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Event&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Span&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Context&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Both an API &lt;strong&gt;and&lt;/strong&gt; and SDK&lt;/li&gt;
&lt;li&gt;Semantic Conventions&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is intimidating to even senior developers. OpenTelemetry knows this learning curve is a problem and has spun up &lt;a href=&quot;https://github.com/open-telemetry/community/blob/ed572bd319edf1092e8e21808307f2eb0424a631/projects/developer_experience.md&quot;&gt;a new SIG&lt;/a&gt;
to work on improving the developer experience.&lt;/p&gt;
</content:encoded></item></channel></rss>