Infrastructure
Infrastructure
How modern observability works — OpenTelemetry, the traces/metrics/logs trio, and controlling telemetry cost at scale.
Observability is the ability to ask arbitrary questions about a running system from its outputs, without shipping new code to answer them. The classic framing is three signals — metrics (aggregate numbers over time), logs (discrete event records), and traces (the path of a single request across services) — increasingly unified by correlating IDs so you can pivot from a slow metric to the exact trace to the relevant logs.
The defining shift of the last few years is OpenTelemetry (OTel) becoming the vendor-neutral standard for generating and shipping that telemetry, ending lock-in to a single agent. Notifire tracks the OTel project's maturation, the rise of high-cardinality wide-event analysis, eBPF-based zero-instrumentation collection, and — the pressure point for most teams in 2026 — telemetry cost control through sampling, aggregation, and tiered storage.
AI
Coralogix has secured $200 million in a new funding round. The company is betting on the growing need for tools that monitor, troubleshoot, and ensure the reliability of AI systems as they are deployed into production environments, highlighting the emerging market for AI observability.
Neeraj Dhiman ·
Infra
JetBrains released Toolbox App 3.5, a significant update for developers. The new version introduces OpenTelemetry metrics for better monitoring of remote development connections, adds interface zooming for accessibility, and includes several reliability improvements to enhance the overall user experience.
Ashish Kale ·
AI
A new Datadog report finds nearly 70% of companies now use three or more AI models, a significant shift towards multi-model strategies. This approach allows teams to select the best model for specific tasks, optimizing for factors like cost, latency, and operational risk across different workloads.
Neeraj Dhiman ·
Infra
ClickHouse has announced major updates to its observability platform, ClickStack. The new releases include ClickStack Cloud in private preview, AI-powered Notebooks in beta, and a new MCP server. These changes aim to simplify setup, improve investigation, and enhance the platform's composability for developers and IT teams.
Ashish Kale ·
Infra
A CNCF blog post shares a real-world story about the 'integration tax' of cloud-native tools. An on-call engineer faced blank dashboards because Prometheus wasn't correctly configured to monitor Cilium, highlighting how complex integrations can cause serious production issues for engineering teams.
Ashish Kale ·
Data
ClickHouse announced several major updates at its Open House 2026 event. Key developments include deeper integration with Postgres, new data ingestion tools called ClickPipes and ClickHouse Agents, and a partnership with Langfuse for LLM observability. The updates aim to simplify real-time data analytics.
Taranpreet Singh ·
Infra
ClickHouse has introduced ClickStack Cloud, a new serverless observability platform. It's a fully managed service built on the ClickHouse database, designed to handle logs, metrics, and traces. The platform uses a managed endpoint for OpenTelemetry data, allowing teams to analyze systems without managing infrastructure.
Ashish Kale ·
Data
Elastic has released version 9.4.2 of the Elastic Stack. This is a security-focused update that addresses potential vulnerabilities found in previous versions. All users are strongly encouraged to upgrade their deployments to this latest version to ensure their systems remain secure and protected.
Taranpreet Singh ·
Infra
AI agent frameworks like CrewAI and AutoGen are moving from demos to production environments for tasks like incident response. This shift is creating a critical new challenge: a lack of established tools and practices for monitoring and observing these complex, multi-step AI systems in real-world applications.
Ashish Kale ·
Infra
The open-source tracing tool Jaeger now supports the ClickHouse database. This new integration is designed for large-scale telemetry, offering significant performance gains. In one test, it achieved an 8.6x compression rate on 10 million spans, helping teams better manage and store observability data.
Ashish Kale ·
Security
Grafana Labs confirmed a security breach limited to its GitHub environment, exposing public and private source code. The company stated that its investigation found no evidence of customer production systems being compromised. The incident was linked to a supply chain attack involving a TanStack npm package.
Neeraj Dhiman ·
Security
Grafana Labs has disclosed a security incident where attackers used a stolen GitHub access token to access its environment. The breach resulted in the unauthorized download of some of its source code. Grafana is investigating but states no customer data was compromised.
Neeraj Dhiman ·
Infra
Grafana has disclosed a security incident where an unauthorized party gained access to its GitHub environment using a stolen token. The attacker was able to download the company's codebase. Grafana's investigation found no evidence that customer data or systems were affected by the breach.
Ashish Kale ·
Monitoring tracks known failure modes — dashboards and alerts for metrics you decided in advance to watch. Observability lets you investigate unknown problems after the fact by querying rich telemetry, including questions you never anticipated. Monitoring tells you that something is wrong; observability helps you figure out why, especially for novel failures in distributed systems.
Metrics (numeric measurements aggregated over time, e.g. request rate or p99 latency), logs (timestamped records of discrete events), and traces (the end-to-end path of a single request as it flows through multiple services). Many teams now treat them less as separate pillars and more as correlated views, linked by trace and span IDs so you can jump between them during an investigation.
OpenTelemetry (OTel) is a CNCF project providing vendor-neutral APIs, SDKs, and a collector for generating and exporting traces, metrics, and logs. Instrumenting with OTel means your telemetry isn't tied to one vendor's agent — you can switch or mix backends (Datadog, Grafana, Honeycomb, Prometheus) without re-instrumenting. It has become the de facto standard for application telemetry.
The main levers are sampling (keep a representative or error-biased subset of traces rather than all of them), aggregation and metric cardinality limits, dropping low-value logs at the collector, and tiered storage that keeps recent data hot and archives older data cheaply. With telemetry volume often growing faster than the systems it observes, cost governance at the OTel collector has become a first-class engineering concern.
The Notifire briefing
Verified tech intelligence in your inbox — AI, security, infra, and data.