Creating self-healing software systems via effective usage of telemetry data and AI agents

Modern software systems operate in complex, dynamic environments where failures are inevitable. Traditional monitoring and manual incident response are no longer sufficient to ensure resilience or customer satisfaction. This talk explores how to design and implement self-healing software systems by combining telemetry data with an AI-driven agentic approach. We’ll start by examining how high-quality telemetry forms the foundation for detecting anomalies and predicting failures. Next, we’ll show how modern GenAI (LLMs) can transform this telemetry into actionable insights for AI agents that interpret data, pinpoint root causes, and apply automated fixes. Through a practical, real-world example, you’ll see how telemetry and AI work together to create adaptive feedback loops that continuously improve system reliability, while freeing engineers from repetitive operational tasks.

Related Talks

You've Got the Cloud All Wrong, Let's Fix That!

Most people think the cloud is just files floating in the sky.Spoiler: it's not.In this episode, I’m breaking down what “the cloud” really is, why everything you’ve been told is probably wrong, and why it's the engine behind everything from Netflix to AI to your online checkout.This is the kickoff to WTH is the Cloud?! a fun series that makes cloud technology make sense.

Learn More

WTH is Chaos Engineering?! A Quick Look at Breaking Things on Purpose

Ever wonder why some teams intentionally break their own systems? Welcome to the world of chaos engineering — a practice that's not just for Netflix-scale infrastructure, but for any team that wants to build resilient, reliable applications.In this session, we'll demystify chaos engineering and explain why intentionally breaking things is actually the smart move. You'll learn:What chaos engineering really is (in plain English, no buzzwords)Why waiting for production failures is a terrible strategyHow to start experimenting with controlled failure locally, before it happens in the wildReal-world examples of chaos experiments that catch bugs you'd never find in traditional testingTools and techniques to get started without blowing up your infrastructureThrough practical demos using LocalStack's cloud emulation and chaos engineering tools, we'll simulate failures like network latency, service outages, and resource exhaustion right from your laptop.If you've ever said "it worked on my machine" only to watch it crash in production, this talk is for you—let's break things intentionally so they don't break unexpectedly.

Learn More

Wait, Building in the Cloud Is This Complicated?

Testing in the cloud = slow builds, fragile staging, surprise bills.Let’s talk about how developers are flipping the script and using local cloud environments to test smarter, faster, and cheaper — without breaking production.Bonus: You’ll learn how LocalStack lets you simulate AWS on your machine. Game changer.

Learn More

Creating self-healing software systems via effective usage of telemetry data and AI agents

Related Talks

Launch yourself in the world of local cloud development