How AI would have handled a real incident at Port
We had a real incident where three teams investigated the same problem in parallel. Here's what happened and how we're using Port to move toward autonomous incident resolution
A couple of weeks ago we had an incident. Three different teams got paged, each one investigated separately, and it took some time to figure out they were all looking at the same problem.
We went through the Slack threads afterward and mapped out what happened. Then we got together as a team and asked: what would this have looked like if an agent handled the incident end-to-end?
This article is that exercise. The real incident, what went wrong in how we responded, and what we’re building to make sure it goes differently next time.
The incident
On a Friday evening, a customer created 1.7 million automation runs in 90 minutes. Kafka offset lag spiked across multiple services.
PagerDuty fired alerts for three of our teams: Actions, Spotlight, and Builder. Three on-call engineers acknowledged in three separate Slack channels without knowing about each other.
The Actions team found the offset lag first. Their on-call engineer restarted the service worker through a self-service action in Port. It didn’t help. So he then dug into our automation run counts, narrowed the spike to three orgs, traced them all back to one customer. After that, he tried to scale out, but the config override action didn’t exist in Port for this service. So he opened a manual PR, waited for DevOps approval, and hit a failing Argo diff that needed a third person.
Another team called Builder also got a PagerDuty alert at the same time, but started talking about it in a different Slack thread. Their engineer also restarted workers. Also checked New Relic. Also checked Kafka UI. And got the same false hope: “lag seems to decrease as restart rolls out.” It didn’t.
43 minutes in, someone from a 3rd team Spotlight, while dealing with the same incident on their own, finally asked: “do you think our alert is related?” That was the first time anyone connected the dots. Three teams had been investigating the same root cause in parallel, trying the same restart that didn’t work, wasting the same ten minutes twice.
By the time the threads merged, 77 minutes had passed. The remediation plan was two words: scale out. Getting there took over an hour of human coordination across three Slack threads, New Relic, Kafka UI, PagerDuty, GitHub, and Argo.
But we’re working on something big to resolve incidents in record time.
Our first step: A triage agent
We’re building toward an agent that handles the full incident lifecycle. But the first piece is already live: an agent that triages the incident in Port and posts a message to on-call engineers in Slack within minutes.
Two things make it work:
The Context Lake
The Port Context Lake gives the agent everything it needs to know about how our SDLC behaves. It holds everything about our micro-services, deployment history, CI/CD jobs, ownership, dependencies, vulnerabilities, team structure, and much more. When the agent finds a failing service, it already knows who owns it, what recently deployed, what other services depend on it, and which customers are affected because everything in Port is connected.
During the Kafka incident, an engineer spent time tracing three org IDs back to one customer. The context lake already has those relationships. The agent would have just read them instead of piecing them together between different tools.
Equipped with those two things, here's what the triage agent does:
Why Port
The agent does the investigation. But when it's time to act, like restarting a service, scaling out workers, or rolling back a deployment, a human approves it through Port. Here's why that matters.
how much time the agent saved per incident
how often engineers accepted the recommendation vs. overrode them
the sentiment from engineers on how the agent performed
Put these on a dashboard and then you'll know exactly what autonomous incident resolution is worth and where it still needs work.
What’s next
The triage agent is just the first step. We're building Port to heal incidents autonomously. So let's revisit the Kafka incident and see how it would have gone differently.
The incident management future we see
Incident management is moving from reactive to autonomous. We already know agents can do it. The question is how you get there without losing control.
The companies moving fastest are building agents that pursue multiple hypotheses in parallel, correlate across systems, and suggest remediations. They’re participants in the on-call rotation that work across your code, infra, and telemetry.
We think the foundation for that is context. An agent that doesn’t know your architecture, your ownership model, your dependencies, and your playbooks can just be a faster way to get the wrong answer. The context lake and the skill file are what turn a generic agent into one that investigates the way your team does.
If you want to build something similar to what we described, we’re publishing the skill file, the workflow definition, and a guide for how we do incident management at Port.






