Degraded diagram rendering

Incident Report for Capable

Postmortem

Summary

On 17 June 2026, our diagram rendering service experienced a period of elevated errors. Some diagrams failed to load and displayed an error or appeared blank. The issue was identified and resolved the same day, and rendering is now fully operational. No customer data was affected, and only diagram rendering was impacted - all other functionality continued to work normally.

Impact

  • What: A growing share of diagram render requests (PlantUML, Mermaid, and BPMN diagrams) returned errors. Affected diagrams showed an error state or did not display; refreshing or reopening the page would resolve it in many cases earlier in the window.
  • Who: Users actively viewing or generating diagrams during the window. Existing content was never lost — only the live rendering of diagrams was affected.
  • Scope: Limited to the diagram rendering feature. Editing, storage, and all other parts of the product were unaffected.

Timeline (UTC)

  • 06:30 - Diagram rendering began returning intermittent errors. Many requests still succeeded.
  • 06:30 – 14:30 - Error rate gradually increased over the morning as the underlying capacity issue worsened.
  • 14:30 - Error rate peaked, with most diagram requests failing.
  • 15.00 - We identified the cause and applied a fix. Rendering recovered shortly after and returned to normal.

Root cause

The diagram rendering service ran short of capacity on its underlying compute, which caused it to become progressively less responsive over the course of the morning. As capacity headroom ran out, an increasing proportion of requests timed out.

Automated scaling was in place to add capacity when needed, but the affected host continued to report itself as healthy despite being unable to serve requests. Because of this, additional capacity was not brought online automatically.

Resolution

We manually increased the resources allocated to the diagram rendering service, which immediately restored normal operation. We then addressed the underlying scaling issue to ensure the autoscaler correctly detects host-level failures going forward.

Posted Jun 17, 2026 - 15:53 UTC

Resolved

The incident has been resolved. An auto-scaling misconfiguration caused resource starvation in our diagram rendering system. We have scaled the cluster appropriately and added automated monitoring to detect this condition in future.
Posted Jun 17, 2026 - 15:45 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Jun 17, 2026 - 15:00 UTC

Investigating

We are currently investigating this issue.
Posted Jun 17, 2026 - 14:00 UTC
This incident affected: Capable Apps.