2026-06-11 14:15
Exploring agent-driven end-to-end (E2E) testing offers a new approach to validating software functionality. Recent experiments with over 200 agentic workflows reveal that these tests focus on achieving specific goals rather than following strict sequences. This flexibility allows agents to adapt their methods for reaching outcomes, though it introduces considerations around reliability and cost. While agent-driven tests can be more costly and time-consuming, advancements in large language...
Source: Slack Engineering
Sergii Gorbachov
2026-05-28 14:15
🚀 In early 2023, Slack tackled the challenge of implementing Large Language Models (LLMs) at an enterprise scale. Over three years, they developed a multi-cloud architecture to enhance security and performance. Initially, they used AWS SageMaker, but faced issues with scaling latency and hardware availability. To address these, they introduced On-Demand Capacity Reservations. By mid-2024, Slack migrated to Amazon Bedrock, gaining operational simplicity and immediate access to new models,...
Source: Slack Engineering
Shaurya Kethireddy
2026-04-13 17:17
In complex, long-running agentic systems, maintaining alignment among agents is crucial. This article discusses the challenges and mechanisms designed to enhance productivity over extended periods. It highlights a structured process for AI agents in security investigations, orchestrated by a Director, with roles for Experts and a Critic. Different phases of investigation allow for iterative improvements. To manage context effectively, three channels are utilized: the Director's Journal, the...
Source: Slack Engineering
Dominic Marks
2026-03-31 17:00
🚀 Slack has transitioned to support HTTP/3, but faced challenges with client-side observability due to legacy tools. Existing monitoring solutions lacked the capability to probe new HTTP/3 endpoints effectively. This gap hindered visibility and accurate metrics. 💡 An intern, Sebastian Feliciano, developed QUIC support for Prometheus' Blackbox Exporter, using the quic-go client. His open-source contribution enhances monitoring for the entire Prometheus community. #HTTP3 #OpenSource #Prometheus...
Source: Slack Engineering
Carlo Preciado
2026-03-19 19:00
At Slack, we recognized that notifications are vital for team communication but can be overwhelming. 💬 We aimed to redesign our notification system to enhance clarity and ease of use. Research revealed that notification overload is a common frustration, particularly as users join more channels. Our findings showed that inconsistencies in settings and user preferences led to confusion. We identified four main issues: conflicting models across devices, tightly coupled notification types,...
Source: Slack Engineering
Frances Coronel
2025-12-01 16:00
🔒 Slack's Security Engineering team is enhancing its security processes by integrating AI agents. Their security event ingestion pipeline manages billions of daily events, focusing on alert reviews during on-call shifts. The team is refining a prototype for AI-driven investigations to improve efficiency and decision-making. This post is the first in a series detailing their design choices and learnings. Stay tuned for more insights! 🚀 #CyberSecurity #AI #TechInnovation #Slack #SecurityAgents
Source: Slack Engineering
Dominic Marks
2025-11-06 16:00
Source: Slack Engineering
David Reed (he/his)
2025-10-23 18:17
🚀 Last year, we discussed the evolution of our Chef infrastructure and its transition from a single stack to a multi-stack model. At Slack, service reliability is key. We explored moving to Chef Policyfiles but opted to enhance our existing EC2 framework instead. This approach minimizes disruption while improving deployment safety. We split our production Chef environment into multiple isolated buckets, increasing resilience and allowing independent updates. This strategy helps mitigate risks...
Source: Slack Engineering
Archie Gunasekara
2025-10-07 16:33
📉 In mid-2023, Slack launched the Deploy Safety Program to enhance reliability and reduce customer impact from changes. By January 2025, customer impact hours dropped by 90%. 🔍 Analysis revealed that most customer incidents were linked to Slack-induced changes. Key goals include reducing deployment impact time and improving detection processes. 📊 A new Deploy Safety metric tracks customer impact from significant incidents to better measure reliability. #Slack #DeploySafety...
Source: Slack Engineering
Sam Bailey
2025-09-04 10:00
In response to evolving cyber threats, Slack has introduced Anomaly Event Response (AER), a proactive security measure. 🌐 AER utilizes real-time monitoring and advanced analytics to quickly identify and respond to suspicious activities on the platform, reducing detection-to-response time from hours to minutes. ⏱️ This system helps prevent potential data breaches without the need for additional security tools. Slack also provides comprehensive audit logs to enhance security for Enterprise...
Source: Slack Engineering
Nathan Lehotsky
2025-09-04 10:00
Cyberattacks are becoming more sophisticated, making rapid breach detection and response essential. Traditional methods often respond too late, giving attackers an advantage. To combat this, Slack has introduced Anomaly Event Response (AER). This proactive defense mechanism uses real-time monitoring and advanced analytics to identify threats and respond automatically, reducing detection-to-response time to minutes. 🚀🔍 AER helps prevent data breaches without needing extra tools or human...
Source: Slack Engineering
Nathan Lehotsky