Tech Europe
Photo of Omar El-Sayed judging at {Tech: Europe}

AI agents are moving from demos into real workflows, and the work changes when that happens. Early progress in the field rewarded model capability, whether a system could understand and respond convincingly. In production environments, teams care just as much about behaviour: what an agent does when it is uncertain, how it reacts when tools fail, and whether its decisions can be examined after the fact.

That shift is changing what companies look for in builders. Agents no longer live in controlled environments. They sit inside real businesses, touch customer data, trigger downstream actions, and interact with users who interrupt, contradict themselves, or change goals mid-conversation. A system can sound impressive yet still be unsafe to deploy if it behaves unpredictably once connected to real workflows.

Trust as an Engineering Discipline

This is why trust is increasingly treated as an engineering problem rather than a branding exercise. Frameworks such as the NIST AI Risk Management Framework emphasise that trustworthiness must be built into systems throughout their lifecycle, not assumed from benchmark performance alone. When agents operate in settings where mistakes carry operational or compliance consequences, discipline matters as much as capability.

Frontier companies building production voice agents are responding by reshaping their teams. Rather than hiring for narrow specialisms alone, they are prioritising people who can own system behaviour end-to-end. That includes defining what an agent is allowed to do, how it fails safely, and how its actions can be explained and audited when something goes wrong.

Bland AI is an example of this shift. As a leading voice AI infrastructure company whose agents handle real-time enterprise phone calls with latency measured in milliseconds, it builds systems designed to operate inside live customer workflows, which places a premium on deployment discipline and operational judgment. Its end-to-end platform powers real-time conversational pipelines for enterprises, including Better.com, Samsara, and the Cleveland Cavaliers, and the company has raised $65m to date, underscoring strong institutional conviction in its approach to voice AI at scale.

How Diverse Backgrounds Shape Agent Design

Within Bland's deployment work are specialists such as Omar El-Sayed, who represent this operator-builder profile. His background spans legal training, strategy, operations, and performance, alongside hands-on experience building and evaluating governed AI systems in high-trust environments, including regulated workflows where system behaviour must remain defensible under review. That mix shapes how he approaches agent design and operations.

Legal training pushes clear boundaries and defensible decision-making. Operational experience sharpens attention to failure modes and incident response. Performance training adds a feel for narrative and intent, how people explain what they want, change direction mid-sentence, and react when a system misunderstands them. Strategy work builds comfort with trade-offs and second-order effects. In voice agents, those instincts translate into safer permissioning, clearer escalation paths, and more disciplined auditability.

The goal is not human likeness for its own sake, but disciplined autonomy, so the system stays predictable when real-world conditions get messy. In practice, this starts with permissions. Risk-bearing actions are not triggered on inferred intent alone. They require explicit confirmation. When inputs are unclear or contradictory, the system is designed to fall back to safe states rather than guess. Escalation is treated as a normal path, with thresholds defined in advance. These controls mirror the same constraints-first approach Omar applied in earlier governed deployments, where uncertainty had to be handled conservatively.

Testing for Reality, Not Perfect Conditions

Omar's approach also shapes how systems are tested. Instead of optimising for clean transcripts, deployments are stress-tested against the scenarios that tend to break real systems: interruptions, misheard entities, partial data from backend tools, timeouts mid-flow, and multi-step requests where users change objectives. Testing is structured around known failure modes with clear pass/fail criteria, rather than subjective transcript review. Consistency matters more than cleverness. If similar inputs produce materially different outcomes, the system is not considered ready for production.

He applies the same lens when evaluating other teams' work. Omar has served as a final-round judge at large international hackathons for Lovable and {Tech: Europe}, where speed and surface-level polish are easy to optimise for. In those settings, his focus is on whether a product holds up across multiple dimensions at once, including reliability under interruption, sensible escalation design, disciplined tool use, and traceability that allows teams to explain what happened after the fact.

This way of thinking is increasingly reflected in how the United States has approached AI assurance as systems move into real deployment. Elizabeth Kelly, the former Director of the U.S. AI Safety Institute, has emphasised that safety must be grounded in evaluation, risk controls, and accountability, not assumed from impressive performance in ideal conditions. In practice, that means designing systems with explicit boundaries, escalation paths, and auditability so behaviour can be examined when something goes wrong.

As AI agents take on more responsibility inside real organisations, the field is widening. The teams that ship durable systems are increasingly those that combine model capability with governance discipline and operational ownership. In production, success is defined less by how human an agent sounds and more by how reliably it behaves under far-from-ideal conditions. Omar's work sits squarely in that operational layer, where reliability and accountability determine whether agents actually ship.