The concept of the AI co-clinician is based on what the researchers call "triadic care": AI agents assist patients throughout their treatment journey, while the physician retains clinical authority and control. The goal is an AI system that functions as a collaborative member of the medical team, supporting patients under clinical supervision.

For the clinician-side evaluation, the researchers worked with academic physicians to adapt the NOHARM framework, testing the system against two types of errors: providing incorrect information ("errors of commission") and failing to deliver critical information ("errors of omission").

In a blinded comparison using 98 realistic primary care queries, physicians consistently preferred the AI co-clinician's responses over leading evidence synthesis tools. Against an existing clinical AI system, the preference was 67 to 26; against GPT-5.4-thinking-with-search, it was 63 to 30. In the objective analysis, the system recorded one critical error across the 98 cases.

In a blinded comparison with 98 realistic primary care queries, physicians clearly preferred the AI co-clinician over an existing clinical AI agent (67 to 26) and GPT-5.4-thinking-with-search (63 to 30). | Image: Google DeepMind

The advantage was especially pronounced on medication-related questions. The RxQA benchmark comprises 600 questions on drug compounds, interactions, and dosages — derived from national drug registries in two countries and verified by licensed pharmacists. These questions are challenging for general practitioners: with reference tools, they achieved only 61.3% correct answers; without any aids, just 48.3%.

The AI co-clinician scored 73.3%, while GPT-5.4-thinking-with-search achieved 72.7%. The gap widened further when questions were posed in open-ended format rather than multiple choice — that is, the way physicians actually look things up in practice. Here, the AI co-clinician reached a quality score of 95.0%, compared to 90.9% for OpenAI's model.

Multimodal Telemedicine: AI With Eyes, Ears, and Voice

Beyond text-based support, Google DeepMind is exploring how the AI co-clinician can be deployed with real-time audio and video in telemedicine scenarios. In collaboration with physicians at Harvard and Stanford, the researchers conducted a randomized simulation study involving 20 synthetic clinical scenarios, 10 physicians acting as patients, and a total of 120 hypothetical telemedicine encounters.

The AI co-clinician demonstrated capabilities that go beyond pure text systems — correcting a patient's inhaler technique, for example, and guiding shoulder examinations to identify a rotator cuff injury.

For patient-facing telemedicine conversations, the AI co-clinician uses a dual-agent architecture: a "Planner" module continuously monitors the conversation and checks whether the "Talker" agent remains within safe clinical boundaries. For physician-facing use, the system prioritizes clinically grounded evidence and performs verification and citation checks during information retrieval.

Experienced Physicians Still Come Out Ahead

The study evaluated over 140 aspects of consultation quality across seven domains: triage, medical history taking, clinical reasoning, communication and counseling, management steps, recognition of red flags, and physical examinations. The findings are sobering for anyone who sees AI as a replacement for doctors: experienced physicians outperformed the AI system overall, particularly in identifying red flags and guiding critical physical examinations.

At the same time, the AI co-clinician reached a comparable or better level than general practitioners in 68 of the 140 evaluated areas. OpenAI's GPT-realtime performed significantly worse than both across all seven domains. The researchers conclude that such systems are currently best suited as supportive tools for physicians — not as replacements for clinical judgment.

In simulated telemedicine consultations, general practitioners (orange) outperformed Google's AI co-clinician (blue) across all seven evaluated domains. The gap was largest in red flag recognition and physical examinations. OpenAI's GPT-realtime (grey) trailed both in every category. | Image: Google DeepMind

Whether and when the research initiative will become a marketable product remains open. The results demonstrate notable advances in AI-assisted evidence synthesis and telemedicine consultation — while also making clear that the gap to experienced physicians persists, particularly in safety-critical areas such as red flag recognition. "We are still at the very beginning, but the promise is clear," says DeepMind researcher Alan Karthikesalingam.