Why XR’s Next Step Is Not a Better Headset

For more than a decade, Extended Reality has been described as the next phase of human-computer interaction. Keyboards gave way to mice, mice gave way to touchscreens, and touchscreens will give way to displays that follow our eyes and hands. That is the promise. Everyday adoption is something else.

After working on Almer (now part of RealWear) at SIPBB since the first 3D-printed prototype in 2021, and now contributing to the XR5.0 EU project, we have come to a different reading of why XR has stalled at the threshold. The bottleneck is not hardware. Weight has dropped, optics have improved, batteries last longer. The bottleneck is interaction itself.

Where XR has already worked. XR has produced its clearest wins in tightly bounded training and simulation. The XR5.0 Pilot 3 targets a 30 % reduction in training time and cost. The Pilot 5 targets a 20 % reduction in assembly time and a 50 % increase in the engagement of lower-skilled technicians. The Pilot 4 replays the actions of skilled technicians for the Wing Anti-Ice Valve repair so trainees can rehearse the procedure without grounding an aircraft. These are real results. They also share a pattern: they are for now tightly scoped, run on dedicated training devices, and live in controlled environments. Outside that envelope, XR rarely becomes the primary interface for daily work.

The interaction problem. Every successful computing paradigm has had a stable interaction grammar. The mouse pointer and the touchscreen tap each took decades to settle and then carried across millions of applications without retraining. XR has converged on nothing equivalent. A user crossing from VR to MR to a head-worn smart-glasses device meets a different input set each time: controllers here, gaze plus pinch there, hand tracking elsewhere, voice on a fourth device. For an industrial worker whose hands stay on the task, most of those options are not even available. In the first Almer prototypes, the choice was constrained to voice. There was no controller in hand, no surface to touch, and no spare bandwidth for learning a gesture set. Voice worked because there was nothing else, and because language models had finally become tolerant enough of free-form requests that workers could speak as they would to a colleague rather than memorise commands.

AI changes the role of XR. This is where AI rewrites the script. Large language models, multimodal context understanding, and persistent memory together collapse the fragmented input channels into a single interaction layer that sits above the device. The worker speaks, looks, or moves their head. The model fuses these signals with what it sees through the device’s camera. The output is whichever modality fits: a spatial overlay, a spoken answer, a triggered hardware action. In XR5.0 Pilot 6, Innov’s AI module holds the troubleshooting knowledge for a specific piece of industrial equipment and emits step-by-step instructions, which the Oculavis assistance app renders on a RealWear headset. SUPSI’s middleware orchestrates the traffic between the components. The whole stack is tested against a real machine, the LNS Barfeeder Express 126, installed in the SSF Demonstration Factory. It behaves as one assistant rather than four bolted-together tools.

The hard parts are not the model. When we look at what actually broke during these years, the binding constraints were operational, not algorithmic. A device is useful only if it is charged and ready to grab. That is why we built a docking station that doubles as a charging cradle and gives the device a fixed, visible home in the workspace. By fitting into the existing workflow instead of asking workers to remember a separate routine, the station also makes the glasses easier to adopt at the organisational level. The headband form factor that solved compatibility with prescription glasses and hard hats made weight distribution a dominant concern, which constrains every future feature, AI included, before it exists. Around the device itself we have been building the surrounding system: AriOs as the on-device operating system that abstracts hardware peripherals and lets the same headset play tutor, co-pilot, or remote-expert proxy by configuration; and Ari Cloud as the enterprise-integration layer that connects to the customer’s documentation, asset records, and ticketing.

Figure 1. Evolution of the Almer/RealWear XR ecosystem from joint SIPBB–RealWear research projects to industrial deployment, including XR hardware, Ari-OS, and Ari-Cloud.

What to watch. LLM hallucination in safety-critical guidance is a real risk, and the literature has it right; we manage it by scoping the model to the equipment’s documentation and keeping an explicit fallback to a human expert. Privacy concerns around biometric capture and prompt injection are mostly latent in a voice plus head-movement device that does not track gaze, but they would become live the moment the form factor changes. The risk profile shifts with the device class as much as with the model.

The point. The next jump in XR is not a smaller, lighter headset that runs a bigger model. It is an integration spine that connects the worker, the device, the AI, and the surrounding enterprise systems into one assistant, one workflow, and one stable interaction layer. That is the part of the puzzle worth building.