POC: Delphi VCL + WebView2 component for OpenAI Realtime (WebRTC, voice & text)

Maxidonkey · September 21

Hello everyone,

Following up on my previous post about EdgeAudio, and after a suggestion from Kas Ob. about using WebRTC with WebView2, I’ve published a new project on GitHub: Edge-OpenAI-Realtime

This project provides a VCL component that implements WebRTC through WebView2 and leverages OpenAI’s Realtime VAD.
It supports the full set of Realtime APIs (as of September 2025), including functions and remote MCP tools.

To accompany the code, I wrote a white paper (included in the README) that details the architecture and runtime sequence.

A demo ZIP archive is also available in the samples folder, so you can quickly try out the component once installed.

I’d be glad to hear your feedback or answer any questions!

Edited September 21 by Maxidonkey

Kas Ob. · September 22

Hi,

I am sorry i can't test the samples as Edge is not there for older Delphi's, and not planning on installing CE as i don't see a real reason (now at least for my self) to do it, anyways..

The code is neat and looks very clean, and most important part the bridge is piece of art, Nice !

and i ( and many here i think) want to hear your finding and experience on

1) using audio with WebRTC and its audio processing, CPU load, does VAD on Delphi with Edge perform as advertised, see, i am very familiar with Jitsi https://en.wikipedia.org/wiki/Jitsi and been a user for years and suggested it for many and many run their own servers, it always astonished me with its performance, like running on old Android even within a browser it was fast and responsive, so what is your experience with it ? how this compare to your EdgeAudio lib.

2) What is the real problem (struggle, may this is not the word but stopped you) with enabling AEC and AGC ? see, Jitsi performed better than native Skype on the same old mobile device with the same connection, on Desktop if you are in middle of conversation and audio is playing then if you change the position of the microphone to close (way closer suddenly) or far from the speakers then a loopback happen ( some distortion and may be echo) for second or fraction of second then it will correct, and to my recall VAD/AEC/AGC are enabled by default, why only VAD !

Thank you for this lib and for sharing!

Maxidonkey · September 23

Hello Kas,

Thank you very much for your comments and your kind words about the bridge. That’s greatly appreciated!

CPU load and smoothness

On my tests (i7-7700), CPU usage stays very low, around 3 and 5%, with no spikes, even during longer sessions (I tested up to 20 minutes continuously). Sessions can last anywhere from 10 to 7,200 minutes. The experience remains smooth throughout.

Comparison with Jitsi

I don’t have hands-on experience with Jitsi, so I can’t provide a direct comparison. What I can share are my observations with OpenAI Realtime:

Response latency is nearly instantaneous, but it mainly depends on two factors:

1- Model: OpenAI currently offers three realtime models (gpt-realtime, gpt-4o-realtime, and gpt-4o-mini-realtime). In practice, responsiveness feels almost identical across them; the real difference is in token cost.

2- Turn detection (VAD):

semantic_vad (default): uses semantic estimation to guess when the user has finished speaking. This provides a very natural conversational flow and supports “talk over”, though in noisy environments, it may misjudge the end of speech.
server_vad: works like in EdgeAudio, speech activity is detected and cut off after a configurable silence threshold. In noisy environments this can be preferable, since you can tune silence duration, threshold, and prefix padding. Latency is practically zero, but all captured sound (including unwanted noise) is processed.

For reference, OpenAI’s own realtime apps (on the website or via the Windows/Android/iOS clients) use semantic_vad. The responsiveness you experience there is identical to what the Edge-OpenAI-Realtime component provides.

AEC / AGC

For this POC, I focused on Realtime + VAD (semantic_vad and server_vad), because VAD is explicitly part of the OpenAI Realtime API interface.

By contrast, AEC/AGC are not exposed in the Realtime API. They are generally handled at the native WebRTC audio stack level (Chromium/Edge). To control or fine-tune them, one would need to bypass WebView2 and implement a native audio capture/playback chain instead of relying on getUserMedia in JS.

That was clearly outside the scope of this experiment. My priority was to validate the WebRTC + DataChannel pipeline and conversational latency.

Edited September 23 by Maxidonkey

Sign In

POC: Delphi VCL + WebView2 component for OpenAI Realtime (WebRTC, voice & text)

Recommended Posts

Maxidonkey 35

Share this post

Link to post

Kas Ob. 160

Share this post

Link to post

Maxidonkey 35

Share this post

Link to post

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity