Bitmaps to Video for Mediafoundation

tuyen · June 29

On 6/26/2025 at 4:18 AM, Kas Ob. said:


AudioSampleDuration := 1.0 / 48000;   // 20.833 microsecond (microsecond = 1/1000000 of second)
VideoFps := 30.0;
VideoFrameDuration := 1.0 / VideoFps; // 0.033333 ms

30 fps is not 1 frame per 0.0333333 ms.

It's 1 frame per 33.33333 ms, or 0.033333 seconds.

Edited June 29 by tuyen

Renate Schaaf · 2025-07-06T17:07:02Z

There is a new version at https://github.com/rmesch/Bitmaps2Video-for-Media-Foundation.

New stuff: Some rewrite of audio, making sure that gaps at the beginning of a stream are filled with silence. 2 optimized frame-rates for audio-synching, see below. Most importantly:

One can now run @Kas Ob.'s frame analysis from within the demo, if one enables the hidden tab "Analysis". I just made the lines a bit shorter, as the rest was just repeating the same values for all I tested, as far as I could see. The file ffprobe.exe needs to be in the same directory as DemoWMF.exe. ffprobe is part of ffmpeg-git-essentials.7z on https://www.gyan.dev/ffmpeg/builds/.

I spent a good amount of time trying to figure out what I can and what I cannot control about audio-synching, tracing into the relevant code and running the analysis. Results of audio-rethynching follow (beware, it's long):

The math is for audio-sample-rate of 48000 and the time units are all s.

Audio-blockalign is always 4 Bytes for what I do.

There are at least 2 different meanings of "sample":

PCMSample: as in samples per second. ByteSize: Channels*BitsPerSample/8 = 2*16/8 = 4 Bytes. Time: 1/48000 s

IMFSample: Chunk of audio returned by IMFSourceReader.ReadSample. It contains a buffer holding a certain amount of uncompressed PCMsamples, and info like timestamp, duration, flags ...

The size of these samples varies a lot with the type of input. Some observed values:

.mp3-file 1:
Buffersize = 96 768 Bytes Duration = 0.504 (96768 bytes = 96768/4 PCMSamples = 96768/4/48000 s OK)
.mp3-file 2:
Buffersize = 35 108 Bytes Duration = 0.1828532 (35108/4/48000 = 0.182854166.. not OK)
.wmv-file:
Buffersize = 17 832 Bytes Duration = 0.092875 (17832/4/48000 = 0.092875 OK)

Except for the first sample read, the values don't differ from sample to sample. Those are the samples I can write to the sinkwriter for encoding. Breaking them up seems like a bad idea. I have to trust MF to handle the writing correctly. The buffers seem to always be block-aligned. I've added some redundant variables in TBitmapEncoderWMF.WriteAudio so these values can be examined in the debugger.

A related quantity are audio-frames. Similarly to the video-stream the audio-stream of a compressed video consists of audio-frames. 1 audio-frame contains the compressed equivalent of 1024 PCMSamples. So:

AudioFrameDuration = 1024/48000 AudioFrameRate = 48000/1024

I can only control the writing of the video by feeding the IMFSamples of video and audio to the sinkwriter in good order. The samples I write to the sinkwriter are collected in a "Leaky-Bucket"-buffer. The encoder pulls out what it needs to write the next chunk of video. It hopefully waits until there are enough samples to write something meaningful. Problems arise if the bucket overflows. There need to be enough video- and audio-samples to correctly write both streams.

So here is the workflow, roughly (can be checked by stepping into TBitmapEncoderWMF.WriteOneFrame):

Check if the audio-time written so far is less than the timestamp of the next video-frame.
Yes: Pull audio-samples out of the sourcereader and write them to the sinkwriter until audio-time >= video-timestamp.
Looking at the durations above, one sample might already achieve this.
Write the next video-frame
Repeat

In the case of mp3-file 1 the reading and writing of 1 audio-sample would be followed by the writing of several video-samples.

The encoder now breaks the bucket-buffer up into frames, compresses them and writes them to file. It does that following its own rules, which I have no control over. Frame-analysis can show the result:

A group of video-frames is followed by a group of audio-frames, which should cover the same time-interval as the video-frames. In the output I have seen so far, the audio-frame-period is always 15 audio-frames. For video-framerate 30, the video-frame-period is 9 or 10 frames. Why doesn't it make the audio- and video-periods smaller? No idea. Guess that's the amount of info the players can handle nowadays, and these periods are a compromise between optimal phase-locking of audio- video- periods and the buffer-size the player can handle. Theoretically, at framerate 30, 16 video-frames should phase-lock with 25 audio-frames.

Here is one of those video-audio-groups. Video-framerate is 30.

video stream_index=0 key_frame=0 pts=39000 pts_time=1.300000 duration_time=0.033333
video stream_index=0 key_frame=0 pts=40000 pts_time=1.333333 duration_time=0.033333
video stream_index=0 key_frame=0 pts=41000 pts_time=1.366667 duration_time=0.033333
video stream_index=0 key_frame=0 pts=42000 pts_time=1.400000 duration_time=0.033333
video stream_index=0 key_frame=0 pts=43000 pts_time=1.433333 duration_time=0.033333
video stream_index=0 key_frame=0 pts=44000 pts_time=1.466667 duration_time=0.033333
video stream_index=0 key_frame=0 pts=45000 pts_time=1.500000 duration_time=0.033333
video stream_index=0 key_frame=0 pts=46000 pts_time=1.533333 duration_time=0.033333
video stream_index=0 key_frame=0 pts=47000 pts_time=1.566667 duration_time=0.033333
video stream_index=0 key_frame=0 pts=48000 pts_time=1.600000 duration_time=0.033333

audio stream_index=1 key_frame=1 pts=62992 pts_time=1.312333 duration_time=0.021333
audio stream_index=1 key_frame=1 pts=64016 pts_time=1.333667 duration_time=0.021333
audio stream_index=1 key_frame=1 pts=65040 pts_time=1.355000 duration_time=0.021333
audio stream_index=1 key_frame=1 pts=66064 pts_time=1.376333 duration_time=0.021333
audio stream_index=1 key_frame=1 pts=67088 pts_time=1.397667 duration_time=0.021333
audio stream_index=1 key_frame=1 pts=68112 pts_time=1.419000 duration_time=0.021333
audio stream_index=1 key_frame=1 pts=69136 pts_time=1.440333 duration_time=0.021333
audio stream_index=1 key_frame=1 pts=70160 pts_time=1.461667 duration_time=0.021333
audio stream_index=1 key_frame=1 pts=71184 pts_time=1.483000 duration_time=0.021333
audio stream_index=1 key_frame=1 pts=72208 pts_time=1.504333 duration_time=0.021333
audio stream_index=1 key_frame=1 pts=73232 pts_time=1.525667 duration_time=0.021333
audio stream_index=1 key_frame=1 pts=74256 pts_time=1.547000 duration_time=0.021333
audio stream_index=1 key_frame=1 pts=75280 pts_time=1.568333 duration_time=0.021333
audio stream_index=1 key_frame=1 pts=76304 pts_time=1.589667 duration_time=0.021333
audio stream_index=1 key_frame=1 pts=77328 pts_time=1.611000 duration_time=0.021333

pts stands for "presentation time stamp" and pts_time is of interest.
Video-time-intervall: from 1.300000 to 1.600000+0.033333=1.633333
Audio-time-intervall: from 1.312333 to 1.611000+0.021333=1.632333

Audio is a bit ahead at the beginning and a tiny bit behind at the end. pts should be multiples of 1024, but they aren't hmm. The difference is still 1024, but they are phase-shifted. Phase-shift is 62992 mod 1024 = 528 (or -496).

The interval from a bit further ahead:

Video: From 8.066667 to 8.366667+0.033333=8.400000
Audio: From 8.053667 to 8.352333+0.021333=8.373666 pts-phase-shift: still 528 (-496)

Audio is lagging behind.

To really see what is happening I will have to implement better statistics than just looking at things 🙂

One further test: I tried to phase-lock audio and video optimally:

VideoFrameRate: f. AudioFrameRate: 48000/1024, so f = 48000/1024 = 46,875. I've added this frame-rate to the demo.

Result: Perfect sync for the first audio-video group. In the middle of the second group the pts-phase-shift is again 528, and audio lags behind. For the rest of the groups the lag doesn't get bigger, it is always corrected to some degree. But the file should have identical audio and video timestamps in the first place!

There is another new frame-rate, which is the result of trying to phase-lock 2 video-frames to 3 audio-frames. 2/f = 3*1024/4800 results in f = 2*48000/3/1024 = 31.25

I will try to find out what causes the phase-shift in audio by parsing the ffprobe-output a bit more (sigh). Maybe generate a log-file for the samples written, too. (Sigh). No, so far it's still fun.

For those, who made it up to here: Thanks for your patience.

Renate

Edited 13 hours ago by Renate Schaaf
Trying to get rid of the strikethrough

Sign In

Bitmaps to Video for Mediafoundation

Recommended Posts

tuyen 1

Share this post

Link to post

Renate Schaaf 71

Share this post

Link to post

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity