Jump to content
Renate Schaaf

Bitmaps to Video for Mediafoundation

Recommended Posts

Posted (edited)
On 6/26/2025 at 4:18 AM, Kas Ob. said:

AudioSampleDuration := 1.0 / 48000;   // 20.833 microsecond (microsecond = 1/1000000 of second)
VideoFps := 30.0;
VideoFrameDuration := 1.0 / VideoFps; // 0.033333 ms

 

 

 

30 fps is not 1 frame per 0.0333333 ms.

It's 1 frame per 33.33333 ms, or 0.033333 seconds.

Edited by tuyen
  • Like 1

Share this post


Link to post

There is a new version at https://github.com/rmesch/Bitmaps2Video-for-Media-Foundation.

 

New stuff: Some rewrite of audio, making sure that gaps at the beginning of a stream are filled with silence. 2 optimized frame-rates for audio-synching, see below. Most importantly:

One can now run @Kas Ob.'s frame analysis from within the demo, if one enables the hidden tab "Analysis". I just made the lines a bit shorter, as the rest was just repeating the same values for all I tested, as far as I could see. The file ffprobe.exe needs to be in the same directory as DemoWMF.exe. ffprobe is part of ffmpeg-git-essentials.7z on https://www.gyan.dev/ffmpeg/builds/.

 

I spent a good amount of time trying to figure out what I can and what I cannot control about audio-synching, tracing into the relevant code and running the analysis. Results of audio-rethynching follow (beware, it's long):


The math is for audio-sample-rate of 48000 and the time units are all s.

Audio-blockalign is always 4 Bytes for what I do. 

 

There are at least 2 different meanings of "sample":

PCMSample: as in samples per second. ByteSize: Channels*BitsPerSample/8 = 2*16/8 = 4 Bytes. Time: 1/48000 s

 

IMFSample: Chunk of audio returned by IMFSourceReader.ReadSample. It contains a buffer holding a certain amount of uncompressed PCMsamples, and info like timestamp, duration, flags ...

The size of these samples varies a lot with the type of input. Some observed values:

 

.mp3-file 1: 
Buffersize = 96 768 Bytes    Duration = 0.504  (96768 bytes = 96768/4 PCMSamples = 96768/4/48000 s OK)
.mp3-file 2:
Buffersize = 35 108 Bytes    Duration = 0.1828532 (35108/4/48000 = 0.182854166.. not OK) 
.wmv-file:
Buffersize = 17 832 Bytes    Duration = 0.092875  (17832/4/48000 = 0.092875 OK)

 

Except for the first sample read, the values don't differ from sample to sample. Those are the samples I can write to the sinkwriter for encoding. Breaking them up seems like a bad idea. I have to trust MF to handle the writing correctly. The buffers seem to always be block-aligned. I've added some redundant variables in TBitmapEncoderWMF.WriteAudio so these values can be examined in the debugger.

 

A related quantity are audio-frames. Similarly to the video-stream the audio-stream of a compressed video consists of audio-frames. 1 audio-frame contains the compressed equivalent of 1024 PCMSamples. So:

AudioFrameDuration = 1024/48000   AudioFrameRate = 48000/1024

 

I can only control the writing of the video by feeding the IMFSamples of video and audio to the sinkwriter in good order. The samples I write to the sinkwriter are collected in a "Leaky-Bucket"-buffer. The encoder pulls out what it needs to write the next chunk of video. It hopefully waits until there are enough samples to write something meaningful. Problems arise if the bucket overflows. There need to be enough video- and audio-samples to correctly write both streams.

 

So here is the workflow, roughly (can be checked by stepping into TBitmapEncoderWMF.WriteOneFrame):

  Check if the audio-time written so far is less than the timestamp of the next video-frame.
     Yes: Pull audio-samples out of the sourcereader and write them to the sinkwriter until audio-time >= video-timestamp.
          Looking at the durations above, one sample might already achieve this.
     Write the next video-frame
Repeat

In the case of mp3-file 1 the reading and writing of 1 audio-sample would be followed by the writing of several video-samples.

 

The encoder now breaks the bucket-buffer up into frames, compresses them and writes them to file. It does that following its own rules, which I have no control over. Frame-analysis can show the result: 

A group of video-frames is followed by a group of audio-frames, which should cover the same time-interval as the video-frames. In the output I have seen so far, the audio-frame-period is always 15 audio-frames. For video-framerate 30, the video-frame-period is 9 or 10 frames. Why doesn't it make the audio- and video-periods smaller? No idea. Guess that's the amount of info the players can handle nowadays, and these periods are a compromise between optimal phase-locking of audio- video- periods and the buffer-size the player can handle. Theoretically, at framerate 30, 16 video-frames should phase-lock with 25 audio-frames.

 

Here is one of those video-audio-groups. Video-framerate is 30. 

video stream_index=0 key_frame=0 pts=39000 pts_time=1.300000  duration_time=0.033333
video stream_index=0 key_frame=0 pts=40000 pts_time=1.333333  duration_time=0.033333
video stream_index=0 key_frame=0 pts=41000 pts_time=1.366667  duration_time=0.033333
video stream_index=0 key_frame=0 pts=42000 pts_time=1.400000  duration_time=0.033333
video stream_index=0 key_frame=0 pts=43000 pts_time=1.433333  duration_time=0.033333
video stream_index=0 key_frame=0 pts=44000 pts_time=1.466667  duration_time=0.033333
video stream_index=0 key_frame=0 pts=45000 pts_time=1.500000  duration_time=0.033333
video stream_index=0 key_frame=0 pts=46000 pts_time=1.533333  duration_time=0.033333
video stream_index=0 key_frame=0 pts=47000 pts_time=1.566667  duration_time=0.033333
video stream_index=0 key_frame=0 pts=48000 pts_time=1.600000  duration_time=0.033333

audio stream_index=1 key_frame=1 pts=62992 pts_time=1.312333  duration_time=0.021333
audio stream_index=1 key_frame=1 pts=64016 pts_time=1.333667  duration_time=0.021333
audio stream_index=1 key_frame=1 pts=65040 pts_time=1.355000  duration_time=0.021333
audio stream_index=1 key_frame=1 pts=66064 pts_time=1.376333  duration_time=0.021333
audio stream_index=1 key_frame=1 pts=67088 pts_time=1.397667  duration_time=0.021333
audio stream_index=1 key_frame=1 pts=68112 pts_time=1.419000  duration_time=0.021333
audio stream_index=1 key_frame=1 pts=69136 pts_time=1.440333  duration_time=0.021333
audio stream_index=1 key_frame=1 pts=70160 pts_time=1.461667  duration_time=0.021333
audio stream_index=1 key_frame=1 pts=71184 pts_time=1.483000  duration_time=0.021333
audio stream_index=1 key_frame=1 pts=72208 pts_time=1.504333  duration_time=0.021333
audio stream_index=1 key_frame=1 pts=73232 pts_time=1.525667  duration_time=0.021333
audio stream_index=1 key_frame=1 pts=74256 pts_time=1.547000  duration_time=0.021333
audio stream_index=1 key_frame=1 pts=75280 pts_time=1.568333  duration_time=0.021333
audio stream_index=1 key_frame=1 pts=76304 pts_time=1.589667  duration_time=0.021333
audio stream_index=1 key_frame=1 pts=77328 pts_time=1.611000  duration_time=0.021333

pts stands for "presentation time stamp" and pts_time is of interest.
Video-time-intervall: from 1.300000  to  1.600000+0.033333=1.633333
Audio-time-intervall: from 1.312333  to  1.611000+0.021333=1.632333

Audio is a bit ahead at the beginning and a tiny bit behind at the end. pts should be multiples of 1024, but they aren't hmm. The difference is still 1024, but they are phase-shifted. Phase-shift is 62992 mod 1024 = 528 (or -496).

 

The interval from a bit further ahead:

Video: From 8.066667 to 8.366667+0.033333=8.400000
Audio: From 8.053667 to 8.352333+0.021333=8.373666  pts-phase-shift: still 528 (-496)

Audio is lagging behind.

 

To really see what is happening I will have to implement better statistics than just looking at things 🙂

 

One further test: I tried to phase-lock audio and video optimally:

VideoFrameRate: f. AudioFrameRate: 48000/1024, so f = 48000/1024 = 46,875. I've added this frame-rate to the demo.

Result: Perfect sync for the first audio-video group. In the middle of the second group the pts-phase-shift is again 528, and audio lags behind. For the rest of the groups the lag doesn't get bigger, it is always corrected to some degree. But the file should have identical audio and video timestamps in the first place! 

 

There is another new frame-rate, which is the result of trying to phase-lock 2 video-frames to 3 audio-frames.  2/f = 3*1024/4800  results in f = 2*48000/3/1024 = 31.25

 

I will try to find out what causes the phase-shift in audio by parsing the ffprobe-output a bit more (sigh). Maybe generate a log-file for the samples written, too. (Sigh). No, so far it's still fun.:classic_biggrin:

For those, who made it up to here: Thanks for your patience.

Renate  

Edited by Renate Schaaf
Trying to get rid of the strikethrough
  • Like 2

Share this post


Link to post

I think I solved the audio-syncing  ... kind of.

 

First observation: Audio and video are perfectly synced if the audio comes from a .wav-file. You can check this using the optimal frame-rates 46.875 or 31.25. So for optimal synching, compressed audio should be converted to .wav first. I have added a routine in uTransformer.pas which does this. In the demo there are some checkboxes to try this out.

 

Second observation: For compressed input the phase-shift in audio happens exactly at the boundaries of the IMFSamples read in. So this is what I think happens: The encoder doesn't like the buffer-size of these samples and throws away some bytes at the end.

This causes a gap in the audio-stream and a phase-shift in the timing. I have a notorious video where you can actually hear these gaps after re-encoding. If I transform the audio to .wav first, the gaps are gone. One could try to safekeep the thrown-away bytes and pad them to the beginning of the next sample, fixing up the time-stamps... Is that what you were suggesting,  @Kas Ob.? Well, I don't think i could do it anyway :).

 

So right now, first transforming audio-input to .wav is the best I can come up with. For what I use this for it's fine, because I mix all the audio into one big .wav before encoding. 

 

Renate

  • Like 1

Share this post


Link to post
13 hours ago, Renate Schaaf said:

I think I solved the audio-syncing  ... kind of.

In 10 tests i did, it is synced and difference is at the beginning is 4 ms and in the middle 4ms and at the end still 4ms, that is very accurate considering the acceptable desyncing between audio and video is constrained and small;

https://video.stackexchange.com/questions/25064/by-how-much-can-video-and-audio-be-out-of-sync

 

What is still perplexing me is;

1) why the frames are grouped, so i added something like this "OutputDebugString(PChar('Audio: '+IntToStr(AudioSampleDuration)));" before SafeRelease, same for video, the debug output is clearly showing an interleaved frames one by one ! beautiful interleaving, yet the result video frames are grouped, so it might be something has to do with WMF and its codec or missing some some settings somewhere, in other words you code is doing it right.

 

2) the duration at 1000 and i am not talking about the timestamp but the relevancy of nominator and video frames is 1000, i tried to tweak things and it didn't change, even used the recommended 10m instead of 1m you are using, still didn't change, so this also might be like above a setting or a constrained bit/frame/packet limitation specific to this very codec, one test video is 60gps with 200 duration, the output is 1000 at 30fps, while it should be 400.

 

14 hours ago, Renate Schaaf said:

This causes a gap in the audio-stream and a phase-shift in the timing. I have a notorious video where you can actually hear these gaps after re-encoding. If I transform the audio to .wav first, the gaps are gone. One could try to safekeep the thrown-away bytes and pad them to the beginning of the next sample, fixing up the time-stamps... Is that what you were suggesting,  @Kas Ob.? Well, I don't think i could do it anyway :).

Yes in some way, see if there is gap then the audio is distorted and the final video is bad or low quality, so yes decoding the audio into PCM from some exotic audio format, then use more standard audio codec from WMF will be the best thing to keep the quality.

 

Anyway, here a nice answer on SO leading to very beautiful SDK, you might find it very useful

https://stackoverflow.com/questions/41326231/network-media-sink-in-microsoft-media-foundation

https://www.codeproject.com/Articles/1017223/CaptureManager-SDK-Capturing-Recording-and-Streami#twentythirddemoprogram

 

Now, why i keep looking at this drifting in audio and video you might ask,

the answer is long time ago i wanted to know how those media players could read from slow HDD huge chunks of data and decode them then render them, everything is irrelevant here except one behavior you can watch, they like WMP and VLC do strange thing, they read the header of the video, then load huge buffers form the beginning then seek to the end of that file then again load huge chunk, from the end they try to see how much the streams drifted, only after that they play, those players saw it all, so they do tricks of resyncing at there own, when the video/audio stream are desynced and it is possible then adjust and cover it (fix it)

Why is this is relevant here if all modern and used players doing this and fix things, because this will fail when you stream that video there is no way to seek to the end, so the player will play what he get, being WebRTC, RTMP, RTSP... 

Think video conference or WebCam or even security cams being received by server that will encoded and save the videos while allowing the user to monitor one or more cam online, audio and video syncing is important here, and players tricks will not help.

 

Anyway, nice and thank you, you did really nice job.

  • Thanks 1

Share this post


Link to post
52 minutes ago, Kas Ob. said:

yet the result video frames are grouped, so it might be something has to do with WMF and its codec or missing some some settings somewhere

I've been wondering about the same, maybe the setup isn't right. But I find it so hard to even figure out what settings you can specify. 

 

56 minutes ago, Kas Ob. said:

2) the duration at 1000 and i am not talking about the timestamp but the relevancy of nominator and video frames is 1000, i tried to tweak things and it didn't change, even used the recommended 10m instead of 1m you are using, still didn't change, so this also might be like above a setting or a constrained bit/frame/packet limitation specific to this very codec, one test video is 60gps with 200 duration, the output is 1000 at 30fps, while it should be 400.

You lost me here. What 10m, and what's gps?

 

58 minutes ago, Kas Ob. said:

Anyway, here a nice answer on SO leading to very beautiful SDK, you might find it very useful

Indeed, that looks interesting. Thanks, your links have already helped me a lot to understand better.

 

Renate

Share this post


Link to post
25 minutes ago, Renate Schaaf said:
1 hour ago, Kas Ob. said:

2) the duration at 1000 and i am not talking about the timestamp but the relevancy of nominator and video frames is 1000, i tried to tweak things and it didn't change, even used the recommended 10m instead of 1m you are using, still didn't change, so this also might be like above a setting or a constrained bit/frame/packet limitation specific to this very codec, one test video is 60gps with 200 duration, the output is 1000 at 30fps, while it should be 400.

You lost me here. What 10m, and what's gps?

Honestly i lost my self reading that, fps not gps, (stupid auto correct and clumsy fingers), and 10m=10000000 vs 1m =1000000, as dominator for the rate at setup.

  • Like 1

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×