tuyen 1 Posted June 29 (edited) On 6/26/2025 at 4:18 AM, Kas Ob. said: AudioSampleDuration := 1.0 / 48000; // 20.833 microsecond (microsecond = 1/1000000 of second) VideoFps := 30.0; VideoFrameDuration := 1.0 / VideoFps; // 0.033333 ms 30 fps is not 1 frame per 0.0333333 ms. It's 1 frame per 33.33333 ms, or 0.033333 seconds. Edited June 29 by tuyen 1 Share this post Link to post
Renate Schaaf 74 Posted Sunday at 05:07 PM (edited) There is a new version at https://github.com/rmesch/Bitmaps2Video-for-Media-Foundation. New stuff: Some rewrite of audio, making sure that gaps at the beginning of a stream are filled with silence. 2 optimized frame-rates for audio-synching, see below. Most importantly: One can now run @Kas Ob.'s frame analysis from within the demo, if one enables the hidden tab "Analysis". I just made the lines a bit shorter, as the rest was just repeating the same values for all I tested, as far as I could see. The file ffprobe.exe needs to be in the same directory as DemoWMF.exe. ffprobe is part of ffmpeg-git-essentials.7z on https://www.gyan.dev/ffmpeg/builds/. I spent a good amount of time trying to figure out what I can and what I cannot control about audio-synching, tracing into the relevant code and running the analysis. Results of audio-rethynching follow (beware, it's long): The math is for audio-sample-rate of 48000 and the time units are all s. Audio-blockalign is always 4 Bytes for what I do. There are at least 2 different meanings of "sample": PCMSample: as in samples per second. ByteSize: Channels*BitsPerSample/8 = 2*16/8 = 4 Bytes. Time: 1/48000 s IMFSample: Chunk of audio returned by IMFSourceReader.ReadSample. It contains a buffer holding a certain amount of uncompressed PCMsamples, and info like timestamp, duration, flags ... The size of these samples varies a lot with the type of input. Some observed values: .mp3-file 1: Buffersize = 96 768 Bytes Duration = 0.504 (96768 bytes = 96768/4 PCMSamples = 96768/4/48000 s OK) .mp3-file 2: Buffersize = 35 108 Bytes Duration = 0.1828532 (35108/4/48000 = 0.182854166.. not OK) .wmv-file: Buffersize = 17 832 Bytes Duration = 0.092875 (17832/4/48000 = 0.092875 OK) Except for the first sample read, the values don't differ from sample to sample. Those are the samples I can write to the sinkwriter for encoding. Breaking them up seems like a bad idea. I have to trust MF to handle the writing correctly. The buffers seem to always be block-aligned. I've added some redundant variables in TBitmapEncoderWMF.WriteAudio so these values can be examined in the debugger. A related quantity are audio-frames. Similarly to the video-stream the audio-stream of a compressed video consists of audio-frames. 1 audio-frame contains the compressed equivalent of 1024 PCMSamples. So: AudioFrameDuration = 1024/48000 AudioFrameRate = 48000/1024 I can only control the writing of the video by feeding the IMFSamples of video and audio to the sinkwriter in good order. The samples I write to the sinkwriter are collected in a "Leaky-Bucket"-buffer. The encoder pulls out what it needs to write the next chunk of video. It hopefully waits until there are enough samples to write something meaningful. Problems arise if the bucket overflows. There need to be enough video- and audio-samples to correctly write both streams. So here is the workflow, roughly (can be checked by stepping into TBitmapEncoderWMF.WriteOneFrame): Check if the audio-time written so far is less than the timestamp of the next video-frame. Yes: Pull audio-samples out of the sourcereader and write them to the sinkwriter until audio-time >= video-timestamp. Looking at the durations above, one sample might already achieve this. Write the next video-frame Repeat In the case of mp3-file 1 the reading and writing of 1 audio-sample would be followed by the writing of several video-samples. The encoder now breaks the bucket-buffer up into frames, compresses them and writes them to file. It does that following its own rules, which I have no control over. Frame-analysis can show the result: A group of video-frames is followed by a group of audio-frames, which should cover the same time-interval as the video-frames. In the output I have seen so far, the audio-frame-period is always 15 audio-frames. For video-framerate 30, the video-frame-period is 9 or 10 frames. Why doesn't it make the audio- and video-periods smaller? No idea. Guess that's the amount of info the players can handle nowadays, and these periods are a compromise between optimal phase-locking of audio- video- periods and the buffer-size the player can handle. Theoretically, at framerate 30, 16 video-frames should phase-lock with 25 audio-frames. Here is one of those video-audio-groups. Video-framerate is 30. video stream_index=0 key_frame=0 pts=39000 pts_time=1.300000 duration_time=0.033333 video stream_index=0 key_frame=0 pts=40000 pts_time=1.333333 duration_time=0.033333 video stream_index=0 key_frame=0 pts=41000 pts_time=1.366667 duration_time=0.033333 video stream_index=0 key_frame=0 pts=42000 pts_time=1.400000 duration_time=0.033333 video stream_index=0 key_frame=0 pts=43000 pts_time=1.433333 duration_time=0.033333 video stream_index=0 key_frame=0 pts=44000 pts_time=1.466667 duration_time=0.033333 video stream_index=0 key_frame=0 pts=45000 pts_time=1.500000 duration_time=0.033333 video stream_index=0 key_frame=0 pts=46000 pts_time=1.533333 duration_time=0.033333 video stream_index=0 key_frame=0 pts=47000 pts_time=1.566667 duration_time=0.033333 video stream_index=0 key_frame=0 pts=48000 pts_time=1.600000 duration_time=0.033333 audio stream_index=1 key_frame=1 pts=62992 pts_time=1.312333 duration_time=0.021333 audio stream_index=1 key_frame=1 pts=64016 pts_time=1.333667 duration_time=0.021333 audio stream_index=1 key_frame=1 pts=65040 pts_time=1.355000 duration_time=0.021333 audio stream_index=1 key_frame=1 pts=66064 pts_time=1.376333 duration_time=0.021333 audio stream_index=1 key_frame=1 pts=67088 pts_time=1.397667 duration_time=0.021333 audio stream_index=1 key_frame=1 pts=68112 pts_time=1.419000 duration_time=0.021333 audio stream_index=1 key_frame=1 pts=69136 pts_time=1.440333 duration_time=0.021333 audio stream_index=1 key_frame=1 pts=70160 pts_time=1.461667 duration_time=0.021333 audio stream_index=1 key_frame=1 pts=71184 pts_time=1.483000 duration_time=0.021333 audio stream_index=1 key_frame=1 pts=72208 pts_time=1.504333 duration_time=0.021333 audio stream_index=1 key_frame=1 pts=73232 pts_time=1.525667 duration_time=0.021333 audio stream_index=1 key_frame=1 pts=74256 pts_time=1.547000 duration_time=0.021333 audio stream_index=1 key_frame=1 pts=75280 pts_time=1.568333 duration_time=0.021333 audio stream_index=1 key_frame=1 pts=76304 pts_time=1.589667 duration_time=0.021333 audio stream_index=1 key_frame=1 pts=77328 pts_time=1.611000 duration_time=0.021333 pts stands for "presentation time stamp" and pts_time is of interest. Video-time-intervall: from 1.300000 to 1.600000+0.033333=1.633333 Audio-time-intervall: from 1.312333 to 1.611000+0.021333=1.632333 Audio is a bit ahead at the beginning and a tiny bit behind at the end. pts should be multiples of 1024, but they aren't hmm. The difference is still 1024, but they are phase-shifted. Phase-shift is 62992 mod 1024 = 528 (or -496). The interval from a bit further ahead: Video: From 8.066667 to 8.366667+0.033333=8.400000 Audio: From 8.053667 to 8.352333+0.021333=8.373666 pts-phase-shift: still 528 (-496) Audio is lagging behind. To really see what is happening I will have to implement better statistics than just looking at things 🙂 One further test: I tried to phase-lock audio and video optimally: VideoFrameRate: f. AudioFrameRate: 48000/1024, so f = 48000/1024 = 46,875. I've added this frame-rate to the demo. Result: Perfect sync for the first audio-video group. In the middle of the second group the pts-phase-shift is again 528, and audio lags behind. For the rest of the groups the lag doesn't get bigger, it is always corrected to some degree. But the file should have identical audio and video timestamps in the first place! There is another new frame-rate, which is the result of trying to phase-lock 2 video-frames to 3 audio-frames. 2/f = 3*1024/4800 results in f = 2*48000/3/1024 = 31.25 I will try to find out what causes the phase-shift in audio by parsing the ffprobe-output a bit more (sigh). Maybe generate a log-file for the samples written, too. (Sigh). No, so far it's still fun. For those, who made it up to here: Thanks for your patience. Renate Edited Sunday at 05:26 PM by Renate Schaaf Trying to get rid of the strikethrough 2 Share this post Link to post
Renate Schaaf 74 Posted Tuesday at 05:55 PM I think I solved the audio-syncing ... kind of. First observation: Audio and video are perfectly synced if the audio comes from a .wav-file. You can check this using the optimal frame-rates 46.875 or 31.25. So for optimal synching, compressed audio should be converted to .wav first. I have added a routine in uTransformer.pas which does this. In the demo there are some checkboxes to try this out. Second observation: For compressed input the phase-shift in audio happens exactly at the boundaries of the IMFSamples read in. So this is what I think happens: The encoder doesn't like the buffer-size of these samples and throws away some bytes at the end. This causes a gap in the audio-stream and a phase-shift in the timing. I have a notorious video where you can actually hear these gaps after re-encoding. If I transform the audio to .wav first, the gaps are gone. One could try to safekeep the thrown-away bytes and pad them to the beginning of the next sample, fixing up the time-stamps... Is that what you were suggesting, @Kas Ob.? Well, I don't think i could do it anyway :). So right now, first transforming audio-input to .wav is the best I can come up with. For what I use this for it's fine, because I mix all the audio into one big .wav before encoding. Renate 1 Share this post Link to post
Kas Ob. 147 Posted 22 hours ago 13 hours ago, Renate Schaaf said: I think I solved the audio-syncing ... kind of. In 10 tests i did, it is synced and difference is at the beginning is 4 ms and in the middle 4ms and at the end still 4ms, that is very accurate considering the acceptable desyncing between audio and video is constrained and small; https://video.stackexchange.com/questions/25064/by-how-much-can-video-and-audio-be-out-of-sync What is still perplexing me is; 1) why the frames are grouped, so i added something like this "OutputDebugString(PChar('Audio: '+IntToStr(AudioSampleDuration)));" before SafeRelease, same for video, the debug output is clearly showing an interleaved frames one by one ! beautiful interleaving, yet the result video frames are grouped, so it might be something has to do with WMF and its codec or missing some some settings somewhere, in other words you code is doing it right. 2) the duration at 1000 and i am not talking about the timestamp but the relevancy of nominator and video frames is 1000, i tried to tweak things and it didn't change, even used the recommended 10m instead of 1m you are using, still didn't change, so this also might be like above a setting or a constrained bit/frame/packet limitation specific to this very codec, one test video is 60gps with 200 duration, the output is 1000 at 30fps, while it should be 400. 14 hours ago, Renate Schaaf said: This causes a gap in the audio-stream and a phase-shift in the timing. I have a notorious video where you can actually hear these gaps after re-encoding. If I transform the audio to .wav first, the gaps are gone. One could try to safekeep the thrown-away bytes and pad them to the beginning of the next sample, fixing up the time-stamps... Is that what you were suggesting, @Kas Ob.? Well, I don't think i could do it anyway :). Yes in some way, see if there is gap then the audio is distorted and the final video is bad or low quality, so yes decoding the audio into PCM from some exotic audio format, then use more standard audio codec from WMF will be the best thing to keep the quality. Anyway, here a nice answer on SO leading to very beautiful SDK, you might find it very useful https://stackoverflow.com/questions/41326231/network-media-sink-in-microsoft-media-foundation https://www.codeproject.com/Articles/1017223/CaptureManager-SDK-Capturing-Recording-and-Streami#twentythirddemoprogram Now, why i keep looking at this drifting in audio and video you might ask, the answer is long time ago i wanted to know how those media players could read from slow HDD huge chunks of data and decode them then render them, everything is irrelevant here except one behavior you can watch, they like WMP and VLC do strange thing, they read the header of the video, then load huge buffers form the beginning then seek to the end of that file then again load huge chunk, from the end they try to see how much the streams drifted, only after that they play, those players saw it all, so they do tricks of resyncing at there own, when the video/audio stream are desynced and it is possible then adjust and cover it (fix it) Why is this is relevant here if all modern and used players doing this and fix things, because this will fail when you stream that video there is no way to seek to the end, so the player will play what he get, being WebRTC, RTMP, RTSP... Think video conference or WebCam or even security cams being received by server that will encoded and save the videos while allowing the user to monitor one or more cam online, audio and video syncing is important here, and players tricks will not help. Anyway, nice and thank you, you did really nice job. 1 Share this post Link to post
Renate Schaaf 74 Posted 21 hours ago 52 minutes ago, Kas Ob. said: yet the result video frames are grouped, so it might be something has to do with WMF and its codec or missing some some settings somewhere I've been wondering about the same, maybe the setup isn't right. But I find it so hard to even figure out what settings you can specify. 56 minutes ago, Kas Ob. said: 2) the duration at 1000 and i am not talking about the timestamp but the relevancy of nominator and video frames is 1000, i tried to tweak things and it didn't change, even used the recommended 10m instead of 1m you are using, still didn't change, so this also might be like above a setting or a constrained bit/frame/packet limitation specific to this very codec, one test video is 60gps with 200 duration, the output is 1000 at 30fps, while it should be 400. You lost me here. What 10m, and what's gps? 58 minutes ago, Kas Ob. said: Anyway, here a nice answer on SO leading to very beautiful SDK, you might find it very useful Indeed, that looks interesting. Thanks, your links have already helped me a lot to understand better. Renate Share this post Link to post
Kas Ob. 147 Posted 21 hours ago 25 minutes ago, Renate Schaaf said: 1 hour ago, Kas Ob. said: 2) the duration at 1000 and i am not talking about the timestamp but the relevancy of nominator and video frames is 1000, i tried to tweak things and it didn't change, even used the recommended 10m instead of 1m you are using, still didn't change, so this also might be like above a setting or a constrained bit/frame/packet limitation specific to this very codec, one test video is 60gps with 200 duration, the output is 1000 at 30fps, while it should be 400. You lost me here. What 10m, and what's gps? Honestly i lost my self reading that, fps not gps, (stupid auto correct and clumsy fingers), and 10m=10000000 vs 1m =1000000, as dominator for the rate at setup. 1 Share this post Link to post