In 10 tests i did, it is synced and difference is at the beginning is 4 ms and in the middle 4ms and at the end still 4ms, that is very accurate considering the acceptable desyncing between audio and video is constrained and small;
https://video.stackexchange.com/questions/25064/by-how-much-can-video-and-audio-be-out-of-sync
What is still perplexing me is;
1) why the frames are grouped, so i added something like this "OutputDebugString(PChar('Audio: '+IntToStr(AudioSampleDuration)));" before SafeRelease, same for video, the debug output is clearly showing an interleaved frames one by one ! beautiful interleaving, yet the result video frames are grouped, so it might be something has to do with WMF and its codec or missing some some settings somewhere, in other words you code is doing it right.
2) the duration at 1000 and i am not talking about the timestamp but the relevancy of nominator and video frames is 1000, i tried to tweak things and it didn't change, even used the recommended 10m instead of 1m you are using, still didn't change, so this also might be like above a setting or a constrained bit/frame/packet limitation specific to this very codec, one test video is 60gps with 200 duration, the output is 1000 at 30fps, while it should be 400.
Yes in some way, see if there is gap then the audio is distorted and the final video is bad or low quality, so yes decoding the audio into PCM from some exotic audio format, then use more standard audio codec from WMF will be the best thing to keep the quality.
Anyway, here a nice answer on SO leading to very beautiful SDK, you might find it very useful
https://stackoverflow.com/questions/41326231/network-media-sink-in-microsoft-media-foundation
https://www.codeproject.com/Articles/1017223/CaptureManager-SDK-Capturing-Recording-and-Streami#twentythirddemoprogram
Now, why i keep looking at this drifting in audio and video you might ask,
the answer is long time ago i wanted to know how those media players could read from slow HDD huge chunks of data and decode them then render them, everything is irrelevant here except one behavior you can watch, they like WMP and VLC do strange thing, they read the header of the video, then load huge buffers form the beginning then seek to the end of that file then again load huge chunk, from the end they try to see how much the streams drifted, only after that they play, those players saw it all, so they do tricks of resyncing at there own, when the video/audio stream are desynced and it is possible then adjust and cover it (fix it)
Why is this is relevant here if all modern and used players doing this and fix things, because this will fail when you stream that video there is no way to seek to the end, so the player will play what he get, being WebRTC, RTMP, RTSP...
Think video conference or WebCam or even security cams being received by server that will encoded and save the videos while allowing the user to monitor one or more cam online, audio and video syncing is important here, and players tricks will not help.
Anyway, nice and thank you, you did really nice job.