Jump to content
pyscripter

Revisiting TThreadedQueue and TMonitor

Recommended Posts

Just ran that stress test with Darian's default settings (thread count =30000, timeout=20) on Hyper-V guest with Server 2016, the test was for 10 minutes on 1 dedicated core guest and with another 6 cores out of 8 cores XEON CPU, both worked fine and the latter was faster almost twice (if not more) than my device.

 

Share this post


Link to post
10 hours ago, Kas Ob. said:

@Darian Miller Thank you, 

 

I am trying to make sense of my observation, now it is morning and i just turned on my PC less an hour ago, it does worked fine for about 10 minutes with 256 thread with timeout at 2ms but failed at 1ms immediately, yesterday before shutting down and after the PC was stressed all day, the 16ms was failing and 20 looked stable, while yesterday in the morning 1ms worked for more than 10 minutes without fail.

 

Windows 10 Build 1803 ( 17134.706), i7-2600K .

 

A new Quality Portal issue could be raised to ensure that the timeout values of the queue are >= 16ms.  Anything less is apparently not achievable based on earlier comments in this thread.

 

 

Share this post


Link to post
5 hours ago, Primož Gabrijelčič said:

Did some testing on that recently and most of the time wait functions work as expected while on some virtual machines they return early. The test cases were running on few Windows boxes and on two VMWare Fusion/Mac VMs. Worked OK on all but one. One Fusion/Mac combo consistently returned early from wait.

 

I've seen multiple cases of a timeout of 200ms returning in 160ms when running within VMWare.  My first stress test included a fudge factor for returning too early just for that reason.

 

Share this post


Link to post

Thanks a lot !
The proposed here changes will be in 10.4 Update 1 with some modifications. Mostly due to new cross-platform AtomicCmpExchange128 for 64bit platforms.

  • Like 5

Share this post


Link to post
1 hour ago, Darian Miller said:

A new Quality Portal issue could be raised to ensure that the timeout values of the queue are >= 16ms.  Anything less is apparently not achievable based on earlier comments in this thread.

I disagree.

TMonitor is cross platform and the ~16ms timeout granularity is a Windows limitation, which can (but very seldom should) be modified with timeBeginPeriod.

It would be better to document the current behavior on Windows - and maybe also why it's there.

Share this post


Link to post
Just now, Anders Melander said:

I disagree.

TMonitor is cross platform and the ~16ms timeout granularity is a Windows limitation, which can (but very seldom should) be modified with timeBeginPeriod.

It would be better to document the current behavior on Windows - and maybe also why it's there.

I agree!  I had switched to QP to type up a new issue and thought the same thing.  When I closed that window, this window beeped with your reply - must be good karma!

 

  • Like 1

Share this post


Link to post
2 hours ago, Darian Miller said:

I've seen multiple cases of a timeout of 200ms returning in 160ms when running within VMWare.  My first stress test included a fudge factor for returning too early just for that reason.

The following code might give hint with what is going when running code compiled with Delphi on VM.

program SimpleTiming;
{$APPTYPE CONSOLE}

{$R *.res}

uses
  System.SysUtils,
  Windows;

procedure ASMPad4One;
asm
end;

procedure ASMPad4Two;
asm
end;

procedure ASMPad4Three;
asm
end;

procedure ASMPad4Four;
asm
end;

procedure ASMProc1;
var
  Temp1, Temp2, Temp3, Temp4: Cardinal;
asm
        mov     eax, 10000000

@Loop:
        mov     Temp1, ebx
        mov     Temp2, esi
        mov     edx, eax
        AND     edx, 3
        cmp     edx, 0
        jnz     @Skip
        XOR     Temp3, eax
        XOR     Temp4, edx

@Skip:
        dec     eax
        cmp     eax, 0
        jnz     @Loop
end;

procedure ASMProc2;
var
  Temp1, Temp2, Temp3, Temp4: Cardinal;
asm
        mov     eax, 10000000

@Loop:
        mov     Temp1, ebx
        mov     Temp2, esi
        mov     edx, eax
        AND     edx, 3
        cmp     edx, 0
        jnz     @Skip
        XOR     Temp3, eax
        XOR     Temp4, edx

@Skip:
        dec     eax
        cmp     eax, 0
        jnz     @Loop
end;

var
  Count1, Count2: Int64;

procedure Benchmark2;
var
  QS1, QF1: Int64;
  QS2, QF2: Int64;
begin
  QueryPerformanceCounter(QS1);
  ASMProc1;
  QueryPerformanceCounter(QF1);

  QueryPerformanceCounter(QS2);
  ASMProc2;
  QueryPerformanceCounter(QF2);

  Inc(Count1, QF1 - QS1);
  Inc(Count2, QF2 - QS2);
  Writeln('AsmProc1: ' + IntToStr(QF1 - QS1) + #9 + 'AsmProc2: ' + IntToStr(QF2 - QS2));
end;

const
  LOOP_COUNT = 10;

var
  i: Integer;

begin
  // force linking
  ASMPad4One;
  ASMPad4Two;
  //ASMPad4Three;
  //ASMPad4Four;

  ASMProc1;
  ASMProc2;


  Writeln('ASMProc1 address : '+IntToHex(NativeUInt(Addr(ASMProc1))));
  Writeln('ASMProc2 address : '+IntToHex(NativeUInt(Addr(ASMProc2))));
  Count1 := 0;
  Count2 := 0;
  for I := 1 to LOOP_COUNT do
    Benchmark2;
  Writeln('Average 1 : ' + IntToStr(Count1 div LOOP_COUNT) + '  Average 2 : ' +
    IntToStr(Count2 div LOOP_COUNT));
  Readln;
end.

 

Now notes on that code:

1) The code seems only 32bit but it is perfect safe to be running on 64bit, and the behaviour is somehow similar.

2) to tweak the result uncomment or comment those procedures called ASMPad4(X)

3) the result with the code above on my device is "Average 1 : 36195  Average 2 : 42693", (comment/ uncomment to make them same or reversed ) losing/gaining around 1/5 of the speed out of 12 assembly code is not a joke!, it be any more serious, specially on MM level or system.move , or managed types intrinsic functions.

4) yes it is the same assembly and the lose is huge due only 2 loops where the jump address are not aligned !!

5) you may need to add similar padding function between ASMProc1 and ASMProc2 to induce even stranger behaviour.

6) not long ago i was trying to write this demo for Arnaud and Pierre, as i think they might be interested in it more than anyone else, but reached only this result about aligned conditional jumps, and got bored before demonstrate the effect from using cmp and jmpX sequentially on VM machine ( or emulated CPU) , the mentioned behaviour needs different and little complex assembly near them like few mov with 3 registers instead of those above with one register and stack, any way this is not the point here, while it is still very important as how much compiled application can lose performance on VM, we are in cloud era after all.

7) trying to use $CODEALIGN, which i just recently heard about might fix it , didn't try, i still trust FPC for my performance critical code.

 

Now after all of that lets go to the point;

 

59 minutes ago, Dmitry Arefiev said:

The proposed here changes will be in 10.4 Update 1 with some modifications. Mostly due to new cross-platform AtomicCmpExchange128 for 64bit platforms.

Please if you are going to walk, then walk the walk till finish line.

AtomicCmpExchange128 like other AtomicCmpExchangeXX depends on that THE loop, there is always a loop and branch, pretty please align the bit of it ! this is for every CPU out there (including ARM and CPU on VM), so do your research or hire who knows.

Embarcadero gave up on the compiler to generate more efficient, faster and shorter native code, That is OK, then used assembly as workaround, so that asm part covering many parts of RTL, should be redone the right way, aligned and optimized, extra tested, definitely it should be done by professional not hobbyist.

The point of having RTL is ....., let say is not beautiful code, no one care how the code does look in RTL, as it fast and reliable, and that is the finish line.

 

It is 2020, intrinsic functions is essential for any high performance application, i can't list all what i can think of what is desperately Delphi needed, but consider those most popular from AtomicXX to bit manipulation on low level, like any other compiler, also this is a finish line.

 

My post is already very long, i am stopping here.

Share this post


Link to post
2 hours ago, Kas Ob. said:

definitely it should be done by professional not hobbyist

....i think you should have a little faith in this "hobbyist".

Share this post


Link to post
54 minutes ago, ConstantGardener said:

....i think you should have a little faith in this "hobbyist".

Hobbyist might do perfect job, professionals just do it, we are talking about specific part that will save trillions of CPU cycles in countless apps, so i prefer this part to be done right and one time.

Share this post


Link to post
1 minute ago, Kas Ob. said:

Hobbyist might do perfect job, professionals just do it

I'm afraid your experience of "professionals" differ from mine. Supposedly the original implementation was written by "professionals".

 

Quote

https://www.merriam-webster.com/dictionary/professional

professional adjective
pro·fes·sion·al | \ prə-ˈfesh-nəl
a : participating for gain or livelihood in an activity or field of endeavor often engaged in by amateurs
b : having a particular profession as a permanent career
c : engaged in by persons receiving financial return

 

  • Like 1

Share this post


Link to post
11 minutes ago, Kas Ob. said:

Hobbyist might do perfect job, professionals just do it, we are talking about specific part that will save trillions of CPU cycles in countless apps, so i prefer this part to be done right and one time.

... that was not my point. i think dmitry can handle something. ...some infos

 

Share this post


Link to post
5 hours ago, Dmitry Arefiev said:

Thanks a lot !
The proposed here changes will be in 10.4 Update 1 with some modifications. Mostly due to new cross-platform AtomicCmpExchange128 for 64bit platforms.

This is worth a certain degree of celebration.   I cannot remember Embarcadero/Inprice/Borland etc. being so responsive ever. It also demonstrates the power of the collective efforts of the community to make Delphi better.

  • Like 1

Share this post


Link to post

@Anders MelanderThis is the last time i will address you, you like trolling and i am not finding it funny, if you are failed to get what i wrote there, then first i am not native English speaker, second i don't care if you get the idea or not.

 

1 minute ago, ConstantGardener said:

... that was not my point. i think dmitry can handle something. ...some infos

OK i am lost now, i wasn't talking about specific person or group, i pointing the fact that 10-15 years ago Delphi adopted FastMM4 and some of FastCode, since then what had been evolved ?, nothing on RTL side , and before start listing me the enhancement, please get the point i am talking about, compiler that is still the same, while RTL still has inefficient assembly!

That should be addressed years ago, now there is a chance for Delphi to become 2020 application generator, but definitely it should be done by experts who did right, this should be guess work.

 

As for criticism i wrote posts here, and some of you felt offended by my words, offending something they do like, be assured i like it and love it more than you, my words are out of fear to lose this love.

 

Here a blog that i follow, and i suggest everyone to read this post: ( even if you feel like a god knowing it all)

http://www.cs.uni.edu/~wallingf/blog/archives/monthly/2020-05.html#e2020-05-18T16_10_46.htm

Share this post


Link to post

 

7 minutes ago, pyscripter said:

I cannot remember Embarcadero/Inprice/Borland etc. being so responsive ever.

 

I can remember. It's every time Dmitry.

 

Share this post


Link to post
5 hours ago, Dmitry Arefiev said:

Thanks a lot !
The proposed here changes will be in 10.4 Update 1 with some modifications. Mostly due to new cross-platform AtomicCmpExchange128 for 64bit platforms.

 

This is fantastic!  Thank you.

Share this post


Link to post

Now i get it!, Dimitry is the god father of AnyDac, one of the two most valued jewels in Delphi collection, two without doubt were the right decisions made by the guys steering the wheel.

 

On other hand about two years ago (i might be wrong about the dates), Danijel Tkalcec in his RTC licensed forum (private) offered his RTC SDK in full for sell, he made full disclosure about RTC financial situation and licenses sold for some year(s) period (also can't remember for sure), he preferred one of his users to buy, while as i remember Embarcadero was Happy acquiring a navigator with bookmarks, bookmarks !! CnWizards had them for years, i didn't even know that Delphi comes without bookmarks.

 

Can Nexus Quality suit be included with Delphi/Rad Studio, sure, is it the best tool out there, no, but the developers at Nexus are capable to make it better, as long as there is interest and their time been paid, wouldn't this be better than marketing Deleaker ?

 

  • Like 1

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×