Worker thread queue performance

snowdev · April 7

I have an application which continuously receives websocket data and proccess them in the background on worker threads. Some data is critical and must be proccessed as quickly as possible.

Once I receive the websocket data I fill a TObject descendent with the information and push it to the corresponding worker thread, which proccess and frees the object. I use Delphi 12.

I would like to reduce as much overhead as possible in the flow, and beside the business rules, I believe there is an overhead in my worker thread consumer implementation, especially because I work with TObject descendants to transport the data. Also there has TObject cloning when the communication occurs from worker thread to worker thread, because each worker thread owns the queue objects lifetime, so I need to send a copy for each worker thread.

I decided to make a benchmark to check by myself the differente in the differet approachs that I know about threading queues and know what has the best performance, attached has the benchmark I build. Nowdays I use CustomQueueObject.pas queue model in prod.

In the example, has 4 examples of threading consumer queues:

-A thread with TQueue<TObject>, TSempahore and TCritical Section;

-A thread with TQueue<Pointer>, TSempahore and TCritical Section;

-A thread with internal TThread queue processing TObject descendents, without events and sync objects;
-A thread with internal TThread queue processing Pointers, without events and sync objects;

After making tons of tests, for my surprise the fastest consumer queue is the example in CustomQueueObject.pas (which I already use), even with more creation/deletion and access control (sync object)... what brought me here to ask more experienced developers if I doing something wrong (according my examples as is my base). Mainly on Pointer examples as I rarely use pointers but I willing to change if its better.

In my mind the internal thread queue would much more efficient for the reasons said above... I also thought that work with Pointers could highly improve the overal performance as in the workflow since I would have a single reference and only freed in the final.

Could I be measuring the performance incorrectly?

Thanks in advance!

MultithreadingQueueBenchmark.zip

Edited April 7 by snowdev

chmichael · April 7

Why you don't use TMonitor ?

Anders Melander · April 7

4 hours ago, snowdev said:

Could I be measuring the performance incorrectly?

Yes.

For one you are running all the tests concurrently which means that you will be penalizing the tests that start later because they will be competing for CPU against the test that are already executing. Execute each test and wait for it to finish before you start the next test.
You also seem to have massive memory leaks which probably means that some of the test have an unfair advantage because they don't consume time releasing their resources.
If you are using thread pools (I'm not sure that your are (if not, you should be)) then you should ensure that the thread pool has been spun up before you start the test. Otherwise you will penalize the first threads with the startup overhead.
Instead of just looking at the time from start to end and then guessing about why it is fast/slow/whatever, profile your code so you can see exactly where the bottlenecks are. Do this for each individual algorithm in turn.

Apart from that, for something as simple as this, you don't need locking and you definitely don't need to use the Windows message queue as a work queue.

Use a simple lock-free fifo queue instead.

You could even use a fixed size lock-free ring buffer (just an array of records with two integer values as in/out indices). The fixed size buffer and the records would eliminate the allocation overhead of the queue itself. You should probably also try to eliminate the use of string and replace it with fixed size buffer if possible.

RDP1974 · April 7

please can you test with TThreadedQueue? (with latest 12.3)

snowdev · April 7

5 hours ago, chmichael said:

Why you don't use TMonitor ?

I’ve search over the internet when I started the project and found some posts around TMonitor performance. I also found a gabr42’s (OmniThreadLibrary creator) blog post about this and just decided to use TCriticalSection.

4 hours ago, Anders Melander said:

- For one you are running all the tests concurrently which means that you will be penalizing the tests that start later because they will be competing for CPU against the test that are already executing. Execute each test and wait for it to finish before you start the next test.

Thats make sense, I’ve tested this way and got similar results, and the simple fifo queue wons (working with objects or pointers).

4 hours ago, Anders Melander said:

- You also seem to have massive memory leaks which probably means that some of the test have an unfair advantage because they don't consume time releasing their resources.

Not exactly, using ReportMemoryLeaksOnShutdown didnt take any leak running the tests… every queue format release their resources.

4 hours ago, Anders Melander said:

- If you are using thread pools (I'm not sure that your are (if not, you should be)) then you should ensure that the thread pool has been spun up before you start the test. Otherwise you will penalize the first threads with the startup overhead.

I’ll take a look into that, usually dont. This reason I dont included in the given example… every thread became up on the app initialization.

4 hours ago, Anders Melander said:

- Instead of just looking at the time from start to end and then guessing about why it is fast/slow/whatever, profile your code so you can see exactly where the bottlenecks are. Do this for each individual algorithm in turn.

Thanks for the tip. I dont know a profiling lib for Delphi, but I’ll measure them with stopwatches.

4 hours ago, Anders Melander said:

Apart from that, for something as simple as this, you don't need locking and you definitely don't need to use the Windows message queue as a work queue.

Use a simple lock-free fifo queue instead.

You could even use a fixed size lock-free ring buffer (just an array of records with two integer values as in/out indices). The fixed size buffer and the records would eliminate the allocation overhead of the queue itself. You should probably also try to eliminate the use of string and replace it with fixed size buffer if possible.

I just use locking because I dont know if there could have a deadlock when other thread is pushing and the worker is popping, so I do it just in case. You say that this scenario isnt that possible?

About Windows message queue, it seems slow as a simple fifo aswell, thought continue using this approach.

In the next few days I’ll build a ring buffer like approach and test the performance compared to TQueue, it internal uses an array of T btw.

About strings I could switch to PWideChar aswell, I use string for ease.

3 hours ago, RDP1974 said:

please can you test with TThreadedQueue? (with latest 12.3)

Almost same performance as TQueue.

Thanks for the reply.

Edited April 7 by snowdev
Gramatic

Kas Ob. · April 7

1 hour ago, snowdev said:

5 hours ago, Anders Melander said:

- You also seem to have massive memory leaks which probably means that some of the test have an unfair advantage because they don't consume time releasing their resources.

Not exactly, using ReportMemoryLeaksOnShutdown didnt take any leak running the tests… every queue format release their resources.

Did not ran the code, and my test will be irrelevant on old IDE, but browsed the code and i can see a thing to point here, Anders might be right and you should investigate the fact you are not generating leaks or catching something worse,

See, there is "FreeOnTerminate:= True;" in the constructor in few places and yet there is specific call to destroy, this should be bad mix, leaks and double freeing most likely will be there, well unless TThread and RTL have changed a lot since XE8, and in that case just ignore this post.

Anders Melander · April 7

34 minutes ago, snowdev said:

Not exactly, using ReportMemoryLeaksOnShutdown didnt take any leak running the tests… every queue format release their resources.

I didn't investigate but I got a lot of leaks reported when existing the application when running in the debugger.

36 minutes ago, snowdev said:

I’ll take a look into that, usually dont. This reason I dont included in the given example… every thread became up on the app initialization.

Okay.

It's expensive to start a thread but if you are launching the threads at application startup then it doesn't matter. If you create them on-demand then I would use TTask instead. The first task will take the worst of the pool initialization hit.

39 minutes ago, snowdev said:

I dont know a profiling lib for Delphi, but I’ll measure them with stopwatches.

https://en.delphipraxis.net/search/?q=profiling

41 minutes ago, snowdev said:

I just use locking because I dont know if there could have a deadlock when other thread is pushing and the worker is popping, so I do it just in case.

If you use a lock-free structure then you don't need locking. Hence the "free" in the name 🙂

And FTR, the term deadlock means a cycle where two threads each have some resource locked and each is waiting for the other to release their resource. I think what you meant was race condition; Two threads modifying the same resource at the same time.

54 minutes ago, snowdev said:

About strings I could switch to PWideChar aswell, I use string for ease.

PWideChar is supposedly a pointer to a WideString? In that case, please don't. WideString is only for use in COM and it's horribly slow.

No, what I meant was that instead of using dynamic strings (which are relatively slow because they must be allocated, sized, resized, freed, etc.) use a static array of chars: Buffer: array[BufferSize] of char. You will waste some bytes but it's fast.

snowdev · April 7

15 minutes ago, Anders Melander said:

I didn't investigate but I got a lot of leaks reported when existing the application when running in the debugger.

Its a bit weird. I also running in debug and didnt got leaks.

16 minutes ago, Anders Melander said:

If you use a lock-free structure then you don't need locking. Hence the "free" in the name 🙂

And FTR, the term deadlock means a cycle where two threads each have some resource locked and each is waiting for the other to release their resource. I think what you meant was race condition; Two threads modifying the same resource at the same time.

Ya, I forgot the correct term. Anyway I already planning some changes:

-Change TCriticalSection locking for TLightweigtMREW gabr42 version (discussion here: https://www.thedelphigeek.com/2021/02/readers-writ-47358-48721-45511-46172.html?m=1)

Another change would be:

-Switch from single worker thread processing to N worker thread processing: I have a worker that loops an array and do what is necessary based on the item settings, so the thread retrieve the data every notification from this array…

Instead I’ll create a threadpool like (or a simple TThread list) and pre-creates the worker threads with the pre-defined parameters, then I’ll just signaling they when necessary, the necessary information already be there.

Final probably change:

-Switch from TQueue to ring buffer like.

35 minutes ago, Anders Melander said:

PWideChar is supposedly a pointer to a WideString? In that case, please don't. WideString is only for use in COM and it's horribly slow.

No, what I meant was that instead of using dynamic strings (which are relatively slow because they must be allocated, sized, resized, freed, etc.) use a static array of chars: Buffer: array[BufferSize] of char. You will waste some bytes but it's fast.

Thanks for educate me. The external API which I consume returns PWideChar and is a pain to work in certain circumstances… I’ll evaluate change to array of char.

Thanks again Anders.

Anders Melander · April 7

5 minutes ago, snowdev said:

-Change TCriticalSection locking for TLightweigtMREW gabr42 version

Do you really need reentrant locks? If not then just use the standard version. There's no need to complicate thing further.

Also MREW only makes sense if you have more readers than writers. For single writer & single reader there's no reason for it.

As I read it you are considering using a pool of reader in which case MREW might very well make sense (or "just" use a lock free queue).

Btw, if you don't need to process the work packets "in order" then things become a little easier since a lock-free stack is often simpler to implement than a queue.

9 minutes ago, snowdev said:

The external API which I consume returns PWideChar and is a pain to work in certain circumstances…

A PWideChar doesn't necessarily mean that the source string is a Widestring. It would very well just be a pointer to regular unicode string.

It's just not common to explicitly use PWideChar anymore since it is the same as PChar on unicode Delphi. Regardless, the message was more that you should avoid WideString unless you have a reason to use it. You can't really do anything about what your external lib uses internally.

Good luck. You have many hours of debugging ahead of you 🙂

Stefan Glienke · April 7

Be careful with the term reader and assuming that you can use mrew - in a consumer/producer pattern, the "reader" (i.e. consumer) is mutating a data structure (i.e. it pops an item)

RDP1974 · April 7

guess a pool with 100 tthreads, each with a queue fifo receiving messages, also each tthread send messages simultaneously to every all others:
then a tthreadqueue without global locking as CRT should be the faster solution?

Tommi Prami · April 8

Helllo,

Depending on your CPU, you might need to set Affinity mask for the threads.

If you have new Intel CPU with P and E cores you get wildly different results depending on which core threads are running at.

Made siple unit top get Affinities: https://github.com/TommiPrami/Delphi.ProcessAffinity.Utils

If you use it and find bugs, please make pull reguest.

-Tee-

Anders Melander · April 8

6 hours ago, Tommi Prami said:

Depending on your CPU, you might need to set Affinity mask for the threads.

For benchmarking, sure. But otherwise I would think it would be better to let Windows manage that.

6 hours ago, Tommi Prami said:

Made siple unit top get Affinities: https://github.com/TommiPrami/Delphi.ProcessAffinity.Utils

That code sure does look a lot like this one...

https://github.com/graphics32/graphics32/blob/4fbc8d2a3083e42a00ca776eaa52af7cab2de34a/Source/GR32_System.pas#L399

RDP1974 · April 8

guess a pool with 100 tthreads, each with a queue fifo receiving messages, also each tthread send messages simultaneously to every all others:
then a tthreadqueue without global locking as CRT should be the faster solution?

as far I have researched then the spring4d queue lock free seems the fastest solution (but I cannot find it in the source)

finally -> OmniThreadLibrary -> TOmniBaseQueue -> Dynamically allocated, O(1) enqueue and dequeue, threadsafe, microlocking queue
or TOmniMessageQueue (ring buffer)

also I have found a ring buffer from https://blog.grijjy.com/2017/01/12/expand-your-collections-collection-part-2-a-generic-ring-buffer/

please can you suggest me the best code, libraries to achieve consumer-producers between threads? thanks

btw.if I have time will do a dll for tbb::concurrent_queue

Edited April 9 by RDP1974

Tommi Prami · April 9

17 hours ago, Anders Melander said:

That code sure does look a lot like this one...

https://github.com/graphics32/graphics32/blob/4fbc8d2a3083e42a00ca776eaa52af7cab2de34a/Source/GR32_System.pas#L399

It is mostly from there. I think. I thought I made it clear in code or GitHUB, let me check. I think I found few starting points, when searching the solution for the problem.

If you’re referring to what I said earlier, that I did write it. I honestly did no longer remembered where it originally came from. I’ll make sure to give credit to the project. In other words, I had no intention of taking credit for something that isn’t mine.

-Tee-

Edited April 9 by Tommi Prami

Tommi Prami · April 9

17 hours ago, Anders Melander said:

For benchmarking, sure. But otherwise I would think it would be better to let Windows manage that.

At least in my case, I ran 7-Zip processes in parallel, windows did piss poor job of allocating those prcesses to perdormance cores.

Anyones mileage migh vary for sure.

Tommi Prami · April 9

15 hours ago, RDP1974 said:

as far I have researched then the spring4d queue lock free seems the fastest solution (but I cannot find it in the source)

Is there Lock free or Thread safe containers in Sping4D...

Asking for a friend 😉

-Tee-

Stefan Glienke · April 9

I don't know what research RDP did (probably asking some GenAI ), but Spring4d does not contain thread-safe collections - for those needs, refer to libraries such as OTL or protect them by primitives in your own code according to your use-cases.

I don't step into that territory because you cannot simply make general-purpose collections thread-safe. It already starts with simple things like: how do you protect a list where one thread adds/removes items and another iterates over it?

It then requires a different API, and it's complex to design a general-purpose thread-safe collection library because everyone has their use cases, which you cannot simply combine.

Edited April 9 by Stefan Glienke

RDP1974 · April 9

I asked the old good chatgpt

Edited April 9 by RDP1974

Anders Melander · April 9

2 hours ago, Tommi Prami said:

I had no intention of taking credit for something that isn’t mine.

Yet you did.

If you couldn't remember where you got the code from then you simply shouldn't have posted it without proper attribution.

As it is now the code is nearly identical so it's not just "inspired" or "influenced". That's not a problem in itself, it's open source after all, but you have to at least keep the original license which is "MPL 1.1 or LGPL 2.1 with linking exception". I've created an issue at your repo to get that fixed.

snowdev · April 10

Just an update.

I tested a ringbuffer solution (i’ve used this one for benchmarking: https://github.com/MHumm/CircularBuffer)

I didnt get into implementations details to check if its well optmized or not, just tested out.

My results was same as TQueue (with critical section) for pushing items on queue, and a slighty difference for consuming (like 5ms +/-). Testes both with TObject descendent and TRecord (not pointers), and the last I got better results.

I’ll do more tests, but for now is safier I just change to a faster read/write-lock to the current solution… at least till I being able to write my own ringbuffer-based and understand 100% whats going on for maintability.

Tommi Prami · April 10

19 hours ago, Anders Melander said:

Yet you did.

If you couldn't remember where you got the code from then you simply shouldn't have posted it without proper attribution.

People tend to forget things. Right?

As I recall ChatGPT gave virtually identical code also, but does not change the fact that I messed up...

(EDIT: What I tried to say with that ChatGPT thing wea that I think (if recall) made ChatGPT implementation first. Checked out the API calls made, Google search and found the Grahics32 implementation and went with that.

I think I had no intention to make the reposiotory public of that extracted version. As I was just testing the effect of the performance cores only thing out of curiocity, but at some point thought that mayeb some one also would like to use that. At that point most likely did not even think of the origin of the code any more. etc. I should have, for sure.

Would not like to hijack more this thread on this, hope all is good now as the license is changed and added reference to the oigin, make issue on the project if something else is needed to fix)

Edited April 10 by Tommi Prami

Stefan Glienke · April 10

41 minutes ago, Tommi Prami said:

As I recall ChatGPT gave virtually identical code also, but does not change the fact that I messed up...

This just underlines that GenAI and copyright is a delicate thing

snowdev · April 10

@Anders Melander you mentioned profiling, I searched about and found the https://www.delphitools.info/samplingprofiler/ , but realized that only works for Maintthread functions, what doesnt help me alot since I handle almost everything in threads.

Also I found your solution at https://github.com/andersmelander/map2pdb and saw that (at least) initial releases hasnt support for 64bits .exe... tried to find something on the post discussion about but not sure if supports 64bit applications, also tried to search in repo but didnt find mentions.

Your solution works for 64bits applications?

Anders Melander · April 10

8 minutes ago, snowdev said:

Your solution works for 64bits applications?

Yes it does - but map2pdb just produces the pdb files required by the profilers.

Profilers that works with pdb files includes Intel VTune and AMD μProf.

I believe μProf works with both Intel and AMD processors while VTune only works with Intel processors but use the one that matches your processor to get the most precise results. I use VTune myself.

Ask if you need instructions on how to get started. The process is very easy once you know how to do it but it can be challenging to get to that point 🙂

Sign In

Worker thread queue performance

Recommended Posts

snowdev 0

Share this post

Link to post

chmichael 14

Share this post

Link to post

Anders Melander 2023

Share this post

Link to post

RDP1974 40

Share this post

Link to post

snowdev 0

Share this post

Link to post

Kas Ob. 147

Share this post

Link to post

Anders Melander 2023

Share this post

Link to post

snowdev 0

Share this post

Link to post

Anders Melander 2023

Share this post

Link to post

Stefan Glienke 2143

Share this post

Link to post

RDP1974 40

Share this post

Link to post

Tommi Prami 148

Share this post

Link to post

Anders Melander 2023

Share this post

Link to post

RDP1974 40

Share this post

Link to post

Tommi Prami 148

Share this post

Link to post

Tommi Prami 148

Share this post

Link to post

Tommi Prami 148

Share this post

Link to post

Stefan Glienke 2143

Share this post

Link to post

RDP1974 40

Share this post

Link to post

Anders Melander 2023

Share this post

Link to post

snowdev 0

Share this post

Link to post

Tommi Prami 148

Share this post

Link to post

Stefan Glienke 2143

Share this post

Link to post

snowdev 0

Share this post

Link to post

Anders Melander 2023

Share this post

Link to post

Create an account or sign in to comment

Create an account