snowdev 0 Posted 14 hours ago (edited) I have an application which continuously receives websocket data and proccess them in the background on worker threads. Some data is critical and must be proccessed as quickly as possible. Once I receive the websocket data I fill a TObject descendent with the information and push it to the corresponding worker thread, which proccess and frees the object. I use Delphi 12. I would like to reduce as much overhead as possible in the flow, and beside the business rules, I believe there is an overhead in my worker thread consumer implementation, especially because I work with TObject descendants to transport the data. Also there has TObject cloning when the communication occurs from worker thread to worker thread, because each worker thread owns the queue objects lifetime, so I need to send a copy for each worker thread. I decided to make a benchmark to check by myself the differente in the differet approachs that I know about threading queues and know what has the best performance, attached has the benchmark I build. Nowdays I use CustomQueueObject.pas queue model in prod. In the example, has 4 examples of threading consumer queues: -A thread with TQueue<TObject>, TSempahore and TCritical Section; -A thread with TQueue<Pointer>, TSempahore and TCritical Section; -A thread with internal TThread queue processing TObject descendents, without events and sync objects; -A thread with internal TThread queue processing Pointers, without events and sync objects; After making tons of tests, for my surprise the fastest consumer queue is the example in CustomQueueObject.pas (which I already use), even with more creation/deletion and access control (sync object)... what brought me here to ask more experienced developers if I doing something wrong (according my examples as is my base). Mainly on Pointer examples as I rarely use pointers but I willing to change if its better. In my mind the internal thread queue would much more efficient for the reasons said above... I also thought that work with Pointers could highly improve the overal performance as in the workflow since I would have a single reference and only freed in the final. Could I be measuring the performance incorrectly? Thanks in advance! MultithreadingQueueBenchmark.zip Edited 14 hours ago by snowdev Share this post Link to post
Anders Melander 1960 Posted 9 hours ago 4 hours ago, snowdev said: Could I be measuring the performance incorrectly? Yes. For one you are running all the tests concurrently which means that you will be penalizing the tests that start later because they will be competing for CPU against the test that are already executing. Execute each test and wait for it to finish before you start the next test. You also seem to have massive memory leaks which probably means that some of the test have an unfair advantage because they don't consume time releasing their resources. If you are using thread pools (I'm not sure that your are (if not, you should be)) then you should ensure that the thread pool has been spun up before you start the test. Otherwise you will penalize the first threads with the startup overhead. Instead of just looking at the time from start to end and then guessing about why it is fast/slow/whatever, profile your code so you can see exactly where the bottlenecks are. Do this for each individual algorithm in turn. Apart from that, for something as simple as this, you don't need locking and you definitely don't need to use the Windows message queue as a work queue. Use a simple lock-free fifo queue instead. You could even use a fixed size lock-free ring buffer (just an array of records with two integer values as in/out indices). The fixed size buffer and the records would eliminate the allocation overhead of the queue itself. You should probably also try to eliminate the use of string and replace it with fixed size buffer if possible. 4 Share this post Link to post
RDP1974 40 Posted 8 hours ago please can you test with TThreadedQueue? (with latest 12.3) Share this post Link to post
snowdev 0 Posted 5 hours ago (edited) 5 hours ago, chmichael said: Why you don't use TMonitor ? I’ve search over the internet when I started the project and found some posts around TMonitor performance. I also found a gabr42’s (OmniThreadLibrary creator) blog post about this and just decided to use TCriticalSection. 4 hours ago, Anders Melander said: - For one you are running all the tests concurrently which means that you will be penalizing the tests that start later because they will be competing for CPU against the test that are already executing. Execute each test and wait for it to finish before you start the next test. Thats make sense, I’ve tested this way and got similar results, and the simple fifo queue wons (working with objects or pointers). 4 hours ago, Anders Melander said: - You also seem to have massive memory leaks which probably means that some of the test have an unfair advantage because they don't consume time releasing their resources. Not exactly, using ReportMemoryLeaksOnShutdown didnt take any leak running the tests… every queue format release their resources. 4 hours ago, Anders Melander said: - If you are using thread pools (I'm not sure that your are (if not, you should be)) then you should ensure that the thread pool has been spun up before you start the test. Otherwise you will penalize the first threads with the startup overhead. I’ll take a look into that, usually dont. This reason I dont included in the given example… every thread became up on the app initialization. 4 hours ago, Anders Melander said: - Instead of just looking at the time from start to end and then guessing about why it is fast/slow/whatever, profile your code so you can see exactly where the bottlenecks are. Do this for each individual algorithm in turn. Thanks for the tip. I dont know a profiling lib for Delphi, but I’ll measure them with stopwatches. 4 hours ago, Anders Melander said: Apart from that, for something as simple as this, you don't need locking and you definitely don't need to use the Windows message queue as a work queue. Use a simple lock-free fifo queue instead. You could even use a fixed size lock-free ring buffer (just an array of records with two integer values as in/out indices). The fixed size buffer and the records would eliminate the allocation overhead of the queue itself. You should probably also try to eliminate the use of string and replace it with fixed size buffer if possible. I just use locking because I dont know if there could have a deadlock when other thread is pushing and the worker is popping, so I do it just in case. You say that this scenario isnt that possible? About Windows message queue, it seems slow as a simple fifo aswell, thought continue using this approach. In the next few days I’ll build a ring buffer like approach and test the performance compared to TQueue, it internal uses an array of T btw. About strings I could switch to PWideChar aswell, I use string for ease. 3 hours ago, RDP1974 said: please can you test with TThreadedQueue? (with latest 12.3) Almost same performance as TQueue. Thanks for the reply. Edited 4 hours ago by snowdev Gramatic Share this post Link to post
Kas Ob. 128 Posted 4 hours ago 1 hour ago, snowdev said: 5 hours ago, Anders Melander said: - You also seem to have massive memory leaks which probably means that some of the test have an unfair advantage because they don't consume time releasing their resources. Not exactly, using ReportMemoryLeaksOnShutdown didnt take any leak running the tests… every queue format release their resources. Did not ran the code, and my test will be irrelevant on old IDE, but browsed the code and i can see a thing to point here, Anders might be right and you should investigate the fact you are not generating leaks or catching something worse, See, there is "FreeOnTerminate:= True;" in the constructor in few places and yet there is specific call to destroy, this should be bad mix, leaks and double freeing most likely will be there, well unless TThread and RTL have changed a lot since XE8, and in that case just ignore this post. Share this post Link to post
Anders Melander 1960 Posted 4 hours ago 34 minutes ago, snowdev said: Not exactly, using ReportMemoryLeaksOnShutdown didnt take any leak running the tests… every queue format release their resources. I didn't investigate but I got a lot of leaks reported when existing the application when running in the debugger. 36 minutes ago, snowdev said: I’ll take a look into that, usually dont. This reason I dont included in the given example… every thread became up on the app initialization. Okay. It's expensive to start a thread but if you are launching the threads at application startup then it doesn't matter. If you create them on-demand then I would use TTask instead. The first task will take the worst of the pool initialization hit. 39 minutes ago, snowdev said: I dont know a profiling lib for Delphi, but I’ll measure them with stopwatches. https://en.delphipraxis.net/search/?q=profiling 41 minutes ago, snowdev said: I just use locking because I dont know if there could have a deadlock when other thread is pushing and the worker is popping, so I do it just in case. If you use a lock-free structure then you don't need locking. Hence the "free" in the name 🙂 And FTR, the term deadlock means a cycle where two threads each have some resource locked and each is waiting for the other to release their resource. I think what you meant was race condition; Two threads modifying the same resource at the same time. 54 minutes ago, snowdev said: About strings I could switch to PWideChar aswell, I use string for ease. PWideChar is supposedly a pointer to a WideString? In that case, please don't. WideString is only for use in COM and it's horribly slow. No, what I meant was that instead of using dynamic strings (which are relatively slow because they must be allocated, sized, resized, freed, etc.) use a static array of chars: Buffer: array[BufferSize] of char. You will waste some bytes but it's fast. 1 Share this post Link to post
snowdev 0 Posted 3 hours ago 15 minutes ago, Anders Melander said: I didn't investigate but I got a lot of leaks reported when existing the application when running in the debugger. Its a bit weird. I also running in debug and didnt got leaks. 16 minutes ago, Anders Melander said: If you use a lock-free structure then you don't need locking. Hence the "free" in the name 🙂 And FTR, the term deadlock means a cycle where two threads each have some resource locked and each is waiting for the other to release their resource. I think what you meant was race condition; Two threads modifying the same resource at the same time. Ya, I forgot the correct term. Anyway I already planning some changes: -Change TCriticalSection locking for TLightweigtMREW gabr42 version (discussion here: https://www.thedelphigeek.com/2021/02/readers-writ-47358-48721-45511-46172.html?m=1) Another change would be: -Switch from single worker thread processing to N worker thread processing: I have a worker that loops an array and do what is necessary based on the item settings, so the thread retrieve the data every notification from this array… Instead I’ll create a threadpool like (or a simple TThread list) and pre-creates the worker threads with the pre-defined parameters, then I’ll just signaling they when necessary, the necessary information already be there. Final probably change: -Switch from TQueue to ring buffer like. 35 minutes ago, Anders Melander said: PWideChar is supposedly a pointer to a WideString? In that case, please don't. WideString is only for use in COM and it's horribly slow. No, what I meant was that instead of using dynamic strings (which are relatively slow because they must be allocated, sized, resized, freed, etc.) use a static array of chars: Buffer: array[BufferSize] of char. You will waste some bytes but it's fast. Thanks for educate me. The external API which I consume returns PWideChar and is a pain to work in certain circumstances… I’ll evaluate change to array of char. Thanks again Anders. Share this post Link to post
Anders Melander 1960 Posted 3 hours ago 5 minutes ago, snowdev said: -Change TCriticalSection locking for TLightweigtMREW gabr42 version Do you really need reentrant locks? If not then just use the standard version. There's no need to complicate thing further. Also MREW only makes sense if you have more readers than writers. For single writer & single reader there's no reason for it. As I read it you are considering using a pool of reader in which case MREW might very well make sense (or "just" use a lock free queue). Btw, if you don't need to process the work packets "in order" then things become a little easier since a lock-free stack is often simpler to implement than a queue. 9 minutes ago, snowdev said: The external API which I consume returns PWideChar and is a pain to work in certain circumstances… A PWideChar doesn't necessarily mean that the source string is a Widestring. It would very well just be a pointer to regular unicode string. It's just not common to explicitly use PWideChar anymore since it is the same as PChar on unicode Delphi. Regardless, the message was more that you should avoid WideString unless you have a reason to use it. You can't really do anything about what your external lib uses internally. Good luck. You have many hours of debugging ahead of you 🙂 Share this post Link to post
Stefan Glienke 2103 Posted 3 hours ago Be careful with the term reader and assuming that you can use mrew - in a consumer/producer pattern, the "reader" (i.e. consumer) is mutating a data structure (i.e. it pops an item) Share this post Link to post
RDP1974 40 Posted 10 minutes ago guess a pool with 100 tthreads, each with a queue fifo receiving messages, also each tthread send messages simultaneously to every all others: then a tthreadqueue without global locking as CRT should be the faster solution? Share this post Link to post