Jump to content

Arnaud Bouchez

Members
  • Content Count

    325
  • Joined

  • Last visited

  • Days Won

    25

Everything posted by Arnaud Bouchez

  1. All those tests on the localhost on Windows are not very representative. If you want something fast and scaling, use a Linux server, and not over the loopback, which is highly bypassed by the OS itself. Changing the MM in mORMot tests is never of 10x improvements, because the framework tries to avoid heap allocation as much as possible. @Edwin Yip @Stefan Glienke If you don't make any memory allocation, then you have the best performance. Our THttpAsyncServer Event-Driven HTTP Server tries to minimize the memory allocation, and we get very high numbers. https://github.com/synopse/mORMot2/blob/master/src/net/mormot.net.async.pas If I understand correctly, the performance came from 353 to 4869 requests per second with ab. I need to emphasize that ab is not a good benchmarking tool for high-performance numbers. You need to use something more scalable like wrk. With a mORMot 2 HTTP server on Linux, a benchmark test with wrk has requests per second much higher than those. With the default FPC memory manager. And if we use the FastMM4-based mORMot MM (which is tuned for multithreading) we reach 100K per second. On my old Core i5 7200u laptop: abouchez@aaa:~/$ wrk -c 100 -d 15s -t 4 http://localhost:8080/plaintext Running 15s test @ http://localhost:8080/plaintext 4 threads and 100 connections Thread Stats Avg Stdev Max +/- Stdev Latency 1.41ms 3.74ms 45.25ms 93.57% Req/Sec 30.84k 6.58k 48.49k 65.72% 1845696 requests in 15.09s, 288.67MB read Requests/sec: 122341.58 Transfer/sec: 19.13MB Server code is available in https://github.com/synopse/mORMot2/tree/master/ex/techempower-bench If I run the test with ab, I get: $ ab -c 100 -n 10000 http://localhost:8080/plaintext This is ApacheBench, Version 2.3 <$Revision: 1901567 $> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Licensed to The Apache Software Foundation, http://www.apache.org/ Server Software: mORMot2 Server Hostname: localhost Server Port: 8080 Document Path: /plaintext Document Length: 13 bytes Concurrency Level: 100 Time taken for tests: 0.616 seconds Complete requests: 10000 Failed requests: 0 Total transferred: 1590000 bytes HTML transferred: 130000 bytes Requests per second: 16245.71 [#/sec] (mean) Time per request: 6.155 [ms] (mean) Time per request: 0.062 [ms] (mean, across all concurrent requests) Transfer rate: 2522.53 [Kbytes/sec] received As you can see, ab is not very good at scaling on multiple threads, especially because by default it does NOT keep alive the connection. So if you add the -k switch, then you will have kept-alive connections, which is closer to the actual use of a server I guess: $ ab -k -c 100 -n 100000 http://localhost:8080/plaintext This is ApacheBench, Version 2.3 <$Revision: 1901567 $> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Licensed to The Apache Software Foundation, http://www.apache.org/ Server Software: mORMot2 Server Hostname: localhost Server Port: 8080 Document Path: /plaintext Document Length: 13 bytes Concurrency Level: 100 Time taken for tests: 1.284 seconds Complete requests: 100000 Failed requests: 0 Keep-Alive requests: 100000 Total transferred: 16400000 bytes HTML transferred: 1300000 bytes Requests per second: 77879.68 [#/sec] (mean) Time per request: 1.284 [ms] (mean) Time per request: 0.013 [ms] (mean, across all concurrent requests) Transfer rate: 12472.92 [Kbytes/sec] received Therefore, ab achieves only about half of the requests per second rate that wrk is able to do. Just forget about ab. And when I close my test server, I have the following stats: { "ApiVersion": "Debian Linux 5.10.0 epoll", "ServerName": "mORMot2 (Linux)", "ProcessName": "8080", "SockPort": "8080", "ServerKeepAliveTimeOut": 300000, "HeadersDefaultBufferSize": 2048, "HeadersMaximumSize": 65535, "Async": { "ThreadPoolCount": 16, "ConnectionHigh": 100, "Clients": { "ReadCount": 1784548, "WriteCount": 1627649, "ReadBytes": 95843130, "WriteBytes": 267528514, "Total": 10313 }, "Server": { "Server": "0.0.0.0", "Port": "8080", "RawSocket": 5, "TimeOut": 10000 }, "Accepted": 10313, "MaxConnections": 7777777, "MaxPending": 100000 } } Flags: SERVER assumulthrd lockless erms debug repmemleak Small: blocks=14K size=997KB (part of Medium arena) Medium: 10MB/21MB peak=21MB current=8 alloc=17 free=9 sleep=0 Large: 0B/0B peak=0B current=0 alloc=0 free=0 sleep=0 Small Blocks since beginning: 503K/46MB (as small=43/46 tiny=56/56) 64=306K 48=104K 32=39K 96=11K 160=9K 192=5K 320=5K 448=5K 2176=5K 256=4K 80=1K 144=1K 128=888 1056=609 112=469 1152=450 Small Blocks current: 14K/997KB 64=10K 48=3K 352=194 32=172 112=128 128=85 96=83 80=76 16=38 176=23 192=16 576=14 144=12 880=10 272=10 160=9 The last block is the detailed information of our FastMM4 fork for FPC, in x86_64 asm. https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.fpcx64mm.pas The peak MM consumption was 21MB of memory. Compare it with what WebBroker consumes in a similar test. In particular "sleep=0" indicates that there was NO contention at all during the whole server process. Just by adding some enhancements to FastMM4 original code, like a thread-safe round-robin of small blocks arenas. It was with memory leaks reporting (none reported here), and debug/stats mode - so you could save a few % by disabling those features. To conclude, it seems that it is the WebBroker technology which is fundamentally broken in terms of performance, not the MM itself. I would also consider how much MB of memory the processes are consuming. I suspect the MSHeap consumes more than FastMM4. Intel TBB was a nightmare, not usable on production servers, from my tests, in that respect. (Sorry if I was a bit long, but it is a subject I like very much)
  2. Arnaud Bouchez

    Components4developers???

    Perhaps they have an attack from Russian hackers....
  3. Arnaud Bouchez

    Is Move the fastest way to copy memory?

    L1 cache access time makes a huge difference. http://blog.skoups.com/?p=592 You could retrieve the L1 cache size, then work on buffers of about 90% of this size (always keep some space for stack, tables and such). Then, if you work in the API buffer directly, a non-temporal move to the result buffer may help a little. During your process, if you use lookup tables, ensure they don't pollute the cache. But profiling is the key for sure. Guesses are most of the time wrong...
  4. Arnaud Bouchez

    Is Move the fastest way to copy memory?

    That's what I wrote: it is unlikely alternate Move() would make a huge difference. When working on buffers, cache locality is a performance key. Working on smaller buffers, which fit in L1 cache (a few MB usually) could be faster than two big Move / Process. But perhaps your CPU has already good enough cache (bigger than your picture), so it won't help. About the buffers, couldn't you use a ring of them, so that you don't move data?
  5. Arnaud Bouchez

    How make benchmark?

    It will depend on the Database used behind FireDAC or Zeos, and the standard used (ODBC/OleDB/Direct...). I would say that both are tuned - just ensure you got the latest version of Zeos, which is much more maintained and refined that FireDAC in the last years. Note that FireDAC has some aggressive settings, e.g. for SQLite3 it changes the default safe write settings into faster access. The main interrest of Zeos is that the ZDBC low-level layer does not use a TDataSet, so it is (much) faster if you retrieve a single object. You will see those two behavior in the Michal numbers above, for instance. Also note that mORMot has a direct DB layer, not based on TDataSet, which may be used with FireDAC or Zeos, or with its own direct ODBC/OleDB/Oracle/PostgreSQL/SQLite3 data access. See https://synopse.info/files/html/Synopse mORMot Framework SAD 1.18.html#TITL_27 Note that its ORM is built on top on this unique DB layer, and add some unique features like multi-insert SQL generation, so a mORMot TRestBatch is usually much faster than direct naive INSERTs within a transaction. You can reach 1 million inserts per second with SQLite3 with mORMot 2 - https://blog.synopse.info/?post/2022/02/15/mORMot-2-ORM-Performance
  6. Arnaud Bouchez

    Update framework question

    I would stick with a static JSON resource, if it is 20KB of data once zipped. Don't use HEAD for it. With a simple GET, and proper E-Tag caching, it would let the HTTP server return 304 on GET if not modified: just a single request, only returning the data when it changed. All will stay at HTTP server level, so it would be simple and effective.
  7. Arnaud Bouchez

    Is Move the fastest way to copy memory?

    Don't expect anything magic by using mORMot MoveFast(). Perhaps a few percent more or less. On Win32 - which is your target, IIRC the Delphi RTL uses X87 registers. On this platform, MoveFast() use SSE2 registers for small sizes, so is likely to be slightly faster, and will leverage ERMSB move (i.e. rep movsb) on newer CPUs which support it. To be fair, mORMot asm is more optimized for x86_64 than for i386 - because it is the target platform for server side, which is the one needing more optimization. But I would just try all FastCode variants - some can be very verbose, but "may" be better. What I would do in your case, is trying to not move any data at all. Isn't it possible that you pre-allocate a set of buffers, then just consume them in a circular way, passing them from the acquisition to the processing methods as pointers, with no copy? The fastest move() is ... when there is no move... 🙂
  8. Arnaud Bouchez

    Locked SQlite

    See https://www.sqlite.org/lockingv3.html By default, FireDac opens SQLite3 databases in "exclusive" mode, meaning that only a single connection is allowed. It is much better for the performance, but it "locks" the file for opening outside this main connection. So, as @joaodanet2018 wrote, change the LockingMode in FDconn, or just close the application currently using it.
  9. Arnaud Bouchez

    Docx (RTF) to PDF convert

    I really recommend https://www.trichview.com/
  10. Where are you located? (it makes difference for your potential work status, even remotely) Do you have some code to show? (e.g. on github or anywhere else)
  11. note: if you read the file from start to end, Memory mapped files are not faster than reading the file in memory. The memory faults make it slower than a regular single FileRead() call. For huge files on Win32 which won't be able to load in memory, you may use temporary chunks (e.g. 128MB). And if you really load it once and don't want to pollute the OS disk memory cache, consider using the FILE_FLAG_SEQUENTIAL_SCAN flag under Windows. This is what we do with mORMot's FileOpenSequentialRead(). https://devblogs.microsoft.com/oldnewthing/20120120-00/?p=8493
  12. Yes, delete() is as bad as copy(), David is right! Idea is to keep the input string untouched, then append the output to a new output string, preallocated once with a maximum potential size. Then call SetLength() once at the end, which is likely to do nothing and reallocate the content in-place, thanks to the heap manager.
  13. You could do it with NO copy() call at all. Just write a small state machine and read the input one char per char.
  14. The main trick is to avoid memory allocation, i.e. temporary string allocations. For instance, the less copy() calls, the better. Try to rewrite your code to allocate one single output string per input string. Just parse the input string from left to write, then applying the quotes or dates processing on the fly. Then you could also avoid any input line allocation, and parse the whole input buffer at once. Parsing 100.000 lines could be done much quicker, if properly written. I guess round 500MB/s is easy to reach. For instance, within mORMot, we parse and convert 900MB/s of JSON in pure pascal, including string unquoting.
  15. This is a bit tricky (a COM object) but it works fine on Win32. You have the source code at https://github.com/synopse/SynProject We embedded the COM object as resource with the main exe, and it is uncompressed and registered for the current user.
  16. ShortString have a big advantage: they can be allocated on the stack. So they are perfect for small ASCII text, e.g. numbers to text conversion, or when logging information in plain English. Using str() over a shortstring is for instance faster than using IntToString() and a temporary string (or AnsiString) on a multi-thread program, because the heap is not involved. Of course, we could use a static array of AnsiChar, but ShortString have an advantage because they have their length encoded so they are safer and faster than #0 terminated strings. So on mobile platform, we could end up by creating a new record type, re-inventing the wheel whereas the ShortString type is in fact still supported and generated by the compiler, and even used by the RTL at its lowest system level. ShortString have been deprecated... and hidden. They could even be restored/unhidden by some tricks like https://www.idefixpack.de/blog/2016/05/system-bytestrings-for-10-1-berlin Why? Because some people at Embarcadero thought it was confusing, and that the language should be "cleaned up" - I translate by "more C# / Java like", with a single string type. This was the very same reason they did hide RawByteString and AnsiString... More a marketing strategy than a technical decision IMHO. I prefer the FPC more "conservative" way of preserving backward compatibility. It is worth noting that the FPC compiler source code itself uses a lot of shortstring internally, so it would never be deprecated on FPC for sure. 😉
  17. Arnaud Bouchez

    About TGUID type...

    Using a local TGUID constant seems the best solution. It is clean and fast (no string/hex conversion involved, just copy the TGUID record bytes). No need to make anything else, or ask for a feature request I guess, because it would be purely cosmetic.
  18. Arnaud Bouchez

    Why compiler allows this difference in declaration?

    IIRC it was almost mandatory to work with Ole Automation and Word/Excel. Named parameters are the usual way of calling Word/Excel Ole API, from Office macros or Visual Basic. So Delphi did have to support this too. And it is easy to implement named parameters with OLE. Because in its ABI, parameters are... indeed named. Whereas for regular native code function ABI, parameters are not named, but passed in order on registers or the stack. So implementing named parameters in Delphi would have been feasible, but would require more magic, especially for the "unnamed" parameter. It was requested several times in the past, especially from people coming from Python background. Named parameters can make calls more explicit. So the question is more for Embarcadero people. 😉 You can emulate this using a record or a class to pass the values, but it is not as elegant.
  19. Yes, I have seen the handcrafted IMT. But I am not convinced the "sub rcx, xxx; jmp xxx" code block makes a noticeable performance penalty - perhaps only on irrelevant micro benchmarks. The CPU lock involved in calling a Delphi virtual method through an interface has a cost for sure https://www.idefixpack.de/blog/2016/05/whats-wrong-with-virtual-methods-called-through-an-interface - but not a sub + jmp. Also the enumerator instance included into the list itself seems a premature optimization to me if we want to be cross-platform as we want. Calling GetThreadID on a non Windows platform has a true cost if you call the pthread library. And the resulting code complexity makes me wonder if it is really worth it in practice. Better switch to a better memory manager, or just use a record and rely on inlining + stack allocation of an unmanaged enumerator. Since you implemented both of those optimizations: OK, just continue to use them. But I won't go that way with mORMot. Whereas I still find your interface + pre-compiled folded classes an effective and nice trick (even if I just disable byte and word stubs for dictionaries: if you have a dictionary, the hash table will be bigger than the size of the byte/word data itself - so just use integers instead of byte/word for dictionaries in end-user code; but I still have byte/word specializations for IList<> which does not have any hash by default, but may on demand).
  20. Nice timings. Yes, you are right, in mORMot we only use basic iteration to be used in "for ... in .. do" statement with no further composition and flexibility as available with IEnumerable<T>. The difference for small loops is huge (almost 10 times) and for big loops is still relevant (2 times) when records are used. I guess mORMot pointer-based records could be slightly faster than RTL index-based values, especially when managed types are involved. In practice, I find "for .. in .. do" to be the main place for iterations. So to my understanding, records are the way to go for mORMot. Then nothing prevents another method returning complex and fluid IEnumerable<T>. We just don't go that way in mORMot yet.
  21. I discovered that using a record as TEnumerator makes the code faster, and allow some kind of inlining, with no memory allocation. My TSynEnumerator<T> record only uses 3 pointers on the stack with no temporary allocation. I prefer using pointers here to avoid any temporary storage e.g. for managed types. And the GetCurrent and MoveNext functions have no branch and inline very aggressively. And it works with Delphi 2010! 😉 Record here sounds like a much more efficient idea than a class + interface, as @Stefan Glienke did in Spring4D. Stefan, have you any reason to prefer interfaces also for the enumerator instead of good old record? From my finding, performance is better with a record, especially thanks to inlining - which is perfect on FPC, but Delphi still calls MoveNext and don't inline it. It also avoid a try...finally for most simple functions, nor any heap allocation. Please check https://github.com/synopse/mORMot2/commit/17b7a2753bb54057ad0b6d03bd757e370d80dce2
  22. I just found out that Delphi has big troubles with static arrays like THash128 = array[0..15] of byte. If I remove those types, then the Delphi compiler does emit "F2084 Internal Error" any more... So I will stick with the same kind of types as you did (byte/word/integer/int64/string/interface/variant) and keep those bigger ordinals (THash128/TGUID) to use regular (bloated) generics code on Delphi (they work with no problem on FPC 3.2). Yes, I get a glimpse of what you endure, and I thank you much for having found such nice tricks in Spring4D. The interface + folded types pattern is awesome. 😄
  23. Is it only me or awful and undocumented problems like " F2084 Internal Error: AV004513DE-R00024F47-0 " occur when working on Delphi with generics? @Stefan Glienke How did you manage to circumvent the Delphi compiler limitations? On XE7 for instance, as soon as I use intrinsics I encounter those problems - and the worse is that they are random: sometimes, the compilation succeed, but other times there is the Internal Error, sometimes on Win32 sometimes on Win64... 😞 I get rid of as many "inlined" code as possible, since I discovered generics do not like calling "inline" code. In comparison, FPC seems much more stable. The Lazarus IDE is somewhat lost within generics code (code navigation is not effective) - but at least, it doesn't crash and you can work as you expect. FPC 3.2 generics support is somewhat mature: I can see by disassembling the .o that intrinsics are properly handled with this compiler. Which is good news, since I rely heavily on it for folding base generics specialization using interface, as you did with Spring4D.
  24. Arnaud Bouchez

    64 bit compiler problem

    My guess is that the default data alignment may have changed between Delphi 10.2 and 10.3, so the static arrays don't have the same size. You may try to use "packed" for all the internal structures of the array.
  25. @Stefan Glienke If you can, please take a look at a new mORMot 2 unit: https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.collections.pas In respect to generics.collections, this unit uses interfaces as variable holders, and leverage them to reduce the generated code as much as possible, as the Spring4D 2.0 framework does, but for both Delphi and FPC. Most of the unit is in fact embedding some core collection types to mormot.core.collections.dcu to reduce the user units and executable size for Delphi XE7+ and FPC 3.2+. Thanks a lot for the ideas! It also publishes TDynArray and TSynDictionary high-level features like JSON/binary serialization or thread safety with Generics strong typing. More TDynArray features (like sorting and search) and also TDynArrayHasher features (an optional hash table included in ISynList<T>) are coming.
×