Jump to content

RDP1974

Members
  • Content Count

    235
  • Joined

  • Last visited

  • Days Won

    1

Everything posted by RDP1974

  1. RDP1974

    Experience/opinions on FastMM5

    I'm studying ad implementing Elixir/PhoenixWeb/Erlang over FreeBSD/Linux. Simply it's incredible! From http MVC with routes/controller/ORM to websocket channels, linear scalability until millions of sockets x single server with yusecs latency and fault tolerance (it's a VM with userlevel scheduler and signaling)...within a bunch of lines (a bench shows 100,000 reqs/sec from a MVC/postgre ORM json render in a single server; further you can change the code inside the VM meantime is running, so you can update pieces of the running app without close it) https://www.phoenixframework.org/ https://elixir-lang.org/
  2. RDP1974

    Experience/opinions on FastMM5

    "They" should move if want to jump to the bandwagon of parallel computing (IMHO? Within 5 years will be the facto with dozens or hundred cpu cores as standard)-> hard to beat Elixir, Erlang, Go or those functional programming that offers built-in horizontal and vertical scalability (userland scheduler with lightweight fibers, kernel threads, multiprocessing over cpu hw cores, machine clustering... without modify a line of code) 🙂
  3. RDP1974

    Experience/opinions on FastMM5

    Hi, https://github.com/RDP1974/Delphi64 look, there I have patched "key" RTL functions with the SIMD enhanced from Intel libraries: https://github.com/RDP1974/Delphi64/blob/master/RDPSimd64.pas (move, fillchar, pos) So I did a TBB allocator wrapper, a SIMD rtl patch, and a Zlib Intel version for http deflate (5x faster than gzip). Results are outstanding, tested by "famous" company coders: A test with Indy, the built-in TCP Delphi library, on I7 cpu, shows an enhancement from 6934.29 ops/sec to 23097.68 ops/sec Another test with WebBroker http compression, on I7 cpu, shows an enhancement from 147 pages/sec to 722 pages/sec Another test with DMVC web api, on I9 cpu and windows 2016, simulating with apachebench 10000 requests and 100 users, shows an enhancement from 111 reqs/sec to 6448 reqs/sec Another test, a ISAPI, on I9 cpu and windows 2016, doing in sequence DB query -> dataset of 1500 lines x 10 rows -> serialize to json string -> shrink it with deflate, is populating 2000 http reqs/sec, correctly filling all the cpu cores As far I have read the code of TBB, seems that the speed is obtained using x thread TLS (threadvar), when an app thread ask for mem, the allocator provides an already prepared zone (act as a cache)(I'm not sure of this). If you wish feel free to test my lib and see if behavior can be reproduced. As far I have seen should be enough to obtain a fast move, fillchar, pos (used in a lot of classes) and lock-free allocator (without branch jumps etc.) to have win64 speedup. (Anyway I agree with you, we should do real case bench) Thank you.
  4. RDP1974

    Experience/opinions on FastMM5

    hello @Pierre le Riche thank you for this great piece of code (FastMM5), I have a suggestion to make it quicker, in my TBB wrapper I have used to replace Fillchar (that's under Delphi64 is very slow) with a SIMD version (Intel IPP avx-512 etc...). Further, you are pre-allocating pieces of virtual mem. Perhaps you can do a quick hash or binary tree based cache with ready fillchar 0 blocks, maybe assigned to a background thread with minimal priority. So when the MM calls the Alloc, the fillchar is not needed, because the block is already filled with zeroes. IMHO in multithreaded stress test this will boost the performance! I don't mind of virtual allocated ram being bigger, windows kernel utilize only the "really used" (hard to explain for me :-)) Further, as far I have read of those new allocators, they pre-allocate ram in TLS cache, dispatching a thread pool (of course with a big ram allocation(virtual, so what cares?), but to avoid race concurrency and global locking) (please sorry me if those info are useless) kind regards Roberto
  5. work scenario can be different, thread pool using the heap will benefit a lot from TBB+IPP but, memory a part, I wish embarcadero will update delphi and linker to accomodate the modern C libraries ($TLS) kind regards
  6. RDP1974

    Experience/opinions on FastMM5

    See this post FastMM5 still 5x slower than the best C allocators
  7. I did a test of your console bench, using FastMM4, FastMM5, and optimized Intel Delphi64 TBB (feel free to use it) The result on VMware 8vcpu I9 5Ghz Windows 2016 Server: FastMM5 is 4x faster than FastMM4; IntelTBB is 5x faster than FastMM5 and 18x faster than FastMM4 Those new generation of allocators based on TLS cache are faster and used in production (I see game engines as unreal that are using by default TBB). Visual Studio C, C++ have as option to optimize using TBB and IPP. Further are better suited for memory error discovery and tested for 24/7/365 use. In my humble opinion Delphi should license TBB from Intel (it's free oss license) and port it to CLANG, rewriting the missing $TLS API runtime. The WINAPI headers dependency of msvcrt should be avoided using the C++Builder winapi 7.0 repository. This should be used in Win32, Win64, Android, Linux, Ios, Osx. Another cool C allocator, free, is the mimalloc of Microsoft. (IMHO Delphi 64bit can have a nice place for Cloud and distributed web apps, with a modern allocator can compete with Rust, Erlang, Go) C:\Exes>FastMM5ConsoleTest_F4 Parallel For used : 1479456 ticks Parallel For used : 1593960 ticks Parallel For used : 1492162 ticks Parallel For used : 1516575 ticks Parallel For used : 1504889 ticks Parallel For used : 1616684 ticks Parallel For used : 1694674 ticks Parallel For used : 1659002 ticks Parallel For used : 1509797 ticks Parallel For used : 1623232 ticks Parallel For used : 1549025 ticks Parallel For used : 1768947 ticks Parallel For used : 1860454 ticks Parallel For used : 1813156 ticks Parallel For used : 2014587 ticks Parallel For used : 1896651 ticks Parallel For used : 1918023 ticks Parallel For used : 1869937 ticks Parallel For used : 1832852 ticks Parallel For used : 1855156 ticks Done. Press ENTER to exit C:\Exes>FastMM5ConsoleTest_F5 (FastMM_SetOptimizationStrategy(mmosOptimizeForSpeed)) Parallel For used : 429409 ticks Parallel For used : 428977 ticks Parallel For used : 439715 ticks Parallel For used : 431561 ticks Parallel For used : 441682 ticks Parallel For used : 448713 ticks Parallel For used : 457904 ticks Parallel For used : 451374 ticks Parallel For used : 420869 ticks Parallel For used : 433840 ticks Parallel For used : 428119 ticks Parallel For used : 426678 ticks Parallel For used : 431399 ticks Parallel For used : 432025 ticks Parallel For used : 429793 ticks Parallel For used : 420178 ticks Parallel For used : 422983 ticks Parallel For used : 433726 ticks Parallel For used : 426557 ticks Parallel For used : 418806 ticks Done. Press ENTER to exit C:\Exes>FastMM5ConsoleTest_Intel Parallel For used : 85910 ticks Parallel For used : 82550 ticks Parallel For used : 84917 ticks Parallel For used : 81707 ticks Parallel For used : 81077 ticks Parallel For used : 80789 ticks Parallel For used : 81069 ticks Parallel For used : 81506 ticks Parallel For used : 85098 ticks Parallel For used : 84156 ticks Parallel For used : 84978 ticks Parallel For used : 81699 ticks Parallel For used : 84017 ticks Parallel For used : 79480 ticks Parallel For used : 80324 ticks Parallel For used : 80736 ticks Parallel For used : 83380 ticks Parallel For used : 84887 ticks Parallel For used : 78052 ticks Parallel For used : 82792 ticks Done. Press ENTER to exit
  8. RDP1974

    borderless with aero shadow

    I know, but I need VCL 🙂
  9. RDP1974

    borderless with aero shadow

    No, the canvas is inside the external frame. The solution in plain API is here: https://stackoverflow.com/questions/22165258/how-to-create-window-without-border-and-with-shadow-like-github-app/44489430#44489430 Create window with WS_CAPTION style Call DwmExtendFrameIntoClientArea WDM API passing 1 pixel top margin Handle WM_NCCALCSIZE message, do not forward call to DefWindowProc while processing this message, but just return 0 (https://stackoverflow.com/questions/43818022/borderless-window-with-drop-shadow)
  10. RDP1974

    borderless with aero shadow

    thank you, but the problem is the 1px frame of the color of theme title I have read a C++ example that I will try in Delphi, needs a return parameter from paint API where VCL use a procedure without return 😕
  11. hello, I did a good benchmark to test the Delphi Linux compiler. Resuming: - server I9 8core with Debian 10 and MySQL8 - server I9 8core with ClearLinux and Apache - server I9 8core with Windows 2016 and IIS10 - I7 client with apachebench The webbroker app from Windows or ClearLinux connect using pooled firedac connections to the Debian MySQL server. I don't want to bench different RDPMS, but only the layer IIS-ISAPI_webbroker and Apache-mod_webbroker. Multiple queries are done against a set returning thousand of lines x ten columns; then the dataset is serialized to REST string using DMVC serializers. The results are, using apachebench ab -n 1000 -c 10 -k -r Default Delphi 64bit IIS 10: 143 reqs/sec Default Delphi 64bit Linux Apache: 554 reqs/sec Delphi 64bit IIS 10 with RDP Intel TBB and Intel IPP libs: 567 reqs/sec (In my site you can download those libs) With a small text output instead of DB both Windows and Linux sustains 10000 reqs/sec So the Linux compiler is great performing and it's very reliable. Under apache I had errors raising the number of concurrent users, this need a manual tuning in apache config files (IIS is autotuning). Congratulations Emba! ------- WINDOWS IIS ISAPI Default Server Software: Microsoft-IIS/10.0 Server Hostname: / Server Port: 80 Document Path: / Document Length: 162716 bytes Concurrency Level: 100 Time taken for tests: 6.974 seconds Complete requests: 1000 Failed requests: 0 Keep-Alive requests: 1000 Total transferred: 162952000 bytes HTML transferred: 162716000 bytes Requests per second: 143.38 [#/sec] (mean) Time per request: 697.430 [ms] (mean) Time per request: 6.974 [ms] (mean, across all concurrent requests) Transfer rate: 22817.02 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.5 0 12 Processing: 47 657 330.7 511 2396 Waiting: 8 655 331.0 508 2396 Total: 47 657 330.7 511 2396 Percentage of the requests served within a certain time (ms) 50% 511 66% 578 75% 825 80% 969 90% 1203 95% 1291 98% 1500 99% 1732 100% 2396 (longest request) ------- WINDOWS IIS ISAPI with Intel TBB IPP Server Software: Microsoft-IIS/10.0 Server Hostname: / Server Port: 80 Document Path: / Document Length: 162716 bytes Concurrency Level: 100 Time taken for tests: 1.762 seconds Complete requests: 1000 Failed requests: 0 Keep-Alive requests: 1000 Total transferred: 162952000 bytes HTML transferred: 162716000 bytes Requests per second: 567.56 [#/sec] (mean) Time per request: 176.192 [ms] (mean) Time per request: 1.762 [ms] (mean, across all concurrent requests) Transfer rate: 90317.94 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.8 0 8 Processing: 23 159 64.5 148 387 Waiting: 8 157 64.7 145 383 Total: 23 159 64.3 148 387 Percentage of the requests served within a certain time (ms) 50% 148 66% 153 75% 160 80% 160 90% 266 95% 312 98% 355 99% 363 100% 387 (longest request) ------ APACHE MOD CLEARLINUX Server Software: Apache/2.4.41 Server Hostname: / Server Port: 80 Document Path: / Document Length: 162778 bytes Concurrency Level: 10 Time taken for tests: 1.804 seconds Complete requests: 1000 Failed requests: 0 Keep-Alive requests: 996 Total transferred: 162992068 bytes HTML transferred: 162778000 bytes Requests per second: 554.27 [#/sec] (mean) Time per request: 18.042 [ms] (mean) Time per request: 1.804 [ms] (mean, across all concurrent requests) Transfer rate: 88224.32 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.1 0 2 Processing: 11 18 3.0 17 35 Waiting: 9 16 2.9 16 33 Total: 11 18 3.0 17 35 Percentage of the requests served within a certain time (ms) 50% 17 66% 19 75% 20 80% 20 90% 22 95% 24 98% 26 99% 27 100% 35 (longest request)
  12. When you have DoS/DDoS protection in apache, for example with the usage of the qos_module, you will see that there will be a lot of failed requests in the output of the command. This happens, because the protection is indeed working and as mentioned, the ab tool basically floods your server with requests, so a lot of requests with the same IP will automatically be blocked by the apache module. Indeed I see that the performance of Delphi apache module or Indy web application, with Firedac and data middleware manipulation, under Linux is brilliant. I wait for the compiler optimization to redo a benchmark.
  13. in counterpart for scimark benchmark LLVM compiler needs a complete optimization overhaul https://quality.embarcadero.com/browse/RSP-28006
  14. ab failed requests over apache seems related to the use of it inside a virtual machine, on real hardware the problem doesn't exist
  15. btw. did also a benchmark of Indy based custom Httpd (Soap, Webbroker) and Linux version is 3x more performant (ClearLinux) than Windows patched with Intel Performance Libraries.
  16. https://quality.embarcadero.com/browse/RSP-27918 open
  17. later I'll publish a blog full of tips how to create highly scalable Indy, WebBroker, Soap, Firedac, Windows, Linux servers. A lot of utilities with Webbroker CRUD/REST helpers.
  18. https://news.netcraft.com/archives/category/web-server-survey/ actually nginx is the first, well a webbroker nginx module output is welcome
  19. ab -n 1000 -c 10 -k -r IIS Isapi + Intel TBB IPP = 496 r/s ClearlLinux Apache = 552 r/s But if I raise concurrent users, IIS 0 troubles, Apache start to produce some failed requests then stop with a not enough space. (Need to be tuned) I have not tested zlib, Windows users can use the Intel version, Linux users can use the Cloudflare fork. OpenSSL easNI should be similar to Windows Crypto API, but again, I don't want test different OS, but only delphi code. Lets' go with FastCGI support in webbroker!
  20. ok, I was unable to tune apache for massive load testing, also changing the settings in conf files. Seems that a limit of 150 concurrent users is set somewhere (or a syn flood protection probably). Anyway the performance until limit reach is great. We need that Embarcadero should add FAST-CGI to webbroker, so to bind Nginx, Lighttpd and other modern httpd non-blocking-IO (and largely scalable). I ask this to quality central. Let'see. Kind regards.
  21. Wait, I did a mistake on apachebench. Tomorrow will correct the results.
  22. hello, I'm used to deploy custom Intel TBB memory manager and IPP with Delphi 64bit server apps, with highly satisfaction (see https://github.com/RDP1974/Sea-Delphi-RTL-IIS-Filter results). I have found FastMM4 to be slow and give me many fatal errors under multithreaded stress test (especially the last AVX-512 fork) Many people asked me for a 32bit version of Intel TBB malloc, and many other asked me to do embedded code instead of external DLL to distribute. The point are: - impossible to do static objects of Intel libs without rely on MSVCRT redistribution, further Delphi linker cannot manage the objs (see below) - have tried 32bit TBB DLL, but don't works, giving fatal errors on runtime - have tried CLANG compiled of other two good allocators, doing static objects, but Delphi32 can't link them due to architecture limits ($ThreadLocalStorage functions not managed) The allocators I have tried after reviewing dozens are: - https://github.com/mjansson/rpmalloc - https://github.com/microsoft/mimalloc Now the nasty :) question: Does somebody likes to join me to do a Delphi pascal native version of rpmalloc (seems the easier and cleaner IMHO)? Look, I have really few spare time, but I think we can do it. Any opinion? Regards. (sorry my not good english) Roberto Della Pasqua www.dellapasqua.com
  23. Hi, I’m using FPC with Arm Linux, is very nice, but: - the community release updates so slowly, seems stopped - the RTL and the whole classes quality source code is a lot better and polished in Delphi - RTTI at runtime? - I did a test and Delphi 64 was double faster than FPC in low level loops, sets, arrays and collections - high quality database layer in Delphi IMHO
  24. Really we need a new model for the MM, FastMM4 is bloat, should be cool to have a lock-free allocator using threadvar and/or TLS API also with a small thread pool preallocated, so to compete with other high performance languages. As far I have tested to make Delphi perfect again: - TLS lock-free allocator - SIMD FillChar, Move, Pos I ask Mr.Allen of Grizly :-P Regards.
  25. btw: Iceland should be a nice place where to put tier4-datacenters 😄
×