Jump to content

RDP1974

Members
  • Content Count

    46
  • Joined

  • Last visited

Community Reputation

13 Good

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

  1. RDP1974

    Experience/opinions on FastMM5

    Your talent is fantastic and so your code, but let me tell you a word about "TBB unusable" that it's the default optimize option on the whole Visual Studio C compiler and in main game engines... TBB and IPP also are used in Oracle Database, Adobe, Autodesk...
  2. RDP1974

    Experience/opinions on FastMM5

    I'm studying ad implementing Elixir/PhoenixWeb/Erlang over FreeBSD/Linux. Simply it's incredible! From http MVC with routes/controller/ORM to websocket channels, linear scalability until millions of sockets x single server with yusecs latency and fault tolerance (it's a VM with userlevel scheduler and signaling)...within a bunch of lines (a bench shows 100,000 reqs/sec from a MVC/postgre ORM json render in a single server; further you can change the code inside the VM meantime is running, so you can update pieces of the running app without close it) https://www.phoenixframework.org/ https://elixir-lang.org/
  3. RDP1974

    Experience/opinions on FastMM5

    "They" should move if want to jump to the bandwagon of parallel computing (IMHO? Within 5 years will be the facto with dozens or hundred cpu cores as standard)-> hard to beat Elixir, Erlang, Go or those functional programming that offers built-in horizontal and vertical scalability (userland scheduler with lightweight fibers, kernel threads, multiprocessing over cpu hw cores, machine clustering... without modify a line of code) 🙂
  4. RDP1974

    Experience/opinions on FastMM5

    Hi, https://github.com/RDP1974/Delphi64 look, there I have patched "key" RTL functions with the SIMD enhanced from Intel libraries: https://github.com/RDP1974/Delphi64/blob/master/RDPSimd64.pas (move, fillchar, pos) So I did a TBB allocator wrapper, a SIMD rtl patch, and a Zlib Intel version for http deflate (5x faster than gzip). Results are outstanding, tested by "famous" company coders: A test with Indy, the built-in TCP Delphi library, on I7 cpu, shows an enhancement from 6934.29 ops/sec to 23097.68 ops/sec Another test with WebBroker http compression, on I7 cpu, shows an enhancement from 147 pages/sec to 722 pages/sec Another test with DMVC web api, on I9 cpu and windows 2016, simulating with apachebench 10000 requests and 100 users, shows an enhancement from 111 reqs/sec to 6448 reqs/sec Another test, a ISAPI, on I9 cpu and windows 2016, doing in sequence DB query -> dataset of 1500 lines x 10 rows -> serialize to json string -> shrink it with deflate, is populating 2000 http reqs/sec, correctly filling all the cpu cores As far I have read the code of TBB, seems that the speed is obtained using x thread TLS (threadvar), when an app thread ask for mem, the allocator provides an already prepared zone (act as a cache)(I'm not sure of this). If you wish feel free to test my lib and see if behavior can be reproduced. As far I have seen should be enough to obtain a fast move, fillchar, pos (used in a lot of classes) and lock-free allocator (without branch jumps etc.) to have win64 speedup. (Anyway I agree with you, we should do real case bench) Thank you.
  5. RDP1974

    Experience/opinions on FastMM5

    hello @Pierre le Riche thank you for this great piece of code (FastMM5), I have a suggestion to make it quicker, in my TBB wrapper I have used to replace Fillchar (that's under Delphi64 is very slow) with a SIMD version (Intel IPP avx-512 etc...). Further, you are pre-allocating pieces of virtual mem. Perhaps you can do a quick hash or binary tree based cache with ready fillchar 0 blocks, maybe assigned to a background thread with minimal priority. So when the MM calls the Alloc, the fillchar is not needed, because the block is already filled with zeroes. IMHO in multithreaded stress test this will boost the performance! I don't mind of virtual allocated ram being bigger, windows kernel utilize only the "really used" (hard to explain for me :-)) Further, as far I have read of those new allocators, they pre-allocate ram in TLS cache, dispatching a thread pool (of course with a big ram allocation(virtual, so what cares?), but to avoid race concurrency and global locking) (please sorry me if those info are useless) kind regards Roberto
  6. work scenario can be different, thread pool using the heap will benefit a lot from TBB+IPP but, memory a part, I wish embarcadero will update delphi and linker to accomodate the modern C libraries ($TLS) kind regards
  7. RDP1974

    Experience/opinions on FastMM5

    See this post FastMM5 still 5x slower than the best C allocators
  8. I did a test of your console bench, using FastMM4, FastMM5, and optimized Intel Delphi64 TBB (feel free to use it) The result on VMware 8vcpu I9 5Ghz Windows 2016 Server: FastMM5 is 4x faster than FastMM4; IntelTBB is 5x faster than FastMM5 and 18x faster than FastMM4 Those new generation of allocators based on TLS cache are faster and used in production (I see game engines as unreal that are using by default TBB). Visual Studio C, C++ have as option to optimize using TBB and IPP. Further are better suited for memory error discovery and tested for 24/7/365 use. In my humble opinion Delphi should license TBB from Intel (it's free oss license) and port it to CLANG, rewriting the missing $TLS API runtime. The WINAPI headers dependency of msvcrt should be avoided using the C++Builder winapi 7.0 repository. This should be used in Win32, Win64, Android, Linux, Ios, Osx. Another cool C allocator, free, is the mimalloc of Microsoft. (IMHO Delphi 64bit can have a nice place for Cloud and distributed web apps, with a modern allocator can compete with Rust, Erlang, Go) C:\Exes>FastMM5ConsoleTest_F4 Parallel For used : 1479456 ticks Parallel For used : 1593960 ticks Parallel For used : 1492162 ticks Parallel For used : 1516575 ticks Parallel For used : 1504889 ticks Parallel For used : 1616684 ticks Parallel For used : 1694674 ticks Parallel For used : 1659002 ticks Parallel For used : 1509797 ticks Parallel For used : 1623232 ticks Parallel For used : 1549025 ticks Parallel For used : 1768947 ticks Parallel For used : 1860454 ticks Parallel For used : 1813156 ticks Parallel For used : 2014587 ticks Parallel For used : 1896651 ticks Parallel For used : 1918023 ticks Parallel For used : 1869937 ticks Parallel For used : 1832852 ticks Parallel For used : 1855156 ticks Done. Press ENTER to exit C:\Exes>FastMM5ConsoleTest_F5 (FastMM_SetOptimizationStrategy(mmosOptimizeForSpeed)) Parallel For used : 429409 ticks Parallel For used : 428977 ticks Parallel For used : 439715 ticks Parallel For used : 431561 ticks Parallel For used : 441682 ticks Parallel For used : 448713 ticks Parallel For used : 457904 ticks Parallel For used : 451374 ticks Parallel For used : 420869 ticks Parallel For used : 433840 ticks Parallel For used : 428119 ticks Parallel For used : 426678 ticks Parallel For used : 431399 ticks Parallel For used : 432025 ticks Parallel For used : 429793 ticks Parallel For used : 420178 ticks Parallel For used : 422983 ticks Parallel For used : 433726 ticks Parallel For used : 426557 ticks Parallel For used : 418806 ticks Done. Press ENTER to exit C:\Exes>FastMM5ConsoleTest_Intel Parallel For used : 85910 ticks Parallel For used : 82550 ticks Parallel For used : 84917 ticks Parallel For used : 81707 ticks Parallel For used : 81077 ticks Parallel For used : 80789 ticks Parallel For used : 81069 ticks Parallel For used : 81506 ticks Parallel For used : 85098 ticks Parallel For used : 84156 ticks Parallel For used : 84978 ticks Parallel For used : 81699 ticks Parallel For used : 84017 ticks Parallel For used : 79480 ticks Parallel For used : 80324 ticks Parallel For used : 80736 ticks Parallel For used : 83380 ticks Parallel For used : 84887 ticks Parallel For used : 78052 ticks Parallel For used : 82792 ticks Done. Press ENTER to exit
  9. RDP1974

    borderless with aero shadow

    I know, but I need VCL 🙂
  10. RDP1974

    borderless with aero shadow

    No, the canvas is inside the external frame. The solution in plain API is here: https://stackoverflow.com/questions/22165258/how-to-create-window-without-border-and-with-shadow-like-github-app/44489430#44489430 Create window with WS_CAPTION style Call DwmExtendFrameIntoClientArea WDM API passing 1 pixel top margin Handle WM_NCCALCSIZE message, do not forward call to DefWindowProc while processing this message, but just return 0 (https://stackoverflow.com/questions/43818022/borderless-window-with-drop-shadow)
  11. RDP1974

    borderless with aero shadow

    thank you, but the problem is the 1px frame of the color of theme title I have read a C++ example that I will try in Delphi, needs a return parameter from paint API where VCL use a procedure without return 😕
  12. When you have DoS/DDoS protection in apache, for example with the usage of the qos_module, you will see that there will be a lot of failed requests in the output of the command. This happens, because the protection is indeed working and as mentioned, the ab tool basically floods your server with requests, so a lot of requests with the same IP will automatically be blocked by the apache module. Indeed I see that the performance of Delphi apache module or Indy web application, with Firedac and data middleware manipulation, under Linux is brilliant. I wait for the compiler optimization to redo a benchmark.
  13. in counterpart for scimark benchmark LLVM compiler needs a complete optimization overhaul https://quality.embarcadero.com/browse/RSP-28006
  14. ab failed requests over apache seems related to the use of it inside a virtual machine, on real hardware the problem doesn't exist
  15. btw. did also a benchmark of Indy based custom Httpd (Soap, Webbroker) and Linux version is 3x more performant (ClearLinux) than Windows patched with Intel Performance Libraries.
×