dummzeuch 1505 Posted March 22, 2022 (edited) All functions for copying / moving the contents of memory from one buffer to another seem to end up calling System.Move. (e.g. Windows.CopyMemory and Windows.MoveMemory). Is that really the fastest way to copy large amounts of data? Edit: This is about the 32 bit compiler for Windows. In my case these are Mono8 bitmaps with a resolution up to 5120x5120 = about 26 MB or BGR8 bitmaps with HD resolution (1920x1080 * 3 bytes = about 6 MB (or even BayerRG16 or Mono16 Bitmaps with up to 5120x5120 * 2 bytes = about 52 MB). These come in a buffer delivered from a camera API which must be returned as fast as possible, so I need to copy these pictures to another buffer for further processing and do it fast. The buffers don't overlap in that case which is something that Move takes into consideration, so there is some small amount of optimization possible. I think it might be possible to make it faster using some MMX/SSE instructions, but I am not familiar with that. I tried to google for this, but Google kept showing showing results for the C function CopyMem, as usual ignoring the "Delphi" part of my query (even when put in quotes). Edited March 22, 2022 by dummzeuch Share this post Link to post
David Heffernan 2345 Posted March 22, 2022 I think Arnaud's synopse library has a bunch of more optimised mem copy routines 1 1 Share this post Link to post
dummzeuch 1505 Posted March 22, 2022 (edited) 1 hour ago, David Heffernan said: I think Arnaud's synopse library has a bunch of more optimised mem copy routines Found them, thanks. Just in case anybody else wants to look at them: * for 32 bit: https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.base.asmx86.inc * for 64 bit: https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.base.asmx64.inc Edited March 22, 2022 by dummzeuch 1 Share this post Link to post
Mark- 29 Posted March 22, 2022 56 minutes ago, dummzeuch said: Found them, thanks. Are you planning any testing/bench marking to verify they are a better option? Share this post Link to post
dummzeuch 1505 Posted March 22, 2022 50 minutes ago, Mark- said: Are you planning any testing/bench marking to verify they are a better option? I will definitely do that before using it. So far I have just had a look. 1 Share this post Link to post
Mark- 29 Posted March 22, 2022 Just now, dummzeuch said: I will definitely do that before using it. So far I have just had a look. Please publish the results, I am most interested. Share this post Link to post
Arnaud Bouchez 407 Posted March 22, 2022 Don't expect anything magic by using mORMot MoveFast(). Perhaps a few percent more or less. On Win32 - which is your target, IIRC the Delphi RTL uses X87 registers. On this platform, MoveFast() use SSE2 registers for small sizes, so is likely to be slightly faster, and will leverage ERMSB move (i.e. rep movsb) on newer CPUs which support it. To be fair, mORMot asm is more optimized for x86_64 than for i386 - because it is the target platform for server side, which is the one needing more optimization. But I would just try all FastCode variants - some can be very verbose, but "may" be better. What I would do in your case, is trying to not move any data at all. Isn't it possible that you pre-allocate a set of buffers, then just consume them in a circular way, passing them from the acquisition to the processing methods as pointers, with no copy? The fastest move() is ... when there is no move... 🙂 3 1 Share this post Link to post
Attila Kovacs 629 Posted March 22, 2022 1 minute ago, Arnaud Bouchez said: just consume them in a circular way that's how it should work actually Share this post Link to post
dummzeuch 1505 Posted March 22, 2022 4 hours ago, Arnaud Bouchez said: What I would do in your case, is trying to not move any data at all. Isn't it possible that you pre-allocate a set of buffers, then just consume them in a circular way, passing them from the acquisition to the processing methods as pointers, with no copy? The fastest move() is ... when there is no move... 🙂 I have already done that. There is now one move operation left per picture and that's between the buffers used by the API and the buffers used internally by my code. I found no way for avoiding this. Share this post Link to post
Fr0sT.Brutal 900 Posted March 23, 2022 FastMM has optimized Move routines as well. While generic Move is pretty fast, it can't squeeze the maximum because of generosity. The best perf could be achieved with specially prepared memory blocks - aligned, non-overlapping, not locked. FastMM has some specific Move's optimized for specific blocks Share this post Link to post
Stefan Glienke 2002 Posted March 23, 2022 While it cannot squeeze the maximum it is far far away from performing anywhere near "well" for larger amounts - simply because a) the x86 implementation only moves 8 bytes at once in a loop using FILD and FISTP which is slower than an SSE loop using instructions that are available since like 2000. and b) because Win64 uses a rather terrible purepascal implementation which also at most moves 8 byte at a time. Share this post Link to post
Fr0sT.Brutal 900 Posted March 29, 2022 @Stefan Glienke I must have expressed a bit unclearly. By "Move" I meant any implementation not only the RTL one. Share this post Link to post
dummzeuch 1505 Posted May 24, 2022 On 3/22/2022 at 2:44 PM, Arnaud Bouchez said: Don't expect anything magic by using mORMot MoveFast(). Perhaps a few percent more or less. I finally came around to trying it with a 32 bit application and found no measurable improvement. Maybe I am doing something wrong, as there are several IFDEFS in the code which I don't understand what they are doing. Or maybe the move operation is simply not that important in the overall performance of the program. I did not do any specific timing for the move operation itself as I think it's pointless if it doesn't improve the performance of the program noticeably. Share this post Link to post
Anders Melander 1783 Posted May 24, 2022 7 hours ago, dummzeuch said: Or maybe the move operation is simply not that important in the overall performance of the program. Umm... You mean to say you haven't profiled it? 1 Share this post Link to post
dummzeuch 1505 Posted May 25, 2022 7 hours ago, Anders Melander said: You mean to say you haven't profiled it? No, not that part. I have simply replaced some part that I thought might contribute a bit to the overall performance and was easy to replace. I timed the result and found that it didn't make any difference. Took me about 30 minutes so not much wasted effort. Through other means (improved algorithm and multithreading) I have already reduced the overall time for one work unit of the program to 1/3 compared to the original code. The program which I now used for the test is not the one I mentioned in the original question. That one has already gone into "production" about a month ago and the performance is "good enough". Share this post Link to post
Arnaud Bouchez 407 Posted May 25, 2022 That's what I wrote: it is unlikely alternate Move() would make a huge difference. When working on buffers, cache locality is a performance key. Working on smaller buffers, which fit in L1 cache (a few MB usually) could be faster than two big Move / Process. But perhaps your CPU has already good enough cache (bigger than your picture), so it won't help. About the buffers, couldn't you use a ring of them, so that you don't move data? Share this post Link to post
dummzeuch 1505 Posted May 25, 2022 5 hours ago, Arnaud Bouchez said: That's what I wrote: it is unlikely alternate Move() would make a huge difference. I understood that. I just tried it anyway just because that was easy to do and doesn't hurt. 5 hours ago, Arnaud Bouchez said: Working on smaller buffers, which fit in L1 cache (a few MB usually) could be faster than two big Move / Process. But perhaps your CPU has already good enough cache (bigger than your picture), so it won't help. The pictures are huge: 5120 by 2000 with Mono8 pixels, so about 10 MB each. Sometimes they get even bigger: Up to 5120x5120 with RGP8 pixels. But that's very rare so far, so I don't care about performance for these at the moment. But I'm sure, since we have got this camera now, it will be used in more projects. 5 hours ago, Arnaud Bouchez said: About the buffers, couldn't you use a ring of them, so that you don't move data? I have already done that, as far as possible: On 3/22/2022 at 6:56 PM, dummzeuch said: There is now one move operation left per picture and that's between the buffers used by the API and the buffers used internally by my code. I found no way for avoiding this. As said before: This time it's a different program but the same basically applies: I try to avoid moving this data if possible at all. Hm, maybe using smaller buffers for only parts of the picture, as you suggested above, could help here. I'll have to think about that one. Share this post Link to post
Arnaud Bouchez 407 Posted May 25, 2022 L1 cache access time makes a huge difference. http://blog.skoups.com/?p=592 You could retrieve the L1 cache size, then work on buffers of about 90% of this size (always keep some space for stack, tables and such). Then, if you work in the API buffer directly, a non-temporal move to the result buffer may help a little. During your process, if you use lookup tables, ensure they don't pollute the cache. But profiling is the key for sure. Guesses are most of the time wrong... 2 Share this post Link to post