FPiette 391 Posted January 13, 2021 29 minutes ago, Tommi Prami said: Stream of bytes in order of RGB. What happens after this is another story all together. As you like. There are other existing solutions to convert RGBA to RGB depending on which file format you want to create. An RGB stream copied to a file can't do much. The reader cannot even know the width and height. Share this post Link to post
Tommi Prami 140 Posted January 13, 2021 12 minutes ago, FPiette said: As you like. There are other existing solutions to convert RGBA to RGB depending on which file format you want to create. An RGB stream copied to a file can't do much. The reader cannot even know the width and height. Like I said earlier, part of 3rd party library/Component and can't be changed. -Tee- Share this post Link to post
FPiette 391 Posted January 13, 2021 18 minutes ago, Tommi Prami said: Like I said earlier, part of 3rd party library/Component and can't be changed. That has no sense: you ask to change the way to convert from RGBA to RGB, so you can change that part. Sorry but I won't participate any more in this useless conversation about an XY problem. Share this post Link to post
Tommi Prami 140 Posted January 13, 2021 (edited) 39 minutes ago, FPiette said: That has no sense: you ask to change the way to convert from RGBA to RGB, so you can change that part. Sorry but I won't participate any more in this useless conversation about an XY problem. Hmm, I think this is lost in translation or something like that. Process is and willöbe 100% same, I am not asking help to change process, just optimize what there already is. As far as I know it can't be even changed changed. I am just looking a way to optimize it. Have received valuable info, so I thank all for that. Method input is TBitmap with 32bit "pixel format" and pixels in it has to be saved into stream with 3bytes per pixel in order of RGB. Sorry if I am not clear enough. Edited January 13, 2021 by Tommi Prami Share this post Link to post
Tommi Prami 140 Posted January 21, 2021 Intermediate info: It is way faster than I thought. I got bit (very badly) by FastMM4 full debug mode. It takes ages to do that memory allocation Scanline does. So there is room for optimizing the extra scanlines out, and maybe some other stuff also. So not too much user preservable room for optimization there is it first seems. Still can make it faster tough. I have some kind of test project cooking, trying to check how far I can push it. I'll publish it later if I got something out of it. Share this post Link to post
Tommi Prami 140 Posted January 21, 2021 Thanks for all info, suggestions and mainly sceptics so far 🙂 -Tee- Share this post Link to post
Tommi Prami 140 Posted January 25, 2021 (edited) That is not complete story tough. Production APP about : 128641ms Running test: "Reference" (RELEASE build) Run count: 5 Min: 113,458ms, Average: 118,925ms, Max: 127,638ms Running test: "Reference" (DEBUG build) Run count: 5 Min: 13489,222ms, Average: 13957,058ms, Max: 14276,866ms While debug mode is slow, but not even close as slow what I experienced when that piece of code is part of the whole program (Not isolated small piece of code in separate ap). Because it took minutes to complete. (Yes I measured it). What in earth could cause that. Memory alignment? Somehow the bitmap is differently in memory (bad alignment etc?). One obvious could be Compiler options, have to check those later. -Tee- Edited January 25, 2021 by Tommi Prami Possible reason for production App slowdown added Share this post Link to post
Fr0sT.Brutal 901 Posted January 25, 2021 2 hours ago, Tommi Prami said: What in earth could cause that Range/overflow checks f.ex. 2 Share this post Link to post
Tommi Prami 140 Posted January 25, 2021 3 hours ago, Fr0sT.Brutal said: Range/overflow checks f.ex. I'll check those later. Thanks. -Tee- Share this post Link to post
Fr0sT.Brutal 901 Posted January 25, 2021 Also: if you use FastMM4, in some configs it could enable many debug actions that help hunt bugs but slow things down. When you have massive memory allocs/disposes, difference could be significant Share this post Link to post
Tommi Prami 140 Posted January 27, 2021 On 1/25/2021 at 1:42 PM, Fr0sT.Brutal said: Also: if you use FastMM4, in some configs it could enable many debug actions that help hunt bugs but slow things down. When you have massive memory allocs/disposes, difference could be significant Use same FastMM in test app with same settings than in production app. And also the overflow and range checking did not produce as slow situation than in the production app case. This is Overflow and Range checking on. Running test: "Reference" (DEBUG build) Run count: 5 Min: 12707,820ms, Average: 13247,058ms, Max: 14194,733ms Production case also has somewhat smaller bitmap also than in the test app. this is getting weirder. And the production app case is way way slower than in this test app. Only thing currently) I can think of is that Production bitmap addresses go Bottom Up and in test Appm Top to bottom. Just have to figure out how to get that situation tested. Share this post Link to post
Tommi Prami 140 Posted January 27, 2021 Put also Inlining control to Off in the test app, still about as fast as before: Running test: "Reference" (DEBUG build) Run count: 5 Min: 12459,948ms, Average: 13538,720ms, Max: 14904,923ms Share this post Link to post
Anders Melander 1955 Posted January 27, 2021 45 minutes ago, Tommi Prami said: Put also Inlining control to Off in the test app, still about as fast as before: Yes, of course that didn't do anything. Why would you expect it to? I think you need to take a step back and think about what you are doing instead of just trying random stuff. Take control of the problem. The numbers you have posted shows that you are either measuring time in microseconds or using the thousand separator incorrectly. If you are measuring microseconds then stop that. Numbers that small are not relevant here. One of the first things you should have done would be to locate the bottleneck by profiling your code. If you don't have a profiler or don't understand how to use one then you can emulate a sampling profiler by just running the application a few times and pause it in the debugger. Unless the slowdown is evenly distributed then there's a statistic likelihood that the call stack will show you where the application is spending the majority of its time. Share this post Link to post
Fr0sT.Brutal 901 Posted January 27, 2021 3 hours ago, Anders Melander said: One of the first things you should have done would be to locate the bottleneck by profiling your code. If you don't have a profiler or don't understand how to use one then you can emulate a sampling profiler by just running the application a few times and pause it in the debugger. Unless the slowdown is evenly distributed then there's a statistic likelihood that the call stack will show you where the application is spending the majority of its time. Another option is some kind of binary search when one comments out most of the code until things are fast and then uncomments pieces to track what exactly causes slowdown. This method is especially useful when there is some piece which runs for a small time but executes very frequently. Profilers and timers won't help much here. Share this post Link to post
Anders Melander 1955 Posted January 27, 2021 3 hours ago, Fr0sT.Brutal said: This method is especially useful when there is some piece which runs for a small time but executes very frequently. Profilers and timers won't help much here. Why wouldn't a profiler, real or not, help there? What do you think a profiler does? Share this post Link to post
Fr0sT.Brutal 901 Posted January 27, 2021 1 hour ago, Anders Melander said: Why wouldn't a profiler, real or not, help there? What do you think a profiler does? Because timers have resolution. If lineA is executed in 1 ms and lineB in 1.5 ms, they will likely miss the difference or, even worse, add a noise by their actions thus producing irrelevant results. But if these lines are repeated in loop 10E9 times, overall difference between then will be noticeable. Share this post Link to post
Anders Melander 1955 Posted January 27, 2021 14 minutes ago, Fr0sT.Brutal said: Because timers have resolution. If lineA is executed in 1 ms and lineB in 1.5 ms, they will likely miss the difference or, even worse, add a noise by their actions thus producing irrelevant results. You're the only one talking about timers, but even if you used a grandfather clock the method would work. If you break the application at random you will have 50% higher likelihood of hitting lineB than lineA and each time you do this the likelihood increases. This is exactly how a sampling profiler works. Are you saying sampling profilers are a hoax? Share this post Link to post
Tommi Prami 140 Posted January 28, 2021 22 hours ago, Anders Melander said: Yes, of course that didn't do anything. Why would you expect it to? Because RTL and VCL code is build also with inlining I think, so it was valid thing to test I think. Quote The numbers you have posted shows that you are either measuring time in microseconds or using the thousand separator incorrectly. If you are measuring microseconds then stop that. Numbers that small are not relevant here. No, milliseconds. LResultArray := LStopWatch.Elapsed.TotalMilliseconds; Number format is standard Finnish, which can be misleading to some, decimal separator is comma, and thousand separator is "space". Quote One of the first things you should have done would be to locate the bottleneck by profiling your code. Will do that when I've got time for that. Also publish my test code so anyone can check it if want to. Big problem here is that profiling is that I can't still reproduce the level of slowdown in production App. And is very hard to profile without instrumenting one (which I don't have) without loosing all to the noise, of all other processes in the app. What I know that main problem is the ScanLine, but Dunno why it is sometimes very fast and takes ages in our production code (vs in test app). It is at least 10 times difference. I've tested production code with just one Scanline and inc the pointer to next line, it'll at least 10x faster as I stated before. Just have to reproduce that in the test App. This, sadly, will be kind of marathon can't spend to much time on this currently. But I'll get there 🙂 -tee- Share this post Link to post
Tommi Prami 140 Posted February 2, 2021 Small Test APP, https://github.com/mWaltari/ImageStreamTest This DOES NOT reproduce the ultimate slowdown I saw in the out App. This is way faster. In production app, redusing the ScanLine calls made it at least 10x faster. But if you just want to check this out, there you are... -Tee- Share this post Link to post
Anders Melander 1955 Posted February 2, 2021 1 hour ago, Tommi Prami said: Small Test APP Your TRGB32 type is identical to TRGBQuad. Why have you declared the TRGB32Array array packed? It serves no purpose. Why do you use TWICImage but also explicitly reference the Jpeg unit? One thing that I don't think you've mentioned, or at least I can't remember, is if you actually need to use TBitmap at all. Where does your bitmap data originate? A disk file, a stream, something else? Share this post Link to post
Tommi Prami 140 Posted February 3, 2021 (edited) 15 hours ago, Anders Melander said: Your TRGB32 type is identical to TRGBQuad. Why have you declared the TRGB32Array array packed? It serves no purpose. Why do you use TWICImage but also explicitly reference the Jpeg unit? One thing that I don't think you've mentioned, or at least I can't remember, is if you actually need to use TBitmap at all. Where does your bitmap data originate? A disk file, a stream, something else? 1, not my code originally so, preserve original code, but noted, did not know that type existed 2. Can be changed then, did not realize that 3. Jpeg unit I think can be removed then. Have to look at that I thought "32bit RGBA TBitmap to RGB byte stream." as caption is pretty obvious that data is in the TBitmap originally. In this case bitmap is loaded from several sources, from DB and some is made in Code etc. But not from file as such. Made couple changes of your observations Edited February 3, 2021 by Tommi Prami Share this post Link to post
Anders Melander 1955 Posted February 3, 2021 38 minutes ago, Tommi Prami said: I thought "32bit RGBA TBitmap to RGB byte stream." as caption is pretty obvious that data is in the TBitmap originally. In this case bitmap is loaded from several sources, from DB and some is made in Code etc. But not from file as such. It's obvious that you, at some point in your pipeline, have the bitmaps in a TBitmap, but it's not obvious if this is actually necessary. If you don't have a requirement that the bitmaps need to be represented as a TBitmap then you can avoid the whole GDI overhead by simply loading the bitmaps as a memory stream and accessing the pixel data directly in memory. BMP is a fairly simple and well defined format so it's not that hard. You can also use something like Graphics32 to do this for you since this seems to be about 32-bit bitmaps exclusively. Share this post Link to post
Tommi Prami 140 Posted February 3, 2021 19 minutes ago, Anders Melander said: It's obvious that you, at some point in your pipeline, have the bitmaps in a TBitmap, but it's not obvious if this is actually necessary. If you don't have a requirement that the bitmaps need to be represented as a TBitmap then you can avoid the whole GDI overhead by simply loading the bitmaps as a memory stream and accessing the pixel data directly in memory. BMP is a fairly simple and well defined format so it's not that hard. You can also use something like Graphics32 to do this for you since this seems to be about 32-bit bitmaps exclusively. Fair point. Said in some point that this is 3rd part library, very small part of it, tough more than less obsolete and we started to fix it our self's, But some portion we have no control of it, like TBitmap (Would be total rewrite of huge library) Share this post Link to post
fatihtsp 0 Posted February 6, 2021 Thanks for this interesting topic, in particular such a one like me who deals with image processing and related algoriths and a Delphi/FPC lover. In my tests, the reference pixel-accessing code is the fastest one, and the difference on a bigger image (12000*8143 pixs) is more clear, please look at the results below (running two times successively). Additionally, I've tested FastMM4 and FastMM5 on these testes, and FastMM5 is clearly yielded the fastest time scores (here I did not give the FastMM4 scores), and I've used FastMM5 in my tests. Using orginal image: ----------------------- Running test: "Reference" (RELEASE build Win64, used with Normal Image(5600x3800)) Run count: 5 Min: 78.990ms, Average: 79.810ms, Max: 80.594ms Running test: "Reference" (RELEASE build Win64, used with Normal Image(5600x3800)) Run count: 5 Min: 80.314ms, Average: 84.316ms, Max: 91.712ms Running test: "ReferenceWithScanlineHelper" (RELEASE build Win64, used with Normal Image(5600x3800)) Run count: 5 Min: 93.387ms, Average: 93.930ms, Max: 94.503ms Running test: "ReferenceWithScanlineHelper" (RELEASE build Win64, used with Normal Image(5600x3800)) Run count: 5 Min: 93.118ms, Average: 93.848ms, Max: 94.309ms Using bigger image: ----------------------- Running test: "Reference" (RELEASE build Win64, used with Bigger Image(12000x8143)) Run count: 5 Min: 356.202ms, Average: 361.204ms, Max: 378.090ms Running test: "Reference" (RELEASE build Win64, used with Bigger Image(12000x8143)) Run count: 5 Min: 352.916ms, Average: 367.075ms, Max: 385.400ms Running test: "ReferenceWithScanlineHelper" (RELEASE build Win64, used with Bigger Image(12000x8143)) Run count: 5 Min: 422.031ms, Average: 429.115ms, Max: 438.597ms Running test: "ReferenceWithScanlineHelper" (RELEASE build Win64, used with Bigger Image(12000x8143)) Run count: 5 Min: 423.584ms, Average: 426.645ms, Max: 430.645ms Share this post Link to post
Tommi Prami 140 Posted February 11, 2021 On 2/6/2021 at 5:47 PM, fatihtsp said: Thanks for this interesting topic, in particular such a one like me who deals with image processing and related algoriths and a Delphi/FPC lover. In my tests, the reference pixel-accessing code is the fastest one, and the difference on a bigger image (12000*8143 pixs) is more clear, please look at the results below (running two times successively). Additionally, I've tested FastMM4 and FastMM5 on these testes, and FastMM5 is clearly yielded the fastest time scores (here I did not give the FastMM4 scores), and I've used FastMM5 in my tests. Problem (to me) is that I've seen 10x speedup without using that much ScanLine, And would be nice to abple to reproduce that scenario. Some one also made simple test app and saw that ScanLines too significant time, alone, without other code. Share this post Link to post