Jump to content
Sign in to follow this  
Tommi Prami

32bit RGBA TBitmap to RGB byte stream.

Recommended Posts

29 minutes ago, Tommi Prami said:

Stream of bytes in order of RGB. What happens after this is another story all together.

As you like. There are other existing solutions to convert RGBA to RGB depending on which file format you want to create. An RGB stream copied to a file can't do much. The reader cannot even know the width and height.

Share this post


Link to post
12 minutes ago, FPiette said:

As you like. There are other existing solutions to convert RGBA to RGB depending on which file format you want to create. An RGB stream copied to a file can't do much. The reader cannot even know the width and height.

Like I said earlier, part of 3rd party library/Component and can't be changed.

 

-Tee-

Share this post


Link to post
18 minutes ago, Tommi Prami said:

Like I said earlier, part of 3rd party library/Component and can't be changed.

That has no sense: you ask to change the way to convert from RGBA to RGB, so you can change that part.

Sorry but I won't participate any more in this useless conversation about an XY problem.

Share this post


Link to post
39 minutes ago, FPiette said:

That has no sense: you ask to change the way to convert from RGBA to RGB, so you can change that part.

Sorry but I won't participate any more in this useless conversation about an XY problem.

Hmm, I think this is lost in translation or something like that.

Process is and willöbe 100% same, I am not asking help to change process, just optimize what there already is. As far as I know it can't be even changed changed. I am just looking a way to optimize it. Have received valuable info, so I thank all for that.

Method input is TBitmap with 32bit "pixel format" and pixels in it has to be saved into stream with 3bytes per pixel in order of RGB. Sorry if I am not clear enough.

 

Edited by Tommi Prami

Share this post


Link to post

Intermediate info:

It is way faster than I thought. I got bit (very badly) by FastMM4 full debug mode. It takes ages to do that memory allocation Scanline does. So there is room for optimizing the extra scanlines out, and maybe some other stuff also.

So not too much user preservable room for optimization there is it first seems. Still can make it faster tough. I have some kind of test project cooking, trying to check how far I can push it. I'll publish it later if I got something out of it.

Share this post


Link to post

That is not complete story tough.

Production APP about : 128641ms

 

Running test: "Reference" (RELEASE build)
  Run count: 5

  Min: 113,458ms, Average: 118,925ms, Max: 127,638ms

Running test: "Reference" (DEBUG build)
  Run count: 5

  Min: 13489,222ms, Average: 13957,058ms, Max: 14276,866ms

While debug mode is slow, but not even close as slow what I experienced when that piece of code is part of the whole program (Not isolated small piece of code in separate ap). Because it took minutes to complete. (Yes I measured it).

What in earth could cause that. Memory alignment? Somehow the bitmap is differently in memory (bad alignment etc?). 

One obvious could be Compiler options, have to check those later.

 

-Tee-

Edited by Tommi Prami
Possible reason for production App slowdown added

Share this post


Link to post
3 hours ago, Fr0sT.Brutal said:

Range/overflow checks f.ex.

I'll check those later. Thanks.

 

-Tee-

Share this post


Link to post

Also: if you use FastMM4, in some configs it could enable many debug actions that help hunt bugs but slow things down. When you have massive memory allocs/disposes, difference could be significant

Share this post


Link to post
On 1/25/2021 at 1:42 PM, Fr0sT.Brutal said:

Also: if you use FastMM4, in some configs it could enable many debug actions that help hunt bugs but slow things down. When you have massive memory allocs/disposes, difference could be significant

Use same FastMM in test app with same settings than in production app.

And also the overflow and range checking did not produce as slow situation than in the production app case.

This is Overflow and Range checking on.

Running test: "Reference" (DEBUG build) 
  Run count: 5
  
  Min: 12707,820ms, Average: 13247,058ms, Max: 14194,733ms

Production case also has somewhat smaller bitmap also than in the test app. this is getting weirder. And the production app case is way way slower than in this test app.

Only thing currently) I can think of is that Production bitmap addresses go Bottom Up and in test Appm  Top to bottom. 

Just have to figure out how to get that situation tested.

Share this post


Link to post

Put also Inlining control to Off in the test app, still about as fast as before:

Running test: "Reference" (DEBUG build)
  Run count: 5
  
  Min: 12459,948ms, Average: 13538,720ms, Max: 14904,923ms

Share this post


Link to post
45 minutes ago, Tommi Prami said:

Put also Inlining control to Off in the test app, still about as fast as before:

Yes, of course that didn't do anything. Why would you expect it to?

I think you need to take a step back and think about what you are doing instead of just trying random stuff. Take control of the problem.

 

The numbers you have posted shows that you are either measuring time in microseconds or using the thousand separator incorrectly. If you are measuring microseconds then stop that. Numbers that small are not relevant here.

 

One of the first things you should have done would be to locate the bottleneck by profiling your code. If you don't have a profiler or don't understand how to use one then you can emulate a sampling profiler by just running the application a few times and pause it in the debugger. Unless the slowdown is evenly distributed then there's a statistic likelihood that the call stack will show you where the application is spending the majority of its time.

 

Share this post


Link to post
3 hours ago, Anders Melander said:

One of the first things you should have done would be to locate the bottleneck by profiling your code. If you don't have a profiler or don't understand how to use one then you can emulate a sampling profiler by just running the application a few times and pause it in the debugger. Unless the slowdown is evenly distributed then there's a statistic likelihood that the call stack will show you where the application is spending the majority of its time.

Another option is some kind of binary search when one comments out most of the code until things are fast and then uncomments pieces to track what exactly causes slowdown. This method is especially useful when there is some piece which runs for a small time but executes very frequently. Profilers and timers won't help much here.

Share this post


Link to post
3 hours ago, Fr0sT.Brutal said:

This method is especially useful when there is some piece which runs for a small time but executes very frequently. Profilers and timers won't help much here.

Why wouldn't a profiler, real or not, help there? What do you think a profiler does?

Share this post


Link to post
1 hour ago, Anders Melander said:

Why wouldn't a profiler, real or not, help there? What do you think a profiler does?

Because timers have resolution. If lineA is executed in 1 ms and lineB in 1.5 ms, they will likely miss the difference or, even worse, add a noise by their actions thus producing irrelevant results. But if these lines are repeated in loop 10E9 times, overall difference between then will be noticeable.

Share this post


Link to post
14 minutes ago, Fr0sT.Brutal said:

Because timers have resolution. If lineA is executed in 1 ms and lineB in 1.5 ms, they will likely miss the difference or, even worse, add a noise by their actions thus producing irrelevant results.

You're the only one talking about timers, but even if you used a grandfather clock the method would work.

If you break the application at random you will have 50% higher likelihood of hitting lineB than lineA and each time you do this the likelihood increases. This is exactly how a sampling profiler works. Are you saying sampling profilers are a hoax?

Share this post


Link to post
22 hours ago, Anders Melander said:

Yes, of course that didn't do anything. Why would you expect it to?

 

 

Because RTL and VCL code is build also with inlining I think, so it was valid thing to test I think.

Quote

The numbers you have posted shows that you are either measuring time in microseconds or using the thousand separator incorrectly. If you are measuring microseconds then stop that. Numbers that small are not relevant here.

No, milliseconds.
LResultArray := LStopWatch.Elapsed.TotalMilliseconds;

Number format is standard Finnish, which can be misleading to some, decimal separator is comma, and thousand separator is "space".

 

Quote

One of the first things you should have done would be to locate the bottleneck by profiling your code.

Will do that when I've got time for that. Also publish my test code so anyone can check it if want to.

Big problem here is that profiling is that I can't still reproduce the level of slowdown in production App. And is very hard to profile without instrumenting one (which I don't have) without loosing all to the noise, of all other processes in the app.

What I know that main problem is the ScanLine, but Dunno why it is sometimes very fast and takes ages in our production code (vs in test app). It is at least 10 times difference. I've tested production code with just one Scanline and inc the pointer to next line, it'll at least 10x faster as I stated before.

Just have to reproduce that in the test App.

This, sadly, will be kind of marathon can't spend to much time on this currently. 
 

But I'll get there 🙂

 

-tee-
 

Share this post


Link to post
1 hour ago, Tommi Prami said:

Small Test APP

  1. Your TRGB32 type is identical to TRGBQuad.
  2. Why have you declared the TRGB32Array array packed? It serves no purpose.
  3. Why do you use TWICImage but also explicitly reference the Jpeg unit?

One thing that I don't think you've mentioned, or at least I can't remember, is if you actually need to use TBitmap at all. Where does your bitmap data originate? A disk file, a stream, something else?

Share this post


Link to post
15 hours ago, Anders Melander said:
  1. Your TRGB32 type is identical to TRGBQuad.
  2. Why have you declared the TRGB32Array array packed? It serves no purpose.
  3. Why do you use TWICImage but also explicitly reference the Jpeg unit?

One thing that I don't think you've mentioned, or at least I can't remember, is if you actually need to use TBitmap at all. Where does your bitmap data originate? A disk file, a stream, something else?

1, not my code originally so, preserve original code, but noted, did not know that type existed

2. Can be changed then, did not realize that

3. Jpeg unit I think can be removed then. Have to look at that

I thought "32bit RGBA TBitmap to RGB byte stream." as caption is pretty obvious that data is in the TBitmap originally. In this case bitmap  is loaded from several sources, from DB and some is made in Code etc. But not from file as such.

 

Made couple changes of your observations

Edited by Tommi Prami

Share this post


Link to post
38 minutes ago, Tommi Prami said:

I thought "32bit RGBA TBitmap to RGB byte stream." as caption is pretty obvious that data is in the TBitmap originally. In this case bitmap  is loaded from several sources, from DB and some is made in Code etc. But not from file as such.

It's obvious that you, at some point in your pipeline, have the bitmaps in a TBitmap, but it's not obvious if this is actually necessary. If you don't have a requirement that the bitmaps need to be represented as a TBitmap then you can avoid the whole GDI overhead by simply loading the bitmaps as a memory stream and accessing the pixel data directly in memory. BMP is a fairly simple and well defined format so it's not that hard. You can also use something like Graphics32 to do this for you since this seems to be about 32-bit bitmaps exclusively.

Share this post


Link to post
19 minutes ago, Anders Melander said:

It's obvious that you, at some point in your pipeline, have the bitmaps in a TBitmap, but it's not obvious if this is actually necessary. If you don't have a requirement that the bitmaps need to be represented as a TBitmap then you can avoid the whole GDI overhead by simply loading the bitmaps as a memory stream and accessing the pixel data directly in memory. BMP is a fairly simple and well defined format so it's not that hard. You can also use something like Graphics32 to do this for you since this seems to be about 32-bit bitmaps exclusively.

Fair point. Said in some point that this is 3rd part library, very small part of it, tough more than less obsolete and we started to fix it our self's, But some portion we have no control of it, like TBitmap (Would be total rewrite of huge library)

Share this post


Link to post

Thanks for this interesting topic, in particular such a one like me who deals with image processing and related algoriths and a Delphi/FPC lover. In my tests, the reference pixel-accessing code is the fastest one, and the difference on a bigger image (12000*8143 pixs) is more clear, please look at the results below (running two times successively). Additionally, I've tested FastMM4 and FastMM5 on these testes, and FastMM5 is clearly yielded the fastest time scores (here I did not give the FastMM4 scores), and I've used FastMM5 in my tests. 

 

Using orginal image:

-----------------------

Running test: "Reference" (RELEASE build Win64, used with Normal Image(5600x3800))
  Run count: 5
  
  Min: 78.990ms, Average: 79.810ms, Max: 80.594ms

Running test: "Reference" (RELEASE build Win64, used with Normal Image(5600x3800))
  Run count: 5
  
  Min: 80.314ms, Average: 84.316ms, Max: 91.712ms

Running test: "ReferenceWithScanlineHelper" (RELEASE build Win64, used with Normal Image(5600x3800))
  Run count: 5
  
  Min: 93.387ms, Average: 93.930ms, Max: 94.503ms

Running test: "ReferenceWithScanlineHelper" (RELEASE build Win64, used with Normal Image(5600x3800))
  Run count: 5
  
  Min: 93.118ms, Average: 93.848ms, Max: 94.309ms

 

Using bigger image:

-----------------------

Running test: "Reference" (RELEASE build Win64, used with Bigger Image(12000x8143))
  Run count: 5
  
  Min: 356.202ms, Average: 361.204ms, Max: 378.090ms

Running test: "Reference" (RELEASE build Win64, used with Bigger Image(12000x8143))
  Run count: 5
  
  Min: 352.916ms, Average: 367.075ms, Max: 385.400ms

Running test: "ReferenceWithScanlineHelper" (RELEASE build Win64, used with Bigger Image(12000x8143))
  Run count: 5
  
  Min: 422.031ms, Average: 429.115ms, Max: 438.597ms

Running test: "ReferenceWithScanlineHelper" (RELEASE build Win64, used with Bigger Image(12000x8143))
  Run count: 5
  
  Min: 423.584ms, Average: 426.645ms, Max: 430.645ms

ReleaseBuildScores_Win64.png

Share this post


Link to post
On 2/6/2021 at 5:47 PM, fatihtsp said:

Thanks for this interesting topic, in particular such a one like me who deals with image processing and related algoriths and a Delphi/FPC lover. In my tests, the reference pixel-accessing code is the fastest one, and the difference on a bigger image (12000*8143 pixs) is more clear, please look at the results below (running two times successively). Additionally, I've tested FastMM4 and FastMM5 on these testes, and FastMM5 is clearly yielded the fastest time scores (here I did not give the FastMM4 scores), and I've used FastMM5 in my tests. 

Problem (to me) is that I've seen 10x speedup without using that much ScanLine, And would be nice to abple to reproduce that scenario. Some one also made simple test app and saw that ScanLines too significant time, alone, without other code.

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×