Tommi Prami 136 Posted January 11, 2021 Hello, One piece of code we use (3rd party component) is taking bitmapdata and put rgb values to stream. And on large bitmap this will take quite a long time, due the sheer amount of pixels to go through, Original code was even sl,owe because it used Pixels []-property. Now it uses Scanline. There is two different buffering (now) strategies but changes to that buffering (Collect data to byte array and write that once and awhile to stream) strategy changes can only go so far. I was just pondering coulöd there be any weird bit fiddling trics etc to get that RGBA -> RGB byte triplet faster? That is the most common operation anyhow. I eman this: LLine := ASrcmap.Scanline[LY]; for LX := 0 to xdim - 1 do begin bbuff[BP] := LLine[LX].R; // RGBColor^.red; bbuff[BP + 1] := LLine[LX].G; // RGBColor^.green; bbuff[BP + 2] := LLine[LX].B; // RGBColor^.blue; Inc(BP, 3); end; Any ideas? Share this post Link to post
Tommi Prami 136 Posted January 11, 2021 One that could significantly make this faster would be to use some fast but good enough quality Algorithm to resample the image smaller first. Might be possible to do that, or not, depending how large change it would be an sure would have to be super fast resampling. But if possible with put changing the original bitmap, would be cool. -Tee- Share this post Link to post
FPiette 387 Posted January 11, 2021 Maybe using pointers to avoid index computation? 1 Share this post Link to post
Fr0sT.Brutal 901 Posted January 11, 2021 Before doing any deep optimizations, run benchmark to ensure that serialize really is the source of slowdown. Otherwise you could spend hours achieving nothing in the end 2 Share this post Link to post
Anders Melander 1844 Posted January 11, 2021 I can't see how resampling would make it any faster unless your streaming implementation really sucks. Resampling would mean that you'd have to read all the pixel data, juggle it around, store it in a new buffer and then read from that buffer instead. Considerably more expensive than whatever solution you can come up with that just reads the data via Scanline. You haven't shown how you RGBA and RGB types are declared but assuming the R-G-B ordering are the same and the A is the last (i.e. high) byte then just read 4 bytes (that's a DWORD) from the source and write 3 bytes. Rinse, repeat. If the source is ABGR and the destination is RGB (e.g. TColor) then you can rearrange the bits like this: function ABGR2RGB(ABGR: DWORD): TColor; begin Result := ((ABGR and $00FF0000) shr 16) or (ABGR and $0000FF00) or ((ABGR and $000000FF) shl 16); end; or in assembler: function ABGR2RGB(ABGR: DWORD): TColor; asm mov EAX, ECX // Remove this for 32-bit rol EAX, 8 xor AL, AL bswap EAX end; 1 Share this post Link to post
Tommi Prami 136 Posted January 12, 2021 18 hours ago, Fr0sT.Brutal said: Before doing any deep optimizations, run benchmark to ensure that serialize really is the source of slowdown. Otherwise you could spend hours achieving nothing in the end On large bitmap this takes minutes so I am pretty sure this is the place all speedups are welcome, Share this post Link to post
Tommi Prami 136 Posted January 12, 2021 (edited) 13 hours ago, Anders Melander said: I can't see how resampling would make it any faster unless your streaming implementation really sucks. Resampling would mean that you'd have to read all the pixel data, juggle it around, store it in a new buffer and then read from that buffer instead. Considerably more expensive than whatever solution you can come up with that just reads the data via Scanline. You haven't shown how you RGBA and RGB types are declared but assuming the R-G-B ordering are the same and the A is the last (i.e. high) byte then just read 4 bytes (that's a DWORD) from the source and write 3 bytes. Rinse, repeat. If the source is ABGR and the destination is RGB (e.g. TColor) then you can rearrange the bits like this: function ABGR2RGB(ABGR: DWORD): TColor; begin Result := ((ABGR and $00FF0000) shr 16) or (ABGR and $0000FF00) or ((ABGR and $000000FF) shl 16); end; or in assembler: function ABGR2RGB(ABGR: DWORD): TColor; asm mov EAX, ECX // Remove this for 32-bit rol EAX, 8 xor AL, AL bswap EAX end; Would that still be 4 bytes? Right? Ah, should learn how to read first 🙂 Edited January 12, 2021 by Tommi Prami (Misunderstanmding) Share this post Link to post
Tommi Prami 136 Posted January 12, 2021 (edited) Input is as in normally in TBitmap with 32 bit pixels. TRGB32 = packed record B, G, R, A: Byte; end; and output should be stream of RGB-bytes in that order. .tee. Edited January 12, 2021 by Tommi Prami Typo Share this post Link to post
Tommi Prami 136 Posted January 12, 2021 (edited) Thanks everyone, so far. I'll have to check on this later. I'll stress that this is part of 3rd party component, which we can't totally rewrite, this process takes too much time sometimes so if we can speed up it a bit if just can. I was pondering that if I could define 4byte array and use Absolute trick to map that array to the result of method shown by the Anders above. I am still pretty much in a sleep, so all ideas I get how to implement this seems that it would have too much code in it. I bet there is elegant solution, possibly using pointers which I am not too good at. But have to try later. -Tee- Edited January 12, 2021 by Tommi Prami Typo Share this post Link to post
FPiette 387 Posted January 12, 2021 38 minutes ago, Tommi Prami said: On large bitmap this takes minutes Please define what is a large bitmap for you. Share this post Link to post
Fr0sT.Brutal 901 Posted January 12, 2021 1 hour ago, Tommi Prami said: On large bitmap this takes minutes so I am pretty sure this is the place all speedups are welcome, Are you absolutely sure? What happens if you comment out copy leaving only ScanLine? Share this post Link to post
dummzeuch 1537 Posted January 12, 2021 (edited) How often do you access the ScanLine property? I found that it is much faster to get the address of the first line, calculate the offset between lines and add (or subtract) the offset to get the other lines. Also, pointer incrementation is much faster than using an array with indexes. On top of that, make sure to disable range checking in the release code. There is some code that does it in u_dzGraphicsUtils in my dzlib. If I remember correctly I blogged about it too. Edit: Yes I did: https://blog.dummzeuch.de/2019/12/12/accessing-bitmap-pixels-with-less-scanline-calls-in-delphi/ Edited January 12, 2021 by dummzeuch Share this post Link to post
Tommi Prami 136 Posted January 12, 2021 2 hours ago, FPiette said: Please define what is a large bitmap for you. Customer had bigger than 5000x3000, which is way way too big, but that just brought this piece of code into my attention.. Share this post Link to post
Tommi Prami 136 Posted January 12, 2021 46 minutes ago, dummzeuch said: How often do you access the ScanLine property? I found that it is much faster to get the address of the first line, calculate the offset between lines and add (or subtract) the offset to get the other lines. Also, pointer incrementation is much faster than using an array with indexes. On top of that, make sure to disable range checking in the release code. There is some code that does it in u_dzGraphicsUtils in my dzlib. If I remember correctly I blogged about it too. Edit: Yes I did: https://blog.dummzeuch.de/2019/12/12/accessing-bitmap-pixels-with-less-scanline-calls-in-delphi/ Thanks, I'll have a look... Share this post Link to post
Guest Posted January 12, 2021 7 minutes ago, Tommi Prami said: Customer had bigger than 5000x3000, which is way way too big, but that just brought this piece of code into my attention.. So we have 15mil pixel, now lets assume simple naive assembly handling this pixel by pixel in a loop, and here the loop should also be assembly, and i agree with Thomas on how this should be done, (i do the same ) one ScanLine per bitmap, the naive assembly wit general instruction set can be 3 cycle at most with the loop (not considering the memory bottle neck here because there will be, hit and miss on cache also fetching), anyway, 45mil cycle might be achieved means that converting on 3Ghz CPU will take less than a second adding the memory access overhead, the same memory overhead will be there with MMX or SIMD, but with these you can do many pixel per cycle ( may be +32). If you want us to have fun then please put some small code that really pinpoint the bottle neck and test for its correction and let us have out fun ! of course if assembly is on table. ps : i don't quite understand the target is it to convert 32bit RGBA to 24bit RGB or for just storing (wiring the image on net) to save space ? Share this post Link to post
FPiette 387 Posted January 12, 2021 12 minutes ago, Tommi Prami said: 3 hours ago, FPiette said: Please define what is a large bitmap for you. Customer had bigger than 5000x3000 This is not what i call a large bitmap. 15 mega pixel is a normal size for picture. Most today's camera produce much larger images. My Sony A7III which is a mid-range camera produce 6000x4000 pixel while a Sony A7RIV produce 9504x6336 pixel image. I developed radiography software where images can be even really much larger . Share this post Link to post
Fr0sT.Brutal 901 Posted January 12, 2021 I did quick & dumb test that has shown that 100 ScanLines on 5000*5000 bitmap takes 5 seconds (!) because bitmap is recreated in every call. So this is the real handbrake. Looking at TBitmap.GetScanLine you can extract necessary parts provided you have the pointer to the 1st row from initial ScanLine call. BytesPerScanline helper method is public so this even won't be a hack. 2 Share this post Link to post
Rollo62 542 Posted January 12, 2021 On 1/11/2021 at 10:55 AM, Tommi Prami said: One piece of code we use (3rd party component) is taking bitmapdata and put rgb values to stream. Whats the purpose of copying bitmap to a (linear) stream ? This sounds as its for saving to disk. 1 Share this post Link to post
Anders Melander 1844 Posted January 12, 2021 1 hour ago, Kas Ob. said: If you want us to have fun then please put some small code that really pinpoint the bottle neck and test for its correction and let us have out fun ! of course if assembly is on table. A bit premature wouldn't you say. If you consider the rest of the pipeline then using optimized assembly for this will not make any significant difference. 1 hour ago, Fr0sT.Brutal said: I did quick & dumb test that has shown that 100 ScanLines on 5000*5000 bitmap takes 5 seconds (!) because bitmap is recreated in every call. So this is the real handbrake. Looking at TBitmap.GetScanLine you can extract necessary parts provided you have the pointer to the 1st row from initial ScanLine call. BytesPerScanline helper method is public so this even won't be a hack. Good point. I think I would just not use TBitmap for this and either create a DIB directly or use a TBitmap32 from Graphics32 (with a memory backend). Share this post Link to post
Tommi Prami 136 Posted January 13, 2021 19 hours ago, FPiette said: This is not what i call a large bitmap. 15 mega pixel is a normal size for picture. Most today's camera produce much larger images. My Sony A7III which is a mid-range camera produce 6000x4000 pixel while a Sony A7RIV produce 9504x6336 pixel image. I developed radiography software where images can be even really much larger . Not huge, but large, but this is for 7x3cm logo on the print so overkill for that. But that makes this piece of code even worse 🙂 Share this post Link to post
Tommi Prami 136 Posted January 13, 2021 18 hours ago, Rollo62 said: Whats the purpose of copying bitmap to a (linear) stream ? This sounds as its for saving to disk. Yes, it is saved to file... Share this post Link to post
Tommi Prami 136 Posted January 13, 2021 19 hours ago, Fr0sT.Brutal said: I did quick & dumb test that has shown that 100 ScanLines on 5000*5000 bitmap takes 5 seconds (!) because bitmap is recreated in every call. So this is the real handbrake. Should always study the code one is calling 🙂 I've always thought that it would just return pointer to the data and offset that. depending the line you access. Good to learn new things. Share this post Link to post
Fr0sT.Brutal 901 Posted January 13, 2021 (edited) 1 hour ago, Tommi Prami said: Should always study the code one is calling 🙂 That's true but not always possible. More essential lesson is when one encounters a slowdown it's wise to track what exactly is the cause. One doesn't even need timers and so on, it's enough to just comment out fragments and see what's changing Edited January 13, 2021 by Fr0sT.Brutal Share this post Link to post
FPiette 387 Posted January 13, 2021 1 hour ago, Tommi Prami said: 20 hours ago, Rollo62 said: Whats the purpose of copying bitmap to a (linear) stream ? This sounds as its for saving to disk. Yes, it is saved to file... Saved to which file format? Why are you not telling us the full story? It look to me that it is a XY problem. Share this post Link to post
Tommi Prami 136 Posted January 13, 2021 1 hour ago, FPiette said: Saved to which file format? Why are you not telling us the full story? It look to me that it is a XY problem. I think this is pretty clear in the Caption or I at least think it is pretty self explanatory. Stream of bytes in order of RGB. What happens after this is another story all together. Share this post Link to post