XylemFlow 8 Posted April 30, 2023 (edited) FMX mostly does a very good job of antialiasing graphics, however I notice that quality of TCanvas.DrawBitmap is poor when the destination rect is smaller than the source rect. I'm running on Windows for now. This is especially obvious when the image contains thin lines as the subsampling causes parts of the lines to disappear. I'm looking for a way to improve the quality without compromising too much on runtime. So ideally I'd like it to be done on the GPU. I feel that this is a job that the GPU should be doing. Below is an example with 3 different methods. The last method is my own code, which shows what I'm trying to achieve but is not done on the GPU and so is not as fast as I'd like it to be. It also won't work if I want to include some rotation as well as scale using TCanvas.Matrix. I have also tried changing the HighSpeed flag in the DrawBitmap function, but it doesn't seem to make a difference (it does when upscaling an image but not when downscaling). See the attached project code example. Is this something that GPUs can normally do and if so, why isn't DrawBitmap doing it? Is there an alternative that will also work on different platforms? If I reduce an image in something like Inkscape it will do a much better job, although I'm not sure if the GPU is being used for the downsampling. Draw_bitmap_small.zip Edited April 30, 2023 by XylemFlow Share this post Link to post
KodeZwerg 54 Posted April 30, 2023 How about using an ImageViewer control, put your bitmap in and give the control the "BestFit" property. Share this post Link to post
XylemFlow 8 Posted May 2, 2023 (edited) On 4/30/2023 at 6:46 PM, KodeZwerg said: How about using an ImageViewer control, put your bitmap in and give the control the "BestFit" property. That doesn't do a great job either. Looking into the code, it appears to use DrawBitmap as well. However it wouldn't help me anyway. I need to be able to render the images to a TCanvas. I'm updating the canvas for dragging the objects around in real time, which is why I need high performance. Edited May 2, 2023 by XylemFlow Share this post Link to post
Rollo62 539 Posted May 3, 2023 I am not sure what you want to achieve, but probably you should look after methods that make use of resampling. https://stackoverflow.com/questions/11190472/transparent-image-control-with-resampling-in-delphi I would recommend Image32. Share this post Link to post
Renate Schaaf 64 Posted May 3, 2023 Your handcrafted routine isn't *that* slow, just turn on compiler-optimization. As for resampling, I had started to port my parallel bitmap-resampler to fmx, but then I thought, hey, these guys can use DirectDraw, there won't be a demand. Now, seeing how poor the quality is for (supposedly) bilinear rescaling, I have continued working on it. A first version is usable on Windows only for the time being. I just have to add some demos, and I'll probably upload it later today to https://github.com/rmesch/Parallel-Bitmap-Resampler Just in case you might be interested. 1 1 Share this post Link to post
angusj 126 Posted May 3, 2023 Looking at the image in the OP, it looks like FMX is using a nearest neighbour resampler, which is fast but poor quality. Certainly a bilinear or bicubic resampler would be much better, but for down sampling specifically, a BoxDownSampling resampler would be better still. Share this post Link to post
Anders Melander 1815 Posted May 4, 2023 18 hours ago, angusj said: it looks like FMX is using a nearest neighbour resampler, which is fast but poor quality. I think it's just a bug in their implementation. They appear to be AND'ing the pixels instead or OR'ing them. Even the GDI's COLORONCOLOR or STRETCH_DELETESCAN methods, which are just about the fastest methods there are, with the worst quality, would produce a better result. 17 hours ago, angusj said: but for down sampling specifically, a BoxDownSampling resampler would be better still. Possibly, but the examples on that page are cooked to show the result you want; They only really demonstrate the effect of a downsample followed by a cubic upsample followed by a linear downsample (you've let the browser shrink the final bitmap). A fair comparison would be to compare the unscaled, downsampled results. What the results would look like when upsampled again with a cubic resampler is not relevant to the downsample quality. Original Downsampled, box filter Downsampled, linear filter Downsampled, cubic filter 1 Share this post Link to post
XylemFlow 8 Posted May 4, 2023 (edited) Thanks everyone for those suggestions. However, I don't think anyone has suggested a way to get the GPU to handle this. My hand written code is about as fast as the CPU can go (I also have code for downscaling by exactly a factor of 2 which is faster still). But my original question was about doing this on the GPU, because I'm dealing with animated real time graphics. I also need to draw these images onto a canvas at an angle, which I do by setting TCanvas.Matrix with TCanvas.DrawBitmap. I could write code to do shrink and rotate but that would be super slow compared to the GPU. // shrink a bitmap by a factor of 2. ABitmapOut size needs to be pre set procedure ShrinkFast(const ABitmap : TBitmap ; out ABitmapOut : TBitmap); Var Lx, Ly, R : integer; P1, P2, P3, P4, POut, PRowStart, PRowStartOut : pByte; W, HM, WL : integer; LRowSizeOut, LRowSize : integer; bdata, bdatao : TBitmapData; begin if (ABitmapOut.Width = 0) or (ABitmapOut.Height = 0) then Exit; ABitmap.Map(TMapAccess.Read, bdata); ABitmapOut.Map(TMapAccess.Write, bdatao); try W := ABitmapOut.Width; R := ABitmap.Width div W; // shrink ratio if R <> 2 then Exit; HM := ABitmapOut.Height - 1; WL := W - 3; PRowStart := pByte(bdata.GetScanline(0)); LRowSize := bdata.Pitch; PRowStartOut := pByte(bdatao.GetScanline(0)); LRowSizeOut := bdatao.Pitch; if R = 2 then begin for Ly := 0 to HM do begin P1 := PRowStart; P2 := P1 + LRowSize; P3 := P1 + 4; P4 := P2 + 4; POut := PRowStartOut; Lx := 0; // set output pixel to the average of the 2X2 box of input pixels while Lx < WL do begin // loop unrolled by 4 POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // blue Inc(P1); Inc(P2); Inc(P3); Inc(P4); Inc(POut); POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // green Inc(P1); Inc(P2); Inc(P3); Inc(P4); Inc(POut); POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // red Inc(P1); Inc(P2); Inc(P3); Inc(P4); Inc(POut); POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // alpha Inc(P1,5); Inc(P2,5); Inc(P3,5); Inc(P4,5); Inc(POut); POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // blue Inc(P1); Inc(P2); Inc(P3); Inc(P4); Inc(POut); POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // green Inc(P1); Inc(P2); Inc(P3); Inc(P4); Inc(POut); POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // red Inc(P1); Inc(P2); Inc(P3); Inc(P4); Inc(POut); POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // alpha Inc(P1,5); Inc(P2,5); Inc(P3,5); Inc(P4,5); Inc(POut); POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // blue Inc(P1); Inc(P2); Inc(P3); Inc(P4); Inc(POut); POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // green Inc(P1); Inc(P2); Inc(P3); Inc(P4); Inc(POut); POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // red Inc(P1); Inc(P2); Inc(P3); Inc(P4); Inc(POut); POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // alpha Inc(P1,5); Inc(P2,5); Inc(P3,5); Inc(P4,5); Inc(POut); POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // blue Inc(P1); Inc(P2); Inc(P3); Inc(P4); Inc(POut); POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // green Inc(P1); Inc(P2); Inc(P3); Inc(P4); Inc(POut); POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // red Inc(P1); Inc(P2); Inc(P3); Inc(P4); Inc(POut); POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // alpha Inc(P1,5); Inc(P2,5); Inc(P3,5); Inc(P4,5); Inc(POut); Inc(Lx, 4); end; while Lx < W do begin POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // blue Inc(P1); Inc(P2); Inc(P3); Inc(P4); Inc(POut); POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // green Inc(P1); Inc(P2); Inc(P3); Inc(P4); Inc(POut); POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // red Inc(P1); Inc(P2); Inc(P3); Inc(P4); Inc(POut); POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // alpha Inc(P1,5); Inc(P2,5); Inc(P3,5); Inc(P4,5); Inc(POut); Inc(Lx); end; Inc(PRowStartOut, LRowSizeOut); Inc(PRowStart, LRowSize shl 1); end; end; finally ABitmap.Unmap(bdata); ABitmapOut.Unmap(bdatao); end; end; Edited May 4, 2023 by XylemFlow Share this post Link to post
Renate Schaaf 64 Posted May 4, 2023 3 minutes ago, XylemFlow said: However, I don't think anyone has suggested a way to get the GPU to handle this. Well, if one could get one's hands on the DirectDraw-Canvas, and the DirectDraw-RenderingContext, one could write a descendent of FMX-TBitmap which uses the higher-quality setting possible with DirectDraw. I just can't see how, but I'm a newbie. 1 Share this post Link to post
Rollo62 539 Posted May 4, 2023 17 minutes ago, XylemFlow said: Thanks everyone for those suggestions. However, I don't think anyone has suggested a way to get the GPU to handle this. If your circles are that critical, maybe it's worth if you are looking into Skia4Delphi, which is the next, new hot thing in town. It is well-supported and in favor of Embarcadero too, but probably adding a lot of extra baggage too, but seems to have endless possibilities on the cons side :-) Share this post Link to post
angusj 126 Posted May 4, 2023 2 hours ago, Anders Melander said: Possibly, but the examples on that page are cooked to show the result you want; LOL 🤣. The pages perhaps are "cooked" but it wasn't intentional. Anyhow, I've just done a number of followup tests and I'll concede that I can't spot the difference between all 3 renderers when downsampling various images. I'm surprised and I'll need to refresh myself on the differences between these resamplers. Share this post Link to post
Anders Melander 1815 Posted May 4, 2023 13 minutes ago, angusj said: The pages perhaps are "cooked" but it wasn't intentional. Confirmation bias, most likely. It's a common trap that I find myself in more often than I'd like to admit. Well, I guess I just did 🙂 58 minutes ago, XylemFlow said: My hand written code is about as fast as the CPU can go I doubt it. Unless you're running this on a potato you shouldn't really need the GPU for something as simple as this. Of course, the GPU will be faster but the CPU should be fast enough. Rotation, translation, and scaling can be done in one go with a 3x3 (well, 2x3 actually) affine transformation. You "just" need to find a library that does that (or write it yourself). Graphics32 can do it but it doesn't support FMX. I'm guessing Image32 can too. Share this post Link to post
XylemFlow 8 Posted May 4, 2023 2 hours ago, Rollo62 said: If your circles are that critical, maybe it's worth if you are looking into Skia4Delphi, which is the next, new hot thing in town. It is well-supported and in favor of Embarcadero too, but probably adding a lot of extra baggage too, but seems to have endless possibilities on the cons side :-) The circles is just an example. My users could load any image and then want to animate it at various scale and angles. I've tried Skia4Delphi. One issue for me is that it doesn't use the GPU when drawing to an off screen TBitmap, whereas TCanvasD2D does. 1 hour ago, Anders Melander said: Unless you're running this on a potato you shouldn't really need the GPU for something as simple as this. Of course, the GPU will be faster but the CPU should be fast enough. Rotation, translation, and scaling can be done in one go with a 3x3 (well, 2x3 actually) affine transformation. You "just" need to find a library that does that (or write it yourself). Graphics32 can do it but it doesn't support FMX. I'm guessing Image32 can too. I've already benchmarked TCanvasD2D (using GPU) against TCanvasGDIPlus (without GPU) on Windows and TCanvasD2D is significantly faster. That tells me that the GPU is making a big difference even with a fast library. I'm not running on a potato either, but my users might be (I use a potato for testing to make sure that it will work for all user setups). A previous version of my software was developed in VCL and rendered the images with scale and rotation in software, so I have those libraries already. There was a significant performance boost moving to FMX, so there's no going back. I'm doing full screen animation at up to 30fps so I need to make use of any hardware boost available. You said that you think the code is AND-ing rather than OR-ing. What makes you think that rather than it just using nearest neighbour sub-sampling? Surely the Delphi code is just sending instructions to the GPU and the GPU is unlikely to be making an error like that. Share this post Link to post
Anders Melander 1815 Posted May 4, 2023 54 minutes ago, XylemFlow said: That tells me that the GPU is making a big difference even with a fast library. GDI+ is generally not a fast library... 1 hour ago, XylemFlow said: A previous version of my software was developed in VCL and rendered the images with scale and rotation in software, so I have those libraries already. There was a significant performance boost moving to FMX, so there's no going back. I'm doing full screen animation at up to 30fps so I need to make use of any hardware boost available. Okay. I guess I'll take your word on that since you've actually tried it and I'm only speculating, but I would really expect a significantly higher FPS (on a "reasonably" sized screen) to be possible without hardware assist. I mean, what did we do before we got access to the GPU? Again, I'm not arguing that the GPU isn't the faster solution. I'm just surprised that it's necessary. Can you remember what bitmap size and resampler type you used when you tried this with Graphics32 (if that was what you used)? 55 minutes ago, XylemFlow said: You said that you think the code is AND-ing rather than OR-ing. What makes you think that rather than it just using nearest neighbour sub-sampling? Surely the Delphi code is just sending instructions to the GPU and the GPU is unlikely to be making an error like that. Now that I think of it, that was a brain fart on my part; It's OR-ing. I was thinking that since it's dropping black pixels it must be AND-ing but of course, since black isn't a color but rather the absence of color, it's the other way round. It's OR-ing so white $xxFFFFFF is replacing black $xx000000. Share this post Link to post
Anders Melander 1815 Posted May 4, 2023 Btw, I don't know if the following is relevant to what you're doing: https://blog.grijjy.com/2021/01/14/shader-programming/ 1 Share this post Link to post
XylemFlow 8 Posted May 5, 2023 (edited) 21 hours ago, Anders Melander said: Btw, I don't know if the following is relevant to what you're doing: https://blog.grijjy.com/2021/01/14/shader-programming/ That could be very useful. I've often considered if I could use the 3D capabilities of FMX for my 2D graphics. I may give textures a go in my circle demo. Implementing the interpolation for downsampling with higher quality should just be a matter of writing it into the pixel shader. One down side is that different shaders need to be written to support all platforms, but that's not a big issue. The main issue is combining this with other drawing primitives such as lines, circles, text and others that I use from TCanvas. The shader requires a 3D component, so mixing the two to draw to a single canvas seems difficult. Edited May 5, 2023 by XylemFlow Share this post Link to post
DelphiUdIT 187 Posted May 5, 2023 If you want to use the GPU, you can use OPENCL standards. I use OPENCL through some computer vision libraries, but the OPENCL is transparent to my code. I can enable or disable both partial and full OPENCL functionality for the whole library at runtime, so a certain function will be able to run under GPU or CPU without the code being modified. However, with modern processors (I have been using the Intel I7 12xxx series since it was on the market) the differences in terms of quality and performance are negligible on the vast majority of functions. Then taking into account the cost of an additional graphics card (NVIDIA / INTEL / AMD) .... Probably a careful use of Threads and the good library could lead to a better quality / performance of what you want to do (but I can't help you specifically because I've never needed better performance / quality than the standard image resizing). Start from here: Embarcadero blog on OPENCL Bye Share this post Link to post
angusj 126 Posted April 19 (edited) On 5/4/2023 at 6:40 PM, Anders Melander said: A fair comparison would be to compare the unscaled, downsampled results. On 5/4/2023 at 9:36 PM, angusj said: Anyhow, I've just done a number of followup tests and I'll concede that I can't spot the difference between all 3 renderers when downsampling various images. I've just had another look at resampling and specifically downsampling, and I'm back to my starting assertion that box downsampling does produce better quality images than general purpose resampling algorithms. However, I will concede that, because these downsampled images are generally much smaller, it's usually difficult to spot these differences. For example: This is the fruit image from above that has been resized to 1/3 original using a bicubic resampler: This is the fruit image from above that has been resized to 1/3 original using a box downsampler: Yes, it's hard to spot the differences unless you compare them with a decent image editor (or just zoom in using your web browser). Yet here's a more extreme example of downsampling (scaled to 0.1 of original size) where the quality differences are very noticeable: Bicubic kernel resampler: Box downsampler: Original image: And this does make sense when you understand the differences between these algorithms. Consider downsampling an image to 1/3 its size (where each 3 x 3 grid of pixels will merge into a single pixel) ... box downsampling will weigh every pixel equally in each 3 x 3 grid; whereas general purpose kernel resamplers will heavily weight pixels that are closer to the middle of these 3 x 3 grids. Edited April 19 by angusj Share this post Link to post
Anders Melander 1815 Posted April 20 11 hours ago, angusj said: Yet here's a more extreme example of downsampling (scaled to 0.1 of original size) where the quality differences are very noticeable: It looks to me as if there's a problem in your implementation... Here's what I get with a selection of Graphics32 kernels: Box Cubic Linear Cosine Spline Hermite Yes, there are differences but IMO they all look good. Even Spline which shouldn't really be used for down-sampling. Ignore the black line at the top of each image; It's caused by a bug in Firefox's clipboard handling of 32-bit RGBA bitmaps: https://bugzilla.mozilla.org/show_bug.cgi?id=1866655 https://forums.getpaint.net/topic/124628-1-px-line-on-top-of-every-image-pasted-into-firefox-from-paintnet/ https://github.com/graphics32/graphics32/issues/257 Share this post Link to post
angusj 126 Posted April 20 (edited) 3 hours ago, Anders Melander said: It looks to me as if there's a problem in your implementation... Here's what I get with a selection of Graphics32 kernels: If you avoid using TAffineTransformation, and just use a resampler together with a renderer, then you do avoid this issue with pixelation. (In my Image32 graphics library, I use affine transformations without a renderer.) Edited April 20 by angusj Share this post Link to post
Anders Melander 1815 Posted April 20 23 minutes ago, angusj said: If you avoid using TAffineTransformation, and just use a resampler together with a renderer, then you do avoid this issue with pixelation. (In my Image32 graphics library, I use affine transformations without a renderer.) So: A problem in your implementation - or rather a consequence of the way you have chosen to implement resizing images. Or did I misunderstand what you just wrote? Share this post Link to post
angusj 126 Posted April 20 2 minutes ago, Anders Melander said: So: A problem in your implementation - or rather a consequence of the way you have chosen to implement resizing images. Or did I misunderstand what you just wrote? It's a problem with the Graphics32 library too if TAffineTransformation is used to do the scaling. Share this post Link to post
Anders Melander 1815 Posted April 20 2 hours ago, angusj said: It's a problem with the Graphics32 library too if TAffineTransformation is used to do the scaling. True. Luckily nobody does that 🙂 Here's the bitmap resized with TAffineTransformation.Scale(0.1, 0.1) and TKernelResampler with TCubicKernel: So pretty much as bad as yours: But anyway, I think we can conclude that the problem isn't with the cubic filter itself but more with how it's applied. Share this post Link to post