Jump to content
XylemFlow

Poor image quality with DrawBitmap when destination is smaller than source

Recommended Posts

FMX mostly does a very good job of antialiasing graphics, however I notice that quality of TCanvas.DrawBitmap is poor when the destination rect is smaller than the source rect. I'm running on Windows for now. This is especially obvious when the image contains thin lines as the subsampling causes parts of the lines to disappear. I'm looking for a way to improve the quality without compromising too much on runtime. So ideally I'd like it to be done on the GPU. I feel that this is a job that the GPU should be doing. Below is an example with 3 different methods. The last method is my own code, which shows what I'm trying to achieve but is not done on the GPU and so is not as fast as I'd like it to be. It also won't work if I want to include some rotation as well as scale using TCanvas.Matrix. I have also tried changing the HighSpeed flag in the DrawBitmap function, but it doesn't seem to make a difference (it does when upscaling an image but not when downscaling). See the attached project code example.

 

Is this something that GPUs can normally do and if so, why isn't DrawBitmap doing it? Is there an alternative that will also work on different platforms? If I reduce an image in something like Inkscape it will do a much better job, although I'm not sure if the GPU is being used for the downsampling.

 

Rfoje.png

Draw_bitmap_small.zip

Edited by XylemFlow

Share this post


Link to post

How about using an ImageViewer control, put your bitmap in and give the control the "BestFit" property.

Share this post


Link to post
On 4/30/2023 at 6:46 PM, KodeZwerg said:

How about using an ImageViewer control, put your bitmap in and give the control the "BestFit" property.

That doesn't do a great job either. Looking into the code, it appears to use DrawBitmap as well. However it wouldn't help me anyway. I need to be able to render the images to a TCanvas. I'm updating the canvas for dragging the objects around in real time, which is why I need high performance.

Edited by XylemFlow

Share this post


Link to post

Your handcrafted routine isn't *that* slow, just turn on compiler-optimization.

As for resampling, I had started to port my parallel bitmap-resampler to fmx, but then I thought, hey, these guys can use DirectDraw, there won't be a demand.

Now, seeing how poor the quality is for (supposedly) bilinear rescaling, I have continued working on it. A first version is usable on Windows only for the time being. I just have to add some demos, and I'll probably upload it later today to

https://github.com/rmesch/Parallel-Bitmap-Resampler

 

Just in case you might be interested.

  • Like 1
  • Thanks 1

Share this post


Link to post

Looking at the image in the OP, it looks like FMX is using a nearest neighbour resampler, which is fast but poor quality. Certainly a bilinear or bicubic resampler would be much better, but for down sampling specifically, a BoxDownSampling resampler would be better still. 

Share this post


Link to post
18 hours ago, angusj said:

it looks like FMX is using a nearest neighbour resampler, which is fast but poor quality.

I think it's just a bug in their implementation. They appear to be AND'ing the pixels instead or OR'ing them.

Even the GDI's COLORONCOLOR or STRETCH_DELETESCAN methods, which are just about the fastest methods there are, with the worst quality, would produce a better result.

 

17 hours ago, angusj said:

but for down sampling specifically, a BoxDownSampling resampler would be better still. 

Possibly, but the examples on that page are cooked to show the result you want; They only really demonstrate the effect of a downsample followed by a cubic upsample followed by a linear downsample (you've let the browser shrink the final bitmap).

 

A fair comparison would be to compare the unscaled, downsampled results. What the results would look like when upsampled again with a cubic resampler is not relevant to the downsample quality.

 

Original

original.thumb.png.59e2f625638566e14a804aaae9680ac4.pngoriginal.png.ef25baecc32ecfd13ae2171e3ce5b434.png

 

Downsampled, box filter

box.png.522be1cb0a88049a20f6a9bc74596147.pngbox.png.df42e167a5a7ed1500515e380fc28860.png

 

Downsampled, linear filter

linear.png.0e4661df7e8171e907d7a1bb73ff02b0.pnglinear.png.5c807bdd422a7f83389ff531059be16d.png

 

Downsampled, cubic filter

cubic.png.6cd11c3fa2a2d670a1e79db8e2762d1e.pngcubic.png.c5390139f25139ccc78e6f73e07356bb.png

  • Like 1

Share this post


Link to post

Thanks everyone for those suggestions. However, I don't think anyone has suggested a way to get the GPU to handle this. My hand written code is about as fast as the CPU can go (I also have code for downscaling by exactly a factor of 2 which is faster still). But my original question was about doing this on the GPU, because I'm dealing with animated real time graphics. I also need to draw these images onto a canvas at an angle, which I do by setting TCanvas.Matrix with TCanvas.DrawBitmap. I could write code to do shrink and rotate but that would be super slow compared to the GPU.

// shrink a bitmap by a factor of 2. ABitmapOut size needs to be pre set
procedure ShrinkFast(const ABitmap : TBitmap ; out ABitmapOut : TBitmap);
Var
  Lx, Ly, R : integer;
  P1, P2, P3, P4, POut, PRowStart, PRowStartOut : pByte;
  W, HM, WL : integer;
  LRowSizeOut, LRowSize : integer;
  bdata, bdatao : TBitmapData;
begin
  if (ABitmapOut.Width = 0) or (ABitmapOut.Height = 0) then Exit;

  ABitmap.Map(TMapAccess.Read, bdata);
  ABitmapOut.Map(TMapAccess.Write, bdatao);

  try
    W := ABitmapOut.Width;
    R := ABitmap.Width div W; // shrink ratio
    if R <> 2 then Exit;

    HM := ABitmapOut.Height - 1;
    WL := W - 3;
    PRowStart := pByte(bdata.GetScanline(0));
    LRowSize := bdata.Pitch;
    PRowStartOut := pByte(bdatao.GetScanline(0));
    LRowSizeOut := bdatao.Pitch;
    if R = 2 then begin
      for Ly := 0 to HM do begin
        P1 := PRowStart;
        P2 := P1 + LRowSize;
        P3 := P1 + 4;
        P4 := P2 + 4;
        POut := PRowStartOut;
        Lx := 0;
        // set output pixel to the average of the 2X2 box of input pixels
        while Lx < WL do begin // loop unrolled by 4
          POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // blue
          Inc(P1); Inc(P2); Inc(P3); Inc(P4); Inc(POut);
          POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // green
          Inc(P1); Inc(P2); Inc(P3); Inc(P4); Inc(POut);
          POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // red
          Inc(P1); Inc(P2); Inc(P3); Inc(P4); Inc(POut);
          POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // alpha
          Inc(P1,5); Inc(P2,5); Inc(P3,5); Inc(P4,5); Inc(POut);
          POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // blue
          Inc(P1); Inc(P2); Inc(P3); Inc(P4); Inc(POut);
          POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // green
          Inc(P1); Inc(P2); Inc(P3); Inc(P4); Inc(POut);
          POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // red
          Inc(P1); Inc(P2); Inc(P3); Inc(P4); Inc(POut);
          POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // alpha
          Inc(P1,5); Inc(P2,5); Inc(P3,5); Inc(P4,5); Inc(POut);
          POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // blue
          Inc(P1); Inc(P2); Inc(P3); Inc(P4); Inc(POut);
          POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // green
          Inc(P1); Inc(P2); Inc(P3); Inc(P4); Inc(POut);
          POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // red
          Inc(P1); Inc(P2); Inc(P3); Inc(P4); Inc(POut);
          POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // alpha
          Inc(P1,5); Inc(P2,5); Inc(P3,5); Inc(P4,5); Inc(POut);
          POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // blue
          Inc(P1); Inc(P2); Inc(P3); Inc(P4); Inc(POut);
          POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // green
          Inc(P1); Inc(P2); Inc(P3); Inc(P4); Inc(POut);
          POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // red
          Inc(P1); Inc(P2); Inc(P3); Inc(P4); Inc(POut);
          POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // alpha
          Inc(P1,5); Inc(P2,5); Inc(P3,5); Inc(P4,5); Inc(POut);
          Inc(Lx, 4);
        end;
        while Lx < W do begin
          POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // blue
          Inc(P1); Inc(P2); Inc(P3); Inc(P4); Inc(POut);
          POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // green
          Inc(P1); Inc(P2); Inc(P3); Inc(P4); Inc(POut);
          POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // red
          Inc(P1); Inc(P2); Inc(P3); Inc(P4); Inc(POut);
          POut^ := (P1^ + P2^ + P3^ + P4^) shr 2; // alpha
          Inc(P1,5); Inc(P2,5); Inc(P3,5); Inc(P4,5); Inc(POut);
          Inc(Lx);
        end;
        Inc(PRowStartOut, LRowSizeOut);
        Inc(PRowStart, LRowSize shl 1);
      end;
    end;

  finally
    ABitmap.Unmap(bdata);
    ABitmapOut.Unmap(bdatao);
  end;
end;

 

Edited by XylemFlow

Share this post


Link to post
3 minutes ago, XylemFlow said:

However, I don't think anyone has suggested a way to get the GPU to handle this.

Well, if one could get one's hands on the DirectDraw-Canvas, and the DirectDraw-RenderingContext, one could write a descendent of FMX-TBitmap which uses the higher-quality setting possible with DirectDraw. I just can't see how, but I'm a newbie.

  • Like 1

Share this post


Link to post
17 minutes ago, XylemFlow said:

Thanks everyone for those suggestions. However, I don't think anyone has suggested a way to get the GPU to handle this.

If your circles are that critical, maybe it's worth if you are looking into Skia4Delphi, which is the next, new hot thing in town.

It is well-supported and in favor of Embarcadero too, but probably adding a lot of extra baggage too, but seems to have endless possibilities on the cons side :-)

Share this post


Link to post
2 hours ago, Anders Melander said:

Possibly, but the examples on that page are cooked to show the result you want;

LOL 🤣.

The pages perhaps are "cooked" but it wasn't intentional.

 

Anyhow, I've just done a number of followup tests and I'll concede that I can't spot the difference between all 3 renderers when downsampling various images.

I'm surprised and I'll need to refresh myself on the differences between these resamplers.

 

Share this post


Link to post
13 minutes ago, angusj said:

The pages perhaps are "cooked" but it wasn't intentional.

Confirmation bias, most likely. It's a common trap that I find myself in more often than I'd like to admit. Well, I guess I just did 🙂

 

58 minutes ago, XylemFlow said:

My hand written code is about as fast as the CPU can go

I doubt it.

Unless you're running this on a potato you shouldn't really need the GPU for something as simple as this. Of course, the GPU will be faster but the CPU should be fast enough.

Rotation, translation, and scaling can be done in one go with a 3x3 (well, 2x3 actually) affine transformation. You "just" need to find a library that does that (or write it yourself). Graphics32 can do it but it doesn't support FMX. I'm guessing Image32 can too.

Share this post


Link to post
2 hours ago, Rollo62 said:

If your circles are that critical, maybe it's worth if you are looking into Skia4Delphi, which is the next, new hot thing in town.

It is well-supported and in favor of Embarcadero too, but probably adding a lot of extra baggage too, but seems to have endless possibilities on the cons side :-)

The circles is just an example. My users could load any image and then want to animate it at various scale and angles.
I've tried Skia4Delphi. One issue for me is that it doesn't use the GPU when drawing to an off screen TBitmap, whereas TCanvasD2D does.

 

1 hour ago, Anders Melander said:

Unless you're running this on a potato you shouldn't really need the GPU for something as simple as this. Of course, the GPU will be faster but the CPU should be fast enough.

Rotation, translation, and scaling can be done in one go with a 3x3 (well, 2x3 actually) affine transformation. You "just" need to find a library that does that (or write it yourself). Graphics32 can do it but it doesn't support FMX. I'm guessing Image32 can too.

I've already benchmarked TCanvasD2D (using GPU) against TCanvasGDIPlus (without GPU) on Windows and TCanvasD2D is significantly faster. That tells me that the GPU is making a big difference even with a fast library. I'm not running on a potato either, but my users might be (I use a potato for testing to make sure that it will work for all user setups). A previous version of my software was developed in VCL and rendered the images with scale and rotation in software, so I have those libraries already. There was a significant performance boost moving to FMX, so there's no going back. I'm doing full screen animation at up to 30fps so I need to make use of any hardware boost available.

You said that you think the code is AND-ing rather than OR-ing. What makes you think that rather than it just using nearest neighbour sub-sampling? Surely the Delphi code is just sending instructions to the GPU and the GPU is unlikely to be making an error like that.

Share this post


Link to post
54 minutes ago, XylemFlow said:

That tells me that the GPU is making a big difference even with a fast library.

GDI+ is generally not a fast library...

 

1 hour ago, XylemFlow said:

A previous version of my software was developed in VCL and rendered the images with scale and rotation in software, so I have those libraries already. There was a significant performance boost moving to FMX, so there's no going back. I'm doing full screen animation at up to 30fps so I need to make use of any hardware boost available.

Okay. I guess I'll take your word on that since you've actually tried it and I'm only speculating, but I would really expect a significantly higher FPS (on a "reasonably" sized screen) to be possible without hardware assist. I mean, what did we do before we got access to the GPU? Again, I'm not arguing that the GPU isn't the faster solution. I'm just surprised that it's necessary.

 

Can you remember what bitmap size and resampler type you used when you tried this with Graphics32 (if that was what you used)?

 

 

55 minutes ago, XylemFlow said:

You said that you think the code is AND-ing rather than OR-ing. What makes you think that rather than it just using nearest neighbour sub-sampling? Surely the Delphi code is just sending instructions to the GPU and the GPU is unlikely to be making an error like that.

Now that I think of it, that was a brain fart on my part; It's OR-ing.

I was thinking that since it's dropping black pixels it must be AND-ing but of course, since black isn't a color but rather the absence of color, it's the other way round. It's OR-ing so white $xxFFFFFF is replacing black $xx000000.

Share this post


Link to post
21 hours ago, Anders Melander said:

Btw, I don't know if the following is relevant to what you're doing:

https://blog.grijjy.com/2021/01/14/shader-programming/

 

That could be very useful. I've often considered if I could use the 3D capabilities of FMX for my 2D graphics. I may give textures a go in my circle demo. Implementing the interpolation for downsampling with higher quality should just be a matter of writing it into the pixel shader. One down side is that different shaders need to be written to support all platforms, but that's not a big issue. The main issue is combining this with other drawing primitives such as lines, circles, text and others that I use from TCanvas. The shader requires a 3D component, so mixing the two to draw to a single canvas seems difficult.

Edited by XylemFlow

Share this post


Link to post

If you want to use the GPU, you can use OPENCL standards.

 

I use OPENCL through some computer vision libraries, but the OPENCL is transparent to my code.
I can enable or disable both partial and full OPENCL functionality for the whole library at runtime, so a certain function will be able to run under GPU or CPU without the code being modified.

 

However, with modern processors (I have been using the Intel I7 12xxx series since it was on the market) the differences in terms of quality and performance are negligible on the vast majority of functions. Then taking into account the cost of an additional graphics card (NVIDIA / INTEL / AMD) ....

 

Probably a careful use of Threads and the good library could lead to a better quality / performance of what you want to do (but I can't help you specifically because I've never needed better performance / quality than the standard image resizing).

Start from here:

 

Embarcadero blog on OPENCL

 

Bye

Share this post


Link to post
On 5/4/2023 at 6:40 PM, Anders Melander said:

A fair comparison would be to compare the unscaled, downsampled results.

On 5/4/2023 at 9:36 PM, angusj said:

Anyhow, I've just done a number of followup tests and I'll concede that I can't spot the difference between all 3 renderers when downsampling various images.

 

I've just had another look at resampling and specifically downsampling, and I'm back to my starting assertion that box downsampling does produce better quality images than general purpose resampling algorithms. However, I will concede that, because these downsampled images are generally much smaller, it's usually difficult to spot these differences.

 

For example:

This is the fruit image from above that has been resized to 1/3 original using a bicubic resampler:

fruit_bcr.png.69ee17597c0b1202ade0cfdefe82075b.png

This is the fruit image from above that has been resized to 1/3 original using a box downsampler:

fruit_bds.png.11791f3857419d9be0f7362be47a35d2.png

 

Yes, it's hard to spot the differences unless you compare them with a decent image editor (or just zoom in using your web browser).


Yet here's a more extreme example of downsampling (scaled to 0.1 of original size) where the quality differences are very noticeable:

 

Bicubic kernel resampler:

text3_bcr.png.79a312be566c3aea75ee2f0937d0deb4.png

Box downsampler:

text3_bds.png.54a7b29ec44fe0a38a81f578ad19d6b5.png

 

Original image:

text3.thumb.png.bd7897d8098e807683b979ffd7150702.png

 

And this does make sense when you understand the differences between these algorithms.

Consider downsampling an image to 1/3 its size (where each 3 x 3 grid of pixels will merge into a single pixel) ...

box downsampling will weigh every pixel equally in each 3 x 3 grid;

whereas general purpose kernel resamplers will heavily weight pixels that are closer to the middle of these 3 x 3 grids.

 

Edited by angusj

Share this post


Link to post
11 hours ago, angusj said:

Yet here's a more extreme example of downsampling (scaled to 0.1 of original size) where the quality differences are very noticeable:

It looks to me as if there's a problem in your implementation... Here's what I get with a selection of Graphics32 kernels:

Box

image.png.ef31af6b1381047cd458dba37052a35a.png

 

Cubic

image.png.aa06c1e3d34aad656582ce4a25655df8.png

 

Linear

image.png.954ed10f4742096400f2b50789cb4dbe.png

 

Cosine

image.png.74d2c59a8703d45bac55a5008b24c314.png

 

Spline

image.png.7df63ab3e31f2223ea4d2c527d6e3f20.png

 

Hermite

image.png.fc698c10ccd91025c945a7ca1d324cda.png

 

Yes, there are differences but IMO they all look good. Even Spline which shouldn't really be used for down-sampling.

 

Ignore the black line at the top of each image; It's caused by a bug in Firefox's clipboard handling of 32-bit RGBA bitmaps:

Share this post


Link to post
3 hours ago, Anders Melander said:

It looks to me as if there's a problem in your implementation... Here's what I get with a selection of Graphics32 kernels:

If you avoid using TAffineTransformation, and just use a resampler together with a renderer, then you do avoid this issue with pixelation.

(In my Image32 graphics library, I use affine transformations without a renderer.)

Edited by angusj

Share this post


Link to post
23 minutes ago, angusj said:

If you avoid using TAffineTransformation, and just use a resampler together with a renderer, then you do avoid this issue with pixelation.

(In my Image32 graphics library, I use affine transformations without a renderer.)

So: A problem in your implementation - or rather a consequence of the way you have chosen to implement resizing images. Or did I misunderstand what you just wrote?

Share this post


Link to post
2 minutes ago, Anders Melander said:

So: A problem in your implementation - or rather a consequence of the way you have chosen to implement resizing images. Or did I misunderstand what you just wrote?

It's a problem with the Graphics32 library too if TAffineTransformation is used to do the scaling.

Share this post


Link to post
2 hours ago, angusj said:

It's a problem with the Graphics32 library too if TAffineTransformation is used to do the scaling.

True. Luckily nobody does that 🙂

 

Here's the bitmap resized with TAffineTransformation.Scale(0.1, 0.1) and TKernelResampler with TCubicKernel:

xxx.png.58a1536963176de70a0abb097481518a.png

So pretty much as bad as yours:

bicubic.png.ce700be559337a4c4d1d69287bb1412d.png

 

But anyway, I think we can conclude that the problem isn't with the cubic filter itself but more with how it's applied.

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×