Jump to content
Renate Schaaf

Parallel Resampling of (VCL-) Bitmaps

Recommended Posts

55 minutes ago, Anders Melander said:

Please verify that the comments I've added in the source are correct

Correct and very clear.

 

I like the introduction of the MappingTablePrecicion.. constants.

Share this post


Link to post

I might have introduced a bug in GR32_Resamplers, as it is, the left bound of the source rectangle is ignored. The fix is simple:

 

Line 1778 needs to be

 

SourceColor := @Src.Bits[ClusterY[0].Pos * Src.Width+SrcRect.Left];  //+SrcRect.Left was missing!

and line 1806:

        SourceColor := @Src.Bits[ClusterY[Y].Pos * Src.Width+SrcRect.Left];//+SrcRect.Left was missing!

Hope, you read this, Anders. If I don't hear from you, I'll create an issue on GitHub.

 

Edit: I definitely intoduced it by changing the order of the loops, I checked against an old version. Instead of

+SrcRect.Left

one should probably use

+MapXLoPos

 

Renate

Edited by Renate Schaaf
  • Like 1

Share this post


Link to post

Hi Anders,

Just tried the new version of Graphics32, and found that the downscaling with Box looks as cr***y as before we changed the radius to 0.5, which IS the logically correct value, since the box function has a support [-0.5,0.5]. I can't see right now what goes wrong with the upscaling, it must be something different.

Anyway, I don't see any problems with upscaling in my code, just tried it with a factor 20.

 

You did a lot of work on graphics32, will have a closer look.

 

Renate

Share this post


Link to post
6 hours ago, FreeDelphiPascal said:

Hi. Do you have something similar but for parallel jpeg decoding?

Sorry, no, but it sounds like a good idea. Naively. I have no idea how parallelizable jpeg-decoding is 🙂

Share this post


Link to post
1 hour ago, Renate Schaaf said:

I have no idea how parallelizable jpeg-decoding is

Most modern jpegs require sequential decompression due to the compression algorithms used (decompression of a block is based on the result of the previous block); There's nothing much to parallelize.

 

Jpegs with lots of restart markers (a restart marker means that the result of the previous blocks isn't needed) in the compression stream would benefit from parallelization but it is my understanding that those have become very rare as the problem they were meant to solve (data corruption during download via modem) no longer exist.

Share this post


Link to post

Hi Anders,

Thanks for explaining. I had a feeling that the compression is too "global" for parallelizing. But ..

From what I have meanwhile read, it seems that parts of the decompression could be done in parallel.

This link is about compression, but couldn't it apply to decompression too? (not that I know anything about it 🙂

https://stackoverflow.com/questions/61850421/how-to-perform-jpeg-encoding-of-a-big-rgb-image-in-parallel

Anyway, there are research papers which claim that they got a speedup from doing the decoding partly in parallel.

Share this post


Link to post
3 minutes ago, Renate Schaaf said:

This link is about compression, but couldn't it apply to decompression too? (not that I know anything about it 🙂

https://stackoverflow.com/questions/61850421/how-to-perform-jpeg-encoding-of-a-big-rgb-image-in-parallel

Yes, there will of course always be some parts that can be parallelized but the problem is that the expensive part, the Huffman decoding, cannot.

 

6 minutes ago, Renate Schaaf said:

Anyway, there are research papers which claim that they got a speedup from doing the decoding partly in parallel.

I'm guessing they used "cooked" jpegs because there's really not much magic that can be done here.

 

I think the effort is better spent on using SSE, AVX, or the GPU to decode - which is also what I believe most high-performance decoders do.

Share this post


Link to post
9 minutes ago, Anders Melander said:

I'm guessing they used "cooked" jpegs because there's really not much magic that can be done here.

OK, I'll stop thinking about it. Time to get some sleep:)

Share this post


Link to post

Sorry. My question was maybe not very clear. I am talking about decoding multiple JPG files in parallel.
Maybe in a pool of threads equal to the number of cores...

Share this post


Link to post
39 minutes ago, FreeDelphiPascal said:

My question was maybe not very clear.

Oh, you think? :classic_dry:

 

39 minutes ago, FreeDelphiPascal said:

I am talking about decoding multiple JPG files in parallel.
Maybe in a pool of threads equal to the number of cores... 

Yes, of course you can do that.

You don't need a special library to decode a jpeg in a thread.

Share this post


Link to post
22 hours ago, chmichael said:

Just curious, anyone tried Skia for resampling ?

I did a quick test with the demo of the fmx-version of my resampler, just doing "Enable Skia" on the project.

In the demo I compare my results to TCanvas.DrawBitmap with HighSpeed set to false.

I see that the Skia-Canvas is being used, and that HighSpeed=False results in Skia-resampling set to

SkSamplingOptionsHigh  : TSkSamplingOptions = (UseCubic: True; Cubic: (B: 1 / 3; C: 1 / 3); Filter: TSkFilterMode.Nearest; Mipmap: TSkMipmapMode.None);

So, some form of cubic resampling, if I see that right.

 

Result:

Timing is slightly slower than native fmx-drawing, but still a lot faster than my parallel resampling.

I see no improvement in quality over plain fmx, which supposedly uses bilinear resampling with this setting.

Here are two results: (How do you make this browser use the original pixel size, this is scaled!)

SkiaCubic.png.f13652c0858a401e06076032aa3b29cb.pngSkiaCubic2.png.65b76e3916cd0071f9bbdf7b7b9b7461.png

This doesn't look very cubic to me. As a comparison, here are the results of my resampler using the bicubic filter:

uScaleFMXBicubic.png.d5a2e157921608636391f2f07f7e7eb9.pnguScaleFMXBicubic2.png.91352bbab470df55495f76c2858faed5.png

 

I might not have  used Skia to its most favorable advantage.

 

Renate

  • Like 2

Share this post


Link to post

I just uploaded a new version to https://github.com/rmesch/Parallel-Bitmap-Resampler

 

Newest addition: a parallel unsharp-mask using Gaussian blur. Can be used to sharpen or blur images.

Dedicated VCL-demo "Sharpen.dproj" included. For FMX the effect can be seen in the thumbnail-viewer-demo (ThreadsInThreadsFMX.dproj).

 

This is for the "modern" version, 10.4 and up.

 

I haven't ported the unsharp-mask to the legacy version (Delphi 2006 and up) yet, requires more work, but I plan on doing so.

 

Renate

Share this post


Link to post
43 minutes ago, Renate Schaaf said:

Newest addition: a parallel unsharp-mask using Gaussian blur. Can be used to sharpen or blur images.

Have you benchmarked this against some of the existing Gaussian blur implementations?

 

It's a bit difficult to decode the algorithm you use due to the lack of comments in the source but it appears you are just applying a Gaussian kernel (with some additional logic) and that approach is usually quite slow.

 

I have a benchmark suite that compares the performance and fidelity of 8 different implementations. I'll try to find time to integrate your implementation into it.

 

With regard to the ratio between Radius and Sigma, it's my understanding that:

Ratio = 1 / FWHM (Full Width at Half Maximum)
      = 1 / (2 * Sqrt(2 * Ln(2)))
      = 0.424660891294479

But you have a ratio of 0.5

Have I misunderstood something?

Share this post


Link to post
43 minutes ago, Anders Melander said:

But you have a ratio of 0.5

I took sigma = 0.2*Radius, but it's easy to change that to something more common. I just took a value for which the integral is very close to 1. With respect to other implementations, I'm ready to learn. I just implemented it as accurately as I could think of without being overly slow. Performance is quite satisfying to me, but I bet with your input it'll get faster 🙂

Share this post


Link to post
9 hours ago, Anders Melander said:

I have a benchmark suite that compares the performance and fidelity of 8 different implementations. I'll try to find time to integrate your implementation into it.

Hi Anders,

It's great that you think of it, but hold off on that for a bit. I noticed that I compute the weights in a horrendously stupid way. The weights are mostly identical, it's not like when you resample, dumb me. So taking care of that reduces memory usage by a lot and the subsequent application of the weights becomes much faster.

I've also changed the sigma-to-radius ratio a bit according to your suggestion. I find it hard to make results look nice with cutoff at half the max-value, I changed it to 10^-2 times max-value. But this still allows for smaller radii, and it becomes again a bit faster.

So, before you do anything I would like to finish these changes, and also comment the code a bit more. (Forces me to really understand what I'm doing 🙂

Edited by Renate Schaaf

Share this post


Link to post

New version at https://github.com/rmesch/Parallel-Bitmap-Resampler:

 

Has more efficient code for the unsharp-mask, and I added more comments in code to explain what I'm doing.

 

Procedures with explaining comments:

uScaleCommon.Gauss

uScaleCommon.MakeGaussContributors

uScaleCommon.ProcessRowUnsharp

 

and see type TUnsharpParameters in uScale.pas.

 

Would it be a good idea to overload the UnsharpMask procedure to take sigma instead of radius as a parameter? Might be easier for comparison to other implementations.

  • Like 1

Share this post


Link to post
On 10/3/2023 at 1:23 AM, Anders Melander said:

Have you benchmarked this against some of the existing Gaussian blur implementations?

OK, I plugged my unsharp-mask into the Blurs-example of GR32. Doing so, made me aware of the need to do gamma-correction when you mix colors. So I implemented that, but see below.

Also, I finally included options to properly handle the alpha-channel for the sharpen/blur.

The repo at GitHub has been updated with these changes.

 

Results:

  Quality: My results seem a tad brighter, otherwise I could see no difference between Gaussian and Unsharp.

  Performance: Unthreaded routine: For radii up to 8 Unsharp is on par with FastGaussian, after that FastGaussian is the clear winner.

                        Threaded routine: Always fastest.

 

If anybody is interested, I am attaching the test project. It of course requires to have GR32 installed. It also requires 10.3 or higher, I guess.

 

Gamma-correction:

  I did it via an 8bit-Table same as GR32. This seems very unprecise to me, but I wouldn't know how to get it any more precise other than operating with floats, no thanks.

  Sadly, this can produce visible banding in some images, no matter which blur is used. Here is an example (for uploading all images have been compressed, but the effect is about the same):

  OriginalR.thumb.jpg.40ad02e953d6927ff2ce78777e3f877e.jpg

Original, a cutout from a picture taken with my digital camera.

GaussianR_40.thumb.jpg.f231a95615b99c5ba7def58bea4711a4.jpg

Result of Gaussian with Radius = 40 and Gamma = 1.6

 

When gamma-correction is used for sharpening, bright edge-artifacts are reduced, but dark edge-artifacts are enhanced. My conclusion right now would be to not use gamma-correction.

But if anybody has an idea for how to implement it better, I'm all ears.

 

Thanks,

Renate

BlurTest.zip

Share this post


Link to post
10 minutes ago, Renate Schaaf said:

Performance: Unthreaded routine: For radii up to 8 Unsharp is on par with FastGaussian, after that FastGaussian is the clear winner.

By "FastGaussian" I guess you mean the FastBlur routine?

FastBlur is actually a box blur and not a true Gaussian blur. This is just fine for some setups, but not so great for others.

 

Also, performance is, as you've discovered, not the only important metric when comparing blurs. Fidelity can also be important. It completely depends on what the blur is used for. Some algorithms are fast but suffer from signal loss or produce artifacts. Some are precise but slow. And then there are some that do it all well 🙂

 

The parameters below are [Width, Height, Radius]:

image.png.b32a26003d5ca0aef95412a671ac0dcd.png

image.png.5673519d0e33a7f94c76ef0d92d66c0c.png

image.png.812a0c070745c3517cfc7c836547ab9d.png

image.png.68f41d9bef013a399599363f33280f5a.png

Case in point, BoxBlur32 above is consistently the fastest but also has the worst quality and doesn't handle Alpha at all.

 

45 minutes ago, Renate Schaaf said:

But if anybody has an idea for how to implement it better, I'm all ears.

Use floats and implement it with SSE. That's what I did 🙂

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×