Jump to content

Renate Schaaf

Members
  • Content Count

    126
  • Joined

  • Last visited

  • Days Won

    4

Posts posted by Renate Schaaf


  1. I've managed to translate the alternative computation of weights into Graphics32. It was actually quite easy :). The idea is, to compute the intergral for the convolution with the filter via the midpoint-rule. Before I've used the exact antiderivatives, leading to constant underestimations of peaks and valleys in the bitmap function, and thus to a loss of detail. Now pixels not lying totally within the support of the filter get their weight reduced, leading to less artefacts, but the peaks are better estimated, so contrast and detail is better preserved (the math is for readability):

     

    //Precision of weights,
    //Totals Cb,Cg,Cr,Ca in Resample need to be unscaled by Prec * Prec
    const Prec = $800;
    
    function BuildMappingTableNew(DstLo, DstHi: Integer; ClipLo, ClipHi: Integer;
      SrcLo, SrcHi: Integer; Kernel: TCustomKernel): TMappingTable;
    var
     ...
    begin
      ...
      else if Scale < 1 then
      begin
        OldScale := Scale;
        Scale := 1 / Scale;
        FilterWidth := FilterWidth * Scale;
        for I := 0 to ClipW - 1 do
        begin
          if FullEdge then
            Center := SrcLo - 0.5 + (I - DstLo + ClipLo + 0.5) * Scale
          else
            Center := SrcLo + (I - DstLo + ClipLo) * Scale;
          Left := Floor(Center - FilterWidth);
          Right := Ceil(Center + FilterWidth);
          Count := -Prec;
          for J := Left to Right do
          begin
                       
            //changed part           
            x0 := J - Center; // old weight: Filter(x0*Oldscale)*Oldscale
            x1 := max(x0 - 0.5, -FilterWidth);
            x2 := min(x0 + 0.5, FilterWidth); // intersect symmetric interval of length 1 about x0 with support of scaled filter
            x3 := 0.5 * (x2 + x1); // new center
            Weight := Round(Prec * Filter(x3 * OldScale) * OldScale * (x2 - x1));  // intersection with support entered into the weight
                       
            if Weight <> 0 then
            begin
              Inc(Count, Weight);
              K := Length(Result[I]);
              SetLength(Result[I], K + 1);
              Result[I][K].Pos := Constrain(J, SrcLo, SrcHi - 1);
              Result[I][K].Weight := Weight;
            end;
          end;
    ...

    At first the results were getting too dark and contrast was increased. By increasing the accuracy of the weights and using my own way of rounding the averaged result into bytes, this seems no longer the case:

     

    If RangeCheck then
            begin
              C.B := min((max(Cb, 0) + $1FFFFF) shr 22, 255);  //unscale and round
              C.G := min((max(Cg, 0) + $1FFFFF) shr 22, 255);
              C.R := min((max(Cr, 0) + $1FFFFF) shr 22, 255);
              C.A := min((max(Ca, 0) + $1FFFFF) shr 22, 255);
            end
            else
            begin
              C.B := (Cb + $1FFFFF) shr 22;
              C.G := (Cg + $1FFFFF) shr 22;
              C.R := (Cr + $1FFFFF) shr 22;
              C.A := (Ca + $1FFFFF) shr 22;
            end;
    
            // Combine it with the background
            case CombineOp of
              dmOpaque:
                DstLine[I] := C.ARGB;
    ...

    The changed file uScalingProcsGR32.pas is attached.

    If you are interested in a test, here is a short video, zooms and pans have been done with the new Lanczos. The second picture is one of the most notorious in my collection.

    uScalingProcsGR32.zip

    • Thanks 1

  2. 4 hours ago, Anders Melander said:

    With optimization enabled, the performance of Resample() now matches, and in some cases even exceed, your algorithm

    Confirmed. Nice.

    4 hours ago, Anders Melander said:

    The performance of Resample() with the 64-bit compiler is horrible.

    I've always been disappointed in the performance of 64-bit code. No idea, why mine is faster.

     

    Meanwhile I have come up with an alternative way to compute the weights, which seems to decrease artefacts while keeping the brilliance. So far I could not translate it into Graphics32, the filters there all live on different sized intervals, wheras mine all live on [-1,1], and at this time of the day my aging brain can't deal with the math. Maybe tomorrow.

     


  3. I'm too stupid to create a pull-request on GitHub, if I read the help for it, I don't understand the first thing about it.

    Somehow I can post changes to my own repository using GitHub-Desktop, but I don't really understand what it's doing :).

    So here is the changed GR32_Resamplers.pas.

     

    7 hours ago, Anders Melander said:

    The quality could probably be improved to match WIC but the question is if it's worth the trouble or if the current quality is "good enough"...?

    I personally think the quality is good enough, on Ultra-High DPI I can hardly see the difference between a low-quality and high-quality filter for "normal" images, glyphs are another story.

    For me the filters kick in when animating pictures by zooms and pans, the artifacts then really show up.

    Theoretically the quality could be improved by a higher precision of the weights, which currently run from -256 to 256. Up to $800 should be possible, which I have done for the none-premultiply-modes. But again, I have a hard time seeing a difference.

    Also, again theoretically, the algorithm using antiderivates of filters should yield better results (except there isn't any in closed form for the Lanczos). But though I can see less artifacts, they decrease details, as you have seen. I've probably made some mistake in the implementation. It could be the same kind of mistake you can make by computing numerical derivatives, Small divided by Small.

     

    Time to hit the sack.

     

    Renate

    GR32_Resamplers.zip

    • Thanks 1

  4. I have been able to make the GR32-resampling as fast as mine unthreaded, by making some simple changes to the procedure GR32_Resamplers.Resample (in Implementation-part):

     

      changing the order of X- and Y- loop in the filling of the horizontal buffer, avoiding jumps in the bitmap-memory,

     using pointers to walk along the arrays,

      turning on compiler-optimization for the procedure (biggest improvement)

     

    If you want to see for yourself, in the attachment are 3 changed .pas-files that need to overwrite the corresponding ones in the Algorithms-folder under Bitmap Scaling.

     

    Renate

    Bitmap Scaling-Diff.zip

    • Thanks 1

  5. Here is a new version with lots of changes (thanks, Anders):

     

    I have changed the algorithm for the filters except box from continous- to discrete-space convolution, because the result didn't seem to look as good as expected, maybe rounding errors, further investigation needed. But I just noticed, that for my videos I need to go back to continuous space, because that gives much smoother zooms and pans.
    Anyway, now the algorithm itself is more or less identical to Graphics32, so you have a completely fair comparison. I also followed Anders' idea of including a stopwatch in the signature of the test procedures that can be turned on only if it matters.

    More changes:

    Alpha:
    I've put some alpha-shenanigans into the loaded bitmaps, so you can see how the alpha-channel is handled.

     

    There now is a TAlphaCombineMode=(amIndependent, amPreMultiply, amIgnore) doing the following:

     

    amIndependent:
    The behaviour as before, all channels are sampled independently of one another.
    This is the behavior of GR32 for drawmode dmOpaque.

     

    amPreMultiply:
    RGB is mulitplied by alpha prior to resampling, after that nonzero alpha is "unmultiplied".
    I had to sacrifice some precision for the weights and the alpha-multiplication in order to stay within integer-range.
    This is the behavior of WICImage with Source.AlphaFormat=afDefined (it seems to have made the same sacrifices).
    GR32 with drawmode dmBlend does pre-multiply, but doesn't unmultiply, or, rather it "post-multiplies".

     

    amIgnore:
    RGB is resampled, alpha is set to 255. Faster for images not needing an alpha-channel.
    This is the behavior of WICImage with Source.AlphaFormat=afIgnored.

     

    To prevent apples from being compared to oranges I have included a notice when a certain algorithm does not fully support the chosen mode.

    To avoid code repetition I had to introduce a number of procedural variables, which slow things down a little bit, but it's still nice and fast.

     

    Threads:
    The <= 16 threads hardly consume any CPU-time while waiting, for the time being I want to keep them.
    There is a TTask-version included in the source, it has worked so far, but TTask behaves somewhat erratically timing-wise, and I don't understand what it's doing.

     

    I'll try to write a TKernelResample-descendent (unthreaded) for Graphics32, maybe I can speed it up, let's see.

     

    Renate

    Bitmap Scaling-New.zip


  6. Now I got the pre-multiplication in place, but before I do an update, I have a question for Anders:

     

    The pre-multiplication completely erases any RGB-Info stored at transparent pixels. So those can't be unmultiplied. I can see that this is desirable if the image contains a mask and needs to be rescaled.

    The RGB-part of the scaled image looks something like this, and btw. the result for WICImage looks the same.

    Premultiplied.thumb.jpg.addb994c8815b5f35889a4c82268f253.jpg

    What puzzles me though is the fact, that Graphics32 still keeps the transparent parts of the RGB in place, so it must have some magic way to compute 0/0, which I would like to learn ...

     

    Further looking at the source code, I cannot see any pre-multiplication done:

              C := Src.Bits[X + ClusterY[Y].Pos * Src.Width];
              ClustYW := ClusterY[Y].Weight;
              Inc(Ca, Integer(C shr 24) * ClustYW);
              Inc(Cr, Integer(C and $00FF0000) shr 16 * ClustYW);
              Inc(Cg, Integer(C and $0000FF00) shr 8 * ClustYW);
              Inc(Cb, Integer(C and $000000FF) * ClustYW);

    The Bytes of the channels are just lumped into one cardinal, the bytes are extracted and multiplied by the weights, or am I just dense, or has this already been done to Src.Bits?

     

    Have a nice Sunday,

    Renate


  7. 50 minutes ago, Anders Melander said:

    The premultiply/unpremultiply really messes with the quality since you're operating on 8-bit values

    My cache arrays are floats scaled to integer, so I can do the multiply while filling the array, and the unmultiply on the integers. Let's see.

     

    52 minutes ago, Anders Melander said:

    The memory usage can be lessened by processing one channel at a time.

    That's a great idea.

     

    53 minutes ago, Anders Melander said:

    Of course it's wanted. Improvements are always welcome

    I'll see, whether I can plug it in, but the source of Graphics32 is a bit intimidating. But then, being retired, I have all day to understand it.


  8. 4 hours ago, Anders Melander said:

    As a Graphics32 contributor I would have preferred though that you'd spent the (probably considerable) effort improving Graphics32 🙂

    I can still do that it it's wanted and still makes sense (see below).

     

    4 hours ago, Anders Melander said:

    You cannot process each RGBA channel independently. Instead you need to operate on premultiplied RGBA values. For example you wouldn't want an RGBA value of $00FFFFFF to have any influence on neighboring pixels. So: 1) Premultiply, 2) Resample, 3) Unpremultiply.

    Oh, I need to fix that ASA :classic_blush:.  And include the better GR32-routine.

     

    Thanks for your feedback,

     

    Renate

     

     


  9.  

    I have made my parallel resampling of bitmaps now as fast as I can get it. Now I would find it interesting to know, how the algorithm performs on other systems, and it would be super to get suggestions for improvement.

    The procedures can be found in the unit uScale under Algorithms in the attached zip.

    I have tested against Windows StretchBlt-Half_Tone, WICImage and Graphics32. On my PC (AMD Ryzen 7 1700, 8-core) I see a substantial improvement in speed.

    The threads are based on TThread rather than TTask or TParallel, because I had failures using the latter two, whereas the oldfashioned threads haven't failed me ever in gazillions of runs.

     

    If you want to compile and run the comparison project, you need to include the source folder of Graphics32 (latest version) in your search path. For convenience it is included in the zip under Algorithms. I couldn't find a way to divvy out only the units needed for resampling.

    The test against Graphics32 might be slightly unfair, because of bitmaps being assigned to TBitmap32 before the processing.

     

    Right now, the procedure itself cannot be run in concurrent threads, because the array of TThreads is a global variable, I need to change the design (ideas welcome).

    There might be still room for improvement by  minimizing the cache misses, but my head can't handle it at the moment.

     

    Hope you find it useful.

     

    Renate

     

    Bitmap Scaling.zip

    • Like 3

  10. 30 minutes ago, Attila Kovacs said:

    After the "Free" the reference is a "dangling pointer".

    That's why I don't use it :). 

    33 minutes ago, Attila Kovacs said:

    Try to use explicit Free's and you don't have to keep the implementation details in your head.

    Good advice.

     

    34 minutes ago, Attila Kovacs said:

    Or keep it there and follow the rules.

    So you are saying it is Ok (other than maintainance headaches) to keep it there on condition that one follows the rules? I'm saying this, because I remember reactions like "Shriek" when somebody posted code like this.


  11. type
      TDoSomething = class
        Danger: string;
        constructor create;
        procedure DoSomething;
      end;
    
    implementation
      ...
    constructor TDoSomething.create;
    begin
      Danger := 'Danger!';
    end;
    
    procedure TDoSomething.DoSomething;
    begin
      ShowMessage('Doing something');
      Sleep(1000);
      ShowMessage('Done doing something');
      Free; //<-----
    end;
    
    procedure TForm2.Button1Click(Sender: TObject);
    begin
      var
        DoSomething: TDoSomething := TDoSomething.create;
      DoSomething.DoSomething;
      // ShowMessage(DoSomething.Danger);
    end;
    
    initialization
    ReportMemoryLeaksOnShutDown := true;
    end.

    I'm asking because I just noticed that I was doing something similar in my code :classic_blush:, only it was a bit more hidden. It has worked without complaint for quite a while, so I was wondering why. My guess is, it's because the 'Free' is the very last thing I ask the class to do. In the destructor the class is still alive, after that the memory is freed, and as long as I don't do anything like the commented ShowMessage, that should be safe or no?

    I am cleaning this up, of course, but I didn't want to miss the chance of learning something, I don't understand what's going on in the Delphi-source at this point all that well.

     

    Renate


  12. Hi, Anders,

     

    Thanks a lot for yet another great tool. Totally straightforward, fast and hasslefree to use.

     

    I have 3 little issues with it, which I can work around, but maybe you'd like to know (newest compiled version 1.3.8055.21506):

     

    1. On my 4K-monitor scaled 200% the ribbon text is barely readable. Minimizing the ribbon displays the text correctly when folded out.

    2. Translating a rather long string using the text-editor inserts line breaks at the end of every displayed line.

        Translation manager then rightly complains about mismatching line breaks.

        I tried to find a setting which turns this off, but couldn't.

        Using notepad for the translation and pasting the translated string into the in-place-editor cuts off the string.

    3. The font of the text-editor is rather large on my display, adding to the problem in 2.

     

    Thanks again,

     

    Renate


  13. Meanwhile I've followed your suggestions and rewrote the algorithm so it writes one line at a time. Sadly, I couldn't give SuperDuper a chance to shine, as I would need to transpose memory in chunks of 16 Bytes instead of 4. Still, it got a little faster, and since both unthreaded routines are faster than Graphics32, I don't think I'm doing so badly.

    For the threaded version I used

    On 4/12/2021 at 5:23 PM, Fr0sT.Brutal said:

    A

    B

    C

    D

    with a minimum of 2 and a maximum of 8 threads. Is rather fast if it works.

    Sadly the fails for the tasks version became, if anything, more frequent, and a Tparallel.for version hung indefinitely a few times. I failed to completely recover from the timeouts, after a fail the next call would likely fail again.

     

    So, finally I just set up 8 oldfashioned threads which wake up as needed, do a chunk of the work as needed and  signal that they're done.  I didn't even bother to put in checks for timeouts. It's the fastest yet and has never failed so far. Guess that is a strong indication that the culprit was TTask. If you are curious, here is a 720x540 video that got written at 40 FPS, a new record for me 🙂. Once I do similar optimizations for the transitions, the writing speed might be acceptable.

     

    Thanks again to all!


  14. 53 minutes ago, Anders Melander said:

    I can't see why the intermediate buffer would need to be a bitmap; It's just a chunk of memory.

    W and Height are the size in byte of the memory chunk? Do you transpose byte by byte, I guess not, wouldn't make any sense for bgra-records. Well, I need to play around with it. Pity I don't read ASM. Is this for 32bit only?

     

    Thanks 🙂

     

     


  15. 5 hours ago, Dalija Prasnikar said:

    In that case AnonymousThread approach can be good enough

    It isn't, the creation of the threads takes too long, in the end the procedure takes about twice the time of the unthreaded one. I would really hate to have to use the approach with waking up existing threads to do the resampling ...

     

    5 hours ago, Dalija Prasnikar said:

    and that using no timeout would just hang indefinitely (it might be worth trying, though)

    I introduced the timeout, because the first time this happened, I had to kill the application.

     

    I am right to assume then, that there isn't anything obviously wrong in my approach which would cause the timeout?

    5 hours ago, Anders Melander said:

    While this isn't related to your threading problem, it seems you are processing the bitmap by column instead of by row.

    This is true for the target, and to a degree of the source, too. But the intermediate Columnx is in fact a "row" the way it's allocated. But you are right; and maybe one can get away without the transposing. Also, if the intermediate storage is a TBitmap, the algorithms wouldn't be "symmetric" anymore, the transpose of the target wouldn't be the same as the target of the transpose, but that's probably just nitpick.

    Anyway, if you'd care to share the code for the transposing, I'd be glad.

     

    5 hours ago, Anders Melander said:

    Another thing to be aware of when multiple threads read or write to the same memory is that if two threads read and write to two different locations, but those two locations are within the same cache line, then you will generally get a decrease in performance as the cores fight over the cache line. 

     

    2 hours ago, M.Joos said:

    Indeed, this is then called "false sharing"

    I didn't know that. Thanks. Another point in favor of doing complete rows first, or to have the threads compute chunks of full rows.

     

    Thank you all, I have a lot to do.


  16. 37 minutes ago, Dalija Prasnikar said:

    But, if you are resampling video frames, that suggests that you are resizing same sized frames to same output size. In that case, calculating contributor weights for each frame again is huge waste. Calculate that once, and then just reuse it.

    Good idea, but I'm doing zoom-pans, the fixed resizing I leave to the video-encoder to do.

    I want to be reasonably sure that this runs on any system, not just mine .. Time to do some more stress tests 🙂


  17. I agree that it isn't very nice to just dump the source code here. So here is the explanation:

     

    In the routines, nevermind SourceRect, imagine it to be the full rect of the source bitmap.
    The Radius isn't important either.

    Contributors:

    For pixel(x,y) in the target, ContribsX[x] is a record, that defines the range of x-Values of pixels in the source, which contribute to the value of p(x,y).
    The record also contains the weights which have to be applied before summing up.
    The further away from the center of the range, the lower the weight (simplified).
    The range and the weights are independent of y, so the array ContribsX is computed in advance.

    ContribsY[y] does the same for the source pixels in y-direction.

    The way the contributors are computed isn't important here, it's done outside of the tasks.

     

    Pseudo-Code for computing p(x,y): (ps is a source pixel)
    (Imagine the weights to be floats, and results being rounded)

    p(x,y) := 0;
    for i:=0 to ContribsX[x].high do
      for j:= 0 to ContribsY[y].high do
         p(x,y) := p(x,y) + ContribsX[x].Weights[i] * ContribsY[y].Weights[j] * ps(ContribsX[x].min + i, ContribsY[y].min + j);

    Since this isn't very efficient and can't well be divided into tasks, the averaging in x- and y-direction is done separately.

    Here's where the routine ProcessColumn comes in, it's the most important procedure, and it writes one column of the target bitmap's memory in 2 steps:

    Resample.thumb.png.81c18cb67996130463688c997bf59e92.png

    PseudoCode for ProcessColumn:  (x is given, Compute Column at x)

    Step 1: resample to intermediate array ColumnX:

    for ys:= 0 to OldHeight-1 do
    begin
      ColumnX[ys] := 0;
      for i:=0 to ContribsX[x].high do
       ColumnX[ys] := ColumnX[ys] + ContribsX[x].Weights[i]*ps(ContribsX[x].min +i, ys);
    end;

    Step 2: resample ColumnX to the target column:

    for y := 0 to NewHeight-1 do
    begin
      p(x,y):=0;
      for j := 0 to ContribsY[y].high do
        p(x,y) := p(x,y) + ContribsY[y].Weights[j]*ColumnX[ContribsY[y].min + j];
    end;

    The tasks just write 4 different chunks of colums, but they all read from the same source-bitmap-memory and from the same contributor-arrays.

    tasks.png.e5772c5811206d0dd3f68ca7d0c4d4d0.png

    The original source code is obfuscated by (at least:) 2 things:

       Values that should be float have been scaled to integers
       Pointers are used to walk along the arrays

     

    I'm afraid that's still too long.  ..Make it shorter ...Again, half the size

     

    Gruß, Renate


  18. 13 minutes ago, Anders Melander said:

    Instead of just posting your source and let us figure out what you're doing, it would be nice if you instead described exactly what your doing. I.e. what does the overall job do (e.g. it resamples a bitmap), how does it do that (describe the algorithm), how are you dividing the job, what does your individual tasks do, etc. Describe it as if we didn't have the source. This is basically also what your source comments should do.

    I'll try but it ain't easy. More than 10 years ago I spent days reading your code, before I understood what made it tick 🙂


  19. I've tried to divide the work for resizing bitmaps into 4 parallel tasks. The result is, on average, a speed improvement by 60%, the bicubic resampler gets close to StretchBlt/Halftone. I'm using the routine for writing videos, where it is called at least once for every video frame. So, in my tests I must have called it at least 100 000 times. Problem is, that 3 times there was a timeout in TTask.WaitForAll.

    I'm not using any synchronization, because

     

    The tasks are completely independent of each other, i.e. none alters anything that the other uses.

    They write to memory chunks, which do not overlap.

    They don't use any VCL-code, not even VCL.Graphics.

    The memory they read does overlap, but I thought that was OK, but after reading up a bit, I get the impression that I am wrong here.

     

    From what I gather, I have a "race condition" caused by "not atomic reads" (this is all very new to me, sorry, if I'm talking rubbish).

    The threads are (among other things) all reading from two contributor-arrays, which contain records of varying size, so could that be it?

    What can I do to prevent the error?

    Maybe I should make copies of the arrays for each task?

     

    Here is the unit, it only uses a bicubic filter to make things a little simpler:

    unit uResample;
    
    interface
    
    uses WinApi.Windows, System.Types, VCL.Graphics, System.SysUtils,
      System.Classes, System.Math;
    
    type
      TZoomProcedure = procedure(const Source, Target: TBitmap; SourceRect: TRectF;
        Radius: single);
    
      // "bicubic" resampler based on ideas by Anders Melander, Mike Lischke
      // and Eric Grange. It supports float values for filter radii and float-valued
      // zoom rectangles. Source, Target should be pf32bit. Target must be set to the
      // correct size.
    procedure ZoomResample(const Source, Target: TBitmap; SourceRect: TRectF;
      Radius: single);
    
    // Same as above divided into 4 parallel tasks
    procedure ZoomResampleParallel(const Source, Target: TBitmap;
      SourceRect: TRectF; Radius: single);
    
    implementation
    
    uses WinApi.MMSystem, VCL.Dialogs, System.SyncObjs, System.Threading;
    
    type
    
      TContributor = record
        min, High: integer;
        // Min: start source pixel
        // High+1: number of source pixels to contribute to the result
        Weights: array of integer; // doubles scaled by $800
      end;
    
      TContribArray = array of TContributor;
    
      TBGRAInt = record
        b, g, r, a: integer;
      end;
    
      PBGRAInt = ^TBGRAInt;
    
      TBGRAIntArray = array of TBGRAInt;
    
    const
      beta = 0.525; // f(beta)=0
      beta2 = beta * beta;
      alpha = 105 / (16 - 112 * beta2);
      aa = 1 / 7 * alpha;
      bb = -1 / 5 * alpha * (2 + beta2);
      cc = 1 / 3 * alpha * (1 + 2 * beta2);
      dd = -alpha * beta2;
      // constants computed with maple for polynomial fit
    
    function AntiFilter(X: double): double; inline;
    // Antiderivative of a filter function similar to bicubic, but I like it better
    begin
      if X < -1 then
        Result := -0.5
      else if X < 1 then
        Result := aa * X * X * X * X * X * X * X + bb * X * X * X * X * X + cc * X *
          X * X + dd * X
      else
        Result := 0.5;
    end;
    
    procedure MakeContributors(r: single; SourceSize, TargetSize: integer;
      SourceStart, SourceFloatwidth: double; var Contribs: TContribArray);
    // r: Filterradius
    var
      xCenter, scale, rr: double;
      X, j: integer;
      x1, x2, delta: double;
      TrueMin, TrueMax, Mx: integer;
    begin
      Assert(TargetSize > 0);
      if SourceFloatwidth = 0 then
        SourceFloatwidth := SourceSize;
      scale := SourceFloatwidth / TargetSize;
      SetLength(Contribs, TargetSize);
    
      if scale > 1 then
        // downsampling
        rr := r * scale
      else
        rr := r;
      delta := 1 / rr;
    
      for X := 0 to TargetSize - 1 do
      begin
        xCenter := (X + 0.5) * scale;
        TrueMin := Ceil(xCenter - rr + SourceStart - 1);
        TrueMax := Floor(xCenter + rr + SourceStart);
        Contribs[X].min := min(max(TrueMin, 0), SourceSize - 1);
        // make sure not to read in negative pixel locations
        Mx := max(min(TrueMax, SourceSize - 1), 0);
        // make sure not to read past w-1 in the source
        Contribs[X].High := Mx - Contribs[X].min;
        Assert(Contribs[X].High >= 0);
        // High=Number of contributing pixels minus 1
        SetLength(Contribs[X].Weights, Contribs[X].High + 1);
        with Contribs[X] do
        begin
          x1 := delta * (min - SourceStart - xCenter);
          for j := 0 to High do
          begin
            x2 := x1 + delta;
            Weights[j] := round($800 * (AntiFilter(x2) - AntiFilter(x1)));
            x1 := x2;
          end;
          for j := TrueMin - min to -1 do
          begin
            // assume the first pixel to be repeated
            x1 := delta * (min + j - SourceStart - xCenter);
            x2 := x1 + delta;
            Weights[0] := Weights[0] +
              round($800 * (AntiFilter(x2) - AntiFilter(x1)));
          end;
          for j := High + 1 to TrueMax - min do
          begin
            // assume the last pixel to be repeated
            x1 := delta * (min + j - SourceStart - xCenter);
            x2 := x1 + delta;
            Weights[High] := Weights[High] +
              round($800 * (AntiFilter(x2) - AntiFilter(x1)));
          end;
        end; { with Contribs[x] }
      end; { for x }
    end;
    
    // This writes one column in the target
    procedure ProcessColumn(X, ymin, ymax, Sbps, Tbps, NewHeight: integer;
      rStart, rTStart: PByte; runstart: PBGRAInt;
      const ContribsX, ContribsY: TContribArray); inline;
    var
      ps, pT: PRGBQuad;
      rs, rT: PByte;
      y, xs, ys: integer;
      highx, highy, minx, miny: integer;
      WeightxStart, Weightx, Weighty: PInteger;
      Total: TBGRAInt;
      run: PBGRAInt;
    begin
      rs := rStart;
      highx := ContribsX[X].High;
      minx := ContribsX[X].min;
      inc(rs, 4 * minx);
      WeightxStart := @ContribsX[X].Weights[0];
      run := runstart;
      // runstart points to Columnx[0]
      // Colmnx stores the result of the resampling in x-direction
      for y := ymin to ymax do
      begin
        // For each source line at y
        // Sum up weighted color values at source pixels ContribsX[x].Min+xs
        // 0<=xs<=ContribsX[x].High
        // and store the results in Columnx[y-ymin] etc.
        ps := PRGBQuad(rs);
        Weightx := WeightxStart;
        FillChar(run^, SizeOf(run^), 0);
        for xs := 0 to highx do
        begin
          inc(run.b, Weightx^ * ps.rgbBlue);
          inc(run.g, Weightx^ * ps.rgbGreen);
          inc(run.r, Weightx^ * ps.rgbRed);
          inc(run.a, Weightx^ * ps.rgbReserved);
          inc(Weightx);
          inc(ps);
        end;
        inc(run);
        dec(rs, Sbps);
      end;
      // Average in y-direction:
      // For each target line y sum up weighted colors in
      // Columnx[ys+ContribsY[y].Min-ymin], 0<=ys<=ContribsY[y].High
      // Store result in Total
      // round and assign to TargetPixel pt at [x,y]
      rT := rTStart;
      for y := 0 to NewHeight - 1 do
      begin
        pT := PRGBQuad(rT);
        inc(pT, X);
        highy := ContribsY[y].High;
        miny := ContribsY[y].min - ymin;
        Weighty := @ContribsY[y].Weights[0];
        run := runstart;
        inc(run, miny);
        FillChar(Total, SizeOf(Total), 0);
        for ys := 0 to highy do
        begin
          inc(Total.b, Weighty^ * run.b);
          inc(Total.g, Weighty^ * run.g);
          inc(Total.r, Weighty^ * run.r);
          inc(Total.a, Weighty^ * run.a);
          inc(Weighty);
          inc(run);
        end;
        Total.r := max(Total.r, 0);
        Total.g := max(Total.g, 0);
        Total.b := max(Total.b, 0);
        Total.a := max(Total.a, 0);
        // results could be negative, filter has negative values
        Total.r := (Total.r + $1FFFFF) shr 22; // "round" the result
        Total.g := (Total.g + $1FFFFF) shr 22;
        Total.b := (Total.b + $1FFFFF) shr 22;
        Total.a := (Total.a + $1FFFFF) shr 22;
        pT.rgbBlue := min(Total.b, 255);
        pT.rgbGreen := min(Total.g, 255);
        pT.rgbRed := min(Total.r, 255);
        pT.rgbReserved := min(Total.a, 255);
        dec(rT, Tbps);
      end; // for y
    end;
    
    procedure ZoomResampleParallel(const Source, Target: TBitmap;
      SourceRect: TRectF; Radius: single);
    var
      ContribsX, ContribsY: TContribArray;
    
      OldWidth, OldHeight: integer;
      NewWidth, NewHeight: integer;
    
      Sbps, Tbps: integer;
      rStart, rTStart: PByte; // Row start in Source, Target
      ymin, ymax, i: integer;
      tasks: array [0 .. 3] of ITask;
      Chunk: integer;
    begin
      Source.PixelFormat := pf32bit;
      Target.PixelFormat := pf32bit; // for safety
      // Target needs to have been set to correct size
      NewWidth := Target.Width;
      NewHeight := Target.Height;
      OldWidth := Source.Width;
      OldHeight := Source.Height;
    
      Tbps := ((NewWidth * 32 + 31) and not 31) div 8;
      // BytesPerScanline Target
      Sbps := ((OldWidth * 32 + 31) and not 31) div 8;
      // BytesPerScanline Source
    
      MakeContributors(Radius, OldWidth, NewWidth, SourceRect.Left,
        SourceRect.Right - SourceRect.Left, ContribsX);
      MakeContributors(Radius, OldHeight, NewHeight, SourceRect.Top,
        SourceRect.Bottom - SourceRect.Top, ContribsY);
      ymin := ContribsY[0].min;
      ymax := ContribsY[NewHeight - 1].High + ContribsY[NewHeight - 1].min;
    
      rStart := Source.ScanLine[ymin];
      rTStart := Target.ScanLine[0];
    
      Chunk := NewWidth div 4;
    
      // stupid repetion!
      tasks[0] := TTask.create(
        procedure
        begin
          var
            Columnx: TBGRAIntArray; // Stores averaged x-Contributors
          var
            X, xmin, xmax: integer;
          var
            runstart: PBGRAInt;
          SetLength(Columnx, ymax - ymin + 1);
          runstart := @Columnx[0];
          xmin := 0;
          xmax := Chunk - 1;
          for X := xmin to xmax do
            ProcessColumn(X, ymin, ymax, Sbps, Tbps, NewHeight, rStart, rTStart,
              runstart, ContribsX, ContribsY);
        end);
    
      tasks[1] := TTask.create(
        procedure
        begin
          var
            Columnx: TBGRAIntArray;
          var
            X, xmin, xmax: integer;
          var
            runstart: PBGRAInt;
          SetLength(Columnx, ymax - ymin + 1);
          runstart := @Columnx[0];
          xmin := Chunk;
          xmax := 2 * Chunk - 1;
          for X := xmin to xmax do
            ProcessColumn(X, ymin, ymax, Sbps, Tbps, NewHeight, rStart, rTStart,
              runstart, ContribsX, ContribsY);
        end);
    
      tasks[2] := TTask.create(
        procedure
        begin
          var
            Columnx: TBGRAIntArray;
          var
            X, xmin, xmax: integer;
          var
            runstart: PBGRAInt;
          SetLength(Columnx, ymax - ymin + 1);
          runstart := @Columnx[0];
          xmin := 2 * Chunk;
          xmax := 3 * Chunk - 1;
          for X := xmin to xmax do
            ProcessColumn(X, ymin, ymax, Sbps, Tbps, NewHeight, rStart, rTStart,
              runstart, ContribsX, ContribsY);
        end);
    
      tasks[3] := TTask.create(
        procedure
        begin
          var
            Columnx: TBGRAIntArray;
          var
            X, xmin, xmax: integer;
          var
            runstart: PBGRAInt;
          SetLength(Columnx, ymax - ymin + 1);
          runstart := @Columnx[0];
          xmin := 3 * Chunk;
          xmax := NewWidth - 1;
          for X := xmin to xmax do
            ProcessColumn(X, ymin, ymax, Sbps, Tbps, NewHeight, rStart, rTStart,
              runstart, ContribsX, ContribsY);
        end);
    
      for i := 0 to 3 do
        tasks[i].start;
      if not TTask.WaitForAll(tasks, 10000) then
        Raise exception.create('Time out for parallel resampling threads');
    end;
    
    procedure ZoomResample(const Source, Target: TBitmap; SourceRect: TRectF;
    Radius: single);
    var
      ContribsX, ContribsY: TContribArray;
    
      OldWidth, OldHeight: integer;
      NewWidth, NewHeight: integer;
      // Target needs to be set to correct size
    
      Sbps, Tbps: integer;
      rStart, rTStart: PByte; // Row start in Source, Target
      Columnx: TBGRAIntArray; // cache array, records for better memory layout
      X, ymin, ymax: integer;
      runstart: PBGRAInt;
    begin
      Source.PixelFormat := pf32bit;
      Target.PixelFormat := pf32bit; // for safety
      NewWidth := Target.Width;
      NewHeight := Target.Height;
      OldWidth := Source.Width;
      OldHeight := Source.Height;
    
      Tbps := ((NewWidth * 32 + 31) and not 31) div 8;
      // BytesPerScanline Target
      Sbps := ((OldWidth * 32 + 31) and not 31) div 8;
      // BytesPerScanline Source
    
      MakeContributors(Radius, OldWidth, NewWidth, SourceRect.Left,
        SourceRect.Right - SourceRect.Left, ContribsX);
      MakeContributors(Radius, OldHeight, NewHeight, SourceRect.Top,
        SourceRect.Bottom - SourceRect.Top, ContribsY);
      ymin := ContribsY[0].min;
      ymax := ContribsY[NewHeight - 1].High + ContribsY[NewHeight - 1].min;
    
      rStart := Source.ScanLine[ymin];
      rTStart := Target.ScanLine[0];
    
      SetLength(Columnx, ymax - ymin + 1);
    
      runstart := @Columnx[0];
    
      // Compute colors for each target column at x
      for X := 0 to NewWidth - 1 do
      begin
        ProcessColumn(X, ymin, ymax, Sbps, Tbps, NewHeight, rStart, rTStart,
          runstart, ContribsX, ContribsY);
      end; // for x
    end;
    
    end.

     


  20. 16 minutes ago, Anders Melander said:

    I guess I'll take a look at it when my VTune trial expires.

    That would be great. Here are some relevant lines of the log file, I doubt they are of any use, though:

     

    2021.04.08	14:23:10.909	#DEBUG	#PeFile::Open	#PeFile.cpp(111)	#Executable (PE) C:\Windows\SysWOW64\ntdll.dll opened
    2021.04.08	14:23:10.137	#DEBUG	#PeFile::InitializeSymbolEngine	#PeFile.cpp(699)	#Executable (PE) C:\Windows\SysWOW64\ntdll.dll started Symbol Engine initialization
    2021.04.08	14:23:10.752	#DEBUG	#PeFile::InitializeSymbolEngine	#PeFile.cpp(751)	#Executable (PE) C:\Windows\SysWOW64\ntdll.dll initialized Symbol Engine: PDB
    2021.04.08	14:23:10.662	#DEBUG	#PeFile::Open	#PeFile.cpp(111)	#Executable (PE) D:\DelphiSource\DelphiRio\mystuffR\uProfTest\Win32\Debug\uProfTest.exe opened
    2021.04.08	14:23:10.945	#DEBUG	#PeFile::InitializeSymbolEngine	#PeFile.cpp(699)	#Executable (PE) D:\DelphiSource\DelphiRio\mystuffR\uProfTest\Win32\Debug\uProfTest.exe started Symbol Engine initialization
    2021.04.08	14:23:10.113	#DEBUG	#PeFile::InitializeSymbolEngine	#PeFile.cpp(758)	#Executable (PE) D:\DelphiSource\DelphiRio\mystuffR\uProfTest\Win32\Debug\uProfTest.exe failed to initialize Symbol Engine: PDB
    2021.04.08	14:23:10.740	#DEBUG	#PeFile::InitializeSymbolEngine	#PeFile.cpp(778)	#Executable (PE) D:\DelphiSource\DelphiRio\mystuffR\uProfTest\Win32\Debug\uProfTest.exe initialized Symbol Engine: COFF

     

    • Thanks 1
×