Jump to content
Renate Schaaf

Parallel Algorithm for resampling bitmaps - very infrequent fails

Recommended Posts

I've tried to divide the work for resizing bitmaps into 4 parallel tasks. The result is, on average, a speed improvement by 60%, the bicubic resampler gets close to StretchBlt/Halftone. I'm using the routine for writing videos, where it is called at least once for every video frame. So, in my tests I must have called it at least 100 000 times. Problem is, that 3 times there was a timeout in TTask.WaitForAll.

I'm not using any synchronization, because

 

The tasks are completely independent of each other, i.e. none alters anything that the other uses.

They write to memory chunks, which do not overlap.

They don't use any VCL-code, not even VCL.Graphics.

The memory they read does overlap, but I thought that was OK, but after reading up a bit, I get the impression that I am wrong here.

 

From what I gather, I have a "race condition" caused by "not atomic reads" (this is all very new to me, sorry, if I'm talking rubbish).

The threads are (among other things) all reading from two contributor-arrays, which contain records of varying size, so could that be it?

What can I do to prevent the error?

Maybe I should make copies of the arrays for each task?

 

Here is the unit, it only uses a bicubic filter to make things a little simpler:

unit uResample;

interface

uses WinApi.Windows, System.Types, VCL.Graphics, System.SysUtils,
  System.Classes, System.Math;

type
  TZoomProcedure = procedure(const Source, Target: TBitmap; SourceRect: TRectF;
    Radius: single);

  // "bicubic" resampler based on ideas by Anders Melander, Mike Lischke
  // and Eric Grange. It supports float values for filter radii and float-valued
  // zoom rectangles. Source, Target should be pf32bit. Target must be set to the
  // correct size.
procedure ZoomResample(const Source, Target: TBitmap; SourceRect: TRectF;
  Radius: single);

// Same as above divided into 4 parallel tasks
procedure ZoomResampleParallel(const Source, Target: TBitmap;
  SourceRect: TRectF; Radius: single);

implementation

uses WinApi.MMSystem, VCL.Dialogs, System.SyncObjs, System.Threading;

type

  TContributor = record
    min, High: integer;
    // Min: start source pixel
    // High+1: number of source pixels to contribute to the result
    Weights: array of integer; // doubles scaled by $800
  end;

  TContribArray = array of TContributor;

  TBGRAInt = record
    b, g, r, a: integer;
  end;

  PBGRAInt = ^TBGRAInt;

  TBGRAIntArray = array of TBGRAInt;

const
  beta = 0.525; // f(beta)=0
  beta2 = beta * beta;
  alpha = 105 / (16 - 112 * beta2);
  aa = 1 / 7 * alpha;
  bb = -1 / 5 * alpha * (2 + beta2);
  cc = 1 / 3 * alpha * (1 + 2 * beta2);
  dd = -alpha * beta2;
  // constants computed with maple for polynomial fit

function AntiFilter(X: double): double; inline;
// Antiderivative of a filter function similar to bicubic, but I like it better
begin
  if X < -1 then
    Result := -0.5
  else if X < 1 then
    Result := aa * X * X * X * X * X * X * X + bb * X * X * X * X * X + cc * X *
      X * X + dd * X
  else
    Result := 0.5;
end;

procedure MakeContributors(r: single; SourceSize, TargetSize: integer;
  SourceStart, SourceFloatwidth: double; var Contribs: TContribArray);
// r: Filterradius
var
  xCenter, scale, rr: double;
  X, j: integer;
  x1, x2, delta: double;
  TrueMin, TrueMax, Mx: integer;
begin
  Assert(TargetSize > 0);
  if SourceFloatwidth = 0 then
    SourceFloatwidth := SourceSize;
  scale := SourceFloatwidth / TargetSize;
  SetLength(Contribs, TargetSize);

  if scale > 1 then
    // downsampling
    rr := r * scale
  else
    rr := r;
  delta := 1 / rr;

  for X := 0 to TargetSize - 1 do
  begin
    xCenter := (X + 0.5) * scale;
    TrueMin := Ceil(xCenter - rr + SourceStart - 1);
    TrueMax := Floor(xCenter + rr + SourceStart);
    Contribs[X].min := min(max(TrueMin, 0), SourceSize - 1);
    // make sure not to read in negative pixel locations
    Mx := max(min(TrueMax, SourceSize - 1), 0);
    // make sure not to read past w-1 in the source
    Contribs[X].High := Mx - Contribs[X].min;
    Assert(Contribs[X].High >= 0);
    // High=Number of contributing pixels minus 1
    SetLength(Contribs[X].Weights, Contribs[X].High + 1);
    with Contribs[X] do
    begin
      x1 := delta * (min - SourceStart - xCenter);
      for j := 0 to High do
      begin
        x2 := x1 + delta;
        Weights[j] := round($800 * (AntiFilter(x2) - AntiFilter(x1)));
        x1 := x2;
      end;
      for j := TrueMin - min to -1 do
      begin
        // assume the first pixel to be repeated
        x1 := delta * (min + j - SourceStart - xCenter);
        x2 := x1 + delta;
        Weights[0] := Weights[0] +
          round($800 * (AntiFilter(x2) - AntiFilter(x1)));
      end;
      for j := High + 1 to TrueMax - min do
      begin
        // assume the last pixel to be repeated
        x1 := delta * (min + j - SourceStart - xCenter);
        x2 := x1 + delta;
        Weights[High] := Weights[High] +
          round($800 * (AntiFilter(x2) - AntiFilter(x1)));
      end;
    end; { with Contribs[x] }
  end; { for x }
end;

// This writes one column in the target
procedure ProcessColumn(X, ymin, ymax, Sbps, Tbps, NewHeight: integer;
  rStart, rTStart: PByte; runstart: PBGRAInt;
  const ContribsX, ContribsY: TContribArray); inline;
var
  ps, pT: PRGBQuad;
  rs, rT: PByte;
  y, xs, ys: integer;
  highx, highy, minx, miny: integer;
  WeightxStart, Weightx, Weighty: PInteger;
  Total: TBGRAInt;
  run: PBGRAInt;
begin
  rs := rStart;
  highx := ContribsX[X].High;
  minx := ContribsX[X].min;
  inc(rs, 4 * minx);
  WeightxStart := @ContribsX[X].Weights[0];
  run := runstart;
  // runstart points to Columnx[0]
  // Colmnx stores the result of the resampling in x-direction
  for y := ymin to ymax do
  begin
    // For each source line at y
    // Sum up weighted color values at source pixels ContribsX[x].Min+xs
    // 0<=xs<=ContribsX[x].High
    // and store the results in Columnx[y-ymin] etc.
    ps := PRGBQuad(rs);
    Weightx := WeightxStart;
    FillChar(run^, SizeOf(run^), 0);
    for xs := 0 to highx do
    begin
      inc(run.b, Weightx^ * ps.rgbBlue);
      inc(run.g, Weightx^ * ps.rgbGreen);
      inc(run.r, Weightx^ * ps.rgbRed);
      inc(run.a, Weightx^ * ps.rgbReserved);
      inc(Weightx);
      inc(ps);
    end;
    inc(run);
    dec(rs, Sbps);
  end;
  // Average in y-direction:
  // For each target line y sum up weighted colors in
  // Columnx[ys+ContribsY[y].Min-ymin], 0<=ys<=ContribsY[y].High
  // Store result in Total
  // round and assign to TargetPixel pt at [x,y]
  rT := rTStart;
  for y := 0 to NewHeight - 1 do
  begin
    pT := PRGBQuad(rT);
    inc(pT, X);
    highy := ContribsY[y].High;
    miny := ContribsY[y].min - ymin;
    Weighty := @ContribsY[y].Weights[0];
    run := runstart;
    inc(run, miny);
    FillChar(Total, SizeOf(Total), 0);
    for ys := 0 to highy do
    begin
      inc(Total.b, Weighty^ * run.b);
      inc(Total.g, Weighty^ * run.g);
      inc(Total.r, Weighty^ * run.r);
      inc(Total.a, Weighty^ * run.a);
      inc(Weighty);
      inc(run);
    end;
    Total.r := max(Total.r, 0);
    Total.g := max(Total.g, 0);
    Total.b := max(Total.b, 0);
    Total.a := max(Total.a, 0);
    // results could be negative, filter has negative values
    Total.r := (Total.r + $1FFFFF) shr 22; // "round" the result
    Total.g := (Total.g + $1FFFFF) shr 22;
    Total.b := (Total.b + $1FFFFF) shr 22;
    Total.a := (Total.a + $1FFFFF) shr 22;
    pT.rgbBlue := min(Total.b, 255);
    pT.rgbGreen := min(Total.g, 255);
    pT.rgbRed := min(Total.r, 255);
    pT.rgbReserved := min(Total.a, 255);
    dec(rT, Tbps);
  end; // for y
end;

procedure ZoomResampleParallel(const Source, Target: TBitmap;
  SourceRect: TRectF; Radius: single);
var
  ContribsX, ContribsY: TContribArray;

  OldWidth, OldHeight: integer;
  NewWidth, NewHeight: integer;

  Sbps, Tbps: integer;
  rStart, rTStart: PByte; // Row start in Source, Target
  ymin, ymax, i: integer;
  tasks: array [0 .. 3] of ITask;
  Chunk: integer;
begin
  Source.PixelFormat := pf32bit;
  Target.PixelFormat := pf32bit; // for safety
  // Target needs to have been set to correct size
  NewWidth := Target.Width;
  NewHeight := Target.Height;
  OldWidth := Source.Width;
  OldHeight := Source.Height;

  Tbps := ((NewWidth * 32 + 31) and not 31) div 8;
  // BytesPerScanline Target
  Sbps := ((OldWidth * 32 + 31) and not 31) div 8;
  // BytesPerScanline Source

  MakeContributors(Radius, OldWidth, NewWidth, SourceRect.Left,
    SourceRect.Right - SourceRect.Left, ContribsX);
  MakeContributors(Radius, OldHeight, NewHeight, SourceRect.Top,
    SourceRect.Bottom - SourceRect.Top, ContribsY);
  ymin := ContribsY[0].min;
  ymax := ContribsY[NewHeight - 1].High + ContribsY[NewHeight - 1].min;

  rStart := Source.ScanLine[ymin];
  rTStart := Target.ScanLine[0];

  Chunk := NewWidth div 4;

  // stupid repetion!
  tasks[0] := TTask.create(
    procedure
    begin
      var
        Columnx: TBGRAIntArray; // Stores averaged x-Contributors
      var
        X, xmin, xmax: integer;
      var
        runstart: PBGRAInt;
      SetLength(Columnx, ymax - ymin + 1);
      runstart := @Columnx[0];
      xmin := 0;
      xmax := Chunk - 1;
      for X := xmin to xmax do
        ProcessColumn(X, ymin, ymax, Sbps, Tbps, NewHeight, rStart, rTStart,
          runstart, ContribsX, ContribsY);
    end);

  tasks[1] := TTask.create(
    procedure
    begin
      var
        Columnx: TBGRAIntArray;
      var
        X, xmin, xmax: integer;
      var
        runstart: PBGRAInt;
      SetLength(Columnx, ymax - ymin + 1);
      runstart := @Columnx[0];
      xmin := Chunk;
      xmax := 2 * Chunk - 1;
      for X := xmin to xmax do
        ProcessColumn(X, ymin, ymax, Sbps, Tbps, NewHeight, rStart, rTStart,
          runstart, ContribsX, ContribsY);
    end);

  tasks[2] := TTask.create(
    procedure
    begin
      var
        Columnx: TBGRAIntArray;
      var
        X, xmin, xmax: integer;
      var
        runstart: PBGRAInt;
      SetLength(Columnx, ymax - ymin + 1);
      runstart := @Columnx[0];
      xmin := 2 * Chunk;
      xmax := 3 * Chunk - 1;
      for X := xmin to xmax do
        ProcessColumn(X, ymin, ymax, Sbps, Tbps, NewHeight, rStart, rTStart,
          runstart, ContribsX, ContribsY);
    end);

  tasks[3] := TTask.create(
    procedure
    begin
      var
        Columnx: TBGRAIntArray;
      var
        X, xmin, xmax: integer;
      var
        runstart: PBGRAInt;
      SetLength(Columnx, ymax - ymin + 1);
      runstart := @Columnx[0];
      xmin := 3 * Chunk;
      xmax := NewWidth - 1;
      for X := xmin to xmax do
        ProcessColumn(X, ymin, ymax, Sbps, Tbps, NewHeight, rStart, rTStart,
          runstart, ContribsX, ContribsY);
    end);

  for i := 0 to 3 do
    tasks[i].start;
  if not TTask.WaitForAll(tasks, 10000) then
    Raise exception.create('Time out for parallel resampling threads');
end;

procedure ZoomResample(const Source, Target: TBitmap; SourceRect: TRectF;
Radius: single);
var
  ContribsX, ContribsY: TContribArray;

  OldWidth, OldHeight: integer;
  NewWidth, NewHeight: integer;
  // Target needs to be set to correct size

  Sbps, Tbps: integer;
  rStart, rTStart: PByte; // Row start in Source, Target
  Columnx: TBGRAIntArray; // cache array, records for better memory layout
  X, ymin, ymax: integer;
  runstart: PBGRAInt;
begin
  Source.PixelFormat := pf32bit;
  Target.PixelFormat := pf32bit; // for safety
  NewWidth := Target.Width;
  NewHeight := Target.Height;
  OldWidth := Source.Width;
  OldHeight := Source.Height;

  Tbps := ((NewWidth * 32 + 31) and not 31) div 8;
  // BytesPerScanline Target
  Sbps := ((OldWidth * 32 + 31) and not 31) div 8;
  // BytesPerScanline Source

  MakeContributors(Radius, OldWidth, NewWidth, SourceRect.Left,
    SourceRect.Right - SourceRect.Left, ContribsX);
  MakeContributors(Radius, OldHeight, NewHeight, SourceRect.Top,
    SourceRect.Bottom - SourceRect.Top, ContribsY);
  ymin := ContribsY[0].min;
  ymax := ContribsY[NewHeight - 1].High + ContribsY[NewHeight - 1].min;

  rStart := Source.ScanLine[ymin];
  rTStart := Target.ScanLine[0];

  SetLength(Columnx, ymax - ymin + 1);

  runstart := @Columnx[0];

  // Compute colors for each target column at x
  for X := 0 to NewWidth - 1 do
  begin
    ProcessColumn(X, ymin, ymax, Sbps, Tbps, NewHeight, rStart, rTStart,
      runstart, ContribsX, ContribsY);
  end; // for x
end;

end.

 

Share this post


Link to post

Instead of just posting your source and let us figure out what you're doing, it would be nice if you instead described exactly what your doing. I.e. what does the overall job do (e.g. it resamples a bitmap), how does it do that (describe the algorithm), how are you dividing the job, what does your individual tasks do, etc. Describe it as if we didn't have the source. This is basically also what your source comments should do.

Share this post


Link to post
13 minutes ago, Anders Melander said:

Instead of just posting your source and let us figure out what you're doing, it would be nice if you instead described exactly what your doing. I.e. what does the overall job do (e.g. it resamples a bitmap), how does it do that (describe the algorithm), how are you dividing the job, what does your individual tasks do, etc. Describe it as if we didn't have the source. This is basically also what your source comments should do.

I'll try but it ain't easy. More than 10 years ago I spent days reading your code, before I understood what made it tick 🙂

Share this post


Link to post

I agree that it isn't very nice to just dump the source code here. So here is the explanation:

 

In the routines, nevermind SourceRect, imagine it to be the full rect of the source bitmap.
The Radius isn't important either.

Contributors:

For pixel(x,y) in the target, ContribsX[x] is a record, that defines the range of x-Values of pixels in the source, which contribute to the value of p(x,y).
The record also contains the weights which have to be applied before summing up.
The further away from the center of the range, the lower the weight (simplified).
The range and the weights are independent of y, so the array ContribsX is computed in advance.

ContribsY[y] does the same for the source pixels in y-direction.

The way the contributors are computed isn't important here, it's done outside of the tasks.

 

Pseudo-Code for computing p(x,y): (ps is a source pixel)
(Imagine the weights to be floats, and results being rounded)

p(x,y) := 0;
for i:=0 to ContribsX[x].high do
  for j:= 0 to ContribsY[y].high do
     p(x,y) := p(x,y) + ContribsX[x].Weights[i] * ContribsY[y].Weights[j] * ps(ContribsX[x].min + i, ContribsY[y].min + j);

Since this isn't very efficient and can't well be divided into tasks, the averaging in x- and y-direction is done separately.

Here's where the routine ProcessColumn comes in, it's the most important procedure, and it writes one column of the target bitmap's memory in 2 steps:

Resample.thumb.png.81c18cb67996130463688c997bf59e92.png

PseudoCode for ProcessColumn:  (x is given, Compute Column at x)

Step 1: resample to intermediate array ColumnX:

for ys:= 0 to OldHeight-1 do
begin
  ColumnX[ys] := 0;
  for i:=0 to ContribsX[x].high do
   ColumnX[ys] := ColumnX[ys] + ContribsX[x].Weights[i]*ps(ContribsX[x].min +i, ys);
end;

Step 2: resample ColumnX to the target column:

for y := 0 to NewHeight-1 do
begin
  p(x,y):=0;
  for j := 0 to ContribsY[y].high do
    p(x,y) := p(x,y) + ContribsY[y].Weights[j]*ColumnX[ContribsY[y].min + j];
end;

The tasks just write 4 different chunks of colums, but they all read from the same source-bitmap-memory and from the same contributor-arrays.

tasks.png.e5772c5811206d0dd3f68ca7d0c4d4d0.png

The original source code is obfuscated by (at least:) 2 things:

   Values that should be float have been scaled to integers
   Pointers are used to walk along the arrays

 

I'm afraid that's still too long.  ..Make it shorter ...Again, half the size

 

Gruß, Renate

Edited by Renate Schaaf

Share this post


Link to post
1 hour ago, Vandrovnik said:

TThread.CreateAnonymousThread.

Thanks, I will try this. I've also tried to change to TParallel.for, which hasn't bugged out so far, but is more erratic timing wise. Problem is, that the error occurs so very rarely, it's hard to say whether or not it's fixed. I wish I had more insight into what could go wrong.

Thx

Renate

Share this post


Link to post

Without going into the why you are getting timeout, could be the issue @Vandrovnik mentioned, could be something else. Depends on your machine, number of cores, system activity too. 

 

But, if you are resampling video frames, that suggests that you are resizing same sized frames to same output size. In that case, calculating contributor weights for each frame again is huge waste. Calculate that once, and then just reuse it.

Share this post


Link to post
37 minutes ago, Dalija Prasnikar said:

But, if you are resampling video frames, that suggests that you are resizing same sized frames to same output size. In that case, calculating contributor weights for each frame again is huge waste. Calculate that once, and then just reuse it.

Good idea, but I'm doing zoom-pans, the fixed resizing I leave to the video-encoder to do.

I want to be reasonably sure that this runs on any system, not just mine .. Time to do some more stress tests 🙂

Share this post


Link to post
22 minutes ago, Renate Schaaf said:

Good idea, but I'm doing zoom-pans, the fixed resizing I leave to the video-encoder to do.

I want to be reasonably sure that this runs on any system, not just mine .. Time to do some more stress tests 🙂

Since, your timeout is quite large, I suppose your requirement is not real-time resizing. In that case AnonymousThread approach can be good enough. I would also suggest using TTask wait without timeout, but 10 seconds is quite long time period, so it is quite possible there are other problems with TTask, and that using no timeout would just hang indefinitely (it might be worth trying, though). If the AnonymousThread approach is not good enough (creating and starting thread takes time - not much, but still) then creating reusable dedicated threads that would take care of resampling could be an option. But that is the most complicated approach, because you would need to communicate back and forth with such long running threads and pass data in and out as it finishes processing pieces of data. It is possible that there are libraries that have something you can reuse - the first place I would look for is OmniThreadLibrary https://github.com/gabr42/OmniThreadLibrary

Share this post


Link to post

While this isn't related to your threading problem, it seems you are processing the bitmap by column instead of by row. This is very bad for performance since each row of each column will start with a cache miss.

I think you will find that if you process all rows, transpose (so columns becomes rows), process all rows, transpose again (rows back to columns), the performance will be significantly better. I have a fast 32-bit (i.e. RGBA) blocked transpose if you need one.

 

Another thing to be aware of when multiple threads read or write to the same memory is that if two threads read and write to two different locations, but those two locations are within the same cache line, then you will generally get a decrease in performance as the cores fight over the cache line.

  • Like 2

Share this post


Link to post
3 hours ago, Anders Melander said:

While this isn't related to your threading problem, it seems you are processing the bitmap by column instead of by row. This is very bad for performance since each row of each column will start with a cache miss.

I think you will find that if you process all rows, transpose (so columns becomes rows), process all rows, transpose again (rows back to columns), the performance will be significantly better. I have a fast 32-bit (i.e. RGBA) blocked transpose if you need one.

 

Another thing to be aware of when multiple threads read or write to the same memory is that if two threads read and write to two different locations, but those two locations are within the same cache line, then you will generally get a decrease in performance as the cores fight over the cache line.

Indeed, this is then called "false sharing" and is very well explained from Microsoft's C++ Herb Sutter with an overview here: https://herbsutter.com/2009/05/15/effective-concurrency-eliminate-false-sharing/

Share this post


Link to post
5 hours ago, Dalija Prasnikar said:

In that case AnonymousThread approach can be good enough

It isn't, the creation of the threads takes too long, in the end the procedure takes about twice the time of the unthreaded one. I would really hate to have to use the approach with waking up existing threads to do the resampling ...

 

5 hours ago, Dalija Prasnikar said:

and that using no timeout would just hang indefinitely (it might be worth trying, though)

I introduced the timeout, because the first time this happened, I had to kill the application.

 

I am right to assume then, that there isn't anything obviously wrong in my approach which would cause the timeout?

5 hours ago, Anders Melander said:

While this isn't related to your threading problem, it seems you are processing the bitmap by column instead of by row.

This is true for the target, and to a degree of the source, too. But the intermediate Columnx is in fact a "row" the way it's allocated. But you are right; and maybe one can get away without the transposing. Also, if the intermediate storage is a TBitmap, the algorithms wouldn't be "symmetric" anymore, the transpose of the target wouldn't be the same as the target of the transpose, but that's probably just nitpick.

Anyway, if you'd care to share the code for the transposing, I'd be glad.

 

5 hours ago, Anders Melander said:

Another thing to be aware of when multiple threads read or write to the same memory is that if two threads read and write to two different locations, but those two locations are within the same cache line, then you will generally get a decrease in performance as the cores fight over the cache line. 

 

2 hours ago, M.Joos said:

Indeed, this is then called "false sharing"

I didn't know that. Thanks. Another point in favor of doing complete rows first, or to have the threads compute chunks of full rows.

 

Thank you all, I have a lot to do.

Share this post


Link to post
31 minutes ago, Renate Schaaf said:

Another point in favor of doing complete rows first, or to have the threads compute chunks of full rows.

IOW, optimal distribution would be not
 

AB
CD

where each letter is a fragment of an image to process by a thread but

Quote

 

A

B

C

D

 

?

Share this post


Link to post
18 minutes ago, Renate Schaaf said:

It isn't, the creation of the threads takes too long, in the end the procedure takes about twice the time of the unthreaded one. I would really hate to have to use the approach with waking up existing threads to do the resampling ... 

Yes, it's relatively costly to create a thread but if you use a thread pool then the threads will only have to be created once.

 

21 minutes ago, Renate Schaaf said:

maybe one can get away without the transposing. Also, if the intermediate storage is a TBitmap, the algorithms wouldn't be "symmetric" anymore, the transpose of the target wouldn't be the same as the target of the transpose, but that's probably just nitpick.

I don't think I follow you. I can't see why the intermediate buffer would need to be a bitmap; It's just a chunk of memory.

Also the transpose if faster than you'd think. After all it's much faster to do two row-by-row passes and two transpositions, than one row-by-row pass and one column-by-column pass. One might then think that it would be smart to do the transposition in place while doing the row-by-row pass, after all you already have the value that needs to be transposed, but that isn't so as writing at the transposed location will flush the cache.

 

Anyway, here's the aptly named SuperDuperTranspose32 (I also have a FastTranspose (MMX) and a SuperTranspose :classic_smile: ). I've been using it in an IIR gaussian blur filter. Zuuuuper fast.

// MatrixTranspose by AW
// http://masm32.com/board/index.php?topic=6140.msg65145#msg65145
// 4x4 matrix transpose by Siekmanski
// http://masm32.com/board/index.php?topic=6127.msg65026#msg65026
// Ported to Delphi by Anders Melander
procedure SuperDuperTranspose32(Src, Dst: Pointer; W, Height: cardinal); register;
type
  dword = cardinal;
  // Parameters:
  // EAX <- Source
  // EDX <- Destination
  // ECX <- Width
  // Stack[0] <- Height
  // Preserves: EDI, ESI, EBX
var
  Source, Destination: Pointer;
  Width: dword;
  X4x4Required: dword;
  Y4x4Required: dword;
  remainderX: dword;
  remainderY: dword;
  destRowSize: dword;
  sourceRowSize: dword;
  savedDest: dword;
asm
  push edi
  push esi
  push ebx

  mov Destination, Dst
  mov Source, Src
  mov Width, W

  // How many cols % 4?
  mov eax, Width
  mov ebx, 4
  mov edx, 0
  div ebx
  mov X4x4Required, eax
  mov remainderX, edx

  // How many rows %4?
  mov eax, Height
  mov ebx, 4
  mov edx, 0
  div ebx
  mov Y4x4Required, eax
  mov remainderY, edx

  mov eax, Height
  shl eax, 2
  mov destRowSize, eax

  mov eax, Width
  shl eax, 2
  mov sourceRowSize, eax

  mov ebx, 0

  @@loop1outer:
  cmp ebx, Y4x4Required // while ebx<Y4x4Required // Height % 4
  jae @@loop1outer_exit

    // find starting point for source
    mov eax, ebx
    mul sourceRowSize
    shl eax, 2

    mov esi, Source
    add esi, eax
    mov ecx, esi // save
    // find starting point for destination
    mov eax, ebx
    shl eax, 4
    mov edi, Destination
    add edi, eax
    mov savedDest, edi // save
    push ebx

    mov ebx,0

    @@loop1inner:
    cmp ebx, X4x4Required// while ebx<X4x4Required
    jae @@loop1inner_exit

      mov eax, ebx
      shl eax, 4
      mov esi, ecx
      add esi, eax
      movups xmm0, [esi]
      add esi, sourceRowSize
      movups xmm1, [esi]
      add esi, sourceRowSize
      movups xmm2, [esi]
      add esi, sourceRowSize
      movups xmm3, [esi]

      movaps      xmm4,xmm0
      movaps      xmm5,xmm2
      unpcklps    xmm4,xmm1
      unpcklps    xmm5,xmm3
      unpckhps    xmm0,xmm1
      unpckhps    xmm2,xmm3
      movaps      xmm1,xmm4
      movaps      xmm6,xmm0
      movlhps     xmm4,xmm5
      movlhps     xmm6,xmm2
      movhlps     xmm5,xmm1
      movhlps     xmm2,xmm0

      mov eax, destRowSize
      shl eax, 2
      mul ebx
      mov edi, savedDest
      add edi, eax

      movups [edi], xmm4
      add edi, destRowSize
      movups [edi], xmm5
      add edi, destRowSize
      movups [edi], xmm6
      add edi, destRowSize
      movups [edi], xmm2

      inc ebx

      jmp @@loop1inner
    @@loop1inner_exit:
    pop ebx

    inc ebx

    jmp @@loop1outer
  @@loop1outer_exit:

  // deal with Height not multiple of 4
  cmp remainderX, 1 // .if remainderX >=1
  jb @@no_extra_x
    mov eax, X4x4Required
    shl eax, 4
    mov esi, Source
    add esi, eax

    mov eax, X4x4Required
    shl eax, 2
    mul destRowSize
    mov edi, Destination
    add edi, eax

    mov edx, 0

    @@extra_x:
    cmp edx, remainderX // while edx < remainderX
    jae @@extra_x_exit

      mov ecx, 0
      mov eax, 0

      @@extra_x_y:
      cmp ecx, Height // while ecx < Height
      jae @@extra_x_y_exit

        mov ebx, dword ptr [esi+eax]
        mov dword ptr [edi+4*ecx], ebx
        add eax, sourceRowSize
        inc ecx

        jmp @@extra_x_y
      @@extra_x_y_exit:

      add esi, 4
      add edi, destRowSize
      inc edx

      jmp @@extra_x
    @@extra_x_exit:

  @@no_extra_x:

  // deal with columns not multiple of 4
  cmp remainderY, 1 // if remainderY >=1
  jb @@no_extra_y

    mov eax, Y4x4Required
    shl eax, 2
    mul sourceRowSize
    mov esi, Source
    add esi, eax

    mov eax, Y4x4Required
    shl eax, 4
    mov edi, Destination
    add edi, eax

    mov edx,0

    @@extra_y:
    cmp edx, remainderY // while edx < remainderY
    jae @@extra_y_exit

      mov ecx, 0
      mov eax, 0

      @@extra_y_x:
      cmp ecx, Width // while ecx < Width
      jae @@extra_y_x_exit

        mov ebx, dword ptr [esi+4*ecx]
        mov dword ptr [edi+eax], ebx
        add eax, destRowSize
        inc ecx

        jmp @@extra_y_x
      @@extra_y_x_exit:

      add esi, sourceRowSize
      add edi, 4
      inc edx

      jmp @@extra_y
    @@extra_y_exit:

  @@no_extra_y:

  pop ebx
  pop esi
  pop edi
end;

 

  • Thanks 1

Share this post


Link to post
53 minutes ago, Anders Melander said:

I can't see why the intermediate buffer would need to be a bitmap; It's just a chunk of memory.

W and Height are the size in byte of the memory chunk? Do you transpose byte by byte, I guess not, wouldn't make any sense for bgra-records. Well, I need to play around with it. Pity I don't read ASM. Is this for 32bit only?

 

Thanks 🙂

 

 

Share this post


Link to post
20 minutes ago, Renate Schaaf said:

W and Height are the size in byte of the memory chunk?

It's the size in dwords (i.e. 32-bit RGBA).

AFAIR it should work with the 64-bit compiler too.

Share this post


Link to post

 

3 hours ago, Renate Schaaf said:

It isn't, the creation of the threads takes too long, in the end the procedure takes about twice the time of the unthreaded one. I would really hate to have to use the approach with waking up existing threads to do the resampling ...

 

I introduced the timeout, because the first time this happened, I had to kill the application.

OK, long timeout was a bit confusing and I could not determine how long each operation should last. 

 

Quote

I am right to assume then, that there isn't anything obviously wrong in my approach which would cause the timeout?

As far as I can see there are no obvious reasons why your approach fails with timeout. It is a bit longish code, so I am not excluding possibility of not catching error in code and possibility of some memory corruption, but that would commonly show much sooner. 

 

Using TTask and reusing threads through its thread pool is definitely faster and more correct approach comparing to creating anonymous threads. Anonymous threads was "fallback" suggestion in case your overall time spent in calculation is long enough so that creating thread overhead pays off.

 

If the TTask is really the issue here (which is also hard to say) another option would be trying some other library. One thing you probably can confirm, by using anonymous threads is that TTask is the culprit here (even though you can never say that with 100% certainty). 

Edited by Dalija Prasnikar
  • Thanks 1

Share this post


Link to post

Meanwhile I've followed your suggestions and rewrote the algorithm so it writes one line at a time. Sadly, I couldn't give SuperDuper a chance to shine, as I would need to transpose memory in chunks of 16 Bytes instead of 4. Still, it got a little faster, and since both unthreaded routines are faster than Graphics32, I don't think I'm doing so badly.

For the threaded version I used

On 4/12/2021 at 5:23 PM, Fr0sT.Brutal said:

A

B

C

D

with a minimum of 2 and a maximum of 8 threads. Is rather fast if it works.

Sadly the fails for the tasks version became, if anything, more frequent, and a Tparallel.for version hung indefinitely a few times. I failed to completely recover from the timeouts, after a fail the next call would likely fail again.

 

So, finally I just set up 8 oldfashioned threads which wake up as needed, do a chunk of the work as needed and  signal that they're done.  I didn't even bother to put in checks for timeouts. It's the fastest yet and has never failed so far. Guess that is a strong indication that the culprit was TTask. If you are curious, here is a 720x540 video that got written at 40 FPS, a new record for me 🙂. Once I do similar optimizations for the transitions, the writing speed might be acceptable.

 

Thanks again to all!

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×