Jump to content

Renate Schaaf

Members
  • Content Count

    126
  • Joined

  • Last visited

  • Days Won

    4

Everything posted by Renate Schaaf

  1. Renate Schaaf

    Parallel Resampling of (VCL-) Bitmaps

    No, that would be the trapezoidal rule, and that is just as bad as using the antiderivatives. Midpoint rule: integral from x1 to x2 f(x) dx is appoximately f(0.5*(x1+x2))*(x2-x1). The multiplications by oldscale transform this from the scale of the source (pixelwidth 1) to the scale of the destination ("pixelwidth" NewWidth/OldWidth). If you want to know why using the integral is a good way of thinking about the algorithm, there's a little article in the doc folder. I couldn't explain it any better here. I'm not sure what you're asking, could you do some explaining back? For the pre-mult I'm using precision $100, for the others $800. Is that what you are asking? BTW, the check x2>x1 isn't necessary, Filter(x3) would be zero if not true.
  2. Renate Schaaf

    Parallel Resampling of (VCL-) Bitmaps

    I was hoping for you to untwiddle this 🙂 Meanwhile I found the reason for the box-kernel not being up to par, it's here: function TBoxKernel.GetWidth: TFloat; begin Result := 1; //must be 0.5! end; I also spotted a mistake in my code. It could be that the interval [x0-0.5,x0+0.5] is completely outside of the support of the filter. In this case a false non-zero weight would be generated. So a check of x2>x1 needs to be added: ... for J := Left to Right do begin x0 := J - Center; // previous weight: Filter(x0*Oldscale)*Oldscale x1 := max(x0 - 0.5, -FilterWidth); x2 := min(x0 + 0.5, FilterWidth); // intersect symmetric interval of length 1 about x0 with support of scaled filter if (x2 > x1) then begin x3 := 0.5 * (x2 + x1); // new center Weight := Round(Prec * Filter(x3 * OldScale) * OldScale * (x2 - x1)); // intersection with support entered into the weight if Weight <> 0 then begin Inc(Count, Weight); K := Length(Result[I]); SetLength(Result[I], K + 1); Result[I][K].Pos := Constrain(J, SrcLo, SrcHi - 1); Result[I][K].Weight := Weight; end; end; end; Also, at the analogous place for the case scale>1. The code for the 2 cases could be unified, but it's better to understand as it is.
  3. Renate Schaaf

    Parallel Resampling of (VCL-) Bitmaps

    I've managed to translate the alternative computation of weights into Graphics32. It was actually quite easy :). The idea is, to compute the intergral for the convolution with the filter via the midpoint-rule. Before I've used the exact antiderivatives, leading to constant underestimations of peaks and valleys in the bitmap function, and thus to a loss of detail. Now pixels not lying totally within the support of the filter get their weight reduced, leading to less artefacts, but the peaks are better estimated, so contrast and detail is better preserved (the math is for readability): //Precision of weights, //Totals Cb,Cg,Cr,Ca in Resample need to be unscaled by Prec * Prec const Prec = $800; function BuildMappingTableNew(DstLo, DstHi: Integer; ClipLo, ClipHi: Integer; SrcLo, SrcHi: Integer; Kernel: TCustomKernel): TMappingTable; var ... begin ... else if Scale < 1 then begin OldScale := Scale; Scale := 1 / Scale; FilterWidth := FilterWidth * Scale; for I := 0 to ClipW - 1 do begin if FullEdge then Center := SrcLo - 0.5 + (I - DstLo + ClipLo + 0.5) * Scale else Center := SrcLo + (I - DstLo + ClipLo) * Scale; Left := Floor(Center - FilterWidth); Right := Ceil(Center + FilterWidth); Count := -Prec; for J := Left to Right do begin //changed part x0 := J - Center; // old weight: Filter(x0*Oldscale)*Oldscale x1 := max(x0 - 0.5, -FilterWidth); x2 := min(x0 + 0.5, FilterWidth); // intersect symmetric interval of length 1 about x0 with support of scaled filter x3 := 0.5 * (x2 + x1); // new center Weight := Round(Prec * Filter(x3 * OldScale) * OldScale * (x2 - x1)); // intersection with support entered into the weight if Weight <> 0 then begin Inc(Count, Weight); K := Length(Result[I]); SetLength(Result[I], K + 1); Result[I][K].Pos := Constrain(J, SrcLo, SrcHi - 1); Result[I][K].Weight := Weight; end; end; ... At first the results were getting too dark and contrast was increased. By increasing the accuracy of the weights and using my own way of rounding the averaged result into bytes, this seems no longer the case: If RangeCheck then begin C.B := min((max(Cb, 0) + $1FFFFF) shr 22, 255); //unscale and round C.G := min((max(Cg, 0) + $1FFFFF) shr 22, 255); C.R := min((max(Cr, 0) + $1FFFFF) shr 22, 255); C.A := min((max(Ca, 0) + $1FFFFF) shr 22, 255); end else begin C.B := (Cb + $1FFFFF) shr 22; C.G := (Cg + $1FFFFF) shr 22; C.R := (Cr + $1FFFFF) shr 22; C.A := (Ca + $1FFFFF) shr 22; end; // Combine it with the background case CombineOp of dmOpaque: DstLine[I] := C.ARGB; ... The changed file uScalingProcsGR32.pas is attached. If you are interested in a test, here is a short video, zooms and pans have been done with the new Lanczos. The second picture is one of the most notorious in my collection. uScalingProcsGR32.zip
  4. Renate Schaaf

    Parallel Resampling of (VCL-) Bitmaps

    Confirmed. Nice. I've always been disappointed in the performance of 64-bit code. No idea, why mine is faster. Meanwhile I have come up with an alternative way to compute the weights, which seems to decrease artefacts while keeping the brilliance. So far I could not translate it into Graphics32, the filters there all live on different sized intervals, wheras mine all live on [-1,1], and at this time of the day my aging brain can't deal with the math. Maybe tomorrow.
  5. Renate Schaaf

    Parallel Resampling of (VCL-) Bitmaps

    I'm too stupid to create a pull-request on GitHub, if I read the help for it, I don't understand the first thing about it. Somehow I can post changes to my own repository using GitHub-Desktop, but I don't really understand what it's doing :). So here is the changed GR32_Resamplers.pas. I personally think the quality is good enough, on Ultra-High DPI I can hardly see the difference between a low-quality and high-quality filter for "normal" images, glyphs are another story. For me the filters kick in when animating pictures by zooms and pans, the artifacts then really show up. Theoretically the quality could be improved by a higher precision of the weights, which currently run from -256 to 256. Up to $800 should be possible, which I have done for the none-premultiply-modes. But again, I have a hard time seeing a difference. Also, again theoretically, the algorithm using antiderivates of filters should yield better results (except there isn't any in closed form for the Lanczos). But though I can see less artifacts, they decrease details, as you have seen. I've probably made some mistake in the implementation. It could be the same kind of mistake you can make by computing numerical derivatives, Small divided by Small. Time to hit the sack. Renate GR32_Resamplers.zip
  6. Renate Schaaf

    Parallel Resampling of (VCL-) Bitmaps

    I have been able to make the GR32-resampling as fast as mine unthreaded, by making some simple changes to the procedure GR32_Resamplers.Resample (in Implementation-part): changing the order of X- and Y- loop in the filling of the horizontal buffer, avoiding jumps in the bitmap-memory, using pointers to walk along the arrays, turning on compiler-optimization for the procedure (biggest improvement) If you want to see for yourself, in the attachment are 3 changed .pas-files that need to overwrite the corresponding ones in the Algorithms-folder under Bitmap Scaling. Renate Bitmap Scaling-Diff.zip
  7. Renate Schaaf

    Parallel Resampling of (VCL-) Bitmaps

    Here is a new version with lots of changes (thanks, Anders): I have changed the algorithm for the filters except box from continous- to discrete-space convolution, because the result didn't seem to look as good as expected, maybe rounding errors, further investigation needed. But I just noticed, that for my videos I need to go back to continuous space, because that gives much smoother zooms and pans. Anyway, now the algorithm itself is more or less identical to Graphics32, so you have a completely fair comparison. I also followed Anders' idea of including a stopwatch in the signature of the test procedures that can be turned on only if it matters. More changes: Alpha: I've put some alpha-shenanigans into the loaded bitmaps, so you can see how the alpha-channel is handled. There now is a TAlphaCombineMode=(amIndependent, amPreMultiply, amIgnore) doing the following: amIndependent: The behaviour as before, all channels are sampled independently of one another. This is the behavior of GR32 for drawmode dmOpaque. amPreMultiply: RGB is mulitplied by alpha prior to resampling, after that nonzero alpha is "unmultiplied". I had to sacrifice some precision for the weights and the alpha-multiplication in order to stay within integer-range. This is the behavior of WICImage with Source.AlphaFormat=afDefined (it seems to have made the same sacrifices). GR32 with drawmode dmBlend does pre-multiply, but doesn't unmultiply, or, rather it "post-multiplies". amIgnore: RGB is resampled, alpha is set to 255. Faster for images not needing an alpha-channel. This is the behavior of WICImage with Source.AlphaFormat=afIgnored. To prevent apples from being compared to oranges I have included a notice when a certain algorithm does not fully support the chosen mode. To avoid code repetition I had to introduce a number of procedural variables, which slow things down a little bit, but it's still nice and fast. Threads: The <= 16 threads hardly consume any CPU-time while waiting, for the time being I want to keep them. There is a TTask-version included in the source, it has worked so far, but TTask behaves somewhat erratically timing-wise, and I don't understand what it's doing. I'll try to write a TKernelResample-descendent (unthreaded) for Graphics32, maybe I can speed it up, let's see. Renate Bitmap Scaling-New.zip
  8. Renate Schaaf

    Parallel Resampling of (VCL-) Bitmaps

    Now I got the pre-multiplication in place, but before I do an update, I have a question for Anders: The pre-multiplication completely erases any RGB-Info stored at transparent pixels. So those can't be unmultiplied. I can see that this is desirable if the image contains a mask and needs to be rescaled. The RGB-part of the scaled image looks something like this, and btw. the result for WICImage looks the same. What puzzles me though is the fact, that Graphics32 still keeps the transparent parts of the RGB in place, so it must have some magic way to compute 0/0, which I would like to learn ... Further looking at the source code, I cannot see any pre-multiplication done: C := Src.Bits[X + ClusterY[Y].Pos * Src.Width]; ClustYW := ClusterY[Y].Weight; Inc(Ca, Integer(C shr 24) * ClustYW); Inc(Cr, Integer(C and $00FF0000) shr 16 * ClustYW); Inc(Cg, Integer(C and $0000FF00) shr 8 * ClustYW); Inc(Cb, Integer(C and $000000FF) * ClustYW); The Bytes of the channels are just lumped into one cardinal, the bytes are extracted and multiplied by the weights, or am I just dense, or has this already been done to Src.Bits? Have a nice Sunday, Renate
  9. type TDoSomething = class Danger: string; constructor create; procedure DoSomething; end; implementation ... constructor TDoSomething.create; begin Danger := 'Danger!'; end; procedure TDoSomething.DoSomething; begin ShowMessage('Doing something'); Sleep(1000); ShowMessage('Done doing something'); Free; //<----- end; procedure TForm2.Button1Click(Sender: TObject); begin var DoSomething: TDoSomething := TDoSomething.create; DoSomething.DoSomething; // ShowMessage(DoSomething.Danger); end; initialization ReportMemoryLeaksOnShutDown := true; end. I'm asking because I just noticed that I was doing something similar in my code , only it was a bit more hidden. It has worked without complaint for quite a while, so I was wondering why. My guess is, it's because the 'Free' is the very last thing I ask the class to do. In the destructor the class is still alive, after that the memory is freed, and as long as I don't do anything like the commented ShowMessage, that should be safe or no? I am cleaning this up, of course, but I didn't want to miss the chance of learning something, I don't understand what's going on in the Delphi-source at this point all that well. Renate
  10. Renate Schaaf

    Parallel Resampling of (VCL-) Bitmaps

    My cache arrays are floats scaled to integer, so I can do the multiply while filling the array, and the unmultiply on the integers. Let's see. That's a great idea. I'll see, whether I can plug it in, but the source of Graphics32 is a bit intimidating. But then, being retired, I have all day to understand it.
  11. Renate Schaaf

    Parallel Resampling of (VCL-) Bitmaps

    I can still do that it it's wanted and still makes sense (see below). Oh, I need to fix that ASA . And include the better GR32-routine. Thanks for your feedback, Renate
  12. Thanks, I didn't know that. Otherwise my question (starting with why) has, as far as I am concerned, been answered by Attila, and I wouldn't want to use anymore bandwidth for it.
  13. I'm doing that, unless I'm too dim to notice. I just wanted to understand why it doesn't blow up in my face .. oh well.
  14. That's why I don't use it :). Good advice. So you are saying it is Ok (other than maintainance headaches) to keep it there on condition that one follows the rules? I'm saying this, because I remember reactions like "Shriek" when somebody posted code like this.
  15. Renate Schaaf

    ANN: Better Translation Manager released

    Hi, Anders, Thanks a lot for yet another great tool. Totally straightforward, fast and hasslefree to use. I have 3 little issues with it, which I can work around, but maybe you'd like to know (newest compiled version 1.3.8055.21506): 1. On my 4K-monitor scaled 200% the ribbon text is barely readable. Minimizing the ribbon displays the text correctly when folded out. 2. Translating a rather long string using the text-editor inserts line breaks at the end of every displayed line. Translation manager then rightly complains about mismatching line breaks. I tried to find a setting which turns this off, but couldn't. Using notepad for the translation and pasting the translated string into the in-place-editor cuts off the string. 3. The font of the text-editor is rather large on my display, adding to the problem in 2. Thanks again, Renate
  16. Renate Schaaf

    MAP2PDB - Profiling with VTune

    Confirmed: Works with uProf now. Great Job. I get the same info as with VTune: hotspot timings, processor use, stack graph.
  17. I've tried to divide the work for resizing bitmaps into 4 parallel tasks. The result is, on average, a speed improvement by 60%, the bicubic resampler gets close to StretchBlt/Halftone. I'm using the routine for writing videos, where it is called at least once for every video frame. So, in my tests I must have called it at least 100 000 times. Problem is, that 3 times there was a timeout in TTask.WaitForAll. I'm not using any synchronization, because The tasks are completely independent of each other, i.e. none alters anything that the other uses. They write to memory chunks, which do not overlap. They don't use any VCL-code, not even VCL.Graphics. The memory they read does overlap, but I thought that was OK, but after reading up a bit, I get the impression that I am wrong here. From what I gather, I have a "race condition" caused by "not atomic reads" (this is all very new to me, sorry, if I'm talking rubbish). The threads are (among other things) all reading from two contributor-arrays, which contain records of varying size, so could that be it? What can I do to prevent the error? Maybe I should make copies of the arrays for each task? Here is the unit, it only uses a bicubic filter to make things a little simpler: unit uResample; interface uses WinApi.Windows, System.Types, VCL.Graphics, System.SysUtils, System.Classes, System.Math; type TZoomProcedure = procedure(const Source, Target: TBitmap; SourceRect: TRectF; Radius: single); // "bicubic" resampler based on ideas by Anders Melander, Mike Lischke // and Eric Grange. It supports float values for filter radii and float-valued // zoom rectangles. Source, Target should be pf32bit. Target must be set to the // correct size. procedure ZoomResample(const Source, Target: TBitmap; SourceRect: TRectF; Radius: single); // Same as above divided into 4 parallel tasks procedure ZoomResampleParallel(const Source, Target: TBitmap; SourceRect: TRectF; Radius: single); implementation uses WinApi.MMSystem, VCL.Dialogs, System.SyncObjs, System.Threading; type TContributor = record min, High: integer; // Min: start source pixel // High+1: number of source pixels to contribute to the result Weights: array of integer; // doubles scaled by $800 end; TContribArray = array of TContributor; TBGRAInt = record b, g, r, a: integer; end; PBGRAInt = ^TBGRAInt; TBGRAIntArray = array of TBGRAInt; const beta = 0.525; // f(beta)=0 beta2 = beta * beta; alpha = 105 / (16 - 112 * beta2); aa = 1 / 7 * alpha; bb = -1 / 5 * alpha * (2 + beta2); cc = 1 / 3 * alpha * (1 + 2 * beta2); dd = -alpha * beta2; // constants computed with maple for polynomial fit function AntiFilter(X: double): double; inline; // Antiderivative of a filter function similar to bicubic, but I like it better begin if X < -1 then Result := -0.5 else if X < 1 then Result := aa * X * X * X * X * X * X * X + bb * X * X * X * X * X + cc * X * X * X + dd * X else Result := 0.5; end; procedure MakeContributors(r: single; SourceSize, TargetSize: integer; SourceStart, SourceFloatwidth: double; var Contribs: TContribArray); // r: Filterradius var xCenter, scale, rr: double; X, j: integer; x1, x2, delta: double; TrueMin, TrueMax, Mx: integer; begin Assert(TargetSize > 0); if SourceFloatwidth = 0 then SourceFloatwidth := SourceSize; scale := SourceFloatwidth / TargetSize; SetLength(Contribs, TargetSize); if scale > 1 then // downsampling rr := r * scale else rr := r; delta := 1 / rr; for X := 0 to TargetSize - 1 do begin xCenter := (X + 0.5) * scale; TrueMin := Ceil(xCenter - rr + SourceStart - 1); TrueMax := Floor(xCenter + rr + SourceStart); Contribs[X].min := min(max(TrueMin, 0), SourceSize - 1); // make sure not to read in negative pixel locations Mx := max(min(TrueMax, SourceSize - 1), 0); // make sure not to read past w-1 in the source Contribs[X].High := Mx - Contribs[X].min; Assert(Contribs[X].High >= 0); // High=Number of contributing pixels minus 1 SetLength(Contribs[X].Weights, Contribs[X].High + 1); with Contribs[X] do begin x1 := delta * (min - SourceStart - xCenter); for j := 0 to High do begin x2 := x1 + delta; Weights[j] := round($800 * (AntiFilter(x2) - AntiFilter(x1))); x1 := x2; end; for j := TrueMin - min to -1 do begin // assume the first pixel to be repeated x1 := delta * (min + j - SourceStart - xCenter); x2 := x1 + delta; Weights[0] := Weights[0] + round($800 * (AntiFilter(x2) - AntiFilter(x1))); end; for j := High + 1 to TrueMax - min do begin // assume the last pixel to be repeated x1 := delta * (min + j - SourceStart - xCenter); x2 := x1 + delta; Weights[High] := Weights[High] + round($800 * (AntiFilter(x2) - AntiFilter(x1))); end; end; { with Contribs[x] } end; { for x } end; // This writes one column in the target procedure ProcessColumn(X, ymin, ymax, Sbps, Tbps, NewHeight: integer; rStart, rTStart: PByte; runstart: PBGRAInt; const ContribsX, ContribsY: TContribArray); inline; var ps, pT: PRGBQuad; rs, rT: PByte; y, xs, ys: integer; highx, highy, minx, miny: integer; WeightxStart, Weightx, Weighty: PInteger; Total: TBGRAInt; run: PBGRAInt; begin rs := rStart; highx := ContribsX[X].High; minx := ContribsX[X].min; inc(rs, 4 * minx); WeightxStart := @ContribsX[X].Weights[0]; run := runstart; // runstart points to Columnx[0] // Colmnx stores the result of the resampling in x-direction for y := ymin to ymax do begin // For each source line at y // Sum up weighted color values at source pixels ContribsX[x].Min+xs // 0<=xs<=ContribsX[x].High // and store the results in Columnx[y-ymin] etc. ps := PRGBQuad(rs); Weightx := WeightxStart; FillChar(run^, SizeOf(run^), 0); for xs := 0 to highx do begin inc(run.b, Weightx^ * ps.rgbBlue); inc(run.g, Weightx^ * ps.rgbGreen); inc(run.r, Weightx^ * ps.rgbRed); inc(run.a, Weightx^ * ps.rgbReserved); inc(Weightx); inc(ps); end; inc(run); dec(rs, Sbps); end; // Average in y-direction: // For each target line y sum up weighted colors in // Columnx[ys+ContribsY[y].Min-ymin], 0<=ys<=ContribsY[y].High // Store result in Total // round and assign to TargetPixel pt at [x,y] rT := rTStart; for y := 0 to NewHeight - 1 do begin pT := PRGBQuad(rT); inc(pT, X); highy := ContribsY[y].High; miny := ContribsY[y].min - ymin; Weighty := @ContribsY[y].Weights[0]; run := runstart; inc(run, miny); FillChar(Total, SizeOf(Total), 0); for ys := 0 to highy do begin inc(Total.b, Weighty^ * run.b); inc(Total.g, Weighty^ * run.g); inc(Total.r, Weighty^ * run.r); inc(Total.a, Weighty^ * run.a); inc(Weighty); inc(run); end; Total.r := max(Total.r, 0); Total.g := max(Total.g, 0); Total.b := max(Total.b, 0); Total.a := max(Total.a, 0); // results could be negative, filter has negative values Total.r := (Total.r + $1FFFFF) shr 22; // "round" the result Total.g := (Total.g + $1FFFFF) shr 22; Total.b := (Total.b + $1FFFFF) shr 22; Total.a := (Total.a + $1FFFFF) shr 22; pT.rgbBlue := min(Total.b, 255); pT.rgbGreen := min(Total.g, 255); pT.rgbRed := min(Total.r, 255); pT.rgbReserved := min(Total.a, 255); dec(rT, Tbps); end; // for y end; procedure ZoomResampleParallel(const Source, Target: TBitmap; SourceRect: TRectF; Radius: single); var ContribsX, ContribsY: TContribArray; OldWidth, OldHeight: integer; NewWidth, NewHeight: integer; Sbps, Tbps: integer; rStart, rTStart: PByte; // Row start in Source, Target ymin, ymax, i: integer; tasks: array [0 .. 3] of ITask; Chunk: integer; begin Source.PixelFormat := pf32bit; Target.PixelFormat := pf32bit; // for safety // Target needs to have been set to correct size NewWidth := Target.Width; NewHeight := Target.Height; OldWidth := Source.Width; OldHeight := Source.Height; Tbps := ((NewWidth * 32 + 31) and not 31) div 8; // BytesPerScanline Target Sbps := ((OldWidth * 32 + 31) and not 31) div 8; // BytesPerScanline Source MakeContributors(Radius, OldWidth, NewWidth, SourceRect.Left, SourceRect.Right - SourceRect.Left, ContribsX); MakeContributors(Radius, OldHeight, NewHeight, SourceRect.Top, SourceRect.Bottom - SourceRect.Top, ContribsY); ymin := ContribsY[0].min; ymax := ContribsY[NewHeight - 1].High + ContribsY[NewHeight - 1].min; rStart := Source.ScanLine[ymin]; rTStart := Target.ScanLine[0]; Chunk := NewWidth div 4; // stupid repetion! tasks[0] := TTask.create( procedure begin var Columnx: TBGRAIntArray; // Stores averaged x-Contributors var X, xmin, xmax: integer; var runstart: PBGRAInt; SetLength(Columnx, ymax - ymin + 1); runstart := @Columnx[0]; xmin := 0; xmax := Chunk - 1; for X := xmin to xmax do ProcessColumn(X, ymin, ymax, Sbps, Tbps, NewHeight, rStart, rTStart, runstart, ContribsX, ContribsY); end); tasks[1] := TTask.create( procedure begin var Columnx: TBGRAIntArray; var X, xmin, xmax: integer; var runstart: PBGRAInt; SetLength(Columnx, ymax - ymin + 1); runstart := @Columnx[0]; xmin := Chunk; xmax := 2 * Chunk - 1; for X := xmin to xmax do ProcessColumn(X, ymin, ymax, Sbps, Tbps, NewHeight, rStart, rTStart, runstart, ContribsX, ContribsY); end); tasks[2] := TTask.create( procedure begin var Columnx: TBGRAIntArray; var X, xmin, xmax: integer; var runstart: PBGRAInt; SetLength(Columnx, ymax - ymin + 1); runstart := @Columnx[0]; xmin := 2 * Chunk; xmax := 3 * Chunk - 1; for X := xmin to xmax do ProcessColumn(X, ymin, ymax, Sbps, Tbps, NewHeight, rStart, rTStart, runstart, ContribsX, ContribsY); end); tasks[3] := TTask.create( procedure begin var Columnx: TBGRAIntArray; var X, xmin, xmax: integer; var runstart: PBGRAInt; SetLength(Columnx, ymax - ymin + 1); runstart := @Columnx[0]; xmin := 3 * Chunk; xmax := NewWidth - 1; for X := xmin to xmax do ProcessColumn(X, ymin, ymax, Sbps, Tbps, NewHeight, rStart, rTStart, runstart, ContribsX, ContribsY); end); for i := 0 to 3 do tasks[i].start; if not TTask.WaitForAll(tasks, 10000) then Raise exception.create('Time out for parallel resampling threads'); end; procedure ZoomResample(const Source, Target: TBitmap; SourceRect: TRectF; Radius: single); var ContribsX, ContribsY: TContribArray; OldWidth, OldHeight: integer; NewWidth, NewHeight: integer; // Target needs to be set to correct size Sbps, Tbps: integer; rStart, rTStart: PByte; // Row start in Source, Target Columnx: TBGRAIntArray; // cache array, records for better memory layout X, ymin, ymax: integer; runstart: PBGRAInt; begin Source.PixelFormat := pf32bit; Target.PixelFormat := pf32bit; // for safety NewWidth := Target.Width; NewHeight := Target.Height; OldWidth := Source.Width; OldHeight := Source.Height; Tbps := ((NewWidth * 32 + 31) and not 31) div 8; // BytesPerScanline Target Sbps := ((OldWidth * 32 + 31) and not 31) div 8; // BytesPerScanline Source MakeContributors(Radius, OldWidth, NewWidth, SourceRect.Left, SourceRect.Right - SourceRect.Left, ContribsX); MakeContributors(Radius, OldHeight, NewHeight, SourceRect.Top, SourceRect.Bottom - SourceRect.Top, ContribsY); ymin := ContribsY[0].min; ymax := ContribsY[NewHeight - 1].High + ContribsY[NewHeight - 1].min; rStart := Source.ScanLine[ymin]; rTStart := Target.ScanLine[0]; SetLength(Columnx, ymax - ymin + 1); runstart := @Columnx[0]; // Compute colors for each target column at x for X := 0 to NewWidth - 1 do begin ProcessColumn(X, ymin, ymax, Sbps, Tbps, NewHeight, rStart, rTStart, runstart, ContribsX, ContribsY); end; // for x end; end.
  18. Meanwhile I've followed your suggestions and rewrote the algorithm so it writes one line at a time. Sadly, I couldn't give SuperDuper a chance to shine, as I would need to transpose memory in chunks of 16 Bytes instead of 4. Still, it got a little faster, and since both unthreaded routines are faster than Graphics32, I don't think I'm doing so badly. For the threaded version I used with a minimum of 2 and a maximum of 8 threads. Is rather fast if it works. Sadly the fails for the tasks version became, if anything, more frequent, and a Tparallel.for version hung indefinitely a few times. I failed to completely recover from the timeouts, after a fail the next call would likely fail again. So, finally I just set up 8 oldfashioned threads which wake up as needed, do a chunk of the work as needed and signal that they're done. I didn't even bother to put in checks for timeouts. It's the fastest yet and has never failed so far. Guess that is a strong indication that the culprit was TTask. If you are curious, here is a 720x540 video that got written at 40 FPS, a new record for me 🙂. Once I do similar optimizations for the transitions, the writing speed might be acceptable. Thanks again to all!
  19. W and Height are the size in byte of the memory chunk? Do you transpose byte by byte, I guess not, wouldn't make any sense for bgra-records. Well, I need to play around with it. Pity I don't read ASM. Is this for 32bit only? Thanks 🙂
  20. It isn't, the creation of the threads takes too long, in the end the procedure takes about twice the time of the unthreaded one. I would really hate to have to use the approach with waking up existing threads to do the resampling ... I introduced the timeout, because the first time this happened, I had to kill the application. I am right to assume then, that there isn't anything obviously wrong in my approach which would cause the timeout? This is true for the target, and to a degree of the source, too. But the intermediate Columnx is in fact a "row" the way it's allocated. But you are right; and maybe one can get away without the transposing. Also, if the intermediate storage is a TBitmap, the algorithms wouldn't be "symmetric" anymore, the transpose of the target wouldn't be the same as the target of the transpose, but that's probably just nitpick. Anyway, if you'd care to share the code for the transposing, I'd be glad. I didn't know that. Thanks. Another point in favor of doing complete rows first, or to have the threads compute chunks of full rows. Thank you all, I have a lot to do.
  21. Good idea, but I'm doing zoom-pans, the fixed resizing I leave to the video-encoder to do. I want to be reasonably sure that this runs on any system, not just mine .. Time to do some more stress tests 🙂
  22. Thanks, I will try this. I've also tried to change to TParallel.for, which hasn't bugged out so far, but is more erratic timing wise. Problem is, that the error occurs so very rarely, it's hard to say whether or not it's fixed. I wish I had more insight into what could go wrong. Thx Renate
  23. I agree that it isn't very nice to just dump the source code here. So here is the explanation: In the routines, nevermind SourceRect, imagine it to be the full rect of the source bitmap. The Radius isn't important either. Contributors: For pixel(x,y) in the target, ContribsX[x] is a record, that defines the range of x-Values of pixels in the source, which contribute to the value of p(x,y). The record also contains the weights which have to be applied before summing up. The further away from the center of the range, the lower the weight (simplified). The range and the weights are independent of y, so the array ContribsX is computed in advance. ContribsY[y] does the same for the source pixels in y-direction. The way the contributors are computed isn't important here, it's done outside of the tasks. Pseudo-Code for computing p(x,y): (ps is a source pixel) (Imagine the weights to be floats, and results being rounded) p(x,y) := 0; for i:=0 to ContribsX[x].high do for j:= 0 to ContribsY[y].high do p(x,y) := p(x,y) + ContribsX[x].Weights[i] * ContribsY[y].Weights[j] * ps(ContribsX[x].min + i, ContribsY[y].min + j); Since this isn't very efficient and can't well be divided into tasks, the averaging in x- and y-direction is done separately. Here's where the routine ProcessColumn comes in, it's the most important procedure, and it writes one column of the target bitmap's memory in 2 steps: PseudoCode for ProcessColumn: (x is given, Compute Column at x) Step 1: resample to intermediate array ColumnX: for ys:= 0 to OldHeight-1 do begin ColumnX[ys] := 0; for i:=0 to ContribsX[x].high do ColumnX[ys] := ColumnX[ys] + ContribsX[x].Weights[i]*ps(ContribsX[x].min +i, ys); end; Step 2: resample ColumnX to the target column: for y := 0 to NewHeight-1 do begin p(x,y):=0; for j := 0 to ContribsY[y].high do p(x,y) := p(x,y) + ContribsY[y].Weights[j]*ColumnX[ContribsY[y].min + j]; end; The tasks just write 4 different chunks of colums, but they all read from the same source-bitmap-memory and from the same contributor-arrays. The original source code is obfuscated by (at least:) 2 things: Values that should be float have been scaled to integers Pointers are used to walk along the arrays I'm afraid that's still too long. ..Make it shorter ...Again, half the size Gruß, Renate
  24. I'll try but it ain't easy. More than 10 years ago I spent days reading your code, before I understood what made it tick 🙂
  25. Renate Schaaf

    MAP2PDB - Profiling with VTune

    That would be great. Here are some relevant lines of the log file, I doubt they are of any use, though: 2021.04.08 14:23:10.909 #DEBUG #PeFile::Open #PeFile.cpp(111) #Executable (PE) C:\Windows\SysWOW64\ntdll.dll opened 2021.04.08 14:23:10.137 #DEBUG #PeFile::InitializeSymbolEngine #PeFile.cpp(699) #Executable (PE) C:\Windows\SysWOW64\ntdll.dll started Symbol Engine initialization 2021.04.08 14:23:10.752 #DEBUG #PeFile::InitializeSymbolEngine #PeFile.cpp(751) #Executable (PE) C:\Windows\SysWOW64\ntdll.dll initialized Symbol Engine: PDB 2021.04.08 14:23:10.662 #DEBUG #PeFile::Open #PeFile.cpp(111) #Executable (PE) D:\DelphiSource\DelphiRio\mystuffR\uProfTest\Win32\Debug\uProfTest.exe opened 2021.04.08 14:23:10.945 #DEBUG #PeFile::InitializeSymbolEngine #PeFile.cpp(699) #Executable (PE) D:\DelphiSource\DelphiRio\mystuffR\uProfTest\Win32\Debug\uProfTest.exe started Symbol Engine initialization 2021.04.08 14:23:10.113 #DEBUG #PeFile::InitializeSymbolEngine #PeFile.cpp(758) #Executable (PE) D:\DelphiSource\DelphiRio\mystuffR\uProfTest\Win32\Debug\uProfTest.exe failed to initialize Symbol Engine: PDB 2021.04.08 14:23:10.740 #DEBUG #PeFile::InitializeSymbolEngine #PeFile.cpp(778) #Executable (PE) D:\DelphiSource\DelphiRio\mystuffR\uProfTest\Win32\Debug\uProfTest.exe initialized Symbol Engine: COFF
×