Jump to content

Anders Melander

Members
  • Content Count

    2265
  • Joined

  • Last visited

  • Days Won

    117

Everything posted by Anders Melander

  1. Anders Melander

    MAP2PDB - Profiling with VTune

    Works for me so there was probably something wrong with the pdb at that time. I've tried both with a small and a very large application. On the positive side uProf resolved a lot faster than VTune but I'm a bit surprised about how basic the uProf feature set is and I can't really imagine what I would use it for. Also, it has pie charts... WTF?
  2. Anders Melander

    Delphi 5 Printing

    TPrinter/TCanvas is GDI printing. It's likely that there are bugs in TPrinter in Delphi 5 that has since been fixed. I seem to recall that there were quite a lot of them. Buffer overflows and whatnot. What has happened is probably that your application has been using GDI in a way that was invalid but was worked around by Windows and now they've stopped working around it.
  3. Anders Melander

    Is a "bare-minimum" EurekaLog possible?

    One thing to be aware of with EurekaLog is that it, in my experience, makes the link stage unbearable slow for large projects. This alone has made me replace it with madExcept in a few projects. I have my small grievances with madExcept too though. In particular the fact that it pumps the message queue, for no good reason, when processing silent exceptions.
  4. Anders Melander

    Is a "bare-minimum" EurekaLog possible?

    I agree. I believe it's been discussed with Mathias several times but for some reason he's not seen the light.
  5. Anders Melander

    Determining why Delphi App Hangs

    You're in good company; We've all been there
  6. Unfortunately inline vars are not always equivalent to "with". I'm currently working on a project that has, um.. let's be polite and say, "liberal" use of record arrays. So for example: type TFoo = record // Lots of stuff here end; TBar = record Foo: array of TFoo; // Even more stuff end; TFooBar = array of TBar; var FooBar: TFooBar; begin with FooBar[i].Foo[j] do WhatEver := 42; var Foo := FooBar[i].Foo[j]; Foo.WhatEver := 42; // Nope. end; Using an inline var to access an inner record will create a copy of the record while using "with" will use a reference to the record. Only way around that is to use record pointers but the code is horrible enough as it is.
  7. It's the size in dwords (i.e. 32-bit RGBA). AFAIR it should work with the 64-bit compiler too.
  8. Yes, it's relatively costly to create a thread but if you use a thread pool then the threads will only have to be created once. I don't think I follow you. I can't see why the intermediate buffer would need to be a bitmap; It's just a chunk of memory. Also the transpose if faster than you'd think. After all it's much faster to do two row-by-row passes and two transpositions, than one row-by-row pass and one column-by-column pass. One might then think that it would be smart to do the transposition in place while doing the row-by-row pass, after all you already have the value that needs to be transposed, but that isn't so as writing at the transposed location will flush the cache. Anyway, here's the aptly named SuperDuperTranspose32 (I also have a FastTranspose (MMX) and a SuperTranspose ). I've been using it in an IIR gaussian blur filter. Zuuuuper fast. // MatrixTranspose by AW // http://masm32.com/board/index.php?topic=6140.msg65145#msg65145 // 4x4 matrix transpose by Siekmanski // http://masm32.com/board/index.php?topic=6127.msg65026#msg65026 // Ported to Delphi by Anders Melander procedure SuperDuperTranspose32(Src, Dst: Pointer; W, Height: cardinal); register; type dword = cardinal; // Parameters: // EAX <- Source // EDX <- Destination // ECX <- Width // Stack[0] <- Height // Preserves: EDI, ESI, EBX var Source, Destination: Pointer; Width: dword; X4x4Required: dword; Y4x4Required: dword; remainderX: dword; remainderY: dword; destRowSize: dword; sourceRowSize: dword; savedDest: dword; asm push edi push esi push ebx mov Destination, Dst mov Source, Src mov Width, W // How many cols % 4? mov eax, Width mov ebx, 4 mov edx, 0 div ebx mov X4x4Required, eax mov remainderX, edx // How many rows %4? mov eax, Height mov ebx, 4 mov edx, 0 div ebx mov Y4x4Required, eax mov remainderY, edx mov eax, Height shl eax, 2 mov destRowSize, eax mov eax, Width shl eax, 2 mov sourceRowSize, eax mov ebx, 0 @@loop1outer: cmp ebx, Y4x4Required // while ebx<Y4x4Required // Height % 4 jae @@loop1outer_exit // find starting point for source mov eax, ebx mul sourceRowSize shl eax, 2 mov esi, Source add esi, eax mov ecx, esi // save // find starting point for destination mov eax, ebx shl eax, 4 mov edi, Destination add edi, eax mov savedDest, edi // save push ebx mov ebx,0 @@loop1inner: cmp ebx, X4x4Required// while ebx<X4x4Required jae @@loop1inner_exit mov eax, ebx shl eax, 4 mov esi, ecx add esi, eax movups xmm0, [esi] add esi, sourceRowSize movups xmm1, [esi] add esi, sourceRowSize movups xmm2, [esi] add esi, sourceRowSize movups xmm3, [esi] movaps xmm4,xmm0 movaps xmm5,xmm2 unpcklps xmm4,xmm1 unpcklps xmm5,xmm3 unpckhps xmm0,xmm1 unpckhps xmm2,xmm3 movaps xmm1,xmm4 movaps xmm6,xmm0 movlhps xmm4,xmm5 movlhps xmm6,xmm2 movhlps xmm5,xmm1 movhlps xmm2,xmm0 mov eax, destRowSize shl eax, 2 mul ebx mov edi, savedDest add edi, eax movups [edi], xmm4 add edi, destRowSize movups [edi], xmm5 add edi, destRowSize movups [edi], xmm6 add edi, destRowSize movups [edi], xmm2 inc ebx jmp @@loop1inner @@loop1inner_exit: pop ebx inc ebx jmp @@loop1outer @@loop1outer_exit: // deal with Height not multiple of 4 cmp remainderX, 1 // .if remainderX >=1 jb @@no_extra_x mov eax, X4x4Required shl eax, 4 mov esi, Source add esi, eax mov eax, X4x4Required shl eax, 2 mul destRowSize mov edi, Destination add edi, eax mov edx, 0 @@extra_x: cmp edx, remainderX // while edx < remainderX jae @@extra_x_exit mov ecx, 0 mov eax, 0 @@extra_x_y: cmp ecx, Height // while ecx < Height jae @@extra_x_y_exit mov ebx, dword ptr [esi+eax] mov dword ptr [edi+4*ecx], ebx add eax, sourceRowSize inc ecx jmp @@extra_x_y @@extra_x_y_exit: add esi, 4 add edi, destRowSize inc edx jmp @@extra_x @@extra_x_exit: @@no_extra_x: // deal with columns not multiple of 4 cmp remainderY, 1 // if remainderY >=1 jb @@no_extra_y mov eax, Y4x4Required shl eax, 2 mul sourceRowSize mov esi, Source add esi, eax mov eax, Y4x4Required shl eax, 4 mov edi, Destination add edi, eax mov edx,0 @@extra_y: cmp edx, remainderY // while edx < remainderY jae @@extra_y_exit mov ecx, 0 mov eax, 0 @@extra_y_x: cmp ecx, Width // while ecx < Width jae @@extra_y_x_exit mov ebx, dword ptr [esi+4*ecx] mov dword ptr [edi+eax], ebx add eax, destRowSize inc ecx jmp @@extra_y_x @@extra_y_x_exit: add esi, sourceRowSize add edi, 4 inc edx jmp @@extra_y @@extra_y_exit: @@no_extra_y: pop ebx pop esi pop edi end;
  9. While this isn't related to your threading problem, it seems you are processing the bitmap by column instead of by row. This is very bad for performance since each row of each column will start with a cache miss. I think you will find that if you process all rows, transpose (so columns becomes rows), process all rows, transpose again (rows back to columns), the performance will be significantly better. I have a fast 32-bit (i.e. RGBA) blocked transpose if you need one. Another thing to be aware of when multiple threads read or write to the same memory is that if two threads read and write to two different locations, but those two locations are within the same cache line, then you will generally get a decrease in performance as the cores fight over the cache line.
  10. Instead of just posting your source and let us figure out what you're doing, it would be nice if you instead described exactly what your doing. I.e. what does the overall job do (e.g. it resamples a bitmap), how does it do that (describe the algorithm), how are you dividing the job, what does your individual tasks do, etc. Describe it as if we didn't have the source. This is basically also what your source comments should do.
  11. Anders Melander

    MAP2PDB - Profiling with VTune

    New version (2.5) uploaded. Changes since last upload: Include/exclude modules/units from pdb. This helps keep the size of the pdb down and thus reduces the symbol resolve time in VTune. You no longer need to link your projects with debug info. map2pdb will reuse the existing debug section in the exe/dll/bpl if there is one. Otherwise it will create a new one. https://bitbucket.org/anders_melander/map2pdb/downloads/ What's next: Refactoring of the logging code. The current logging is basically just some functions that calls WriteLn. This should be replaced with a pluggable log framework so the whole logging mechanism can be replaced. The end goal is to enable integration of the map2pdb core into other projects. A jdbg reader. Embarcadero does not supply map files for the RTL/VCL rune time packages. Instead they ship jdbg files that can be read with the JEDI debug functions. The jdbg are built from map files so supposedly they contains much, if not all, of the information we need. The task here is to write a reader for the jdbg file format so we can produce pdb files from them. Figure out why VTune is so slow. A never ending task it seems.
  12. Anders Melander

    Build managed dll in Delphi

    Ah Yes, that too. Not insignificant but once you know, you know.
  13. Anders Melander

    Build managed dll in Delphi

    While I completely agree with David's considerations it should be noted that the startup cost of an in-process COM server is considerably smaller than that of of an out-of-process COM server. For an in-process server it's basically just the cost of setting up the COM apartment. For an out-of-process server there's also the cost of launching the process. In the cases where I have chosen to use COM for interop with .NET it has mostly been because of the convenience of not having to worry about marshalling and because it's so easy to just wrap a Delphi object in an interface/TAutoIntfObject and pass that on to .NET
  14. Anders Melander

    MAP2PDB - Profiling with VTune

    Too bad. It looks very nice. I guess I'll take a look at it when my VTune trial expires.
  15. Anders Melander

    MAP2PDB - Profiling with VTune

    Any clues as to what goes wrong?
  16. Anders Melander

    Build managed dll in Delphi

    Yes, I think it is. It's been a while since I did it though so I could be wrong. Maybe start by reading the help http://docwiki.embarcadero.com/RADStudio/Sydney/en/Developing_Interoperable_Applications_Using_COM
  17. Anders Melander

    MAP2PDB - Profiling with VTune

    Let's start with the map file. Zip it and PM it to me.
  18. Anders Melander

    Build managed dll in Delphi

    Just write your DLL as an in-process COM server (or out-of-process if that suits your need better). .NET can use that directly.
  19. Anders Melander

    MAP2PDB - Profiling with VTune

    I don't know about C++Builder and map files but in any case there would be too many differences, caused by all the other stuff that VS outputs, to make that feasible. What I've done previously, when I had to figure out why something didn't work, was to use a hex editor to compare the pdb of VTune's matrix example with the output from "map2pdb -test". They are both sufficiently small. At this time I can pretty much parse pdb just by looking at the hex 🙂 I think the easiest way forward would be to just examine the key suspects (address table, hash tables) of matrix.pdb in a hex editor and verify that they're ordered and structured like we expect them to be. I'm using @mael's HxD editor so if would help if that supported structures. tap.tap... 🙂 Another way would be to write a drop-in replacement for the msdia140.dll in-process COM server, which is what VTune uses to access the PDB data (I considered doing that at one point since it would completely eliminate the need to write pdb). That would tell us exactly what API methods VTune is using and how.
  20. Anders Melander

    MAP2PDB - Profiling with VTune

    Yes.
  21. Anders Melander

    MAP2PDB - Profiling with VTune

    With an exclusion list that removed most of the VCL and RTL as well as DevExpress, TeeChart, Indy and Firedac I managed to reduced the size of my pdb from 200Mb to 35Mb. VTune now loads my project in "just" 5 minutes... It's still struggling though. Everything is incredible slow. I get the impression that Intel has never tried profiling VTune with VTune.... or maybe they tried and gave up because it was too slow. Here's my command line: map2pdb -v -bind "TurboFooProPlus.map" -exclude:dx*;cx*;system*;winapi*;vcl*;data*;firedac*;soap*;web*;id*
  22. Anders Melander

    MAP2PDB - Profiling with VTune

    New version with @Stefan Glienke's improvements uploaded. https://bitbucket.org/anders_melander/map2pdb/downloads/ Notable changes: VTune now attributes samples to the correct source lines within a block. Removing RTTI from the exe has reduced the size by 15%. The pdb is now being built in a temporary memory stream which is significantly faster due to reduced I/O.
  23. Anders Melander

    MAP2PDB - Profiling with VTune

    I too have that problem. So far I haven't been able to profile our main project because the symbol resolution step takes longer (hours) than I'm willing to wait. It's probably caused by a bug in map2pdb (e.g. some hash table is incorrect) but for now my plan is to add a white/black list option to include and exclude units from the pdb. So for example if I specify the switch -exclude:dx* then any unit that starts with "dx" (i.e. DevExpress) will be excluded from the pdb. It's hard to spot but there's one at the bottom of the download page: https://software.intel.com/content/www/us/en/develop/articles/oneapi-standalone-components.html#vtune I'm on Windows 7 so I had to find an older version of VTune. VTune 2019 was the last one to support Windows 7. Unfortunately it's a trial and it expires in a week...
  24. Anders Melander

    StockSharp, anybody worked with this?

    N0n53N53. /733t d00d3
  25. Anders Melander

    MAP2PDB - Profiling with VTune

    Fixed the problem where some symbols (in particular generics) either resolved to the calling unit or to a wrong line number in the implementing unit. Thanks to @Stefan Glienke for pestering me about this until it got solved. Source committed and binary uploaded. And now... Pizza! 🍕
×