Jump to content

Arnaud Bouchez

Members
  • Content Count

    100
  • Joined

  • Last visited

  • Days Won

    5

Arnaud Bouchez last won the day on February 1

Arnaud Bouchez had the most liked content!

Community Reputation

112 Excellent

4 Followers

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

  1. Arnaud Bouchez

    Random Access Violation?

    As David wrote, try to make a minimal reproducible example. Just a project with ODBC access, running a SELECT query. I thought it may have been some problem with FPU exceptions, which happen with third-party libraries, but they usually occur in your Delphi code, not in library code....
  2. Arnaud Bouchez

    Shift-F9 dead

    Another program may capture Shift-F9 ?
  3. Arnaud Bouchez

    'stdcall' for 32-bit vs. 64-bit?

    From the code point of view, there is a single calling convention on Delphi for x86_64. But in practice, on Win64, they may be several calling conventions, e.g;. __vectorcall https://en.wikipedia.org/wiki/X86_calling_conventions#Microsoft_vectorcall This is not yet supported with Delphi - and I doubt it will any day soon, since we have so poor support of vectorization in our compiler. Let's hope a vectorcall; attribute would be implemented someday in Delphi! 🙂 It is supported by FPC - which seems more advanced in terms of optimization and vectorization: https://wiki.freepascal.org/FPC_New_Features_3.2#Support_for_Microsoft.27s_vectorcall_calling_convention Last note: for regular calls, there is a single calling convention per ABI on 64-bit. But the ABI itself (i.e. how parameters are passed and stack handled) is not the same on Windows and POSIX. The Windows 64-bit ABI differs from the Linux/POSIX/SystemV 64-bit ABI on x86_64. Check https://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling_conventions It won't affect regular pascal code. But if you mess with assembly, ensure you use the right registers and stack alignment!
  4. Arnaud Bouchez

    language updates in 10.4?

    What do you mean? That you can put a string to a TNullableInteger? At least under FPC they are type safe: you are required to call NullableInteger(1234) to fill a TNullableInteger variable - you can't write aNullableInteger := 'toto' - or even aNullableInteger := 1234. Delphi is more relaxed about implicit conversions, so under Delphi it is indeed not typesafe, since you can write aNullableInteger := 'toto'.
  5. Arnaud Bouchez

    language updates in 10.4?

    The main change about the language would be the full ARC removal. The memory model is really part of the language, to my understanding. It is just as vital as to operate with "class" itself. Pure unfair FUD trolling remark: managed records are available in FPC trunk since a few months, and I guess EMB doesn't like to be behind an Open Source compiler. 😉 We implemented Nullable types using variants, and integrated support in our ORM. It has the advantage on working since Delphi 6, with low overhead, and good integration with the pascal language. See http://blog.synopse.info/post/2015/09/25/ORM-TNullable*-fields-for-NULL-storage - back from 2015!
  6. We use our https://github.com/synopse/mORMot/blob/master/SynLog.pas Open Source logging framework on server side, with sometimes dozens of threads into the same log file. It has very high performance, so logging don't slow down the process, and you can make very useful forensic and analysis if needed. Having all threads logging in the same log file is at the same time a nightmare and a blessing. It may be a nightmare since all operations are interleaved and difficult to identify. It is a blessing with the right tool, able to filter for one or several threads, then find out what is really occuring: in this case, a single log file is better than several. For instance, one of our https://livemon.com server generates TB of logs - see https://leela1.livemon.net/metrics/counters - and we can still handle it. This instance for instance is running since more than 8 months without being restarted, and with hunderths of simultaneous connections, logging incoming data every second... 🙂 We defined our simple and very usable log viewer tool - see http://blog.synopse.info/post/2011/08/20/Enhanced-Log-viewer and https://synopse.info/files/html/Synopse mORMot Framework SAD 1.18.html#TITL_103 This is the key to be able to have something usable from heavily multithreaded logs. So if you can, rather use a single log file per process, with proper thread identification and filtering.
  7. @David Heffernan Yes, the fastest heap is the one not used - I tend to allocate a lot of temporary small buffers (e.g. for number to text conversion) from the stack instead of using a temporary string. See http://blog.synopse.info/post/2011/05/20/How-to-write-fast-multi-thread-Delphi-applications
  8. @Tommi Prami I guess you found out about Lemire in our latest commits - see e.g. https://github.com/synopse/mORMot/commit/a91dfbe2e63761d724adef0703140e717f5b2f00 🙂 @Stefan Glienke It is to be used with a prime size - as with our code - which also reduces memory consumption since power of 2 tables are far from optimal in this regard (doubling the slot numbers can become problematic). With a prime, it actually enhances the distribution, even with a weak hash function, especially in respect to anding a power of 2. Lemire reduction is as fast as anding a power of 2 since a multiplication is done in 1 cycle on modern CPUs. Note that Delphi Win32 is not so good at compiling 64-bit multiplcation as involved with Lemire's, whereas FPC has no problem using the i386 mul opcode - which already gives 64-bit results.
  9. @David Heffernan You just store the ThreadID within the memory block information, or you use a per-convention identification of the memory buffer. Cross-thread deallocations also usually require a ThreadEnded-like event handler, which doesn't exist on Delphi IIRC - but does exist on FPC - so need to hack TThread. @RDP1974 Last time I checked, FastMM4 (trunk or AVX2 fork) don't work well with Linux (at least under FPC). Under Delphi + Linux, FastMM4 is not used at all - it just call libc free/malloc IIRC. I am not convinced the slowness comes from libc heap - which is very good from our tests. But from how Delphi/Linux is not optimized (yet). Other MM like BrainMM or our ScaleMM are not designed for Linux. We tried also a lot of allocators on Linux - see https://github.com/synopse/mORMot/blob/master/SynFPCCMemAligned.pas - in the context of highly multi-threaded servers. In a nutshell, see https://github.com/synopse/mORMot/blob/master/SynFPCCMemAligned.pas#L57 for some numbers. The big issue with those C-based allocators, which is not listed in those comments, apart from loading a lot of RAM, is that they stop the executable as soon as some GPF occurs: e.g. a double free will call a SIGABORT! So they are NOT usable on production unless you use them with ValGrid and proper debugging. We fallback into using the FPC default heap, which is a bit slower, consumes a lot of RAM (since it has a per-thread heap for smaller blocks) but is very stable. It is written in plain pascal. And the main idea about performance is to avoid as much memory allocation as possible - which is what we tried with mORMot from the ground up: for instance, we define most of the temp strings in the stack, not in the heap. I don't think that re-writing a C allocator into pascal would be much faster. It is very likely to be slower. Only a pure asm version may have some noticeable benefits - just like FastMM4. And, personally, I wouldn't invest into Delphi for Linux for server process: FPC is so much stable, faster and better maintained... for free!
  10. Arnaud Bouchez

    Reading large UTF8 encoded file in chunks

    When reading several MB of buffers, it is not needed to read back. Just read the buffer line by line, from the beginning. Use a fast function like our BufferLineLength() above to compute the line length. Then search within the line buffer. If you can keep the buffer smaller than your CPU L3 cache, it may have some benefit. Going that way, the CPU will give you best performance, for several reasons: 1. the whole line is very likely to remain in L1 cache, so searching the line feed, then search any pattern will be achieved at full core speed. 2. there will be automatic prefetching from main RAM into L1/L2 cache when reading ahead in a single direction. If your disk is fast enough (NVMe), you can fill buffers in separated threads (use number of CPU cores - 1), then search in parallel from several files (one core per file - it would be more difficult to properly search the same file in multiple cores). If you don't allocate any memory during the process (do not use string), parallel search would scale linearly. Always do proper timing for your search speed - also taking into account the OS disk cache, which is likely to be used during testing, but not from real "cold" files.
  11. Arnaud Bouchez

    Reading large UTF8 encoded file in chunks

    There is very fast line feed search, using proper x86_64 SSE assembly, checking by 16 bytes per loop iteration, in our SynCommons.pas: function BufferLineLength(Text, TextEnd: PUTF8Char): PtrInt; {$ifdef CPUX64} {$ifdef FPC} nostackframe; assembler; asm {$else} asm .noframe {$endif} {$ifdef MSWINDOWS} // Win64 ABI to System-V ABI push rsi push rdi mov rdi, rcx mov rsi, rdx {$endif}mov r8, rsi sub r8, rdi // rdi=Text, rsi=TextEnd, r8=TextLen jz @fail mov ecx, edi movdqa xmm0, [rip + @for10] movdqa xmm1, [rip + @for13] and rdi, -16 // check first aligned 16 bytes and ecx, 15 // lower 4 bits indicate misalignment movdqa xmm2, [rdi] movdqa xmm3, xmm2 pcmpeqb xmm2, xmm0 pcmpeqb xmm3, xmm1 por xmm3, xmm2 pmovmskb eax, xmm3 shr eax, cl // shift out unaligned bytes test eax, eax jz @main bsf eax, eax add rax, rcx add rax, rdi sub rax, rsi jae @fail // don't exceed TextEnd add rax, r8 // rax = TextFound - TextEnd + (TextEnd - Text) = offset {$ifdef MSWINDOWS} pop rdi pop rsi {$endif}ret @main: add rdi, 16 sub rdi, rsi jae @fail jmp @by16 {$ifdef FPC} align 16 {$else} .align 16 {$endif} @for10: dq $0a0a0a0a0a0a0a0a dq $0a0a0a0a0a0a0a0a @for13: dq $0d0d0d0d0d0d0d0d dq $0d0d0d0d0d0d0d0d @by16: movdqa xmm2, [rdi + rsi] // check 16 bytes per loop movdqa xmm3, xmm2 pcmpeqb xmm2, xmm0 pcmpeqb xmm3, xmm1 por xmm3, xmm2 pmovmskb eax, xmm3 test eax, eax jnz @found add rdi, 16 jnc @by16 @fail: mov rax, r8 // returns TextLen if no CR/LF found {$ifdef MSWINDOWS} pop rdi pop rsi {$endif}ret @found: bsf eax, eax add rax, rdi jc @fail add rax, r8 {$ifdef MSWINDOWS} pop rdi pop rsi {$endif} end; {$else} {$ifdef FPC}inline;{$endif} var c: cardinal; begin result := 0; dec(PtrInt(TextEnd),PtrInt(Text)); // compute TextLen if TextEnd<>nil then repeat c := ord(Text[result]); if c>13 then begin inc(result); if result>=PtrInt(PtrUInt(TextEnd)) then break; continue; end; if (c=10) or (c=13) then break; inc(result); if result>=PtrInt(PtrUInt(TextEnd)) then break; until false; end; {$endif CPUX64} It will be faster than any UTF-8 decoding for sure. I already hear some people say: "hey, this is premature optimization! the disk is the bottleneck!". But in 2020, my 1TB SSD reads at more than 3GB/s - https://www.sabrent.com/rocket This is real numbers on my laptop. So searching at GB/s speed does make sense. We use similar techniques at https://www.livemon.com/features/log-management With optimized compression, and distributed search, we reach TB/s brute force speed.
  12. Arnaud Bouchez

    Reading large UTF8 encoded file in chunks

    We use similar techniques in our SynCommons.pas unit. See for instance lines 17380 and following: // some constants used for UTF-8 conversion, including surrogates const UTF16_HISURROGATE_MIN = $d800; UTF16_HISURROGATE_MAX = $dbff; UTF16_LOSURROGATE_MIN = $dc00; UTF16_LOSURROGATE_MAX = $dfff; UTF8_EXTRABYTES: array[$80..$ff] of byte = ( 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,0,0); UTF8_EXTRA: array[0..6] of record offset, minimum: cardinal; end = ( // http://floodyberry.wordpress.com/2007/04/14/utf-8-conversion-tricks (offset: $00000000; minimum: $00010000), (offset: $00003080; minimum: $00000080), (offset: $000e2080; minimum: $00000800), (offset: $03c82080; minimum: $00010000), (offset: $fa082080; minimum: $00200000), (offset: $82082080; minimum: $04000000), (offset: $00000000; minimum: $04000000)); UTF8_EXTRA_SURROGATE = 3; UTF8_FIRSTBYTE: array[2..6] of byte = ($c0,$e0,$f0,$f8,$fc); In fact, the state machine I talked about was just about line feeds, not UTF-8. My guess was that UTF-8 decoding could be avoided during the process. If the lines are not truncated, then UTF-8 and Ansi bytes will be valid sequences. Since when processing logs, lines should be taken into account, a first scan would be to decode line feeds, then process the line bytes directly, with no string/UnicodeString conversion at all. For fast searching within the UTF-8/Ansi memory buffer, we have some enhanced techniques e.g. the SBNDM2 algorithm: see TMatch.PrepareContains in our SynTable.pas unit. It is much faster than Pos() or BoyerMore for small patterns, with branchless case-insensitivity. It reaches several GB/s of searching speed inside memory buffers. There is even a very fast expression search engine (e.g. search for '404 & mydomain.com') in TExprParserMatch. More convenient than a RegEx to me - for a fast RegEx engine, check https://github.com/BeRo1985/flre/ Any memory allocation would reduce a lot the process performance.
  13. Arnaud Bouchez

    Reading large UTF8 encoded file in chunks

    I would just cut a line bigger than this size - which is very unlikely with a 2MB buffer. Or just don't cut anything, just read the buffer and use a proper simple state machine to decode the content, without allocating any string.
  14. Arnaud Bouchez

    Reading large UTF8 encoded file in chunks

    @Vandrovnik I guess you didn't understand what I wrote. I proposed to read the files in a buffer (typically 2MB-32MB), chunk by chunk, searching for the line feeds in it. It will work, very efficiently, for any size of input files - even TB. Last trick: under Windows, check the FILE_FLAG_SEQUENTIAL_SCAN option when you open such a huge file. It bypasses the OS cache, so make it more efficient in your case. See the corresponding function in SynCommons.pas : /// overloaded function optimized for one pass file reading // - will use e.g. the FILE_FLAG_SEQUENTIAL_SCAN flag under Windows, as stated // by http://blogs.msdn.com/b/oldnewthing/archive/2012/01/20/10258690.aspx // - under XP, we observed ERROR_NO_SYSTEM_RESOURCES problems with FileRead() // bigger than 32MB // - under POSIX, calls plain FileOpen(FileName,fmOpenRead or fmShareDenyNone) // - is used e.g. by StringFromFile() and TSynMemoryStreamMapped.Create() function FileOpenSequentialRead(const FileName: string): Integer; begin {$ifdef MSWINDOWS} result := CreateFile(pointer(FileName),GENERIC_READ, FILE_SHARE_READ or FILE_SHARE_WRITE,nil, // same as fmShareDenyNone OPEN_EXISTING,FILE_FLAG_SEQUENTIAL_SCAN,0); {$else} result := FileOpen(FileName,fmOpenRead or fmShareDenyNone); {$endif MSWINDOWS} end;
  15. Arnaud Bouchez

    Reading large UTF8 encoded file in chunks

    For decoding such log lines, I would not bother about UTF-8 decoding, just about line feeds decoding, during file reading. Just read your data into a buffer (bigger than you expect, e.g. of 2MB, not 32KB), search for #13#10 or #10, then decode the UTF-8 or Ansi text in-between - only if really needed. If you don't find a line feed before the end of the buffer, copy the bytes remaining from the last line at the beginning of the buffer, then fill it from disk. Last but not least, to efficiently process huge log files which are UTF-8 or Ansi encoded, I wouldn't make any conversion to string (UnicodeString), but use raw PAnsiChar or PByteArray pointer, with no memory allocation. We have plenty of low-level search / decoding functions working directly into memory buffers (using pointers) in our Open Source libraries https://github.com/synopse/mORMot/blob/master/SynCommons.pas
×