Clément 148 Posted January 26, 2020 Hi, I'm using Delphi Rio to read some rather large log files ( some might have 5Gb). The encoding might be Ansi or UTF8. The log files are created from different systems and each encodes differently. I would like to use TFileStream ( or TBufferedFileStream) to read the file in chunks of 32k, but how can I be sure that an UTF8 character split between chunks will be decoded correctly? Share this post Link to post
David Heffernan 2345 Posted January 26, 2020 https://stackoverflow.com/questions/56724326/find-longest-utf-8-sequence-without-breaking-multi-byte-sequences Self synchronisation is the keyword. Once you understand that then the code will emerge. 1 Share this post Link to post
Uwe Raabe 2057 Posted January 26, 2020 (edited) Streams don't know about encoding. You might better do with a TStreamReader. In its FillBuffer method there is some code taking care of your concern (look for ExtraByteCount). Edited January 26, 2020 by Uwe Raabe 2 Share this post Link to post
Arnaud Bouchez 407 Posted January 27, 2020 (edited) For decoding such log lines, I would not bother about UTF-8 decoding, just about line feeds decoding, during file reading. Just read your data into a buffer (bigger than you expect, e.g. of 2MB, not 32KB), search for #13#10 or #10, then decode the UTF-8 or Ansi text in-between - only if really needed. If you don't find a line feed before the end of the buffer, copy the bytes remaining from the last line at the beginning of the buffer, then fill it from disk. Last but not least, to efficiently process huge log files which are UTF-8 or Ansi encoded, I wouldn't make any conversion to string (UnicodeString), but use raw PAnsiChar or PByteArray pointer, with no memory allocation. We have plenty of low-level search / decoding functions working directly into memory buffers (using pointers) in our Open Source libraries https://github.com/synopse/mORMot/blob/master/SynCommons.pas Edited January 27, 2020 by Arnaud Bouchez Share this post Link to post
Vandrovnik 214 Posted January 27, 2020 5 hours ago, Arnaud Bouchez said: For decoding such log lines, I would not bother about UTF-8 decoding, just about line feeds decoding, during file reading. Just read your data into a buffer (bigger than you expect, e.g. of 2MB, not 32KB), search for #13#10 or #10, then decode the UTF-8 or Ansi text in-between - only if really needed. If you don't find a line feed before the end of the buffer, copy the bytes remaining from the last line at the beginning of the buffer, then fill it from disk. Last but not least, to efficiently process huge log files which are UTF-8 or Ansi encoded, I wouldn't make any conversion to string (UnicodeString), but use raw PAnsiChar or PByteArray pointer, with no memory allocation. We have plenty of low-level search / decoding functions working directly into memory buffers (using pointers) in our Open Source libraries https://github.com/synopse/mORMot/blob/master/SynCommons.pas He expects files up to 5 GB, lots of machines will not be able to keep it in RAM. Share this post Link to post
Arnaud Bouchez 407 Posted January 31, 2020 @Vandrovnik I guess you didn't understand what I wrote. I proposed to read the files in a buffer (typically 2MB-32MB), chunk by chunk, searching for the line feeds in it. It will work, very efficiently, for any size of input files - even TB. Last trick: under Windows, check the FILE_FLAG_SEQUENTIAL_SCAN option when you open such a huge file. It bypasses the OS cache, so make it more efficient in your case. See the corresponding function in SynCommons.pas : /// overloaded function optimized for one pass file reading // - will use e.g. the FILE_FLAG_SEQUENTIAL_SCAN flag under Windows, as stated // by http://blogs.msdn.com/b/oldnewthing/archive/2012/01/20/10258690.aspx // - under XP, we observed ERROR_NO_SYSTEM_RESOURCES problems with FileRead() // bigger than 32MB // - under POSIX, calls plain FileOpen(FileName,fmOpenRead or fmShareDenyNone) // - is used e.g. by StringFromFile() and TSynMemoryStreamMapped.Create() function FileOpenSequentialRead(const FileName: string): Integer; begin {$ifdef MSWINDOWS} result := CreateFile(pointer(FileName),GENERIC_READ, FILE_SHARE_READ or FILE_SHARE_WRITE,nil, // same as fmShareDenyNone OPEN_EXISTING,FILE_FLAG_SEQUENTIAL_SCAN,0); {$else} result := FileOpen(FileName,fmOpenRead or fmShareDenyNone); {$endif MSWINDOWS} end; 1 Share this post Link to post
Fr0sT.Brutal 900 Posted January 31, 2020 I wrote Utf8toWide function that is able to process chunks of any size. If someone needs it, I could publish it. Share this post Link to post
David Heffernan 2345 Posted January 31, 2020 @Arnaud Bouchez How do you handle a line longer than your buffer? Share this post Link to post
Lars Fosdal 1792 Posted January 31, 2020 Would it be feasible to memory map the utf8 file? Share this post Link to post
Clément 148 Posted January 31, 2020 (edited) I'm still working on that reading routine . I can't read all 5Gb in memory (even in chunks). That particular file have lines over 64Kb of data without finding a CRLF. And of course my buffer was set to 64Kb! I'd go straight to 1Mb of data to walk back to the last CRLF. I'm not walking back the stream position per se. I'm using the buffer I just read to count the number of chars I need to go back, and set the stream back that amount of bytes. For other large files it works just fine, but there's always "that file" that messes everything. Edited January 31, 2020 by Clément Share this post Link to post
Guest Posted January 31, 2020 (edited) May i suggest a rough untested routine Wrong and not working , DON'T USE the working version is another post // Will ONLY check the ending of the buffer for valid UTF8 charecter // Will return accepted size beased on the trailing bytes ( max 4 bytes ) // Will return -1 if last 4 bytes (or less) in buffer are invalid UTF8 char function GetMaxUTF8String(Buffer: Pointer; Size: Integer): Integer; var BBuffer: TBytes absolute Buffer; Trailer:Integer; begin Result:=Size; if Result=0 then Exit; Trailer:=0; while (Trailer<4) do begin if (BBuffer[Size-1] < $80) then Break; if (Result>0) and (BBuffer[Size-1] shr 6 =2 ) and (BBuffer[Size - 2] shr 5 =6) then Break; if (Result>1) and (BBuffer[Size-1] shr 6 =2 ) and (BBuffer[Size - 2] shr 6 =2) and (BBuffer[Size - 3] shr 4 =$E) then Break; if (Result>2) and (BBuffer[Size-1] shr 6 =2 ) and (BBuffer[Size - 2] shr 6 =2) and (BBuffer[Size - 3] shr 6 =2) and (BBuffer[Size - 4] shr 3 =$1E) then break; Dec(Result); Inc(Trailer); end; if Trailer>4 then Result:=-1; end; @Clément Please fix it and repost it . Edited January 31, 2020 by Guest Share this post Link to post
Guest Posted January 31, 2020 I think i missed the case where the size if very short < 4, but i believe you can fix it and test it if it is any good. Share this post Link to post
Uwe Raabe 2057 Posted January 31, 2020 Did you give that TStreamReader approach a try? Share this post Link to post
Arnaud Bouchez 407 Posted January 31, 2020 3 hours ago, David Heffernan said: @Arnaud Bouchez How do you handle a line longer than your buffer? I would just cut a line bigger than this size - which is very unlikely with a 2MB buffer. Or just don't cut anything, just read the buffer and use a proper simple state machine to decode the content, without allocating any string. Share this post Link to post
Guest Posted January 31, 2020 This works just fine ! the answer in the SO was perfect. // Will ONLY check the ending of the buffer for valid UTF8 charecter // Will return accepted size beased on the trailing bytes ( max 4 bytes ) // Will return -1 if last bytes in buffer are not valid UTF8 char function GetMaxUTF8String(Buffer: Pointer; Size: Integer): Integer; var BB: TBytes absolute Buffer; begin Result := Size; if Result = 0 then Exit; repeat Dec(Result); if BB[Result] and $C0 <> $80 then Break; until (Result <= 0) or (Size - Result > 4); if Size - Result > 4 then Result := -1; end; Share this post Link to post
Anders Melander 1782 Posted January 31, 2020 3 hours ago, Arnaud Bouchez said: use a proper simple state machine to decode the content, without allocating any string Like this: Flexible and Economical UTF-8 Decoder or this: A Branchless UTF-8 Decoder Share this post Link to post
Arnaud Bouchez 407 Posted February 1, 2020 14 hours ago, Anders Melander said: Like this: Flexible and Economical UTF-8 Decoder or this: A Branchless UTF-8 Decoder We use similar techniques in our SynCommons.pas unit. See for instance lines 17380 and following: // some constants used for UTF-8 conversion, including surrogates const UTF16_HISURROGATE_MIN = $d800; UTF16_HISURROGATE_MAX = $dbff; UTF16_LOSURROGATE_MIN = $dc00; UTF16_LOSURROGATE_MAX = $dfff; UTF8_EXTRABYTES: array[$80..$ff] of byte = ( 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,0,0); UTF8_EXTRA: array[0..6] of record offset, minimum: cardinal; end = ( // http://floodyberry.wordpress.com/2007/04/14/utf-8-conversion-tricks (offset: $00000000; minimum: $00010000), (offset: $00003080; minimum: $00000080), (offset: $000e2080; minimum: $00000800), (offset: $03c82080; minimum: $00010000), (offset: $fa082080; minimum: $00200000), (offset: $82082080; minimum: $04000000), (offset: $00000000; minimum: $04000000)); UTF8_EXTRA_SURROGATE = 3; UTF8_FIRSTBYTE: array[2..6] of byte = ($c0,$e0,$f0,$f8,$fc); In fact, the state machine I talked about was just about line feeds, not UTF-8. My guess was that UTF-8 decoding could be avoided during the process. If the lines are not truncated, then UTF-8 and Ansi bytes will be valid sequences. Since when processing logs, lines should be taken into account, a first scan would be to decode line feeds, then process the line bytes directly, with no string/UnicodeString conversion at all. For fast searching within the UTF-8/Ansi memory buffer, we have some enhanced techniques e.g. the SBNDM2 algorithm: see TMatch.PrepareContains in our SynTable.pas unit. It is much faster than Pos() or BoyerMore for small patterns, with branchless case-insensitivity. It reaches several GB/s of searching speed inside memory buffers. There is even a very fast expression search engine (e.g. search for '404 & mydomain.com') in TExprParserMatch. More convenient than a RegEx to me - for a fast RegEx engine, check https://github.com/BeRo1985/flre/ Any memory allocation would reduce a lot the process performance. Share this post Link to post
Arnaud Bouchez 407 Posted February 1, 2020 (edited) There is very fast line feed search, using proper x86_64 SSE assembly, checking by 16 bytes per loop iteration, in our SynCommons.pas: function BufferLineLength(Text, TextEnd: PUTF8Char): PtrInt; {$ifdef CPUX64} {$ifdef FPC} nostackframe; assembler; asm {$else} asm .noframe {$endif} {$ifdef MSWINDOWS} // Win64 ABI to System-V ABI push rsi push rdi mov rdi, rcx mov rsi, rdx {$endif}mov r8, rsi sub r8, rdi // rdi=Text, rsi=TextEnd, r8=TextLen jz @fail mov ecx, edi movdqa xmm0, [rip + @for10] movdqa xmm1, [rip + @for13] and rdi, -16 // check first aligned 16 bytes and ecx, 15 // lower 4 bits indicate misalignment movdqa xmm2, [rdi] movdqa xmm3, xmm2 pcmpeqb xmm2, xmm0 pcmpeqb xmm3, xmm1 por xmm3, xmm2 pmovmskb eax, xmm3 shr eax, cl // shift out unaligned bytes test eax, eax jz @main bsf eax, eax add rax, rcx add rax, rdi sub rax, rsi jae @fail // don't exceed TextEnd add rax, r8 // rax = TextFound - TextEnd + (TextEnd - Text) = offset {$ifdef MSWINDOWS} pop rdi pop rsi {$endif}ret @main: add rdi, 16 sub rdi, rsi jae @fail jmp @by16 {$ifdef FPC} align 16 {$else} .align 16 {$endif} @for10: dq $0a0a0a0a0a0a0a0a dq $0a0a0a0a0a0a0a0a @for13: dq $0d0d0d0d0d0d0d0d dq $0d0d0d0d0d0d0d0d @by16: movdqa xmm2, [rdi + rsi] // check 16 bytes per loop movdqa xmm3, xmm2 pcmpeqb xmm2, xmm0 pcmpeqb xmm3, xmm1 por xmm3, xmm2 pmovmskb eax, xmm3 test eax, eax jnz @found add rdi, 16 jnc @by16 @fail: mov rax, r8 // returns TextLen if no CR/LF found {$ifdef MSWINDOWS} pop rdi pop rsi {$endif}ret @found: bsf eax, eax add rax, rdi jc @fail add rax, r8 {$ifdef MSWINDOWS} pop rdi pop rsi {$endif} end; {$else} {$ifdef FPC}inline;{$endif} var c: cardinal; begin result := 0; dec(PtrInt(TextEnd),PtrInt(Text)); // compute TextLen if TextEnd<>nil then repeat c := ord(Text[result]); if c>13 then begin inc(result); if result>=PtrInt(PtrUInt(TextEnd)) then break; continue; end; if (c=10) or (c=13) then break; inc(result); if result>=PtrInt(PtrUInt(TextEnd)) then break; until false; end; {$endif CPUX64} It will be faster than any UTF-8 decoding for sure. I already hear some people say: "hey, this is premature optimization! the disk is the bottleneck!". But in 2020, my 1TB SSD reads at more than 3GB/s - https://www.sabrent.com/rocket This is real numbers on my laptop. So searching at GB/s speed does make sense. We use similar techniques at https://www.livemon.com/features/log-management With optimized compression, and distributed search, we reach TB/s brute force speed. Edited February 1, 2020 by Arnaud Bouchez 1 Share this post Link to post
Arnaud Bouchez 407 Posted February 1, 2020 20 hours ago, Clément said: I'm using the buffer I just read to count the number of chars I need to go back, and set the stream back that amount of bytes. When reading several MB of buffers, it is not needed to read back. Just read the buffer line by line, from the beginning. Use a fast function like our BufferLineLength() above to compute the line length. Then search within the line buffer. If you can keep the buffer smaller than your CPU L3 cache, it may have some benefit. Going that way, the CPU will give you best performance, for several reasons: 1. the whole line is very likely to remain in L1 cache, so searching the line feed, then search any pattern will be achieved at full core speed. 2. there will be automatic prefetching from main RAM into L1/L2 cache when reading ahead in a single direction. If your disk is fast enough (NVMe), you can fill buffers in separated threads (use number of CPU cores - 1), then search in parallel from several files (one core per file - it would be more difficult to properly search the same file in multiple cores). If you don't allocate any memory during the process (do not use string), parallel search would scale linearly. Always do proper timing for your search speed - also taking into account the OS disk cache, which is likely to be used during testing, but not from real "cold" files. Share this post Link to post