Reading large UTF8 encoded file in chunks

Clément · January 26, 2020

Hi,

I'm using Delphi Rio to read some rather large log files ( some might have 5Gb). The encoding might be Ansi or UTF8. The log files are created from different systems and each encodes differently.
I would like to use TFileStream ( or TBufferedFileStream) to read the file in chunks of 32k, but how can I be sure that an UTF8 character split between chunks will be decoded correctly?

David Heffernan · January 26, 2020

https://stackoverflow.com/questions/56724326/find-longest-utf-8-sequence-without-breaking-multi-byte-sequences

Self synchronisation is the keyword. Once you understand that then the code will emerge.

Uwe Raabe · January 26, 2020

Streams don't know about encoding. You might better do with a TStreamReader. In its FillBuffer method there is some code taking care of your concern (look for ExtraByteCount).

Edited January 26, 2020 by Uwe Raabe

Clément · January 26, 2020

Thanks!

Arnaud Bouchez · January 27, 2020

For decoding such log lines, I would not bother about UTF-8 decoding, just about line feeds decoding, during file reading.

Just read your data into a buffer (bigger than you expect, e.g. of 2MB, not 32KB), search for #13#10 or #10, then decode the UTF-8 or Ansi text in-between - only if really needed.

If you don't find a line feed before the end of the buffer, copy the bytes remaining from the last line at the beginning of the buffer, then fill it from disk.

Last but not least, to efficiently process huge log files which are UTF-8 or Ansi encoded, I wouldn't make any conversion to string (UnicodeString), but use raw PAnsiChar or PByteArray pointer, with no memory allocation.

We have plenty of low-level search / decoding functions working directly into memory buffers (using pointers) in our Open Source libraries https://github.com/synopse/mORMot/blob/master/SynCommons.pas

Edited January 27, 2020 by Arnaud Bouchez

Vandrovnik · January 27, 2020

5 hours ago, Arnaud Bouchez said:

For decoding such log lines, I would not bother about UTF-8 decoding, just about line feeds decoding, during file reading.

Just read your data into a buffer (bigger than you expect, e.g. of 2MB, not 32KB), search for #13#10 or #10, then decode the UTF-8 or Ansi text in-between - only if really needed.

If you don't find a line feed before the end of the buffer, copy the bytes remaining from the last line at the beginning of the buffer, then fill it from disk.

Last but not least, to efficiently process huge log files which are UTF-8 or Ansi encoded, I wouldn't make any conversion to string (UnicodeString), but use raw PAnsiChar or PByteArray pointer, with no memory allocation.

We have plenty of low-level search / decoding functions working directly into memory buffers (using pointers) in our Open Source libraries https://github.com/synopse/mORMot/blob/master/SynCommons.pas

He expects files up to 5 GB, lots of machines will not be able to keep it in RAM.

Arnaud Bouchez · January 31, 2020

@Vandrovnik I guess you didn't understand what I wrote.
I proposed to read the files in a buffer (typically 2MB-32MB), chunk by chunk, searching for the line feeds in it.
It will work, very efficiently, for any size of input files - even TB.

Last trick: under Windows, check the FILE_FLAG_SEQUENTIAL_SCAN option when you open such a huge file.
It bypasses the OS cache, so make it more efficient in your case.

See the corresponding function in SynCommons.pas :

/// overloaded function optimized for one pass file reading
// - will use e.g. the FILE_FLAG_SEQUENTIAL_SCAN flag under Windows, as stated
// by http://blogs.msdn.com/b/oldnewthing/archive/2012/01/20/10258690.aspx
// - under XP, we observed ERROR_NO_SYSTEM_RESOURCES problems with FileRead()
// bigger than 32MB
// - under POSIX, calls plain FileOpen(FileName,fmOpenRead or fmShareDenyNone)
// - is used e.g. by StringFromFile() and TSynMemoryStreamMapped.Create()
function FileOpenSequentialRead(const FileName: string): Integer;
begin
  {$ifdef MSWINDOWS}
  result := CreateFile(pointer(FileName),GENERIC_READ,
    FILE_SHARE_READ or FILE_SHARE_WRITE,nil, // same as fmShareDenyNone
    OPEN_EXISTING,FILE_FLAG_SEQUENTIAL_SCAN,0);
  {$else}
  result := FileOpen(FileName,fmOpenRead or fmShareDenyNone);
  {$endif MSWINDOWS}
end;

Fr0sT.Brutal · January 31, 2020

I wrote Utf8toWide function that is able to process chunks of any size. If someone needs it, I could publish it.

David Heffernan · January 31, 2020

@Arnaud Bouchez How do you handle a line longer than your buffer?

Lars Fosdal · January 31, 2020

Would it be feasible to memory map the utf8 file?

Clément · January 31, 2020

I'm still working on that reading routine . I can't read all 5Gb in memory (even in chunks). That particular file have lines over 64Kb of data without finding a CRLF. And of course my buffer was set to 64Kb! I'd go straight to 1Mb of data to walk back to the last CRLF.

I'm not walking back the stream position per se. I'm using the buffer I just read to count the number of chars I need to go back, and set the stream back that amount of bytes. For other large files it works just fine, but there's always "that file" that messes everything.

Edited January 31, 2020 by Clément

January 31, 2020

May i suggest a rough untested routine

Wrong and not working , DON'T USE the working version is another post

// Will ONLY check the ending of the buffer for valid UTF8 charecter
// Will return accepted size beased on the trailing bytes ( max 4 bytes )
// Will return -1 if last 4 bytes (or less) in buffer are invalid UTF8 char
function GetMaxUTF8String(Buffer: Pointer; Size: Integer): Integer;
var
  BBuffer: TBytes absolute Buffer;
  Trailer:Integer;
begin
  Result:=Size;
  if Result=0 then
    Exit;

  Trailer:=0;
  while (Trailer<4) do
  begin
    if (BBuffer[Size-1] < $80) then
      Break;
    if (Result>0) and (BBuffer[Size-1] shr 6 =2 ) and (BBuffer[Size - 2] shr 5 =6) then
      Break;
    if (Result>1) and (BBuffer[Size-1] shr 6 =2 ) and (BBuffer[Size - 2] shr 6 =2) and (BBuffer[Size - 3] shr 4 =$E) then
      Break;
    if (Result>2) and (BBuffer[Size-1] shr 6 =2 ) and (BBuffer[Size - 2] shr 6 =2) and (BBuffer[Size - 3] shr 6 =2) and (BBuffer[Size - 4] shr 3 =$1E) then
    break;

    Dec(Result);
    Inc(Trailer);
  end;

if Trailer>4 then
  Result:=-1;
end;

@Clément Please fix it and repost it .

Edited January 31, 2020 by Guest

January 31, 2020

I think i missed the case where the size if very short < 4, but i believe you can fix it and test it if it is any good.

Uwe Raabe · January 31, 2020

Did you give that TStreamReader approach a try?

Arnaud Bouchez · January 31, 2020

3 hours ago, David Heffernan said:

@Arnaud Bouchez How do you handle a line longer than your buffer?

I would just cut a line bigger than this size - which is very unlikely with a 2MB buffer.
Or just don't cut anything, just read the buffer and use a proper simple state machine to decode the content, without allocating any string.

January 31, 2020

This works just fine ! the answer in the SO was perfect.

// Will ONLY check the ending of the buffer for valid UTF8 charecter
// Will return accepted size beased on the trailing bytes ( max 4 bytes )
// Will return -1 if last bytes in buffer are not valid UTF8 char
function GetMaxUTF8String(Buffer: Pointer; Size: Integer): Integer;
var
  BB: TBytes absolute Buffer;
begin
  Result := Size;
  if Result = 0 then
    Exit;
  repeat
    Dec(Result);
    if BB[Result] and $C0 <> $80 then
      Break;
  until (Result <= 0) or (Size - Result > 4);
  if Size - Result > 4 then
    Result := -1;
end;

Anders Melander · January 31, 2020

3 hours ago, Arnaud Bouchez said:

use a proper simple state machine to decode the content, without allocating any string

Like this: Flexible and Economical UTF-8 Decoder or this: A Branchless UTF-8 Decoder

Arnaud Bouchez · February 1, 2020

14 hours ago, Anders Melander said:

Like this: Flexible and Economical UTF-8 Decoder or this: A Branchless UTF-8 Decoder

We use similar techniques in our SynCommons.pas unit. See for instance lines 17380 and following:

// some constants used for UTF-8 conversion, including surrogates
const
  UTF16_HISURROGATE_MIN = $d800;
  UTF16_HISURROGATE_MAX = $dbff;
  UTF16_LOSURROGATE_MIN = $dc00;
  UTF16_LOSURROGATE_MAX = $dfff;
  UTF8_EXTRABYTES: array[$80..$ff] of byte = (
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
    2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,0,0);
  UTF8_EXTRA: array[0..6] of record
    offset, minimum: cardinal;
  end = ( // http://floodyberry.wordpress.com/2007/04/14/utf-8-conversion-tricks
    (offset: $00000000;  minimum: $00010000),
    (offset: $00003080;  minimum: $00000080),
    (offset: $000e2080;  minimum: $00000800),
    (offset: $03c82080;  minimum: $00010000),
    (offset: $fa082080;  minimum: $00200000),
    (offset: $82082080;  minimum: $04000000),
    (offset: $00000000;  minimum: $04000000));
  UTF8_EXTRA_SURROGATE = 3;
  UTF8_FIRSTBYTE: array[2..6] of byte = ($c0,$e0,$f0,$f8,$fc);

In fact, the state machine I talked about was just about line feeds, not UTF-8. My guess was that UTF-8 decoding could be avoided during the process.
If the lines are not truncated, then UTF-8 and Ansi bytes will be valid sequences.
Since when processing logs, lines should be taken into account, a first scan would be to decode line feeds, then process the line bytes directly, with no string/UnicodeString conversion at all.

For fast searching within the UTF-8/Ansi memory buffer, we have some enhanced techniques e.g. the SBNDM2 algorithm: see TMatch.PrepareContains in our SynTable.pas unit. It is much faster than Pos() or BoyerMore for small patterns, with branchless case-insensitivity. It reaches several GB/s of searching speed inside memory buffers.

There is even a very fast expression search engine (e.g. search for '404 & mydomain.com') in TExprParserMatch. More convenient than a RegEx to me - for a fast RegEx engine, check https://github.com/BeRo1985/flre/
Any memory allocation would reduce a lot the process performance.

Arnaud Bouchez · February 1, 2020

There is very fast line feed search, using proper x86_64 SSE assembly, checking by 16 bytes per loop iteration, in our SynCommons.pas:

function BufferLineLength(Text, TextEnd: PUTF8Char): PtrInt;
{$ifdef CPUX64}
{$ifdef FPC} nostackframe; assembler; asm {$else} asm .noframe {$endif}
{$ifdef MSWINDOWS} // Win64 ABI to System-V ABI
        push    rsi
        push    rdi
        mov     rdi, rcx
        mov     rsi, rdx
{$endif}mov     r8, rsi
        sub     r8, rdi // rdi=Text, rsi=TextEnd, r8=TextLen
        jz      @fail
        mov     ecx, edi
        movdqa  xmm0, [rip + @for10]
        movdqa  xmm1, [rip + @for13]
        and     rdi, -16 // check first aligned 16 bytes
        and     ecx, 15  // lower 4 bits indicate misalignment
        movdqa  xmm2, [rdi]
        movdqa  xmm3, xmm2
        pcmpeqb xmm2, xmm0
        pcmpeqb xmm3, xmm1
        por     xmm3, xmm2
        pmovmskb eax, xmm3
        shr     eax, cl  // shift out unaligned bytes
        test    eax, eax
        jz      @main
        bsf     eax, eax
        add     rax, rcx
        add     rax, rdi
        sub     rax, rsi
        jae     @fail   // don't exceed TextEnd
        add     rax, r8 // rax = TextFound - TextEnd + (TextEnd - Text) = offset
{$ifdef MSWINDOWS}
        pop     rdi
        pop     rsi
{$endif}ret
@main:  add     rdi, 16
        sub     rdi, rsi
        jae     @fail
        jmp     @by16
{$ifdef FPC} align 16 {$else} .align 16 {$endif}
@for10: dq      $0a0a0a0a0a0a0a0a
        dq      $0a0a0a0a0a0a0a0a
@for13: dq      $0d0d0d0d0d0d0d0d
        dq      $0d0d0d0d0d0d0d0d
@by16:  movdqa  xmm2, [rdi + rsi] // check 16 bytes per loop
        movdqa  xmm3, xmm2
        pcmpeqb xmm2, xmm0
        pcmpeqb xmm3, xmm1
        por     xmm3, xmm2
        pmovmskb eax, xmm3
        test    eax, eax
        jnz     @found
        add     rdi, 16
        jnc     @by16
@fail:  mov     rax, r8 // returns TextLen if no CR/LF found
{$ifdef MSWINDOWS}
        pop     rdi
        pop     rsi
{$endif}ret
@found: bsf     eax, eax
        add     rax, rdi
        jc      @fail
        add     rax, r8
{$ifdef MSWINDOWS}
        pop     rdi
        pop     rsi
{$endif}
end;
{$else} {$ifdef FPC}inline;{$endif}
var c: cardinal;
begin
  result := 0;
  dec(PtrInt(TextEnd),PtrInt(Text)); // compute TextLen
  if TextEnd<>nil then
    repeat
      c := ord(Text[result]);
      if c>13 then begin
        inc(result);
        if result>=PtrInt(PtrUInt(TextEnd)) then
          break;
        continue;
      end;
      if (c=10) or (c=13) then
        break;
      inc(result);
      if result>=PtrInt(PtrUInt(TextEnd)) then
        break;
    until false;
end;
{$endif CPUX64}

It will be faster than any UTF-8 decoding for sure.

I already hear some people say: "hey, this is premature optimization! the disk is the bottleneck!".
But in 2020, my 1TB SSD reads at more than 3GB/s - https://www.sabrent.com/rocket
This is real numbers on my laptop. So searching at GB/s speed does make sense.

We use similar techniques at https://www.livemon.com/features/log-management
With optimized compression, and distributed search, we reach TB/s brute force speed.

Edited February 1, 2020 by Arnaud Bouchez

Arnaud Bouchez · February 1, 2020

20 hours ago, Clément said:

I'm using the buffer I just read to count the number of chars I need to go back, and set the stream back that amount of bytes.

When reading several MB of buffers, it is not needed to read back.
Just read the buffer line by line, from the beginning. Use a fast function like our BufferLineLength() above to compute the line length. Then search within the line buffer.

If you can keep the buffer smaller than your CPU L3 cache, it may have some benefit.

Going that way, the CPU will give you best performance, for several reasons:
1. the whole line is very likely to remain in L1 cache, so searching the line feed, then search any pattern will be achieved at full core speed.
2. there will be automatic prefetching from main RAM into L1/L2 cache when reading ahead in a single direction.

If your disk is fast enough (NVMe), you can fill buffers in separated threads (use number of CPU cores - 1), then search in parallel from several files (one core per file - it would be more difficult to properly search the same file in multiple cores).
If you don't allocate any memory during the process (do not use string), parallel search would scale linearly.

Always do proper timing for your search speed - also taking into account the OS disk cache, which is likely to be used during testing, but not from real "cold" files.

Sign In

Reading large UTF8 encoded file in chunks

Recommended Posts

Clément 153

Share this post

Link to post

David Heffernan 2465

Share this post

Link to post

Uwe Raabe 2169

Share this post

Link to post

Clément 153

Share this post

Link to post

Arnaud Bouchez 412

Share this post

Link to post

Vandrovnik 224

Share this post

Link to post

Arnaud Bouchez 412

Share this post

Link to post

Fr0sT.Brutal 904

Share this post

Link to post

David Heffernan 2465

Share this post

Link to post

Lars Fosdal 1878

Share this post

Link to post

Clément 153

Share this post

Link to post

Guest

Share this post

Link to post

Guest

Share this post

Link to post

Uwe Raabe 2169

Share this post

Link to post

Arnaud Bouchez 412

Share this post

Link to post

Guest

Share this post

Link to post

Anders Melander 2073

Share this post

Link to post

Arnaud Bouchez 412

Share this post

Link to post

Arnaud Bouchez 412

Share this post

Link to post

Arnaud Bouchez 412

Share this post

Link to post

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity