Jump to content
parallax

More efficient string handling

Recommended Posts

8 hours ago, Attila Kovacs said:

Not sure? That's interesting, you could save 150 secs and a bunch of coding but you are don't bother to find out?

 

double double quotes are escaped double quotes but you can go the explicit way:

 


FIELDS
    TERMINATED BY ','
    ENCLOSED BY '"'
    ESCAPED BY ''

 

Thank you for the suggestion - but I've tried the ESCAPED BY clause, and I still lose some lines when importing:

 

LOAD DATA local INFILE 'c:/programdata/mysql/mysql server 8.0/uploads/2022VAERSDATA.csv'
INTO TABLE data
CHARACTER SET latin1  FIELDS TERMINATED BY ','  ENCLOSED BY '"' ESCAPED BY ''
IGNORE 1 LINES

This inserts 35.673 lines of 39.563, so I am losing close to 10% of the data.

 

When working on this, I came across this section in the manual:

 

  • Quote

    If FIELDS ESCAPED BY is empty, a field value that contains an occurrence of FIELDS ENCLOSED BY or LINES TERMINATED BY followed by the FIELDS TERMINATED BY value causes LOAD DATA to stop reading a field or line too early. This happens because LOAD DATA cannot properly determine where the field or line value ends.

     

https://dev.mysql.com/doc/refman/8.0/en/load-data.html

 

I also noticed this suggestion at SO, but it did not work as indicated in the manual.

 

If anyone has a solution that actually works in this case, I would of course appreciate that very much.

Maybe something is escaping me?

 

Edited by parallax

Share this post


Link to post
//Thanks to Ralph of team B old post
function MMFileToString(const AFilename: string): string;
var
  hFile:    THandle;
  hFileMap: THandle;
  hiSize:   DWORD;
  loSize:   DWORD;
  text:     string;
  view:     pointer;
begin
  Result := '';
  if AFilename = '' then
    Exit;
  if not FileExists(AFilename) then
    Exit;
  // Open the file.
  hFile := CreateFile(PChar(AFilename),
                      GENERIC_READ, FILE_SHARE_READ, nil, OPEN_EXISTING,
                      FILE_ATTRIBUTE_NORMAL, 0);
  if hFile <> INVALID_HANDLE_VALUE then begin
    loSize := GetFileSize(hFile, @hiSize);
    // File was opened successfully, now map it:
    hFileMap := CreateFileMapping(hFile, nil, PAGE_READONLY, hiSize, loSize,
      'TextForString');
    if (hFileMap <> 0) then begin
      if (GetLastError() = ERROR_ALREADY_EXISTS) then begin
        MessageDlg('Mapping already exists - not created.', mtWarning, [mbOk], 0);
        CloseHandle(hFileMap)
      end
      else begin
        try
          // File mapped successfully, now
          // map a view of the file into the address space:
          view := MapViewOfFile(hFileMap, FILE_MAP_READ, 0, 0, 0);
          if (view <> nil) then begin // View mapped successfully.
            CloseHandle(hFile); // Close file handle - as long is view is open it will persist.
            SetLength(Result, loSize);
            Move(view^, Result[1], loSize);
          end
          else
            MessageDlg('Unable to map view of file. ' +
                          SysErrorMessage(GetLastError),
                             mtWarning, [mbOk], 0);
        finally
          UnmapViewOfFile(view); // Close view;
          CloseHandle(hFileMap); // Close mapping;
        end
      end
    end
    else begin
      MessageDlg('Unable to create file mapping. ' +
                   SysErrorMessage(GetLastError), mtWarning, [mbOk], 0);
    end;
  end
  else begin
    MessageDlg('Unable to open file. ' + SysErrorMessage(GetLastError),
                     mtWarning, [mbOk], 0);
  end;
end;

{ TForm1 }
//L has
procedure TForm1.btnFixDateClick(Sender: TObject);
var
  r, Cutoff: Integer;
  s: string;
begin
  for r := 0 to sl.count - 1 do
  begin
    s := SG.Cells[1,r+1];
    Cutoff := length(s)-5;
    s := '2021/' + copy(s,1,Cutoff );
    SG.Cells[1,r+1] := s;
  end;
end;

//note need to set strictDelimiter each time. Delphi too.
//for losing the "
procedure TForm1.btnmapDataSGClick(Sender: TObject);
var
  c,r,i: integer;
  et: int64;
begin
  et := GetTickCount64;
  sl := TStringList.Create;
  sl.Capacity:= 750000;
  sl.text := MMFileToString('C:\2022\2021VAERSData\2021VAERSDATA.csv');

  c := sl.Count;
  for r := 0 to c -1 do
  begin
    SG.Rows[r].StrictDelimiter:= True;
    SG.rows[r].CommaText := sl[r];
    if r mod 5000 = 0 then btnmapDataSG.caption := 'r:' + r.ToString +  ' c:' + c.tostring;
  end;
  et := GetTickCount64 - et;
  memo1.lines[0] := format('done loading in %f',[et/1000]);
end;
//view cropped data in cell with memo. Delphi NA
procedure TForm1.SGBeforeSelection(Sender: TObject; aCol, aRow: Integer);
begin
  memo1.text := SG.Cells[aCol,aRow];
end;

Above is Lazarus code I was to only load 145000 lines in D or got 'out of memory errors' with bigger files 

the L exe swells to 3500 when loading the data.  It loaded 748,xxx records in the SG in 12.78 seconds.   

 

One could cut the file up with Notepad++ or excel 

Or load a Tarray<recordwith38strings> reading the file in line by line. Making the commatext with arecord do format('%s1,...%38', [s1,..s38]); 

Share this post


Link to post
2 hours ago, Pat Foley said:

Above is Lazarus code I was to only load 145000 lines in D or got 'out of memory errors' with bigger files 

the L exe swells to 3500 when loading the data.  It loaded 748,xxx records in the SG in 12.78 seconds.   

 

One could cut the file up with Notepad++ or excel 

Or load a Tarray<recordwith38strings> reading the file in line by line. Making the commatext with arecord do format('%s1,...%38', [s1,..s38]); 

 

Thank you very much, that was a very fast mapping routine - very interesting!

Running on Lazarus it loads (or maps) 748.xxx lines in 2,39 secs., but I was only able to insert 300.000 lines into the SG, which took 4,02 secs, before getting Out of Memory, but that is no problem.

There is no need to view all the data at once in a grid anyway.

Edited by parallax

Share this post


Link to post

Hey I had D in 32 platform switching up to 64 I able to load all data into StringGrid 🙂  

You setting SG.rowcount to big Number.

 

Running the L exe the UI hung at 300000 lines but kept going. Running the D64 exe it grew in size albeit slowly. 

 

// D10.4.2 code runs 746,xxx in 42 seconds on i5 
procedure TForm4.Button4Click(Sender: TObject);
var
  FN: TextFile;
  Row: Integer;
  s: string;
  eT: int64;
begin
  Row := 0;
  eT := GetTickCount64;
  AssignFile(FN, 'C:\2022\2021VAERSData\2021VAERSDATA.csv');
  Reset(FN);
  while not Eof(FN) do
  begin
    Readln(FN, s);
    StringGrid1.Rows[Row].StrictDelimiter := True;
    StringGrid1.Rows[Row].commaText := s;
    Inc(Row);
  end;
  eT := GetTickCount64 - eT;
  caption :=Row.tostring + 'records loaded in' + eT.ToString + ' Msecs';
end;

 

 

Edited by Pat Foley
add UI comment

Share this post


Link to post
31 minutes ago, Pat Foley said:

Hey I had D in 32 platform switching up to 64 I able to load all data into StringGrid 🙂  

You setting SG.rowcount to big Number.

 

 

Yes, I can imagine that going to x64 will make it work. I just tried the MM only loading/mapping in Dx32, and was out of memory again.

 

Edit: I just tried Delphi x64, and the loading/mapping of 748.xxx lines took 1,56 secs.

But the stringlist in Delphi is unicode and all the text ends up in a single string, with delimiters not being recognized.

Edited by parallax

Share this post


Link to post

UTF8Decode did not work as expected, even though the files are UTF-8 without BOM.

But casting the output from MMFileToString using PAnsiChar, it worked fine, and I came down to 72 seconds on loading the big file.

The mapping alone took 3,8 seconds, for the 748.xxx lines. That was fast.

 

 

mapping ex.jpg

Share this post


Link to post
Guest

try this too!

image.thumb.png.738d496a3568094022fa84c9eb2f28c2.png

------------
Reading: [1990VAERSDATA.csv] to MyFileTarget.txt
Records: [570.000] <------
Time: 13541,7101 (00:00:13.5417101)
File output: MyFileTarget.txt, size: 220.343.400 (220MB)
------------
Reading: [1990VAERSDATA.csv] to FDMemTable
Records: [570.000] <------ 
Time: 13275,2741 (00:00:13.2752741)

implementation

{$R *.dfm}

uses
  System.IOUtils,
  System.Diagnostics;

const
  MyArr1990VAERSDATAFields: array [0 .. 31] of string = (                       { }  MY FAULT: NEED FIX THE STRUCTUR NAMES HERE!!!
    'VAERS_ID', 'RECVDATE', 'STATE,AGE_YRS', 'CAGE_YR', 'CAGE_MO',              { }  NOT 32 FIELDS, BUT 35.
    'SEX', 'RPT_DATE', 'SYMPTOM_TEXT', 'DIED,DATEDIED', 'L_THREAT,ER_VISIT',    { }
    'HOSPITAL', 'HOSPDAYS', 'X_STAY', 'DISABLE', 'RECOVD', 'VAX_DATE',          { }
    'ONSET_DATE', 'NUMDAYS', 'LAB_DATA', 'V_ADMINBY', 'V_FUNDBY', 'OTHER_MEDS', { }
    'CUR_ILL', 'HISTORY', 'PRIOR_VAX', 'SPLTTYPE', 'FORM_VERS', 'TODAYS_DATE',  { }
    'BIRTH_DEFECT', 'OFC_VISIT', 'ER_ED_VISIT', 'ALLERGIES');
  //
  MyArr1990VAERSSYMPTOMSFields: array [0 .. 10] of string = (                 { }
    'VAERS_ID', 'SYMPTOM1', 'SYMPTOMVERSION1', 'SYMPTOM2', 'SYMPTOMVERSION2', { }
    'SYMPTOM3', 'SYMPTOMVERSION3', 'SYMPTOM4', 'SYMPTOMVERSION4', 'SYMPTOM5', 'SYMPTOMVERSION5');
  //
  MyArr1990VAERSVAXFields: array [0 .. 7] of string = (                                                   { }
    'VAERS_ID', 'VAX_TYPE', 'VAX_MANU', 'VAX_LOT', 'VAX_DOSE_SERIES', 'VAX_ROUTE', 'VAX_SITE', 'VAX_NAME' { }
    );
  //
  MyCSVFileName: array [0 .. 2] of string = ('1990VAERSDATA.csv', '1990VAERSSYMPTOMS.csv', '1990VAERSVAX.csv');
  //
  MyFileDir: string = 'D:\RADRX11Tests\1990VAERSData\';

var
  MyFieldSize: byte = 30;

  (*
    procedure TForm1.FDBatchMove1WriteValue(ASender: TObject; AItem: TFDBatchMoveMappingItem; var AValue: Variant);
    begin
    if VarIsNull(AValue) then
    AValue := '';
    //
    Memo1.Lines.Add(AValue);    // to "memo" is very slowly....
    end;
  *)

procedure TForm1.FormCreate(Sender: TObject);
begin
  Memo1.Lines.Clear;
end;

procedure TForm1.Button1Click(Sender: TObject);
var
  MyFDTextField: TFDTextField;
  MyClock      : TStopwatch;
begin

  FDMemTable1.Close;
  FDBatchMoveTextReader1.DataDef.Fields.Clear;
  FDBatchMoveTextWriter1.DataDef.Fields.Clear;
  //
  FDBatchMoveTextReader1.DataDef.EndOfLine      := TFDTextEndOfLine.elDefault; // ? Win,Linux
  FDBatchMoveTextReader1.DataDef.Delimiter      := ' ';                        // ?
  FDBatchMoveTextReader1.DataDef.RecordFormat   := rfCustom;                   // ?
  FDBatchMoveTextReader1.DataDef.Separator      := ',';                        // ?
  FDBatchMoveTextReader1.DataDef.WithFieldNames := true;
  // FDBatchMoveTextReader1.DataDef.FormatSettings...
  FDBatchMoveTextReader1.Encoding := TFDEncoding.ecANSI; // ?

  FDMemTable1.FieldDefs.Clear;
  //
  FDBatchMoveTextReader1.FileName := MyFileDir + MyCSVFileName[0];
  //
  for var MyItem in MyArr1990VAERSDATAFields  do
  begin
    MyFDTextField           := FDBatchMoveTextReader1.DataDef.Fields.Add;
    MyFDTextField.FieldName := MyItem;
    MyFDTextField.DataType  := TFDTextDataType.atString;
    MyFDTextField.FieldSize := MyFieldSize;
    //
    MyFDTextField           := FDBatchMoveTextWriter1.DataDef.Fields.Add;
    MyFDTextField.FieldName := MyItem;
    MyFDTextField.DataType  := TFDTextDataType.atString;
    MyFDTextField.FieldSize := MyFieldSize;
    //
    FDMemTable1.FieldDefs.Add(MyItem, TFieldType.ftString, MyFieldSize, false);
  end;
  //
  if ToggleSwitch1.State = TToggleSwitchState.tssOn then // for tests FDMemtable or TextOuptut target
  begin
    FDBatchMove1.Writer := FDBatchMoveDataSetWriter1;
    //
    MyClock := TStopwatch.StartNew;
    Caption := FDBatchMove1.Execute.ToString;
    MyClock.Stop;
    //
    Memo1.Lines.Add('------------');
    Memo1.Lines.Add('Reading: [' + ExtractFileName(FDBatchMoveTextReader1.FileName) + '] to FDMemTable');
    Memo1.Lines.Add('Records: [' + Caption + ']');
    Memo1.Lines.Add('Time: ' + MyClock.Elapsed.TotalMilliseconds.ToString + ' (' + MyClock.Elapsed.Duration.ToString + ')');
  end
  else
  begin
    FDBatchMoveTextWriter1.DataDef  := FDBatchMoveTextReader1.DataDef; // ... ?
    FDBatchMoveTextWriter1.Encoding := FDBatchMoveTextReader1.Encoding;
    //
    FDBatchMoveTextWriter1.FileName := 'MyFileTarget.txt';
    //
    DeleteFile(FDBatchMoveTextWriter1.FileName);
    //
    FDBatchMove1.Writer := FDBatchMoveTextWriter1;
    //
    MyClock := TStopwatch.StartNew;
    Caption := FDBatchMove1.Execute.ToString;
    MyClock.Stop;
    //
    Memo1.Lines.Add('------------');
    Memo1.Lines.Add('Reading: [' + ExtractFileName(FDBatchMoveTextReader1.FileName) + '] to ' + FDBatchMoveTextWriter1.FileName);
    Memo1.Lines.Add('Records: [' + Caption + ']');
    Memo1.Lines.Add('Time: ' + MyClock.Elapsed.TotalMilliseconds.ToString + ' (' + MyClock.Elapsed.Duration.ToString + ')');
    //
    Memo1.Lines.Add('File output: ' + FDBatchMoveTextWriter1.FileName + ', size: ' + TFile.GetSize(FDBatchMoveTextWriter1.FileName).ToString);
  end;
end;

 

Edited by Guest

Share this post


Link to post
Guest

------------
Reading: [1990VAERSDATA.csv] to MyFileTarget.txt
Records: [2.280.000]
Time: 54670,0764 (00:00:54.6700764)

File output: MyFileTarget.txt, size: 881.373.600 (880MB)

*************** Start Log 18/02/2022 23:31:12 ***************

          **********  No Errors Logged **********

*************** End Log 18/02/2022 23:32:13 ***************
-------

to FDMemTable ---> OUT OF MEMORY  😂  😫

Edited by Guest

Share this post


Link to post

Not sure why you get that error.

There should be 35 fields in the data file:

 

MyArr1990VAERSDATAFields: array [0 .. 34] of string = (                       { }
    'VAERS_ID', 'RECVDATE', 'STATE', 'AGE_YRS', 'CAGE_YR', 'CAGE_MO',              { }
    'SEX', 'RPT_DATE', 'SYMPTOM_TEXT', 'DIED', 'DATEDIED', 'L_THREAT', 'ER_VISIT',    { }
    'HOSPITAL', 'HOSPDAYS', 'X_STAY', 'DISABLE', 'RECOVD', 'VAX_DATE',          { }
    'ONSET_DATE', 'NUMDAYS', 'LAB_DATA', 'V_ADMINBY', 'V_FUNDBY', 'OTHER_MEDS', { }
    'CUR_ILL', 'HISTORY', 'PRIOR_VAX', 'SPLTTYPE', 'FORM_VERS', 'TODAYS_DATE',  { }
    'BIRTH_DEFECT', 'OFC_VISIT', 'ER_ED_VISIT', 'ALLERGIES');

 

Share this post


Link to post
12 hours ago, parallax said:

But casting the output from MMFileToString using PAnsiChar, it worked fine, and I came down to 72 seconds on loading the big file.

Not getting how that works?

 

Using streaming appears to help reduce exe growth.

 

// Lifted from Remy L DP post
procedure TForm4.Button5Click(Sender: TObject);
var
  //FS: TFileStream; //
  FS: TBufferedFileStream;  //saves a little
  Reader: TStreamReader;
  FN, data,
  s: string;
  eT, vLinesCounter: Int64;
begin
  et := GetTickCount64;
  vLinesCounter:= 0;
  FN := 'C:\2022\2021VAERSData\2021VAERSDATAutf8.csv';   //lactually just changed all (?) to (') 87xxx of the things
  FS := TBufferedFileStream.Create(FN,fmOpenRead or fmShareDenyWrite);
  try
    Reader := TStreamReader.Create(FS);
    Try
      while not Reader.EndOfStream do
      begin
        StringGrid1.rows[vLinesCounter].StrictDelimiter := True;
        StringGrid1.rows[vLinesCounter].CommaText := Reader.ReadLine;
        Inc(vLinesCounter);
        if vLinesCounter mod 20000 = 0 then caption := vLinesCounter.tostring;
      end;
    Finally
      Reader.Free
    End;
  finally
    FS.Free;
  end;
  StringGrid1.rowcount := vLinesCounter + 3;
  eT := GetTickCount64 - eT;
  Caption := vLinesCounter.tostring + ' Records loaded to SG in ' + (eT/1000).tostring + ' seconds.';
end;

// For viewing cropped text in D
procedure TForm4.StringGrid1SelectCell(Sender: TObject; ACol, ARow: Integer;
  var CanSelect: Boolean);
begin
  memo1.Lines.Text := StringGrid1.Cells[aCol, aRow];
end;

//New L code loads in 7 seconds
procedure TForm1.Button3Click(Sender: TObject);
var
  et: int64;
begin
  if OpenDialog1.Execute then
    FN := OpenDialog1.FileName
  else
    exit;
  et := GetTickCount64;
  SG.LoadFromCSVFile(FN);   
  et := GetTickCount64 - eT;
  Memo1.lines.add(eT.tostring);
end;

 

va2022dataSGdx64.png

Share this post


Link to post

note: if you read the file from start to end, Memory mapped files are not faster than reading the file in memory.

The memory faults make it slower than a regular single FileRead() call.

 

For huge files on Win32 which won't be able to load in memory, you may use temporary chunks (e.g. 128MB).
And if you really load it once and don't want to pollute the OS disk memory cache, consider using the FILE_FLAG_SEQUENTIAL_SCAN flag under Windows.
This is what we do with mORMot's FileOpenSequentialRead(). https://devblogs.microsoft.com/oldnewthing/20120120-00/?p=8493

 

  • Like 1

Share this post


Link to post
Guest
ex. for input/output with DateType defined:

  for var MyItem in MyArr1990VAERSDATAFields do
  begin
    if MyItem = 'VAX_DATE' then
    begin
      FDBatchMoveTextReader1.DataDef.FormatSettings.ShortDateFormat := 'mm/dd/yyyy';
      MyFDTextField                                                 := FDBatchMoveTextReader1.DataDef.Fields.Add;
      MyFDTextField.FieldName                                       := MyItem;
      MyFDTextField.DataType                                        := TFDTextDataType.atDate;
      MyFDTextField.FieldSize                                       := 0;
      //
      FDMemTable1.FieldDefs.Add(MyItem, TFieldType.ftDate, 0, false)
    end
    else
    begin
      MyFDTextField           := FDBatchMoveTextReader1.DataDef.Fields.Add;
      MyFDTextField.FieldName := MyItem;
      MyFDTextField.DataType  := TFDTextDataType.atString;
      MyFDTextField.FieldSize := MyFieldSize;
      //
      FDMemTable1.FieldDefs.Add(MyItem, TFieldType.ftString, MyFieldSize, false);
    end;
  end;
  //
  FDBatchMoveTextWriter1.DataDef := FDBatchMoveTextReader1.DataDef;
procedure TForm1.FDMemTable1AfterScroll(DataSet: TDataSet);
begin
  DateTimePicker1.DateTime := DataSet.FieldByName('VAX_DATE').AsDateTime;
end;

 

Memo1.Lines.Add(FDMemTable1.FieldByName('VAX_DATE').DataType.MyText);

VAX_DATE = ftDate
------------
Reading: [1990VAERSDATA.csv] to FDMemTable
Records: [2102]
Time: 73,697 (00:00:00.0736970)
 

Fixed and works now!  1.051.000 records in 25seconds

...

  MyArr1990VAERSDATAFields: array [0 .. 34] of string = ( ...                         { }

..

  FDBatchMoveTextReader1.DataDef.Delimiter      := '"';                        // ?

...

image.png image.png image.png

Edited by Guest

Share this post


Link to post
Guest

that way, you can create many profiles for each file type. then, with this pre-defined "patterns" stay easy load it to dataset target or just to analize its data.

Share this post


Link to post

A quick look at the file (searched Bing for the 2022VAERSDATA.csv) and it looks like double quotes within a string field are escaped with a double quote.  As well only string fields that contain commas or double quotes appear to be enclosed in double quotes. Will take a more detailed look when I get a chance. 

LOAD DATA local INFILE 'c:/programdata/mysql/mysql server 8.0/uploads/2022VAERSDATA.csv'
INTO TABLE data
CHARACTER SET latin1  FIELDS TERMINATED BY ','  OPTIONALLY ENCLOSED BY '"' ESCAPED BY '"'
IGNORE 1 LINES

Share this post


Link to post
Guest

using FDxxxxTextRead you needs just define what fields you have Format Options nothing more

Share this post


Link to post
44 minutes ago, Brian Evans said:

A quick look at the file (searched Bing for the 2022VAERSDATA.csv) and it looks like double quotes within a string field are escaped with a double quote.  As well only string fields that contain commas or double quotes appear to be enclosed in double quotes. Will take a more detailed look when I get a chance. 


LOAD DATA local INFILE 'c:/programdata/mysql/mysql server 8.0/uploads/2022VAERSDATA.csv'
INTO TABLE data
CHARACTER SET latin1  FIELDS TERMINATED BY ','  OPTIONALLY ENCLOSED BY '"' ESCAPED BY '"'
IGNORE 1 LINES

You are correct, double quotes are only used if there is a comma within the string, and quotations within strings are double quoted, i.e. escaped with a double quote.

So your load clause should work, but it doesn't. About 10% of the rows are discarded somehow.

Share this post


Link to post
58 minutes ago, joaodanet2018 said:

using FDxxxxTextRead you needs just define what fields you have Format Options nothing more

 

I am having issues with testing your code, getting very large files. Could you please attach a test project?

Share this post


Link to post
Guest

attached (7zip v21.07 with no password) my simple project for your tests...

Done in RAD11 but nothing sophisticated, then, any other edition can open and build it.

Any problem on open it, just create a new project and add my MainForm into it.

in my tests, I used the "1990VAERSDATA.csv", copying the records into it until +1.000.000 lines, ok?

if you see the "MyFileTarget.txt" output, you'll see that the field with "long text" will be saved same that in FDMemTable these text is showed using my "FieldSize limitation". Of course, this limitation is not for output text.

 

FDConn_Reading_CSV_files_to_Parallax.zip

Edited by Guest

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×