parallax 1 Posted February 18, 2022 (edited) 8 hours ago, Attila Kovacs said: Not sure? That's interesting, you could save 150 secs and a bunch of coding but you are don't bother to find out? double double quotes are escaped double quotes but you can go the explicit way: FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '' Thank you for the suggestion - but I've tried the ESCAPED BY clause, and I still lose some lines when importing: LOAD DATA local INFILE 'c:/programdata/mysql/mysql server 8.0/uploads/2022VAERSDATA.csv' INTO TABLE data CHARACTER SET latin1 FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '' IGNORE 1 LINES This inserts 35.673 lines of 39.563, so I am losing close to 10% of the data. When working on this, I came across this section in the manual: Quote If FIELDS ESCAPED BY is empty, a field value that contains an occurrence of FIELDS ENCLOSED BY or LINES TERMINATED BY followed by the FIELDS TERMINATED BY value causes LOAD DATA to stop reading a field or line too early. This happens because LOAD DATA cannot properly determine where the field or line value ends. https://dev.mysql.com/doc/refman/8.0/en/load-data.html I also noticed this suggestion at SO, but it did not work as indicated in the manual. If anyone has a solution that actually works in this case, I would of course appreciate that very much. Maybe something is escaping me? Edited February 18, 2022 by parallax Share this post Link to post
Pat Foley 51 Posted February 18, 2022 //Thanks to Ralph of team B old post function MMFileToString(const AFilename: string): string; var hFile: THandle; hFileMap: THandle; hiSize: DWORD; loSize: DWORD; text: string; view: pointer; begin Result := ''; if AFilename = '' then Exit; if not FileExists(AFilename) then Exit; // Open the file. hFile := CreateFile(PChar(AFilename), GENERIC_READ, FILE_SHARE_READ, nil, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, 0); if hFile <> INVALID_HANDLE_VALUE then begin loSize := GetFileSize(hFile, @hiSize); // File was opened successfully, now map it: hFileMap := CreateFileMapping(hFile, nil, PAGE_READONLY, hiSize, loSize, 'TextForString'); if (hFileMap <> 0) then begin if (GetLastError() = ERROR_ALREADY_EXISTS) then begin MessageDlg('Mapping already exists - not created.', mtWarning, [mbOk], 0); CloseHandle(hFileMap) end else begin try // File mapped successfully, now // map a view of the file into the address space: view := MapViewOfFile(hFileMap, FILE_MAP_READ, 0, 0, 0); if (view <> nil) then begin // View mapped successfully. CloseHandle(hFile); // Close file handle - as long is view is open it will persist. SetLength(Result, loSize); Move(view^, Result[1], loSize); end else MessageDlg('Unable to map view of file. ' + SysErrorMessage(GetLastError), mtWarning, [mbOk], 0); finally UnmapViewOfFile(view); // Close view; CloseHandle(hFileMap); // Close mapping; end end end else begin MessageDlg('Unable to create file mapping. ' + SysErrorMessage(GetLastError), mtWarning, [mbOk], 0); end; end else begin MessageDlg('Unable to open file. ' + SysErrorMessage(GetLastError), mtWarning, [mbOk], 0); end; end; { TForm1 } //L has procedure TForm1.btnFixDateClick(Sender: TObject); var r, Cutoff: Integer; s: string; begin for r := 0 to sl.count - 1 do begin s := SG.Cells[1,r+1]; Cutoff := length(s)-5; s := '2021/' + copy(s,1,Cutoff ); SG.Cells[1,r+1] := s; end; end; //note need to set strictDelimiter each time. Delphi too. //for losing the " procedure TForm1.btnmapDataSGClick(Sender: TObject); var c,r,i: integer; et: int64; begin et := GetTickCount64; sl := TStringList.Create; sl.Capacity:= 750000; sl.text := MMFileToString('C:\2022\2021VAERSData\2021VAERSDATA.csv'); c := sl.Count; for r := 0 to c -1 do begin SG.Rows[r].StrictDelimiter:= True; SG.rows[r].CommaText := sl[r]; if r mod 5000 = 0 then btnmapDataSG.caption := 'r:' + r.ToString + ' c:' + c.tostring; end; et := GetTickCount64 - et; memo1.lines[0] := format('done loading in %f',[et/1000]); end; //view cropped data in cell with memo. Delphi NA procedure TForm1.SGBeforeSelection(Sender: TObject; aCol, aRow: Integer); begin memo1.text := SG.Cells[aCol,aRow]; end; Above is Lazarus code I was to only load 145000 lines in D or got 'out of memory errors' with bigger files the L exe swells to 3500 when loading the data. It loaded 748,xxx records in the SG in 12.78 seconds. One could cut the file up with Notepad++ or excel Or load a Tarray<recordwith38strings> reading the file in line by line. Making the commatext with arecord do format('%s1,...%38', [s1,..s38]); Share this post Link to post
parallax 1 Posted February 18, 2022 (edited) 2 hours ago, Pat Foley said: Above is Lazarus code I was to only load 145000 lines in D or got 'out of memory errors' with bigger files the L exe swells to 3500 when loading the data. It loaded 748,xxx records in the SG in 12.78 seconds. One could cut the file up with Notepad++ or excel Or load a Tarray<recordwith38strings> reading the file in line by line. Making the commatext with arecord do format('%s1,...%38', [s1,..s38]); Thank you very much, that was a very fast mapping routine - very interesting! Running on Lazarus it loads (or maps) 748.xxx lines in 2,39 secs., but I was only able to insert 300.000 lines into the SG, which took 4,02 secs, before getting Out of Memory, but that is no problem. There is no need to view all the data at once in a grid anyway. Edited February 18, 2022 by parallax Share this post Link to post
Pat Foley 51 Posted February 18, 2022 (edited) Hey I had D in 32 platform switching up to 64 I able to load all data into StringGrid 🙂 You setting SG.rowcount to big Number. Running the L exe the UI hung at 300000 lines but kept going. Running the D64 exe it grew in size albeit slowly. // D10.4.2 code runs 746,xxx in 42 seconds on i5 procedure TForm4.Button4Click(Sender: TObject); var FN: TextFile; Row: Integer; s: string; eT: int64; begin Row := 0; eT := GetTickCount64; AssignFile(FN, 'C:\2022\2021VAERSData\2021VAERSDATA.csv'); Reset(FN); while not Eof(FN) do begin Readln(FN, s); StringGrid1.Rows[Row].StrictDelimiter := True; StringGrid1.Rows[Row].commaText := s; Inc(Row); end; eT := GetTickCount64 - eT; caption :=Row.tostring + 'records loaded in' + eT.ToString + ' Msecs'; end; Edited February 18, 2022 by Pat Foley add UI comment Share this post Link to post
parallax 1 Posted February 18, 2022 (edited) 31 minutes ago, Pat Foley said: Hey I had D in 32 platform switching up to 64 I able to load all data into StringGrid 🙂 You setting SG.rowcount to big Number. Yes, I can imagine that going to x64 will make it work. I just tried the MM only loading/mapping in Dx32, and was out of memory again. Edit: I just tried Delphi x64, and the loading/mapping of 748.xxx lines took 1,56 secs. But the stringlist in Delphi is unicode and all the text ends up in a single string, with delimiters not being recognized. Edited February 18, 2022 by parallax Share this post Link to post
parallax 1 Posted February 18, 2022 UTF8Decode did not work as expected, even though the files are UTF-8 without BOM. But casting the output from MMFileToString using PAnsiChar, it worked fine, and I came down to 72 seconds on loading the big file. The mapping alone took 3,8 seconds, for the 748.xxx lines. That was fast. Share this post Link to post
Guest Posted February 19, 2022 (edited) try this too! ------------ Reading: [1990VAERSDATA.csv] to MyFileTarget.txt Records: [570.000] <------ Time: 13541,7101 (00:00:13.5417101) File output: MyFileTarget.txt, size: 220.343.400 (220MB) ------------ Reading: [1990VAERSDATA.csv] to FDMemTable Records: [570.000] <------ Time: 13275,2741 (00:00:13.2752741) implementation {$R *.dfm} uses System.IOUtils, System.Diagnostics; const MyArr1990VAERSDATAFields: array [0 .. 31] of string = ( { } MY FAULT: NEED FIX THE STRUCTUR NAMES HERE!!! 'VAERS_ID', 'RECVDATE', 'STATE,AGE_YRS', 'CAGE_YR', 'CAGE_MO', { } NOT 32 FIELDS, BUT 35. 'SEX', 'RPT_DATE', 'SYMPTOM_TEXT', 'DIED,DATEDIED', 'L_THREAT,ER_VISIT', { } 'HOSPITAL', 'HOSPDAYS', 'X_STAY', 'DISABLE', 'RECOVD', 'VAX_DATE', { } 'ONSET_DATE', 'NUMDAYS', 'LAB_DATA', 'V_ADMINBY', 'V_FUNDBY', 'OTHER_MEDS', { } 'CUR_ILL', 'HISTORY', 'PRIOR_VAX', 'SPLTTYPE', 'FORM_VERS', 'TODAYS_DATE', { } 'BIRTH_DEFECT', 'OFC_VISIT', 'ER_ED_VISIT', 'ALLERGIES'); // MyArr1990VAERSSYMPTOMSFields: array [0 .. 10] of string = ( { } 'VAERS_ID', 'SYMPTOM1', 'SYMPTOMVERSION1', 'SYMPTOM2', 'SYMPTOMVERSION2', { } 'SYMPTOM3', 'SYMPTOMVERSION3', 'SYMPTOM4', 'SYMPTOMVERSION4', 'SYMPTOM5', 'SYMPTOMVERSION5'); // MyArr1990VAERSVAXFields: array [0 .. 7] of string = ( { } 'VAERS_ID', 'VAX_TYPE', 'VAX_MANU', 'VAX_LOT', 'VAX_DOSE_SERIES', 'VAX_ROUTE', 'VAX_SITE', 'VAX_NAME' { } ); // MyCSVFileName: array [0 .. 2] of string = ('1990VAERSDATA.csv', '1990VAERSSYMPTOMS.csv', '1990VAERSVAX.csv'); // MyFileDir: string = 'D:\RADRX11Tests\1990VAERSData\'; var MyFieldSize: byte = 30; (* procedure TForm1.FDBatchMove1WriteValue(ASender: TObject; AItem: TFDBatchMoveMappingItem; var AValue: Variant); begin if VarIsNull(AValue) then AValue := ''; // Memo1.Lines.Add(AValue); // to "memo" is very slowly.... end; *) procedure TForm1.FormCreate(Sender: TObject); begin Memo1.Lines.Clear; end; procedure TForm1.Button1Click(Sender: TObject); var MyFDTextField: TFDTextField; MyClock : TStopwatch; begin FDMemTable1.Close; FDBatchMoveTextReader1.DataDef.Fields.Clear; FDBatchMoveTextWriter1.DataDef.Fields.Clear; // FDBatchMoveTextReader1.DataDef.EndOfLine := TFDTextEndOfLine.elDefault; // ? Win,Linux FDBatchMoveTextReader1.DataDef.Delimiter := ' '; // ? FDBatchMoveTextReader1.DataDef.RecordFormat := rfCustom; // ? FDBatchMoveTextReader1.DataDef.Separator := ','; // ? FDBatchMoveTextReader1.DataDef.WithFieldNames := true; // FDBatchMoveTextReader1.DataDef.FormatSettings... FDBatchMoveTextReader1.Encoding := TFDEncoding.ecANSI; // ? FDMemTable1.FieldDefs.Clear; // FDBatchMoveTextReader1.FileName := MyFileDir + MyCSVFileName[0]; // for var MyItem in MyArr1990VAERSDATAFields do begin MyFDTextField := FDBatchMoveTextReader1.DataDef.Fields.Add; MyFDTextField.FieldName := MyItem; MyFDTextField.DataType := TFDTextDataType.atString; MyFDTextField.FieldSize := MyFieldSize; // MyFDTextField := FDBatchMoveTextWriter1.DataDef.Fields.Add; MyFDTextField.FieldName := MyItem; MyFDTextField.DataType := TFDTextDataType.atString; MyFDTextField.FieldSize := MyFieldSize; // FDMemTable1.FieldDefs.Add(MyItem, TFieldType.ftString, MyFieldSize, false); end; // if ToggleSwitch1.State = TToggleSwitchState.tssOn then // for tests FDMemtable or TextOuptut target begin FDBatchMove1.Writer := FDBatchMoveDataSetWriter1; // MyClock := TStopwatch.StartNew; Caption := FDBatchMove1.Execute.ToString; MyClock.Stop; // Memo1.Lines.Add('------------'); Memo1.Lines.Add('Reading: [' + ExtractFileName(FDBatchMoveTextReader1.FileName) + '] to FDMemTable'); Memo1.Lines.Add('Records: [' + Caption + ']'); Memo1.Lines.Add('Time: ' + MyClock.Elapsed.TotalMilliseconds.ToString + ' (' + MyClock.Elapsed.Duration.ToString + ')'); end else begin FDBatchMoveTextWriter1.DataDef := FDBatchMoveTextReader1.DataDef; // ... ? FDBatchMoveTextWriter1.Encoding := FDBatchMoveTextReader1.Encoding; // FDBatchMoveTextWriter1.FileName := 'MyFileTarget.txt'; // DeleteFile(FDBatchMoveTextWriter1.FileName); // FDBatchMove1.Writer := FDBatchMoveTextWriter1; // MyClock := TStopwatch.StartNew; Caption := FDBatchMove1.Execute.ToString; MyClock.Stop; // Memo1.Lines.Add('------------'); Memo1.Lines.Add('Reading: [' + ExtractFileName(FDBatchMoveTextReader1.FileName) + '] to ' + FDBatchMoveTextWriter1.FileName); Memo1.Lines.Add('Records: [' + Caption + ']'); Memo1.Lines.Add('Time: ' + MyClock.Elapsed.TotalMilliseconds.ToString + ' (' + MyClock.Elapsed.Duration.ToString + ')'); // Memo1.Lines.Add('File output: ' + FDBatchMoveTextWriter1.FileName + ', size: ' + TFile.GetSize(FDBatchMoveTextWriter1.FileName).ToString); end; end; Edited February 19, 2022 by Guest Share this post Link to post
Guest Posted February 19, 2022 (edited) ------------ Reading: [1990VAERSDATA.csv] to MyFileTarget.txt Records: [2.280.000] Time: 54670,0764 (00:00:54.6700764) File output: MyFileTarget.txt, size: 881.373.600 (880MB) *************** Start Log 18/02/2022 23:31:12 *************** ********** No Errors Logged ********** *************** End Log 18/02/2022 23:32:13 *************** ------- to FDMemTable ---> OUT OF MEMORY 😂 😫 Edited February 19, 2022 by Guest Share this post Link to post
parallax 1 Posted February 19, 2022 Not sure why you get that error. There should be 35 fields in the data file: MyArr1990VAERSDATAFields: array [0 .. 34] of string = ( { } 'VAERS_ID', 'RECVDATE', 'STATE', 'AGE_YRS', 'CAGE_YR', 'CAGE_MO', { } 'SEX', 'RPT_DATE', 'SYMPTOM_TEXT', 'DIED', 'DATEDIED', 'L_THREAT', 'ER_VISIT', { } 'HOSPITAL', 'HOSPDAYS', 'X_STAY', 'DISABLE', 'RECOVD', 'VAX_DATE', { } 'ONSET_DATE', 'NUMDAYS', 'LAB_DATA', 'V_ADMINBY', 'V_FUNDBY', 'OTHER_MEDS', { } 'CUR_ILL', 'HISTORY', 'PRIOR_VAX', 'SPLTTYPE', 'FORM_VERS', 'TODAYS_DATE', { } 'BIRTH_DEFECT', 'OFC_VISIT', 'ER_ED_VISIT', 'ALLERGIES'); Share this post Link to post
Pat Foley 51 Posted February 19, 2022 12 hours ago, parallax said: But casting the output from MMFileToString using PAnsiChar, it worked fine, and I came down to 72 seconds on loading the big file. Not getting how that works? Using streaming appears to help reduce exe growth. // Lifted from Remy L DP post procedure TForm4.Button5Click(Sender: TObject); var //FS: TFileStream; // FS: TBufferedFileStream; //saves a little Reader: TStreamReader; FN, data, s: string; eT, vLinesCounter: Int64; begin et := GetTickCount64; vLinesCounter:= 0; FN := 'C:\2022\2021VAERSData\2021VAERSDATAutf8.csv'; //lactually just changed all (?) to (') 87xxx of the things FS := TBufferedFileStream.Create(FN,fmOpenRead or fmShareDenyWrite); try Reader := TStreamReader.Create(FS); Try while not Reader.EndOfStream do begin StringGrid1.rows[vLinesCounter].StrictDelimiter := True; StringGrid1.rows[vLinesCounter].CommaText := Reader.ReadLine; Inc(vLinesCounter); if vLinesCounter mod 20000 = 0 then caption := vLinesCounter.tostring; end; Finally Reader.Free End; finally FS.Free; end; StringGrid1.rowcount := vLinesCounter + 3; eT := GetTickCount64 - eT; Caption := vLinesCounter.tostring + ' Records loaded to SG in ' + (eT/1000).tostring + ' seconds.'; end; // For viewing cropped text in D procedure TForm4.StringGrid1SelectCell(Sender: TObject; ACol, ARow: Integer; var CanSelect: Boolean); begin memo1.Lines.Text := StringGrid1.Cells[aCol, aRow]; end; //New L code loads in 7 seconds procedure TForm1.Button3Click(Sender: TObject); var et: int64; begin if OpenDialog1.Execute then FN := OpenDialog1.FileName else exit; et := GetTickCount64; SG.LoadFromCSVFile(FN); et := GetTickCount64 - eT; Memo1.lines.add(eT.tostring); end; Share this post Link to post
Arnaud Bouchez 407 Posted February 19, 2022 note: if you read the file from start to end, Memory mapped files are not faster than reading the file in memory. The memory faults make it slower than a regular single FileRead() call. For huge files on Win32 which won't be able to load in memory, you may use temporary chunks (e.g. 128MB). And if you really load it once and don't want to pollute the OS disk memory cache, consider using the FILE_FLAG_SEQUENTIAL_SCAN flag under Windows. This is what we do with mORMot's FileOpenSequentialRead(). https://devblogs.microsoft.com/oldnewthing/20120120-00/?p=8493 1 Share this post Link to post
Attila Kovacs 629 Posted February 19, 2022 I'd be surprised if FILE_FLAG_SEQUENTIAL_SCAN would give any notable boost. Share this post Link to post
Guest Posted February 19, 2022 (edited) ex. for input/output with DateType defined: for var MyItem in MyArr1990VAERSDATAFields do begin if MyItem = 'VAX_DATE' then begin FDBatchMoveTextReader1.DataDef.FormatSettings.ShortDateFormat := 'mm/dd/yyyy'; MyFDTextField := FDBatchMoveTextReader1.DataDef.Fields.Add; MyFDTextField.FieldName := MyItem; MyFDTextField.DataType := TFDTextDataType.atDate; MyFDTextField.FieldSize := 0; // FDMemTable1.FieldDefs.Add(MyItem, TFieldType.ftDate, 0, false) end else begin MyFDTextField := FDBatchMoveTextReader1.DataDef.Fields.Add; MyFDTextField.FieldName := MyItem; MyFDTextField.DataType := TFDTextDataType.atString; MyFDTextField.FieldSize := MyFieldSize; // FDMemTable1.FieldDefs.Add(MyItem, TFieldType.ftString, MyFieldSize, false); end; end; // FDBatchMoveTextWriter1.DataDef := FDBatchMoveTextReader1.DataDef; procedure TForm1.FDMemTable1AfterScroll(DataSet: TDataSet); begin DateTimePicker1.DateTime := DataSet.FieldByName('VAX_DATE').AsDateTime; end; Memo1.Lines.Add(FDMemTable1.FieldByName('VAX_DATE').DataType.MyText); VAX_DATE = ftDate ------------ Reading: [1990VAERSDATA.csv] to FDMemTable Records: [2102] Time: 73,697 (00:00:00.0736970) Fixed and works now! 1.051.000 records in 25seconds ... MyArr1990VAERSDATAFields: array [0 .. 34] of string = ( ... { } .. FDBatchMoveTextReader1.DataDef.Delimiter := '"'; // ? ... Edited February 19, 2022 by Guest Share this post Link to post
Guest Posted February 19, 2022 that way, you can create many profiles for each file type. then, with this pre-defined "patterns" stay easy load it to dataset target or just to analize its data. Share this post Link to post
Brian Evans 105 Posted February 19, 2022 A quick look at the file (searched Bing for the 2022VAERSDATA.csv) and it looks like double quotes within a string field are escaped with a double quote. As well only string fields that contain commas or double quotes appear to be enclosed in double quotes. Will take a more detailed look when I get a chance. LOAD DATA local INFILE 'c:/programdata/mysql/mysql server 8.0/uploads/2022VAERSDATA.csv' INTO TABLE data CHARACTER SET latin1 FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ESCAPED BY '"' IGNORE 1 LINES Share this post Link to post
Guest Posted February 19, 2022 using FDxxxxTextRead you needs just define what fields you have Format Options nothing more Share this post Link to post
parallax 1 Posted February 19, 2022 44 minutes ago, Brian Evans said: A quick look at the file (searched Bing for the 2022VAERSDATA.csv) and it looks like double quotes within a string field are escaped with a double quote. As well only string fields that contain commas or double quotes appear to be enclosed in double quotes. Will take a more detailed look when I get a chance. LOAD DATA local INFILE 'c:/programdata/mysql/mysql server 8.0/uploads/2022VAERSDATA.csv' INTO TABLE data CHARACTER SET latin1 FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ESCAPED BY '"' IGNORE 1 LINES You are correct, double quotes are only used if there is a comma within the string, and quotations within strings are double quoted, i.e. escaped with a double quote. So your load clause should work, but it doesn't. About 10% of the rows are discarded somehow. Share this post Link to post
parallax 1 Posted February 19, 2022 58 minutes ago, joaodanet2018 said: using FDxxxxTextRead you needs just define what fields you have Format Options nothing more I am having issues with testing your code, getting very large files. Could you please attach a test project? Share this post Link to post
Guest Posted February 19, 2022 (edited) attached (7zip v21.07 with no password) my simple project for your tests... Done in RAD11 but nothing sophisticated, then, any other edition can open and build it. Any problem on open it, just create a new project and add my MainForm into it. in my tests, I used the "1990VAERSDATA.csv", copying the records into it until +1.000.000 lines, ok? if you see the "MyFileTarget.txt" output, you'll see that the field with "long text" will be saved same that in FDMemTable these text is showed using my "FieldSize limitation". Of course, this limitation is not for output text. FDConn_Reading_CSV_files_to_Parallax.zip Edited February 19, 2022 by Guest Share this post Link to post