parallax 1 Posted February 15, 2022 (edited) Before loading a .csv file into MySQL, I have to get rid of all double quotes, as the LOAD DATA won't have any of that. But there are instances of commas within those strings having double quotes, so they have to be converted to something else. Besides, the date formats have to be changed, and I use a regEx for this. And the boolean fields. But this all takes a bit too much time, in my opinion, up to 20-30 seconds for conversion of 100.000 lines on my i7-7700. Is it possible to make this run faster? Anyone got some good suggestions? The code below is repeated for each line that is read from the .csv textfile. The small project may be found at https://github.com/parallaxprog/vaers function convertDates(s:string):string; var i:integer; d, dStr:string; begin i:=0; while (i<length(s)) do begin inc(i); d:=copy(s, i, 10); if regEx.IsMatch(d) then begin //we have a date, change format from american mm/dd/yyyy to yyyy-mm-dd format dStr:=copy(d, 7, 4)+'-'+copy(d, 1, 2)+'-'+copy(d, 4, 2); //insert it in the right place s:=copy(s, 1, i-1)+dStr+copy(s, i+10, length(s)); //skip past it inc(i, 9); end; end; result:=s; end; function commaReplace(s:string):string; var i:integer; inString:boolean; s2:string; begin //replace each comma within double quotes with semicolon, and then remove double quotes //so we can use MySQL LOAD DATA function to load the .csv directly //first remove dual double quotes s:=stringReplace(s,'""','',[rfReplaceAll]); inString:=false; for i:=1 to length(s) do begin if s[i]='"' then inString:=not inString else if inString and (s[i]=',') then s[i]:=';'; end; s2:=stringReplace(s,'"','',[rfReplaceAll]); //convert boolean Y to 1 in the right places, and convert dates result:=convertDates(stringReplace(s2,',Y,',',1,',[rfReplaceAll])); end; Edited February 15, 2022 by parallax Share this post Link to post
Arnaud Bouchez 407 Posted February 15, 2022 (edited) The main trick is to avoid memory allocation, i.e. temporary string allocations. For instance, the less copy() calls, the better. Try to rewrite your code to allocate one single output string per input string. Just parse the input string from left to write, then applying the quotes or dates processing on the fly. Then you could also avoid any input line allocation, and parse the whole input buffer at once. Parsing 100.000 lines could be done much quicker, if properly written. I guess round 500MB/s is easy to reach. For instance, within mORMot, we parse and convert 900MB/s of JSON in pure pascal, including string unquoting. Edited February 15, 2022 by Arnaud Bouchez 3 Share this post Link to post
parallax 1 Posted February 15, 2022 2 minutes ago, Arnaud Bouchez said: The main trick is to avoid memory allocation, i.e. temporary string allocations. For instance, the less copy() calls, the better. Try to rewrite your code to allocate one single output string per input string. Just parse the input string from left to write, then applying the quotes or dates processing on the fly. Thanks, I think you have a point, and I have been pondering doing it all in a single loop, but it seemed a bit complicated. May have to figure this out 🙂 Share this post Link to post
parallax 1 Posted February 15, 2022 (edited) I was able to boil it down to a single loop, but still there is a copy call as well as three stringReplace calls. function commaReplace(s:string):string; var i:integer; inString:boolean; s2, d, dStr:string; begin //replace each comma within double quotes with semicolon, and then remove double quotes //so we can use MySQL LOAD DATA function to load the .csv directly //first remove dual double quotes s:=stringReplace(s,'""','',[rfReplaceAll]); inString:=false; for i:=1 to length(s) do begin if s[i]='"' then inString:=not inString else if inString and (s[i]=',') then s[i]:=';'; if ((length(s)-i)>10) and (not inString) then begin d:=copy(s, i, 10); if regEx.IsMatch(d) then begin //we have a date, change format from american mm/dd/yyyy to yyyy-mm-dd format dStr:=copy(d, 7, 4)+'-'+copy(d, 1, 2)+'-'+copy(d, 4, 2); //insert it in the right place s:=copy(s, 1, i-1)+dStr+copy(s, i+10, length(s)); end; end; end; s2:=stringReplace(s,'"','',[rfReplaceAll]); //convert boolean Y to 1 in the right places, and convert dates //result:=convertDates(stringReplace(s2,',Y,',',1,',[rfReplaceAll])); result:=stringReplace(s2,',Y,',',1,',[rfReplaceAll]); end; Edited February 15, 2022 by parallax Share this post Link to post
Bill Meyer 337 Posted February 15, 2022 Have you used a profiler yet? There really is little point to reworking your code until you know where the time is consumed. 1 Share this post Link to post
parallax 1 Posted February 15, 2022 No, I haven't tried that - that's probably a good idea. Will look at it, thanks. Share this post Link to post
Arnaud Bouchez 407 Posted February 15, 2022 You could do it with NO copy() call at all. Just write a small state machine and read the input one char per char. Share this post Link to post
Brian Evans 105 Posted February 15, 2022 The MySQL LOAD DATA statement can certainly handle all of that not sure why you think otherwise? MySQL :: MySQL 5.7 Reference Manual :: 13.2.6 LOAD DATA Statement The [{FIELDS | COLUMNS} [TERMINATED BY 'string'] [[OPTIONALLY] ENCLOSED BY 'char'] [ESCAPED BY 'char'] can handle double quotes. The SET col_name={expr | DEFAULT} section can be used for date, Boolean or other field conversions as needed. Share this post Link to post
parallax 1 Posted February 16, 2022 5 hours ago, Brian Evans said: The MySQL LOAD DATA statement can certainly handle all of that not sure why you think otherwise? MySQL :: MySQL 5.7 Reference Manual :: 13.2.6 LOAD DATA Statement The [{FIELDS | COLUMNS} [TERMINATED BY 'string'] [[OPTIONALLY] ENCLOSED BY 'char'] [ESCAPED BY 'char'] can handle double quotes. The SET col_name={expr | DEFAULT} section can be used for date, Boolean or other field conversions as needed. Yes, I know the statement is supposed to handle quotes, but I was not able to make it work with quotes and gave up after a couple of hours. The terminated by clause seemed to have no effect whatsover. Also, there are instances of double and triple quotes, when something is quoted within a string they use double quotes: "This is a string, and then someone said ""Hello there.""" So you can risk having three double quotes in a row if the quote is at the end of the string. Share this post Link to post
parallax 1 Posted February 16, 2022 (edited) 9 hours ago, Arnaud Bouchez said: You could do it with NO copy() call at all. Just write a small state machine and read the input one char per char. Yes, this is what I came up with, and I am checking it now to see if it works ok. It reduced the extraction time from 147 to 83 seconds. function convert(s:string):string; var i:integer; nextNextQuote, nextQuote, nextComma, prevComma, yChar, comma, inString, quote:boolean; m1, m2, d1, d2:char; begin inString:=false; quote:=false; comma:=false; for i:=1 to length(s) do if i<=length(s) then begin nextComma:=false; nextQuote:=false; nextNextQuote:=false; prevComma:=comma; comma:=(s[i]=','); if i<length(s) then nextComma:=(s[succ(i)]=','); if (prevComma and (s[i]='Y') and nextComma) then s[i]:='1'; if (s[i]='"') then begin quote:=true; system.Delete(s, i, 1); end else quote:=false; if (s[i]='"') then begin nextQuote:=true; system.Delete(s, i, 1); end; if (s[i]='"') then begin nextNextQuote:=true; system.Delete(s, i, 1); end; //assume that only single or triple quotes takes us in or out of a string if (quote and (not nextQuote) and (not nextNextQuote)) or (quote and nextQuote and nextNextQuote) then inString:=not inString; if inString and (s[i]=',') then s[i]:=';'; if ((length(s)-i)>10) and (not inString) and prevComma then if (s[i+2]='/') and (s[i+5]='/') and (s[i+10]=',') then begin //pattern: ,??/??/????, is supposed to be a date //change date format from american mm/dd/yyyy to yyyy-mm-dd format m1:=s[i]; m2:=s[i+1]; d1:=s[i+3]; d2:=s[i+4]; s[i]:=s[i+6]; s[i+1]:=s[i+7]; s[i+2]:=s[i+8]; s[i+3]:=s[i+9]; s[i+4]:='-'; s[i+5]:=m1; s[i+6]:=m2; s[i+7]:='-'; s[i+8]:=d1; s[i+9]:=d2; end; end; result:=s; end; Edited February 16, 2022 by parallax Share this post Link to post
David Heffernan 2345 Posted February 16, 2022 Regarding the code in the previous post, every call to Delete performs both a copy and a heap allocation. You should be able to do this with one heap allocation and one copy. Share this post Link to post
Arnaud Bouchez 407 Posted February 16, 2022 (edited) Yes, delete() is as bad as copy(), David is right! Idea is to keep the input string untouched, then append the output to a new output string, preallocated once with a maximum potential size. Then call SetLength() once at the end, which is likely to do nothing and reallocate the content in-place, thanks to the heap manager. Edited February 16, 2022 by Arnaud Bouchez Share this post Link to post
parallax 1 Posted February 16, 2022 8 hours ago, David Heffernan said: Regarding the code in the previous post, every call to Delete performs both a copy and a heap allocation. You should be able to do this with one heap allocation and one copy. Good idea, using some logic that should be possible. I am also considering reading larger chunks of the files into an array, and looping through it - if that would be faster than reading single lines as with text files. Share this post Link to post
David Heffernan 2345 Posted February 16, 2022 (edited) 6 hours ago, parallax said: I am also considering reading larger chunks of the files into an array, and looping through it - if that would be faster than reading single lines as with text files. Yes. Use a buffered line reader. But more importantly measure where your code spends its time. Don't guess. Edited February 16, 2022 by David Heffernan Share this post Link to post
Guest Posted February 17, 2022 (edited) Are these values meaningful to what you are intending when reading a CSV file? 1990VAERSDATA.csv with 102.952 lines and these fields: VAERS ID,RECVDATE,STATE,AGE_YRS,CAGE_YR,CAGE MO,SEX,RPT_DATE,SYMPTOM_TEXT,DIED,DATEDIED,L_THREAT,ER_VISIT,HOSPITAL,HOSPDAYS,X_STAY,DISABLE,RECOVD,VAX_DATE,ONSET_DATE,NUMDAYS,LAB_DATA,V_ADMINBY,V_FUNDBY,OTHER_MEDS,CUR_ILL,HISTORY,PRIOR_VAX,SPLTTYPE,FORM_VERS,TODAYS_DATE,BIRTH_DEFECT,OFC_VISIT,ER_ED_VISIT,ALLERGIES, Arrays: 102952 => ( lines on CSV file ) Scenary: CPU: i7 4770, 16GB, HDD300GB, MSWin10_21H2_64bits File CSV: 28,4 MB (29.846.064 bytes) RAD Studio 11, Debug mode, 64bits, +/- 200lines of code (counting the comments) On process: No Threading... :>< CPU: +/- 15% usage Mem: +/- 800MBytes usage Time: 9563 ms --> 00:00:09.5638096 ---> len: 102.952 records == read-block=50bytes Same file but now using Read-Block with 512bytes Arrays: 57943 4256 --> 00:00:04.2569624 ---> len: 57943 records == read-block=512bytes Same file but now using Read-Block with 1024bytes Arrays: 29148 3938 --> 00:00:03.9388028 ---> len: 29148 records == read-block=1024bytes objects JSON = Arrays objects simulating convertion of types in each value on the line... { "xVAERS ID": "VAERS ID", "xRECVDATE": "RECVDATE", "xSTATE": "STATE", "xAGE_YRS": "AGE_YRS", "xCAGE_YR": "CAGE_YR", "xCAGE MO": "CAGE MO", .... { "x0025012": "0025012", "x07_02_1990": "07_02_1990", "xWI": "WI", "x0_2": "0_2", "xM": "M", ... Edited February 17, 2022 by Guest Share this post Link to post
Tommi Prami 130 Posted February 17, 2022 Tineas thing of all (in this context), but always pass string parameters as const, if possible, procedure Foo(const AStringValue: string); -Tee- Share this post Link to post
parallax 1 Posted February 17, 2022 (edited) 9 hours ago, David Heffernan said: Yes. Use a buffered line reader. But more importantly measure where your code spends its time. Don't guess. 4 hours ago, joaodanet2018 said: Are these values meaningful to what you are intending when reading a CSV file? 1990VAERSDATA.csv with 102.952 lines and these fields: When doing the 1990 data file with 2102 lines , the whole process takes less than 300 milliseconds with an empty table. The results below 200ms are when the table was already populated. I get the same numbers independent of mode, release or debug. I have an i7-7700HQ with 16Gb running Win10 x64. Now this is a small file, only 597,231 bytes, but when processing the 2021 file we have almost 653 million bytes/characters to check, over 1000 times more. That is the bottleneck. Edited February 17, 2022 by parallax Share this post Link to post
parallax 1 Posted February 17, 2022 9 hours ago, David Heffernan said: Yes. Use a buffered line reader. But more importantly measure where your code spends its time. Don't guess. You are right, and I should at least set up some ticks measuring around critical parts, to get an idea. Share this post Link to post
Anders Melander 1783 Posted February 17, 2022 1 minute ago, parallax said: I should at least set up some ticks measuring around critical parts, to get an idea. No need for that. Just run it in a profiler and you will know exactly which statements you should concentrate your efforts on. 1 Share this post Link to post
parallax 1 Posted February 17, 2022 30 minutes ago, Tommi Prami said: Tineas thing of all (in this context), but always pass string parameters as const, if possible, procedure Foo(const AStringValue: string); -Tee- True, and that is something I should use a lot more, obviously. Knew about it, but... I guess it's a habit that went away when computers got faster. 1 Share this post Link to post
parallax 1 Posted February 17, 2022 2 minutes ago, Anders Melander said: No need for that. Just run it in a profiler and you will know exactly which statements you should concentrate your efforts on. I am afraid that searching for a good profiler, or buying one, and finding out how to set it up and use it, will take more time than it's worth...or compared to doing some less complicated setup. Share this post Link to post
Anders Melander 1783 Posted February 17, 2022 3 minutes ago, parallax said: I am afraid that searching for a good profiler, or buying one, and finding out how to set it up and use it, will take more time than it's worth...or compared to doing some less complicated setup. It's a tool you should have in your tool chest so consider it an investment. A profiler is the only way to make sure you aren't doing premature optimization. The VTune profiler is free and with map2pdb you can use it with Delphi. 1 Share this post Link to post
parallax 1 Posted February 17, 2022 7 minutes ago, Anders Melander said: It's a tool you should have in your tool chest so consider it an investment. A profiler is the only way to make sure you aren't doing premature optimization. The VTune profiler is free and with map2pdb you can use it with Delphi. Interesting project you have there - thanks a lot for the suggestion. Of course you are right that it would be nice to use this kind of tool, in certain situations, like in this case. Share this post Link to post
David Heffernan 2345 Posted February 17, 2022 vtune with map2pdb is excellent. It will take time to set it up, but it will save you time in the long run. And you'll have a better end result. Share this post Link to post
parallax 1 Posted February 17, 2022 3 hours ago, David Heffernan said: vtune with map2pdb is excellent. It will take time to set it up, but it will save you time in the long run. And you'll have a better end result. I think you are right about that, and I've downloaded it all and installed it, and maybe I will take a look at the manual later on. Interesting stuff 🙂 Share this post Link to post