Jump to content
parallax

More efficient string handling

Recommended Posts

Before loading a .csv file into MySQL, I have to get rid of all double quotes, as the LOAD DATA won't have any of that.

But there are instances of commas within those strings having double quotes, so they have to be converted to something else.

 

Besides, the date formats have to be changed, and I use a regEx for this. And the boolean fields.

 

But this all takes a bit too much time, in my opinion, up to 20-30 seconds for conversion of 100.000 lines on my i7-7700.

 

Is it possible to make this run faster? Anyone got some good suggestions?

 

The code below is repeated for each line that is read from the .csv textfile.

 

The small project may be found at https://github.com/parallaxprog/vaers

 

function convertDates(s:string):string;
var i:integer;
    d, dStr:string;
begin
  i:=0;
  while (i<length(s)) do
  begin
    inc(i);
    d:=copy(s, i, 10);
    if regEx.IsMatch(d) then
    begin
      //we have a date, change format from american mm/dd/yyyy to yyyy-mm-dd format
      dStr:=copy(d, 7, 4)+'-'+copy(d, 1, 2)+'-'+copy(d, 4, 2);
      //insert it in the right place
      s:=copy(s, 1, i-1)+dStr+copy(s, i+10, length(s));
      //skip past it
      inc(i, 9);
    end;
  end;
  result:=s;
end;

function commaReplace(s:string):string;
var i:integer;
    inString:boolean;
    s2:string;
begin
  //replace each comma within double quotes with semicolon, and then remove double quotes
  //so we can use MySQL LOAD DATA function to load the .csv directly
  //first remove dual double quotes
  s:=stringReplace(s,'""','',[rfReplaceAll]);
  inString:=false;
  for i:=1 to length(s) do
  begin
    if s[i]='"' then
      inString:=not inString
      else
      if inString and (s[i]=',') then
        s[i]:=';';
  end;
  s2:=stringReplace(s,'"','',[rfReplaceAll]);
  //convert boolean Y to 1 in the right places, and convert dates
  result:=convertDates(stringReplace(s2,',Y,',',1,',[rfReplaceAll]));
end;

 

 

data ex2.jpg

Edited by parallax

Share this post


Link to post

The main trick is to avoid memory allocation, i.e. temporary string allocations.

For instance, the less copy() calls, the better.

 

Try to rewrite your code to allocate one single output string per input string.
Just parse the input string from left to write, then applying the quotes or dates processing on the fly.

 

Then you could also avoid any input line allocation, and parse the whole input buffer at once.


Parsing 100.000 lines could be done much quicker, if properly written. I guess round 500MB/s is easy to reach.
For instance, within mORMot, we parse and convert 900MB/s of JSON in pure pascal, including string unquoting.

Edited by Arnaud Bouchez
  • Like 3

Share this post


Link to post
2 minutes ago, Arnaud Bouchez said:

The main trick is to avoid memory allocation, i.e. temporary string allocations.

For instance, the less copy() calls, the better.

 

Try to rewrite your code to allocate one single output string per input string.
Just parse the input string from left to write, then applying the quotes or dates processing on the fly.

 

Thanks, I think you have a point, and I have been pondering doing it all in a single loop, but it seemed a bit complicated. May have to figure this out 🙂

Share this post


Link to post

I was able to boil it down to a single loop, but still there is a copy call as well as three stringReplace calls.

 


function commaReplace(s:string):string;
var i:integer;
    inString:boolean;
    s2, d, dStr:string;
begin
  //replace each comma within double quotes with semicolon, and then remove double quotes
  //so we can use MySQL LOAD DATA function to load the .csv directly
  //first remove dual double quotes
  s:=stringReplace(s,'""','',[rfReplaceAll]);
  inString:=false;
  for i:=1 to length(s) do
  begin
    if s[i]='"' then
      inString:=not inString
      else
      if inString and (s[i]=',') then
        s[i]:=';';

    if ((length(s)-i)>10) and (not inString) then
    begin
      d:=copy(s, i, 10);
      if regEx.IsMatch(d) then
      begin
        //we have a date, change format from american mm/dd/yyyy to yyyy-mm-dd format
        dStr:=copy(d, 7, 4)+'-'+copy(d, 1, 2)+'-'+copy(d, 4, 2);
        //insert it in the right place
        s:=copy(s, 1, i-1)+dStr+copy(s, i+10, length(s));
      end;
    end;
  end;
  s2:=stringReplace(s,'"','',[rfReplaceAll]);
  //convert boolean Y to 1 in the right places, and convert dates
  //result:=convertDates(stringReplace(s2,',Y,',',1,',[rfReplaceAll]));
  result:=stringReplace(s2,',Y,',',1,',[rfReplaceAll]);
end;

 

Edited by parallax

Share this post


Link to post

Have you used a profiler yet? There really is little point to reworking your code until you know where the time is consumed.

  • Thanks 1

Share this post


Link to post

No, I haven't tried that - that's probably a good idea. Will look at it, thanks.

Share this post


Link to post

The MySQL LOAD DATA statement can certainly handle all of that not sure why you think otherwise? 

 

MySQL :: MySQL 5.7 Reference Manual :: 13.2.6 LOAD DATA Statement

 

The [{FIELDS | COLUMNS} [TERMINATED BY 'string'] [[OPTIONALLY] ENCLOSED BY 'char'] [ESCAPED BY 'char'can handle double quotes. 

The SET col_name={expr | DEFAULT}  section can be used for date, Boolean or other field conversions as needed.

 

Share this post


Link to post
5 hours ago, Brian Evans said:

The MySQL LOAD DATA statement can certainly handle all of that not sure why you think otherwise? 

 

MySQL :: MySQL 5.7 Reference Manual :: 13.2.6 LOAD DATA Statement

 

The [{FIELDS | COLUMNS} [TERMINATED BY 'string'] [[OPTIONALLY] ENCLOSED BY 'char'] [ESCAPED BY 'char'can handle double quotes. 

The SET col_name={expr | DEFAULT}  section can be used for date, Boolean or other field conversions as needed.

 

Yes, I know the statement is supposed to handle quotes, but I was not able to make it work with quotes and gave up after a couple of hours.

The terminated by clause seemed to have no effect whatsover.

 

Also, there are instances of double and triple quotes, when something is quoted within a string they use double quotes:

"This is a string, and then someone said ""Hello there."""

 

So you can risk having three double quotes in a row if the quote is at the end of the string.

Share this post


Link to post
9 hours ago, Arnaud Bouchez said:

You could do it with NO copy() call at all.

 

Just write a small state machine and read the input one char per char.

Yes, this is what I came up with, and I am checking it now to see if it works ok.

It reduced the extraction time from 147 to 83 seconds.

 

 

function convert(s:string):string;
var i:integer;
    nextNextQuote, nextQuote, nextComma, prevComma, yChar, comma, inString, quote:boolean;
    m1, m2, d1, d2:char;
begin
  inString:=false;
  quote:=false;
  comma:=false;

  for i:=1 to length(s) do
  if i<=length(s) then
  begin
    nextComma:=false;
    nextQuote:=false;
    nextNextQuote:=false;
    prevComma:=comma;
    comma:=(s[i]=',');
    if i<length(s) then
      nextComma:=(s[succ(i)]=',');
    if (prevComma and (s[i]='Y') and nextComma) then
      s[i]:='1';

    if (s[i]='"') then
    begin
      quote:=true;
      system.Delete(s, i, 1);
    end else quote:=false;

    if (s[i]='"') then
    begin
      nextQuote:=true;
      system.Delete(s, i, 1);
    end;

    if (s[i]='"') then
    begin
      nextNextQuote:=true;
      system.Delete(s, i, 1);
    end;

    //assume that only single or triple quotes takes us in or out of a string
    if (quote and (not nextQuote) and (not nextNextQuote))
    or (quote and nextQuote and nextNextQuote) then
      inString:=not inString;

    if inString and (s[i]=',') then
      s[i]:=';';

    if ((length(s)-i)>10) and (not inString) and prevComma then
    if (s[i+2]='/') and (s[i+5]='/') and (s[i+10]=',') then
    begin
      //pattern: ,??/??/????, is supposed to be a date
      //change date format from american mm/dd/yyyy to yyyy-mm-dd format
      m1:=s[i];
      m2:=s[i+1];
      d1:=s[i+3];
      d2:=s[i+4];
      s[i]:=s[i+6];
      s[i+1]:=s[i+7];
      s[i+2]:=s[i+8];
      s[i+3]:=s[i+9];
      s[i+4]:='-';
      s[i+5]:=m1;
      s[i+6]:=m2;
      s[i+7]:='-';
      s[i+8]:=d1;
      s[i+9]:=d2;
    end;
  end;
  result:=s;
end;

 

Edited by parallax

Share this post


Link to post

Regarding the code in the previous post, every call to Delete performs both a copy and a heap allocation. You should be able to do this with one heap allocation and one copy. 

Share this post


Link to post

Yes, delete() is as bad as copy(), David is right!

 

Idea is to keep the input string untouched, then append the output to a new output string, preallocated once with a maximum potential size.
Then call SetLength() once at the end, which is likely to do nothing and reallocate the content in-place, thanks to the heap manager.

Edited by Arnaud Bouchez

Share this post


Link to post
8 hours ago, David Heffernan said:

Regarding the code in the previous post, every call to Delete performs both a copy and a heap allocation. You should be able to do this with one heap allocation and one copy. 

Good idea, using some logic that should be possible.

 

I am also considering reading larger chunks of the files into an array, and looping through it - if that would be faster than reading single lines as with text files.

Share this post


Link to post
6 hours ago, parallax said:

I am also considering reading larger chunks of the files into an array, and looping through it - if that would be faster than reading single lines as with text files.

Yes. Use a buffered line reader. But more importantly measure where your code spends its time. Don't guess. 

Edited by David Heffernan

Share this post


Link to post
Guest

Are these values meaningful to what you are intending when reading a CSV file?

1990VAERSDATA.csv  with 102.952 lines and these fields:

VAERS ID,RECVDATE,STATE,AGE_YRS,CAGE_YR,CAGE MO,SEX,RPT_DATE,SYMPTOM_TEXT,DIED,DATEDIED,L_THREAT,ER_VISIT,HOSPITAL,HOSPDAYS,X_STAY,DISABLE,RECOVD,VAX_DATE,ONSET_DATE,NUMDAYS,LAB_DATA,V_ADMINBY,V_FUNDBY,OTHER_MEDS,CUR_ILL,HISTORY,PRIOR_VAX,SPLTTYPE,FORM_VERS,TODAYS_DATE,BIRTH_DEFECT,OFC_VISIT,ER_ED_VISIT,ALLERGIES,

 

Arrays: 102952 => ( lines on CSV file )

Scenary:
CPU: i7 4770, 16GB, HDD300GB, MSWin10_21H2_64bits
File CSV: 28,4 MB (29.846.064 bytes)
RAD Studio 11, Debug mode, 64bits, +/- 200lines of code (counting the comments)

On process: No Threading... :><
CPU: +/- 15% usage
Mem: +/- 800MBytes usage

Time: 9563 ms --> 00:00:09.5638096 ---> len: 102.952 records == read-block=50bytes

Same file but now using Read-Block with 512bytes
Arrays: 57943
4256 --> 00:00:04.2569624 ---> len: 57943 records == read-block=512bytes

Same file but now using Read-Block with 1024bytes
Arrays: 29148
3938 --> 00:00:03.9388028 ---> len: 29148 records == read-block=1024bytes


objects JSON = Arrays objects simulating convertion of types in each value on the line...

{
    "xVAERS ID": "VAERS ID",
    "xRECVDATE": "RECVDATE",
    "xSTATE": "STATE",
    "xAGE_YRS": "AGE_YRS",
    "xCAGE_YR": "CAGE_YR",
    "xCAGE MO": "CAGE MO",
.... 
{
    "x0025012": "0025012",
    "x07_02_1990": "07_02_1990",
    "xWI": "WI",
    "x0_2": "0_2",
    "xM": "M",
...

 

Edited by Guest

Share this post


Link to post

Tineas thing of all (in this context), but always pass string parameters as const, if possible,

procedure Foo(const AStringValue: string);

 

-Tee-

Share this post


Link to post
9 hours ago, David Heffernan said:

Yes. Use a buffered line reader. But more importantly measure where your code spends its time. Don't guess. 

4 hours ago, joaodanet2018 said:

Are these values meaningful to what you are intending when reading a CSV file?

1990VAERSDATA.csv  with 102.952 lines and these fields:

 

 

 

When doing the 1990 data file with 2102 lines , the whole process takes less than 300 milliseconds with an empty table. The results below 200ms are when the table was already populated.

I get the same numbers independent of mode, release or debug. I have an i7-7700HQ with 16Gb running Win10 x64.

 

 

1990 ex.jpg

 

Now this is a small file, only 597,231 bytes, but when processing the 2021 file we have almost 653 million bytes/characters to check, over 1000 times more. That is the bottleneck.

 

 

 

Edited by parallax

Share this post


Link to post
9 hours ago, David Heffernan said:

Yes. Use a buffered line reader. But more importantly measure where your code spends its time. Don't guess. 

You are right, and I should at least set up some ticks measuring around critical parts, to get an idea.

Share this post


Link to post
1 minute ago, parallax said:

I should at least set up some ticks measuring around critical parts, to get an idea.

No need for that. Just run it in a profiler and you will know exactly which statements you should concentrate your efforts on.

  • Like 1

Share this post


Link to post
30 minutes ago, Tommi Prami said:

Tineas thing of all (in this context), but always pass string parameters as const, if possible,

procedure Foo(const AStringValue: string);

 

-Tee-

True, and that is something I should use a lot more, obviously. Knew about it, but... I guess it's a habit that went away when computers got faster.

  • Like 1

Share this post


Link to post
2 minutes ago, Anders Melander said:

No need for that. Just run it in a profiler and you will know exactly which statements you should concentrate your efforts on.

I am afraid that searching for a good profiler, or buying one, and finding out how to set it up and use it, will take more time than it's worth...or compared to doing some less complicated setup.

Share this post


Link to post
3 minutes ago, parallax said:

I am afraid that searching for a good profiler, or buying one, and finding out how to set it up and use it, will take more time than it's worth...or compared to doing some less complicated setup.

It's a tool you should have in your tool chest so consider it an investment. A profiler is the only way to make sure you aren't doing premature optimization.

 

The VTune profiler is free and with map2pdb you can use it with Delphi.

  • Like 1

Share this post


Link to post
7 minutes ago, Anders Melander said:

It's a tool you should have in your tool chest so consider it an investment. A profiler is the only way to make sure you aren't doing premature optimization.

 

The VTune profiler is free and with map2pdb you can use it with Delphi.

Interesting project you have there - thanks a lot for the suggestion. Of course you are right that it would be nice to use this kind of tool, in certain situations, like in this case.

 

Share this post


Link to post

vtune with map2pdb is excellent. It will take time to set it up, but it will save you time in the long run. And you'll have a better end result. 

Share this post


Link to post
3 hours ago, David Heffernan said:

vtune with map2pdb is excellent. It will take time to set it up, but it will save you time in the long run. And you'll have a better end result. 

I think you are right about that, and I've downloaded it all and installed it, and maybe I will take a look at the manual later on. Interesting stuff 🙂

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×