More efficient string handling

parallax · February 15, 2022

Before loading a .csv file into MySQL, I have to get rid of all double quotes, as the LOAD DATA won't have any of that.

But there are instances of commas within those strings having double quotes, so they have to be converted to something else.

Besides, the date formats have to be changed, and I use a regEx for this. And the boolean fields.

But this all takes a bit too much time, in my opinion, up to 20-30 seconds for conversion of 100.000 lines on my i7-7700.

Is it possible to make this run faster? Anyone got some good suggestions?

The code below is repeated for each line that is read from the .csv textfile.

The small project may be found at https://github.com/parallaxprog/vaers

function convertDates(s:string):string;
var i:integer;
    d, dStr:string;
begin
  i:=0;
  while (i<length(s)) do
  begin
    inc(i);
    d:=copy(s, i, 10);
    if regEx.IsMatch(d) then
    begin
      //we have a date, change format from american mm/dd/yyyy to yyyy-mm-dd format
      dStr:=copy(d, 7, 4)+'-'+copy(d, 1, 2)+'-'+copy(d, 4, 2);
      //insert it in the right place
      s:=copy(s, 1, i-1)+dStr+copy(s, i+10, length(s));
      //skip past it
      inc(i, 9);
    end;
  end;
  result:=s;
end;

function commaReplace(s:string):string;
var i:integer;
    inString:boolean;
    s2:string;
begin
  //replace each comma within double quotes with semicolon, and then remove double quotes
  //so we can use MySQL LOAD DATA function to load the .csv directly
  //first remove dual double quotes
  s:=stringReplace(s,'""','',[rfReplaceAll]);
  inString:=false;
  for i:=1 to length(s) do
  begin
    if s[i]='"' then
      inString:=not inString
      else
      if inString and (s[i]=',') then
        s[i]:=';';
  end;
  s2:=stringReplace(s,'"','',[rfReplaceAll]);
  //convert boolean Y to 1 in the right places, and convert dates
  result:=convertDates(stringReplace(s2,',Y,',',1,',[rfReplaceAll]));
end;

Edited February 15, 2022 by parallax

Arnaud Bouchez · February 15, 2022

The main trick is to avoid memory allocation, i.e. temporary string allocations.

For instance, the less copy() calls, the better.

Try to rewrite your code to allocate one single output string per input string.
Just parse the input string from left to write, then applying the quotes or dates processing on the fly.

Then you could also avoid any input line allocation, and parse the whole input buffer at once.

Parsing 100.000 lines could be done much quicker, if properly written. I guess round 500MB/s is easy to reach.
For instance, within mORMot, we parse and convert 900MB/s of JSON in pure pascal, including string unquoting.

Edited February 15, 2022 by Arnaud Bouchez

parallax · February 15, 2022

2 minutes ago, Arnaud Bouchez said:

The main trick is to avoid memory allocation, i.e. temporary string allocations.

For instance, the less copy() calls, the better.

Try to rewrite your code to allocate one single output string per input string.
Just parse the input string from left to write, then applying the quotes or dates processing on the fly.

Thanks, I think you have a point, and I have been pondering doing it all in a single loop, but it seemed a bit complicated. May have to figure this out 🙂

parallax · February 15, 2022

I was able to boil it down to a single loop, but still there is a copy call as well as three stringReplace calls.


function commaReplace(s:string):string;
var i:integer;
    inString:boolean;
    s2, d, dStr:string;
begin
  //replace each comma within double quotes with semicolon, and then remove double quotes
  //so we can use MySQL LOAD DATA function to load the .csv directly
  //first remove dual double quotes
  s:=stringReplace(s,'""','',[rfReplaceAll]);
  inString:=false;
  for i:=1 to length(s) do
  begin
    if s[i]='"' then
      inString:=not inString
      else
      if inString and (s[i]=',') then
        s[i]:=';';

    if ((length(s)-i)>10) and (not inString) then
    begin
      d:=copy(s, i, 10);
      if regEx.IsMatch(d) then
      begin
        //we have a date, change format from american mm/dd/yyyy to yyyy-mm-dd format
        dStr:=copy(d, 7, 4)+'-'+copy(d, 1, 2)+'-'+copy(d, 4, 2);
        //insert it in the right place
        s:=copy(s, 1, i-1)+dStr+copy(s, i+10, length(s));
      end;
    end;
  end;
  s2:=stringReplace(s,'"','',[rfReplaceAll]);
  //convert boolean Y to 1 in the right places, and convert dates
  //result:=convertDates(stringReplace(s2,',Y,',',1,',[rfReplaceAll]));
  result:=stringReplace(s2,',Y,',',1,',[rfReplaceAll]);
end;

Edited February 15, 2022 by parallax

Bill Meyer · February 15, 2022

Have you used a profiler yet? There really is little point to reworking your code until you know where the time is consumed.

parallax · February 15, 2022

No, I haven't tried that - that's probably a good idea. Will look at it, thanks.

Arnaud Bouchez · February 15, 2022

You could do it with NO copy() call at all.

Just write a small state machine and read the input one char per char.

Brian Evans · February 15, 2022

The MySQL LOAD DATA statement can certainly handle all of that not sure why you think otherwise?

MySQL :: MySQL 5.7 Reference Manual :: 13.2.6 LOAD DATA Statement

The [{FIELDS | COLUMNS} [TERMINATED BY 'string'] [[OPTIONALLY] ENCLOSED BY 'char'] [ESCAPED BY 'char'] can handle double quotes.

The SET col_name={expr | DEFAULT} section can be used for date, Boolean or other field conversions as needed.

parallax · February 16, 2022

5 hours ago, Brian Evans said:

The MySQL LOAD DATA statement can certainly handle all of that not sure why you think otherwise?

MySQL :: MySQL 5.7 Reference Manual :: 13.2.6 LOAD DATA Statement

The [{FIELDS | COLUMNS} [TERMINATED BY 'string'] [[OPTIONALLY] ENCLOSED BY 'char'] [ESCAPED BY 'char'] can handle double quotes.

The SET col_name={expr | DEFAULT} section can be used for date, Boolean or other field conversions as needed.

Yes, I know the statement is supposed to handle quotes, but I was not able to make it work with quotes and gave up after a couple of hours.

The terminated by clause seemed to have no effect whatsover.

Also, there are instances of double and triple quotes, when something is quoted within a string they use double quotes:

"This is a string, and then someone said ""Hello there."""

So you can risk having three double quotes in a row if the quote is at the end of the string.

parallax · February 16, 2022

9 hours ago, Arnaud Bouchez said:

You could do it with NO copy() call at all.

Just write a small state machine and read the input one char per char.

Yes, this is what I came up with, and I am checking it now to see if it works ok.

It reduced the extraction time from 147 to 83 seconds.

function convert(s:string):string;
var i:integer;
    nextNextQuote, nextQuote, nextComma, prevComma, yChar, comma, inString, quote:boolean;
    m1, m2, d1, d2:char;
begin
  inString:=false;
  quote:=false;
  comma:=false;

  for i:=1 to length(s) do
  if i<=length(s) then
  begin
    nextComma:=false;
    nextQuote:=false;
    nextNextQuote:=false;
    prevComma:=comma;
    comma:=(s[i]=',');
    if i<length(s) then
      nextComma:=(s[succ(i)]=',');
    if (prevComma and (s[i]='Y') and nextComma) then
      s[i]:='1';

    if (s[i]='"') then
    begin
      quote:=true;
      system.Delete(s, i, 1);
    end else quote:=false;

    if (s[i]='"') then
    begin
      nextQuote:=true;
      system.Delete(s, i, 1);
    end;

    if (s[i]='"') then
    begin
      nextNextQuote:=true;
      system.Delete(s, i, 1);
    end;

    //assume that only single or triple quotes takes us in or out of a string
    if (quote and (not nextQuote) and (not nextNextQuote))
    or (quote and nextQuote and nextNextQuote) then
      inString:=not inString;

    if inString and (s[i]=',') then
      s[i]:=';';

    if ((length(s)-i)>10) and (not inString) and prevComma then
    if (s[i+2]='/') and (s[i+5]='/') and (s[i+10]=',') then
    begin
      //pattern: ,??/??/????, is supposed to be a date
      //change date format from american mm/dd/yyyy to yyyy-mm-dd format
      m1:=s[i];
      m2:=s[i+1];
      d1:=s[i+3];
      d2:=s[i+4];
      s[i]:=s[i+6];
      s[i+1]:=s[i+7];
      s[i+2]:=s[i+8];
      s[i+3]:=s[i+9];
      s[i+4]:='-';
      s[i+5]:=m1;
      s[i+6]:=m2;
      s[i+7]:='-';
      s[i+8]:=d1;
      s[i+9]:=d2;
    end;
  end;
  result:=s;
end;

Edited February 16, 2022 by parallax

David Heffernan · February 16, 2022

Regarding the code in the previous post, every call to Delete performs both a copy and a heap allocation. You should be able to do this with one heap allocation and one copy.

Arnaud Bouchez · February 16, 2022

Yes, delete() is as bad as copy(), David is right!

Idea is to keep the input string untouched, then append the output to a new output string, preallocated once with a maximum potential size.
Then call SetLength() once at the end, which is likely to do nothing and reallocate the content in-place, thanks to the heap manager.

Edited February 16, 2022 by Arnaud Bouchez

parallax · February 16, 2022

8 hours ago, David Heffernan said:

Regarding the code in the previous post, every call to Delete performs both a copy and a heap allocation. You should be able to do this with one heap allocation and one copy.

Good idea, using some logic that should be possible.

I am also considering reading larger chunks of the files into an array, and looping through it - if that would be faster than reading single lines as with text files.

David Heffernan · February 16, 2022

6 hours ago, parallax said:

I am also considering reading larger chunks of the files into an array, and looping through it - if that would be faster than reading single lines as with text files.

Yes. Use a buffered line reader. But more importantly measure where your code spends its time. Don't guess.

Edited February 16, 2022 by David Heffernan

February 17, 2022

Are these values meaningful to what you are intending when reading a CSV file?

1990VAERSDATA.csv with 102.952 lines and these fields:

VAERS ID,RECVDATE,STATE,AGE_YRS,CAGE_YR,CAGE MO,SEX,RPT_DATE,SYMPTOM_TEXT,DIED,DATEDIED,L_THREAT,ER_VISIT,HOSPITAL,HOSPDAYS,X_STAY,DISABLE,RECOVD,VAX_DATE,ONSET_DATE,NUMDAYS,LAB_DATA,V_ADMINBY,V_FUNDBY,OTHER_MEDS,CUR_ILL,HISTORY,PRIOR_VAX,SPLTTYPE,FORM_VERS,TODAYS_DATE,BIRTH_DEFECT,OFC_VISIT,ER_ED_VISIT,ALLERGIES,

Arrays: 102952 => ( lines on CSV file )

Scenary:
CPU: i7 4770, 16GB, HDD300GB, MSWin10_21H2_64bits
File CSV: 28,4 MB (29.846.064 bytes)
RAD Studio 11, Debug mode, 64bits, +/- 200lines of code (counting the comments)

On process: No Threading... :><
CPU: +/- 15% usage
Mem: +/- 800MBytes usage

Time: 9563 ms --> 00:00:09.5638096 ---> len: 102.952 records == read-block=50bytes

Same file but now using Read-Block with 512bytes
Arrays: 57943
4256 --> 00:00:04.2569624 ---> len: 57943 records == read-block=512bytes

Same file but now using Read-Block with 1024bytes
Arrays: 29148
3938 --> 00:00:03.9388028 ---> len: 29148 records == read-block=1024bytes


objects JSON = Arrays objects simulating convertion of types in each value on the line...

{
    "xVAERS ID": "VAERS ID",
    "xRECVDATE": "RECVDATE",
    "xSTATE": "STATE",
    "xAGE_YRS": "AGE_YRS",
    "xCAGE_YR": "CAGE_YR",
    "xCAGE MO": "CAGE MO",
.... 
{
    "x0025012": "0025012",
    "x07_02_1990": "07_02_1990",
    "xWI": "WI",
    "x0_2": "0_2",
    "xM": "M",
...

Edited February 17, 2022 by Guest

Tommi Prami · February 17, 2022

Tineas thing of all (in this context), but always pass string parameters as const, if possible,

procedure Foo(const AStringValue: string);

-Tee-

parallax · February 17, 2022

9 hours ago, David Heffernan said:

Yes. Use a buffered line reader. But more importantly measure where your code spends its time. Don't guess.

4 hours ago, joaodanet2018 said:

Are these values meaningful to what you are intending when reading a CSV file?

1990VAERSDATA.csv with 102.952 lines and these fields:

When doing the 1990 data file with 2102 lines , the whole process takes less than 300 milliseconds with an empty table. The results below 200ms are when the table was already populated.

I get the same numbers independent of mode, release or debug. I have an i7-7700HQ with 16Gb running Win10 x64.

Now this is a small file, only 597,231 bytes, but when processing the 2021 file we have almost 653 million bytes/characters to check, over 1000 times more. That is the bottleneck.

Edited February 17, 2022 by parallax

parallax · February 17, 2022

9 hours ago, David Heffernan said:

Yes. Use a buffered line reader. But more importantly measure where your code spends its time. Don't guess.

You are right, and I should at least set up some ticks measuring around critical parts, to get an idea.

Anders Melander · February 17, 2022

1 minute ago, parallax said:

I should at least set up some ticks measuring around critical parts, to get an idea.

No need for that. Just run it in a profiler and you will know exactly which statements you should concentrate your efforts on.

parallax · February 17, 2022

30 minutes ago, Tommi Prami said:

Tineas thing of all (in this context), but always pass string parameters as const, if possible,

procedure Foo(const AStringValue: string);

-Tee-

True, and that is something I should use a lot more, obviously. Knew about it, but... I guess it's a habit that went away when computers got faster.

parallax · February 17, 2022

2 minutes ago, Anders Melander said:

No need for that. Just run it in a profiler and you will know exactly which statements you should concentrate your efforts on.

I am afraid that searching for a good profiler, or buying one, and finding out how to set it up and use it, will take more time than it's worth...or compared to doing some less complicated setup.

Anders Melander · February 17, 2022

3 minutes ago, parallax said:

I am afraid that searching for a good profiler, or buying one, and finding out how to set it up and use it, will take more time than it's worth...or compared to doing some less complicated setup.

It's a tool you should have in your tool chest so consider it an investment. A profiler is the only way to make sure you aren't doing premature optimization.

The VTune profiler is free and with map2pdb you can use it with Delphi.

parallax · February 17, 2022

7 minutes ago, Anders Melander said:

It's a tool you should have in your tool chest so consider it an investment. A profiler is the only way to make sure you aren't doing premature optimization.

The VTune profiler is free and with map2pdb you can use it with Delphi.

Interesting project you have there - thanks a lot for the suggestion. Of course you are right that it would be nice to use this kind of tool, in certain situations, like in this case.

David Heffernan · February 17, 2022

vtune with map2pdb is excellent. It will take time to set it up, but it will save you time in the long run. And you'll have a better end result.

parallax · February 17, 2022

3 hours ago, David Heffernan said:

vtune with map2pdb is excellent. It will take time to set it up, but it will save you time in the long run. And you'll have a better end result.

I think you are right about that, and I've downloaded it all and installed it, and maybe I will take a look at the manual later on. Interesting stuff 🙂

Sign In

More efficient string handling

Recommended Posts

parallax 1

Share this post

Link to post

Arnaud Bouchez 409

Share this post

Link to post

parallax 1

Share this post

Link to post

parallax 1

Share this post

Link to post

Bill Meyer 339

Share this post

Link to post

parallax 1

Share this post

Link to post

Arnaud Bouchez 409

Share this post

Link to post

Brian Evans 124

Share this post

Link to post

parallax 1

Share this post

Link to post

parallax 1

Share this post

Link to post

David Heffernan 2446

Share this post

Link to post

Arnaud Bouchez 409

Share this post

Link to post

parallax 1

Share this post

Link to post

David Heffernan 2446

Share this post

Link to post

Guest

Share this post

Link to post

Tommi Prami 148

Share this post

Link to post

parallax 1

Share this post

Link to post

parallax 1

Share this post

Link to post

Anders Melander 2016

Share this post

Link to post

parallax 1

Share this post

Link to post

parallax 1

Share this post

Link to post

Anders Melander 2016

Share this post

Link to post

parallax 1

Share this post

Link to post

David Heffernan 2446

Share this post

Link to post

parallax 1

Share this post

Link to post

Create an account or sign in to comment

Create an account