More efficient string handling

Anders Melander · February 17, 2022

7 minutes ago, parallax said:

maybe I will take a look at the manual later on

What? There's a manual? 😉

parallax · February 17, 2022

17 minutes ago, Anders Melander said:

What? There's a manual? 😉

I meant the user guide for vTune, at https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top.html

Running vTune quickly without knowing much about what I'm doing, I got an interesting list of functions and how much time is spent on each.

Writing and reading from disk takes the lead, not surprisingly.

I realize that this may be very handy - thanks again for suggesting this idea, and for creating the map2pdb project 🙂

Edited February 17, 2022 by parallax

Anders Melander · February 17, 2022

5 minutes ago, parallax said:

I meant the user guide for vTune

Yes, I know 🙂

Looking at your numbers, since WriteFile dwarfs all the string related operations, it seems that you might be able to eliminate the need to optimize the string handling just by moving your file write operations to a background thread. For example with overlapped I/O.

parallax · February 17, 2022

5 minutes ago, Anders Melander said:

Yes, I know 🙂

Looking at your numbers, since WriteFile dwarfs all the string related operations, it seems that you might be able to eliminate the need to optimize the string handling just by moving your file write operations to a background thread. For example with overlapped I/O.

That's a good idea, reading and writing at the same time - just push the data to a global buffer after conversion, which is read by another thread which does the file write, maybe using TCriticalSection for concurrency control.

Could be interesting to try that and see what happens.

Attila Kovacs · February 17, 2022

why do you need writefile more than once? how much gigabytes is that csv?

Edited February 17, 2022 by Attila Kovacs

parallax · February 17, 2022

3 minutes ago, Attila Kovacs said:

why do you need writefile more than once? how much gigabytes is that csv?

I don't need to do it more than once, but it's just nice to get the load function optimized for the large 2021 file, which is around 0,62 Gb.

If I can get it down from about 150 seconds to 50 seconds, for that file, on my system, it would be acceptable, and the excellent concurrency read/write idea may accomplish that.

Attila Kovacs · February 17, 2022

5 hours ago, parallax said:

but when processing the 2021 file we have almost 653 million bytes/characters to check

How often do you have to import that file?

parallax · February 17, 2022

5 minutes ago, Attila Kovacs said:

How often do you have to import that file?

It is not for me, but for the people who will install the program and use it for analyzing the data.

They probably have slower hardware than me, so it would be nice if they would not have to wait for several minutes for the loading to finish.

The idea is that this is something everybody should see, because it is so shocking - I mean the sudden spike in cases in 2021.

So I want it to work well.

Edited February 17, 2022 by parallax

Attila Kovacs · February 17, 2022

I understand now. And why would their disks write faster from a different thread? More pressure? 😉

By the way, you can tell load data to handle double quotes and stuff.

Edited February 17, 2022 by Attila Kovacs

Anders Melander · February 17, 2022

36 minutes ago, parallax said:

just push the data to a global buffer after conversion, which is read by another thread which does the file write, maybe using TCriticalSection for concurrency control.

You could do that but then you would have to use a ring buffer or something like it.

It's much easier if you:

Producer (probably main thread):

Allocate a buffer and place it in a pool (just a list of buffers).
Grab a buffer from the pool. Goto 1 if the pool is empty.
Read data, process it and write to the buffer.
Place the buffer into a FIFO queue.
Goto 2.

Consumer (a dedicated thread):

Fetch a buffer from the FIFO queue.
Write buffer to disk.
Place buffer in buffer pool.
Goto 1

You will need some synchronization primitives to control access to the FIFO queue and to signal the writer thread when there's data in the queue. It's common to use a semaphore for this.

11 minutes ago, Attila Kovacs said:

And why would their disks write faster from a different thread?

We can't make the disk faster but we can do something useful while we're waiting for the disk.

Attila Kovacs · February 17, 2022

12 minutes ago, Anders Melander said:

We can't make the disk faster but we can do something useful while we're waiting for the disk.

I would not do that to myself but I'm a lazy dog.

However, I'm not sure if we should ignore the original problem, LOAD DATA not want to have double quotes or wrong date format etc... because

AFAIR it can be told both.

Edited February 17, 2022 by Attila Kovacs

parallax · February 17, 2022

14 minutes ago, Attila Kovacs said:

I would not do that to myself but I'm a lazy dog.

However, I'm not sure if we should ignore the original problem, LOAD DATA not want to have double quotes or wrong date format etc... because

AFAIR it can be told both.

You are right about that - if the file may be loaded without conversion, that would be ideal.

And it is probably a good thing to be a bit lazy, as that motivates to create automated solutions with minimal manual demand.

Regarding a direct data load, the issue is the date format and the BOOLs, and they could of course be imported as strings (assuming it worked with quotes).

I would not be able to sort on the dates then, but could of course spin through the table and convert it in a background thread...after the loading. Not really ideal.

I assumed that the sorting on dates would be possible and on integers would be faster than if using strings, when doing queries later on, if the file was first converted.

parallax · February 17, 2022

36 minutes ago, Anders Melander said:

You could do that but then you would have to use a ring buffer or something like it.

It's much easier if you:

Producer (probably main thread):

Allocate a buffer and place it in a pool (just a list of buffers).

Grab a buffer from the pool. Goto 1 if the pool is empty.

Read data, process it and write to the buffer.

Place the buffer into a FIFO queue.

Goto 2.

Consumer (a dedicated thread):

Fetch a buffer from the FIFO queue.

Write buffer to disk.

Place buffer in buffer pool.

Goto 1

You will need some synchronization primitives to control access to the FIFO queue and to signal the writer thread when there's data in the queue. It's common to use a semaphore for this.

We can't make the disk faster but we can do something useful while we're waiting for the disk.

Thank you, Anders, for a very nice setup, I will consider that. FIFO was in fact something that crossed my mind when I read your previous post, presenting the idea 🙂

Attila Kovacs · February 17, 2022

No, I didn't say no conversation, I said LOAD DATA can be told what to do.

So either I'm still missing something or you did not check the help on LOAD DATA?

LOAD DATA INFILE '/path/to/temp_test.csv'
IGNORE INTO TABLE temp_test
  FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
  LINES TERMINATED BY '\r\n' -- or '\n'
  IGNORE 1 LINES
(@c1, @c2)
SET c1 = STR_TO_DATE(@c1,'%d-%b-%y %H:%i:%s')
,SET c2 = (@c2 = 'True');

Edited February 17, 2022 by Attila Kovacs

parallax · February 17, 2022

9 minutes ago, Attila Kovacs said:
No, I didn't say no conversation, I said LOAD DATA can be told what to do.

So either I'm still missing something or you did not check the help on LOAD DATA?
LOAD DATA INFILE '/path/to/temp_test.csv'
IGNORE INTO TABLE temp_test
  FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
  LINES TERMINATED BY '\r\n' -- or '\n'
  IGNORE 1 LINES
(@c1, @c2)
SET c1 = STR_TO_DATE(@c1,'%d-%b-%y %H:%i:%s')
,SET c2 = (@c2 = 'True');

Sure, functions may be run in the load statement in the SET clauses - I know, and I already use that.

You are correct that it could have been solved quite easily, I forgot about that in my previous reply.

But I was not able to make it work with the quotes, so I did not get that far.

I may have overlooked something. Maybe it was the double double (double) quotes issue that was the problem.

I specified the delimiter and did testing on very simple files, but to no avail.

I checked the net and there were others that had the same issue, and it may be related to the db version - I use MySQL 8.0.23

Since I was not able to resolve this, I went for the conversion routine.

Edited February 17, 2022 by parallax

parallax · February 17, 2022

Testing the load data function on the original files without conversion now, I see that it loads without error messages, but it does not accept all the lines of the file.

Loading the 2021 file takes close to 30 seconds, but it only loads 669.008 lines of 748.779 lines.

The issue may be the double double quotes, not sure.

Edited February 17, 2022 by parallax

February 17, 2022

What would be the ideal time to read, analyze your data, save it in another target, for a CSV file with 1,000,000 lines with all the fields presented above?

What would your time goal be to achieve, using your infrastructure, for example?

February 17, 2022

here, +100.000 lines using TMemo loadfile +/-200ms... a TStringList (no visually) is more quick

parallax · February 17, 2022

7 minutes ago, joaodanet2018 said:

here, +100.000 lines using TMemo loadfile +/-200ms... a TStringList (no visually) is more quick

I was not able to load 620 Mb into a stringlist, or a memo... a smaller file of 39.000 lines/36 Mb went fast, about 120 ms.

I have considered memory mapping, but I think it is too much to load into memory, and if I have to convert the file anyway to get rid of the double double quotes,

then I can't escape reading and writing it, with all the time that it has to take. If that is the case, I guess the best one can do is to do as much as possible at the same time.

February 17, 2022

if on Tstringlist events you verify the values with " " and just change to no ".." dont help?

loading just little part of big file each time and saving on target each time

Edited February 17, 2022 by Guest

parallax · February 17, 2022

17 minutes ago, joaodanet2018 said:

if on Tstringlist events you verify the values with " " and just change to no ".." dont help?

loading just little part of big file each time and saving on target each time

Thank you for the idea. If I have to split the file into smaller pieces, I have to read and write the whole file first, but if that is done in a more efficient way than line by line,

it may work. I would have to test some block reading to see what kind of speed I can get on larger files.

February 17, 2022

other way, you would can use the FireDAC framework to access your files w easy way and almost nothing of code.

FDMemTables (it's great) can works with files no-conventional db format, BUT with some "pattern" pre-defined (by default, it works with XML, but any other can be used if some "pattern" exists, like a Header-with-Fields (VAERID = interger/string, ... AGE = string/DateTime/DateStamp,...) ), if necessary you can works with ODBC help.

Using events from it you can control like datas will be manipulated, like remove "...", convert strings to DataTime etc.... You see?

That way, all performance will be done by these helpers. Include you can manipulate it easily or almost.

TFDBatchMove, TFDBatchMoveTextReader, TFDBatchMoveTextWriter (and Dataset classes) and its sub-classes can READ and WRITE data in text files for example.

"The TFDBatchMove component is optimized for performance and I would personally recommend its use whenever you need to copy data from one data source to another including this case since you can specify format, field mappings and others in a flexible way."

Remember, FDManager is ancestral to all FDxxxx classes, then you define the Rules/Map Fields and all will use it. by default always exists one on FD use

Edited February 17, 2022 by Guest

Stano · February 17, 2022

You probably meant TFDBatchMoveTextReader

Attila Kovacs · February 17, 2022

3 hours ago, parallax said:

Loading the 2021 file takes close to 30 seconds, but it only loads 669.008 lines of 748.779 lines.

The issue may be the double double quotes, not sure.

Not sure? That's interesting, you could save 150 secs and a bunch of coding but you are don't bother to find out?

double double quotes are escaped double quotes but you can go the explicit way:

FIELDS
    TERMINATED BY ','
    ENCLOSED BY '"'
    ESCAPED BY ''

parallax · February 18, 2022

8 hours ago, joaodanet2018 said:

other way, you would can use the FireDAC framework to access your files w easy way and almost nothing of code.

FDMemTables (it's great) can works with files no-conventional db format, BUT with some "pattern" pre-defined (by default, it works with XML, but any other can be used if some "pattern" exists, like a Header-with-Fields (VAERID = interger/string, ... AGE = string/DateTime/DateStamp,...) ), if necessary you can works with ODBC help.

Using events from it you can control like datas will be manipulated, like remove "...", convert strings to DataTime etc.... You see?

That way, all performance will be done by these helpers. Include you can manipulate it easily or almost.

TFDBatchMove, TFDBatchMoveTextReader, TFDBatchMoveTextWriter (and Dataset classes) and its sub-classes can READ and WRITE data in text files for example.

"The TFDBatchMove component is optimized for performance and I would personally recommend its use whenever you need to copy data from one data source to another including this case since you can specify format, field mappings and others in a flexible way."

Remember, FDManager is ancestral to all FDxxxx classes, then you define the Rules/Map Fields and all will use it. by default always exists one on FD use

Thank you for the suggestion, it looks very interesting. I have not worked with FD yet, so it will take a while to get into it.

Sign In

More efficient string handling

Recommended Posts

Anders Melander 2019

Share this post

Link to post

parallax 1

Share this post

Link to post

Anders Melander 2019

Share this post

Link to post

parallax 1

Share this post

Link to post

Attila Kovacs 666

Share this post

Link to post

parallax 1

Share this post

Link to post

Attila Kovacs 666

Share this post

Link to post

parallax 1

Share this post

Link to post

Attila Kovacs 666

Share this post

Link to post

Anders Melander 2019

Share this post

Link to post

Attila Kovacs 666

Share this post

Link to post

parallax 1

Share this post

Link to post

parallax 1

Share this post

Link to post

Attila Kovacs 666

Share this post

Link to post

parallax 1

Share this post

Link to post

parallax 1

Share this post

Link to post

Guest

Share this post

Link to post

Guest

Share this post

Link to post

parallax 1

Share this post

Link to post

Guest

Share this post

Link to post

parallax 1

Share this post

Link to post

Guest

Share this post

Link to post

Stano 145

Share this post

Link to post

Attila Kovacs 666

Share this post

Link to post

parallax 1

Share this post

Link to post

Create an account or sign in to comment

Create an account