Jump to content
parallax

More efficient string handling

Recommended Posts

17 minutes ago, Anders Melander said:

What? There's a manual? 😉

I meant the user guide for vTune, at https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top.html

 

Running vTune quickly without knowing much about what I'm doing, I got an interesting list of functions and how much time is spent on each.

Writing and reading from disk takes the lead, not surprisingly.

 

I realize that this may be very handy - thanks again for suggesting this idea, and for creating the map2pdb project 🙂

 

 

hotspots.jpg

Edited by parallax

Share this post


Link to post
5 minutes ago, parallax said:

I meant the user guide for vTune

Yes, I know 🙂

Looking at your numbers, since WriteFile dwarfs all the string related operations, it seems that you might be able to eliminate the need to optimize the string handling just by moving your file write operations to a background thread. For example with overlapped I/O.

Share this post


Link to post
5 minutes ago, Anders Melander said:

Yes, I know 🙂

Looking at your numbers, since WriteFile dwarfs all the string related operations, it seems that you might be able to eliminate the need to optimize the string handling just by moving your file write operations to a background thread. For example with overlapped I/O.

That's a good idea, reading and writing at the same time - just push the data to a global buffer after conversion, which is read by another thread which does the file write, maybe using TCriticalSection for concurrency control.

Could be interesting to try that and see what happens.

Share this post


Link to post

why do you need writefile more than once? how much gigabytes is that csv?

Edited by Attila Kovacs

Share this post


Link to post
3 minutes ago, Attila Kovacs said:

why do you need writefile more than once? how much gigabytes is that csv?

 

I don't need to do it more than once, but it's just nice to get the load function optimized for the large 2021 file, which is around 0,62 Gb.

 

If I can get it down from about 150 seconds to 50 seconds, for that file, on my system, it would be acceptable, and the excellent concurrency read/write idea may accomplish that.

Share this post


Link to post
5 hours ago, parallax said:

but when processing the 2021 file we have almost 653 million bytes/characters to check

How often do you have to import that file? 

Share this post


Link to post
5 minutes ago, Attila Kovacs said:

How often do you have to import that file? 

 

It is not for me, but for the people who will install the program and use it for analyzing the data.

They probably have slower hardware than me, so it would be nice if they would not have to wait for several minutes for the loading to finish.

 

The idea is that this is something everybody should see, because it is so shocking - I mean the sudden spike in cases in 2021.

So I want it to work well.

Edited by parallax

Share this post


Link to post

I understand now. And why would their disks write faster from a different thread? More pressure? 😉 

 

By the way, you can tell load data to handle double quotes and stuff.

Edited by Attila Kovacs

Share this post


Link to post
36 minutes ago, parallax said:

just push the data to a global buffer after conversion, which is read by another thread which does the file write, maybe using TCriticalSection for concurrency control. 

You could do that but then you would have to use a ring buffer or something like it.

It's much easier if you:

 

Producer (probably main thread):

  1. Allocate a buffer and place it in a pool (just a list of buffers).
  2. Grab a buffer from the pool. Goto 1 if the pool is empty.
  3. Read data, process it and write to the buffer.
  4. Place the buffer into a FIFO queue.
  5. Goto 2.

Consumer (a dedicated thread):

  1. Fetch a buffer from the FIFO queue.
  2. Write buffer to disk.
  3. Place buffer in buffer pool.
  4. Goto 1

You will need some synchronization primitives to control access to the FIFO queue and to signal the writer thread when there's data in the queue. It's common to use a semaphore for this.

 

11 minutes ago, Attila Kovacs said:

And why would their disks write faster from a different thread?

We can't make the disk faster but we can do something useful while we're waiting for the disk.

  • Like 1

Share this post


Link to post
12 minutes ago, Anders Melander said:

We can't make the disk faster but we can do something useful while we're waiting for the disk.

I would not do that to myself but I'm a lazy dog.

However, I'm not sure if we should ignore the original problem, LOAD DATA not want to have double quotes or wrong date format etc... because 

AFAIR it can be told both.

Edited by Attila Kovacs

Share this post


Link to post
14 minutes ago, Attila Kovacs said:

I would not do that to myself but I'm a lazy dog.

However, I'm not sure if we should ignore the original problem, LOAD DATA not want to have double quotes or wrong date format etc... because 

AFAIR it can be told both.

 

You are right about that - if the file may be loaded without conversion, that would be ideal.

And it is probably a good thing to be a bit lazy, as that motivates to create automated solutions with minimal manual demand.

 

Regarding a direct data load, the issue is the date format and the BOOLs, and they could of course be imported as strings (assuming it worked with quotes).

I would not be able to sort on the dates then, but could of course spin through the table and convert it in a background thread...after the loading. Not really ideal.

 

I assumed that the sorting on dates would be possible and on integers would be faster than if using strings, when doing queries later on, if the file was first converted.

Share this post


Link to post
36 minutes ago, Anders Melander said:

You could do that but then you would have to use a ring buffer or something like it.

It's much easier if you:

 

Producer (probably main thread):

  1. Allocate a buffer and place it in a pool (just a list of buffers).
  2. Grab a buffer from the pool. Goto 1 if the pool is empty.
  3. Read data, process it and write to the buffer.
  4. Place the buffer into a FIFO queue.
  5. Goto 2.

Consumer (a dedicated thread):

  1. Fetch a buffer from the FIFO queue.
  2. Write buffer to disk.
  3. Place buffer in buffer pool.
  4. Goto 1

You will need some synchronization primitives to control access to the FIFO queue and to signal the writer thread when there's data in the queue. It's common to use a semaphore for this.

 

We can't make the disk faster but we can do something useful while we're waiting for the disk.

 

Thank you, Anders, for a very nice setup, I will consider that. FIFO was in fact something that crossed my mind when I read your previous post, presenting the idea 🙂

Share this post


Link to post

No, I didn't say no conversation, I said LOAD DATA can be told what to do.

So either I'm still missing something or you did not check the help on LOAD DATA?

LOAD DATA INFILE '/path/to/temp_test.csv'
IGNORE INTO TABLE temp_test
  FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
  LINES TERMINATED BY '\r\n' -- or '\n'
  IGNORE 1 LINES
(@c1, @c2)
SET c1 = STR_TO_DATE(@c1,'%d-%b-%y %H:%i:%s')
,SET c2 = (@c2 = 'True');

 

Edited by Attila Kovacs

Share this post


Link to post
9 minutes ago, Attila Kovacs said:

No, I didn't say no conversation, I said LOAD DATA can be told what to do.

So either I'm still missing something or you did not check the help on LOAD DATA?


LOAD DATA INFILE '/path/to/temp_test.csv'
IGNORE INTO TABLE temp_test
  FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
  LINES TERMINATED BY '\r\n' -- or '\n'
  IGNORE 1 LINES
(@c1, @c2)
SET c1 = STR_TO_DATE(@c1,'%d-%b-%y %H:%i:%s')
,SET c2 = (@c2 = 'True');

 

 

Sure, functions may be run in the load statement in the SET clauses - I know, and I already use that.

You are correct that it could have been solved quite easily, I forgot about that in my previous reply.

 

But I was not able to make it work with the quotes, so I did not get that far.

I may have overlooked something. Maybe it was the double double (double) quotes issue that was the problem.

 

I specified the delimiter and did testing on very simple files, but to no avail. 

I checked the net and there were others that had the same issue, and it may be related to the db version - I use MySQL 8.0.23

 

Since I was not able to resolve this, I went for the conversion routine.

Edited by parallax

Share this post


Link to post

Testing the load data function on the original files without conversion now, I see that it loads without error messages, but it does not accept all the lines of the file.

 

Loading the 2021 file takes close to 30 seconds, but it only loads 669.008 lines of 748.779 lines.

 

The issue may be the double double quotes, not sure.

Edited by parallax

Share this post


Link to post
Guest

What would be the ideal time to read, analyze your data, save it in another target, for a CSV file with 1,000,000 lines with all the fields presented above?

What would your time goal be to achieve, using your infrastructure, for example?

Share this post


Link to post
Guest

here, +100.000 lines using TMemo loadfile +/-200ms... a TStringList (no visually) is more quick

Share this post


Link to post
7 minutes ago, joaodanet2018 said:

here, +100.000 lines using TMemo loadfile +/-200ms... a TStringList (no visually) is more quick

 

I was not able to load 620 Mb into a stringlist, or a memo... a smaller file of 39.000 lines/36 Mb went fast, about 120 ms.

 

I have considered memory mapping, but I think it is too much to load into memory, and if I have to convert the file anyway to get rid of the double double quotes,

then I can't escape reading and writing it, with all the time that it has to take. If that is the case, I guess the best one can do is to do as much as possible at the same time.

 

 

Share this post


Link to post
Guest

if on Tstringlist events you verify the values with " "  and just change to no ".."  dont help?

loading just little part of big file each time and saving on target each time

Edited by Guest

Share this post


Link to post
17 minutes ago, joaodanet2018 said:

if on Tstringlist events you verify the values with " "  and just change to no ".."  dont help?

loading just little part of big file each time and saving on target each time

 

Thank you for the idea. If I have to split the file into smaller pieces, I have to read and write the whole file first, but if that is done in a more efficient way than line by line,

it may work. I would have to test some block reading to see what kind of speed I can get on larger files.

Share this post


Link to post
Guest

other way, you would can use the FireDAC framework to access your files w easy way and almost nothing of code.

FDMemTables (it's great) can works with files no-conventional db format, BUT with some "pattern" pre-defined (by default, it works with XML, but any other can be used if some "pattern"  exists, like a Header-with-Fields (VAERID = interger/string, ... AGE = string/DateTime/DateStamp,...) ), if necessary you can works with ODBC help.

 

Using events from it you can control like datas will be manipulated, like remove "...", convert strings to DataTime etc.... You see?

 

That way, all performance will be done by these helpers. Include you can manipulate it easily or almost.

 

TFDBatchMove, TFDBatchMoveTextReader, TFDBatchMoveTextWriter (and Dataset classes) and its sub-classes can READ and WRITE data in text files for example.

 

"The TFDBatchMove component is optimized for performance and I would personally recommend its use whenever you need to copy data from one data source to another including this case since you can specify format, field mappings and others in a flexible way."

 

Remember, FDManager is ancestral to all FDxxxx classes, then you define the Rules/Map Fields and all will use it. by default always exists one on FD use

Edited by Guest

Share this post


Link to post
3 hours ago, parallax said:

Loading the 2021 file takes close to 30 seconds, but it only loads 669.008 lines of 748.779 lines.

 

The issue may be the double double quotes, not sure.

Not sure? That's interesting, you could save 150 secs and a bunch of coding but you are don't bother to find out?

 

double double quotes are escaped double quotes but you can go the explicit way:

 

FIELDS
    TERMINATED BY ','
    ENCLOSED BY '"'
    ESCAPED BY ''

Share this post


Link to post
8 hours ago, joaodanet2018 said:

other way, you would can use the FireDAC framework to access your files w easy way and almost nothing of code.

FDMemTables (it's great) can works with files no-conventional db format, BUT with some "pattern" pre-defined (by default, it works with XML, but any other can be used if some "pattern"  exists, like a Header-with-Fields (VAERID = interger/string, ... AGE = string/DateTime/DateStamp,...) ), if necessary you can works with ODBC help.

 

Using events from it you can control like datas will be manipulated, like remove "...", convert strings to DataTime etc.... You see?

 

That way, all performance will be done by these helpers. Include you can manipulate it easily or almost.

 

TFDBatchMove, TFDBatchMoveTextReader, TFDBatchMoveTextWriter (and Dataset classes) and its sub-classes can READ and WRITE data in text files for example.

 

"The TFDBatchMove component is optimized for performance and I would personally recommend its use whenever you need to copy data from one data source to another including this case since you can specify format, field mappings and others in a flexible way."

 

Remember, FDManager is ancestral to all FDxxxx classes, then you define the Rules/Map Fields and all will use it. by default always exists one on FD use

 

Thank you for the suggestion, it looks very interesting. I have not worked with FD yet, so it will take a while to get into it.

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×