Anders Melander 1782 Posted February 17, 2022 7 minutes ago, parallax said: maybe I will take a look at the manual later on What? There's a manual? 😉 Share this post Link to post
parallax 1 Posted February 17, 2022 (edited) 17 minutes ago, Anders Melander said: What? There's a manual? 😉 I meant the user guide for vTune, at https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top.html  Running vTune quickly without knowing much about what I'm doing, I got an interesting list of functions and how much time is spent on each. Writing and reading from disk takes the lead, not surprisingly.  I realize that this may be very handy - thanks again for suggesting this idea, and for creating the map2pdb project 🙂   Edited February 17, 2022 by parallax Share this post Link to post
Anders Melander 1782 Posted February 17, 2022 5 minutes ago, parallax said: I meant the user guide for vTune Yes, I know 🙂 Looking at your numbers, since WriteFile dwarfs all the string related operations, it seems that you might be able to eliminate the need to optimize the string handling just by moving your file write operations to a background thread. For example with overlapped I/O. Share this post Link to post
parallax 1 Posted February 17, 2022 5 minutes ago, Anders Melander said: Yes, I know 🙂 Looking at your numbers, since WriteFile dwarfs all the string related operations, it seems that you might be able to eliminate the need to optimize the string handling just by moving your file write operations to a background thread. For example with overlapped I/O. That's a good idea, reading and writing at the same time - just push the data to a global buffer after conversion, which is read by another thread which does the file write, maybe using TCriticalSection for concurrency control. Could be interesting to try that and see what happens. Share this post Link to post
Attila Kovacs 629 Posted February 17, 2022 (edited) why do you need writefile more than once? how much gigabytes is that csv? Edited February 17, 2022 by Attila Kovacs Share this post Link to post
parallax 1 Posted February 17, 2022 3 minutes ago, Attila Kovacs said: why do you need writefile more than once? how much gigabytes is that csv?  I don't need to do it more than once, but it's just nice to get the load function optimized for the large 2021 file, which is around 0,62 Gb.  If I can get it down from about 150 seconds to 50 seconds, for that file, on my system, it would be acceptable, and the excellent concurrency read/write idea may accomplish that. Share this post Link to post
Attila Kovacs 629 Posted February 17, 2022 5 hours ago, parallax said: but when processing the 2021 file we have almost 653 million bytes/characters to check How often do you have to import that file? Share this post Link to post
parallax 1 Posted February 17, 2022 (edited) 5 minutes ago, Attila Kovacs said: How often do you have to import that file?  It is not for me, but for the people who will install the program and use it for analyzing the data. They probably have slower hardware than me, so it would be nice if they would not have to wait for several minutes for the loading to finish.  The idea is that this is something everybody should see, because it is so shocking - I mean the sudden spike in cases in 2021. So I want it to work well. Edited February 17, 2022 by parallax Share this post Link to post
Attila Kovacs 629 Posted February 17, 2022 (edited) I understand now. And why would their disks write faster from a different thread? More pressure? 😉  By the way, you can tell load data to handle double quotes and stuff. Edited February 17, 2022 by Attila Kovacs Share this post Link to post
Anders Melander 1782 Posted February 17, 2022 36 minutes ago, parallax said: just push the data to a global buffer after conversion, which is read by another thread which does the file write, maybe using TCriticalSection for concurrency control.  You could do that but then you would have to use a ring buffer or something like it. It's much easier if you:  Producer (probably main thread): Allocate a buffer and place it in a pool (just a list of buffers). Grab a buffer from the pool. Goto 1 if the pool is empty. Read data, process it and write to the buffer. Place the buffer into a FIFO queue. Goto 2. Consumer (a dedicated thread): Fetch a buffer from the FIFO queue. Write buffer to disk. Place buffer in buffer pool. Goto 1 You will need some synchronization primitives to control access to the FIFO queue and to signal the writer thread when there's data in the queue. It's common to use a semaphore for this.  11 minutes ago, Attila Kovacs said: And why would their disks write faster from a different thread? We can't make the disk faster but we can do something useful while we're waiting for the disk. 1 Share this post Link to post
Attila Kovacs 629 Posted February 17, 2022 (edited) 12 minutes ago, Anders Melander said: We can't make the disk faster but we can do something useful while we're waiting for the disk. I would not do that to myself but I'm a lazy dog. However, I'm not sure if we should ignore the original problem, LOAD DATA not want to have double quotes or wrong date format etc... because AFAIR it can be told both. Edited February 17, 2022 by Attila Kovacs Share this post Link to post
parallax 1 Posted February 17, 2022 14 minutes ago, Attila Kovacs said: I would not do that to myself but I'm a lazy dog. However, I'm not sure if we should ignore the original problem, LOAD DATA not want to have double quotes or wrong date format etc... because AFAIR it can be told both.  You are right about that - if the file may be loaded without conversion, that would be ideal. And it is probably a good thing to be a bit lazy, as that motivates to create automated solutions with minimal manual demand.  Regarding a direct data load, the issue is the date format and the BOOLs, and they could of course be imported as strings (assuming it worked with quotes). I would not be able to sort on the dates then, but could of course spin through the table and convert it in a background thread...after the loading. Not really ideal.  I assumed that the sorting on dates would be possible and on integers would be faster than if using strings, when doing queries later on, if the file was first converted. Share this post Link to post
parallax 1 Posted February 17, 2022 36 minutes ago, Anders Melander said: You could do that but then you would have to use a ring buffer or something like it. It's much easier if you:  Producer (probably main thread): Allocate a buffer and place it in a pool (just a list of buffers). Grab a buffer from the pool. Goto 1 if the pool is empty. Read data, process it and write to the buffer. Place the buffer into a FIFO queue. Goto 2. Consumer (a dedicated thread): Fetch a buffer from the FIFO queue. Write buffer to disk. Place buffer in buffer pool. Goto 1 You will need some synchronization primitives to control access to the FIFO queue and to signal the writer thread when there's data in the queue. It's common to use a semaphore for this.  We can't make the disk faster but we can do something useful while we're waiting for the disk.  Thank you, Anders, for a very nice setup, I will consider that. FIFO was in fact something that crossed my mind when I read your previous post, presenting the idea 🙂 Share this post Link to post
Attila Kovacs 629 Posted February 17, 2022 (edited) No, I didn't say no conversation, I said LOAD DATA can be told what to do. So either I'm still missing something or you did not check the help on LOAD DATA? LOAD DATA INFILE '/path/to/temp_test.csv' IGNORE INTO TABLE temp_test FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\r\n' -- or '\n' IGNORE 1 LINES (@c1, @c2) SET c1 = STR_TO_DATE(@c1,'%d-%b-%y %H:%i:%s') ,SET c2 = (@c2 = 'True'); Â Edited February 17, 2022 by Attila Kovacs Share this post Link to post
parallax 1 Posted February 17, 2022 (edited) 9 minutes ago, Attila Kovacs said: No, I didn't say no conversation, I said LOAD DATA can be told what to do. So either I'm still missing something or you did not check the help on LOAD DATA? LOAD DATA INFILE '/path/to/temp_test.csv' IGNORE INTO TABLE temp_test FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\r\n' -- or '\n' IGNORE 1 LINES (@c1, @c2) SET c1 = STR_TO_DATE(@c1,'%d-%b-%y %H:%i:%s') ,SET c2 = (@c2 = 'True');   Sure, functions may be run in the load statement in the SET clauses - I know, and I already use that. You are correct that it could have been solved quite easily, I forgot about that in my previous reply.  But I was not able to make it work with the quotes, so I did not get that far. I may have overlooked something. Maybe it was the double double (double) quotes issue that was the problem.  I specified the delimiter and did testing on very simple files, but to no avail. I checked the net and there were others that had the same issue, and it may be related to the db version - I use MySQL 8.0.23  Since I was not able to resolve this, I went for the conversion routine. Edited February 17, 2022 by parallax Share this post Link to post
parallax 1 Posted February 17, 2022 (edited) Testing the load data function on the original files without conversion now, I see that it loads without error messages, but it does not accept all the lines of the file.  Loading the 2021 file takes close to 30 seconds, but it only loads 669.008 lines of 748.779 lines.  The issue may be the double double quotes, not sure. Edited February 17, 2022 by parallax Share this post Link to post
Guest Posted February 17, 2022 What would be the ideal time to read, analyze your data, save it in another target, for a CSV file with 1,000,000 lines with all the fields presented above? What would your time goal be to achieve, using your infrastructure, for example? Share this post Link to post
Guest Posted February 17, 2022 here, +100.000 lines using TMemo loadfile +/-200ms... a TStringList (no visually) is more quick Share this post Link to post
parallax 1 Posted February 17, 2022 7 minutes ago, joaodanet2018 said: here, +100.000 lines using TMemo loadfile +/-200ms... a TStringList (no visually) is more quick  I was not able to load 620 Mb into a stringlist, or a memo... a smaller file of 39.000 lines/36 Mb went fast, about 120 ms.  I have considered memory mapping, but I think it is too much to load into memory, and if I have to convert the file anyway to get rid of the double double quotes, then I can't escape reading and writing it, with all the time that it has to take. If that is the case, I guess the best one can do is to do as much as possible at the same time.   Share this post Link to post
Guest Posted February 17, 2022 (edited) if on Tstringlist events you verify the values with " "Â and just change to no ".."Â dont help? loading just little part of big file each time and saving on target each time Edited February 17, 2022 by Guest Share this post Link to post
parallax 1 Posted February 17, 2022 17 minutes ago, joaodanet2018 said: if on Tstringlist events you verify the values with " " and just change to no ".." dont help? loading just little part of big file each time and saving on target each time  Thank you for the idea. If I have to split the file into smaller pieces, I have to read and write the whole file first, but if that is done in a more efficient way than line by line, it may work. I would have to test some block reading to see what kind of speed I can get on larger files. Share this post Link to post
Guest Posted February 17, 2022 (edited) other way, you would can use the FireDAC framework to access your files w easy way and almost nothing of code. FDMemTables (it's great) can works with files no-conventional db format, BUT with some "pattern" pre-defined (by default, it works with XML, but any other can be used if some "pattern" exists, like a Header-with-Fields (VAERID = interger/string, ... AGE = string/DateTime/DateStamp,...) ), if necessary you can works with ODBC help.  Using events from it you can control like datas will be manipulated, like remove "...", convert strings to DataTime etc.... You see?  That way, all performance will be done by these helpers. Include you can manipulate it easily or almost.  TFDBatchMove, TFDBatchMoveTextReader, TFDBatchMoveTextWriter (and Dataset classes) and its sub-classes can READ and WRITE data in text files for example.  "The TFDBatchMove component is optimized for performance and I would personally recommend its use whenever you need to copy data from one data source to another including this case since you can specify format, field mappings and others in a flexible way."  Remember, FDManager is ancestral to all FDxxxx classes, then you define the Rules/Map Fields and all will use it. by default always exists one on FD use Edited February 17, 2022 by Guest Share this post Link to post
Stano 143 Posted February 17, 2022 You probably meant TFDBatchMoveTextReader Share this post Link to post
Attila Kovacs 629 Posted February 17, 2022 3 hours ago, parallax said: Loading the 2021 file takes close to 30 seconds, but it only loads 669.008 lines of 748.779 lines.  The issue may be the double double quotes, not sure. Not sure? That's interesting, you could save 150 secs and a bunch of coding but you are don't bother to find out?  double double quotes are escaped double quotes but you can go the explicit way:  FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '' Share this post Link to post
parallax 1 Posted February 18, 2022 8 hours ago, joaodanet2018 said: other way, you would can use the FireDAC framework to access your files w easy way and almost nothing of code. FDMemTables (it's great) can works with files no-conventional db format, BUT with some "pattern" pre-defined (by default, it works with XML, but any other can be used if some "pattern" exists, like a Header-with-Fields (VAERID = interger/string, ... AGE = string/DateTime/DateStamp,...) ), if necessary you can works with ODBC help.  Using events from it you can control like datas will be manipulated, like remove "...", convert strings to DataTime etc.... You see?  That way, all performance will be done by these helpers. Include you can manipulate it easily or almost.  TFDBatchMove, TFDBatchMoveTextReader, TFDBatchMoveTextWriter (and Dataset classes) and its sub-classes can READ and WRITE data in text files for example.  "The TFDBatchMove component is optimized for performance and I would personally recommend its use whenever you need to copy data from one data source to another including this case since you can specify format, field mappings and others in a flexible way."  Remember, FDManager is ancestral to all FDxxxx classes, then you define the Rules/Map Fields and all will use it. by default always exists one on FD use  Thank you for the suggestion, it looks very interesting. I have not worked with FD yet, so it will take a while to get into it. Share this post Link to post