parallax
Members-
Content Count
31 -
Joined
-
Last visited
Everything posted by parallax
-
More efficient string handling
parallax posted a topic in Algorithms, Data Structures and Class Design
Before loading a .csv file into MySQL, I have to get rid of all double quotes, as the LOAD DATA won't have any of that. But there are instances of commas within those strings having double quotes, so they have to be converted to something else. Besides, the date formats have to be changed, and I use a regEx for this. And the boolean fields. But this all takes a bit too much time, in my opinion, up to 20-30 seconds for conversion of 100.000 lines on my i7-7700. Is it possible to make this run faster? Anyone got some good suggestions? The code below is repeated for each line that is read from the .csv textfile. The small project may be found at https://github.com/parallaxprog/vaers function convertDates(s:string):string; var i:integer; d, dStr:string; begin i:=0; while (i<length(s)) do begin inc(i); d:=copy(s, i, 10); if regEx.IsMatch(d) then begin //we have a date, change format from american mm/dd/yyyy to yyyy-mm-dd format dStr:=copy(d, 7, 4)+'-'+copy(d, 1, 2)+'-'+copy(d, 4, 2); //insert it in the right place s:=copy(s, 1, i-1)+dStr+copy(s, i+10, length(s)); //skip past it inc(i, 9); end; end; result:=s; end; function commaReplace(s:string):string; var i:integer; inString:boolean; s2:string; begin //replace each comma within double quotes with semicolon, and then remove double quotes //so we can use MySQL LOAD DATA function to load the .csv directly //first remove dual double quotes s:=stringReplace(s,'""','',[rfReplaceAll]); inString:=false; for i:=1 to length(s) do begin if s[i]='"' then inString:=not inString else if inString and (s[i]=',') then s[i]:=';'; end; s2:=stringReplace(s,'"','',[rfReplaceAll]); //convert boolean Y to 1 in the right places, and convert dates result:=convertDates(stringReplace(s2,',Y,',',1,',[rfReplaceAll])); end; -
More efficient string handling
parallax replied to parallax's topic in Algorithms, Data Structures and Class Design
I am having issues with testing your code, getting very large files. Could you please attach a test project? -
More efficient string handling
parallax replied to parallax's topic in Algorithms, Data Structures and Class Design
You are correct, double quotes are only used if there is a comma within the string, and quotations within strings are double quoted, i.e. escaped with a double quote. So your load clause should work, but it doesn't. About 10% of the rows are discarded somehow. -
More efficient string handling
parallax replied to parallax's topic in Algorithms, Data Structures and Class Design
Not sure why you get that error. There should be 35 fields in the data file: MyArr1990VAERSDATAFields: array [0 .. 34] of string = ( { } 'VAERS_ID', 'RECVDATE', 'STATE', 'AGE_YRS', 'CAGE_YR', 'CAGE_MO', { } 'SEX', 'RPT_DATE', 'SYMPTOM_TEXT', 'DIED', 'DATEDIED', 'L_THREAT', 'ER_VISIT', { } 'HOSPITAL', 'HOSPDAYS', 'X_STAY', 'DISABLE', 'RECOVD', 'VAX_DATE', { } 'ONSET_DATE', 'NUMDAYS', 'LAB_DATA', 'V_ADMINBY', 'V_FUNDBY', 'OTHER_MEDS', { } 'CUR_ILL', 'HISTORY', 'PRIOR_VAX', 'SPLTTYPE', 'FORM_VERS', 'TODAYS_DATE', { } 'BIRTH_DEFECT', 'OFC_VISIT', 'ER_ED_VISIT', 'ALLERGIES'); -
More efficient string handling
parallax replied to parallax's topic in Algorithms, Data Structures and Class Design
UTF8Decode did not work as expected, even though the files are UTF-8 without BOM. But casting the output from MMFileToString using PAnsiChar, it worked fine, and I came down to 72 seconds on loading the big file. The mapping alone took 3,8 seconds, for the 748.xxx lines. That was fast. -
More efficient string handling
parallax replied to parallax's topic in Algorithms, Data Structures and Class Design
Yes, I can imagine that going to x64 will make it work. I just tried the MM only loading/mapping in Dx32, and was out of memory again. Edit: I just tried Delphi x64, and the loading/mapping of 748.xxx lines took 1,56 secs. But the stringlist in Delphi is unicode and all the text ends up in a single string, with delimiters not being recognized. -
More efficient string handling
parallax replied to parallax's topic in Algorithms, Data Structures and Class Design
Thank you very much, that was a very fast mapping routine - very interesting! Running on Lazarus it loads (or maps) 748.xxx lines in 2,39 secs., but I was only able to insert 300.000 lines into the SG, which took 4,02 secs, before getting Out of Memory, but that is no problem. There is no need to view all the data at once in a grid anyway. -
More efficient string handling
parallax replied to parallax's topic in Algorithms, Data Structures and Class Design
Thank you for the suggestion - but I've tried the ESCAPED BY clause, and I still lose some lines when importing: LOAD DATA local INFILE 'c:/programdata/mysql/mysql server 8.0/uploads/2022VAERSDATA.csv' INTO TABLE data CHARACTER SET latin1 FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '' IGNORE 1 LINES This inserts 35.673 lines of 39.563, so I am losing close to 10% of the data. When working on this, I came across this section in the manual: https://dev.mysql.com/doc/refman/8.0/en/load-data.html I also noticed this suggestion at SO, but it did not work as indicated in the manual. If anyone has a solution that actually works in this case, I would of course appreciate that very much. Maybe something is escaping me? -
More efficient string handling
parallax replied to parallax's topic in Algorithms, Data Structures and Class Design
Thank you for the suggestion, it looks very interesting. I have not worked with FD yet, so it will take a while to get into it. -
More efficient string handling
parallax replied to parallax's topic in Algorithms, Data Structures and Class Design
Thank you for the idea. If I have to split the file into smaller pieces, I have to read and write the whole file first, but if that is done in a more efficient way than line by line, it may work. I would have to test some block reading to see what kind of speed I can get on larger files. -
More efficient string handling
parallax replied to parallax's topic in Algorithms, Data Structures and Class Design
I was not able to load 620 Mb into a stringlist, or a memo... a smaller file of 39.000 lines/36 Mb went fast, about 120 ms. I have considered memory mapping, but I think it is too much to load into memory, and if I have to convert the file anyway to get rid of the double double quotes, then I can't escape reading and writing it, with all the time that it has to take. If that is the case, I guess the best one can do is to do as much as possible at the same time. -
More efficient string handling
parallax replied to parallax's topic in Algorithms, Data Structures and Class Design
Testing the load data function on the original files without conversion now, I see that it loads without error messages, but it does not accept all the lines of the file. Loading the 2021 file takes close to 30 seconds, but it only loads 669.008 lines of 748.779 lines. The issue may be the double double quotes, not sure. -
More efficient string handling
parallax replied to parallax's topic in Algorithms, Data Structures and Class Design
Sure, functions may be run in the load statement in the SET clauses - I know, and I already use that. You are correct that it could have been solved quite easily, I forgot about that in my previous reply. But I was not able to make it work with the quotes, so I did not get that far. I may have overlooked something. Maybe it was the double double (double) quotes issue that was the problem. I specified the delimiter and did testing on very simple files, but to no avail. I checked the net and there were others that had the same issue, and it may be related to the db version - I use MySQL 8.0.23 Since I was not able to resolve this, I went for the conversion routine. -
More efficient string handling
parallax replied to parallax's topic in Algorithms, Data Structures and Class Design
Thank you, Anders, for a very nice setup, I will consider that. FIFO was in fact something that crossed my mind when I read your previous post, presenting the idea 🙂 -
More efficient string handling
parallax replied to parallax's topic in Algorithms, Data Structures and Class Design
You are right about that - if the file may be loaded without conversion, that would be ideal. And it is probably a good thing to be a bit lazy, as that motivates to create automated solutions with minimal manual demand. Regarding a direct data load, the issue is the date format and the BOOLs, and they could of course be imported as strings (assuming it worked with quotes). I would not be able to sort on the dates then, but could of course spin through the table and convert it in a background thread...after the loading. Not really ideal. I assumed that the sorting on dates would be possible and on integers would be faster than if using strings, when doing queries later on, if the file was first converted. -
More efficient string handling
parallax replied to parallax's topic in Algorithms, Data Structures and Class Design
It is not for me, but for the people who will install the program and use it for analyzing the data. They probably have slower hardware than me, so it would be nice if they would not have to wait for several minutes for the loading to finish. The idea is that this is something everybody should see, because it is so shocking - I mean the sudden spike in cases in 2021. So I want it to work well. -
More efficient string handling
parallax replied to parallax's topic in Algorithms, Data Structures and Class Design
I don't need to do it more than once, but it's just nice to get the load function optimized for the large 2021 file, which is around 0,62 Gb. If I can get it down from about 150 seconds to 50 seconds, for that file, on my system, it would be acceptable, and the excellent concurrency read/write idea may accomplish that. -
More efficient string handling
parallax replied to parallax's topic in Algorithms, Data Structures and Class Design
That's a good idea, reading and writing at the same time - just push the data to a global buffer after conversion, which is read by another thread which does the file write, maybe using TCriticalSection for concurrency control. Could be interesting to try that and see what happens. -
More efficient string handling
parallax replied to parallax's topic in Algorithms, Data Structures and Class Design
I meant the user guide for vTune, at https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top.html Running vTune quickly without knowing much about what I'm doing, I got an interesting list of functions and how much time is spent on each. Writing and reading from disk takes the lead, not surprisingly. I realize that this may be very handy - thanks again for suggesting this idea, and for creating the map2pdb project 🙂 -
More efficient string handling
parallax replied to parallax's topic in Algorithms, Data Structures and Class Design
I think you are right about that, and I've downloaded it all and installed it, and maybe I will take a look at the manual later on. Interesting stuff 🙂 -
More efficient string handling
parallax replied to parallax's topic in Algorithms, Data Structures and Class Design
Interesting project you have there - thanks a lot for the suggestion. Of course you are right that it would be nice to use this kind of tool, in certain situations, like in this case. -
More efficient string handling
parallax replied to parallax's topic in Algorithms, Data Structures and Class Design
I am afraid that searching for a good profiler, or buying one, and finding out how to set it up and use it, will take more time than it's worth...or compared to doing some less complicated setup. -
More efficient string handling
parallax replied to parallax's topic in Algorithms, Data Structures and Class Design
True, and that is something I should use a lot more, obviously. Knew about it, but... I guess it's a habit that went away when computers got faster. -
More efficient string handling
parallax replied to parallax's topic in Algorithms, Data Structures and Class Design
You are right, and I should at least set up some ticks measuring around critical parts, to get an idea. -
More efficient string handling
parallax replied to parallax's topic in Algorithms, Data Structures and Class Design
When doing the 1990 data file with 2102 lines , the whole process takes less than 300 milliseconds with an empty table. The results below 200ms are when the table was already populated. I get the same numbers independent of mode, release or debug. I have an i7-7700HQ with 16Gb running Win10 x64. Now this is a small file, only 597,231 bytes, but when processing the 2021 file we have almost 653 million bytes/characters to check, over 1000 times more. That is the bottleneck.