Jump to content
dummzeuch

Speed up reading multiple text files

Recommended Posts

I've got a bunch of text files, up to a few hundred, each containing up to a few thousand lines, but some only maybe ten. Each line contains a single word, the lines are sorted, and the words are unique within the file (but not unique over all files).

I need to read those files and add them to an in memory structure where they should be sorted and each entry should have a reference back to the file name it came from. This list should still contain the duplicates, so there could be multiple entries of "bla", one from file "foo", another from file "bar" etc.

 

So far I use a stringlist to store the result, with a shared PChar for each file name stored in the Objects property. Reading the files is done with a temporary stringlist and LoadFromFile. I then add the strings one by one to the result using AddObject. In the last step I sort the whole list.

 

This works, but it takes several seconds (which feels like eternity), even when read from a SSD. This is a problem because it happens frequently and the files are always different (no caching possible).

 

So, what can be done to speed this up?

I don't think that inserting into a sorted result list would be faster than adding unsorted and sorting at the end.

Would setting the Objects properties first and then using AddStrings for each file be faster?

 

I guess I should do some timing to find the bottleneck(s).

Edited by dummzeuch

Share this post


Link to post
1 hour ago, dummzeuch said:

add them to an in memory structure where they should be sorted

Can you elaborate on this a bit:

  • What is the sort used for (iterating and/or searching)?
  • How do you sort duplicates from different files?

Share this post


Link to post

I suspect you're running into something which had me gnashing my teeth in the past...  TStringList isn't the most predictable in terms of performance particularly when you set the Sorted property to True ahead of populating it...

 

When you set the Sorted property to True on a TStringList, the stringlist must (of course) maintain the ordering when new strings are Added.  To do this, it must determine where to insert the line being added.  And to do this it uses a binary search.   See: TStringList.AddObject(), and which calls TStringList.Find() before potentially calling InsertItem().  Then see TStringList.Find().  This results in an O(n.log n) time complexity just for reading n lines of text from a single file as opposed to just O(n) without.

 

This is maybe OK, however there's more.  TStringList's name and methods implies that it's a list structure when in fact it's not.  The underlying datastructure is a contiguous array of strings.  This means the actual insert becomes increasingly expensive as the TStringList grows because the Insertions are effectively O(n), even if it uses an fairly efficient copy.  See TStringList.InsertItem().  While System.Move() is a pretty quick "block copy" it nevertheless will be increasing as the number of items in the stringlist grows.  In a normal linked list the cost of the actual insert once you've found where you want to insert is O(1).  Then there's also the growing aspect adding some overhead, e.g. the underlying array have to be enlarged/grown when the number of actual lines matches the capacity of the array and another is inserted.  (One might suspect that all this copying and growing may be potentially doing unpleasant things to CPU caches and so on, which also won't be helping performance.)

 

So, the point of this little diatribe is that TStringList's performance behaviour can be a bit different to what one might expect, and you should therefore act accordingly.  

 

Given that your lists are already sorted, you should load the data into multiple stringlists without duplicate checking or sorting on which will avoid most of the needless overheads.  And then, write a routine to multi-way merge these into a final list.  (If you want to get fancy, maybe using a min heap given the varying lengths of the input?)  Or, use something else as others have suggested, perhaps.  (Maybe we should write a TFastStringList, and while we're at it, make it interfaced, so I can return IFastStringList instances and have them reference counted etc... I digress.)

 

 

 

 

 

 

Edited by ByteJuggler

Share this post


Link to post
5 hours ago, Uwe Raabe said:

Can you elaborate on this a bit:

  • What is the sort used for (iterating and/or searching)?
  • How do you sort duplicates from different files?

It's used for (incremental) filtering, e.g. I type "bl" and get two entries with "bla", one from file foo, the other from file bar. And of course additional entries that match "bl", like "blub" or "blood":

 

filter:

bl

 

Result:

bla -> foo

bla -> bar

blub -> spacebar

blood -> human

 

Because they are different entries. There are no duplicates within the same file, but there may be duplicates from different files. I want to display both.

Share this post


Link to post
4 minutes ago, dummzeuch said:

there may be duplicates from different files. I want to display both.

How are they sorted?

Share this post


Link to post
1 hour ago, ByteJuggler said:

I suspect you're running into something which had me gnashing my teeth in the past...  TStringList isn't the most predictable in terms of performance particularly when you set the Sorted property to True ahead of populating it...

I'm not doing that.

 

7 hours ago, dummzeuch said:

So far I use a stringlist to store the result, with a shared PChar for each file name stored in the Objects property. Reading the files is done with a temporary stringlist and LoadFromFile. I then add the strings one by one to the result using AddObject. In the last step I sort the whole list. 

 

The temporary string lists have not set Sorted to true, the files just happen to be sorted already (because they are in fact created with SaveToFile of a sorted string list in a different program).

Share this post


Link to post

A memory table with indexing and filtering (Filter := "like 'bl*'") would help. If you haven't TkbmMemTable, there is TFDMemTable and TClientDataSet even in the Community Edition. Use BeginBatch, EndBatch to speed up the loading.

 

Of course, you can write a low level code and try to squeeze the maximum speed, as advised above.

Share this post


Link to post
3 minutes ago, Uwe Raabe said:

How are they sorted?

By the default sort order of TStringList.Sort.

Share this post


Link to post

Just did some timing:

First try:

Loading 2796 files with 1000377 lines in total

loading takes 4125 ms

sorting takes 375 ms

 

Second try (the same files, possibly now in the file system cache)

loading takes 4204 ms

sorting takes 359 ms

 

Third try (the same files, possibly now in the file system cache)

loading takes 4235 ms

sorting takes 359 ms

 

Timing was done with GetTickCount.

 

So it seems that the bottleneck is not sorting but loading the files or adding the entries to the result list.

Edited by dummzeuch

Share this post


Link to post

Loading 2800 files in 4 seconds doesn't seem slow to me ...

 

Sorting is damn slow, though. Maybe IntroSort from Spring4D can do it better. I know @Stefan Glienke introduced some optimizations for fast element swapping (which should help with strings).

 

How much time do you want to spend with this?

 

The fastest solution would definitely be:

  • Load each file into one large block of memory.
  • Scan the file and find all strings. Put the offset to each string into a list.
    • Replace each #13#10 with #0#0 during the process.
  • Sort this list with a custom comparison method which follows the offset into the large block of memory and uses that string for comparison.
    • You can treat strings as PChars as they will be #0 terminated.
  • Note: allocate one byte more than the file size for each block and put #0 in the last byte to ensure that the last string is properly terminated.

After you have that working, you can parallelize the loading and scanning process for additional fun.

 

 

 

 

 

Edited by Primož Gabrijelčič
  • Like 1

Share this post


Link to post

It seems that I have to use something different from GetTickCount. If I try to distinguish between loading the a file and inserting its entries into the result list, the resolution isn't high enough and I end up with a total loading time of 109 ms and total inserting time of 15 ms. Obviously those don't add up to 4 seconds.

OK, I need that high performance timer code I have got somewhere ..

 

@Primož Gabrijelčič since loading takes 4 seconds I don't think spending any time trying to speed up sorting will help any.

Share this post


Link to post
3 minutes ago, dummzeuch said:

I need that high performance timer code I have got somewhere ..

System.Diagnostics.TStopwatch

  • Like 1

Share this post


Link to post

Hm, I must be doing something wrong:

 

Initialize with:

StopWatch := TStopWatch.Create;

 

Start with:

StopWatch.Start;

 

Stop with:

StopWatch.Stop;

 

Continue with:

StopWatch.Start;

 

Get the total elapsed time in ms:

Res := StopWatch.ElapsedMilliseconds;

 

But the times don't add up. I looks as if the total elapsed time is only the time since the last call to Start.

 

 

Edited by dummzeuch

Share this post


Link to post

(This is a different set of files, so this result can't be directly compared to the above.)

 

It turns out that I wasn't doing anything wrong. It's not the loading and inserting that takes so long but building the list of files to load:

 

loaded 1649 files
found 188045 words
 

searching time 2121 ms
processing time 275 ms
loading time 83 ms
inserting time 10 ms
sorting time 462 ms
total time 2859 ms

 

Thanks for listening. I think I'll just go down into the cellar to weep. 😉

  • Like 1
  • Thanks 1

Share this post


Link to post

I believe this is the most common problem in our line of work: not realizing what really is the problem. 😄

  • Like 3

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×