Jump to content
Mike Torrettinni

Multiple string replace - avoid calling StringReplace multiple times

Recommended Posts

58 minutes ago, Attila Kovacs said:

This would be a brand new exotic routine with unpredictable results.

No because the algo as I described is one pass, different from simply running ReplaceStr in a loop where subsequent replacements work on an altered input!

 

Replace('Jabba the Hutt lived in a Hut', ['Hutt', 'Hut'], ['Alien', 'House']) would produce the same as would Replace('Jabba the Hutt lived in a Hut', ['Hut', 'Hutt'], ['House', 'Alien']) because finding the first 'Hut' from 'Hutt' would not eat the replacement to 'House' but continue and find a better match 'Hutt' to be replaced with 'Alien' and then be done with it.

  • Like 1
  • Thanks 1

Share this post


Link to post
8 hours ago, David Heffernan said:

I guess if there are no firm rules on the expected behaviour, then that makes it easier to write faster code. 

You are welcome to try.

Share this post


Link to post
11 hours ago, Attila Kovacs said:

Btw. your example does not reflects this.

I think it does. The two searches are not disjoint, so calling StringReplace (shortest word first) would replace "Hut" by "House" and the name "Hutt" would not exist anymore in the second search pass due to this insertion.  An approach that starts by scanning the original string for all search terms would reveal two "hits" at the first letter H and then things get interesting, because it would have to decide which replacement is "better".  So it needs some rules to make that decision. One rule could be to replace the longest search term.
 

Share this post


Link to post
23 hours ago, Dalija Prasnikar said:

It seems that TStringBuilder is not quite as fast as it could be. I didn't gave it closer look, so I cannot say whether the issue is creating and destroying TStringBuilder or other TStringBuilder code.

In fact, FastMM already reserves some space for dynarrays and strings making StringBuilder nearly useless. SB would benefit if it cached added strings to some internal array and only flushed them to final buffer when needed but it's not the case currently

  • Like 1

Share this post


Link to post
5 minutes ago, Fr0sT.Brutal said:

In fact, FastMM already reserves some space for dynarrays and strings making StringBuilder nearly useless. SB would benefit if it cached added strings to some internal array and only flushed them to final buffer when needed but it's not the case currently

Again, for a heavily threaded program then thread contention becomes a bottleneck with heap allocation of FastMM. 

Share this post


Link to post
Posted (edited)
11 hours ago, Stefan Glienke said:

I brought that up in the third post in this thread already before everyone started polluting it with their attempts before actually knowing the exact requirements 😉 

 

So assuming you want a one pass replacement of multiple matches using the best match I would try something like this:

- sort replacement pairs by length of the to be replaced string descending (to get the best match first (like "hut" and "hutt" you would have "hutt" first to not replace "hut" where you could have replaced "hutt")

- walk the input string char by char and check against the array of replacements if you have a match - due to the sorting by length you will find the most fitting one if there is any

- produce output string on the fly copying the parts with no matches and the replacements

 

I did not write any code for that nor did measure anything but that would be my approach

I like this one. This is the only solid and robut approach to the most generic task - multi-replace. Every solution that modifies a source and then scans the modified value for the next entry has the flaw of processing already replaced fragments:

MultipleReplace('I like milk. I'm afraid of bees.', ['milk', 'beer'], ['bee', 'spider']) => 'I like spiderr. I'm afraid of spiders.'

Edited by Fr0sT.Brutal

Share this post


Link to post
9 minutes ago, David Heffernan said:

Again, for a heavily threaded program then thread contention becomes a bottleneck with heap allocation of FastMM. 

Well, SB does allocations by FastMM as well. The only difference is that with FastMM you can't control initially reserved space.

Heavily thread programs shouldn't use heap allocations in hot places at all or switch to mem managers that implement thread-local heaps

Share this post


Link to post
13 minutes ago, Fr0sT.Brutal said:

Heavily thread programs shouldn't use heap allocations in hot places

You can't always achieve not using heap allocation. What you can do is reduce heap allocation. Which is one of the reasons for using SB.

 

13 minutes ago, Fr0sT.Brutal said:

switch to mem managers that implement thread-local heaps

Good advice, but hard to achieve due to their paucity.

Share this post


Link to post

Sorting of replacement patterns (by length, desc) seems to me an overkill.

As @Attila Kovacs already proposed, input patterns already have their natural order in array - so this order should be respected. This will be more flexible, and fast.

 

 

Share this post


Link to post
Posted (edited)
9 hours ago, Attila Kovacs said:

and what sorting function would you use fot this? Replace('ABA', ['BA','XX'], ['AB','XX']);

It doesn't matter because 'AB' would be found first.

Edited by Stefan Glienke

Share this post


Link to post
Posted (edited)
32 minutes ago, David Heffernan said:

You can't always achieve not using heap allocation. What you can do is reduce heap allocation. Which is one of the reasons for using SB.

Probably, but not in so big numbers as it could be if MM wouldn't reserve space at all.

 

In most cases, the more specific an algo is, the more it could be optimied for speed because you can make some assumptions that reduce conditions. F.ex., generic string replace, even pretty optimized on ASM, being used for replacing one char with another would be likely beaten by dumb "for ch:=1 to Length(Src) if Src[ch] = CharFrom then Dst[ch] := CharTo else Dst[ch] := Src[ch]"

@Mike Torrettinni

if your goal is just to remove HTML entities, better utilize specific routine where you could make these assumptions:

- Result will always be shorter than source. So allocate space for result once at the beginning and truncate it to actual length in the end

- List of entities is standardized. You can hardcode them for better performance

 -- Any non-ASCII char after & and before ; should be skipped

 -- possible set of distances from & to ; is known beforehand

 -- convert input char (you already ensured it's ASCII alpha) to fixed case with simple bitwise operation to reduce ranges

 -- Use "case 'a'..'z'" instead of CharInSet

 -- some more hard optimizations that make code too specific but very fast

- Flow prediction could play big role as well. In good HTML there won't happen a & that's not an entity so "if currChar = '&' and IsEntity then ... else Continue" could outperform "if currChar = '&' and (not IsEntity) then Continue"

- The more you optimize your code, the less readable/universal/pretty it becomes 😞

Edited by Fr0sT.Brutal
if currChar check fixed
  • Like 1

Share this post


Link to post
18 minutes ago, Fr0sT.Brutal said:

if your goal is just to remove HTML entities, better utilize specific routine where you could make these assumptions:

No, I need more general purpose function, normal string is just fine.

 

Here is a real multi pattern where I need performance in my code:

 

['
', #13#10, '"'] -> ['', #8, '\_/']

[#8, '\_/'] -> [#13#10, '"']

 

['XLTT', 'XLA', 'CDR', 'TXE'] -> ['_', '_', '_', '_']

['Start Event', 'Stop Event', 'Load', 'Call', 'Open'...] <-> ['SE:', 'STE:', 'L:', 'C:', 'O:'...]

 

 

But even when performance is not really needed and multiple calls to standard StringReplace are good enough, why wouldn't i use one-pass replacement function for examples like these:

// escaping characters
['?', !', '[', ']', '(', ')'] -> ['\?', '\!', '\[', '\]', '\(', '\)']

// some html tags

['&amp;', '&lt;', '&gt;', '&quot;', '&nbsp;', '<br>', '<br />', '<br/>'] -> ['&', '<', '>', '"', ' ', #13#10, #13#10, #13#10]

['&', '<', '>', '"', ' ', #13#10] -> ['&amp;', '&lt;', '&gt;', '&quot;', '&nbsp;', '<br>']

 

 

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×