bazzer747 25 Posted October 10, 2022 Hi, I'm reading in a large file which I'm scraping from a website into a text file and am having a problem with surnames like O'Donnell and O'Brian. The input text for names like these show in the text like 'O'Donnell' - characters O' for the apostrophe. These surnames need to match an existing table of usernames but won't unless I replace O' with an apostrophe. I'm trying this code to do this: if AnsiContainsText( cLast, ''' ) // cLast holds surname 'O'Donnell' then AnsiReplaceStr( cLast, ''', '' ); So replacing ' with and apostrophe. This isn't working. When I debug the first line recognises the characters ' but the second line replaces nothing. Any thoughts on why this isn't working (or a better way to do this would be appreciated. Share this post Link to post
Lars Fosdal 1792 Posted October 10, 2022 AnsiReplaceStr( cLast, ''', '' ); Is that an empty string or is it the forum software that plays tricks on us? It should look like AnsiReplaceStr( cLast, ''', '''' ); Share this post Link to post
Lajos Juhász 293 Posted October 10, 2022 Also everyone failed to notice a small detail: function AnsiReplaceStr(const AText, AFromText, AToText: string): string; It's a function not a procedure. You can try: cLast:=AnsiReplaceStr( cLast, ''', '''' ); Share this post Link to post
Fr0sT.Brutal 900 Posted October 10, 2022 Also note that "'" is bad style (magic number), more correct is "'", sometime web masters could change to the latter one Share this post Link to post
bazzer747 25 Posted October 10, 2022 Thank you all. yes, missed that it was a function so this code: if AnsiContainsText( cLast, ''' ) then cLast:= AnsiReplaceStr( cLast, ''', chr(39) ); Now works as expected and returns O'Donovan which is what matches the name in the existing table. (and of course, will manage other similar names). Lars - that was just an empty string there, I just wanted to remove the characters at that stage. Share this post Link to post
Uwe Raabe 2057 Posted October 10, 2022 (edited) There also is a built-in function to decode HTML text: uses System.NetEncoding; ... var S := TNetEncoding.HTML.Decode(sHtml); Edited October 10, 2022 by Uwe Raabe 3 Share this post Link to post
Remy Lebeau 1393 Posted October 10, 2022 (edited) 11 hours ago, Uwe Raabe said: There also is a built-in function to decode HTML text: uses System.NetEncoding; ... var S := TNetEncoding.HTML.Decode(sHtml); However, it only supports decoding numeric entities, and references to reserved characters. Since apos is not a reserved character, it will not decode ''' The documentation even says so: https://docwiki.embarcadero.com/Libraries/en/System.NetEncoding.THTMLEncoding Quote THTMLEncoding only encodes reserved HTML characters: "&<>. THTMLEncoding supports decoding any HTML numeric character reference, such as © or þ, as well as the character entity references of reserved HTML characters: ", &, <, >. Warning: Decoding character entity references of non-reserved characters, such as ' or ©, is not supported. The input data must not contain any other character entity references. Otherwise, the output data may be corrupted. Edited October 10, 2022 by Remy Lebeau Share this post Link to post
Lars Fosdal 1792 Posted October 11, 2022 21 hours ago, Fr0sT.Brutal said: Also note that "'" is bad style (magic number), more correct is "'", sometime web masters could change to the latter one Nope. https://stackoverflow.com/questions/2083754/why-shouldnt-apos-be-used-to-escape-single-quotes Share this post Link to post
Fr0sT.Brutal 900 Posted October 11, 2022 2 minutes ago, Lars Fosdal said: Nope Nope on your nope 😉 That was more or less actual when the question was asked (12 yr ago) but now, according to your estimations, it's 3x legacy. IE8 in 22 is hardly a something to consider Share this post Link to post
Lars Fosdal 1792 Posted October 11, 2022 27 minutes ago, Fr0sT.Brutal said: Nope on your nope 😉 That was more or less actual when the question was asked (12 yr ago) but now, according to your estimations, it's 3x legacy. IE8 in 22 is hardly a something to consider So, TNetEncoding.HTML.Decode needs to be updated to support HTML5... Edit: Looks like a significant expansion of named entities. HTML5: https://www.w3.org/TR/2011/WD-html5-20110525/named-character-references.html HTML4: https://www.w3.org/TR/html4/sgml/entities.html Share this post Link to post