Darian Miller 366 Posted April 26, 2022 Every now and then I run into the task of parsing HTML. I've used MSHTML and while it works, it always seems a bit kludgy to work with. Is there an alternative that you use and could recommend? (Usually simple tasks - like give me all the anchors listed in a UL within a particular DIV.) Share this post Link to post
Alexander Sviridenkov 360 Posted April 26, 2022 Why ypu don't you parser from HTML Component Library? AFAIR you own a license. Parser supports XPath and JQuery so searching for particular nodes is quite simple. 1 Share this post Link to post
Darian Miller 366 Posted April 26, 2022 2 hours ago, Alexander Sviridenkov said: Why ypu don't you parser from HTML Component Library? AFAIR you own a license. Parser supports XPath and JQuery so searching for particular nodes is quite simple. Ah yes... I have only used that as a cool editor, but I'll check it out. Not sure why I didn't find "DelphiHTMLComponents.com", especially since I've been a paying customer for quite a while! LOL. Somedays are diamonds, some days are stones... My wife says I have "CRS" syndrome. (Can't Remember 'Stuff') Share this post Link to post
Alexander Sviridenkov 360 Posted April 27, 2022 (edited) Sample uses htmldraw, htmlpars; .. var D: THtDocument; N: THtNode; begin D := THtDocument.Create; try D.Parse('<body><div id="1"><ul><a href="a1">First</a><a href="a2">Second</a></ul></div><div><ul><a href="a3:>Third</a></ul></div></body>'); for N in D.JQuery('div#1 ul a') do ShowMessage(N['href']); for N in D.XPath('//div[@id="1"]/ul/a') do ShowMessage(N['href']); finally D.Free end; Edited April 27, 2022 by Alexander Sviridenkov 2 1 Share this post Link to post
Darian Miller 366 Posted April 27, 2022 Wow, thanks for the code sample too. I have some more code you can write for me. 🙂 Does canvas choice matter in this case when just parsing? Seems like a canvas is required. (I added htcanvasgdi to uses clause.) Share this post Link to post
Alexander Sviridenkov 360 Posted April 27, 2022 In this case canvas doesn't matter. Share this post Link to post
Edwin Yip 154 Posted April 27, 2022 Alexander's HCL is excellent and a set of gems, and definitely you can rely on it. Others I've successfully used in the past: https://github.com/ying32/htmlparser (open source, simple but does work depending on your needs). and DIHtmlParser (commercial, it's actually more of a tokenizer, but powerful) Share this post Link to post
Fr0sT.Brutal 900 Posted April 28, 2022 HTMLViewer obviously also has its parser; however I'm not aware if it could be used separated from rendering engine Share this post Link to post
Darian Miller 366 Posted April 28, 2022 On 4/26/2022 at 8:27 PM, Alexander Sviridenkov said: In this case canvas doesn't matter. Feature requests - allow parser usage without specifying a canvas. Also allow XPath from a particular node or chain multiple XPath statements together...currently having trouble filtering an already filtered input. In some sample HTML docs, there are Header elements that separate nearly identical lists of values with a handful of items before/after the lists. The headers have IDs so I'd like to select the list after the header. Perhaps it's because an XPath noob but it seems easiest to simply iterate all elements and find the header I'm interested in and grab data until the next header shows up in a state machine. Share this post Link to post
Vincent Parrett 763 Posted April 28, 2022 3 hours ago, Darian Miller said: Feature requests - allow parser usage without specifying a canvas. ^ This. I looked at using the parser in a console application but the requirement for a canvas etc brings in the vcl makes it a non starter (can't use the vcl at all under docker). I've had this issue with several well known third party libraries - the lack of layering and tight coupling with the vcl in libraries really reduces their possible uses. Share this post Link to post
Alexander Sviridenkov 360 Posted April 28, 2022 4 hours ago, Darian Miller said: Feature requests - allow parser usage without specifying a canvas. Also allow XPath from a particular node or chain multiple XPath statements together...currently having trouble filtering an already filtered input. In some sample HTML docs, there are Header elements that separate nearly identical lists of values with a handful of items before/after the lists. The headers have IDs so I'd like to select the list after the header. Perhaps it's because an XPath noob but it seems easiest to simply iterate all elements and find the header I'm interested in and grab data until the next header shows up in a state machine. Parser itself do not depend on canvas, VCL or something else. You can use THtmlNode class (hmlpars unit, code is almost the same: D := THtmlNode.Create;D.Parse(..)) THtmlNode supports XPath. For JQuery please use TStyledHTMLNode (htmlcss unit). Both XPath and JQuery can be called from any node so chaining is possible. Share this post Link to post
Alexander Sviridenkov 360 Posted April 28, 2022 54 minutes ago, Vincent Parrett said: ^ This. I looked at using the parser in a console application but the requirement for a canvas etc brings in the vcl makes it a non starter (can't use the vcl at all under docker). I've had this issue with several well known third party libraries - the lack of layering and tight coupling with the vcl in libraries really reduces their possible uses. HCL is layered, HTML and CSS parsers has no dependencies on framework or OS. Class hierarchy is THtNode (htmlpars unit, only RTL) -> THtXMLNode (htxml unit, only RTL) | THtmlNode (htmlpars unit, only RTL) | TStyledHTMLNode (htmlcss unit, only RTL) | TElement (htmldraw unit) | THtDocument (htmldraw unit) THtDocument/TElement use VCL/FMX units for several reasons 1. Native controls in HTML page (edits, combos, etc.) . 2. VCL themes (theme colors support) 3. VCL/FMX canvas (THtDocument can draw on VCL/FMX canvas) But all graphics (canvas classes) is isolated. 2 Share this post Link to post
Vincent Parrett 763 Posted April 28, 2022 (edited) 12 minutes ago, Alexander Sviridenkov said: HCL is layered, HTML and CSS parsers has no dependencies on framework or OS Ok, I was using the document class to parse - didn't realise I could use the node class directly. Edited April 28, 2022 by Vincent Parrett typo Share this post Link to post
Darian Miller 366 Posted April 29, 2022 17 hours ago, Alexander Sviridenkov said: HCL is layered, HTML and CSS parsers has no dependencies on framework or OS. Class hierarchy is ... Note: the example code you provided fails as a canvas was not specified. Share this post Link to post
Darian Miller 366 Posted April 29, 2022 17 hours ago, Alexander Sviridenkov said: Both XPath and JQuery can be called from any node so chaining is possible. <body><div id="1"><h2 id="h1"> text1 <ul><a href="a1">First</a><a href="a2">Second</a></ul></div><div><h2 id="h2"> text2 <ul><a href="a3:>Third</a></ul></div></body> I could not get chaining to work. On a basic example, I'd want a list of anchors for h1 and another list of anchors for h2 within a much more complex source. I tried retrieving the header by id and searching for anchors under and I didn't get any results. After a short time, I switched to another suggested solution (htmlparser) but ended up moving on to something else. Share this post Link to post
Alexander Sviridenkov 360 Posted April 29, 2022 13 minutes ago, Darian Miller said: Note: the example code you provided fails as a canvas was not specified. Only when using THtDocument. With THtmlNode and THtmlStyledNode works without canvas units uses htmlpars; var D: THtmlNode; N: THtNode; begin D := THtmlNode.Create; try D.Parse('<body><div id="1"><ul><a href="a1">First</a><a href="a2">Second</a></ul></div><div><ul><a href="a3">Third</a></ul></div></body>'); for N in D.XPath('//div[@id="1"]/ul/a') do ShowMessage(N['href']); finally D.Free end; Share this post Link to post
Alexander Sviridenkov 360 Posted April 29, 2022 10 minutes ago, Darian Miller said: <body><div id="1"><h2 id="h1"> text1 <ul><a href="a1">First</a><a href="a2">Second</a></ul></div><div><h2 id="h2"> text2 <ul><a href="a3:>Third</a></ul></div></body> I could not get chaining to work. On a basic example, I'd want a list of anchors for h1 and another list of anchors for h2 within a much more complex source. I tried retrieving the header by id and searching for anchors under and I didn't get any results. After a short time, I switched to another suggested solution (htmlparser) but ended up moving on to something else. Sorry, I forgot to mention that when using chaining, current node should be passed to XPath as parameter to prevent XPath walking up toi root. var D: THtmlNode; N, A: THtNode; begin D := THtmlNode.Create; try D.Parse('<body><h1 id="header1"<a href="a1">First</a><a href="a2">Second</a></h1><h2><div><a href="a3">Third</a></div></h2></body>'); for N in D.XPath('//h1[@id="header1"]') do for A in N.XPath('//a', false, N) do ShowMessage(A['href']); for N in D.XPath('//h2') do for A in N.XPath('//a', false, N) do ShowMessage(A['href']); finally D.Free end; Second parameter is "stop after first found node", third is current root node. 1 Share this post Link to post
Darian Miller 366 Posted April 29, 2022 2 hours ago, Alexander Sviridenkov said: Sorry, I forgot to mention that when using chaining, current node should be passed to XPath as parameter to prevent XPath walking up toi root. ... Second parameter is "stop after first found node", third is current root node. Thanks, I'll probably have time to try it again this weekend. Share this post Link to post
Darian Miller 366 Posted April 29, 2022 (edited) On 4/26/2022 at 9:53 PM, Edwin Yip said: https://github.com/ying32/htmlparser (open source, simple but does work depending on your needs). About as far as I got on that tool is forking it and translating all of the Chinese strings/comments into English using Google Translate https://github.com/radprogrammer/htmlparser Edited April 29, 2022 by Darian Miller Share this post Link to post
Angus Robertson 577 Posted April 30, 2022 ICS includes an updated version of THTMLParser from Dennis Spreen 20 years ago, very simple, just works. https://svn.overbyte.be/svn/ics/trunk/Source/OverbyteIcsHtmlPars.pas Angus Share this post Link to post
Edwin Yip 154 Posted May 2, 2022 On 4/30/2022 at 3:38 AM, Darian Miller said: About as far as I got on that tool is forking it and translating all of the Chinese strings/comments into English using Google Translate https://github.com/radprogrammer/htmlparser Sounds cool! Share this post Link to post