HTML Parser alternative to MSHTML?

Darian Miller · April 26, 2022

Every now and then I run into the task of parsing HTML. I've used MSHTML and while it works, it always seems a bit kludgy to work with. Is there an alternative that you use and could recommend? (Usually simple tasks - like give me all the anchors listed in a UL within a particular DIV.)

Alexander Sviridenkov · April 26, 2022

Why ypu don't you parser from HTML Component Library? AFAIR you own a license.

Parser supports XPath and JQuery so searching for particular nodes is quite simple.

Darian Miller · April 26, 2022

2 hours ago, Alexander Sviridenkov said:

Why ypu don't you parser from HTML Component Library? AFAIR you own a license.

Parser supports XPath and JQuery so searching for particular nodes is quite simple.

Ah yes... I have only used that as a cool editor, but I'll check it out. Not sure why I didn't find "DelphiHTMLComponents.com", especially since I've been a paying customer for quite a while! LOL. Somedays are diamonds, some days are stones...

My wife says I have "CRS" syndrome. (Can't Remember 'Stuff')

Alexander Sviridenkov · April 27, 2022

Sample

uses htmldraw, htmlpars;
..

var D: THtDocument;
    N: THtNode;
begin
  D := THtDocument.Create;
  try
    D.Parse('<body><div id="1"><ul><a href="a1">First</a><a href="a2">Second</a></ul></div><div><ul><a href="a3:>Third</a></ul></div></body>');

    for N in D.JQuery('div#1 ul a') do
      ShowMessage(N['href']);

    for N in D.XPath('//div[@id="1"]/ul/a') do
      ShowMessage(N['href']);

  finally
    D.Free
  end;

Edited April 27, 2022 by Alexander Sviridenkov

Darian Miller · April 27, 2022

Wow, thanks for the code sample too. I have some more code you can write for me. 🙂

Does canvas choice matter in this case when just parsing? Seems like a canvas is required. (I added htcanvasgdi to uses clause.)

Alexander Sviridenkov · April 27, 2022

In this case canvas doesn't matter.

Edwin Yip · April 27, 2022

Alexander's HCL is excellent and a set of gems, and definitely you can rely on it.

Others I've successfully used in the past:

https://github.com/ying32/htmlparser (open source, simple but does work depending on your needs).

and

DIHtmlParser (commercial, it's actually more of a tokenizer, but powerful)

Fr0sT.Brutal · April 28, 2022

HTMLViewer obviously also has its parser; however I'm not aware if it could be used separated from rendering engine

Darian Miller · April 28, 2022

On 4/26/2022 at 8:27 PM, Alexander Sviridenkov said:

In this case canvas doesn't matter.

Feature requests - allow parser usage without specifying a canvas. Also allow XPath from a particular node or chain multiple XPath statements together...currently having trouble filtering an already filtered input. In some sample HTML docs, there are Header elements that separate nearly identical lists of values with a handful of items before/after the lists. The headers have IDs so I'd like to select the list after the header. Perhaps it's because an XPath noob but it seems easiest to simply iterate all elements and find the header I'm interested in and grab data until the next header shows up in a state machine.

Vincent Parrett · April 28, 2022

3 hours ago, Darian Miller said:

Feature requests - allow parser usage without specifying a canvas.

^ This. I looked at using the parser in a console application but the requirement for a canvas etc brings in the vcl makes it a non starter (can't use the vcl at all under docker). I've had this issue with several well known third party libraries - the lack of layering and tight coupling with the vcl in libraries really reduces their possible uses.

Alexander Sviridenkov · April 28, 2022

4 hours ago, Darian Miller said:

Feature requests - allow parser usage without specifying a canvas. Also allow XPath from a particular node or chain multiple XPath statements together...currently having trouble filtering an already filtered input. In some sample HTML docs, there are Header elements that separate nearly identical lists of values with a handful of items before/after the lists. The headers have IDs so I'd like to select the list after the header. Perhaps it's because an XPath noob but it seems easiest to simply iterate all elements and find the header I'm interested in and grab data until the next header shows up in a state machine.

Parser itself do not depend on canvas, VCL or something else. You can use THtmlNode class (hmlpars unit, code is almost the same: D := THtmlNode.Create;D.Parse(..))

THtmlNode supports XPath. For JQuery please use TStyledHTMLNode (htmlcss unit).

Both XPath and JQuery can be called from any node so chaining is possible.

Alexander Sviridenkov · April 28, 2022

54 minutes ago, Vincent Parrett said:

^ This. I looked at using the parser in a console application but the requirement for a canvas etc brings in the vcl makes it a non starter (can't use the vcl at all under docker). I've had this issue with several well known third party libraries - the lack of layering and tight coupling with the vcl in libraries really reduces their possible uses.

HCL is layered, HTML and CSS parsers has no dependencies on framework or OS. Class hierarchy is

THtNode (htmlpars unit, only RTL) -> THtXMLNode (htxml unit, only RTL)

|

THtmlNode (htmlpars unit, only RTL)

|

TStyledHTMLNode (htmlcss unit, only RTL)

|

TElement (htmldraw unit)

|

THtDocument (htmldraw unit)

THtDocument/TElement use VCL/FMX units for several reasons

1. Native controls in HTML page (edits, combos, etc.) .

2. VCL themes (theme colors support)

3. VCL/FMX canvas (THtDocument can draw on VCL/FMX canvas)

But all graphics (canvas classes) is isolated.

Vincent Parrett · April 28, 2022

12 minutes ago, Alexander Sviridenkov said:

HCL is layered, HTML and CSS parsers has no dependencies on framework or OS

Ok, I was using the document class to parse - didn't realise I could use the node class directly.

Edited April 28, 2022 by Vincent Parrett
typo

Darian Miller · April 29, 2022

17 hours ago, Alexander Sviridenkov said:

HCL is layered, HTML and CSS parsers has no dependencies on framework or OS. Class hierarchy is

...

Note: the example code you provided fails as a canvas was not specified.

Darian Miller · April 29, 2022

17 hours ago, Alexander Sviridenkov said:

Both XPath and JQuery can be called from any node so chaining is possible.

<body><div id="1"><h2 id="h1"> text1 <ul><a href="a1">First</a><a href="a2">Second</a></ul></div><div><h2 id="h2"> text2 <ul><a href="a3:>Third</a></ul></div></body>

I could not get chaining to work. On a basic example, I'd want a list of anchors for h1 and another list of anchors for h2 within a much more complex source. I tried retrieving the header by id and searching for anchors under and I didn't get any results. After a short time, I switched to another suggested solution (htmlparser) but ended up moving on to something else.

Alexander Sviridenkov · April 29, 2022

13 minutes ago, Darian Miller said:

Note: the example code you provided fails as a canvas was not specified.

Only when using THtDocument. With THtmlNode and THtmlStyledNode works without canvas units

uses htmlpars;

var D: THtmlNode;
    N: THtNode;
begin
  D := THtmlNode.Create;
  try
    D.Parse('<body><div id="1"><ul><a href="a1">First</a><a href="a2">Second</a></ul></div><div><ul><a href="a3">Third</a></ul></div></body>');
    for N in D.XPath('//div[@id="1"]/ul/a') do
      ShowMessage(N['href']);
  finally
    D.Free
  end;

Alexander Sviridenkov · April 29, 2022

10 minutes ago, Darian Miller said:
<body><div id="1"><h2 id="h1"> text1 <ul><a href="a1">First</a><a href="a2">Second</a></ul></div><div><h2 id="h2"> text2 <ul><a href="a3:>Third</a></ul></div></body>
I could not get chaining to work. On a basic example, I'd want a list of anchors for h1 and another list of anchors for h2 within a much more complex source. I tried retrieving the header by id and searching for anchors under and I didn't get any results. After a short time, I switched to another suggested solution (htmlparser) but ended up moving on to something else.

Sorry, I forgot to mention that when using chaining, current node should be passed to XPath as parameter to prevent XPath walking up toi root.

var D: THtmlNode;
    N, A: THtNode;
begin
  D := THtmlNode.Create;
  try
    D.Parse('<body><h1 id="header1"<a href="a1">First</a><a href="a2">Second</a></h1><h2><div><a href="a3">Third</a></div></h2></body>');
    for N in D.XPath('//h1[@id="header1"]') do
      for A in N.XPath('//a', false, N)  do
       ShowMessage(A['href']);
    for N in D.XPath('//h2') do
      for A in N.XPath('//a', false, N)  do
       ShowMessage(A['href']);
  finally
    D.Free
  end;

Second parameter is "stop after first found node", third is current root node.

Darian Miller · April 29, 2022

2 hours ago, Alexander Sviridenkov said:

Sorry, I forgot to mention that when using chaining, current node should be passed to XPath as parameter to prevent XPath walking up toi root.

...

Second parameter is "stop after first found node", third is current root node.

Thanks, I'll probably have time to try it again this weekend.

Darian Miller · April 29, 2022

On 4/26/2022 at 9:53 PM, Edwin Yip said:

https://github.com/ying32/htmlparser (open source, simple but does work depending on your needs).

About as far as I got on that tool is forking it and translating all of the Chinese strings/comments into English using Google Translate

https://github.com/radprogrammer/htmlparser

Edited April 29, 2022 by Darian Miller

Angus Robertson · April 30, 2022

ICS includes an updated version of THTMLParser from Dennis Spreen 20 years ago, very simple, just works.

https://svn.overbyte.be/svn/ics/trunk/Source/OverbyteIcsHtmlPars.pas

Angus

Edwin Yip · May 2, 2022

On 4/30/2022 at 3:38 AM, Darian Miller said:

About as far as I got on that tool is forking it and translating all of the Chinese strings/comments into English using Google Translate

https://github.com/radprogrammer/htmlparser

Sounds cool!

Sign In

HTML Parser alternative to MSHTML?

Recommended Posts

Darian Miller 390

Share this post

Link to post

Alexander Sviridenkov 363

Share this post

Link to post

Darian Miller 390

Share this post

Link to post

Alexander Sviridenkov 363

Share this post

Link to post

Darian Miller 390

Share this post

Link to post

Alexander Sviridenkov 363

Share this post

Link to post

Edwin Yip 154

Share this post

Link to post

Fr0sT.Brutal 904

Share this post

Link to post

Darian Miller 390

Share this post

Link to post

Vincent Parrett 865

Share this post

Link to post

Alexander Sviridenkov 363

Share this post

Link to post

Alexander Sviridenkov 363

Share this post

Link to post

Vincent Parrett 865

Share this post

Link to post

Darian Miller 390

Share this post

Link to post

Darian Miller 390

Share this post

Link to post

Alexander Sviridenkov 363

Share this post

Link to post

Alexander Sviridenkov 363

Share this post

Link to post

Darian Miller 390

Share this post

Link to post

Darian Miller 390

Share this post

Link to post

Angus Robertson 663

Share this post

Link to post

Edwin Yip 154

Share this post

Link to post

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity