Jump to content
Darian Miller

HTML Parser alternative to MSHTML?

Recommended Posts

Every now and then I run into the task of parsing HTML.  I've used MSHTML and while it works, it always seems a bit kludgy to work with.  Is there an alternative that you use and could recommend?  (Usually simple tasks - like give me all the anchors listed in a UL within a particular DIV.)

 

 

Share this post


Link to post

Why ypu don't you parser from HTML Component Library? AFAIR you own a license.

Parser supports XPath and JQuery so searching for particular nodes is quite simple.

  • Thanks 1

Share this post


Link to post
2 hours ago, Alexander Sviridenkov said:

Why ypu don't you parser from HTML Component Library? AFAIR you own a license.

Parser supports XPath and JQuery so searching for particular nodes is quite simple.

 

Ah yes... I have only used that as a cool editor, but I'll check it out.   Not sure why I didn't find "DelphiHTMLComponents.com", especially since I've been a paying customer for quite a while!   LOL.   Somedays are diamonds, some days are stones... 

My wife says I have "CRS" syndrome. (Can't Remember 'Stuff') 

 

 

Share this post


Link to post
Posted (edited)

Sample

uses htmldraw, htmlpars;
..

var D: THtDocument;
    N: THtNode;
begin
  D := THtDocument.Create;
  try
    D.Parse('<body><div id="1"><ul><a href="a1">First</a><a href="a2">Second</a></ul></div><div><ul><a href="a3:>Third</a></ul></div></body>');

    for N in D.JQuery('div#1 ul a') do
      ShowMessage(N['href']);

    for N in D.XPath('//div[@id="1"]/ul/a') do
      ShowMessage(N['href']);

  finally
    D.Free
  end;

 

Edited by Alexander Sviridenkov
  • Like 2
  • Thanks 1

Share this post


Link to post

Wow, thanks for the code sample too.  I have some more code you can write for me.  🙂

Does canvas choice matter in this case when just parsing?  Seems like a canvas is required.  (I added htcanvasgdi to uses clause.)

 

Share this post


Link to post

Alexander's HCL is excellent and a set of gems, and definitely you can rely on it.

Others I've successfully used in the past:

 

https://github.com/ying32/htmlparser (open source, simple but does work depending on your needs).

and 

DIHtmlParser (commercial, it's actually more of a tokenizer, but powerful)

Share this post


Link to post

HTMLViewer obviously also has its parser; however I'm not aware if it could be used separated from rendering engine

Share this post


Link to post
On 4/26/2022 at 8:27 PM, Alexander Sviridenkov said:

In this case canvas doesn't matter.

Feature requests - allow parser usage without specifying a canvas.  Also allow XPath from a particular node or chain multiple XPath statements together...currently having trouble filtering an already filtered input.  In some sample HTML docs, there are Header elements that separate nearly identical lists of values with a handful of items before/after the lists. The headers have IDs so I'd like to select the list after the header.  Perhaps it's because an XPath noob but it seems easiest to simply iterate all elements and find the header I'm interested in and grab data until the next header shows up in a state machine.

 

 

Share this post


Link to post
3 hours ago, Darian Miller said:

Feature requests - allow parser usage without specifying a canvas.

^ This. I looked at using the parser in a console application but the requirement for a canvas etc brings in the vcl makes it a non starter (can't use the vcl at all under docker). I've had this issue with several well known third party libraries - the lack of layering and tight coupling with the vcl in libraries really reduces their possible uses. 

Share this post


Link to post
4 hours ago, Darian Miller said:

Feature requests - allow parser usage without specifying a canvas.  Also allow XPath from a particular node or chain multiple XPath statements together...currently having trouble filtering an already filtered input.  In some sample HTML docs, there are Header elements that separate nearly identical lists of values with a handful of items before/after the lists. The headers have IDs so I'd like to select the list after the header.  Perhaps it's because an XPath noob but it seems easiest to simply iterate all elements and find the header I'm interested in and grab data until the next header shows up in a state machine.

 

 

Parser itself do not depend on canvas, VCL or something else. You can use THtmlNode class (hmlpars unit, code is almost the same: D := THtmlNode.Create;D.Parse(..))

THtmlNode supports XPath. For JQuery please use TStyledHTMLNode (htmlcss unit).

Both XPath and JQuery can be called from any node so chaining is possible.

 

Share this post


Link to post
54 minutes ago, Vincent Parrett said:

^ This. I looked at using the parser in a console application but the requirement for a canvas etc brings in the vcl makes it a non starter (can't use the vcl at all under docker). I've had this issue with several well known third party libraries - the lack of layering and tight coupling with the vcl in libraries really reduces their possible uses. 

HCL is layered, HTML and CSS parsers has no dependencies on framework or OS.  Class hierarchy is

THtNode (htmlpars unit, only RTL)  -> THtXMLNode (htxml unit, only RTL)

      |

THtmlNode (htmlpars unit, only RTL)

      |

TStyledHTMLNode (htmlcss unit, only RTL)

      |

TElement (htmldraw unit)

      |

THtDocument (htmldraw unit)

 

THtDocument/TElement use VCL/FMX units for several reasons

1. Native controls in HTML page (edits, combos,  etc.) .

2. VCL themes (theme colors support)

3. VCL/FMX canvas (THtDocument can draw on VCL/FMX canvas)

 

But all graphics (canvas classes) is isolated.

  • Like 2

Share this post


Link to post
Posted (edited)
12 minutes ago, Alexander Sviridenkov said:

HCL is layered, HTML and CSS parsers has no dependencies on framework or OS

Ok, I was using the document class to parse - didn't realise I could use the node class directly. 

Edited by Vincent Parrett
typo

Share this post


Link to post
17 hours ago, Alexander Sviridenkov said:

HCL is layered, HTML and CSS parsers has no dependencies on framework or OS.  Class hierarchy is

...

Note: the example code you provided fails as a canvas was not specified.

Share this post


Link to post
17 hours ago, Alexander Sviridenkov said:

Both XPath and JQuery can be called from any node so chaining is possible.

 

<body><div id="1"><h2 id="h1"> text1 <ul><a href="a1">First</a><a href="a2">Second</a></ul></div><div><h2 id="h2"> text2 <ul><a href="a3:>Third</a></ul></div></body>

I could not get chaining to work.  On a basic example, I'd want a list of anchors for h1 and another list of anchors for h2 within a much more complex source.  I tried retrieving the header by id and searching for anchors under and I didn't get any results.  After a short time, I switched to another suggested solution (htmlparser) but ended up moving on to something else.

 

Share this post


Link to post
13 minutes ago, Darian Miller said:

Note: the example code you provided fails as a canvas was not specified.

Only when using THtDocument. With THtmlNode and THtmlStyledNode works without canvas units

 

uses htmlpars;

var D: THtmlNode;
    N: THtNode;
begin
  D := THtmlNode.Create;
  try
    D.Parse('<body><div id="1"><ul><a href="a1">First</a><a href="a2">Second</a></ul></div><div><ul><a href="a3">Third</a></ul></div></body>');
    for N in D.XPath('//div[@id="1"]/ul/a') do
      ShowMessage(N['href']);
  finally
    D.Free
  end;

 

Share this post


Link to post
10 minutes ago, Darian Miller said:

<body><div id="1"><h2 id="h1"> text1 <ul><a href="a1">First</a><a href="a2">Second</a></ul></div><div><h2 id="h2"> text2 <ul><a href="a3:>Third</a></ul></div></body>

I could not get chaining to work.  On a basic example, I'd want a list of anchors for h1 and another list of anchors for h2 within a much more complex source.  I tried retrieving the header by id and searching for anchors under and I didn't get any results.  After a short time, I switched to another suggested solution (htmlparser) but ended up moving on to something else.

 

 

Sorry, I forgot to mention that when using chaining, current node should be passed to XPath as parameter to prevent XPath walking up toi root.

 

var D: THtmlNode;
    N, A: THtNode;
begin
  D := THtmlNode.Create;
  try
    D.Parse('<body><h1 id="header1"<a href="a1">First</a><a href="a2">Second</a></h1><h2><div><a href="a3">Third</a></div></h2></body>');
    for N in D.XPath('//h1[@id="header1"]') do
      for A in N.XPath('//a', false, N)  do
       ShowMessage(A['href']);
    for N in D.XPath('//h2') do
      for A in N.XPath('//a', false, N)  do
       ShowMessage(A['href']);
  finally
    D.Free
  end;

Second parameter is "stop after first found node", third is current root node.

  • Thanks 1

Share this post


Link to post
2 hours ago, Alexander Sviridenkov said:

 

Sorry, I forgot to mention that when using chaining, current node should be passed to XPath as parameter to prevent XPath walking up toi root.

...

Second parameter is "stop after first found node", third is current root node.

 

Thanks, I'll probably have time to try it again this weekend.  

Share this post


Link to post
Posted (edited)
On 4/26/2022 at 9:53 PM, Edwin Yip said:

https://github.com/ying32/htmlparser (open source, simple but does work depending on your needs).

 

 

About as far as I got on that tool is forking it and translating all of the Chinese strings/comments into English using Google Translate

 

https://github.com/radprogrammer/htmlparser

 

Edited by Darian Miller

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×