Jump to content
David Schwartz

TWebBrowser + dynamic JS question

Recommended Posts

My earlier question isn't getting me what I'm looking for, so I thought I'd rephrase it.

 

How does one obtain the final user-visible data from TWebBrowser after all the dynamic JS code that's loading up result data on-the-fly has been executed? (I'm using VCL in this case; a solution in FMX would be good to see as well, if it's different.)

 

I need to be able to get the visual contents of a Google search results page -- but I think it could be just about ANY kind of web page that uses dynamic JS to fill data fields when the page is loaded. The well-documented methods I've found thus far all return a bunch of JS code, not the final results they produce. I want what the user sees, not what the browser initially gets from the server.

 

I wonder if the ultimate end-result of this paradigm shift is that web page queries are going to return a super simple HTML document that contains one line in the HEAD or BODY that triggers a JS query like, "<script>loadpage(sessionID,...);</script>" and everything the user sees is rendered through recursive JS calls on a field-by-field basis (with most if  not all running in parallel).

Share this post


Link to post

Are you certain that JS is enabled and triggered on page load?

Have you tried the embedded Chrome browser to see if that makes any difference?
https://github.com/salvadordf/CEF4Delphi

I think your last sentence already is in effect for a large number of sites as it makes harder to crawl content tags and references if they are generated dynamically by JS.

Share this post


Link to post
42 minutes ago, Lars Fosdal said:

Are you certain that JS is enabled and triggered on page load?
 

What I can say with certainty is that every documented method I've used for extracting content from some of these web pages returns big chunks of JS code that's not readable to humans, even as the web browsers I'm extracting it from do, in fact, display a human-readable web page. As far as how or when that happens, I don't know exactly. That's why I'm asking here.

 

TWebBrowser controls display these pages correctly, or people would not be using said controls, because among other things, Google searches would display as nonsense.

 

My question is, what Delphi code is needed to extract the ultimately rendered HTML that exists inside of said controls and is made visible to users?

 

BTW, I'm sure this sort of page encoding does, indeed, make it harder to crawl and scrape using existing tools. I think that's part of its intended use in many cases.

Share this post


Link to post

Shouldn't this be possible with just TWebBrowser?

 

Look at https://stackoverflow.com/a/22518562/1037511

 

Quote

As far as I can see, this is the actual state of the document, including any changes that are made by Javascript.

 

So its seems like OuterHtml should give you the rendered HTML. Make sure to get this after the document is completed loading.

 

Share this post


Link to post
Posted (edited)
2 hours ago, rvk said:

Shouldn't this be possible with just TWebBrowser?

 

Look at https://stackoverflow.com/a/22518562/1037511

 

 

So its seems like OuterHtml should give you the rendered HTML. Make sure to get this after the document is completed loading.

 

Try it on a Google search or on some news sites, and let me know what you find.

Edited by David Schwartz

Share this post


Link to post
Posted (edited)
On 8/13/2019 at 3:48 PM, David Schwartz said:

Try it on a Google search or on some news sites, and let me know what you find.

Why should I try it????

But alright... I tried and it works.

 

With this source file

Test
<div id=content>
Some other content
</div>

<script type = "text/JavaScript">
document.getElementById("content").innerHTML = "Changed content";
</script>

I got this in Delphi for 

var
  d: OleVariant;
begin
  d := WebBrowser1.Document;
  Memo1.Lines.Text := d.documentElement.outerHTML;
end;

 

<HTML><HEAD></HEAD>
<BODY>Test 
<DIV id=content>Changed content</DIV>
<SCRIPT type=text/JavaScript>
document.getElementById("content").innerHTML = "Changed content";
</SCRIPT>
</BODY></HTML>

Note the "Changed content" between the divs of id=content !!

 

The javascript itself obviously stays there because it's part of the source but the content itself is the way it is represented to the end-user after executing the javascript. The DOM even places the missing HTML, HEAD and BODY tags in there.

 

 

 

 

Edited by rvk

Share this post


Link to post
2 hours ago, rvk said:

Why should I try it????

But alright... I tried and it works.

 

Thanks, you created a trivial example to prove YOUR case, not what I'm dealing with. BRAVO!

 

Meanwhile, did you try it with a Google query, or from one of the news sites? If you did, you'd see what I'm pointing at.

 

A guy on SO boldly asserted, "Just strip out the SCRIPT tags and parse the BODY." Yeah, well, what do you do when a site returns <HTML><HEAD>...</HEAD></HTML> with NO BODY TAG? It's just a big chunk of <SCRIPT>...</SCRIPT> tags in the HEADer.

 

But, hey, your example works fine. Unfortunately, that's not what Google or these sites are returning.

 

I understand very clearly how this is supposed to work IN THEORY.

 

In PRACTICE ... you're going to have to actually run one of these (really simple) queries and see what I'm seeing, or you're going to keep implying that I'm hallucinating.


Run a Google query for something simple like "dentists in baltimore" and see what your approach yields. Because that's what I'm dealing with and what I'm asking about.

 

Clearly, nobody who's responded to my questions has bothered to look at what these sites are actually serving up these days!

 

Share this post


Link to post
Posted (edited)

To save you the time of adding a line of code to run a query, here's what outerHTML gives you based on what Google returns for "dentist in baltimore".

 

Feel free to strip out the javascript and show me the text that resembles what the user sees when this query is run directly on google.com.

 

<html lang="en" itemtype="http://schema.org/SearchResultsPage" itemscope=""><head><meta http-equiv="X-UA-Compatible" content="IE=edge"><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>dentist in baltimore - Google Search</title><script nonce="Xp+ZL1qXiueWiJzqZqCD8g==">(function(){window.google={kEI:'Wk5UXdCiNs_7-gTLw6jgBQ',kEXPI:'31',authuser:0,kscs:'c9c918f0_Wk5UXdCiNs_7-gTLw6jgBQ',kGL:'US',kBL:'d31E'};google.sn='web';google.kHL='en';google.jsfs='Ffpdje';})();(function(){google.lc=[];google.li=0;google.getEI=function(a){for(var b;a&&(!a.getAttribute||!(b=a.getAttribute("eid")));)a=a.parentNode;return b||google.kEI};google.getLEI=function(a){for(var b=null;a&&(!a.getAttribute||!(b=a.getAttribute("leid")));)a=a.parentNode;return b};google.https=function(){return"https:"==window.location.protocol};google.ml=function(){return null};google.time=function(){return(new Date).getTime()};google.log=function(a,b,e,c,g){if(a=google.logUrl(a,b,e,c,g)){b=new Image;var d=google.lc,f=google.li;d[f]=b;b.onerror=b.onload=b.onabort=function(){delete d[f]};google.vel&&google.vel.lu&&google.vel.lu(a);b.src=a;google.li=f+1}};google.logUrl=function(a,b,e,c,g){var d="",f=google.ls||"";e||-1!=b.search("&ei=")||(d="&ei="+google.getEI(c),-1==b.search("&lei=")&&(c=google.getLEI(c))&&(d+="&lei="+c));c="";!e&&google.cshid&&-1==b.search("&cshid=")&&"slh"!=a&&(c="&cshid="+google.cshid);a=e||"/"+(g||"gen_204")+"?atyp=i&ct="+a+"&cad="+b+d+f+"&zx="+google.time()+c;/^http:/i.test(a)&&google.https()&&(google.ml(Error("a"),!1,{src:a,glmm:1}),a="");return a};}).call(this);(function(){google.y={};google.x=function(a,b){if(a)var c=a.id;else{do c=Math.random();while(google.y[c])}google.y[c]=[a,b];return!1};google.lm=[];google.plm=function(a){google.lm.push.apply(google.lm,a)};google.lq=[];google.load=function(a,b,c){google.lq.push([[a],b,c])};google.loadAll=function(a,b){google.lq.push([a,b])};}).call(this);google.f={};google.arwt=function(a){a.href=document.getElementById(a.id.substring(1)).href;return!0};(function(){var g=this||self,k=Date.now||function(){return+new Date};var t={};var v=function(b,d){if(null===d)return!1;if("contains"in b&&1==d.nodeType)return b.contains(d);if("compareDocumentPosition"in b)return b==d||!!(b.compareDocumentPosition(d)&16);for(;d&&b!=d;)d=d.parentNode;return d==b};var w=function(b,d){return function(a){a||(a=window.event);return d.call(b,a)}},B=function(b){b=b.target||b.srcElement;!b.getAttribute&&b.parentNode&&(b=b.parentNode);return b},C="undefined"!=typeof navigator&&/Macintosh/.test(navigator.userAgent),D="undefined"!=typeof navigator&&!/Opera/.test(navigator.userAgent)&&/WebKit/.test(navigator.userAgent),E={A:1,INPUT:1,TEXTAREA:1,SELECT:1,BUTTON:1},aa=function(){this._mouseEventsPrevented=!0},F={A:13,BUTTON:0,CHECKBOX:32,COMBOBOX:13,GRIDCELL:13,LINK:13,LISTBOX:13,MENU:0,MENUBAR:0,MENUITEM:0,MENUITEMCHECKBOX:0,MENUITEMRADIO:0,OPTION:0,RADIO:32,RADIOGROUP:32,RESET:0,SUBMIT:0,SWITCH:32,TAB:0,TREE:13,TREEITEM:13},G={CHECKBOX:!0,OPTION:!0,RADIO:!0},H={COLOR:!0,DATE:!0,DATETIME:!0,"DATETIME-LOCAL":!0,EMAIL:!0,MONTH:!0,NUMBER:!0,PASSWORD:!0,RANGE:!0,SEARCH:!0,TEL:!0,TEXT:!0,TEXTAREA:!0,TIME:!0,URL:!0,WEEK:!0},ba={A:!0,AREA:!0,BUTTON:!0,DIALOG:!0,IMG:!0,INPUT:!0,LINK:!0,MENU:!0,OPTGROUP:!0,OPTION:!0,PROGRESS:!0,SELECT:!0,TEXTAREA:!0};var I=function(){this.h=this.a=null},K=function(b,d){var a=J;a.a=b;a.h=d;return a};I.prototype.g=function(){var b=this.a;this.a&&this.a!=this.h?this.a=this.a.__owner||this.a.parentNode:this.a=null;return b};var L=function(){this.i=[];this.a=0;this.h=null;this.j=!1};L.prototype.g=function(){if(this.j)return J.g();if(this.a!=this.i.length){var b=this.i[this.a];this.a++;b!=this.h&&b&&b.__owner&&(this.j=!0,K(b.__owner,this.h));return b}return null};var J=new I,M=new L;var P=function(){this.o=[];this.a=[];this.g=[];this.j={};this.h=null;this.i=[];O(this,"_custom")},ca="undefined"!=typeof navigator&&/iPhone|iPad|iPod/.test(navigator.userAgent),Q=String.prototype.trim?function(b){return b.trim()}:function(b){return b.replace(/^\s+/,"").replace(/\s+$/,"")},da=/\s*;\s*/,ia=function(b,d){return function(a){var c=d;if("_custom"==c){c=a.detail;if(!c||!c._type)return;c=c._type}var e;if("click"==c&&(C&&a.metaKey||!C&&a.ctrlKey||2==a.which||null==a.which&&4==a.button||a.shiftKey))c=
"clickmod";else{var f=a.which||a.keyCode||a.key;D&&3==f&&(f=13);if(13!=f&&32!=f)f=!1;else{var l=B(a),h=(l.getAttribute("role")||l.type||l.tagName).toUpperCase(),m;(m="keydown"!=a.type||!("getAttribute"in l&&!((l.getAttribute("type")||l.tagName).toUpperCase()in H)&&"BUTTON"!=l.tagName.toUpperCase()&&!l.isContentEditable)||a.ctrlKey||a.shiftKey||a.altKey||a.metaKey||(l.getAttribute("type")||l.tagName).toUpperCase()in G&&32==f)||((m=l.tagName in E)||(m=l.getAttributeNode("tabindex"),m=null!=m&&m.specified),m=!(m&&!l.disabled));m?f=!1:(l="INPUT"!=l.tagName.toUpperCase()||l.type,m=!(h in F)&&13==f,f=(0==F[h]%f||m)&&!!l)}f&&(c="clickkey")}h=a.srcElement||a.target;f=R(c,a,h,"",null);a.path?(M.i=a.path,M.a=0,M.h=this,M.j=!1,l=M):l=K(h,this);for(;m=l.g();){m=e=m;var n=c;var p=m.__jsaction;if(!p){var u;p=null;"getAttribute"in m&&(p=m.getAttribute("jsaction"));if(u=p){p=t[u];if(!p){p={};for(var x=u.split(da),ea=x?x.length:0,y=0;y<ea;y++){var r=x[y];if(r){var z=r.indexOf(":"),N=-1!=z,fa=N?Q(r.substr(0,z)):"click";r=N?Q(r.substr(z+1)):r;p[fa]=r}}t[u]=p}m.__jsaction=p}else p=ha,m.__jsaction=p}m=p;"clickkey"==n?n="click":"click"!=n||m.click||(n="clickonly");n={m:n,action:m[n]||"",event:null,s:!1};f=R(n.m,n.event||a,h,n.action||"",e,f.timeStamp);if(n.s||n.action)break}f&&"touchend"==f.eventType&&(f.event._preventMouseEvents=aa);if(n&&n.action){if(h="clickkey"==c)h=B(a),h=(h.type||h.tagName).toUpperCase(),(h=32==(a.which||a.keyCode||a.key)&&"CHECKBOX"!=h)||(h=B(a),l=(h.getAttribute("role")||h.tagName).toUpperCase(),h="BUTTON"==l?!0:!(h.tagName.toUpperCase()in ba)||"A"==l||"SELECT"==l||(h.getAttribute("type")||h.tagName).toUpperCase()in G||(h.getAttribute("type")||h.tagName).toUpperCase()in H?!1:!0);h&&(a.preventDefault?a.preventDefault():a.returnValue=!1);if("mouseenter"==c||"mouseleave"==c)if(h=a.relatedTarget,!("mouseover"==a.type&&"mouseenter"==c||"mouseout"==a.type&&"mouseleave"==c)||h&&(h===e||v(e,h)))f.action="",f.actionElement=null;else{c={};for(var q in a)"function"!==typeof a[q]&&"srcElement"!==q&&
"target"!==q&&(c[q]=a[q]);c.type="mouseover"==a.type?"mouseenter":"mouseleave";c.target=c.srcElement=e;c.bubbles=!1;f.event=c;f.targetElement=e}}else f.action="",f.actionElement=null;e=f;b.h&&(q=R(e.eventType,e.event,e.targetElement,e.action,e.actionElement,e.timeStamp),"clickonly"==q.eventType&&(q.eventType="click"),b.h(q,!0));if(e.actionElement){if(b.h)"A"!=e.actionElement.tagName||"click"!=e.eventType&&"clickmod"!=e.eventType||(a.preventDefault?a.preventDefault():a.returnValue=!1),b.h(e);else{if((q=
g.document)&&!q.createEvent&&q.createEventObject)try{var A=q.createEventObject(a)}catch(la){A=a}else A=a;e.event=A;b.i.push(e)}if("touchend"==e.event.type&&e.event._mouseEventsPrevented){a=e.event;for(var ma in a);k()}}}},R=function(b,d,a,c,e,f){return{eventType:b,event:d,targetElement:a,action:c,actionElement:e,timeStamp:f||k()}},ha={},ja=function(b,d){return function(a){var c=b,e=d,f=!1;"mouseenter"==c?c="mouseover":"mouseleave"==c&&(c="mouseout");if(a.addEventListener){if("focus"==c||"blur"==c||
"error"==c||"load"==c)f=!0;a.addEventListener(c,e,f)}else a.attachEvent&&("focus"==c?c="focusin":"blur"==c&&(c="focusout"),e=w(a,e),a.attachEvent("on"+c,e));return{m:c,l:e,capture:f}}},O=function(b,d){if(!b.j.hasOwnProperty(d)){var a=ia(b,d),c=ja(d,a);b.j[d]=a;b.o.push(c);for(a=0;a<b.a.length;++a){var e=b.a[a];e.g.push(c.call(null,e.a))}"click"==d&&O(b,"keydown")}};P.prototype.l=function(b){return this.j[b]};var V=function(b,d){var a=new ka(d),c;a:{for(c=0;c<b.a.length;c++)if(S(b.a[c],d)){c=!0;break a}c=!1}if(c)return b.g.push(a),a;T(b,a);b.a.push(a);U(b);return a},U=function(b){for(var d=b.g.concat(b.a),a=[],c=[],e=0;e<b.a.length;++e){var f=b.a[e];W(f,d)?(a.push(f),X(f)):c.push(f)}for(e=0;e<b.g.length;++e)f=b.g[e],W(f,d)?a.push(f):(c.push(f),T(b,f));b.a=c;b.g=a},T=function(b,d){var a=d.a;ca&&(a.style.cursor="pointer");for(a=0;a<b.o.length;++a)d.g.push(b.o[a].call(null,d.a))},Y=function(b,d){b.h=d;b.i&&
(0<b.i.length&&d(b.i),b.i=null)},ka=function(b){this.a=b;this.g=[]},S=function(b,d){for(var a=b.a,c=d;a!=c&&c.parentNode;)c=c.parentNode;return a==c},W=function(b,d){for(var a=0;a<d.length;++a)if(d[a].a!=b.a&&S(d[a],b.a))return!0;return!1},X=function(b){for(var d=0;d<b.g.length;++d){var a=b.a,c=b.g[d];a.removeEventListener?a.removeEventListener(c.m,c.l,c.capture):a.detachEvent&&a.detachEvent("on"+c.m,c.l)}b.g=[]};var Z=new P;V(Z,window.document.documentElement);O(Z,"click");O(Z,"focus");O(Z,"focusin");O(Z,"blur");O(Z,"focusout");O(Z,"error");O(Z,"load");O(Z,"change");O(Z,"dblclick");O(Z,"input");O(Z,"keyup");O(Z,"keydown");O(Z,"keypress");O(Z,"mousedown");O(Z,"mouseenter");O(Z,"mouseleave");O(Z,"mouseout");O(Z,"mouseover");O(Z,"mouseup");O(Z,"paste");O(Z,"touchstart");O(Z,"touchend");O(Z,"touchcancel");O(Z,"speech");(function(b){google.jsad=function(d){Y(b,d)};google.jsaac=function(d){return V(b,d)};google.jsarc=function(d){X(d);for(var a=!1,c=0;c<b.a.length;++c)if(b.a[c]===d){b.a.splice(c,1);a=!0;break}if(!a)for(a=0;a<b.g.length;++a)if(b.g[a]===d){b.g.splice(a,1);break}U(b)}})(Z);window.gws_wizbind=function(b){return{trigger:function(d){var a=b.l(d.type);a||(O(b,d.type),a=b.l(d.type));var c=d.target||d.srcElement;a&&a.call(c.ownerDocument.documentElement,d)},bind:function(d){Y(b,d)}}}(Z);}).call(this);(function(){var b=[];google.jsc={xx:b,x:function(a){b.push(a)},mm:[],m:function(a){google.jsc.mm.length||(google.jsc.mm=a)}};}).call(this);(function(){google.c={};(function(){var f=window.performance;var g=function(a,b,c){a.addEventListener?a.addEventListener(b,c,!1):a.attachEvent&&a.attachEvent("on"+b,c)};google.timers={};google.startTick=function(a){google.timers[a]={t:{start:google.time()},e:{},m:{}}};google.tick=function(a,b,c){google.timers[a]||google.startTick(a);c=void 0!==c?c:google.time();b instanceof Array||(b=[b]);for(var e=0,d;d=b[e++];)google.timers[a].t[d]=c};google.c.e=function(a,b,c){google.timers[a].e[b]=c};google.c.b=function(a){var b=google.timers.load.m;b[a]&&google.ml(Error("a"),!1,{m:a});b[a]=!0};google.c.u=function(a){var b=google.timers.load.m;if(b[a]){b[a]=!1;for(a in b)if(b[a])return;google.csiReport()}else google.ml(Error("b"),!1,{m:a})};google.rll=function(a,b,c){var e=function(d){c(d);d=e;a.addEventListener?a.removeEventListener("load",d,!1):a.attachEvent&&a.detachEvent("onload",d);d=e;a.addEventListener?a.removeEventListener("error",d,!1):a.attachEvent&&a.detachEvent("onerror",d)};g(a,"load",e);b&&g(a,"error",e)};google.aft=function(a){a.setAttribute("data-iml",google.time())};google.startTick("load");var h=google.timers.load;a:{var k=h.t;if(f){var l=f.timing;if(l){var m=l.navigationStart,n=l.responseStart;if(n>m&&n<=k.start){k.start=n;h.wsrt=n-m;break a}}f.now&&(h.wsrt=Math.floor(f.now()))}}google.c.b("pr");google.c.b("xe");}).call(this);})();</script></head></html>

 

Edited by David Schwartz

Share this post


Link to post
16 minutes ago, David Schwartz said:

Clearly, nobody who's responded to my questions has bothered to look at what these sites are actually serving up these days!

 

I told you already that this is not something google wants to be able and they are very good at this and changing continuously their code.

 

Good luck

 

Share this post


Link to post

Not to mention, once you get it to work, you will face in a very short time capthcas on every google pages for your IP.

Share this post


Link to post
Posted (edited)
32 minutes ago, Attila Kovacs said:

I told you already that this is not something google wants to be able and they are very good at this and changing continuously their code.

 

Good luck

 

It's not just Google. More and more sites are doing stuff like this.

 

What nobody seems to want to admit is, all the assertions people keep make about outerHTML returning what the user sees is basically bunk. 

 

There is user-visible text being rendered inside of this control, and outerHTML is not providing access to it in all cases.

 

The question remains ... how can you access the content that's being rendered for viewing by the user?

 

Up until this point the stock reply is to use outerHTML. Now that we've established it doesn't work in all cases, what DOES work instead?

 

This is a technical question. I don't care about the politics.

Edited by David Schwartz

Share this post


Link to post
Posted (edited)
1 hour ago, David Schwartz said:

Thanks, you created a trivial example to prove YOUR case, not what I'm dealing with. BRAVO!

You might want to dial down the sarcasme. I'm trying to help you here and with that attitude I'm feeling less and less inclined to do so.

 

With my example I retrieved your search result and got this (this is a snippet just to show you the content is there).

 

<div class="I6vAHd h5RoYd ads-creative">Dr. Lazer is a <b>Baltimore Dentist</b> Dedicated To Quality <b>Dental</b> Care. Financing Available. Patient Focused <b>Dentistry</b>. Top <b>Baltimore Dentists</b>. Advanced Training. Services: Cosmetic <b>dentistry</b>, General <b>dentistry</b>, Porcelain veneers, Teeth whitening, <b>Dental</b> implants, Cosmetic dentures, <b>Dental</b> crowns.</div><ul class="OkkX2d"><li><a class="V0MxL" onmousedown="return google.f[this.getAttribute('data-mousedown')](this)" ontouchstart="return google.f[this.getAttribute('data-touchstart')](this)" href="https://www.cosmeticdentistbaltimore.com/our-office/ed-lazer-dds/?TrackNum=410-753-2005" data-ved="2ahUKEwj7o4KsiIPkAhVEJlAKHYQZDhMQpigoAHoECA8QBA" data-touchstart="bez1fd" data-mousedown="LmvwCb" data-arwt="//www.googleadservices.com/pagead/aclk?sa=L&ai=DChcSEwia64isiIPkAhVH5HcKHQCGCzoYABABGgJlZg&ohost=www.google.com&cid=CAASE-RoDCdUQXMX66yba6ZIKqfKGC0&sig=AOD64_0RkT0b2u8xR9l7_LEQw9VUouKjzQ&adurl=&rct=j&q=">Meet Dr. Ed Lazer</a></li><li><a class="V0MxL" onmousedown="return google.f[this.getAttribute('data-mousedown')](this)" ontouchstart="return google.f[this.getAttribute('data-touchstart')](this)" href="https://www.cosmeticdentistbaltimore.com/for-patients/special-offers/?TrackNum=410-753-2005" data-ved="2ahUKEwj7o4KsiIPkAhVEJlAKHYQZDhMQpigoAXoECA8QBQ" data-touchstart="bez1fd" data-mousedown="LmvwCb" data-arwt="//www.googleadservices.com/pagead/aclk?sa=L&ai=DChcSEwia64isiIPkAhVH5HcKHQCGCzoYABACGgJlZg&ohost=www.google.com&cid=CAASE-RoDCdUQXMX66yba6ZIKqfKGC0&sig=AOD64_2gU3L1cLsoIxMZDy60AJIwH1-G4w&adurl=&rct=j&q=">Special Offers</a></li><li><a class="V0MxL" onmousedown="return google.f[this.getAttribute('data-mousedown')](this)" ontouchstart="return google.f[this.getAttribute('data-touchstart')](this)" href="https://www.cosmeticdentistbaltimore.com/smile-gallery/?TrackNum=410-753-2005" data-ved="2ahUKEwj7o4KsiIPkAhVEJlAKHYQZDhMQpigoAnoECA8QBg" data-touchstart="bez1fd" data-mousedown="LmvwCb" data-arwt="//www.googleadservices.com/pagead/aclk?sa=L&ai=DChcSEwia64isiIPkAhVH5HcKHQCGCzoYABADGgJlZg&ohost=www.google.com&cid=CAASE-RoDCdUQXMX66yba6ZIKqfKGC0&sig=AOD64_0Eq9nb4XnF90RRE1Ik35kT7kFBQQ&adurl=&rct=j&q=">Smile Gallery</a></li><li><a class="V0MxL" onmousedown="return google.f[this.getAttribute('data-mousedown')](this)" ontouchstart="return google.f[this.getAttribute('data-touchstart')](this)" href="https://www.cosmeticdentistbaltimore.com/request-appointment/?TrackNum=410-753-2005" data-ved="2ahUKEwj7o4KsiIPkAhVEJlAKHYQZDhMQpigoA3oECA8QBw" data-touchstart="bez1fd" data-mousedown="LmvwCb" data-arwt="//www.googleadservices.com/pagead/aclk?sa=L&ai=DChcSEwia64isiIPkAhVH5HcKHQCGCzoYABAEGgJlZg&ohost=www.google.com&cid=CAASE-RoDCdUQXMX66yba6ZIKqfKGC0&sig=AOD64_2N4qM3fV1hZgFsH22ko3zEIjVzFQ&adurl=&rct=j&q=">Schedule Appointment</a></li><li><a class="V0MxL" onmousedown="return google.f[this.getAttribute('data-mousedown')](this)" ontouchstart="return google.f[this.getAttribute('data-touchstart')](this)" href="https://www.cosmeticdentistbaltimore.com/contact/?TrackNum=410-753-2005" data-ved="2ahUKEwj7o4KsiIPkAhVEJlAKHYQZDhMQpigoBHoECA8QCA" data-touchstart="bez1fd" data-mousedown="LmvwCb" data-arwt="//www.googleadservices.com/pagead/aclk?sa=L&ai=DChcSEwia64isiIPkAhVH5HcKHQCGCzoYABAFGgJlZg&ohost=www.google.com&cid=CAASE-RoDCdUQXMX66yba6ZIKqfKGC0&sig=AOD64_0sMvnrhd-h49qKlc3LKos0K2h67w&adurl=&rct=j&q=">Contact Us</a></li></ul></li><li class="ads-ad" data-hveid="CBAQAA" data-bg="1"><div class="ad_cclk"><a id="n1s0p2c0" style="display: none;" href="https://www.googleadservices.com/pagead/aclk?sa=L&ai=DChcSEwia64isiIPkAhVH5HcKHQCGCzoYABAGGgJlZg&ohost=www.google.com&cid=CAASE-RoDCdUQXMX66yba6ZIKqfKGC0&sig=AOD64_38aJbvxP_v7qoqtk0s9Byp6f8NQw&rct=j&q=&ved=2ahUKEwj7o4KsiIPkAhVEJlAKHYQZDhMQ0Qx6BAgQEAE&adurl="></a><a class="V0MxL r-ieStqovnU5rk" id="vn1s0p2c0" onmousedown="return google.arwt(this)" ontouchstart="return google.arwt(this)" href="https://www.aspendental.com/dentist/md/dundalk/1401-merritt-blvd" jsl="$t t-r1glFWqNI5A;$x 0;"><h3 class="sA5rQ">Official Aspen Dental | Affordable dentistry‎</h3><br><div class="ads-visurl">

You see that text: Dr. Lazer is a <b>Baltimore Dentist</b> Dedicated To Quality <b>Dental</b> Care. Financing Available.

 

Is that what you are after? (this is from the outerHTML)

Yes, it's riddled with javascript and lots of tags. But it's valid HTML with the content provided on screen of the user.

 

If it's not, please provide some text that you did expect.

 

BTW, You do need to read out outerHTML in WebBrowser1DocumentComplete() because the javascript needs time to run. But I assumed you already knew this.

 

 

 

Edited by rvk
  • Like 2

Share this post


Link to post
58 minutes ago, rvk said:

You might want to dial down the sarcasme. I'm trying to help you here and with that attitude I'm feeling less and less inclined to do so.

 

<snip>

 

BTW, You do need to read out outerHTML in WebBrowser1DocumentComplete() because the javascript needs time to run. But I assumed you already knew this.

 

Sorry, but this is the first time anybody has suggested that there could be a timing issue at work here, and your example certainly wouldn't have uncovered it.

 

I assumed that the OnNavigateComplete2 actually meant that the processing was "complete". I've never had a need to use OnDocumentComplete before, and I've never even seen it used in any examples. But now that you mention it, it makes sense.

 

So I modified my code to call that instead of OnNavigationComplete2 and it seems to be returning what I'm looking for. YES!

 

THANK YOU! 

Share this post


Link to post

I did some research about this topic.

It seems that we should use some "Headless Browser" with Python.

But I try to use delphi.

 

So, I have tested using outerHTML, innerHTML etc, and I have tested another way: Copy/Paste.

And, I can got the HTML text but with some JS code or HTML tag. I can not get plain text that user can read at all.

Share this post


Link to post
49 minutes ago, pcplayer99 said:

I can not get plain text that user can read at all.

You can use TWebBrowser.WebBrowser1DocumentComplete() and read out WebBrowser1.Document.documentElement.outerHTML;

(like discussed above)

You can strip out any script tags and html tags and you end up with the plain text.

(you should add line breaks on div en br tags though otherwise you have text on one line.)

 

Or you might look for HTML2TEXT function which does all that work for you.

But the readable text is there in the .outerHTML.

 

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×