David Schwartz 426 Posted August 12, 2019 My earlier question isn't getting me what I'm looking for, so I thought I'd rephrase it. How does one obtain the final user-visible data from TWebBrowser after all the dynamic JS code that's loading up result data on-the-fly has been executed? (I'm using VCL in this case; a solution in FMX would be good to see as well, if it's different.) I need to be able to get the visual contents of a Google search results page -- but I think it could be just about ANY kind of web page that uses dynamic JS to fill data fields when the page is loaded. The well-documented methods I've found thus far all return a bunch of JS code, not the final results they produce. I want what the user sees, not what the browser initially gets from the server. I wonder if the ultimate end-result of this paradigm shift is that web page queries are going to return a super simple HTML document that contains one line in the HEAD or BODY that triggers a JS query like, "<script>loadpage(sessionID,...);</script>" and everything the user sees is rendered through recursive JS calls on a field-by-field basis (with most if not all running in parallel). Share this post Link to post
Lars Fosdal 1792 Posted August 13, 2019 Are you certain that JS is enabled and triggered on page load? Have you tried the embedded Chrome browser to see if that makes any difference?https://github.com/salvadordf/CEF4Delphi I think your last sentence already is in effect for a large number of sites as it makes harder to crawl content tags and references if they are generated dynamically by JS. Share this post Link to post
David Schwartz 426 Posted August 13, 2019 42 minutes ago, Lars Fosdal said: Are you certain that JS is enabled and triggered on page load? What I can say with certainty is that every documented method I've used for extracting content from some of these web pages returns big chunks of JS code that's not readable to humans, even as the web browsers I'm extracting it from do, in fact, display a human-readable web page. As far as how or when that happens, I don't know exactly. That's why I'm asking here. TWebBrowser controls display these pages correctly, or people would not be using said controls, because among other things, Google searches would display as nonsense. My question is, what Delphi code is needed to extract the ultimately rendered HTML that exists inside of said controls and is made visible to users? BTW, I'm sure this sort of page encoding does, indeed, make it harder to crawl and scrape using existing tools. I think that's part of its intended use in many cases. Share this post Link to post
rvk 33 Posted August 13, 2019 Shouldn't this be possible with just TWebBrowser? Look at https://stackoverflow.com/a/22518562/1037511 Quote As far as I can see, this is the actual state of the document, including any changes that are made by Javascript. So its seems like OuterHtml should give you the rendered HTML. Make sure to get this after the document is completed loading. Share this post Link to post
David Schwartz 426 Posted August 13, 2019 (edited) 2 hours ago, rvk said: Shouldn't this be possible with just TWebBrowser? Look at https://stackoverflow.com/a/22518562/1037511 So its seems like OuterHtml should give you the rendered HTML. Make sure to get this after the document is completed loading. Try it on a Google search or on some news sites, and let me know what you find. Edited August 13, 2019 by David Schwartz Share this post Link to post
rvk 33 Posted August 14, 2019 (edited) On 8/13/2019 at 3:48 PM, David Schwartz said: Try it on a Google search or on some news sites, and let me know what you find. Why should I try it???? But alright... I tried and it works. With this source file Test <div id=content> Some other content </div> <script type = "text/JavaScript"> document.getElementById("content").innerHTML = "Changed content"; </script> I got this in Delphi for var d: OleVariant; begin d := WebBrowser1.Document; Memo1.Lines.Text := d.documentElement.outerHTML; end; <HTML><HEAD></HEAD> <BODY>Test <DIV id=content>Changed content</DIV> <SCRIPT type=text/JavaScript> document.getElementById("content").innerHTML = "Changed content"; </SCRIPT> </BODY></HTML> Note the "Changed content" between the divs of id=content !! The javascript itself obviously stays there because it's part of the source but the content itself is the way it is represented to the end-user after executing the javascript. The DOM even places the missing HTML, HEAD and BODY tags in there. Edited August 14, 2019 by rvk Share this post Link to post
David Schwartz 426 Posted August 14, 2019 2 hours ago, rvk said: Why should I try it???? But alright... I tried and it works. Thanks, you created a trivial example to prove YOUR case, not what I'm dealing with. BRAVO! Meanwhile, did you try it with a Google query, or from one of the news sites? If you did, you'd see what I'm pointing at. A guy on SO boldly asserted, "Just strip out the SCRIPT tags and parse the BODY." Yeah, well, what do you do when a site returns <HTML><HEAD>...</HEAD></HTML> with NO BODY TAG? It's just a big chunk of <SCRIPT>...</SCRIPT> tags in the HEADer. But, hey, your example works fine. Unfortunately, that's not what Google or these sites are returning. I understand very clearly how this is supposed to work IN THEORY. In PRACTICE ... you're going to have to actually run one of these (really simple) queries and see what I'm seeing, or you're going to keep implying that I'm hallucinating. Run a Google query for something simple like "dentists in baltimore" and see what your approach yields. Because that's what I'm dealing with and what I'm asking about. Clearly, nobody who's responded to my questions has bothered to look at what these sites are actually serving up these days! Share this post Link to post
David Schwartz 426 Posted August 14, 2019 (edited) To save you the time of adding a line of code to run a query, here's what outerHTML gives you based on what Google returns for "dentist in baltimore". Feel free to strip out the javascript and show me the text that resembles what the user sees when this query is run directly on google.com. <html lang="en" itemtype="http://schema.org/SearchResultsPage" itemscope=""><head><meta http-equiv="X-UA-Compatible" content="IE=edge"><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>dentist in baltimore - Google Search</title><script nonce="Xp+ZL1qXiueWiJzqZqCD8g==">(function(){window.google={kEI:'Wk5UXdCiNs_7-gTLw6jgBQ',kEXPI:'31',authuser:0,kscs:'c9c918f0_Wk5UXdCiNs_7-gTLw6jgBQ',kGL:'US',kBL:'d31E'};google.sn='web';google.kHL='en';google.jsfs='Ffpdje';})();(function(){google.lc=[];google.li=0;google.getEI=function(a){for(var b;a&&(!a.getAttribute||!(b=a.getAttribute("eid")));)a=a.parentNode;return b||google.kEI};google.getLEI=function(a){for(var b=null;a&&(!a.getAttribute||!(b=a.getAttribute("leid")));)a=a.parentNode;return b};google.https=function(){return"https:"==window.location.protocol};google.ml=function(){return null};google.time=function(){return(new Date).getTime()};google.log=function(a,b,e,c,g){if(a=google.logUrl(a,b,e,c,g)){b=new Image;var d=google.lc,f=google.li;d[f]=b;b.onerror=b.onload=b.onabort=function(){delete d[f]};google.vel&&google.vel.lu&&google.vel.lu(a);b.src=a;google.li=f+1}};google.logUrl=function(a,b,e,c,g){var d="",f=google.ls||"";e||-1!=b.search("&ei=")||(d="&ei="+google.getEI(c),-1==b.search("&lei=")&&(c=google.getLEI(c))&&(d+="&lei="+c));c="";!e&&google.cshid&&-1==b.search("&cshid=")&&"slh"!=a&&(c="&cshid="+google.cshid);a=e||"/"+(g||"gen_204")+"?atyp=i&ct="+a+"&cad="+b+d+f+"&zx="+google.time()+c;/^http:/i.test(a)&&google.https()&&(google.ml(Error("a"),!1,{src:a,glmm:1}),a="");return a};}).call(this);(function(){google.y={};google.x=function(a,b){if(a)var c=a.id;else{do c=Math.random();while(google.y[c])}google.y[c]=[a,b];return!1};google.lm=[];google.plm=function(a){google.lm.push.apply(google.lm,a)};google.lq=[];google.load=function(a,b,c){google.lq.push([[a],b,c])};google.loadAll=function(a,b){google.lq.push([a,b])};}).call(this);google.f={};google.arwt=function(a){a.href=document.getElementById(a.id.substring(1)).href;return!0};(function(){var g=this||self,k=Date.now||function(){return+new Date};var t={};var v=function(b,d){if(null===d)return!1;if("contains"in b&&1==d.nodeType)return b.contains(d);if("compareDocumentPosition"in b)return b==d||!!(b.compareDocumentPosition(d)&16);for(;d&&b!=d;)d=d.parentNode;return d==b};var w=function(b,d){return function(a){a||(a=window.event);return d.call(b,a)}},B=function(b){b=b.target||b.srcElement;!b.getAttribute&&b.parentNode&&(b=b.parentNode);return b},C="undefined"!=typeof navigator&&/Macintosh/.test(navigator.userAgent),D="undefined"!=typeof navigator&&!/Opera/.test(navigator.userAgent)&&/WebKit/.test(navigator.userAgent),E={A:1,INPUT:1,TEXTAREA:1,SELECT:1,BUTTON:1},aa=function(){this._mouseEventsPrevented=!0},F={A:13,BUTTON:0,CHECKBOX:32,COMBOBOX:13,GRIDCELL:13,LINK:13,LISTBOX:13,MENU:0,MENUBAR:0,MENUITEM:0,MENUITEMCHECKBOX:0,MENUITEMRADIO:0,OPTION:0,RADIO:32,RADIOGROUP:32,RESET:0,SUBMIT:0,SWITCH:32,TAB:0,TREE:13,TREEITEM:13},G={CHECKBOX:!0,OPTION:!0,RADIO:!0},H={COLOR:!0,DATE:!0,DATETIME:!0,"DATETIME-LOCAL":!0,EMAIL:!0,MONTH:!0,NUMBER:!0,PASSWORD:!0,RANGE:!0,SEARCH:!0,TEL:!0,TEXT:!0,TEXTAREA:!0,TIME:!0,URL:!0,WEEK:!0},ba={A:!0,AREA:!0,BUTTON:!0,DIALOG:!0,IMG:!0,INPUT:!0,LINK:!0,MENU:!0,OPTGROUP:!0,OPTION:!0,PROGRESS:!0,SELECT:!0,TEXTAREA:!0};var I=function(){this.h=this.a=null},K=function(b,d){var a=J;a.a=b;a.h=d;return a};I.prototype.g=function(){var b=this.a;this.a&&this.a!=this.h?this.a=this.a.__owner||this.a.parentNode:this.a=null;return b};var L=function(){this.i=[];this.a=0;this.h=null;this.j=!1};L.prototype.g=function(){if(this.j)return J.g();if(this.a!=this.i.length){var b=this.i[this.a];this.a++;b!=this.h&&b&&b.__owner&&(this.j=!0,K(b.__owner,this.h));return b}return null};var J=new I,M=new L;var P=function(){this.o=[];this.a=[];this.g=[];this.j={};this.h=null;this.i=[];O(this,"_custom")},ca="undefined"!=typeof navigator&&/iPhone|iPad|iPod/.test(navigator.userAgent),Q=String.prototype.trim?function(b){return b.trim()}:function(b){return b.replace(/^\s+/,"").replace(/\s+$/,"")},da=/\s*;\s*/,ia=function(b,d){return function(a){var c=d;if("_custom"==c){c=a.detail;if(!c||!c._type)return;c=c._type}var e;if("click"==c&&(C&&a.metaKey||!C&&a.ctrlKey||2==a.which||null==a.which&&4==a.button||a.shiftKey))c= "clickmod";else{var f=a.which||a.keyCode||a.key;D&&3==f&&(f=13);if(13!=f&&32!=f)f=!1;else{var l=B(a),h=(l.getAttribute("role")||l.type||l.tagName).toUpperCase(),m;(m="keydown"!=a.type||!("getAttribute"in l&&!((l.getAttribute("type")||l.tagName).toUpperCase()in H)&&"BUTTON"!=l.tagName.toUpperCase()&&!l.isContentEditable)||a.ctrlKey||a.shiftKey||a.altKey||a.metaKey||(l.getAttribute("type")||l.tagName).toUpperCase()in G&&32==f)||((m=l.tagName in E)||(m=l.getAttributeNode("tabindex"),m=null!=m&&m.specified),m=!(m&&!l.disabled));m?f=!1:(l="INPUT"!=l.tagName.toUpperCase()||l.type,m=!(h in F)&&13==f,f=(0==F[h]%f||m)&&!!l)}f&&(c="clickkey")}h=a.srcElement||a.target;f=R(c,a,h,"",null);a.path?(M.i=a.path,M.a=0,M.h=this,M.j=!1,l=M):l=K(h,this);for(;m=l.g();){m=e=m;var n=c;var p=m.__jsaction;if(!p){var u;p=null;"getAttribute"in m&&(p=m.getAttribute("jsaction"));if(u=p){p=t[u];if(!p){p={};for(var x=u.split(da),ea=x?x.length:0,y=0;y<ea;y++){var r=x[y];if(r){var z=r.indexOf(":"),N=-1!=z,fa=N?Q(r.substr(0,z)):"click";r=N?Q(r.substr(z+1)):r;p[fa]=r}}t[u]=p}m.__jsaction=p}else p=ha,m.__jsaction=p}m=p;"clickkey"==n?n="click":"click"!=n||m.click||(n="clickonly");n={m:n,action:m[n]||"",event:null,s:!1};f=R(n.m,n.event||a,h,n.action||"",e,f.timeStamp);if(n.s||n.action)break}f&&"touchend"==f.eventType&&(f.event._preventMouseEvents=aa);if(n&&n.action){if(h="clickkey"==c)h=B(a),h=(h.type||h.tagName).toUpperCase(),(h=32==(a.which||a.keyCode||a.key)&&"CHECKBOX"!=h)||(h=B(a),l=(h.getAttribute("role")||h.tagName).toUpperCase(),h="BUTTON"==l?!0:!(h.tagName.toUpperCase()in ba)||"A"==l||"SELECT"==l||(h.getAttribute("type")||h.tagName).toUpperCase()in G||(h.getAttribute("type")||h.tagName).toUpperCase()in H?!1:!0);h&&(a.preventDefault?a.preventDefault():a.returnValue=!1);if("mouseenter"==c||"mouseleave"==c)if(h=a.relatedTarget,!("mouseover"==a.type&&"mouseenter"==c||"mouseout"==a.type&&"mouseleave"==c)||h&&(h===e||v(e,h)))f.action="",f.actionElement=null;else{c={};for(var q in a)"function"!==typeof a[q]&&"srcElement"!==q&& "target"!==q&&(c[q]=a[q]);c.type="mouseover"==a.type?"mouseenter":"mouseleave";c.target=c.srcElement=e;c.bubbles=!1;f.event=c;f.targetElement=e}}else f.action="",f.actionElement=null;e=f;b.h&&(q=R(e.eventType,e.event,e.targetElement,e.action,e.actionElement,e.timeStamp),"clickonly"==q.eventType&&(q.eventType="click"),b.h(q,!0));if(e.actionElement){if(b.h)"A"!=e.actionElement.tagName||"click"!=e.eventType&&"clickmod"!=e.eventType||(a.preventDefault?a.preventDefault():a.returnValue=!1),b.h(e);else{if((q= g.document)&&!q.createEvent&&q.createEventObject)try{var A=q.createEventObject(a)}catch(la){A=a}else A=a;e.event=A;b.i.push(e)}if("touchend"==e.event.type&&e.event._mouseEventsPrevented){a=e.event;for(var ma in a);k()}}}},R=function(b,d,a,c,e,f){return{eventType:b,event:d,targetElement:a,action:c,actionElement:e,timeStamp:f||k()}},ha={},ja=function(b,d){return function(a){var c=b,e=d,f=!1;"mouseenter"==c?c="mouseover":"mouseleave"==c&&(c="mouseout");if(a.addEventListener){if("focus"==c||"blur"==c|| "error"==c||"load"==c)f=!0;a.addEventListener(c,e,f)}else a.attachEvent&&("focus"==c?c="focusin":"blur"==c&&(c="focusout"),e=w(a,e),a.attachEvent("on"+c,e));return{m:c,l:e,capture:f}}},O=function(b,d){if(!b.j.hasOwnProperty(d)){var a=ia(b,d),c=ja(d,a);b.j[d]=a;b.o.push(c);for(a=0;a<b.a.length;++a){var e=b.a[a];e.g.push(c.call(null,e.a))}"click"==d&&O(b,"keydown")}};P.prototype.l=function(b){return this.j[b]};var V=function(b,d){var a=new ka(d),c;a:{for(c=0;c<b.a.length;c++)if(S(b.a[c],d)){c=!0;break a}c=!1}if(c)return b.g.push(a),a;T(b,a);b.a.push(a);U(b);return a},U=function(b){for(var d=b.g.concat(b.a),a=[],c=[],e=0;e<b.a.length;++e){var f=b.a[e];W(f,d)?(a.push(f),X(f)):c.push(f)}for(e=0;e<b.g.length;++e)f=b.g[e],W(f,d)?a.push(f):(c.push(f),T(b,f));b.a=c;b.g=a},T=function(b,d){var a=d.a;ca&&(a.style.cursor="pointer");for(a=0;a<b.o.length;++a)d.g.push(b.o[a].call(null,d.a))},Y=function(b,d){b.h=d;b.i&& (0<b.i.length&&d(b.i),b.i=null)},ka=function(b){this.a=b;this.g=[]},S=function(b,d){for(var a=b.a,c=d;a!=c&&c.parentNode;)c=c.parentNode;return a==c},W=function(b,d){for(var a=0;a<d.length;++a)if(d[a].a!=b.a&&S(d[a],b.a))return!0;return!1},X=function(b){for(var d=0;d<b.g.length;++d){var a=b.a,c=b.g[d];a.removeEventListener?a.removeEventListener(c.m,c.l,c.capture):a.detachEvent&&a.detachEvent("on"+c.m,c.l)}b.g=[]};var Z=new P;V(Z,window.document.documentElement);O(Z,"click");O(Z,"focus");O(Z,"focusin");O(Z,"blur");O(Z,"focusout");O(Z,"error");O(Z,"load");O(Z,"change");O(Z,"dblclick");O(Z,"input");O(Z,"keyup");O(Z,"keydown");O(Z,"keypress");O(Z,"mousedown");O(Z,"mouseenter");O(Z,"mouseleave");O(Z,"mouseout");O(Z,"mouseover");O(Z,"mouseup");O(Z,"paste");O(Z,"touchstart");O(Z,"touchend");O(Z,"touchcancel");O(Z,"speech");(function(b){google.jsad=function(d){Y(b,d)};google.jsaac=function(d){return V(b,d)};google.jsarc=function(d){X(d);for(var a=!1,c=0;c<b.a.length;++c)if(b.a[c]===d){b.a.splice(c,1);a=!0;break}if(!a)for(a=0;a<b.g.length;++a)if(b.g[a]===d){b.g.splice(a,1);break}U(b)}})(Z);window.gws_wizbind=function(b){return{trigger:function(d){var a=b.l(d.type);a||(O(b,d.type),a=b.l(d.type));var c=d.target||d.srcElement;a&&a.call(c.ownerDocument.documentElement,d)},bind:function(d){Y(b,d)}}}(Z);}).call(this);(function(){var b=[];google.jsc={xx:b,x:function(a){b.push(a)},mm:[],m:function(a){google.jsc.mm.length||(google.jsc.mm=a)}};}).call(this);(function(){google.c={};(function(){var f=window.performance;var g=function(a,b,c){a.addEventListener?a.addEventListener(b,c,!1):a.attachEvent&&a.attachEvent("on"+b,c)};google.timers={};google.startTick=function(a){google.timers[a]={t:{start:google.time()},e:{},m:{}}};google.tick=function(a,b,c){google.timers[a]||google.startTick(a);c=void 0!==c?c:google.time();b instanceof Array||(b=[b]);for(var e=0,d;d=b[e++];)google.timers[a].t[d]=c};google.c.e=function(a,b,c){google.timers[a].e[b]=c};google.c.b=function(a){var b=google.timers.load.m;b[a]&&google.ml(Error("a"),!1,{m:a});b[a]=!0};google.c.u=function(a){var b=google.timers.load.m;if(b[a]){b[a]=!1;for(a in b)if(b[a])return;google.csiReport()}else google.ml(Error("b"),!1,{m:a})};google.rll=function(a,b,c){var e=function(d){c(d);d=e;a.addEventListener?a.removeEventListener("load",d,!1):a.attachEvent&&a.detachEvent("onload",d);d=e;a.addEventListener?a.removeEventListener("error",d,!1):a.attachEvent&&a.detachEvent("onerror",d)};g(a,"load",e);b&&g(a,"error",e)};google.aft=function(a){a.setAttribute("data-iml",google.time())};google.startTick("load");var h=google.timers.load;a:{var k=h.t;if(f){var l=f.timing;if(l){var m=l.navigationStart,n=l.responseStart;if(n>m&&n<=k.start){k.start=n;h.wsrt=n-m;break a}}f.now&&(h.wsrt=Math.floor(f.now()))}}google.c.b("pr");google.c.b("xe");}).call(this);})();</script></head></html> Edited August 14, 2019 by David Schwartz Share this post Link to post
Attila Kovacs 629 Posted August 14, 2019 16 minutes ago, David Schwartz said: Clearly, nobody who's responded to my questions has bothered to look at what these sites are actually serving up these days! I told you already that this is not something google wants to be able and they are very good at this and changing continuously their code. Good luck Share this post Link to post
Attila Kovacs 629 Posted August 14, 2019 Not to mention, once you get it to work, you will face in a very short time capthcas on every google pages for your IP. Share this post Link to post
David Schwartz 426 Posted August 14, 2019 (edited) 32 minutes ago, Attila Kovacs said: I told you already that this is not something google wants to be able and they are very good at this and changing continuously their code. Good luck It's not just Google. More and more sites are doing stuff like this. What nobody seems to want to admit is, all the assertions people keep make about outerHTML returning what the user sees is basically bunk. There is user-visible text being rendered inside of this control, and outerHTML is not providing access to it in all cases. The question remains ... how can you access the content that's being rendered for viewing by the user? Up until this point the stock reply is to use outerHTML. Now that we've established it doesn't work in all cases, what DOES work instead? This is a technical question. I don't care about the politics. Edited August 14, 2019 by David Schwartz Share this post Link to post
rvk 33 Posted August 14, 2019 (edited) 1 hour ago, David Schwartz said: Thanks, you created a trivial example to prove YOUR case, not what I'm dealing with. BRAVO! You might want to dial down the sarcasme. I'm trying to help you here and with that attitude I'm feeling less and less inclined to do so. With my example I retrieved your search result and got this (this is a snippet just to show you the content is there). <div class="I6vAHd h5RoYd ads-creative">Dr. Lazer is a <b>Baltimore Dentist</b> Dedicated To Quality <b>Dental</b> Care. Financing Available. Patient Focused <b>Dentistry</b>. Top <b>Baltimore Dentists</b>. Advanced Training. Services: Cosmetic <b>dentistry</b>, General <b>dentistry</b>, Porcelain veneers, Teeth whitening, <b>Dental</b> implants, Cosmetic dentures, <b>Dental</b> crowns.</div><ul class="OkkX2d"><li><a class="V0MxL" onmousedown="return google.f[this.getAttribute('data-mousedown')](this)" ontouchstart="return google.f[this.getAttribute('data-touchstart')](this)" href="https://www.cosmeticdentistbaltimore.com/our-office/ed-lazer-dds/?TrackNum=410-753-2005" data-ved="2ahUKEwj7o4KsiIPkAhVEJlAKHYQZDhMQpigoAHoECA8QBA" data-touchstart="bez1fd" data-mousedown="LmvwCb" data-arwt="//www.googleadservices.com/pagead/aclk?sa=L&ai=DChcSEwia64isiIPkAhVH5HcKHQCGCzoYABABGgJlZg&ohost=www.google.com&cid=CAASE-RoDCdUQXMX66yba6ZIKqfKGC0&sig=AOD64_0RkT0b2u8xR9l7_LEQw9VUouKjzQ&adurl=&rct=j&q=">Meet Dr. Ed Lazer</a></li><li><a class="V0MxL" onmousedown="return google.f[this.getAttribute('data-mousedown')](this)" ontouchstart="return google.f[this.getAttribute('data-touchstart')](this)" href="https://www.cosmeticdentistbaltimore.com/for-patients/special-offers/?TrackNum=410-753-2005" data-ved="2ahUKEwj7o4KsiIPkAhVEJlAKHYQZDhMQpigoAXoECA8QBQ" data-touchstart="bez1fd" data-mousedown="LmvwCb" data-arwt="//www.googleadservices.com/pagead/aclk?sa=L&ai=DChcSEwia64isiIPkAhVH5HcKHQCGCzoYABACGgJlZg&ohost=www.google.com&cid=CAASE-RoDCdUQXMX66yba6ZIKqfKGC0&sig=AOD64_2gU3L1cLsoIxMZDy60AJIwH1-G4w&adurl=&rct=j&q=">Special Offers</a></li><li><a class="V0MxL" onmousedown="return google.f[this.getAttribute('data-mousedown')](this)" ontouchstart="return google.f[this.getAttribute('data-touchstart')](this)" href="https://www.cosmeticdentistbaltimore.com/smile-gallery/?TrackNum=410-753-2005" data-ved="2ahUKEwj7o4KsiIPkAhVEJlAKHYQZDhMQpigoAnoECA8QBg" data-touchstart="bez1fd" data-mousedown="LmvwCb" data-arwt="//www.googleadservices.com/pagead/aclk?sa=L&ai=DChcSEwia64isiIPkAhVH5HcKHQCGCzoYABADGgJlZg&ohost=www.google.com&cid=CAASE-RoDCdUQXMX66yba6ZIKqfKGC0&sig=AOD64_0Eq9nb4XnF90RRE1Ik35kT7kFBQQ&adurl=&rct=j&q=">Smile Gallery</a></li><li><a class="V0MxL" onmousedown="return google.f[this.getAttribute('data-mousedown')](this)" ontouchstart="return google.f[this.getAttribute('data-touchstart')](this)" href="https://www.cosmeticdentistbaltimore.com/request-appointment/?TrackNum=410-753-2005" data-ved="2ahUKEwj7o4KsiIPkAhVEJlAKHYQZDhMQpigoA3oECA8QBw" data-touchstart="bez1fd" data-mousedown="LmvwCb" data-arwt="//www.googleadservices.com/pagead/aclk?sa=L&ai=DChcSEwia64isiIPkAhVH5HcKHQCGCzoYABAEGgJlZg&ohost=www.google.com&cid=CAASE-RoDCdUQXMX66yba6ZIKqfKGC0&sig=AOD64_2N4qM3fV1hZgFsH22ko3zEIjVzFQ&adurl=&rct=j&q=">Schedule Appointment</a></li><li><a class="V0MxL" onmousedown="return google.f[this.getAttribute('data-mousedown')](this)" ontouchstart="return google.f[this.getAttribute('data-touchstart')](this)" href="https://www.cosmeticdentistbaltimore.com/contact/?TrackNum=410-753-2005" data-ved="2ahUKEwj7o4KsiIPkAhVEJlAKHYQZDhMQpigoBHoECA8QCA" data-touchstart="bez1fd" data-mousedown="LmvwCb" data-arwt="//www.googleadservices.com/pagead/aclk?sa=L&ai=DChcSEwia64isiIPkAhVH5HcKHQCGCzoYABAFGgJlZg&ohost=www.google.com&cid=CAASE-RoDCdUQXMX66yba6ZIKqfKGC0&sig=AOD64_0sMvnrhd-h49qKlc3LKos0K2h67w&adurl=&rct=j&q=">Contact Us</a></li></ul></li><li class="ads-ad" data-hveid="CBAQAA" data-bg="1"><div class="ad_cclk"><a id="n1s0p2c0" style="display: none;" href="https://www.googleadservices.com/pagead/aclk?sa=L&ai=DChcSEwia64isiIPkAhVH5HcKHQCGCzoYABAGGgJlZg&ohost=www.google.com&cid=CAASE-RoDCdUQXMX66yba6ZIKqfKGC0&sig=AOD64_38aJbvxP_v7qoqtk0s9Byp6f8NQw&rct=j&q=&ved=2ahUKEwj7o4KsiIPkAhVEJlAKHYQZDhMQ0Qx6BAgQEAE&adurl="></a><a class="V0MxL r-ieStqovnU5rk" id="vn1s0p2c0" onmousedown="return google.arwt(this)" ontouchstart="return google.arwt(this)" href="https://www.aspendental.com/dentist/md/dundalk/1401-merritt-blvd" jsl="$t t-r1glFWqNI5A;$x 0;"><h3 class="sA5rQ">Official Aspen Dental | Affordable dentistry</h3><br><div class="ads-visurl"> You see that text: Dr. Lazer is a <b>Baltimore Dentist</b> Dedicated To Quality <b>Dental</b> Care. Financing Available. Is that what you are after? (this is from the outerHTML) Yes, it's riddled with javascript and lots of tags. But it's valid HTML with the content provided on screen of the user. If it's not, please provide some text that you did expect. BTW, You do need to read out outerHTML in WebBrowser1DocumentComplete() because the javascript needs time to run. But I assumed you already knew this. Edited August 14, 2019 by rvk 3 Share this post Link to post
David Schwartz 426 Posted August 14, 2019 58 minutes ago, rvk said: You might want to dial down the sarcasme. I'm trying to help you here and with that attitude I'm feeling less and less inclined to do so. <snip> BTW, You do need to read out outerHTML in WebBrowser1DocumentComplete() because the javascript needs time to run. But I assumed you already knew this. Sorry, but this is the first time anybody has suggested that there could be a timing issue at work here, and your example certainly wouldn't have uncovered it. I assumed that the OnNavigateComplete2 actually meant that the processing was "complete". I've never had a need to use OnDocumentComplete before, and I've never even seen it used in any examples. But now that you mention it, it makes sense. So I modified my code to call that instead of OnNavigationComplete2 and it seems to be returning what I'm looking for. YES! THANK YOU! 1 Share this post Link to post
pcplayer99 11 Posted August 15, 2019 I did some research about this topic. It seems that we should use some "Headless Browser" with Python. But I try to use delphi. So, I have tested using outerHTML, innerHTML etc, and I have tested another way: Copy/Paste. And, I can got the HTML text but with some JS code or HTML tag. I can not get plain text that user can read at all. Share this post Link to post
rvk 33 Posted August 15, 2019 49 minutes ago, pcplayer99 said: I can not get plain text that user can read at all. You can use TWebBrowser.WebBrowser1DocumentComplete() and read out WebBrowser1.Document.documentElement.outerHTML; (like discussed above) You can strip out any script tags and html tags and you end up with the plain text. (you should add line breaks on div en br tags though otherwise you have text on one line.) Or you might look for HTML2TEXT function which does all that work for you. But the readable text is there in the .outerHTML. Share this post Link to post
vinni 0 Posted April 17 Hello, Is there a similar solution for FireMonkey? FMX.WebBrowser.TWebBrowser does not have Document property. Share this post Link to post
rvk 33 Posted April 17 11 hours ago, vinni said: Is there a similar solution for FireMonkey? FMX.WebBrowser.TWebBrowser does not have Document property. No, unfortunately TWebBrowser.Document isn't available for FMX. https://stackoverflow.com/a/41853823/1037511 Share this post Link to post
Dave Nottage 557 Posted April 17 1 hour ago, rvk said: No, unfortunately TWebBrowser.Document isn't available for FMX. There is no Document property, however there is a way of accessing the underlying web view (ICoreWebView2 on Windows, JWebView on Android and WKWebView on macOS/iOS), e.g. on Android: var LWebView: JWebView; begin if Supports(WebBrowser1, JWebView, LWebView) then // Do something with LWebView end; That said, as time goes on the underlying web view (on all platforms) has had direct access to HTML removed, so it has become necessary to execute JavaScript to achieve the same thing. Unfortunately, TWebBrowser has only EvaluateJavascript, which does not support returning a result, so it's necessary to use the technique above to achieve this, as I've done in the WebBrowserExt feature in Kastri (demo here). Using TWebBrowserExt, ExecuteJavaScript could be called with cJavaScriptGetPageContents (from the DW.JavaScript unit), thus: unit Unit1; interface uses System.SysUtils, System.Types, System.UITypes, System.Classes, System.Variants, FMX.Types, FMX.Controls, FMX.Forms, FMX.Graphics, FMX.Dialogs, FMX.WebBrowser, FMX.Controls.Presentation, FMX.StdCtrls, DW.WebBrowserExt; type TForm1 = class(TForm) WebBrowser1: TWebBrowser; Button1: TButton; procedure Button1Click(Sender: TObject); procedure WebBrowser1DidFinishLoad(ASender: TObject); private FWebBrowserExt: TWebBrowserExt; public constructor Create(AOwner: TComponent); override; end; var Form1: TForm1; implementation {$R *.fmx} uses DW.JavaScript; { TForm1 } constructor TForm1.Create(AOwner: TComponent); begin inherited; WebBrowser1.WindowsEngine := TWindowsEngine.EdgeOnly; // On Windows, unlikely to work with IE FWebBrowserExt := TWebBrowserExt.Create(WebBrowser1); end; procedure TForm1.WebBrowser1DidFinishLoad(ASender: TObject); begin FWebBrowserExt.ExecuteJavaScript(cJavaScriptGetPageContents, procedure(const AJavaScriptResult: string; const AErrorCode: Integer) begin // AJavaScriptResult should now contain the page contents end ); end; procedure TForm1.Button1Click(Sender: TObject); begin WebBrowser1.Navigate('http://help.websiteos.com/websiteos/example_of_a_simple_html_page.htm'); end; end. 2 hours ago, rvk said: https://stackoverflow.com/a/41853823/1037511 Not sure whether that answer was wrong in 2017, but it's certainly incomplete now, given the above. Share this post Link to post
Brandon Staggs 277 Posted April 18 On 4/17/2024 at 4:07 AM, vinni said: Is there a similar solution for FireMonkey? FMX.WebBrowser.TWebBrowser does not have Document property. Modern web browser design is virtually all asynchronous, so there is no way to do the kinds of things we once did synchronously through direct access to the DOM. I have thousands of lines of code that interacts with the old Internet Explorer COM web browser which I am not looking forward to some day completely rewriting in JavaScript so I can use WebView2.... Share this post Link to post
vinni 0 Posted April 18 Dave, thanks for your suggestion. I gave it a try, but on my side AJavaScriptResult is always 'null' for any website. Share this post Link to post
Dave Nottage 557 Posted April 18 6 hours ago, vinni said: Dave, thanks for your suggestion. I gave it a try, but on my side AJavaScriptResult is always 'null' for any website. Is this on Windows? If so, are you setting the WindowsEngine property of the TWebBrowser to EdgeOnly? The result you're seeing happens if TWebBrowser uses IE. Share this post Link to post
vinni 0 Posted April 19 Yes, but setting WebBrowser1.WindowsEngine to TWindowsEngine.EdgeOnly raises an exception: --------------------------- Application Error --------------------------- Exception EBrowserEngineException in module Project1.exe at 0042A2AD. Edge browser engine is unavailable. --------------------------- OK --------------------------- Edge installed, Delphi 11.3. Share this post Link to post
Dave Nottage 557 Posted April 19 2 hours ago, vinni said: Edge browser engine is unavailable Please see this link. Share this post Link to post