Jump to content
David Schwartz

Parsing Google Search Results redux

Recommended Posts

In case anybody is interested, I've been trying to scrape a Google search result page, and it's nearly impossible without a huge amount of work. The HTML page you get back from their server looks like

 

<HTML><HEAD>...</HEAD></HTML>

 

There's no BODY. It's generated inside the DOM by javascript.

 

The javascript takes a bunch of regular HTML and CSS and appears to use a lookup function do global search-and-replace of nearly every human-readable word and phrase with random character strings that are generated dynamically for each page. So there are hardly any "landmarks" you can use to find anything. It's not like you can build a symbol table inferred from one page on subsequent pages, since the mappings are different from page to page.

 

It also employs deeply nested DIVs and lots of needless SPANs to make it difficult to extract content.

 

Finally, while it's clear from viewing the page that there are 10 entries in the main part of the page, it manages to hide some in ways that I couldn't figure out where they are. I searched for strings that can be seen in the browser view, but they aren't found anywhere in body of the rendered HTML. I'm guessing they're hiding inside of some javascript functions having been obfuscated to the point where they can't be read without executing the functions to extract them.

 

I suspect it won't be long before it will be nearly impossible to extract anything meaningful from HTML pages, as whatever we see displayed in a browser will be generated entirely on-the-fly from a variety of dynamic and static methods buried inside of deeply nested javascript functions, some of which could be rendered even after the page loads. Some of it even looks to be a variety of self-modifying (or self-generated) code.

 

This has some pretty crazy implications for both security as well as making it impossible to sniff code that could signal the presence of malware injections.

 

I've attached an unwound pretty-printed sample of a search page I was able to extract for your reading pleasure. Fun stuff, eh?

 

sample_SERP.html

Edited by David Schwartz

Share this post


Link to post

This is why there are embeddable web engines that process Javascript. You need this, for instance, to parse Amazon wish lists, since the pages use javascript to automatically load more items as you scroll down the page. There are also tools such as Selenium for driving web browsers. Or platforms like ScrapingHub, powered by the open source scrapy library and the Splash headless, scriptable javascript-enabled browser

  • Like 1

Share this post


Link to post

Thanks for all the side-chatter, but none of it is helpful. I was merely passing along the results of what I found. I'm not running Linux; I'm not getting captchas doing this manually, and I have no reason to believe I will get them using similar timings with a different means of doing the data entry; I'm only interested in the first page of results; and in summary, y'all are making stuff up that is completely irrelevant to my present needs. All I'm focusing on is scraping one page of results, and that's all that's up for discussion. But thanks for the insights.

Share this post


Link to post
5 hours ago, David Schwartz said:

It also employs deeply nested DIVs and lots of needless SPANs to make it difficult to extract content.

I doubt that's the reason

 

5 hours ago, David Schwartz said:

There's no BODY. It's generated inside the DOM by javascript.

Now that sounds more like a reason 😉 Having said that, if you're using text-based parsing it'd be possible to workaround the lack of a <body> tag. I started looking at doing this programmatically using TWebBrowser a few days ago. If I had more time, I'd continue with it

Edited by Dave Nottage
Formatting

Share this post


Link to post
Guest

This thread confuses me a bit. @David Schwartz, did you actually try a "selenium"-like solution? I'm just interested, i do not ask as critique.

 

Anyway - the trends pointed out above are IMHO quite hideous. But i must wonder about accessibility here. A lot of people with accessibility needs have money and in the US there are proper laws about accessibility (ok, ok, not hard facts, that, i should google some). Economically though it would be strange for the creators not to heed accessibility needs. If the "screenreader" can read the pertinent information, then you code should be able to do the same.

 

I'd try to mine using the accessibility API via something like selenium. But of course it might be as much work depending on how accessible that page actually is.

Share this post


Link to post
22 hours ago, Dany Marmur said:

This thread confuses me a bit. @David Schwartz, did you actually try a "selenium"-like solution? I'm just interested, i do not ask as critique.

 

No. I've exhausted the time allotted to investigate this, and there's no budget for 3rd-party commercial subscriptions. What used to be pretty easy up to about a year ago or so is much, much more difficult today. 

 

The only real HTML tags that appear useful as possible "landmarks" for identifying different sections of content are the Header tags (H1, H2, etc.). 

 

An e-Reader wouldn't have any problem rendering text-to-speech because it does not need to "recognize" the sections. It just parses through it and performs TTS as it encounters things, adapting to the various tags it finds as needed. It doesn't need to find structure or parse out pieces of content. I don't know how feasible it is to extract CSS attributes along the way that might influence how it renders things -- it's not like you can "hear colors" other than having it report the start and end of a hex code or maybe a recognizable color name, eg., "blue". It's pretty evident that the DIVs and SPANs around the content are sprinkling in quite a bit of visual stuff that would just be a ton of noise if it was all vocalized by a TTS system -- italics, bolds, colors, (sub)headings, etc.

 

There are ad blocks that can be inserted between meaningful sections, and they "look" nearly the same except for one little visual tag that's presented as a CSS attribute that has a random classid.

 

If I had another week to spend on this, I might come up with something that works, but at this point, my time budget is exhausted.

Share this post


Link to post
Guest

Yes, well. The olde extracurricular problems are many :classic_dry:

Edited by Guest

Share this post


Link to post

Google Search is not part of the free package. And I'd bet it doesn't work very well. (It may parse some stuff, but after looking over the efforts G has expended to embed obfuscated javascript that's generated dynamically, intended to hide random sections of results in different ways, it would need to be pretty damn smart to reverse engineer a search page and break it up into meaningful chunks that come close to showing what's actually visible to the user.)

Edited by David Schwartz

Share this post


Link to post

Here Google Search is gray (Available in Enterprise) and included in that 1 year subscription from EMBT,

https://www.embarcadero.com/products/enterprise-connectors

 

And you mean that this connection is not to a Google API? Why not?

 

From Google API documentation:

"Custom Search JSON API provides 100 search queries per day for free. If you need more, you may sign up for billing in the API Console. Additional requests cost $5 per 1000 queries, up to 10k queries per day."

"The search results include the URL, title and text snippets that describe the result. In addition, they can contain rich snippet information, if applicable. "

https://developers.google.com/custom-search/v1/overview

Share this post


Link to post

Sorry, I read that wrong. I though the ones in red were the free ones. 

 

I was talking about parsing the Google SERPs, not using the API.

 

From some research I've done, the API does return "search results", but it's a subset of what you see when you run a query, and may be subject to variations based on data you may or may not have, like LAT+LON location info (ie, what does "near me" mean? an IP-based location can be far away from where you really are).

 

I haven't actually played with the API yet, just basing this on numerous comments and complaints I found around the internet.

 

Like the map often found at the top of the page with the top-3 locations on it ... that's apparently "derived", along with all of the sponsored ads and other stuff. How exactly they come up with those three selections is anybody's guess. Some people care about that, some don't. (That's something I was specifically looking to identify.)

 

IOW, there's both "primary" and "derived" data displayed on SERPs you get back from Google. From what I could determine, the API only gives you the "primary" data, which is fine for most needs.

 

And the API can get rather expensive in some circumstances.

  • Thanks 1

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×