Parsing Google Search Results redux

David Schwartz · August 25, 2019

In case anybody is interested, I've been trying to scrape a Google search result page, and it's nearly impossible without a huge amount of work. The HTML page you get back from their server looks like

There's no BODY. It's generated inside the DOM by javascript.

The javascript takes a bunch of regular HTML and CSS and appears to use a lookup function do global search-and-replace of nearly every human-readable word and phrase with random character strings that are generated dynamically for each page. So there are hardly any "landmarks" you can use to find anything. It's not like you can build a symbol table inferred from one page on subsequent pages, since the mappings are different from page to page.

It also employs deeply nested DIVs and lots of needless SPANs to make it difficult to extract content.

Finally, while it's clear from viewing the page that there are 10 entries in the main part of the page, it manages to hide some in ways that I couldn't figure out where they are. I searched for strings that can be seen in the browser view, but they aren't found anywhere in body of the rendered HTML. I'm guessing they're hiding inside of some javascript functions having been obfuscated to the point where they can't be read without executing the functions to extract them.

I suspect it won't be long before it will be nearly impossible to extract anything meaningful from HTML pages, as whatever we see displayed in a browser will be generated entirely on-the-fly from a variety of dynamic and static methods buried inside of deeply nested javascript functions, some of which could be rendered even after the page loads. Some of it even looks to be a variety of self-modifying (or self-generated) code.

This has some pretty crazy implications for both security as well as making it impossible to sniff code that could signal the presence of malware injections.

I've attached an unwound pretty-printed sample of a search page I was able to extract for your reading pleasure. Fun stuff, eh?

sample_SERP.html

Edited August 25, 2019 by David Schwartz

Attila Kovacs · August 25, 2019

apt-get install elinks // or whatever distro you use, install elinks

elinks https://www.google.com/search?q=dentist+in+scottsdale >output.file

but as I told you already, you will face captchas very soon

Joseph MItzen · August 25, 2019

This is why there are embeddable web engines that process Javascript. You need this, for instance, to parse Amazon wish lists, since the pages use javascript to automatically load more items as you scroll down the page. There are also tools such as Selenium for driving web browsers. Or platforms like ScrapingHub, powered by the open source scrapy library and the Splash headless, scriptable javascript-enabled browser

Joseph MItzen · August 25, 2019

2 hours ago, Attila Kovacs said:

but as I told you already, you will face captchas very soon

In that case, you want to check out goop.

David Schwartz · August 25, 2019

Thanks for all the side-chatter, but none of it is helpful. I was merely passing along the results of what I found. I'm not running Linux; I'm not getting captchas doing this manually, and I have no reason to believe I will get them using similar timings with a different means of doing the data entry; I'm only interested in the first page of results; and in summary, y'all are making stuff up that is completely irrelevant to my present needs. All I'm focusing on is scraping one page of results, and that's all that's up for discussion. But thanks for the insights.

Dave Nottage · August 25, 2019

5 hours ago, David Schwartz said:

It also employs deeply nested DIVs and lots of needless SPANs to make it difficult to extract content.

I doubt that's the reason

5 hours ago, David Schwartz said:

There's no BODY. It's generated inside the DOM by javascript.

Now that sounds more like a reason 😉 Having said that, if you're using text-based parsing it'd be possible to workaround the lack of a <body> tag. I started looking at doing this programmatically using TWebBrowser a few days ago. If I had more time, I'd continue with it

Edited August 25, 2019 by Dave Nottage
Formatting

August 26, 2019

This thread confuses me a bit. @David Schwartz, did you actually try a "selenium"-like solution? I'm just interested, i do not ask as critique.

Anyway - the trends pointed out above are IMHO quite hideous. But i must wonder about accessibility here. A lot of people with accessibility needs have money and in the US there are proper laws about accessibility (ok, ok, not hard facts, that, i should google some). Economically though it would be strange for the creators not to heed accessibility needs. If the "screenreader" can read the pertinent information, then you code should be able to do the same.

I'd try to mine using the accessibility API via something like selenium. But of course it might be as much work depending on how accessible that page actually is.

David Schwartz · August 27, 2019

22 hours ago, Dany Marmur said:

This thread confuses me a bit. @David Schwartz, did you actually try a "selenium"-like solution? I'm just interested, i do not ask as critique.

No. I've exhausted the time allotted to investigate this, and there's no budget for 3rd-party commercial subscriptions. What used to be pretty easy up to about a year ago or so is much, much more difficult today.

The only real HTML tags that appear useful as possible "landmarks" for identifying different sections of content are the Header tags (H1, H2, etc.).

An e-Reader wouldn't have any problem rendering text-to-speech because it does not need to "recognize" the sections. It just parses through it and performs TTS as it encounters things, adapting to the various tags it finds as needed. It doesn't need to find structure or parse out pieces of content. I don't know how feasible it is to extract CSS attributes along the way that might influence how it renders things -- it's not like you can "hear colors" other than having it report the start and end of a hex code or maybe a recognizable color name, eg., "blue". It's pretty evident that the DIVs and SPANs around the content are sprinkling in quite a bit of visual stuff that would just be a ton of noise if it was all vocalized by a TTS system -- italics, bolds, colors, (sub)headings, etc.

There are ad blocks that can be inserted between meaningful sections, and they "look" nearly the same except for one little visual tag that's presented as a CSS attribute that has a random classid.

If I had another week to spend on this, I might come up with something that works, but at this point, my time budget is exhausted.

August 27, 2019

Yes, well. The olde extracurricular problems are many

Edited August 27, 2019 by Guest

Cristian Peța · October 25, 2019

I see that Google Search is part of Enterprise Connectors from C-Data that is part of RAD Studio Enterprise and Architect editions now. Has someone tried this?

https://www.embarcadero.com/products/enterprise-connectors

Lars Fosdal · October 25, 2019

14 minutes ago, Cristian Peța said:

I see that Google Search is part of Enterprise Connectors from C-Data that is part of RAD Studio Enterprise and Architect editions now. Has someone tried this?

https://www.embarcadero.com/products/enterprise-connectors

I've been wondering about those... Where do I download/install these? Can't see them in GetIt.

Cristian Peța · October 25, 2019

https://community.idera.com/developer-tools/b/blog/posts/enterprise-connectors-now-part-of-rad-studio-enterprise-architect-edition

David Schwartz · October 29, 2019

Google Search is not part of the free package. And I'd bet it doesn't work very well. (It may parse some stuff, but after looking over the efforts G has expended to embed obfuscated javascript that's generated dynamically, intended to hide random sections of results in different ways, it would need to be pretty damn smart to reverse engineer a search page and break it up into meaningful chunks that come close to showing what's actually visible to the user.)

Edited October 29, 2019 by David Schwartz

Cristian Peța · October 29, 2019

Here Google Search is gray (Available in Enterprise) and included in that 1 year subscription from EMBT,

https://www.embarcadero.com/products/enterprise-connectors

And you mean that this connection is not to a Google API? Why not?

From Google API documentation:

"Custom Search JSON API provides 100 search queries per day for free. If you need more, you may sign up for billing in the API Console. Additional requests cost $5 per 1000 queries, up to 10k queries per day."

"The search results include the URL, title and text snippets that describe the result. In addition, they can contain rich snippet information, if applicable. "

https://developers.google.com/custom-search/v1/overview

David Schwartz · October 29, 2019

Sorry, I read that wrong. I though the ones in red were the free ones.

I was talking about parsing the Google SERPs, not using the API.

From some research I've done, the API does return "search results", but it's a subset of what you see when you run a query, and may be subject to variations based on data you may or may not have, like LAT+LON location info (ie, what does "near me" mean? an IP-based location can be far away from where you really are).

I haven't actually played with the API yet, just basing this on numerous comments and complaints I found around the internet.

Like the map often found at the top of the page with the top-3 locations on it ... that's apparently "derived", along with all of the sponsored ads and other stuff. How exactly they come up with those three selections is anybody's guess. Some people care about that, some don't. (That's something I was specifically looking to identify.)

IOW, there's both "primary" and "derived" data displayed on SERPs you get back from Google. From what I could determine, the API only gives you the "primary" data, which is fine for most needs.

And the API can get rather expensive in some circumstances.

Sign In

Parsing Google Search Results redux

Recommended Posts

David Schwartz 443

Share this post

Link to post

Attila Kovacs 676

Share this post

Link to post

Joseph MItzen 257

Share this post

Link to post

Joseph MItzen 257

Share this post

Link to post

David Schwartz 443

Share this post

Link to post

Dave Nottage 625

Share this post

Link to post

Guest

Share this post

Link to post

David Schwartz 443

Share this post

Link to post

Guest

Share this post

Link to post

Cristian Peța 122

Share this post

Link to post

Lars Fosdal 1877

Share this post

Link to post

Cristian Peța 122

Share this post

Link to post

David Schwartz 443

Share this post

Link to post

Cristian Peța 122

Share this post

Link to post

David Schwartz 443

Share this post

Link to post

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity