Jump to content
David Schwartz

parsing Google search results

Recommended Posts

I need to parse a Google search result page. I'm getting the data into a TWebBrowser, and am using DIHTMLParser, but there's a problem.

 

I found several possible solutions, but Google has put a bunch of javascript into the raw HTML page that's returned, so if you parse that page, you don't get much.

 

I found someone on SO who suggested grabbing the IHTMLDocument2.body as an IHTMLElement, but it's nil. Perhaps it returns an IHTMLElementCollection in .all now, but the example code was from 2013, and I can't find any examples that go this direction in Delphi. (Does the .body element just contain what's in the BODY part of the HTML page?)

 

I'm working with D10.2.2, in case it matters.

 

And, yes, DIHTMLParser guys have a Google parser plugin that was last updated in 2018, but it doesn't work now.

 

How can I get the HTML elements that are actually displayed in the TWebBrowser's window after the javascript has finished executing?

 

I'm not finding anything in my searches that addresses this.

 

(BTW, what's visible in TWebBrowser is the same as what shows up in Chrome.)

 

Share this post


Link to post
1 hour ago, David Heffernan said:

Isn't this the wrong approach. Don't you use the REST API? 

The only thing I could find was not what I was looking for. Do you happen to know which API returns the same results as running a query, and isn't limited to 100 queries per day?

Share this post


Link to post
Posted (edited)
2 hours ago, David Schwartz said:

And, yes, DIHTMLParser guys have a Google parser plugin that was last updated in 2018, but it doesn't work now.

Even if you get it to work, they will vary the output in a very short time, obviously.

Edited by Attila Kovacs

Share this post


Link to post
2 minutes ago, Attila Kovacs said:

Even if you get it to work, they will vary the output in a very short time, obviously.

I'm not so sure ... this is a structural change. It was the same for a decade before it changed, and then nearly another decade.

 

A REST API would be better, but only as long as it has sufficient info on it that would allow one to effectively create the same result that a normal query produces.

 

I don't need to interact with anything, just pull off a few bits of data from each entry. And if there's a map at the top then I need to read the links that show up below it.

 

The only API I found was for scanning URLs, not running general queries. But they've got so damn many APIs that it can be hard to find things if you don't know what they're probably called.

Share this post


Link to post
Posted (edited)

Continued searching uncovered this:

 

https://stackoverflow.com/questions/4082966/what-are-the-alternatives-now-that-the-google-web-search-api-has-been-deprecated

 

  • Note: The Google Web Search API has been officially deprecated as of November 1, 2010. It will continue to work as per our deprecation policy, but the number of requests you may make per day will be limited. Therefore, we encourage you to move to the new Custom Search API.

 

This seems to addresses the bigger question and tells me I'm not the only person stumped by this issue.

 

This post says at the bottom:

 

  • The search quality is much lower than normal Google search (no synonyms, "intelligence" etc.)
  • It seems that Google is even planning to shut down this service completely.

 

And some comments include these warnings:

 

  • This is why Google claims that the search results are different support.google.com/customsearch/answer/141877?hl=en Mainly: Using specified sites (does not apply here), no social or personalized or real time results – MFARID Apr 27 '14 at 17:41 
     
  • I tried it but it doesn't work now. I asked to look in the entire web for suunto ambit watch, but I got no results (I searched in the public URL that I got) – Dejell Feb 11 '15 at 20:0
     
  • WARNING: we did development using the free version, but to upgrade to the paid version (to do more than 100 searches), google forces you to turn off the "search the entire web but emphasize included sites" – Bryan Larsen Aug 11 '15 at 14:50
     
  • "On April 1, 2017, Google will discontinue sales of the Google Site Search. All new purchases and renewals must take place before this date. The product will be completely shut down by April 1, 2018." – Gajus Mar 7 '17 at 15:52

 

Also, it looks like a bunch of 3rd-party vendors have sprouted up who offer APIs for regular web searches.  SerpWow looks interesting.

 

Methinks if it's as simple as hooking into a standard Google API, there wouldn't be a lively 3rd-party market offering the same thing, or info on hacking the only existing API that seems to come close.

 

Edited by David Schwartz

Share this post


Link to post
On 8/8/2019 at 11:42 PM, Lars Fosdal said:

Although DuckDuckGo does not offer a full cover API, it seems that their DOM is easier to parse which could give you easier access to the data you need - unless you are specifically looking for Google data?
 

Yes, I'm specifically looking at Google. I'm wanting to automate a specific process that is currently being done manually that looks at several things that show up on the first page of a Google search result. A lot of folks wanting this info currently outsource it to people in Asia.

Share this post


Link to post

I've reposted this from a different perspective because people seem more fixated on the fact that I'm trying to parse a Google page or get search results than what the actual problem is.

 

Please respond to my new question instead.

 

 

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×