Jump to content
Dave Nottage

Querying mvnrepository

Recommended Posts

I have some code that queries mvnrepository via HTTP, specifically looking for packages for Android. It essentially boils down to this:

uses
  System.Net.HttpClient;

procedure TForm1.Button1Click(Sender: TObject);
var
  LHTTP: THTTPClient;
  LResponse: IHTTPResponse;
begin
  LHTTP := THTTPClient.Create;
  try
    LResponse := LHTTP.Get('https://mvnrepository.com/search?q=play+services+maps'); // for example
    // At this point, LResponse.StatusCode is 403 :-(
    // The same query works in regular browsers
  finally
    LHTTP.Free;
  end;
end;

Except that recently, as per the comment, it now returns a 403 result, with this content (truncated):

<!DOCTYPE html><html lang="en-US"><head><title>Just a moment...</title><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta http-equiv="X-UA-Compatible" content="IE=Edge"><meta name="robots" content="noindex,nofollow"><meta name="viewport" content="width=device-width,initial-scale=1"><link href="/cdn-cgi/styles/challenges.css" rel="stylesheet"></head><body class="no-js"><div class="main-wrapper" role="main"><div class="main-content"><noscript><div id="challenge-error-title"><div class="h2"><span class="icon-wrapper"><div class="heading-icon warning-icon"></div></span><span id="challenge-error-text">Enable JavaScript and cookies to continue</span></div></div></noscript></div></div>

..and is followed by a bunch of cryptic JavaScript.
 

If there was an API (such as REST), I'd use that, however there does not appear to be, and I haven't had a reply for my email to info@mvnrepository.com.

 

Any ideas as to how I might be able to handle it?

Share this post


Link to post

In your browser it works because there are some JavaScript generating / fetching the data from somewhere else with your browser happily renders.

You’ll need TEdgeBrowser or something similar to actually render it for you and then process the visible document.

  • Like 1

Share this post


Link to post
Just now, aehimself said:

You’ll need TEdgeBrowser or something similar to actually render it for you

Yeah, I had considered that, but I'd prefer to avoid it.

Share this post


Link to post

When you open the site in your browser you can check all the network calls made for the site to actually load. If you are lucky there will be an API call which returns the file list in a well-known format.

 

Even if there is one, I don’t know however if you are allowed to query that API… the site owner will be able to tell you the legal parts.

  • Like 1

Share this post


Link to post

I would look at setting the UserAgent to something that mimicks a browser, often servers look at that as part of their ddos defence

 

Try this

 

Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/116.0

 

 

 

Share this post


Link to post

I just tried it in postman and it fails there too.. looks like it might be an issue with a cloudfare challenge - hard to get around that without js.

  • Sad 1

Share this post


Link to post
23 hours ago, Dave Nottage said:

You’ll need TEdgeBrowser or something similar to actually render it for you

I've decided to go down this route - it has turned out easier than I expected, especially when using ExecuteJavascript to extract the relevant parts out of the HTML

Share this post


Link to post

Maybe, if you are willing to bundle with Python or the ike, a headless browser could help.

https://www.zenrows.com/blog/selenium-python-web-scraping#prerequisites

But that is usually too fat for a simple, I would guess.

 

Or if you are willing to think about online-services, maybe there were also some online webscraper tools out there,
which have limited free tier, but thats also problematic to integrate.
https://www.scraping-bot.io/pricing-web-scraper-api/

https://www.parsehub.com/pricing

 

I'm not sure if something like HtmlComponents could handle that, I think the JS part is still missing and bundling with JS parsers would be also a big task.

Maybe there is any full Pascal HTML5, CSS, JS engine out in te wild, which I'm not yet aware of ?

That would be great for my projects too 🙂

 

Edited by Rollo62
  • Like 1

Share this post


Link to post
18 hours ago, Rollo62 said:

Maybe, if you are willing to bundle with Python or the ike, a headless browser could help.

https://www.zenrows.com/blog/selenium-python-web-scraping#prerequisites

But that is usually too fat for a simple, I would guess.

There's no compact headless browser now as Phantom has died. Bundling whole browser install + python + selenium libs for a task that could be solved with chromium libs is nonsense.

Share this post


Link to post

Ok, I didn't know that Phantom.js was suspended, thats a pity.

I had played around with Pyppeteer project a while ago and it's now suspended too, so I had assumed their successors will do well too.

They recommend playwritgh-python as successor, but never tested that.

 

Not sure if Pyppeteer was based on Phantom.js, it seems not, according to this info.

As far as I knew, the Pyppeteer was bundles with Chromium and Python, which would be large, but reasonable as standalone scraper.

 

Maybe Cef4Delphi could also be usable for such task, have you experience with that ?

 

 

Share this post


Link to post

Well, Webdriver libs just give you a convenient interface to browser automation API which is no more than REST API. This WebDriver API is a standard and browser implementation could be any (Chrome, Opera, Firefox, Edge...). Phantom was just one of those. I have my scraper able to switch between Phantom, Chrome and Firefox and work with the same code.

 

As for CEF4D, I had no experience but some guys here had and I believe they could help if you encounter any issue

  • Like 1

Share this post


Link to post
On 8/21/2023 at 7:12 AM, Rollo62 said:

Ok, I didn't know that Phantom.js was suspended, thats a pity.

I had played around with Pyppeteer project a while ago and it's now suspended too, so I had assumed their successors will do well too.

They recommend playwritgh-python as successor, but never tested that.

 

Not sure if Pyppeteer was based on Phantom.js, it seems not, according to this info.

As far as I knew, the Pyppeteer was bundles with Chromium and Python, which would be large, but reasonable as standalone scraper.

 

Maybe Cef4Delphi could also be usable for such task, have you experience with that ?

 

 

Playwright and Playwright-Python (there are playright bindings for several languages) is fantastic!

 

I believe Microsoft wrote the puppeteer library that automated Edge; the folks who wrote it left that to create playwright, which works with all the major HTML engines and incorporates some nifty features such as automatic waiting for elements. It's well-documented, too.

Lots of nice features, including being able to save and load context (to preserve things such as login cookies).

 

Some sample code from a program I wrote that needed to automate some actions with the Internet Archive:

 

Logging in and saving context so I never have to do it again:

browser = playwright.firefox.launch()
context = browser.new_context()
page = context.new_page()
page.goto("https://archive.org/")
page.get_by_role("link", name="Log in").click()
page.get_by_label("Email address").fill("jgm@myself.com")
page.get_by_label("Password").fill("**********")
page.get_by_role("button", name="Log in").click()
context.storage_state(path=STATE_FILE)

There's an expect function with optional timeout that can be used to wait for things:

page.goto(url)
borrow_button = page.get_by_role("button", name="Borrow for 1 hour")
expect(borrow_button).to_be_visible(timeout=60000)
borrow_button.click()
return_button = page.get_by_role("button", name="Return now")
expect(return_button).to_be_visible(timeout=60000)

And I forgot one of the coolest things - it can run visible or headless. There's a mode you can start it in so that the browser is visible, along with another editing window. Then you can just click and type in the browser window, and the code it would take to replicate those actions appears in the editing window! This is a super-quick way to start a project - no need to start searching through the HTML looking for object names, etc. Just start interacting with the website and all the appropriate code is determined and generated for you. Copy and paste that into your project source code and then tweak as appropriate and you're good to go.
 

There are lots of other nice features including being able to emulate mobile browsers,.

 

 

Share this post


Link to post

On Win7 the same code returns status 200. May be something related to WinHttp settings ?

If you want, you can try to mimic Firefox more, at least by adding the headers, which it sends by default ?

Edited by Dmitry Arefiev

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×