Querying mvnrepository

Dave Nottage · August 19, 2023

I have some code that queries mvnrepository via HTTP, specifically looking for packages for Android. It essentially boils down to this:

uses
  System.Net.HttpClient;

procedure TForm1.Button1Click(Sender: TObject);
var
  LHTTP: THTTPClient;
  LResponse: IHTTPResponse;
begin
  LHTTP := THTTPClient.Create;
  try
    LResponse := LHTTP.Get('https://mvnrepository.com/search?q=play+services+maps'); // for example
    // At this point, LResponse.StatusCode is 403 :-(
    // The same query works in regular browsers
  finally
    LHTTP.Free;
  end;
end;

Except that recently, as per the comment, it now returns a 403 result, with this content (truncated):

<!DOCTYPE html><html lang="en-US"><head><title>Just a moment...</title><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta http-equiv="X-UA-Compatible" content="IE=Edge"><meta name="robots" content="noindex,nofollow"><meta name="viewport" content="width=device-width,initial-scale=1"><link href="/cdn-cgi/styles/challenges.css" rel="stylesheet"></head><body class="no-js"><div class="main-wrapper" role="main"><div class="main-content"><noscript><div id="challenge-error-title"><div class="h2"><span class="icon-wrapper"><div class="heading-icon warning-icon"></div></span><span id="challenge-error-text">Enable JavaScript and cookies to continue</span></div></div></noscript></div></div>

..and is followed by a bunch of cryptic JavaScript.

If there was an API (such as REST), I'd use that, however there does not appear to be, and I haven't had a reply for my email to info@mvnrepository.com.

Any ideas as to how I might be able to handle it?

aehimself · August 19, 2023

In your browser it works because there are some JavaScript generating / fetching the data from somewhere else with your browser happily renders.

You’ll need TEdgeBrowser or something similar to actually render it for you and then process the visible document.

Dave Nottage · August 19, 2023

Just now, aehimself said:

You’ll need TEdgeBrowser or something similar to actually render it for you

Yeah, I had considered that, but I'd prefer to avoid it.

aehimself · August 19, 2023

When you open the site in your browser you can check all the network calls made for the site to actually load. If you are lucky there will be an API call which returns the file list in a well-known format.

Even if there is one, I don’t know however if you are allowed to query that API… the site owner will be able to tell you the legal parts.

Vincent Parrett · August 20, 2023

I would look at setting the UserAgent to something that mimicks a browser, often servers look at that as part of their ddos defence

Try this

Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/116.0

Dave Nottage · August 20, 2023

7 minutes ago, Vincent Parrett said:

Try this

Already did

Vincent Parrett · August 20, 2023

I just tried it in postman and it fails there too.. looks like it might be an issue with a cloudfare challenge - hard to get around that without js.

Dave Nottage · August 20, 2023

23 hours ago, Dave Nottage said:

You’ll need TEdgeBrowser or something similar to actually render it for you

I've decided to go down this route - it has turned out easier than I expected, especially when using ExecuteJavascript to extract the relevant parts out of the HTML

Rollo62 · August 20, 2023

Maybe, if you are willing to bundle with Python or the ike, a headless browser could help.

https://www.zenrows.com/blog/selenium-python-web-scraping#prerequisites

But that is usually too fat for a simple, I would guess.

Or if you are willing to think about online-services, maybe there were also some online webscraper tools out there,
which have limited free tier, but thats also problematic to integrate.
https://www.scraping-bot.io/pricing-web-scraper-api/

https://www.parsehub.com/pricing

I'm not sure if something like HtmlComponents could handle that, I think the JS part is still missing and bundling with JS parsers would be also a big task.

Maybe there is any full Pascal HTML5, CSS, JS engine out in te wild, which I'm not yet aware of ?

That would be great for my projects too 🙂

Edited August 20, 2023 by Rollo62

Fr0sT.Brutal · August 21, 2023

18 hours ago, Rollo62 said:

Maybe, if you are willing to bundle with Python or the ike, a headless browser could help.

https://www.zenrows.com/blog/selenium-python-web-scraping#prerequisites

But that is usually too fat for a simple, I would guess.

There's no compact headless browser now as Phantom has died. Bundling whole browser install + python + selenium libs for a task that could be solved with chromium libs is nonsense.

Rollo62 · August 21, 2023

Ok, I didn't know that Phantom.js was suspended, thats a pity.

I had played around with Pyppeteer project a while ago and it's now suspended too, so I had assumed their successors will do well too.

They recommend playwritgh-python as successor, but never tested that.

Not sure if Pyppeteer was based on Phantom.js, it seems not, according to this info.

As far as I knew, the Pyppeteer was bundles with Chromium and Python, which would be large, but reasonable as standalone scraper.

Maybe Cef4Delphi could also be usable for such task, have you experience with that ?

Fr0sT.Brutal · August 21, 2023

Well, Webdriver libs just give you a convenient interface to browser automation API which is no more than REST API. This WebDriver API is a standard and browser implementation could be any (Chrome, Opera, Firefox, Edge...). Phantom was just one of those. I have my scraper able to switch between Phantom, Chrome and Firefox and work with the same code.

As for CEF4D, I had no experience but some guys here had and I believe they could help if you encounter any issue

Joseph MItzen · August 29, 2023

On 8/21/2023 at 7:12 AM, Rollo62 said:

Ok, I didn't know that Phantom.js was suspended, thats a pity.

I had played around with Pyppeteer project a while ago and it's now suspended too, so I had assumed their successors will do well too.

They recommend playwritgh-python as successor, but never tested that.

Not sure if Pyppeteer was based on Phantom.js, it seems not, according to this info.

As far as I knew, the Pyppeteer was bundles with Chromium and Python, which would be large, but reasonable as standalone scraper.

Maybe Cef4Delphi could also be usable for such task, have you experience with that ?

Playwright and Playwright-Python (there are playright bindings for several languages) is fantastic!

I believe Microsoft wrote the puppeteer library that automated Edge; the folks who wrote it left that to create playwright, which works with all the major HTML engines and incorporates some nifty features such as automatic waiting for elements. It's well-documented, too.

Lots of nice features, including being able to save and load context (to preserve things such as login cookies).

Some sample code from a program I wrote that needed to automate some actions with the Internet Archive:

Logging in and saving context so I never have to do it again:

browser = playwright.firefox.launch()
context = browser.new_context()
page = context.new_page()
page.goto("https://archive.org/")
page.get_by_role("link", name="Log in").click()
page.get_by_label("Email address").fill("jgm@myself.com")
page.get_by_label("Password").fill("**********")
page.get_by_role("button", name="Log in").click()
context.storage_state(path=STATE_FILE)

There's an expect function with optional timeout that can be used to wait for things:

page.goto(url)
borrow_button = page.get_by_role("button", name="Borrow for 1 hour")
expect(borrow_button).to_be_visible(timeout=60000)
borrow_button.click()
return_button = page.get_by_role("button", name="Return now")
expect(return_button).to_be_visible(timeout=60000)

And I forgot one of the coolest things - it can run visible or headless. There's a mode you can start it in so that the browser is visible, along with another editing window. Then you can just click and type in the browser window, and the code it would take to replicate those actions appears in the editing window! This is a super-quick way to start a project - no need to start searching through the HTML looking for object names, etc. Just start interacting with the website and all the appropriate code is determined and generated for you. Copy and paste that into your project source code and then tweak as appropriate and you're good to go.

There are lots of other nice features including being able to emulate mobile browsers,.

Dmitry Arefiev · August 30, 2023

On Win7 the same code returns status 200. May be something related to WinHttp settings ?

If you want, you can try to mimic Firefox more, at least by adding the headers, which it sends by default ?

Edited August 30, 2023 by Dmitry Arefiev

Sign In

Querying mvnrepository

Recommended Posts

Dave Nottage 624

Share this post

Link to post

aehimself 404

Share this post

Link to post

Dave Nottage 624

Share this post

Link to post

aehimself 404

Share this post

Link to post

Vincent Parrett 852

Share this post

Link to post

Dave Nottage 624

Share this post

Link to post

Vincent Parrett 852

Share this post

Link to post

Dave Nottage 624

Share this post

Link to post

Rollo62 595

Share this post

Link to post

Fr0sT.Brutal 904

Share this post

Link to post

Rollo62 595

Share this post

Link to post

Fr0sT.Brutal 904

Share this post

Link to post

Joseph MItzen 257

Share this post

Link to post

Dmitry Arefiev 109

Share this post

Link to post

Please sign in to comment

Browse

Activity