Dave Nottage 557 Posted August 19, 2023 I have some code that queries mvnrepository via HTTP, specifically looking for packages for Android. It essentially boils down to this: uses System.Net.HttpClient; procedure TForm1.Button1Click(Sender: TObject); var LHTTP: THTTPClient; LResponse: IHTTPResponse; begin LHTTP := THTTPClient.Create; try LResponse := LHTTP.Get('https://mvnrepository.com/search?q=play+services+maps'); // for example // At this point, LResponse.StatusCode is 403 :-( // The same query works in regular browsers finally LHTTP.Free; end; end; Except that recently, as per the comment, it now returns a 403 result, with this content (truncated): <!DOCTYPE html><html lang="en-US"><head><title>Just a moment...</title><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta http-equiv="X-UA-Compatible" content="IE=Edge"><meta name="robots" content="noindex,nofollow"><meta name="viewport" content="width=device-width,initial-scale=1"><link href="/cdn-cgi/styles/challenges.css" rel="stylesheet"></head><body class="no-js"><div class="main-wrapper" role="main"><div class="main-content"><noscript><div id="challenge-error-title"><div class="h2"><span class="icon-wrapper"><div class="heading-icon warning-icon"></div></span><span id="challenge-error-text">Enable JavaScript and cookies to continue</span></div></div></noscript></div></div> ..and is followed by a bunch of cryptic JavaScript. If there was an API (such as REST), I'd use that, however there does not appear to be, and I haven't had a reply for my email to info@mvnrepository.com. Any ideas as to how I might be able to handle it? Share this post Link to post
aehimself 396 Posted August 19, 2023 In your browser it works because there are some JavaScript generating / fetching the data from somewhere else with your browser happily renders. You’ll need TEdgeBrowser or something similar to actually render it for you and then process the visible document. 1 Share this post Link to post
Dave Nottage 557 Posted August 19, 2023 Just now, aehimself said: You’ll need TEdgeBrowser or something similar to actually render it for you Yeah, I had considered that, but I'd prefer to avoid it. Share this post Link to post
aehimself 396 Posted August 19, 2023 When you open the site in your browser you can check all the network calls made for the site to actually load. If you are lucky there will be an API call which returns the file list in a well-known format. Even if there is one, I don’t know however if you are allowed to query that API… the site owner will be able to tell you the legal parts. 1 Share this post Link to post
Vincent Parrett 750 Posted August 20, 2023 I would look at setting the UserAgent to something that mimicks a browser, often servers look at that as part of their ddos defence Try this Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/116.0 Share this post Link to post
Dave Nottage 557 Posted August 20, 2023 7 minutes ago, Vincent Parrett said: Try this Already did Share this post Link to post
Vincent Parrett 750 Posted August 20, 2023 I just tried it in postman and it fails there too.. looks like it might be an issue with a cloudfare challenge - hard to get around that without js. 1 Share this post Link to post
Dave Nottage 557 Posted August 20, 2023 23 hours ago, Dave Nottage said: You’ll need TEdgeBrowser or something similar to actually render it for you I've decided to go down this route - it has turned out easier than I expected, especially when using ExecuteJavascript to extract the relevant parts out of the HTML Share this post Link to post
Rollo62 536 Posted August 20, 2023 (edited) Maybe, if you are willing to bundle with Python or the ike, a headless browser could help. https://www.zenrows.com/blog/selenium-python-web-scraping#prerequisites But that is usually too fat for a simple, I would guess. Or if you are willing to think about online-services, maybe there were also some online webscraper tools out there, which have limited free tier, but thats also problematic to integrate. https://www.scraping-bot.io/pricing-web-scraper-api/ https://www.parsehub.com/pricing I'm not sure if something like HtmlComponents could handle that, I think the JS part is still missing and bundling with JS parsers would be also a big task. Maybe there is any full Pascal HTML5, CSS, JS engine out in te wild, which I'm not yet aware of ? That would be great for my projects too 🙂 Edited August 20, 2023 by Rollo62 1 Share this post Link to post
Fr0sT.Brutal 900 Posted August 21, 2023 18 hours ago, Rollo62 said: Maybe, if you are willing to bundle with Python or the ike, a headless browser could help. https://www.zenrows.com/blog/selenium-python-web-scraping#prerequisites But that is usually too fat for a simple, I would guess. There's no compact headless browser now as Phantom has died. Bundling whole browser install + python + selenium libs for a task that could be solved with chromium libs is nonsense. Share this post Link to post
Rollo62 536 Posted August 21, 2023 Ok, I didn't know that Phantom.js was suspended, thats a pity. I had played around with Pyppeteer project a while ago and it's now suspended too, so I had assumed their successors will do well too. They recommend playwritgh-python as successor, but never tested that. Not sure if Pyppeteer was based on Phantom.js, it seems not, according to this info. As far as I knew, the Pyppeteer was bundles with Chromium and Python, which would be large, but reasonable as standalone scraper. Maybe Cef4Delphi could also be usable for such task, have you experience with that ? Share this post Link to post
Fr0sT.Brutal 900 Posted August 21, 2023 Well, Webdriver libs just give you a convenient interface to browser automation API which is no more than REST API. This WebDriver API is a standard and browser implementation could be any (Chrome, Opera, Firefox, Edge...). Phantom was just one of those. I have my scraper able to switch between Phantom, Chrome and Firefox and work with the same code. As for CEF4D, I had no experience but some guys here had and I believe they could help if you encounter any issue 1 Share this post Link to post
Joseph MItzen 251 Posted August 29, 2023 On 8/21/2023 at 7:12 AM, Rollo62 said: Ok, I didn't know that Phantom.js was suspended, thats a pity. I had played around with Pyppeteer project a while ago and it's now suspended too, so I had assumed their successors will do well too. They recommend playwritgh-python as successor, but never tested that. Not sure if Pyppeteer was based on Phantom.js, it seems not, according to this info. As far as I knew, the Pyppeteer was bundles with Chromium and Python, which would be large, but reasonable as standalone scraper. Maybe Cef4Delphi could also be usable for such task, have you experience with that ? Playwright and Playwright-Python (there are playright bindings for several languages) is fantastic! I believe Microsoft wrote the puppeteer library that automated Edge; the folks who wrote it left that to create playwright, which works with all the major HTML engines and incorporates some nifty features such as automatic waiting for elements. It's well-documented, too. Lots of nice features, including being able to save and load context (to preserve things such as login cookies). Some sample code from a program I wrote that needed to automate some actions with the Internet Archive: Logging in and saving context so I never have to do it again: browser = playwright.firefox.launch() context = browser.new_context() page = context.new_page() page.goto("https://archive.org/") page.get_by_role("link", name="Log in").click() page.get_by_label("Email address").fill("jgm@myself.com") page.get_by_label("Password").fill("**********") page.get_by_role("button", name="Log in").click() context.storage_state(path=STATE_FILE) There's an expect function with optional timeout that can be used to wait for things: page.goto(url) borrow_button = page.get_by_role("button", name="Borrow for 1 hour") expect(borrow_button).to_be_visible(timeout=60000) borrow_button.click() return_button = page.get_by_role("button", name="Return now") expect(return_button).to_be_visible(timeout=60000) And I forgot one of the coolest things - it can run visible or headless. There's a mode you can start it in so that the browser is visible, along with another editing window. Then you can just click and type in the browser window, and the code it would take to replicate those actions appears in the editing window! This is a super-quick way to start a project - no need to start searching through the HTML looking for object names, etc. Just start interacting with the website and all the appropriate code is determined and generated for you. Copy and paste that into your project source code and then tweak as appropriate and you're good to go. There are lots of other nice features including being able to emulate mobile browsers,. Share this post Link to post
Dmitry Arefiev 101 Posted August 30, 2023 (edited) On Win7 the same code returns status 200. May be something related to WinHttp settings ? If you want, you can try to mimic Firefox more, at least by adding the headers, which it sends by default ? Edited August 30, 2023 by Dmitry Arefiev Share this post Link to post