Jump to content
KeepTheHonesty

web scraping or web parsing for project?

Recommended Posts

Hello guys, I am Executing an online shop at the moment, so I need a software, which can download all data of competitors (items description, prices etc.)

Recently heard (exactly heard) thats called parsing, but I found not that much data in internet, my familiar advised me to use 
web scraping service which is also almost similar to parsing as I understood. But can you please tell what is that difference in using and whats fittest for me in that case? Sorry for this type of question, but there is not much data in internet about both (only this one https://meta-guide.com/quora/what-is-the-difference-between-crawling-parsing-and-scraping) where Ive managed to understand literally nothing..
Imagine like you are telling it to a kid :classic_biggrin:

Share this post


Link to post

Maybe it is simpler to imagine this way:

- downloading the content from websites = scraping

- inspecting that content and extracting something meaningful = parsing

 

so when you scrape a website, the two concepts work hand in hand. get the content, extract anything meaningful, identify if there are any links, repeat... 

 

parsing isn't an undocumented subject. it is often taught in courses on compiler theory ... but doesn't necessarily have to relate to compilers... also, the way content can be parsed can vary:

- you can be formal and parse according to some language's grammatical rule set

- you can cheat - if you know something about the content, you could do some pos() + copy() to get what you want out of it

- you could even use regular expressions to extract data based on patterns you may be interested in... e.g. emails or urls can be easily identifiable

 

Share this post


Link to post
On 9/19/2021 at 4:10 PM, KeepTheHonesty said:

Hello guys, I am Executing an online shop at the moment, so I need a software, which can download all data of competitors (items description, prices etc.)
 

Hello guy, before you do this you should know that in most cases, what you're proposing to do is going to violate copyright laws in most countries. Are you sure you want to take that risk?

 

To answer your question, "screen-scraping" used to be a means by which applications written to run on PCs were made to interact with virtual terminals attached to what were usually bigger time-sharing computers (mainframs and mini-computers). There would be a form on a screen and the data fields were in certain positions that were always fixed on the screen. (This was a time when terminals were like TVs that showed text as green or white on a black background. They had 80 columns and around 24 lines.) The software would "scrape" (basically, it meant "copy") the data out of those fixed locations and put it into its own variables. (Crawling is not really related to what you're asking.)

 

For web sites, this approach (page scraping) isn't very practical for a number of reasons, unless the material you're trying to scrape is from a report that uses fixed-width typeface on a simulated paper sheet. 

 

Rather, It's much easier to just take the raw page data, which is in HTML, and "parse" it to find the fields you want. But this is not nearly as simple as it sounds, especially if CSS is involved.

 

Parsing is a rather complex topic and is usually taught as part of a computer science course on compiler construction.

 

There are two parts to building a compiler: The first is parsing the input file; the second is emitting some other code based on the input stream, usually a bunch of machine instructions that get "fixed-up" by another program called a "linker". Most compilers are called "two-pass" compilers. (Some do more than two passes.) In the first pass, the parser builds what's called a "parse tree" that the second pass crawls down and processes, either depth-first or breadth-first. As it crawls that tree, it emits code. If there's any kind of optimization going on, the tree can be adjusted first to eliminte unused code and combine parts of the tree that are sematic duplicates.

 

Every programming language, and most things even down to simple math equations, need to be parsed in some way.

 

In this case, what you're emitting is not machine instructions but the content you find in and around the HTML tags. Also, be aware that CSS is often used to position content in different places on the page and it does not have to appear in the same order as the text around it. That is, the page could lay down a bunch of boxes or tables, then the headers, then the footers, then the data -- either in columns or rows. And there can be text data with CSS tags that say to hide it, or display it in the same color as the background color so it's invisible, and that can be done to scatter garbage all over the place that your parser will think is legitimate content, but a users looking at it in their web browser won't see any of it.

 

Parsing the HTML (which is a linear encoding) won't give you any clue that the content is being generated in some apparently random pattern. So what your parser puts out looks could look like someone sucked up the words in a chapter of a book and just randomly spat them out across the pages. You'll have to sit there and study it and look closely at the CSS and figure out how to unravel it all. Then the next page you get could use a different ordering and you'll be back to square one.

 

The thing is, with the increase in the use of client-side javascript and CSS to encode content and render it in a non-linear order, it's getting harder and harder to algorithmically extract content from a web page.

 

It should also be fairly simple to render the input data on the virtual page that's displayed on the screen, which is what the web browser seems to be showing you. But my experience is that's not necessarily the case. You could always take the bitmap image from the browser's canvas and process it with OCR and see if that's simpler than parsing the input stream.

 

It really just depends on how much effort the vendor wants to put into making it difficult to extract their content.

 

For example, run a simple google search query and then look at the page source. Copy it all and paste it into a text editor and turn on word-wrapping. Good luck making sense of it! 

 

Repeat this by running the SAME QUERY a few times, then do a diff on each of the resulting files (if it's not obvious just looking at them) and you'll see what you're dealing with.

 

Pick some phrases in the original page and search for them in the text editor. Some you'll find, and most you won't.

 

Yes, it's fully HTML-compliant code. Yes it can be parsed with any HTML parser. But I guarantee you that a lot of the text you want is embedded inside of encoded javascript methods, and no two take the same approach. To make matters worse, they change the names of the javascript functions and parameter names from one query to the next, so you can't even build a table of common functions to look for.

 

So you'll need a javascript parser that can execute enough of it to extract the content, but not go any further.

 

A lot of it is structured like a macro language, and it uses multiple levels of "encoding" or "embedding". When you try unwinding it, you don't know how deep it goes, and if you're not fully executing the code at the same time as parsing it, you can end up "over-zooming" and miss what you're looking for. They can also bury little land-mines in the code and if you try decoding something it can get scrambled if you don't have something else loaded in the run-time environment that's totally unrelated. Or it could use a function loaded way earlier that doesn't look related that unscrambles some important bit of text that will stop the parser or run-time interpreter if it's not correct, and what you'll end up with is just a bunch of gibberish.

 

It used to be fairly easy to parse Google search result pages (SERPs), up until 2-3 years ago when they started making them quite hard. Some other sites are starting to do this now, like Amazon, eBay, and others. Why? Because it's the only way to deter people from stealing their copyrighted content! They know that YOU DON'T GIVE A RIP about THEIR COPYRIGHTS. And it's a lot easier now to use multiple layers of javascript encoding to hide content than anything else.

 

I've also seen CSS used to embed content. CSS is NOT HTML. You can parse HTML and end up with just a bunch of CSS tags and little if any content. Good luck with that as well. So now you need a CSS parser!

 

Is your head spinning yet?

Ask yourself how much you think the client is willing to pay you to figure all of that out, and realize that it's all being constantly changed by the vendor, so it's a moving target that will work one day and not the next.

 

Honestly, if you're going to risk stealing copyrighted content from other sites, hire people in China or India and pay them per-piece to copy stuff by hand from the other systems into yours. It will be a lot faster than trying to write and maintain code that parses all of this stuff. (I found there are a few companies that do this for Google SERPs and they charge a bit to get the data you want. Maybe some exist for the sites you're interested in robbing of their intellectual property?)

 

Even if it's not that complex to parse their data, pray that you don't get caught.

 

TIP: the fact that you asked your question the way you did tells me you're looking at a minimum of 6-9 months to write the software you think you need, if you can even get it working steadily, because you're going to have to learn a lot about parsing first. I suggest you hire someone with a computer science degree who already knows this stuff.

 

TIP: writing a parser can look seductively simple at first. It's not. And the obvious ways of digging into it without understanding anything about parsing will usually lead you into a dead-end alley. HTML is fairly easy to parse, and there are several components you can get that do an excellent job of it. But again, CSS and javascript are not HTML, and they'll stop you dead in your tracks if they're used to obfuscate content in any way, or even in the case of CSS to do page layout in a non-linear fashion.

 

Edited by David Schwartz
  • Thanks 1

Share this post


Link to post
4 hours ago, corneliusdavid said:

I would suggest looking at one of Embarcadero's latest acquisitions, ApiLayer; they have a web-scraping API. Don't know anything about it but "Turn web pages into actionable data" (scraped from their website) sounds like what you're trying to do.

That's pretty interesting... ScrapeStack lets you unwind the javascript encoding, but they say it returns a raw HTML page, which means it still has to be parsed. So CSS can still trip you up unless the headless browser used for unwinding javascript handles that for you as well. (They don't say, but ... it probably does.) So even if there's no javascript that needs to be processed, you might want to select that option anyway just to get the CSS processed.

 

I played with SerpStack a bit, but then I found ScaleSerp and I like it better. Both return a JSON data packet. My only beef with it is that it has several types of queries, and while they have very similar results, it's as if different people wrote each one, because the field names used in the JSON data for the same data items often aren't the same. So the code that processes each of the different types of queries needs to be different. Luckily I'm only interested in a half-dozen of the fields returned in each of the different JSON packets. But it's just odd to see.

 

I'm mentioning this because ... this represents the "state of the art" at this point in time. We're early-on in this technology curve.

 

(I've been playing with this code for a couple of months, and the data results have changed several times because Google changes their page layouts and encoding mechanisms a lot. I've even found some bugs in the JSON data. It's worrysome to think of publishing a product that relies on something like this where you KNOW that what you're working with is a moving target.)

 

 

Share this post


Link to post
On 9/22/2021 at 7:10 PM, David Schwartz said:

Honestly, if you're going to risk stealing copyrighted content from other sites

Performing open source intelligence on competitors' pricing isn't a copyright violation; depending upon how it's done it may be a violation of the site's terms of service though. Now if they're actually going to use the competitors' product descriptions and photos then yes, that would be a copyright violation as well.

Edited by Joseph MItzen

Share this post


Link to post
On 9/22/2021 at 7:10 PM, David Schwartz said:

Is your head spinning yet?

It sounds like you hate web scraping. :classic_biggrin: Meanwhile I've developed a new love of web scraping.. it unlocks all sorts of amazing possibilities. Delphi's really not the best tool for the job, though. But if you use selenium to control a headless browser and deal with the javascript for you, beautfiulsoup to scrape a page, and perhaps scrapy if you need to do web crawling... things become a lot simpler.

 

My brother has a preexisting medical condition that made him quite vulnerable during the COVID lockdown. Food stores were offering a service where you could order online and they would shop for you and have contactless pickup but there were only so many time slots during each day available and high demand. The first time my brother managed to get an order in he had to do so at a store an hour away! I wrote a program that used selenium to log into my brother's local grocery store website and navigate to the time reservation page and then passed the HTML post-javascript to beautifulsoup to extract the time table and parse it. If there were new openings it would then email my brother to let him know. This was quite a big help to him and saved him two-hour round trips.

 

I've scraped my Amazon wish lists and then used beautifulsoup and requests to search the online lending library at archive.org to find books I was interested in that were available to read for free through archive.org.

 

I used some web scraping to create a script that checks if a piece of open source software I use has a new version available, and if so to download, compile and install it in one go for me.

 

I'm competing in an online horse race handicapping contest and another script takes the races for the contest and scrapes the race track's YouTube page to let me know when the video is up from the relevant races (YouTube is an ugly JSON-filled mess to scrape unfortunately).

 

I've been on the lookout for more things to scrape too! It can only take a line of code or two unless the website takes drastic measures to avoid it (*cough* Equibase *cough*) but then just a few more lines of code to bounce your page requests through TOR while resetting the connection each time to get a new output node at a different spot in the world will take care of that. :classic_biggrin:

  • Like 1

Share this post


Link to post
Guest
6 hours ago, Joseph MItzen said:

Delphi's really not the best tool for the job, though. But if you use selenium to control a headless browser and deal with the javascript for you, beautfiulsoup to scrape a page, and perhaps scrapy if you need to do web crawling... things become a lot simpler.

Hmmm... sure parsing in Delphi may seem a bit more convoluted as the objects in JS are more "native".

But you now have a "toolchain" of at least three npm/something libs. That in itself is a mess IMHO.

Share this post


Link to post
On 9/23/2021 at 8:58 PM, Joseph MItzen said:

It sounds like you hate web scraping. :classic_biggrin: Meanwhile I've developed a new love of web scraping.. it unlocks all sorts of amazing possibilities. Delphi's really not the best tool for the job, though. But if you use selenium to control a headless browser and deal with the javascript for you, beautfiulsoup to scrape a page, and perhaps scrapy if you need to do web crawling... things become a lot simpler.

 

My brother has a preexisting medical condition that made him quite vulnerable during the COVID lockdown. Food stores were offering a service where you could order online and they would shop for you and have contactless pickup but there were only so many time slots during each day available and high demand. The first time my brother managed to get an order in he had to do so at a store an hour away! I wrote a program that used selenium to log into my brother's local grocery store website and navigate to the time reservation page and then passed the HTML post-javascript to beautifulsoup to extract the time table and parse it. If there were new openings it would then email my brother to let him know. This was quite a big help to him and saved him two-hour round trips.

 

I've scraped my Amazon wish lists and then used beautifulsoup and requests to search the online lending library at archive.org to find books I was interested in that were available to read for free through archive.org.

 

I used some web scraping to create a script that checks if a piece of open source software I use has a new version available, and if so to download, compile and install it in one go for me.

 

I'm competing in an online horse race handicapping contest and another script takes the races for the contest and scrapes the race track's YouTube page to let me know when the video is up from the relevant races (YouTube is an ugly JSON-filled mess to scrape unfortunately).

 

I've been on the lookout for more things to scrape too! It can only take a line of code or two unless the website takes drastic measures to avoid it (*cough* Equibase *cough*) but then just a few more lines of code to bounce your page requests through TOR while resetting the connection each time to get a new output node at a different spot in the world will take care of that. :classic_biggrin:

This is actually quite informative and very inspiring! But I don't think this is the kind of thing the OP meant when he said, "which can download all data of competitors (items description, prices etc.)". It read as if he wants to populate his own sites from data extracted from his competitors' sites.

 

I've written lots of things that extract data from HTML pages over the years, but some of the mechanisms being used to prevent that lately make it more of a hassle than fun. I'd not heard of these tools, so it's good to learn of them. Thanks!

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×