I'm not gonna lie; I saw it here first.


Before we get really into it, let's talk about some applications of web scraping.

Web scraping for the greater good

Frequently, data providers will not provide friendly ways of accessing their data. For example, if a provider wishes to keep their content mostly for the sake of human readers, or was never intended to be used by a third party. That being said: there are many times when this data not just helpful, but necessary for third parties. You could, of course, act directly for this data, but frequently the data will update too frequently and the vendor is opposed to making an API for it OR they just don't want you to have that data for your own work. Ben Bernard has a pretty good rundown of the legality of web scraping; do make sure you won't get in trouble for the data you collect.

Gimme an example

As a quick example, consider that TAMU provides information about their classes (examples: 1, 2, 3), but it's not exactly provided in an easy-to-parse way, such as JSON or XML. Instead, in most cases, they opt to dynamically generate HTML content for their viewers, since they don't expect third parties to use this data for anything.

You can read more about the TAMU course web scraping here.

Why would I do this?

You need to classify some of your own data, or users, related to someone else's data.

So what's next?

Well, let's do some with Rust, yeah?