Oh yeah, that wasn't just an example in Web Scraping.

First, let's check the legality

Before we begin, we must trundle through the rules and regulations associated with IT at TAMU in order to ensure that we can scrape. You can see my work in checking the legality of that here.

NOTE (2020-03-12): This information is no longer to date due to policy changes at TAMU. If you wish to do this yourself, please review the newer policies.

Now, the nitty gritty

We've verified that it's okay for us to scrape this data. Now let's do it.

You can view the final code involved with this page here. I recommend following this page if you want to learn about some web scraping with Rust. I don't assume that you know anything about Rust, so you can skip some parts if you're already familiar with the syntax.

The entrypoint

We start with this page, which was found by the googlebot web crawler (because they forgot to add headers, metatags, and a robots.txt; oops). This page allows users to dynamically select course listings by picking the target semester. Unfortunately for us, that means that there's a form that we need to POST to this host.

For our purposes, we want to scrape the "Spring 2019 - College Station" listings.

First, we need to open up the developer console. I will leave this to the reader to figure out how to do for their browser.

Secondly, we select our term, "Spring 2019 - College Station", making sure to take note of the form select name ("p_term") and option value picked (the second option appears to be the most recent available term for College Station consistently):

Picking our term

After we hit send, we inspect our form data using the network panel of the developer console:

Viewing the form content

A-ha. There's our form data for our entrypoint.

Iterative querying

Now we need to get all the course data. A restriction on this site is that we can only select one major at a time, so let's make a way to iterate through the selections.

Similarly to part one, we just need to pick one and hit submit. I picked "ACCT" and then didn't fill out any other form data, as it's irrelevant to us. Just as before, we hit submit and then view the form data in the console.

Viewing the form content, pt.2, electric boogaloo

Oof, that's quite a lot. And it appears they're double-sending quite a bit of data. But now we know what we need to do.

Planning out the steps

So, to review:

  1. GET https://compass-ssb.tamu.edu/pls/PROD/bwckschd.p_disp_dyn_sched.
  2. Pick our options -- in our case, always the second selection. Easy enough.
  3. POST the form to https://compass-ssb.tamu.edu/pls/PROD/bwckgens.p_proc_term_date, following the conventions of the form we saw submitted in the network tab.
  4. For each subject:

Let's make some code

It's time to begin writing that code.

Initialise your project workspace

First, we need to make our project workspace.

Initialise the cargo project with cargo new:

cargo new compass-scraping --bin

Go ahead and open this with whatever IDE you prefer. I personally use CLion with the Rust plugin but you can basically use anything you want.

Get the appropriate libraries

For this project, we'll be using three libraries:

In order to add these libraries to your project, open your Cargo.toml file and add the dependencies. Your dependencies section should look like this:

[dependencies]
reqwest = "0.9.5"
select = "0.4.2"

Getting started with the code

Open the src/main.rs file. You need to first mark that you'll be using other libraries by adding the appropriate extern crate statements:

extern crate reqwest;
extern crate select;

fn main() {
    println!("Hello, world!");
}

I suggest going ahead and making sure this compiles and runs with cargo run, just to pull the dependencies and make sure you can use them. This might take a while, as you are pulling and compiling quite a few dependencies.

GETing our first page

First, we need to build our reqwest client. While reqwest does allow you to make GET requests without a client, we'll be making POST requests later so we might as well make it now.

let client = reqwest::Client::builder().build().unwrap();

Compass can be ~~pretty~~ extremely slow to respond, so let's make the timeout pretty long, too.

let client = reqwest::Client::builder().timeout(Duration::from_secs(60)).build().unwrap();

Note that you will have to add the statement use std::time::Duration; below the extern crate statements in order to use the Duration type.

Let's go ahead and request it:

let mut first_page = client.get("https://compass-ssb.tamu.edu/pls/PROD/bwckschd.p_disp_dyn_sched").send().unwrap();
assert!(first_page.status().is_success()); // Assert that it's fine
println!("{}", first_page.text().unwrap()); // Print out the text the entire page

Your src/main.rs file should now look like this:

extern crate reqwest;
extern crate select;

use std::time::Duration;

fn main() {
    let client = reqwest::Client::builder().timeout(Duration::from_secs(60)).build().unwrap();
    let mut first_page = client.get("https://compass-ssb.tamu.edu/pls/PROD/bwckschd.p_disp_dyn_sched").send().unwrap();
    assert!(first_page.status().is_success());
    println!("{}", first_page.text().unwrap());
}

Go ahead and run this. You should see the HTML for the entrypoint site.

Filtering and Scraping

It's time for our first venture into scraping.

We don't want to just print out the HTML of the entire page; we want to be able to select the second entry in the dropdown menu so we can submit the valid form. Luckily for us, that's where our second library comes in.

Go ahead and delete that println line. Below is what we'll replace it with:

if let Some(opt) = Document::from_read(first_page).unwrap()
    .find(Name("option"))
    .filter_map(|n| n.attr("value"))
    .nth(1) {
    // TODO
}

Note: you'll need to add both use select::document::Document; and use select::predicate::Name; to the use list to use this.

There's a lot to go through here. One at a time:

The value stored in opt after this operation is the value we need to provide with our form to start our first POST.

Creating our first POST

Within our if let statement, we now need to perform our POST request with the form we determined earlier. Code for this is below:

let mut resp = client
    .post("https://compass-ssb.tamu.edu/pls/PROD/bwckgens.p_proc_term_date")
    .form(&[               // Create our form; see the section entitled "The entrypoint"
        ("p_term", opt),   // Inject our selection
        ("p_calling_proc", "bwckschd.p_disp_dyn_sched"),
    ])
    .send()
    .unwrap();
assert!(resp.status().is_success());
println!("{}", resp.text().unwrap());

This is very similar to the GET request, but this time, we've specified that we're POST requesting and sending a form with it. The form entries shown are following the pattern shown by our research in the first half. Make sure this runs and has expected data.

Processing the response of the POST

Similarly to last time, we want to use the document reader to parse the page again. Differently to last time, we want to iterate through each subject selection. To accomplish this, instead of using if let and .nth, we'll use .for_each. Let's replace that println! statement appropriately.

Document::from_read(resp).unwrap()
    .find(Name("select"))
    .next()
    .unwrap()
    .find(Name("option"))
    .filter_map(|n| n.attr("value"))
    .for_each(|val| {
        println!("{}", val);
    });

To explain, we want to do something (.for_each(|val| {...})) with the value (.filter_map(|n| n.attr("value"))) each "option" node (.find(Name("option"))) within the first "select" node (.find(Name("select")).next().unwrap()). Go ahead and run your program to make sure it works. You should see a list of four-letter subject names.

Progress so far

You should now have the following code:

extern crate reqwest;
extern crate select;

use select::document::Document;
use select::predicate::Name;
use std::time::Duration;

fn main() {
    let client = reqwest::Client::builder().timeout(Duration::from_secs(60)).build().unwrap();
    let first_page = client.get("https://compass-ssb.tamu.edu/pls/PROD/bwckschd.p_disp_dyn_sched").send().unwrap();
    assert!(first_page.status().is_success());
    if let Some(opt) = Document::from_read(first_page).unwrap()
        .find(Name("option"))
        .filter_map(|n| n.attr("value"))
        .nth(1) {
            let resp = client
                .post("https://compass-ssb.tamu.edu/pls/PROD/bwckgens.p_proc_term_date")
                .form(&[
                    ("p_term", opt),
                    ("p_calling_proc", "bwckschd.p_disp_dyn_sched"),
                ])
                .send()
                .unwrap();
            assert!(resp.status().is_success());
            Document::from_read(resp).unwrap()
                .find(Name("select"))
                .next()
                .unwrap()
                .find(Name("option"))
                .filter_map(|n| n.attr("value"))
                .for_each(|val| {
                    println!("{}", val);
                });
    }
}

Iterating through the pages

Let's pull each of the pages for each subject; we want to minimise the amount of load we put on this server, so we're just gonna save the pages for now. Actually extracting data from this is for next time.

Remember the long form that we had to send last time? I translated the form, so you don't have to! Just replace the println! statement with this:

let mut results = client
    .post("https://compass-ssb.tamu.edu/pls/PROD/bwckschd.p_get_crse_unsec")
    .form(&[
        ("term_in", opt),
        ("sel_subj", "dummy"),
        ("sel_day", "dummy"),
        ("sel_schd", "dummy"),
        ("sel_insm", "dummy"),
        ("sel_camp", "dummy"),
        ("sel_levl", "dummy"),
        ("sel_sess", "dummy"),
        ("sel_instr", "dummy"),
        ("sel_ptrm", "dummy"),
        ("sel_attr", "dummy"),
        ("sel_subj", val),
        ("sel_crse", ""),
        ("sel_title", ""),
        ("sel_schd", "%"),
        ("sel_insm", "%"),
        ("sel_from_cred", ""),
        ("sel_to_cred", ""),
        ("sel_camp", "%"),
        ("sel_levl", "%"),
        ("sel_ptrm", "%"),
        ("sel_instr", "%"),
        ("sel_attr", "%"),
        ("begin_hh", "0"),
        ("begin_mi", "0"),
        ("begin_ap", "a"),
        ("end_hh", "0"),
        ("end_mi", "0"),
        ("end_ap", "a"),
    ])
    .send()
    .unwrap();
assert!(results.status().is_success());
// TODO

Don't run this yet; it's not quite ready. We need to store this data. It wouldn't be helpful to print this all out to the command line, so instead, let's write it to a file.

Writing to a file is relatively simple in Rust; all we need to do is create a reference to it and invoke write_all. The file handle will be deallocated upon loss of ownership.

let mut file = File::create(format!("{}.html", val)).unwrap();
file.write_all(results.text().unwrap().as_bytes()).unwrap();
println!("Scanned and saved {}.", val);

You'll need to add use std::fs::File; and use std::io::Write; to use these. This will save each page to an HTML file titled with the subject it belongs to.

Finally, as a nod to the policy about what not to do:

(b) impede normal business functions;

Let's sleep between queries as to not impose. With normal web scrapers, you might want to increase the rate of scraping. In this case, we want the opposite. We'll just sleep the thread for 30 seconds. Append to the last segment:

thread::sleep(Duration::from_secs(30));

Remember to add use std::thread; for sleeping.

Wrapping up and running

This takes a long time and should be put its own directory as to not make a mess. Go ahead and run it when you're ready to scan.

The final product

extern crate reqwest;
extern crate select;

use select::document::Document;
use select::predicate::Name;
use std::fs::File;
use std::io::Write;
use std::thread;
use std::time::Duration;

fn main() {
    let client = reqwest::Client::builder().timeout(Duration::from_secs(60)).build().unwrap();
    let first_page = client.get("https://compass-ssb.tamu.edu/pls/PROD/bwckschd.p_disp_dyn_sched").send().unwrap();
    assert!(first_page.status().is_success());
    if let Some(opt) = Document::from_read(first_page).unwrap()
        .find(Name("option"))
        .filter_map(|n| n.attr("value"))
        .nth(1) {
            let resp = client
                .post("https://compass-ssb.tamu.edu/pls/PROD/bwckgens.p_proc_term_date")
                .form(&[
                    ("p_term", opt),
                    ("p_calling_proc", "bwckschd.p_disp_dyn_sched"),
                ])
                .send()
                .unwrap();
            assert!(resp.status().is_success());
            Document::from_read(resp).unwrap()
                .find(Name("select"))
                .next()
                .unwrap()
                .find(Name("option"))
                .filter_map(|n| n.attr("value"))
                .for_each(|val| {
                    let mut results = client
                        .post("https://compass-ssb.tamu.edu/pls/PROD/bwckschd.p_get_crse_unsec")
                        .form(&[
                            ("term_in", opt),
                            ("sel_subj", "dummy"),
                            ("sel_day", "dummy"),
                            ("sel_schd", "dummy"),
                            ("sel_insm", "dummy"),
                            ("sel_camp", "dummy"),
                            ("sel_levl", "dummy"),
                            ("sel_sess", "dummy"),
                            ("sel_instr", "dummy"),
                            ("sel_ptrm", "dummy"),
                            ("sel_attr", "dummy"),
                            ("sel_subj", val),
                            ("sel_crse", ""),
                            ("sel_title", ""),
                            ("sel_schd", "%"),
                            ("sel_insm", "%"),
                            ("sel_from_cred", ""),
                            ("sel_to_cred", ""),
                            ("sel_camp", "%"),
                            ("sel_levl", "%"),
                            ("sel_ptrm", "%"),
                            ("sel_instr", "%"),
                            ("sel_attr", "%"),
                            ("begin_hh", "0"),
                            ("begin_mi", "0"),
                            ("begin_ap", "a"),
                            ("end_hh", "0"),
                            ("end_mi", "0"),
                            ("end_ap", "a"),
                        ])
                        .send()
                        .unwrap();
                    assert!(results.status().is_success());
                    let mut file = File::create(format!("{}.html", val)).unwrap();
                    file.write_all(results.text().unwrap().as_bytes()).unwrap();
                    println!("Scanned and saved {}.", val);
                    thread::sleep(Duration::from_secs(30));
                });
    }
}

The code released on this page is licensed under GPLv3, as seen in its repository.