This is created in association with my venture into web scraping with Rust. We need to make sure TAMU is okay with us scraping their data before we do it.
NOTE (2020-03-12): This information is no longer to date due to policy changes at TAMU. If you wish to do this yourself, please review the newer policies.
Where to start
First, let's look for the ToS for TAMU.
TAMU IT provides rules for the use of their services. Rule 29 provides information on the policies associated with Information Resources.
Rule 29.01.03.M0.02 outlines "Rules for Responsible Computing", which would definitely relate to what we're doing.
To be clear: web scraping is the collection of public data. We, as an entity, are accessing this public data with intent of using this data elsewhere. We have not received explicit permission to use this data elsewhere, so if we are restricted in this regard, we cannot use this data for our own services.
From 29.01.03.M0.02, section 1 (bold emphasis is mine):
Texas A&M recognizes the importance of information resources and facilities to students, faculty and staff in scholarly pursuits, professional development, service activities, personal development and every day work and class-related activities.
In this case: I am both a student and I am using their services for personal development (development, in this context, is not software development, but likely refers to the development of character and abilities). As an example that our scraping is valid in this context, Aggie Scheduler is a service which actively scrapes the same course data in order to simplify the registration process.
This means that, after I graduate, I will have to hand off any project using their data (at least officially) to another student or university staff in order to ensure that this use is considered valid.
Rule 29.01.03.M1.02 outlines "Information Resources - Acceptable Use", which we should also look at.
This page definitely applies to us:
The intended audience for this standard administrative procedure includes ... users of University information resources.
Unhelpfully, this page mostly links to other policy listings. The policies that we should look at are probably System Policy 33.04 and System Policy 29.01.02.
System Policy 33.04
As a quick note: this is a system policy, which means this relates to TAMUS, not just TAMU.
From section 1:
System resources may not be used for personal purposes except for incidental use in accordance with this policy. The incidental use of system resources for personal purposes must not:
(a) result in additional expense to the system;
These queries will be of negligible cost to the university, as students are constantly using this information anyways.
(b) impede normal business functions;
We should avoid querying too quickly; we should space our queries out to make sure we don't prevent other users from being able to access the resource. Furthermore, we should perform these queries outside of normal business hours to reduce load on the queried resource.
(c) be for non-approved private commercial purposes;
The information I'm collecting will not be used commercially, but only as a free-to-use service.
(d) be used for illegal activity;
(e) be used to intentionally access, create, store, or transmit obscene materials; or
(f) be used to compete unfairly with private sector entities or private consultants.
As far as I'm aware, no one's selling course data. Furthermore, we're collecting publicly available information, so this is not unfair.
The rest of this document either relates solely to system employees or explicitly refers to resources which are not information resources. System Policy 33.04.02 on its face may appear to relate to our scraping, but is ultimately unrelated.
System Policy 29.01.02
A closer look at this policy (entitled "Use of Licensed Software") shows that it is ultimately unlikely to be relevant to us. Instead, this policy appears to be about the pirating of their licensed software:
The unauthorized use, copying, or distribution of copyrighted software is a violation of the U.S. Copyright Act. These illegal acts are commonly referred to as "software piracy."
So it is unlikely that we will be affected by this policy.
The service we're scraping is the Compass Service and we'll be pulling data which contains minimal PII. Unhelpfully, it appears that Compass does not have any written ToS explicitly related to the use of their software externally, so we assume that the policies listed above are the limits to our work.
Why are we not talking about robots.txt?
robots.txt is for web crawling, not for
web scraping. We are scraping targeted, specific information, not iterating
through links, so the
robots.txt file and the
robots meta-tag and headers
do not apply to us.