When | Why |
---|---|
Oct-24-20 | What to do with the Data |
Notes on ParseHub
ParseHub is a program you download <https://www.parsehub.com> to "scrape" data that’s laid out in a consistent pattern on webpages. In its free "trial" form you can use it on five (5) websites ("projects"), but when you're done with a project (i.e. have downloaded the data to your own computer) you can delete the project from ParseHub and then do another, so that limit doesn’t actually get in the way.
You get there by clicking on the house icon at the upper left cornner of your screen, which opens a column of menu options on the far left of your sc, white letters on a black background, and Projects is near the top.
For each page you scrape you’ll generate an Excel file (.CSV format); they will each be pasted into a consolidating Google Sheet for each state we’re working on. The Name we grab is usually just 1 field that might contain first name, last name, middle initial, suffixes, etc — we’ll convert that one field into these component separate fields in Google Sheets or Excel (using the text to Columns built-in feature), but only once all the data is pasted into the consolidating sheet (so we only have to run the converter a single time!
These notes refer to grabbing information about professors (name, email, ph#, title, department), but watch segments of the following example video first (showing as an example how to grab book data like book title and price) from Amazon:
1. The Basics (from tutorial beginning until 3:54 mark)
You scrape a page of data by
* selecting the key field, Professor name in our case
* giving that data a name
* then doing "Relative Select" commands via the "+" sign on the Select Name line to grab other data fields related to the key field
If phone# is missing or hard to grab it’s OK to just skip ph#, not critical!
If grabbing Title ever gives you trouble you can manually fill it in in the spreadsheet after the ParseHub run is finished: most people will be “Prof.” of some kind (Adjunct vs. Associate vs. whatever isn’t critical to us), you can just fill in “Prof.” for all of them first (copy a whole in one operation) and then change the few exception cases later. We don’t need departmental secretaries, business managers, etc.
Start with whatever person/record has the most well-structured data, not necessarily the first record at the top of the page. For example, if the title field has "Professor" but also office location and other junk we don't want, and you start with that record, you may be able to just select the "Professor" part and teach the software not to grab the other junk.
No worries, just grab the data as-is and we can decide at the end how or whether to extract the good data from the junk — Google Sheets (and Excel, I use both for various situations) have lots of powerful built-in commands for text manipulation that can be used individually or strung together to make cleanup fast and easy.
For example:
1. sometimes Name and Title are jumbled together and can’t be separated in ParseHubSolution— many times the two are separated by a comma; we can use the built Google Sheets SPLIT function (which can be used with any character, not just commas) to break the one field into two! Explained at https://support.google.com/docs/answer/3094136?hl=en
2. sometimes Ph# is mixed in with office location, Office hours, etc.Solution— when I did this search “Google Sheets remove all alpha characters” I found a wide variety of sites offering solutions for stripping all non-numeric characters (from one cell or from thousands at a time, it’s all the same!!). The page offering what looks like an easy: https://www.got-it.ai/solutions/excel-chat/excel-tutorial/text/strip-non-numeric-characters
IMPORTANT priniciple to apply to whatever computer tasks you face — if they’re simple but repetitive, there’s almost always some built-in feature or add-on 3rd party tool you can find to make the job go fast!
Supposedly if it makes a connection in error (e.g. you grab a phone# instead of an email address) you hit ‘Escape’ to cancel that, but it doesn't work for me.
For more info see:
https://help.parsehub.com/hc/en-us/articles/218226157-Relative-Select
2. How to Process additional pages that have the same page layout (starts around 3:54)
Click on "+" next to "Select Page" (at top) and then choose "Select" and then click on the "next page" or similar button. After you rename that Select step to "NextButton" or similar, then choose the "+" command on that line and choose a "Click" command. Since we're extracting the same stuff as in #1" above, we'll use "Go to Existing Template".
For more information:
https://help.parsehub.com/hc/en-us/articles/217752908-Click
https://help.parsehub.com/hc/en-us/articles/217735328-Click-on-the-Next-button-to-scrape-multiple-pages-pagination-
3. Digging into detail pages or other pages that don't have the same layout (starts around 5:00).
We often start with a Directory/Index page that has a few basic fields but not full information; usually we get to the full information page by clicking the Prof's name. Assuming that's the case, click on the "+" next to your initial Name field, choose "Click" and say that "NO" the key field is NOT a "Next" button, and then choose "Create a New Template" radio button (we can't use previous template since the page layout of the Profile/detail page is different from the initial page).
That will then take you to a new Template page (similar to what you had when you started #1 above), and then, from the "+" sign on the "Select Page" line, for each data field you want to capture, one at a time, you'll choose a "Select" Command (instead of a Relative Select) and then basically repeat the procedure of #1 above.
Note: at 6:25 the video (too quickly!) explains how to extract only the text component of links (e.g. email address) if that’s desired… but it’s not hard for us to delete that information manually in the spreadsheet.
4. Extracting data by finding text strings (starts around 6:39)
[October 2020 note—I’ve not done yet tried this technique, just gathered some information.]
Example in the tutorial relates to tables, but use of text strings isn’t limited to that. I texted Customer Service and got answers to two example cases:
Example 1. https://www.muhlenberg.edu/academics/polisci/faculty/
Question: Prof. names are on the right side of the page, but Political Science Home is at the top of the column and I can't figure out how to tell ParseHub not to treat it as a Person (I need to drill down to get details of each Person, and the department page has a different format).
Answer: You can use a Conditional command for this. In this case you could use:
if $selection.index>0
This will tell ParseHub to only extract/click elements that are after the first.
Example 2. https://www.bucknell.edu/academics/college-arts-sciences/academic-departments-programs/africana-studies/faculty-staff
Question: Some records have both phone number and email address and others just one, and sometimes when the software confuses a ph# with an email I can't figure out how to tell it not to treat that ph# as an email (or vice-versa).
Answer: Try this:
Select all links
if $e.text.contains("@")
extract email
if !$e.text.contains("@")
extract phone
For more information about the Conditional Command:
https://help.parsehub.com/hc/en-us/articles/217753268-Conditional
https://help.parsehub.com/hc/en-us/articles/217753368-Go-to-Template
Last updated: October 2020
Added October 24, 2020 at 12:33am
by Dan Doernberg
Title: What to do with the Data
You can either:
1. save the data as a CSV/Excel file and email it to Dan, or
2. open the Excel file and paste the data into the state-specific Google Doc. If doing this, paste the data in the email address column (assuming that's the field you grabbed first) and we can do any rearranging of data into columns at the end, that's nothing you need to worry about.
Logging in, please wait...
0 archived comments