NowComment
2-Pane Combined
Comments:
Full Summaries Sorted

Notes on ParseHub

Notes on ParseHub

ParseHub is a program you download <https://www.parsehub.com> to "scrape" data that’s laid out in a consistent pattern on webpages. In its free "trial" form you can use it on five (5) websites ("projects"), but when you're done with a project (i.e. have downloaded the data to your own computer) you can delete the project from ParseHub and then do another, so that limit doesn’t actually get in the way.

For each page you scrape you’ll generate an Excel file (.CSV format); they will each be pasted into a consolidating Google Sheet for each state we’re working on. The Name we grab is usually just 1 field that might contain first name, last name, middle initial, suffixes, etc — we’ll convert that one field into these component separate fields in Google Sheets or Excel (using the text to Columns built-in feature), but only once all the data is pasted into the consolidating sheet (so we only have to run the converter a single time!

These notes refer to grabbing information about professors (name, email, ph#, title, department), but watch segments of the following example video first (showing as an example how to grab book data like book title and price) from Amazon:

1. The Basics (from tutorial beginning until 3:54 mark)

You scrape a page of data by

* selecting the key field, Professor name in our case
* giving that data a name
* then doing "Relative Select" commands via the "+" sign on the Select Name line to grab other data fields related to the key field

Start with whatever person/record has the most well-structured data, not necessarily the first record at the top of the page. For example, if the title field has "Professor" but also office location and other junk we don't want, and you start with that record, you may be able to just select the "Professor" part and teach the software not to grab the other junk.

Supposedly if it makes a connection in error (e.g. you grab a phone# instead of an email address) you hit ‘Escape’ to cancel that, but it doesn't work for me.

For more info see:
https://help.parsehub.com/hc/en-us/articles/218226157-Relative-Select

2. How to Process additional pages that have the same page layout (starts around 3:54)

Click on "+" next to "Select Page" (at top) and then choose "Select" and then click on the "next page" or similar button. After you rename that Select step to "NextButton" or similar, then choose the "+" command on that line and choose a "Click" command. Since we're extracting the same stuff as in #1" above, we'll use "Go to Existing Template".

For more information:
https://help.parsehub.com/hc/en-us/articles/217752908-Click
https://help.parsehub.com/hc/en-us/articles/217735328-Click-on-the-Next-button-to-scrape-multiple-pages-pagination-

3. Digging into detail pages or other pages that don't have the same layout (starts around 5:00).

We often start with a Directory/Index page that has a few basic fields but not full information; usually we get to the full information page by clicking the Prof's name. Assuming that's the case, click on the "+" next to your initial Name field, choose "Click" and say that "NO" the key field is NOT a "Next" button, and then choose "Create a New Template" radio button (we can't use previous template since the page layout of the Profile/detail page is different from the initial page) and then choose "Create New Template".

That will then take you to a new Template page (similar to what you had when you started #1 above), and then, from the "+" sign on the "Select Page" line, for each data field you want to capture, one at a time, you'll choose a "Select" Command (instead of a Relative Select) and then basically repeat the procedure of #1 above.

Note: at 6:25 the video (too quickly!) explains how to extract only the text component of links (e.g. email address) if that’s desired… but it’s not hard for us to delete that information manually in the spreadsheet.

4. Extracting data by finding text strings (starts around 6:39)

[October 2020 note—I’ve not done yet tried this technique, just gathered some information.]

Example in the tutorial relates to tables, but use of text strings isn’t limited to that. I texted Customer Service and got answers to two example cases:

Example 1. https://www.muhlenberg.edu/academics/polisci/faculty/

Question: Prof. names are on the right side of the page, but Political Science Home is at the top of the column and I can't figure out how to tell ParseHub not to treat it as a Person (I need to drill down to get details of each Person, and the department page has a different format).

Answer: You can use a Conditional command for this. In this case you could use:

if $selection.index>0
This will tell ParseHub to only extract/click elements that are after the first.

Example 2. https://www.bucknell.edu/academics/college-arts-sciences/academic-departments-programs/africana-studies/faculty-staff

Question: Some records have both phone number and email address and others just one, and sometimes when the software confuses a ph# with an email I can't figure out how to tell it not to treat that ph# as an email (or vice-versa).

Answer: Try this:
Select all links
if $e.text.contains("@")
extract email
if !$e.text.contains("@")
extract phone

For more information about the Conditional Command:
https://help.parsehub.com/hc/en-us/articles/217753268-Conditional
https://help.parsehub.com/hc/en-us/articles/217753368-Go-to-Template

Last updated: October 2020

DMU Timestamp: October 16, 2020 17:16





Image
0 comments, 0 areas
add area
add comment
change display
Video
add comment

Quickstart: Commenting and Sharing

How to Comment
  • Click icons on the left to see existing comments.
  • Desktop/Laptop: double-click any text, highlight a section of an image, or add a comment while a video is playing to start a new conversation.
    Tablet/Phone: single click then click on the "Start One" link (look right or below).
  • Click "Reply" on a comment to join the conversation.
How to Share Documents
  1. "Upload" a new document.
  2. "Invite" others to it.

Logging in, please wait... Blue_on_grey_spinner