Document: Invite

Comments:

Notes on ParseHub

1 additions to document , most recent almost 5 years ago

When	Why
Oct-24-20	What to do with the Data

1 changes, most recent almost 5 years ago

Show Changes

Notes on ParseHub

ParseHub is a program you download <https://www.parsehub.com> to "scrape" data that’s laid out in a consistent pattern on webpages. In its free "trial" form you can use it on five (5) websites ("projects"), but when you're done with a project (i.e. have downloaded the data to your own computer) you can delete the project from ParseHub and then do another, so that limit doesn’t actually get in the way.

For each page you scrape you’ll generate an Excel file (.CSV format); they will each be pasted into a consolidating Google Sheet for each state we’re working on. The Name we grab is usually just 1 field that might contain first name, last name, middle initial, suffixes, etc — we’ll convert that one field into these component separate fields in Google Sheets or Excel (using the text to Columns built-in feature), but only once all the data is pasted into the consolidating sheet (so we only have to run the converter a single time!

These notes refer to grabbing information about professors (name, email, ph#, title, department), but watch segments of the following example video first (showing as an example how to grab book data like book title and price) from Amazon:

1. The Basics (from tutorial beginning until 3:54 mark)

You scrape a page of data by

* selecting the key field, Professor name in our case
* giving that data a name
* then doing "Relative Select" commands via the "+" sign on the Select Name line to grab other data fields related to the key field

Start with whatever person/record has the most well-structured data, not necessarily the first record at the top of the page. For example, if the title field has "Professor" but also office location and other junk we don't want, and you start with that record, you may be able to just select the "Professor" part and teach the software not to grab the other junk.

Supposedly if it makes a connection in error (e.g. you grab a phone# instead of an email address) you hit ‘Escape’ to cancel that, but it doesn't work for me.

For more info see:
https://help.parsehub.com/hc/en-us/articles/218226157-Relative-Select

2. How to Process additional pages that have the same page layout (starts around 3:54)

Click on "+" next to "Select Page" (at top) and then choose "Select" and then click on the "next page" or similar button. After you rename that Select step to "NextButton" or similar, then choose the "+" command on that line and choose a "Click" command. Since we're extracting the same stuff as in #1" above, we'll use "Go to Existing Template".

For more information:
https://help.parsehub.com/hc/en-us/articles/217752908-Click
https://help.parsehub.com/hc/en-us/articles/217735328-Click-on-the-Next-button-to-scrape-multiple-pages-pagination-

3. Digging into detail pages or other pages that don't have the same layout (starts around 5:00).

Paragraph 16 changes

We often start with a Directory/Index page that has a few basic fields but not full information; usually we get to the full information page by clicking the Prof's name. Assuming that's the case, click on the "+" next to your initial Name field, choose "Click" and say that "NO" the key field is NOT a "Next" button, and then choose "Create a New Template" radio button (we can't use previous template since the page layout of the Profile/detail page is different from the initial page) and then choose "Create New Template".

That will then take you to a new Template page (similar to what you had when you started #1 above), and then, from the "+" sign on the "Select Page" line, for each data field you want to capture, one at a time, you'll choose a "Select" Command (instead of a Relative Select) and then basically repeat the procedure of #1 above.

Note: at 6:25 the video (too quickly!) explains how to extract only the text component of links (e.g. email address) if that’s desired… but it’s not hard for us to delete that information manually in the spreadsheet.

4. Extracting data by finding text strings (starts around 6:39)

[October 2020 note—I’ve not done yet tried this technique, just gathered some information.]

Example in the tutorial relates to tables, but use of text strings isn’t limited to that. I texted Customer Service and got answers to two example cases:

Example 1. https://www.muhlenberg.edu/academics/polisci/faculty/

Question: Prof. names are on the right side of the page, but Political Science Home is at the top of the column and I can't figure out how to tell ParseHub not to treat it as a Person (I need to drill down to get details of each Person, and the department page has a different format).

Answer: You can use a Conditional command for this. In this case you could use:

if $selection.index>0
This will tell ParseHub to only extract/click elements that are after the first.

Example 2. https://www.bucknell.edu/academics/college-arts-sciences/academic-departments-programs/africana-studies/faculty-staff

Question: Some records have both phone number and email address and others just one, and sometimes when the software confuses a ph# with an email I can't figure out how to tell it not to treat that ph# as an email (or vice-versa).

Answer: Try this:
Select all links
if $e.text.contains("@")
extract email
if !$e.text.contains("@")
extract phone

For more information about the Conditional Command:
https://help.parsehub.com/hc/en-us/articles/217753268-Conditional
https://help.parsehub.com/hc/en-us/articles/217753368-Go-to-Template

Last updated: October 2020

DMU Timestamp: October 16, 2020 17:16

You can either:

1. save the data as a CSV/Excel file and email it to Dan, or

2. open the Excel file and paste the data into the state-specific Google Doc. If doing this, paste the data in the email address column (assuming that's the field you grabbed first) and we can do any rearranging of data into columns at the end, that's nothing you need to worry about.

DMU Timestamp: October 19, 2020 19:17

Paragraph 1 0

No paragraph-level conversations. Start one.

Paragraph 1, Sentence 1 0

No sentence-level conversations. Start one.

Paragraph 2 0

No paragraph-level conversations. Start one.

Paragraph 2, Sentence 1 0

No sentence-level conversations. Start one.

Paragraph 2, Sentence 2 0

Oct 24

Dan D Dan D (Oct 24 2020 12:36AM) : To delete projects go to the Projects page. more

You get there by clicking on the house icon at the upper left cornner of your screen, which opens a column of menu options on the far left of your sc, white letters on a black background, and Projects is near the top.

Paragraph 3 0

No paragraph-level conversations. Start one.

Paragraph 3, Sentence 1 0

No sentence-level conversations. Start one.

Paragraph 3, Sentence 2 0

No sentence-level conversations. Start one.

Paragraph 4 0

No paragraph-level conversations. Start one.

Paragraph 4, Sentence 1 0

No sentence-level conversations. Start one.

Paragraph 5 (Video 1) 0

No video-level conversations. Start one.

Whole Video 0

No video-level conversations. Start one.

Paragraph 6 0

No paragraph-level conversations. Start one.

Paragraph 6, Sentence 1 0

No sentence-level conversations. Start one.

Paragraph 6, Sentence 2 0

No sentence-level conversations. Start one.

Paragraph 7 0

Oct 25

Dan D Dan D (Oct 25 2020 11:18AM) : Good practice document: https://www.unlv.edu/history/directory/north-and-latin-america

Paragraph 7, Sentence 1 0

No sentence-level conversations. Start one.

Paragraph 8 0

No paragraph-level conversations. Start one.

Paragraph 8, Sentence 1 0

Oct 22

Dan D Dan D (Oct 22 2020 5:23PM) : We started using email as main (#1) field grabbed in ParserHub and then grabbing Email, Phone, and Title next (ascending alphabetical order) and then Name *last*. [Edited] more

If phone# is missing or hard to grab it’s OK to just skip ph#, not critical!

If grabbing Title ever gives you trouble you can manually fill it in in the spreadsheet after the ParseHub run is finished: most people will be “Prof.” of some kind (Adjunct vs. Associate vs. whatever isn’t critical to us), you can just fill in “Prof.” for all of them first (copy a whole in one operation) and then change the few exception cases later. We don’t need departmental secretaries, business managers, etc.

Paragraph 8, Sentence 2 0

No sentence-level conversations. Start one.

Paragraph 8, Sentence 3 0

No sentence-level conversations. Start one.

Paragraph 9 0

No paragraph-level conversations. Start one.

Paragraph 9, Sentence 1 0

No sentence-level conversations. Start one.

Paragraph 9, Sentence 2 0

Oct 24

Dan D Dan D (Oct 24 2020 8:36AM) : Some web pages are badly designed and you can't grab just the data you want. more

No worries, just grab the data as-is and we can decide at the end how or whether to extract the good data from the junk — Google Sheets (and Excel, I use both for various situations) have lots of powerful built-in commands for text manipulation that can be used individually or strung together to make cleanup fast and easy.

For example:

1. sometimes Name and Title are jumbled together and can’t be separated in ParseHub

Solution— many times the two are separated by a comma; we can use the built Google Sheets SPLIT function (which can be used with any character, not just commas) to break the one field into two! Explained at https://support.google.com/docs/answer/3094136?hl=en

2. sometimes Ph# is mixed in with office location, Office hours, etc.

Solution— when I did this search “Google Sheets remove all alpha characters” I found a wide variety of sites offering solutions for stripping all non-numeric characters (from one cell or from thousands at a time, it’s all the same!!). The page offering what looks like an easy: https://www.got-it.ai/solutions/excel-chat/excel-tutorial/text/strip-non-numeric-characters

IMPORTANT priniciple to apply to whatever computer tasks you face — if they’re simple but repetitive, there’s almost always some built-in feature or add-on 3rd party tool you can find to make the job go fast!

Paragraph 10 0

No paragraph-level conversations. Start one.

Paragraph 10, Sentence 1 0

No sentence-level conversations. Start one.

Paragraph 11 0

No paragraph-level conversations. Start one.

Paragraph 11, Sentence 1 0

No sentence-level conversations. Start one.

Paragraph 11, Sentence 2 0

No sentence-level conversations. Start one.

Paragraph 12 0

No paragraph-level conversations. Start one.

Paragraph 12, Sentence 1 0

No sentence-level conversations. Start one.

Paragraph 12, Sentence 2 0

No sentence-level conversations. Start one.

Paragraph 13 0

No paragraph-level conversations. Start one.

Paragraph 13, Sentence 1 0

No sentence-level conversations. Start one.

Paragraph 13, Sentence 2 0

No sentence-level conversations. Start one.

Paragraph 13, Sentence 3 0

No sentence-level conversations. Start one.

Paragraph 14 0

No paragraph-level conversations. Start one.

Paragraph 14, Sentence 1 0

No sentence-level conversations. Start one.

Paragraph 14, Sentence 2 0

No sentence-level conversations. Start one.

Paragraph 14, Sentence 3 0

No sentence-level conversations. Start one.

Paragraph 15 0

No paragraph-level conversations. Start one.

Paragraph 15, Sentence 1 0

No sentence-level conversations. Start one.

Paragraph 15, Sentence 2 0

No sentence-level conversations. Start one.

Paragraph 16 0

No paragraph-level conversations. Start one.

Paragraph 16, Sentence 1 0

No sentence-level conversations. Start one.

Paragraph 16, Sentence 2 0

No sentence-level conversations. Start one.

Paragraph 17 0

No paragraph-level conversations. Start one.

Paragraph 17, Sentence 1 0

No sentence-level conversations. Start one.

Paragraph 18 0

No paragraph-level conversations. Start one.

Paragraph 18, Sentence 1 0

No sentence-level conversations. Start one.

Paragraph 19 0

No paragraph-level conversations. Start one.

Paragraph 19, Sentence 1 0

No sentence-level conversations. Start one.

Paragraph 19, Sentence 2 0

No sentence-level conversations. Start one.

Paragraph 20 0

No paragraph-level conversations. Start one.

Paragraph 20, Sentence 1 0

No sentence-level conversations. Start one.

Paragraph 21 0

No paragraph-level conversations. Start one.

Paragraph 21, Sentence 1 0

No sentence-level conversations. Start one.

Paragraph 21, Sentence 2 0

No sentence-level conversations. Start one.

Paragraph 22 0

No paragraph-level conversations. Start one.

Paragraph 22, Sentence 1 0

No sentence-level conversations. Start one.

Paragraph 23 0

No paragraph-level conversations. Start one.

Paragraph 23, Sentence 1 0

No sentence-level conversations. Start one.

Paragraph 24 0

No paragraph-level conversations. Start one.

Paragraph 24, Sentence 1 0

No sentence-level conversations. Start one.

Paragraph 24, Sentence 2 0

No sentence-level conversations. Start one.

Paragraph 25 0

No paragraph-level conversations. Start one.

Paragraph 25, Sentence 1 0

No sentence-level conversations. Start one.

Paragraph 25, Sentence 2 0

No sentence-level conversations. Start one.

Paragraph 26 0

No paragraph-level conversations. Start one.

Paragraph 26, Sentence 1 0

No sentence-level conversations. Start one.

Paragraph 27 0

No paragraph-level conversations. Start one.

Paragraph 27, Sentence 1 0

No sentence-level conversations. Start one.

Paragraph 28 0

No paragraph-level conversations. Start one.

Paragraph 28, Sentence 1 0

No sentence-level conversations. Start one.

Paragraph 28, Sentence 2 0

No sentence-level conversations. Start one.

Paragraph 28, Sentence 3 0

No sentence-level conversations. Start one.

Paragraph 28, Sentence 4 0

No sentence-level conversations. Start one.

Paragraph 28, Sentence 5 0

No sentence-level conversations. Start one.

Paragraph 28, Sentence 6 0

No sentence-level conversations. Start one.

Paragraph 29 0

No paragraph-level conversations. Start one.

Paragraph 29, Sentence 1 0

No sentence-level conversations. Start one.

Paragraph 29, Sentence 2 0

No sentence-level conversations. Start one.

Paragraph 29, Sentence 3 0

No sentence-level conversations. Start one.

Paragraph 30 0

No paragraph-level conversations. Start one.

Paragraph 30, Sentence 1 0

No sentence-level conversations. Start one.

Paragraph 31 0

No paragraph-level conversations. Start one.

Paragraph 31, Sentence 1 0

No sentence-level conversations. Start one.

Paragraph 32 0

No paragraph-level conversations. Start one.

Paragraph 32, Sentence 1 0

No sentence-level conversations. Start one.

Paragraph 33 0

No paragraph-level conversations. Start one.

Paragraph 33, Sentence 1 0

No sentence-level conversations. Start one.

Paragraph 33, Sentence 2 0

No sentence-level conversations. Start one.

General Document Comments 0

Image

0 comments, 0 areas

add area

add comment

change display

help

Video

add comment

help

Please choose from the list of thinking partners to the left

Choose a tab, then select
your Thinking Partner

Original

Resubmission