NowComment
2-Pane Combined
Comments:
Full Summaries Sorted

Notes on ParseHub

1 additions to document , most recent about 1 month ago

When Why
Oct-24-20 What to do with the Data

1 changes, most recent about 1 month ago

Show Changes

0 General Document comments
0 Sentence and Paragraph comments
0 Image and Video comments


Notes on ParseHub

New Conversation
Paragraph 1 0
No paragraph-level conversations. Start one.
New Conversation
Paragraph 1, Sentence 1 0
No sentence-level conversations. Start one.

ParseHub is a program you download <https://www.parsehub.com> to "scrape" data that’s laid out in a consistent pattern on webpages. In its free "trial" form you can use it on five (5) websites ("projects"), but when you're done with a project (i.e. have downloaded the data to your own computer) you can delete the project from ParseHub and then do another, so that limit doesn’t actually get in the way.

New Conversation
Paragraph 2 0
No paragraph-level conversations. Start one.
New Conversation
Paragraph 2, Sentence 1 0
No sentence-level conversations. Start one.
New Conversation
Paragraph 2, Sentence 2 0
profile_photo

I’m the head/founder of Fairness.com LLC. We really hope you … (more)

Oct 24
Dan D

I’m the head/founder of Fairness.com LLC. We really hope you … (more)

(Oct 24 2020 12:36AM) : To delete projects go to the Projects page. more

You get there by clicking on the house icon at the upper left cornner of your screen, which opens a column of menu options on the far left of your sc, white letters on a black background, and Projects is near the top.

For each page you scrape you’ll generate an Excel file (.CSV format); they will each be pasted into a consolidating Google Sheet for each state we’re working on. The Name we grab is usually just 1 field that might contain first name, last name, middle initial, suffixes, etc — we’ll convert that one field into these component separate fields in Google Sheets or Excel (using the text to Columns built-in feature), but only once all the data is pasted into the consolidating sheet (so we only have to run the converter a single time!

New Conversation
Paragraph 3 0
No paragraph-level conversations. Start one.
New Conversation
Paragraph 3, Sentence 1 0
No sentence-level conversations. Start one.
New Conversation
Paragraph 3, Sentence 2 0
No sentence-level conversations. Start one.

These notes refer to grabbing information about professors (name, email, ph#, title, department), but watch segments of the following example video first (showing as an example how to grab book data like book title and price) from Amazon:

New Conversation
Paragraph 4 0
No paragraph-level conversations. Start one.
New Conversation
Paragraph 4, Sentence 1 0
No sentence-level conversations. Start one.

New Conversation
Paragraph 5 (Video 1) 0
No video-level conversations. Start one.
New Conversation
Whole Video 0
No video-level conversations. Start one.

1. The Basics (from tutorial beginning until 3:54 mark)

New Conversation
Paragraph 6 0
No paragraph-level conversations. Start one.
New Conversation
Paragraph 6, Sentence 1 0
No sentence-level conversations. Start one.
New Conversation
Paragraph 6, Sentence 2 0
No sentence-level conversations. Start one.

You scrape a page of data by

New Conversation
Paragraph 7 0
profile_photo

I’m the head/founder of Fairness.com LLC. We really hope you … (more)

Oct 25
Dan D

I’m the head/founder of Fairness.com LLC. We really hope you … (more)

(Oct 25 2020 11:18AM) : Good practice document: https://www.unlv.edu/history/directory/north-and-latin-america
New Conversation
Paragraph 7, Sentence 1 0
No sentence-level conversations. Start one.

* selecting the key field, Professor name in our case
* giving that data a name
* then doing "Relative Select" commands via the "+" sign on the Select Name line to grab other data fields related to the key field

New Conversation
Paragraph 8 0
No paragraph-level conversations. Start one.
New Conversation
Paragraph 8, Sentence 1 0
profile_photo

I’m the head/founder of Fairness.com LLC. We really hope you … (more)

Oct 22
Dan D

I’m the head/founder of Fairness.com LLC. We really hope you … (more)

(Oct 22 2020 5:23PM) : We started using email as main (#1) field grabbed in ParserHub and then grabbing Email, Phone, and Title next (ascending alphabetical order) and then Name *last*. [Edited] more

If phone# is missing or hard to grab it’s OK to just skip ph#, not critical!

If grabbing Title ever gives you trouble you can manually fill it in in the spreadsheet after the ParseHub run is finished: most people will be “Prof.” of some kind (Adjunct vs. Associate vs. whatever isn’t critical to us), you can just fill in “Prof.” for all of them first (copy a whole in one operation) and then change the few exception cases later. We don’t need departmental secretaries, business managers, etc.

New Conversation
Paragraph 8, Sentence 2 0
No sentence-level conversations. Start one.
New Conversation
Paragraph 8, Sentence 3 0
No sentence-level conversations. Start one.

Start with whatever person/record has the most well-structured data, not necessarily the first record at the top of the page. For example, if the title field has "Professor" but also office location and other junk we don't want, and you start with that record, you may be able to just select the "Professor" part and teach the software not to grab the other junk.

New Conversation
Paragraph 9 0
No paragraph-level conversations. Start one.
New Conversation
Paragraph 9, Sentence 1 0
No sentence-level conversations. Start one.
New Conversation
Paragraph 9, Sentence 2 0
profile_photo

I’m the head/founder of Fairness.com LLC. We really hope you … (more)

Oct 24
Dan D

I’m the head/founder of Fairness.com LLC. We really hope you … (more)

(Oct 24 2020 8:36AM) : Some web pages are badly designed and you can't grab just the data you want. more

No worries, just grab the data as-is and we can decide at the end how or whether to extract the good data from the junk — Google Sheets (and Excel, I use both for various situations) have lots of powerful built-in commands for text manipulation that can be used individually or strung together to make cleanup fast and easy.

For example:

1. sometimes Name and Title are jumbled together and can’t be separated in ParseHub

Solution— many times the two are separated by a comma; we can use the built Google Sheets SPLIT function (which can be used with any character, not just commas) to break the one field into two! Explained at https://support.google.com/docs/answer/3094136?hl=en

2. sometimes Ph# is mixed in with office location, Office hours, etc.

Solution— when I did this search “Google Sheets remove all alpha characters” I found a wide variety of sites offering solutions for stripping all non-numeric characters (from one cell or from thousands at a time, it’s all the same!!). The page offering what looks like an easy: https://www.got-it.ai/solutions/excel-chat/excel-tutorial/text/strip-non-numeric-characters

IMPORTANT priniciple to apply to whatever computer tasks you face — if they’re simple but repetitive, there’s almost always some built-in feature or add-on 3rd party tool you can find to make the job go fast!

Supposedly if it makes a connection in error (e.g. you grab a phone# instead of an email address) you hit ‘Escape’ to cancel that, but it doesn't work for me.

New Conversation
Paragraph 10 0
No paragraph-level conversations. Start one.
New Conversation
Paragraph 10, Sentence 1 0
No sentence-level conversations. Start one.

For more info see:
https://help.parsehub.com/hc/en-us/articles/218226157-Relative-Select

New Conversation
Paragraph 11 0
No paragraph-level conversations. Start one.
New Conversation
Paragraph 11, Sentence 1 0
No sentence-level conversations. Start one.
New Conversation
Paragraph 11, Sentence 2 0
No sentence-level conversations. Start one.

2. How to Process additional pages that have the same page layout (starts around 3:54)

New Conversation
Paragraph 12 0
No paragraph-level conversations. Start one.
New Conversation
Paragraph 12, Sentence 1 0
No sentence-level conversations. Start one.
New Conversation
Paragraph 12, Sentence 2 0
No sentence-level conversations. Start one.

Click on "+" next to "Select Page" (at top) and then choose "Select" and then click on the "next page" or similar button. After you rename that Select step to "NextButton" or similar, then choose the "+" command on that line and choose a "Click" command. Since we're extracting the same stuff as in #1" above, we'll use "Go to Existing Template".

New Conversation
Paragraph 13 0
No paragraph-level conversations. Start one.
New Conversation
Paragraph 13, Sentence 1 0
No sentence-level conversations. Start one.
New Conversation
Paragraph 13, Sentence 2 0
No sentence-level conversations. Start one.
New Conversation
Paragraph 13, Sentence 3 0
No sentence-level conversations. Start one.

For more information:
https://help.parsehub.com/hc/en-us/articles/217752908-Click
https://help.parsehub.com/hc/en-us/articles/217735328-Click-on-the-Next-button-to-scrape-multiple-pages-pagination-

New Conversation
Paragraph 14 0
No paragraph-level conversations. Start one.
New Conversation
Paragraph 14, Sentence 1 0
No sentence-level conversations. Start one.
New Conversation
Paragraph 14, Sentence 2 0
No sentence-level conversations. Start one.
New Conversation
Paragraph 14, Sentence 3 0
No sentence-level conversations. Start one.

3. Digging into detail pages or other pages that don't have the same layout (starts around 5:00).

New Conversation
Paragraph 15 0
No paragraph-level conversations. Start one.
New Conversation
Paragraph 15, Sentence 1 0
No sentence-level conversations. Start one.
New Conversation
Paragraph 15, Sentence 2 0
No sentence-level conversations. Start one.
16
Paragraph 16 changes
We often start with a Directory/Index page that has a few basic fields but not full information; usually we get to the full information page by clicking the Prof's name. Assuming that's the case, click on the "+" next to your initial Name field, choose "Click" and say that "NO" the key field is NOT a "Next" button, and then choose "Create a New Template" radio button (we can't use previous template since the page layout of the Profile/detail page is different from the initial page) and then choose "Create New Template".

We often start with a Directory/Index page that has a few basic fields but not full information; usually we get to the full information page by clicking the Prof's name. Assuming that's the case, click on the "+" next to your initial Name field, choose "Click" and say that "NO" the key field is NOT a "Next" button, and then choose "Create a New Template" radio button (we can't use previous template since the page layout of the Profile/detail page is different from the initial page).

New Conversation
Paragraph 16 0
No paragraph-level conversations. Start one.
New Conversation
Paragraph 16, Sentence 1 0
No sentence-level conversations. Start one.
New Conversation
Paragraph 16, Sentence 2 0
No sentence-level conversations. Start one.

That will then take you to a new Template page (similar to what you had when you started #1 above), and then, from the "+" sign on the "Select Page" line, for each data field you want to capture, one at a time, you'll choose a "Select" Command (instead of a Relative Select) and then basically repeat the procedure of #1 above.

New Conversation
Paragraph 17 0
No paragraph-level conversations. Start one.
New Conversation
Paragraph 17, Sentence 1 0
No sentence-level conversations. Start one.

Note: at 6:25 the video (too quickly!) explains how to extract only the text component of links (e.g. email address) if that’s desired… but it’s not hard for us to delete that information manually in the spreadsheet.

New Conversation
Paragraph 18 0
No paragraph-level conversations. Start one.
New Conversation
Paragraph 18, Sentence 1 0
No sentence-level conversations. Start one.

4. Extracting data by finding text strings (starts around 6:39)

New Conversation
Paragraph 19 0
No paragraph-level conversations. Start one.
New Conversation
Paragraph 19, Sentence 1 0
No sentence-level conversations. Start one.
New Conversation
Paragraph 19, Sentence 2 0
No sentence-level conversations. Start one.

[October 2020 note—I’ve not done yet tried this technique, just gathered some information.]

New Conversation
Paragraph 20 0
No paragraph-level conversations. Start one.
New Conversation
Paragraph 20, Sentence 1 0
No sentence-level conversations. Start one.

Example in the tutorial relates to tables, but use of text strings isn’t limited to that. I texted Customer Service and got answers to two example cases:

New Conversation
Paragraph 21 0
No paragraph-level conversations. Start one.
New Conversation
Paragraph 21, Sentence 1 0
No sentence-level conversations. Start one.
New Conversation
Paragraph 21, Sentence 2 0
No sentence-level conversations. Start one.

Example 1. https://www.muhlenberg.edu/academics/polisci/faculty/

New Conversation
Paragraph 22 0
No paragraph-level conversations. Start one.
New Conversation
Paragraph 22, Sentence 1 0
No sentence-level conversations. Start one.

Question: Prof. names are on the right side of the page, but Political Science Home is at the top of the column and I can't figure out how to tell ParseHub not to treat it as a Person (I need to drill down to get details of each Person, and the department page has a different format).

New Conversation
Paragraph 23 0
No paragraph-level conversations. Start one.
New Conversation
Paragraph 23, Sentence 1 0
No sentence-level conversations. Start one.

Answer: You can use a Conditional command for this. In this case you could use:

New Conversation
Paragraph 24 0
No paragraph-level conversations. Start one.
New Conversation
Paragraph 24, Sentence 1 0
No sentence-level conversations. Start one.
New Conversation
Paragraph 24, Sentence 2 0
No sentence-level conversations. Start one.

if $selection.index>0
This will tell ParseHub to only extract/click elements that are after the first.

New Conversation
Paragraph 25 0
No paragraph-level conversations. Start one.
New Conversation
Paragraph 25, Sentence 1 0
No sentence-level conversations. Start one.
New Conversation
Paragraph 25, Sentence 2 0
No sentence-level conversations. Start one.

Example 2. https://www.bucknell.edu/academics/college-arts-sciences/academic-departments-programs/africana-studies/faculty-staff

New Conversation
Paragraph 26 0
No paragraph-level conversations. Start one.
New Conversation
Paragraph 26, Sentence 1 0
No sentence-level conversations. Start one.

Question: Some records have both phone number and email address and others just one, and sometimes when the software confuses a ph# with an email I can't figure out how to tell it not to treat that ph# as an email (or vice-versa).

New Conversation
Paragraph 27 0
No paragraph-level conversations. Start one.
New Conversation
Paragraph 27, Sentence 1 0
No sentence-level conversations. Start one.

Answer: Try this:
Select all links
if $e.text.contains("@")
extract email
if !$e.text.contains("@")
extract phone

New Conversation
Paragraph 28 0
No paragraph-level conversations. Start one.
New Conversation
Paragraph 28, Sentence 1 0
No sentence-level conversations. Start one.
New Conversation
Paragraph 28, Sentence 2 0
No sentence-level conversations. Start one.
New Conversation
Paragraph 28, Sentence 3 0
No sentence-level conversations. Start one.
New Conversation
Paragraph 28, Sentence 4 0
No sentence-level conversations. Start one.
New Conversation
Paragraph 28, Sentence 5 0
No sentence-level conversations. Start one.
New Conversation
Paragraph 28, Sentence 6 0
No sentence-level conversations. Start one.

For more information about the Conditional Command:
https://help.parsehub.com/hc/en-us/articles/217753268-Conditional
https://help.parsehub.com/hc/en-us/articles/217753368-Go-to-Template

New Conversation
Paragraph 29 0
No paragraph-level conversations. Start one.
New Conversation
Paragraph 29, Sentence 1 0
No sentence-level conversations. Start one.
New Conversation
Paragraph 29, Sentence 2 0
No sentence-level conversations. Start one.
New Conversation
Paragraph 29, Sentence 3 0
No sentence-level conversations. Start one.

Last updated: October 2020

New Conversation
Paragraph 30 0
No paragraph-level conversations. Start one.
New Conversation
Paragraph 30, Sentence 1 0
No sentence-level conversations. Start one.

DMU Timestamp: October 16, 2020 17:16

Added October 24, 2020 at 12:33am by Dan Doernberg
Title: What to do with the Data

You can either:

New Conversation
Paragraph 31 0
No paragraph-level conversations. Start one.
New Conversation
Paragraph 31, Sentence 1 0
No sentence-level conversations. Start one.

1. save the data as a CSV/Excel file and email it to Dan, or

New Conversation
Paragraph 32 0
No paragraph-level conversations. Start one.
New Conversation
Paragraph 32, Sentence 1 0
No sentence-level conversations. Start one.

2. open the Excel file and paste the data into the state-specific Google Doc. If doing this, paste the data in the email address column (assuming that's the field you grabbed first) and we can do any rearranging of data into columns at the end, that's nothing you need to worry about.

New Conversation
Paragraph 33 0
No paragraph-level conversations. Start one.
New Conversation
Paragraph 33, Sentence 1 0
No sentence-level conversations. Start one.
New Conversation
Paragraph 33, Sentence 2 0
No sentence-level conversations. Start one.

DMU Timestamp: October 19, 2020 19:17

General Document Comments 0
Start a new Document-level conversation

Image
0 comments, 0 areas
add area
add comment
change display
Video
add comment

Quickstart: Commenting and Sharing

How to Comment
  • Click icons on the left to see existing comments.
  • Desktop/Laptop: double-click any text, highlight a section of an image, or add a comment while a video is playing to start a new conversation.
    Tablet/Phone: single click then click on the "Start One" link (look right or below).
  • Click "Reply" on a comment to join the conversation.
How to Share Documents
  1. "Upload" a new document.
  2. "Invite" others to it.

Logging in, please wait... Blue_on_grey_spinner