Low code extracting listserv archives to tabular data using Webscraper.io

Low code extracting listserv archives to tabular data using Webscraper.io

Low code extracting listserv archives to tabular data using Webscraper.io
Photo by David Ballew / Unsplash

Mailing lists come and go, and if you've kept everything in your inbox (and were a member since the list's inception) then you can always go back to find that random answer someone gave. BUT, if you didn't sign up at the start, or have lost emails, or just want a better way to maintain historical reference information in a tabular format, maybe this approach will be helpful.

Components

  • Web Browser (Firefox recommended, although Chrome will work too)
  • Access to a mailing list that provides web-based archives (like "listserv")
  • Webscraper.io browser extension

Process

For my purpose, I wanted to retrieve the contents of a particular EDUCAUSE Community Group mailing list. It recently moved to a new platform, and they were sunsetting the old mailing list server/service.

  • First, download and install the webscraper.io browser extension
  • Open the Developer tools, and click the webscraper tab which should look something like this:
Webscraper Browser Extension in Firefox
  • Click "Create New Sitemap" and enter the appropriate details
Sitemap configuration
  • You'll then be able to create "selectors" which is the extension's name for how the tool will navigate through the website, and where it will find the data you want to extract. IE: If you have a structure like a mailing list, then you'll navigate from the mailing list home page, then into one of the "month" pages, then into one of the threads, where you'll end up being able to extract the email subject, date, sender, and message body.
  • This is what the main page looks like:
Month listing for the mailing list
  • To allow the extension to parse the mailing list month list, and then traverse into one of the month pages, this is the configuration:
Month Selector configuration
  • In plain english, this says "Look for an <li> element that contains the string "2017" which also has a hyperlink that goes somewhere."
NOTE: we parse one year at a time, due to timeouts and unexpected resource issues. If you try to run the scraper on all 10 years (for this example) it will unexpectedly timeout somewhere in the middle. Doing one year at a time was the happy medium I found.
  • Now, the extension can go from the main page, into a month page. Now we need to go one step further into each thread:
  • This says find the <span> and <a> tag on the page, and grab all of them. These are our individual email threads that look like this (below). It will traverse into each email now, where it will find the subject/sender/date etc.
  • Now we're on the individual email view, so this is where we actually grab our data.
Getting the individual fields we want
  • We grab the date, from, and subject from this page. But...we also want the message body, right? Well on the individual email page, it's displayed in an iframe, which confused the webscraper extension. So, instead we traverse one more level by selecting the "plain text" link, and grabbing the text from there. See in the screenshot above, there's 3 "selectortext" types, and 1 "selectorlink" type....that's how we grab some data from one page, and supplement it with additional data from a subsequent page.
  • The plain text selector is the easiest of the lot. Just grab everything in the <pre> tag:
email body selector configuration
  • That's it! Your "Selector Graph" view should now look like this:
Selector Graph View
  • Simple, right? Traverse from the main page -> select a month -> select a thread -> store the date, from, subject -> select the plain text version -> store the plaintext version -> resulting row of data saved is: Date, Subject, From, Body
  • You can now export the data into XLSX or CSV.
PROTIP, Use XLSX because the CSV export does NOT handle newlines or quoted strings well at all.

That's really it! If you dont want to go through all that trouble, you can import this sitemap I made already for you (just select all, copy and paste into the "import sitemap" option in the extension.

{"_id":"googleworkspace","startUrl":["http://listserv.educause.edu/scripts/wa.exe?A0=GOOGLEWORKSPACE"],"selectors":[{"id":"monthselector","parentSelectors":["_root"],"type":"SelectorLink","selector":"li:contains('2017') a","multiple":true,"delay":0},{"id":"threadselector","parentSelectors":["monthselector"],"type":"SelectorLink","selector":"span a","multiple":true,"delay":0},{"id":"email_date","parentSelectors":["threadselector"],"type":"SelectorText","selector":"tr.emphasizedgroup tr:nth-of-type(4) td:nth-of-type(3) p","multiple":false,"delay":0,"regex":""},{"id":"email_from","parentSelectors":["threadselector"],"type":"SelectorText","selector":"tr.emphasizedgroup tr:nth-of-type(2) td:nth-of-type(3) p","multiple":false,"delay":0,"regex":""},{"id":"email_subject","parentSelectors":["threadselector"],"type":"SelectorText","selector":"tr.emphasizedgroup tr:nth-of-type(1) td:nth-of-type(3) p","multiple":false,"delay":0,"regex":""},{"id":"textversionselector","parentSelectors":["threadselector"],"type":"SelectorLink","selector":"tr.emphasizedgroup tr tr a:nth-of-type(1)","multiple":false,"delay":0},{"id":"email_body_plaintext","parentSelectors":["textversionselector"],"type":"SelectorText","selector":"pre","multiple":false,"delay":0,"regex":""}]}

Data Cleanup

There's always some kind of data cleanup you have to do in any data set, and this is no exception. To make this data easier to work with, here's a few things I did:

  • Separate the sender data into separate fields for the sender display name, and sender email address (lowercased)
  • Normalize the subject line data, to remove "Re: " strings, so threads were easier to find together
  • Extract the domain from the sender email address, so we could see which domains had the most people contributing to the mailing list.

Visualizing / Exploring the Data

For this particular effort, I used Google Data Studio as a quick way to create a dashboard for people to use for exploration and visualization. It works well (and is free) when the data underneath is stored in a Google Sheet or Excel file stored in Google Drive.

Here's some quick screenshots from the dashboard I made:

Main data explorer tab in my Google Data Studio Dashboard
Time series for all messages

I also did a leaderboard page, showing the people who emailed the most, and which domains had the most senders.

Enjoy!

Hopefully this is helpful to people who want to do the same thing for their mailing lists, or for scraping from another site with similar structured navigation. I hadn't used Webscraper.io until finding it a few days ago....so I can definitely recommend it for its simplicity.