Scraping journal data

One of those data wrangling tools I’ve had in mind to have a look at, is the Google Chrome extension Scraper. Jens Finnäs has a good tutorial on how to start using it. The crux is that you are somewhat familiar with the XPath notation; you’ll use it to navigate in the element tree of the web page. Note that the only way to make a copy of the scraped data is to store it as a Google Drive spreadsheet.

The Federation of Finnish Learned Societies keeps a list of Finnish scientific journals online. As an exercise, let’s grab all relevant data from that page with Scraper, save it, import to OpenRefine via Google Drive, and check against the xISSN web service by OCLC, whether information about the peer-review status of the journals is the same in both sources.

There are various ways to navigate the page with XPath. Here is one possible set of sentences. Scraper

Click Export to Google Docs…. In there, publish data as CSV, and copy the URL. Publish Google Docs spreadsheet

Create a new project in OpenRefine by getting data from the URL you just copied. Then, based on the column ISSN, add a new column by querying the xISSN API. Choose JSON as the result format. Query the xISSN web service by OCLC

OCLC doesn’t know about all ISSNs. From the unknown ones, the service returns

{ "stat":"unknownId" }

In addition, there seems to be at least one invalid ISSN, resulting to

{ "stat":"invalidId" }

xISSN JSON result

With a text filter invalidId on the xISSN column, you’ll see that this is indeed the only one. Text filter on invalidId Values can be edited on the spot; hover the cursor over the invalid cell, and the edit button comes up.

From the result column, let’s parse, into a yet another new column, the value from the JSON field labelled peerreview. Parsing JSON

In the original data, the column Vertaisarvioitu holds the value of whether the journal is peer-reviewed or not. If it is, the value is the Finnish verb form on. Other values are empty, meaning the status is either not known or the journal is not peer-reviewed. xISSN has returned either Y or N. To compare values from these two sources, we need to “harmonize” them first. Translate the Finnish peer-review value

Now we can make a new column based on the comparison. Comparing the peerreview strings

With this a bit complicated GREL sentence, we’ll find out whether the values are the same or not, given there are strings to compare.

if(
  or(isBlank(cells["xISSNpr"].value),
     isBlank(cells["Vertaisarvioitu_en"].value)), 
  "Value(s) missing",
  if(cells["xISSNpr"].value == 
     cells["Vertaisarvioitu_en"].value, 
    "Same", "Different")
)

What happens in here is as follows: the boolean or function returns true, if the cell is blank in either one of the columns or in both. In this case, the first expression of the outer if control function is evaluated, i.e. the string Value(s) missing is returned. However, it there is a value in both columns, the second expression of the if function is evaluated. This is an if function too; the actual string comparison takes place there. Given it is true that strings are the same, the first if expression is evaluated, resulting to Same, otherwise we’ll get Different.

From the text facet on the new column PrDiff we’ll see that there are six journals whose peer-review info is different in these two sources. Text facet on peer-review values

Posted by Tuija Sonkkila

About Tuija Sonkkila

Data Curator at Aalto University. When out of office, in the (rain)forest with binoculars and a travel zoom.
This entry was posted in Data and tagged , , , , , , , , , . Bookmark the permalink.

Comments are closed.