One of those data wrangling tools I’ve had in mind to have a look at, is the Google Chrome extension Scraper. Jens Finnäs has a good tutorial on how to start using it. The crux is that you are somewhat familiar with the XPath notation; you’ll use it to navigate in the element tree of the web page. Note that the only way to make a copy of the scraped data is to store it as a Google Drive spreadsheet.
The Federation of Finnish Learned Societies keeps a list of Finnish scientific journals online. As an exercise, let’s grab all relevant data from that page with Scraper, save it, import to OpenRefine via Google Drive, and check against the xISSN web service by OCLC, whether information about the peer-review status of the journals is the same in both sources.
There are various ways to navigate the page with XPath. Here is one possible set of sentences.
Click Export to Google Docs…. In there, publish data as CSV, and copy the URL.
Create a new project in OpenRefine by getting data from the URL you just copied. Then, based on the column ISSN, add a new column by querying the xISSN API. Choose JSON as the result format.
OCLC doesn’t know about all ISSNs. From the unknown ones, the service returns
{ "stat":"unknownId" }
In addition, there seems to be at least one invalid ISSN, resulting to
{ "stat":"invalidId" }
With a text filter invalidId on the xISSN column, you’ll see that this is indeed the only one. Values can be edited on the spot; hover the cursor over the invalid cell, and the edit button comes up.
From the result column, let’s parse, into a yet another new column, the value from the JSON field labelled peerreview.
In the original data, the column Vertaisarvioitu holds the value of whether the journal is peer-reviewed or not. If it is, the value is the Finnish verb form on. Other values are empty, meaning the status is either not known or the journal is not peer-reviewed. xISSN has returned either Y or N. To compare values from these two sources, we need to “harmonize” them first.
Now we can make a new column based on the comparison.
With this a bit complicated GREL sentence, we’ll find out whether the values are the same or not, given there are strings to compare.
if( or(isBlank(cells["xISSNpr"].value), isBlank(cells["Vertaisarvioitu_en"].value)), "Value(s) missing", if(cells["xISSNpr"].value == cells["Vertaisarvioitu_en"].value, "Same", "Different") )
What happens in here is as follows: the boolean or function returns true, if the cell is blank in either one of the columns or in both. In this case, the first expression of the outer if control function is evaluated, i.e. the string Value(s) missing is returned. However, it there is a value in both columns, the second expression of the if function is evaluated. This is an if function too; the actual string comparison takes place there. Given it is true that strings are the same, the first if expression is evaluated, resulting to Same, otherwise we’ll get Different.
From the text facet on the new column PrDiff we’ll see that there are six journals whose peer-review info is different in these two sources.