Towards more automation (and better quality)

Few weeks ago, CSC-IT Center for Science opened the much awaited VIRTA REST API for publications (page in Finnish). VIRTA is the Finnish higher education achievement register,

a tool to be utilized in the authoritative data harvesting in the way that the collected data will be both commensurable and of a good quality.

The API is good news for a data consumer like me who likes to experiment with altmetrics applications. As strange as it sounds, it’s not necessarily that easy to get your hands into a set of DOIs from inside the University. The ultimate quality control happens whenever University reports its academic output to the Ministry. And that’s precisely what the VIRTA is about: the publication set there is stamped by the University to be an achievement of their researchers.

VIRTA is still work in progress, and so far only a couple of universities import their publication data there on a regular basis, either from their CRIS or otherwise. Aalto University is not among the piloting organisations, so you’ll not find our 2016 publications in there yet. Years covered are 2014-2015.

Get data and filter

OK, let’s download our stuff in XML (the other option is JSON). Note that the service is IP-based, so the very first thing you have to do is to contact CSC, tell who you are and from which IP you work. When the door is open, you’ll need the organisation ID of Aalto. That’s 10076.

Below, I’ve changed the curl command given at the API page so that the header is not outputted to the file (no -i option), only the result XML.

curl -k -H "Accept: application/xml" "" -o aaltopubl.xml

My target is a similar kind of R Shiny web application that I made for the 2:am conference, where you can filter data by School etc. To make my life easier during the subsequent R coding, I first produced a CSV file from the returned XML, with selected columns: DOI, publication year, title, name of the journal, number of authors, OA status and department code. At Aalto, two first characters of the department code tell the School, which helps.

Below is the rather verbose XSLT transformation code parse.xsl. What it says is that for every publication, output the above mentioned element values in quotes (separated by a semicolon), add a line feed, and continue with the next publication.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0"

	<xsl:output method="text" encoding="UTF-8"/>

	<xsl:template match="/">
<xsl:for-each select="//csc:Julkaisu[csc:DOI]">
"<xsl:value-of select="csc:DOI"/>";"<xsl:value-of select="csc:JulkaisuVuosi"/>";"<xsl:value-of select="translate(csc:JulkaisunNimi, ';', '.')"/>";"<xsl:value-of select="csc:LehdenNimi"/>";"<xsl:value-of select="csc:TekijoidenLkm"/>";"<xsl:value-of select="csc:AvoinSaatavuusKoodi"/>";"<xsl:value-of select="csc:JulkaisunOrgYksikot/csc:YksikkoKoodi"/>"<xsl:text>

For the transformation, I’ll use here the Saxon-CE XSLT engine. You can run it from the command line like this:

java -jar saxon9he.jar aaltopubl.xml parse.xsl -o:aaltopubl.csv

Right, so now I have the basic data. Next, altmetrics from Altmetric.

Query altmetrics with a (cleaned) set of DOIs

With the reliable rAltmetric R package, the procedure follows a familiar pattern: with every DOI in turn preambled with doi:, the Altmetric API is queried by the altmetrics function. Results are saved as a list. When all rows are done, the list is transformed to a dataframe with the altmetric_data function.

But before that happens, DOIs need cleaning. Upstream, there was no quality control in DOI input, so the field can (and does) contain extra characters that the rAltmetric query does not tolerate, and with a good reason.

dois$doi <- gsub("", "", dois$doi)
dois$doi <- gsub("", "", dois$doi)
dois$doi <- gsub("doi:", "", dois$doi)
dois$doi <- gsub("DOI:", "", dois$doi)
dois$doi <- gsub("DOI", "", dois$doi)
dois$doi <- gsub("%", "/", dois$doi)
dois$doi <- gsub(" ", "", dois$doi)
dois$doi <- gsub("^/", "", dois$doi)
dois$doi <- gsub("", "", dois$doi)

dois_cleaned <- dois %>%
  filter(doi != 'DOI') %>%
  filter(doi != '[doi]') %>%
  filter(doi != "") %>%
  filter(!grepl("http://", doi)) %>%
  filter(grepl("/", doi))

When all extras are removed, querying is easy.

raw_metrics <- plyr::llply(paste0('doi:',dois_cleaned$doi), altmetrics, .progress = 'text')
metric_data <- plyr::ldply(raw_metrics, altmetric_data)
write.csv(metric_data, file = "aalto_virta_altmetrics.csv")

Data storage

From this step onwards, the rest is Shiny R coding, following similar lines as the 2:am experiment

One thing I decided to do differently this time, though: data storage. Because I’m aiming at more automation where, with the same DOI set, new metrics is gathered on, say, monthly basis – to get time series – I need a sensible way to store data between runs. I could of course just upload new data by re-deploying the application to where it is hosted, Every file in the sync’ed directory is transmitted to the remote server. However, that could result to an unnecessarily bloated application.

Other Shiny users ponder this too of course. In his blog posting Persistent data storage in Shiny apps, Dean Attali goes through several options.

First I had in mind using Google Drive, thanks to a handy new R package googlesheets. For that, I added a date stamp to the dataframe, and subsetted it by School. Then – after the necessary OAuth step – gs_new registers a new Google spreadsheet, ws_title names the first sheet, and uploads data in input. In subsequent pipe commands, gs_ws_new generates new sheets, populating them with the rest of the dataframes.


upload <- gs_new("aalto-virta", ws_title = "ARTS", input = ARTS, trim = TRUE) %>% 
  gs_ws_new("BIZ", input = BIZ, trim = TRUE) %>% 
  gs_ws_new("CHEM", input = CHEM, trim = TRUE) %>%
  gs_ws_new("ELEC", input = ELEC, trim = TRUE) %>%
  gs_ws_new("ENG", input = ENG, trim = TRUE) %>%
  gs_ws_new("SCI", input = SCI, trim = TRUE)

Great! However, when I tried to make use of the data, I stumbled on the same HTTP 429 issue that Jennifer Bryan writes about here. It persisted even when I stripped data to just 10 rows per School. I’ll definitely return to this package in the future but for now, I had to let it be.

Next try: Dropbox. This proved more successful. The code is quite terse. Note that here too, you first have to authenticate. When that is done, data is uploaded. The dest argument refers to a Dropbox directory.


write.table(aalto_virta, file ="aalto-virta-2016_04.csv", row.names = F, fileEncoding = "UTF-8")
drop_upload('aalto-virta-2016_04.csv', dest = "virta")

In the application, you then download data with drop_get when need be, and read into the application with read.table or some such.

drop_get("virta/aalto-virta-2016_04.csv", overwrite = T)

I didn’t notice any major lag in the initiation phase of the application where you see the animated Please Wait… progress bar, so that’s a good sign.

Here is the first version of the application. Now with quality control! In the next episode: time series!

Posted by Tuija Sonkkila

About Tuija Sonkkila

Data Curator at Aalto University. When out of office, in the (rain)forest with binoculars and a travel zoom.
This entry was posted in Data and tagged , , , , , , , . Bookmark the permalink.

Comments are closed.