Course recommendations as a graph

To browse Aalto University’s course selection, there is fairly new and concise site for that. One interesting novelty is recommendations, presented as a Related courses section at the bottom of the course page.

They allow harvesting, so I became curious on how would a course network look like. Are some courses more recommended than others? Which are they?

While working on this, I learned new things about Python both as a harvesting platform, and as a tool for constructing a directed graph in the form of a GEFX file. A nice example by Christopher Kullenberg helped here a great deal.

The final graph is visualized by the GEXF JavaScript Viewer. The network layout is ForceAtlas2 by Gephi, with default parameters. The size of the node (=course) reflects the number of In-Degree ranking; the bigger the circle of the course, the more there are courses that recommend this particular one.

The color of the node reveals the School. The RGB values are taken from the Aalto University Visual Identity instructions.

Click a node, and an information panel opens up to the left.

Most of the nodes come with core metadata like title, number of credits, and description. If these are missing, it most certainly means that the harvester didn’t find anything because my Python code was too optimistic. Although the course pages are built with similar HTML elements and attributes, there do are exceptions. For example, some 50 course titles are not within an H3 element I realized. Because the harvest took more than three hours (!), I didn’t want to bother the site with a re-run. Those few nodes with a high amount of In-Degree links but without any course metadata, I edited manually in Gephi’s Data Laboratory before exporting the data.

By default, after the Gephi process, inbound links – i.e. courses that the node in question is recommended by – were listed nicely and correctly in the information panel. However, outbound links – courses than this node in question recommends – were not. The course code was OK but the title was, incorrectly, the title of the source node. After digging into the JavaScript code of the viewer, and after some more pondering about what to do when I found out that indeed the label of the source node was used as the title of the outbound link, it dawned on me that I could perhaps make use of the idle weight attribute of the edge (i.e. of the link/recommendation). Luckily, with only minor modifications to the JavaScript code it worked.

I guess I could have done the GEXF modifications I needed within Gephi too but decided to brush up my dormant XSLT, once an everyday language at work due to frequent needs of XML transformations but today an exception.

So, which courses are recommended the most?

Number one is MUO-C3007, Design traditions, a bachelor-level course on the legacy of design, provided by the School of ARTS. The course is recommended by all other Schools of Aalto except SCI, School of Science. By hovering the cursor on top of the inbound links you can see how far, network-wise, some recommendations come from. I guess the design of physical artifacts follows similar historical traditions no matter what the realm of the final product is.

The second most frequently recommended one, not far behind MUO-C3007, is A23E53015 offered by the Open University, a masters-level evening course How to manage and assess the power of the brand (my translation).

Which courses are the most active in making recommendations of others, you might ask. Well, differences in rankings this way are hard to discern. Most courses recommend many others.

Jupyter notebooks, and XSLT code are available at GitHub.

Posted by Tuija Sonkkila

Data Curator at Aalto University. When out of office, in the (rain)forest with binoculars and a travel zoom.
Uncategorized - Comments Off on Course recommendations as a graph

Everyday altmetrics

Show me the numbers

A while ago I was contacted by an academic person from one of the Schools of Aalto University. They wanted to have a quick look on how their publications from last year had performed altmetrics-wise.

We do not yet have any commercial solution that would show this. My snapshots from 2014, 2015, 2016-2017, and 2018 do not update themselves. At some point I had in mind to build a database driven application but there’s only so much time.

Yet what we do have, is our CRIS which is as much up-to-date as this type of service can be. Instructions to show the Altmetric badge are embedded in every page of the portal, so whenever there is something to show, the iconic doughnut is rendered. However, this is the grass root view to one article only. What we need is to go one or more levels up in an organization tree.

To build a report that bundles research output by their home of origin (department, research group, or centre of excellence) – “managed by” in CRIS parlance – I first need a boilerplate CRIS report on publications: School, department, research group, DOI, title, etc. Run it, open the result in Excel, and save as HTML. The goal here is to get the Altmetric badge to appear (or not) on every relevant row of the HTML table, next to a DOI. For this I need to add two things: link to Altmetric’s Javascript file, and a placeholder div element with some attributes. How to do this?

There are obviously many options from command line tools to programming languages. Here I show some solutions with sed, awk, XSLT, Javascript, and Python. Thanks to the solid programming skills of our new roommate, the solution we actually delivered (in 15 minutes) was the Python one. Here is an example of what the result looked like. A quick and dirty solution but does the job.

Forget the numbers, show me the tweets!

As useful as the metrics themselves can be, the real thing lies in the human action. Who said what about the article? In which manner? What sort of interests do these Who have? What we need is a way to represent data from Twitter, this social media giant that has become important also in communicating about science. Altmetric kindly shows latest tweets for free, but to see all of them you’d a need a license.

While at it, I’d like to mention that a good and timely read on the What, Where, How, When and Who of academic Twitter is the Altmetric blog mini series by guest authors Stefanie Haustein, Germana Barata, Rémi Toupin and Juan Pablo Alperin.

So, let’s put our focus on Aalto University publications since 2017.

With one of our CRIS standard reports I get the listing of all DOIs. FYI, roughly 80% of all Aalto University publications since 2017 have got a DOI, which is not bad.

With this list of DOIs, I then turn to the rich source of CrossRef Event Data. The work is very easy thanks to the crevents R client by the awesome people from rOpenSci. In no time (read: a weekend-ish) I had the data ready, including the status IDs of the tweets. Feeding those to the lookup_statuses function of rtweet, I get back a whopping amount of information on each tweet. The most tricky task here (for me) was to understand the data model of retweets. Anyway, from then on it was fairly easy to build, on top of the tweets, a standard Shiny web application where the user can drill into the organization. Thanks to the advanced features of DT, rows can be sorted and filtered interactively by column.

Note that I left out those articles that CrossRef did not return any tweet info about.

To add a little metrics sugar I present, from each selected unit: the most tweeted article; median number of tweets; the tweet with the longest life span so far; and the median life span.

Some ideas for poking around:

  • sort Time span, or adjust the slicer in the filter. Span shows the time difference in days between the first and the last/latest tweet to that article. Note that a single tweet shows as 0.
  • Description is the About text of the Twitter screen name aka account. Try e.g. different occupations, or a substring of, like journalist, professor, dr., teacher, hashtags such as #health or #brain, or emojis like ⚽, 🚁 or 🇨🇷
  • Location could potentially be of interest but tweeters need to opt in to use the service, which BTW is a good thing
  • you can use multiple filters at the same time. For example, you might ask yourself:”Which professors (or accounts claiming to be one) whose tweets span over a month, have the most followers?”

Sometimes the search box serves better than filters. For example, to find all about energy be the word then in articles, tweets, screen names, or descriptions – use search.

CrossRef Event Data is a warmly welcomed service! Besides Twitter, other interesting data sources are e.g. Wikipedia and Newsfeed.

Note that CrossRef and Altmetric can return different results. I haven’t done any thorough comparison but one particular article got my attention. The Altmetric badge knows that there are a lot of tweeters on this, yet CrossRef finds only few. Turned out that preprints (ArXiv in this case) are not that well covered by CrossRef.

Half of all tweets are retweets, and the average document [in Altmetric] has a tweet span of 81 days. My small sample follows these patterns quite well: retweets 63%, median tweet life span 81.8 days.

R code is available at GitHub.

Posted by Tuija Sonkkila

Data Curator at Aalto University. When out of office, in the (rain)forest with binoculars and a travel zoom.
Uncategorized - 2 Comments

Open landscape

In his recent blog posting Understanding the implications of Open Citations – how far along are we?, Aaron Tay from Singapore Management University gives a compact and clear overview on open citations. As Tay rightly notes,

With the recent interest in integrating discovery of open access, as well as linked data (with a dash of machine learning and text mining) we have the beginnings of an interesting situation. A third development which was harder to forsee is the rise in Open Citation movement

Tay mentions the fresh CWTS analysis Crossref as a new source of citation data: A comparison with Web of Science and Scopus where at least 39.7% of Web of Science references where found to match an open reference.

Also, finding OA texts is easier by the day.

In The state of OA: a large-scale analysis of the prevalence and impact of Open Access articles Heather Piwowar​ et al. estimate that

at least 28% of the scholarly literature is OA (19M in total) and that this proportion is growing, driven particularly by growth in Gold and Hybrid. The most recent year analyzed (2015) also has the highest percentage of OA (45%). Because of this growth, and the fact that readers disproportionately access newer articles, we find that Unpaywall users encounter OA quite frequently: 47% of articles they view are OA.

Lately, Digital Science Dimensions has generated a lot of welcomed buzz. Like Tay points out, their data are at least partly based on Open Citations but

I’m pretty sure Dimensions goes beyond it, as it is a combination of input and expertise from 6 different teams including ReadCube, Altmetric, Figshare, Symplectic, DS Consultancy and ÜberResearch and other publisher partners.

Compared to my last test with OpenCitations Corpus and Crossref, how would Dimensions serve us, citations-wise? Also, how much of Aalto University articles can be found and read as OA?

Dimensions offers an open Metrics API which is a friendly gesture. However, see their Terms of Use.

As a base set, I used Aalto University publications from 2015 to 2017. These years are well covered by the VIRTA service, and thanks to the REST API (in Finnish), getting your hands into the publication set is fairly easy. Basically you just need to parse the returned XML. Note that I could have also used data from our local DW to start with; publication metadata are transferred first from our CRIS into the DW and later to VIRTA on a regular basis.

While at it, in addition to citations data via Dimensions and Web of Science (WoS), I mashed up some more:

  • links to possible Open Access full texts via Unpaywall with the roadoi R package by Najko Jahn
  • Mendeley reader counts, tweets, and Altmetric score by Altmetric
  • research field names from Statistics Finland
  • Aalto University department names from the local CRIS

Those of you interested in the R code might like to check it on GitHub. A mental note to myself: combining data from so many external sources would desperately need a modular solution.

All other data featured here you can query, parse and join in a computational manner except WoS citations because their APIs don’t return them. See comments in the code for details. EDIT: the citation count do is returned, my misunderstanding. See e.g. the example code with one DOI here that uses the wosr package.

Because there is no quality control of DOI strings in VIRTA, all bad DOIs are silently skipped. Since late 2016 at out University, broken DOIs are history at last thanks to the new CRIS but legacy data are still haunting. To fix DOIs, I only made some quick replacements but apparently you can in fact catch most of the correct ones.

Speaking of Open Access at Aalto University in general, a viable path goes via the CRIS portal where you can filter research output with full text. Another route starts from the Aaltodoc DSpace repository.

In their API documentation, Unpaywall states that the field evidence is used only for debugging, and values are subject to change. OK. Despite this warning though, I couldn’t help making a quick look on from which sources Unpaywall is finding our OA full texts, and what’s the rough split between different source types. For this I put up a collapsible tree which is a slightly modified version of the original one by Mike Bostock.

When you launch the web application that includes all of the mashup data of this exercise, what you’ll see after a short “Please Wait…”, is a scatterplot on University level. By default, it compares citation counts from Dimensions (x axis) and those from WoS (y axis) on a linear scale. The marker symbol refers to the OA availability of the full text: triangel is an OA, circle a non-OA.

FYI, in the data table you’ll find yet another OA column labelled oa_virta. These values come from VIRTA and thus originate in Aalto University itself. See the codes of Avoin saatavuus (in Finnish)

Note that unlike the CWTS analysis mentioned above, I’ll not compare how many WoS items are found in Dimensions. Perhaps I should do that too… Anyway, my original set of VIRTA items is first queried against Dimensions, and all found ones are then queried against WoS.

The difference between citation counts provided by Dimensions vs. that by WoS is seemingly small. By selecting some School you’ll get a more detailed view. Take for instance School of Chemical Engineering, and note how close the values are. The same goes to School of Engineering, and to a lesser degree also to School of Electrical Engineering. With other Schools, citation counts start to disperse. In the case of School of Science, take a look at the departmental level. In there, Department of Industrial Engineering and Management has a somewhat equal coverage in both sources whereas Helsinki Institute for Information Technology HIIT is less so. Note though that the axis scales is not fixed between plots.

In my test set, 38% of all articles are found as an Open Access full text via Unpaywall. On School level, the proportion is highest in School of Science, 48%. All in all, percentages echo the findings of Piwowar et al.

The heatmaps by department visualize citation metrics in a matrix by color. Values are arithmetical means grouped by School, department, and field. Here I took the liberty of using the English translations of the local research field names. This is just for illustration purposes. The actual fields behind the Dimensions metric field_citation_ratio are not known so please be careful with conclusions.

In the web app, there are some technical caveats that you should be aware of. The scatterplot lacks jitter which means that markers are on top of each other when datapoints have the same value. If you zoom in by selecting a rectangular area, double-clicking doesn’t alway bring you back to the default level; if this happens, click the Home icon on the navigation bar instead. In the data table, horizontal scroll doesn’t stay in sync with the fixed header row, unfortunately.

For the record, 56% of those articles that had citation metrics at Dimensions were also found in Web of Science. So what was missing? The plots below show, by publisher (>=20 articles) and School respectively, the percentage of articles not found by DOI from WoS, and missing a URL to an OA full text at Unpaywall. The darker gray, the better coverage. The lighter, the more values are missing.

The core code of making use of the levelplot function of the trellis R package comes from Rense Nieuwenhuis.

Missing values by publisher

Along with the obvious publishers like PLoS and Hindawi, other noteworthy OA ones in our case are e.g. American Physical Society. Taking into account both of the plotted variables, the “weakest performers” are Emerald, IEEE and Inderscience Publishers.

Missing values by School

Yet, DOIs work only so well. As the CWTS analysis reminds

Every publication in Crossref has a DOI, but only a selection of the publications in WoS and Scopus have such an identifier. Furthermore, not all publications with a DOI in WoS and Scopus have a matching DOI in Crossref.

Posted by Tuija Sonkkila

Data Curator at Aalto University. When out of office, in the (rain)forest with binoculars and a travel zoom.
Uncategorized - Comments Off on Open landscape

OpenCitations Corpus

Citations are central to scientific communication. Despite increasing doubts in their very-old-school use in measuring impact, citations are still very much alive in that part of academia too. What is nice is that in OpenCitations Corpus (OCC) we have now a steadily growing open source for citations, among other things, something that asks for a quick try.

In their latest blog posting, OCC gives few welcomed examples on SPARQL queries that e.g. return articles that have cited a given DOI.

How much citations does OCC find to our publications? For testing purposes, I used a (cleaned) set of DOIs minted to articles published between 2013 and 2015. As explained e.g. in this slide set, the citing papers are mainly from 2016 or 2017.

For those of you interested, all the code is on GitHub. The file query.R does the querying part, whereas global.R, ui.R, and server.R are the building blocks of a small web app that shows the results.

To get a more accurate picture if you like, I also queried Crossref. How big a percentage the present OCC citation count is from that what Crossref returns, is shown in the Percentage column (median is 14.29%) Note that the comparison is only from those publications that have citations in OCC. That is around 20% from the original set.

Posted by Tuija Sonkkila

Data Curator at Aalto University. When out of office, in the (rain)forest with binoculars and a travel zoom.
Uncategorized - Comments Off on OpenCitations Corpus

Hilite Pure

When you are a customer of a web-based service that allows you to tweak only some elements of the web appearance but not all of them, you begin to wonder should you do something about it yourself. One such service is the Pure CRIS by Elsevier, one of the central digital systems at Aalto University since last fall.

Currently, many Aalto Pure users are typing in data on activities such as visits abroad, talks, memberships, etc. Together with publications, activities are the backbone of one’s academic CV. Moreover, like all Finnish universities, Aalto has the obligation of sending an annual report on certain aspects of activities to the Ministry. These are e.g. internationality, and country. If you gave your talk abroad, we want to know about this thank you, and also, where exactly was it.

The Pure GUI is a bundle of vertically lengthy, separate windows with multiple buttons and fields. Many fields are extra whereas some are marked obligatory by the vendor. Leaving them unfilled rightly prevents one from saving the record. This is fine. However, how do we tell our users that please, check also International activity and Country? In Pure parlance, these two fields are keywords on activities. Ideally, these should be obligatory too.

Elsevier does not change the core functionality of its product on a single client basis; enhancement requests need the acceptance of the national user group to start with. And even though our friends in Finland would sign our request (I suppose they would), that doesn’t guarantee that Elsevier will add the familiar red star to these two fields, and tweak their program code correspondingly. Note that Elsevier is no exception here, software vendors often work like this. And to be fair, who knows how long our Ministry is interested in internationality. They may drop it next year.

The way we deal with this small challenge at the moment is brute force: within keyword configuration, we have wrapped the keyword name inside an HTML span element, and added a style attribute to it. What this effectively does is that wherever the keyword name is visible in the GUI, it comes with a yellow background. Unfortunately in our case, the HTML is not always rendered but the name is used verbatim too. This clutters the canvas with angle brackets.

How about changing the DOM on the fly? As a proof-of-concept, I followed the instructions on Mozilla WebExtensions.

First I saved these two files in a directory Yellowify:



  "manifest_version": 2,
  "name": "Yellowify",
  "version": "1.0",

  "description": "Adds a yellow background to span elements with text 'International activity' or 'Country' on webpages matching",

  "content_scripts": [
      "matches": ["*://**"],
      "js": ["yellow.js"]


// There are matching span elements also as childs of an H2 element 
// but we don't want to color them 
var el = document.querySelectorAll('label > span')

for(var i = 0; i < el.length; i++)
   var text = el[i].textContent;

   if (text == "International activity" || text == 'Country') 
      el[i].style.backgroundColor = "yellow";

Then I opened about:debugging in Firefox, loaded yellow.js as a temporary add-on, and navigated to a new Pure activity page.

New Pure activity with highlighted fields

New Pure activity with highlighted fields

So far so good!

But of course this functionality is now for me personally, on this computer, in this browser, and even here only temporarily. To propagate the hilite as a semi-permanent feature within the whole university to all Pure users is not going to happen. There are too many different platforms and browsers out there.

Posted by Tuija Sonkkila

Data Curator at Aalto University. When out of office, in the (rain)forest with binoculars and a travel zoom.
Uncategorized - Comments Off on Hilite Pure

One year makes a (small) difference

Time goes by, publications get attention, metrics accrue. Or do they? To find out what happens locally in small scale and in a short time frame, in one year, I re-queried the Altmetric API and WoS for our items at VIRTA. The results are visible behind the Data by School tab of the aaltovirta web application – but not all of them, just (possible) changes in tweet(er)s, Mendeley readers, and WoS citations.

While at it, I also added links to open access full texts found by the oaDOI service API. The roadoi R package was very helpful in this. Currently, roadoi is about to become a member of the rOpenSci suite. I was gladly surprised when I was asked to participate in an open peer-review of the package. Revamping the app with oaDOI information acted as a timely proof for my part that roadoi does what it promises.

There are a multilple ways to visualize changes over time. At first, I had in mind to try to show as big a picture as possible in one go, i.e. all items by School. However, most Schools publish several hundred items per year. Not easily consumable unless you have a megalomaniac full-wall screen, swim within data in VR, or some such. Only the research output of the School of Arts, Design and Architecture (ARTS) was within realistic limits.

Small multiples

Below are three images showing changes in ARTS publications in citations, tweets, and Mendeley readers, respectively. Here I used the facet_wrap function of the ggplot2 R package, with much-needed extra help from Visualizing home ownership with small multiples and R by Antonio S. Chinchón. My version of his code is in this gist. Before plotting, data is sorted by the value of the respective metrics.




Note that some values have decreased during the year, mostly in Mendeley. What this probably just means is that at query time, data was missing.

We know that it takes some time for citations to accumulate. Still, even one year can make a difference. In this ARTS sample, Addressing sustainability in research on distributed production: an integrated literature review by Cindy Kohtala had 2 citations in spring 2016 but now already 16. At the same time, Mendeley readership doubled from 95 to 184, whereas the number of tweeters remained the same, 10. This is a familiar pattern. Twitter is more about communicating here and now.

In all ARTS publications, only three had more tweeters now than a year ago. The increase is so small though, that you cannot decipher it from the pictures due to scaling.

Of course it would be interesting to know, at which exact times these additional tweets were sent, by whom (human or bot), and what might have triggered the sending. CrossRef Event Data is now in fresh beta. Some day we may get detailed information via their API to the first question, for free. As for now, I find the API results still a bit confusing. Based on example queries, here I’m asking for events of the first item from the above list For reasons I don’t understand, there are no mentions of that DOI among the first 10 rows. Anyway, we need to be patient – and read the documentation more carefully.

For the record, in the whole Aalto set, the paper that saw the most increase both in tweeters (14) and Mendeley readers (153) during the year was Virtual Teams Research: 10 Years, 10 Themes, and 10 Opportunities from the School of Science (SCI). On the Altmetric detail page we can see that the newest tweets are only a few days old. The way Journal of Management acts as an amplifier for its own article archive certainly plays a role here.

What about citations? The biggest increase in them took place in the School of Electrical Engineering (ELEC). A Survey on Trust Management for Internet of Things had 35 citations last year, now 134.


Small multiples are known to be useful in detecting anomalies. In the above ELEC picture for example, on the 3rd row, there is something odd with the 7th item from the left. Turns out that there are two items in the data with the same DOI, but the publication year differs. My process didn’t expect that to happen.

Sparklines and colored cells

The main venue for all the data in the current and previous exercises on (alt)metrics has been a table in the Shiny web application. A multi-colum, horizontally scrollable beast built by DT, an R interface to the DataTables JavaScript library. Visualizing changes over time in there, inside table cells, asks for either sparklines or some kind of color scale. I tried them both.

I have only two time data points, so the result lines lack their familiar, playful, sparkling appearance. Basically, mine are either 45 degree slopes or 2D lines. Still, they were fun to implement with the sparkline htmlwidget. Big thanks to SBista for advice!

Colors work great in heatmaps where you have a big canvas with many items. Also calendar heatmaps are fine. Again, a table is perhaps not the best platform for colors, but I think that even here they manage to tell part of the story. To be on the safe side, I used 3-class Spectral by ColorBrewer.

DT is versatile, and it comes with extensive documentation. The number of columns in the table is bordering its maximum, so I used some helper functions and extensions to ease the burden. Hopefully the user experience is not too bad.

Posted by Tuija Sonkkila

Data Curator at Aalto University. When out of office, in the (rain)forest with binoculars and a travel zoom.
Uncategorized - Comments Off on One year makes a (small) difference

Policy sources and impact

Since some time already, Altmeric has harvested information on how much various policy and guidance documents have been citing research publications. The way they do it is mainly by text-mining the References section of PDF documents, and verifying results against CrossRef and PubMed. As a friendly gesture, Altmetric also provides their data via the public API. All of this is great news, and possibly important when we think about the growing interest in assessing impact.

How much and on which fields has Finnish university research output gained attention in international (here = English) policy sources?

Related to this, Kim Holmberg et al. have already shown evidence in the research project Measuring the societal impact of open science of RUSE, that research profiles of Finnish universities seem to be reflected in altmetrics.

In this exercise, I narrowed the scope of publications to a single year, 2015, for practical reasons. First, it is the latest full year covered by the new Virta REST API (page in Finnish), the to-be Top1 open data source of Finnish research publications. Second, I decided to manually go through all policy cites, to get some understanding on what we are talking about here.

For those interested in gory details, here’s the R source code of getting, cleaning & querying the dataset, and building a small interactive web app on results. The app itself is here. Please be patient, it’s a little slow in its turns.

finunic Shiny web app

By default, the app shows all data in two dimensions: the number of Twitter tweets, and the Altmetric score. While at it, try selecting differenct metrics on the vertical axis. It reveals how, in this material, various altmetrics sources are relative to this algorithmic, weighted score.

The size of the plot points (circles) tells about the degree of openness of the publication. The two biggest ones refer to either an OA journal, or an OA repository, respectively. The smaller ones are either not OA, or the status is not known. This information comes from the publication entry proper. In case the article is co-authored by multiple universities, they may have given differing information on the OA status. This is a sign how unclear the terminology still is in the OA field. Note also that when you filter data by university, the colour scale changes from universities to reflect the OA status, but unfortunately the palette isn’t fixed. Anyway, I’m sure you’ll get my point.


Out of those 4563 publications that have gathered at least some altmetrics (12.3% out of total 37K) and thus are represented here, 27 (0.6%) have been cited by some policy or guidance documents. To put this in altmetrical perspective, nearly 500 (11%) have been mentioned in some news site, which constitute the most heavy-weight ingredient of the Altmetric score.

In their preprint How many scientific papers are mentioned in policy-related documents? An empirical investigation using Web of Science and Altmetric data (Konkiel) Haunschild and Bornmann found out that, from papers published in 2000-2014 and indexed by WoS, only about 0.6% had got mentions in policy documents. Why so few? Their reasoning feels legit.

Possible reasons for the low percentage of papers mentioned in policy-related documents are: (1) Altmetric quite recently started to analyze policy documents and the coverage of the literature is still low (but will be extended). (2) Maybe only a small part of the literature is really policy relevant and most of the papers are only relevant for scientists. (3) Authors of policy-related documents often are not researchers themselves. Therefore, a scientific citation style should not be expected in policy-related documents in general. Thus, policy-related documents may not mention every important paper on which a policy-related document is based on. (4) There are possible barriers and low interaction levels between researchers and policy makers.

What was the corresponding percentage here, in Finnish 2015 papers? I haven’t checked how many of them is indexed by WoS, but with a very rough estimate of 50% would give us 0.1%

EFSA and food safety

By far the most frequent policy organisation in this Finnish sample is EFSA, European Food Safety Authority. From their How we work:

Most of EFSA’s work is undertaken in response to requests for scientific advice from the European Commission, the European Parliament and EU Member States. We also carry out scientific work on our own initiative, in particular to examine emerging issues and new hazards and to update our assessment methods and approaches. This is known as “self-tasking”.

EFSA’s scientific advice is mostly provided by its Scientific Panels and Scientific Committee, members of which are appointed through an open selection procedure.

A few EFSA Scientific Panels have selected Finnish members in the prevailing period of 2015-2018. They are

In addition, some also had Finns during the previous period, lasting until 2015.

All but one of the EFSA cites that Altmetric adds up for a given publication, come from EFSA Journal which is the publishing platform of the Scientific Opinions of the Panels. So, I guess it is a matter of taste whether you, in this context, like to count these mentions as self-citations or some such. In this 2015 dataset, the only other citing source for EFSA-originated publications is GOV.UK, referring Scientific Opinion on lumpy skin disease (AWAH) in its various monitoring documents.

In the above-mentioned preprint, authors note that there is no way of knowing whether the citing document indeed was a legitimate source, because sites that Altmetric track, include also CV’s.

However, considering the small percentage of WoS publications mentioned in policy-related documents (which we found in this study), we expect that only very few mentions originate from such unintended sites [as CV’s]

In my small sample here, if we consider EFSA Journal to be an unintended source, the percentage was significant. Still, numbers don’t necessarily tell anything about the potential impact of these publications. In fact, given their specialized and instrumental character, EFSA Opinions may indeed be very impact-rich.

Reading the Abstracts of the featured Opinions is fascinating. Proportionally, they are almost equally divided between NDA and AWAH.

NDA is often asked to deliver an opinion on the scientific substantiation of a health claim related to X, where X can be L-tug lycopene; fat-free yogurts and fermented milks; glycaemic carbohydrates and maintenance of normal brain function pursuant; native chicory inulin; refined Buglossoides oil as a novel food ingredient, and other chemical substances and combounds which, if they get scientific green light from the Panel, can eventually emerge on market shelves as functional food products. On the other hand, the span of influence of Scientific Opinion on Dietary Reference Values for vitamin A is potentially the whole human population.

Not all AWAH Opinion Abstracts are for the faint hearted. Topics range from slaughterhouse technologies such as electrical parameters for stunning of small ruminants, and use of carbon dioxide for stunning rabbits to global threats like African swine fever and Oyster mortality.

The only publication related to the broader theme of food and nutrition, and not cited by EFSA above, comes from University of Tampere. The impact of lipid-based nutrient supplement provision to pregnant women on newborn size in rural Malawi: a randomized controlled trial1-4 has been cited by World Bank.

Society at large

Out of the 27 publications in this sample, two fall in the thematic area of society. They come from Aalto University, and University of Jyväskylä, respectively.

Hedge Funds and Stock Market Efficiency is cited by Australia Policy Online (APO), whereas Democracy, poverty and civil society in Mozambique has got a mention from The Royal Institute of International Affairs. What is perhaps noteworthy with the latter is that it has a single author.

Health & diseases

Similarly, only two papers in this theme. The first one is from University of Tampere, and the second one is co-authored by universities of Oulu and Jyväskylä.

The effects of unemployment and perceived job insecurity: a comparison of their association with psychological and somatic complaints, self-rated health and satisfaction is cited by International Labour Organization (ILO), and Genome characterisation of two Ljungan virus isolates from wild bank voles (Myodes glareolus) in Sweden is mentioned by GOV.UK. However, here we have an unintentional source again; three authors of this paper belong to the staff of Animal & Plant Health Agency (APHA), and is listed in Papers published by APHA Staff in 2015.

Wicked problems?

In his kick-off speech of the new academic year 2015, Rector of University of Helsinki, Jukka Kola, is quoted to have said that academic knowledge is in the front-line of solving so-called wicked problems such as

Climate change, Euro-Russian relations, healthcare challenges, strict nationalist movements

Scientific impact on these would be really something, wouldn’t it? Yet, the first giant step for mankind is to miraculously, uniformly, honestly and openly acknowledge the problem.

From the perspective of Aalto University, had I chosen a wider year range for publications to this exercise, I would’ve found multiple citations from policy documents to our research on climate change, notably Global sea level linked to global temperature (2009), and Climate related sea-level variations over the past two millennia (2011). For the record, here are all the papers published 2007-2015 which have policy mentions at Altmetric (again, by DOI). The only extra is number of WoS citations, added manually to these 24 papers.

Future work

The harvesting work of Altmetric helps a great deal, but understandably their focus is on international and UK sources. How about Finland? How much is Finnish academic research cited in local policy documents? Are there differences between publications written by consulting companies and those written by authors coming from the academia itself? Few starting points for questions like these include

See you in PDF parsing sessions!

Posted by Tuija Sonkkila

Data Curator at Aalto University. When out of office, in the (rain)forest with binoculars and a travel zoom.
Coding - Comments Off on Policy sources and impact

Email anatomy of a project

Project is finished
When a HE project runs more than few years, and when it keeps more people busy than just some, it generates a substantial amount of email. Of course, IM has been among us for quite some time, but it has not substituted older forms of digital communication. Far from it.

The CRIS project of Aalto University was officially initiated in early 2012. At that time, it was still a side note in my email traffic. It took a year until the investment decision was made. After that, the lengthy EU tender process. A turning point was in September 2015: a two-day kick-off seminar with the vendor, where practicalities started to took shape. That day was also the beginning of a more regular flow of emails.

I save work emails in folders by category, a habit that often results in obscure collections the focus of which escapes me later on. At this point of time, when the CRIS project is only recently – last week – been declared to be finished, I still remember what the folder name refers to 🙂

Note that I must have also deleted hundreds of posts. Meeting requests; cronjob notifications; emails only superficially related to CRIS but sent FYI; doubles; etc. Still, I have a hunch that roughly 80% of all those emails that were either sent to me, or where I myself was the sender, sit in the CRIS folder of my Outlook 2013.

What could the emails tell about the project? Active dates, days of the week, times of the day. Sadly, less so about the semantics. Our work language is Finnish, and although Finnish is extensively researched, corpus and other digital tools are not that easily accessible for a layman. Unfortunately (for this blog posting), almost all of the English communication with the vendor took place within their Jira issue and project tracking platform.

Toe bone connected to the foot bone

To start with, I selected all emails (4431) from the CRIS folder, Saved as…, and converted the text file to UTF-8.

Emails are seldom clear, separate entities. On the contrary, more often than not they are bundles of several reply posts. Because I most certainly already had all the older ones, I needed to get rid of all that extra. In other words, I was interested only in text from From: to a horizontal line that denotes the beginning of previous correspondence, at least if the email was written with Outlook, the #1 inhouse email client. This awk oneliner does the job.

awk '/\x5f/ { p-- } /^From:/ { p=1 } p>0' emails_utf8.txt > emails_proper.txt

Thanks to an unknown friend, labelling lines in this Mbox format was easy.

/^From/, /^$/ {
    printf "\nhead : %s", $0

/^$/,/^From/ {
    if ($1 ~ /^From/) next
    printf "\nbody : %s", $0

From the result file, I grep’ed only those lines that were not empty, i.e. gaps between emails. Below you see one example email. It’s from me to Jari, the project manager.

head From: Sonkkila Tuija
head Sent: 21. syyskuuta 2016 10:04
head To: Haggren Jari
head Subject: RE: Kaksi samaa jobia ajossa
body Ok. Muuten: mikä on sun mielestä järkevintä, kun tunnus halutaan disabloida? 
body t. Tuija

Then over to RStudio.

Thigh bone connected to the hip bone

The full R code is here.

First, read in raw, labelled data, and convert to a data frame.

raw <- readLines("email_proper_parsed.txt", encoding = "UTF-8")
lines <- data_frame(raw = raw)

For further work, I needed to have my email “observations” on one line each. This proved tricky. I suspect that the job could have been easier with some command line tool. In fact, I had the nagging feeling that I was trying to re-push the data to some previous stage. Anyway, again, with help, I managed to add a sequence number to lines, showing which head and which body belonged to the same email. A novelty for me, this clever rle function, from the base R.

lines$email <- NA

x <- grepl("^head ", lines$raw)
spanEmail <- rle(x)$lengths[rle(x)$values == TRUE]
lines$email[x] <- rep(seq_along(spanEmail), times = spanEmail)

x <- grepl("^body ", lines$raw)
spanEmail <- rle(x)$lengths[rle(x)$values == TRUE]
lines$email[x] <- rep(seq_along(spanEmail), times = spanEmail)

Then, by every email group, the head and body text to a new column, respectively, plus some further processing.

Hip bone connected to the back bone

Date and time.

I faced the fact that in Finland, the default language of OS varies. I haven’t seen statistics, but based on my small HE sample data, Finnish and English seem to be on equal footing. So, to work with date/time I had to first convert Finnish months and weekdays to their English counterparts. After that I could start working with the lubridate package.

Then, summaries for hourly, weekly, and daily visualizations. Note the use of timezone (TZ). Without explicitly referring to it, the xts object, the cornerstone of the time series graph, is off by a day.

Although word clouds seldom reveal anything we wouldn’t be aware of anyway, I produced two of them nevertheless, here with the packages tm and wordcloud. Excluding stop words is a tedious job without a ready-made list, so I made my life easier and basically just tried to make sure that no person names were involved. BTW, signatures are a nuisance in email text mining.

While working on this, I happened to read the blog posting An overview of text mining visualisations possibilities with R on the CETA trade agreement by BNOSAC, a Belgian consultancy network. They mentioned co-occurrence statistics, and made me experiment with it too. I followed their code example, and constructed a network visualization with ggraph and ggforce.

Neck bone connected to the head bone

Finally, time to build a simple Shiny web app. This is the R code, and here is the web app.

Thanks to the nice dygraphs package, you can zoom in to the time series by selecting an area. Double-clicking reverts to the default view. The hourly and daily plots are made with highcharter.

So, how does it look like?

Overall, it seems to me that the project has a healthy email history. Basically no posts in the wee hours of the day, and almost none during weekends. If you look closely, you’ll notice that, in the hourly barchart, there do are tiny bars at nighttime. I can tell you that these emails were not sent by any of us belonging to the Aalto staff. They were sent by a paid consultant.

The year 2016 word cloud is right when it shows that the word portal was frequent. indeed was a big deliverable.

The co-occurrence network graph needs some explanation. First, the title is a bit misleading; there are also other words than just nouns. Second, the two big nodes, starting from the left, translate as as a person, and in the publication. From the latter node, the thickest edges point to already, because, and are. The context rings a bell. CRIS systems are platforms where data about people and their affiliations meet research output, both scientific and artistic. Converting our legacy publication data to this new CRIS environment was a multi-step process, and something that is not fully over yet.

Subtitles from Dem Bones.

Posted by Tuija Sonkkila

Data Curator at Aalto University. When out of office, in the (rain)forest with binoculars and a travel zoom.
Coding , Data - Comments Off on Email anatomy of a project

Making titles to match

Research and citations

Since a few months now, the Aalto University CRIS back-end has served the research active personnel and other inclined staff. Since last week, the front-end site was opened to the public. One of the most interesting aspects of the CRIS is that it offers, in one place, a digital dossier of one’s publications, together with their metrics.

The de facto metrics in science is the number of citations, made available by the two leading commercial indexing services, Web of Science and Scopus. The Pure CRIS system can talk with both via an API, and return the citation count. For the discussion though, Pure needs to know the exact topic of the talk, the unique identification number that the publication has got in these respective indexing services. In Web of Science, the ID is known as UT or WOS. In Scopus, just ID.

In pre-CRIS times, publication metadata was stored in legacy registers. In these, external IDs were an unknown entity. When old metadata was transformed to the Pure data model and imported, these new records were all missing both UT and Scopus ID. Manually copy-pasting them all would be a daunting task, bordering impossible. A programmatic solution offers some remedy here: IDs can be imported to Pure with a ready-made job. For that, you also need the internal record ID of Pure. The input file of the job is a simple CSV with a pair of Pure ID and external ID per row. The challenge is, how to find the matching pairs.

Getting data

Within the Web of Science GUI, you can make an advanced Boolean affiliation search. What this means is that you need to be familiar with all historic name variants of your organization, added with known false positives that you better exclude. I’m not copying our search here, but I can assure you: it is a long and winding string. The result set, filtered by publication years 2010-2016 (the Aalto University period so far) is over 20.000 items. Aalto University outputs roughly 4000 publications yearly.

In what format to download the WoS result data set? There are several possibilities. The most used one I assume, is tabular-limited Windows aka Excel. There is a caveat though. If the character length of some WoS field value exceeds the Excel cell limit, the tabular data is garbled from that point onwards. This happens quite often with the Author field where the names of all contributing authors are concatenated. However, the WoS plain text format doesn’t have this problem, but the output needs parsing before it’s usable. In the exported file, fields are denoted with a two-character label or tag. Fields can span over multiple lines. For this exercise, I’m mostly interested in title (TI), name (SO), ISSN (SN, EI), and article identifier (UT).

As a CRIS (Pure) back-end user, to fetch tabular data out of the system, you have a number of options at the moment:

  • Filtering from the GUI. Quick and easy but doesn’t provide all fields, e.g. record ID
  • Reporting module. Flexible but asks for a few rounds of trial & error, plus the Administrator role if fields you need are not available by default in which case you need to tweak the report configurations. Minimum requirement is the Reporter role. However, you can save your report specifications, and even import/export them in XML, which is nice
  • REST API. For my simple reporting needs, in its present sate, the API would require too much effort in constructing the right query, and parsing the result

Of course there is also the SQL query option, but for that, you need to be familiar with the database schema, and be granted all necessary access rights.

With the Pure Reporting module, filtering by year, and finding publication title, subtitle, ISSN and ID is relatively easy. From the output formats, I choose Excel. This and the parsed WoS data I will then read into R, and start matching.

To parse the WoS data, I chose the Unix tool awk.

But before parsing, data download. For this, you need paper and pen.

There is an export limit of 500 records at WoS, so you have to select, click and save 25 times to get all the 20K+ result set. To keep a manual record of the operation is a must, otherwise you’ll end up exporting some items several times, or skipping others.


When all WoS data is with you, the Bash shell script below does the work. Note that I’ve added newlines in longer commands for clarity. You may also notice, that I had some troubles in defining appropiate if clauses in awk, to handle ISSNs. By brute force parsing, I circumvented the obstacle, and made housekeeping afterwards.


# Delete output file from previous run
rm parsed.txt

# Rename the first WoS output file
mv savedrecs.txt savedrecs\(0\).txt

# Loop over all exported files and filter relevant fields
for i in {0..25}
 awk 'BEGIN {ORS="|"}; # ORS = output field separator
      /^TI/,/^LA/ {print $0}; # match boundaries, to get the SO field that's in between there
      /^DT/ {print $0}; 
      /^SN/ {print $0}; 
      /^EI/ {print $0}; 
      /^UT/ {printf "%s\n",$0}' savedrecs\($i\).txt >> parsed.txt

# Delete labels, LA field, occasional SE field, and extra white space resulting 
# from fields with multiple lines
sed -i 's/^TI //g; 
        s/|   / /g; 
        s/|SO /|/g; 
        s/|LA [^|]*//g; 
        s/|DT /|/g; 
        s/|SN /|/g; 
        s/|UT /|/g; 
        s/|SE [^|]*//g' parsed.txt

# Filter out rows without a second ISSN
grep -v '|EI ' parsed.txt > noEI.txt

# Filter out rows with a second ISSN
grep '|EI ' parsed.txt > yesEI.txt

# Add and empty field to those without, just before the last field
sed -i 's/|WOS/||WOS/g' noEI.txt

# Concat these two files
cat noEI.txt yesEI.txt > parsed_all.csv

# Delete the label of the second ISSN
sed -i 's/|EI /|/g' parsed_all.csv

# If, before '||', which denotes a non-existing 2nd ISSN, there is no match for an ISSN 
# (so there wasn't any in data, meaning the publication is not an article), 
# add an empty field 
sed -i '/|[0-9][0-9][0-9][0-9]\-[0-9][0-9][0-9][0-9X]||/! s/||/|||/g' parsed_all.csv

After a few seconds, the parsed_all.csv includes rows like the following three ones:

Modulation Instability and Phase-Shifted Fermi-Pasta-Ulam Recurrence|SCIENTIFIC REPORTS|Article|2045-2322||WOS:000379981100001
Real World Optimal UPFC Placement and its Impact on Reliability|RECENT ADVANCES IN ENERGY AND ENVIRONMENT SE Energy and Environmental Engineering Series|Proceedings Paper|||WOS:000276787500010
Deposition Order Controls the First Stages of a Metal-Organic Coordination Network on an Insulator Surface|JOURNAL OF PHYSICAL CHEMISTRY C|Article|1932-7447||WOS:000379990400030

Processing in R

There are several R packages for reading and writing Excel. XLConnect has served me well on Ubuntu Linux. However, on my Windows laptop at home, its Java dependencies have been a minor headache lately.


wb <- loadWorkbook("puredata20160909.xls")
puredata <- readWorksheet(wb, sheet = "Research Output")
puredata <- puredata[, c("", "", 
                         "Journal...Journal.3", "Id.4", "Journal...ISSN...ISSN.5")]
names(puredata) <- c("title", "subtitle", "journal", "id", "issn")
# Paste title and subtitle, if there
puredata$title <- ifelse(!$subtitle), paste(puredata$title, puredata$subtitle), puredata$title)
puredata <- puredata[, c("title", "journal", "id", "issn")]

WoS data file import:

wosdata <- read.csv("parsed_all.txt", stringsAsFactors = F, header = F, sep = "|", quote = "", row.names = NULL)
names(wosdata) <- c("title", "journal", "type", "issn", "issn2", "ut")

Before attempting to do any string matching though, some harmonization is necessary: all chars to lowercase, remove punctuation and articles, etc. Without going into details, here is my clean function. The gsub clauses are intentionally separated one to each for easier reading. Also, I haven’t used any character class but defined characters one by one.

clean <- function(dataset) {
  # Journal
  dataset$journal <- ifelse(!$journal), tolower(dataset$journal), dataset$journal)
  # WOS
  if ( "ut" %in% names(dataset) ) {
    dataset$ut <- gsub("WOS:", "", dataset$ut)
  # Title
  dataset$title <- tolower(dataset$title)
  dataset$title <- gsub(":", " ", dataset$title)
  dataset$title <- gsub(",", " ", dataset$title)
  dataset$title <- gsub("-", " ", dataset$title)
  dataset$title <- gsub("%", " ", dataset$title)
  dataset$title <- gsub('\\"', ' ', dataset$title)
  dataset$title <- gsub('\\?', ' ', dataset$title)
  dataset$title <- gsub("\\([^)]+\\)", " ", dataset$title)
  dataset$title <- gsub(" the ", " ", dataset$title)
  dataset$title <- gsub("^[Tt]he ", "", dataset$title)
  dataset$title <- gsub(" an ", " ", dataset$title)
  dataset$title <- gsub(" a ", " ", dataset$title)
  dataset$title <- gsub("^a ", "", dataset$title)
  dataset$title <- gsub(" on ", " ", dataset$title)
  dataset$title <- gsub(" of ", " ", dataset$title)
  dataset$title <- gsub(" for ", " ", dataset$title)
  dataset$title <- gsub(" in ", " ", dataset$title)
  dataset$title <- gsub(" by ", " ", dataset$title)
  dataset$title <- gsub(" non([^\\s])", " non \\1", dataset$title)
  dataset$title <- gsub("^[[:space:]]*", "" , dataset$title)
  dataset$title <- gsub("[[:space:]]*$", "" , dataset$title)
  # ISSN
  dataset$issn <- gsub("-", "" , dataset$issn)
  dataset$issn <- gsub("[[:space:]]*$", "" , dataset$issn)
  dataset$issn <- gsub("^[[:space:]]*", "" , dataset$issn)
  # Second ISSN
  if ( "issn2" %in% names(dataset) ) {
    dataset$issn2 <- gsub("-", "" , dataset$issn2)
    dataset$issn2 <- gsub("[[:space:]]*$", "" , dataset$issn2)
    dataset$issn2 <- gsub("^[[:space:]]*", "" , dataset$issn2)

Match by joining

The R dplyr package is indispensable for all kind of data manipulation. One of the join types it supports (familiar from SQL), is left_join, which

returns all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

So here I am saying to R:Compare my two data frames by title and ISSN, as they are given. Please.


# Join with the first ISSN
joined <- left_join(puredata, wosdata, by=c("title"="title", "issn"="issn"))
# and then with the second ISSN
joined2 <- left_join(puredata, wosdata, by=c("title"="title", "issn"="issn2"))

found <- joined[!$ut),]
found2 <- joined2[!$ut),]

names(found) <- c("title", "journal", "id", "issn", "journal2", "type", "issn2", "ut")
names(found2) <- c("title", "journal", "id", "issn", "journal2", "type", "issn2", "ut")
allfound <- rbind(found, found2)

This way, roughly 55% of CRIS articles, published between 2010 and 2016, found a match – and the needed WOS/UT. Then, similarly, I matched conference publications with title, i.e. items without a corresponding ISSN. Here, the match rate was even better, 73%, although the number of publications was not that big.

However, only one typo in title or ISSN – and no match could happen. Thanks to my smart coworkers, I was made aware of the Levenshtein distance. In R, this algorithm is implemented in few packages.

Fuzzier match

I started with the adist function from the utils package that is normally loaded automatically when you start R. The code examples from Fuzzy String Matching – a survival skill to tackle unstructured information were of great help. The initial tests with a smallish sample looked promising.


Pretty soon it became obvious though, that it would take too much time to process my whole data in this manner. If 1000 x 1000 rows took 15 minutes, 6000 x 12000 rows would take probably ten times that, or maybe even more. What I needed was to run the R code in parallel.

My Linux machine at work runs a 64-bit Ubuntu 16.04, with 7.7 GB RAM and 2.67 GHz Intel Xeon processors, ” x 8″, as the System settings concludes. I’m not sure if 8 here refers to the number of cores or threads. Anyway, if only my code could be parallelized, I could let the computer or other higher mind to decide how to core-source the job in the most efficient way.

Luckily, there is the stringdist package by Mark van der Loo that does just that. By default, it parallelizes your work between all cores you’ve got, minus one. This is common sense, because while your code is running, you may want to do something else with you machine, like watch the output of the top command.

Here I am running the amatch function from the package, trying to match those article titles that were not matched in the join. The method argument value lv stands for Levenshtein distance, as in R’s native adist. maxDist controls how many edits from the first string to the second is allowed for a match. The last argument defines the output in nomatch cases.

system.time(matched <- amatch(notfound_articles$title, wosdata_articles$title, method="lv", maxDist = 5, nomatch = 0))

Look: %CPU 700!

top command output

I haven’t tried to change the default nthread argument value, so I don’t know if my run time of roughly 6 minutes could be beaten. Nevertheless, six minutes (363.748 milliseconds) is just fantastic.

    user  system  elapsed 
2275.980   0.648  363.748 

In the matched integer vector, I had now, for every row from notfound_articles data frame, either a 0 or a row number from wosdata_articles. Among the first twenty one rows, there are already two matches.

[1]  0  0  0  0  0 55  0  0  0  0  0  0  0  0  0  0  1  0  0  0

All I needed to do, was to gather the IDs of the matching strings, and output. For the sake of quality control, I’m picking up also the strings.

N <- length(matched)
pure <- character(N) 
pureid <- character(N)
wos <- character(N)
ut <- character(N)

for (x in 1:N) { 
  pure[x] <- notfound_articles$title[x]
  pureid[x] <- notfound_articles$id[x]
  wos[x] <- ifelse(matched[x] != 0, wosdata_articles[matched[x], "title"], "NA")
  ut[x] <- ifelse(matched[x] != 0, wosdata_articles[matched[x], "ut"], "NA")

df <- data.frame(pure, pureid, wos, ut)
df <- df[df$wos!='NA',c("pureid", "ut")]
names(df) <- c("PureID", "ExternalID")

write.csv(df, paste0("pure_wos_fuzzy_articles", Sys.Date(), ".csv"), row.names = F)

Then, the same procedure for the conference output, and I was done.

All in all, thanks to stringdist and other brilliant stuff by the R community, I could find a matching WoS ID to 67.7% of our CRIS publications from the Aalto era.

To repeat the same process for Scopus means to build another parser, because the export format is different. The rest should follow similar lines.

EDIT 15.9.2016: Made corrections to the shell script.

Posted by Tuija Sonkkila

Data Curator at Aalto University. When out of office, in the (rain)forest with binoculars and a travel zoom.
Coding - Comments Off on Making titles to match

Towards more automation (and better quality)

Few weeks ago, CSC-IT Center for Science opened the much awaited VIRTA REST API for publications (page in Finnish). VIRTA is the Finnish higher education achievement register,

a tool to be utilized in the authoritative data harvesting in the way that the collected data will be both commensurable and of a good quality.

The API is good news for a data consumer like me who likes to experiment with altmetrics applications. As strange as it sounds, it’s not necessarily that easy to get your hands into a set of DOIs from inside the University. The ultimate quality control happens whenever University reports its academic output to the Ministry. And that’s precisely what the VIRTA is about: the publication set there is stamped by the University to be an achievement of their researchers.

VIRTA is still work in progress, and so far only a couple of universities import their publication data there on a regular basis, either from their CRIS or otherwise. Aalto University is not among the piloting organisations, so you’ll not find our 2016 publications in there yet. Years covered are 2014-2015.

Get data and filter

OK, let’s download our stuff in XML (the other option is JSON). Note that the service is IP-based, so the very first thing you have to do is to contact CSC, tell who you are and from which IP you work. When the door is open, you’ll need the organisation ID of Aalto. That’s 10076.

Below, I’ve changed the curl command given at the API page so that the header is not outputted to the file (no -i option), only the result XML.

curl -k -H "Accept: application/xml" "" -o aaltopubl.xml

My target is a similar kind of R Shiny web application that I made for the 2:am conference, where you can filter data by School etc. To make my life easier during the subsequent R coding, I first produced a CSV file from the returned XML, with selected columns: DOI, publication year, title, name of the journal, number of authors, OA status and department code. At Aalto, two first characters of the department code tell the School, which helps.

Below is the rather verbose XSLT transformation code parse.xsl. What it says is that for every publication, output the above mentioned element values in quotes (separated by a semicolon), add a line feed, and continue with the next publication.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0"

	<xsl:output method="text" encoding="UTF-8"/>

	<xsl:template match="/">
<xsl:for-each select="//csc:Julkaisu[csc:DOI]">
"<xsl:value-of select="csc:DOI"/>";"<xsl:value-of select="csc:JulkaisuVuosi"/>";"<xsl:value-of select="translate(csc:JulkaisunNimi, ';', '.')"/>";"<xsl:value-of select="csc:LehdenNimi"/>";"<xsl:value-of select="csc:TekijoidenLkm"/>";"<xsl:value-of select="csc:AvoinSaatavuusKoodi"/>";"<xsl:value-of select="csc:JulkaisunOrgYksikot/csc:YksikkoKoodi"/>"<xsl:text>

For the transformation, I’ll use here the Saxon-CE XSLT engine. You can run it from the command line like this:

java -jar saxon9he.jar aaltopubl.xml parse.xsl -o:aaltopubl.csv

Right, so now I have the basic data. Next, altmetrics from Altmetric.

Query altmetrics with a (cleaned) set of DOIs

With the reliable rAltmetric R package, the procedure follows a familiar pattern: with every DOI in turn preambled with doi:, the Altmetric API is queried by the altmetrics function. Results are saved as a list. When all rows are done, the list is transformed to a dataframe with the altmetric_data function.

But before that happens, DOIs need cleaning. Upstream, there was no quality control in DOI input, so the field can (and does) contain extra characters that the rAltmetric query does not tolerate, and with a good reason.

dois$doi <- gsub("", "", dois$doi)
dois$doi <- gsub("", "", dois$doi)
dois$doi <- gsub("doi:", "", dois$doi)
dois$doi <- gsub("DOI:", "", dois$doi)
dois$doi <- gsub("DOI", "", dois$doi)
dois$doi <- gsub("%", "/", dois$doi)
dois$doi <- gsub(" ", "", dois$doi)
dois$doi <- gsub("^/", "", dois$doi)
dois$doi <- gsub("", "", dois$doi)

dois_cleaned <- dois %>%
  filter(doi != 'DOI') %>%
  filter(doi != '[doi]') %>%
  filter(doi != "") %>%
  filter(!grepl("http://", doi)) %>%
  filter(grepl("/", doi))

When all extras are removed, querying is easy.

raw_metrics <- plyr::llply(paste0('doi:',dois_cleaned$doi), altmetrics, .progress = 'text')
metric_data <- plyr::ldply(raw_metrics, altmetric_data)
write.csv(metric_data, file = "aalto_virta_altmetrics.csv")

Data storage

From this step onwards, the rest is Shiny R coding, following similar lines as the 2:am experiment

One thing I decided to do differently this time, though: data storage. Because I’m aiming at more automation where, with the same DOI set, new metrics is gathered on, say, monthly basis – to get time series – I need a sensible way to store data between runs. I could of course just upload new data by re-deploying the application to where it is hosted, Every file in the sync’ed directory is transmitted to the remote server. However, that could result to an unnecessarily bloated application.

Other Shiny users ponder this too of course. In his blog posting Persistent data storage in Shiny apps, Dean Attali goes through several options.

First I had in mind using Google Drive, thanks to a handy new R package googlesheets. For that, I added a date stamp to the dataframe, and subsetted it by School. Then – after the necessary OAuth step – gs_new registers a new Google spreadsheet, ws_title names the first sheet, and uploads data in input. In subsequent pipe commands, gs_ws_new generates new sheets, populating them with the rest of the dataframes.


upload <- gs_new("aalto-virta", ws_title = "ARTS", input = ARTS, trim = TRUE) %>% 
  gs_ws_new("BIZ", input = BIZ, trim = TRUE) %>% 
  gs_ws_new("CHEM", input = CHEM, trim = TRUE) %>%
  gs_ws_new("ELEC", input = ELEC, trim = TRUE) %>%
  gs_ws_new("ENG", input = ENG, trim = TRUE) %>%
  gs_ws_new("SCI", input = SCI, trim = TRUE)

Great! However, when I tried to make use of the data, I stumbled on the same HTTP 429 issue that Jennifer Bryan writes about here. It persisted even when I stripped data to just 10 rows per School. I’ll definitely return to this package in the future but for now, I had to let it be.

Next try: Dropbox. This proved more successful. The code is quite terse. Note that here too, you first have to authenticate. When that is done, data is uploaded. The dest argument refers to a Dropbox directory.


write.table(aalto_virta, file ="aalto-virta-2016_04.csv", row.names = F, fileEncoding = "UTF-8")
drop_upload('aalto-virta-2016_04.csv', dest = "virta")

In the application, you then download data with drop_get when need be, and read into the application with read.table or some such.

drop_get("virta/aalto-virta-2016_04.csv", overwrite = T)

I didn’t notice any major lag in the initiation phase of the application where you see the animated Please Wait… progress bar, so that’s a good sign.

Here is the first version of the application. Now with quality control! In the next episode: time series!

Posted by Tuija Sonkkila

Data Curator at Aalto University. When out of office, in the (rain)forest with binoculars and a travel zoom.
Coding , Data - Comments Off on Towards more automation (and better quality)