Hilite Pure

When you are a customer of a web-based service that allows you to tweak only some elements of the web appearance but not all of them, you begin to wonder should you do something about it yourself. One such service is the Pure CRIS by Elsevier, one of the central digital systems at Aalto University since last fall.

Currently, many Aalto Pure users are typing in data on activities such as visits abroad, talks, memberships, etc. Together with publications, activities are the backbone of one’s academic CV. Moreover, like all Finnish universities, Aalto has the obligation of sending an annual report on certain aspects of activities to the Ministry. These are e.g. internationality, and country. If you gave your talk abroad, we want to know about this thank you, and also, where exactly was it.

The Pure GUI is a bundle of vertically lengthy, separate windows with multiple buttons and fields. Many fields are extra whereas some are marked obligatory by the vendor. Leaving them unfilled rightly prevents one from saving the record. This is fine. However, how do we tell our users that please, check also International activity and Country? In Pure parlance, these two fields are keywords on activities. Ideally, these should be obligatory too.

Elsevier does not change the core functionality of its product on a single client basis; enhancement requests need the acceptance of the national user group to start with. And even though our friends in Finland would sign our request (I suppose they would), that doesn’t guarantee that Elsevier will add the familiar red star to these two fields, and tweak their program code correspondingly. Note that Elsevier is no exception here, software vendors often work like this. And to be fair, who knows how long our Ministry is interested in internationality. They may drop it next year.

The way we deal with this small challenge at the moment is brute force: within keyword configuration, we have wrapped the keyword name inside an HTML span element, and added a style attribute to it. What this effectively does is that wherever the keyword name is visible in the GUI, it comes with a yellow background. Unfortunately in our case, the HTML is not always rendered but the name is used verbatim too. This clutters the canvas with angle brackets.

How about changing the DOM on the fly? As a proof-of-concept, I followed the instructions on Mozilla WebExtensions.

First I saved these two files in a directory Yellowify:

manifest.json

{

  "manifest_version": 2,
  "name": "Yellowify",
  "version": "1.0",

  "description": "Adds a yellow background to span elements with text 'International activity' or 'Country' on webpages matching aalto.fi.",

  "content_scripts": [
    {
      "matches": ["*://*.aalto.fi/*"],
      "js": ["yellow.js"]
    }
  ]
}

yellow.js

// There are matching span elements also as childs of an H2 element 
// but we don't want to color them 
var el = document.querySelectorAll('label > span')

for(var i = 0; i < el.length; i++)
{
   var text = el[i].textContent;

   if (text == "International activity" || text == 'Country') 
    { 
      el[i].style.backgroundColor = "yellow";
     }
}

Then I opened about:debugging in Firefox, loaded yellow.js as a temporary add-on, and navigated to a new Pure activity page.

New Pure activity with highlighted fields

New Pure activity with highlighted fields

So far so good!

But of course this functionality is now for me personally, on this computer, in this browser, and even here only temporarily. To propagate the hilite as a semi-permanent feature within the whole university to all Pure users is not going to happen. There are too many different platforms and browsers out there.

Posted by Tuija Sonkkila

Data Curator at Aalto University. When out of office, in the (rain)forest with binoculars and a travel zoom.
Uncategorized - Comments Off on Hilite Pure

One year makes a (small) difference

Time goes by, publications get attention, metrics accrue. Or do they? To find out what happens locally in small scale and in a short time frame, in one year, I re-queried the Altmetric API and WoS for our items at VIRTA. The results are visible behind the Data by School tab of the aaltovirta web application – but not all of them, just (possible) changes in tweet(er)s, Mendeley readers, and WoS citations.

While at it, I also added links to open access full texts found by the oaDOI service API. The roadoi R package was very helpful in this. Currently, roadoi is about to become a member of the rOpenSci suite. I was gladly surprised when I was asked to participate in an open peer-review of the package. Revamping the app with oaDOI information acted as a timely proof for my part that roadoi does what it promises.

There are a multilple ways to visualize changes over time. At first, I had in mind to try to show as big a picture as possible in one go, i.e. all items by School. However, most Schools publish several hundred items per year. Not easily consumable unless you have a megalomaniac full-wall screen, swim within data in VR, or some such. Only the research output of the School of Arts, Design and Architecture (ARTS) was within realistic limits.

Small multiples

Below are three images showing changes in ARTS publications in citations, tweets, and Mendeley readers, respectively. Here I used the facet_wrap function of the ggplot2 R package, with much-needed extra help from Visualizing home ownership with small multiples and R by Antonio S. Chinchón. My version of his code is in this gist. Before plotting, data is sorted by the value of the respective metrics.

altm_gap_cites_ARTS

altm_gap_tweets_ARTS

altm_gap_mend_ARTS

Note that some values have decreased during the year, mostly in Mendeley. What this probably just means is that at query time, data was missing.

We know that it takes some time for citations to accumulate. Still, even one year can make a difference. In this ARTS sample, Addressing sustainability in research on distributed production: an integrated literature review by Cindy Kohtala had 2 citations in spring 2016 but now already 16. At the same time, Mendeley readership doubled from 95 to 184, whereas the number of tweeters remained the same, 10. This is a familiar pattern. Twitter is more about communicating here and now.

In all ARTS publications, only three had more tweeters now than a year ago. The increase is so small though, that you cannot decipher it from the pictures due to scaling.

Of course it would be interesting to know, at which exact times these additional tweets were sent, by whom (human or bot), and what might have triggered the sending. CrossRef Event Data is now in fresh beta. Some day we may get detailed information via their API to the first question, for free. As for now, I find the API results still a bit confusing. Based on example queries, here I’m asking for events of the first item from the above list http://query.eventdata.crossref.org/events?rows=10&filter=work:10.1111/bjet.12224. For reasons I don’t understand, there are no mentions of that DOI among the first 10 rows. Anyway, we need to be patient – and read the documentation more carefully.

For the record, in the whole Aalto set, the paper that saw the most increase both in tweeters (14) and Mendeley readers (153) during the year was Virtual Teams Research: 10 Years, 10 Themes, and 10 Opportunities from the School of Science (SCI). On the Altmetric detail page we can see that the newest tweets are only a few days old. The way Journal of Management acts as an amplifier for its own article archive certainly plays a role here.

What about citations? The biggest increase in them took place in the School of Electrical Engineering (ELEC). A Survey on Trust Management for Internet of Things had 35 citations last year, now 134.

altm_gap_cites_ELEC

Small multiples are known to be useful in detecting anomalies. In the above ELEC picture for example, on the 3rd row, there is something odd with the 7th item from the left. Turns out that there are two items in the data with the same DOI, but the publication year differs. My process didn’t expect that to happen.

Sparklines and colored cells

The main venue for all the data in the current and previous exercises on (alt)metrics has been a table in the Shiny web application. A multi-colum, horizontally scrollable beast built by DT, an R interface to the DataTables JavaScript library. Visualizing changes over time in there, inside table cells, asks for either sparklines or some kind of color scale. I tried them both.

I have only two time data points, so the result lines lack their familiar, playful, sparkling appearance. Basically, mine are either 45 degree slopes or 2D lines. Still, they were fun to implement with the sparkline htmlwidget. Big thanks to SBista for advice!

Colors work great in heatmaps where you have a big canvas with many items. Also calendar heatmaps are fine. Again, a table is perhaps not the best platform for colors, but I think that even here they manage to tell part of the story. To be on the safe side, I used 3-class Spectral by ColorBrewer.

DT is versatile, and it comes with extensive documentation. The number of columns in the table is bordering its maximum, so I used some helper functions and extensions to ease the burden. Hopefully the user experience is not too bad.

Posted by Tuija Sonkkila

Data Curator at Aalto University. When out of office, in the (rain)forest with binoculars and a travel zoom.
Uncategorized - Comments Off on One year makes a (small) difference

Policy sources and impact

Since some time already, Altmeric has harvested information on how much various policy and guidance documents have been citing research publications. The way they do it is mainly by text-mining the References section of PDF documents, and verifying results against CrossRef and PubMed. As a friendly gesture, Altmetric also provides their data via the public API. All of this is great news, and possibly important when we think about the growing interest in assessing impact.

How much and on which fields has Finnish university research output gained attention in international (here = English) policy sources?

Related to this, Kim Holmberg et al. have already shown evidence in the research project Measuring the societal impact of open science of RUSE, that research profiles of Finnish universities seem to be reflected in altmetrics.

In this exercise, I narrowed the scope of publications to a single year, 2015, for practical reasons. First, it is the latest full year covered by the new Virta REST API (page in Finnish), the to-be Top1 open data source of Finnish research publications. Second, I decided to manually go through all policy cites, to get some understanding on what we are talking about here.

For those interested in gory details, here’s the R source code of getting, cleaning & querying the dataset, and building a small interactive web app on results. The app itself is here. Please be patient, it’s a little slow in its turns.

finunic Shiny web app

By default, the app shows all data in two dimensions: the number of Twitter tweets, and the Altmetric score. While at it, try selecting differenct metrics on the vertical axis. It reveals how, in this material, various altmetrics sources are relative to this algorithmic, weighted score.

The size of the plot points (circles) tells about the degree of openness of the publication. The two biggest ones refer to either an OA journal, or an OA repository, respectively. The smaller ones are either not OA, or the status is not known. This information comes from the publication entry proper. In case the article is co-authored by multiple universities, they may have given differing information on the OA status. This is a sign how unclear the terminology still is in the OA field. Note also that when you filter data by university, the colour scale changes from universities to reflect the OA status, but unfortunately the palette isn’t fixed. Anyway, I’m sure you’ll get my point.

Statistics

Out of those 4563 publications that have gathered at least some altmetrics (12.3% out of total 37K) and thus are represented here, 27 (0.6%) have been cited by some policy or guidance documents. To put this in altmetrical perspective, nearly 500 (11%) have been mentioned in some news site, which constitute the most heavy-weight ingredient of the Altmetric score.

In their preprint How many scientific papers are mentioned in policy-related documents? An empirical investigation using Web of Science and Altmetric data (Konkiel) Haunschild and Bornmann found out that, from papers published in 2000-2014 and indexed by WoS, only about 0.6% had got mentions in policy documents. Why so few? Their reasoning feels legit.

Possible reasons for the low percentage of papers mentioned in policy-related documents are: (1) Altmetric quite recently started to analyze policy documents and the coverage of the literature is still low (but will be extended). (2) Maybe only a small part of the literature is really policy relevant and most of the papers are only relevant for scientists. (3) Authors of policy-related documents often are not researchers themselves. Therefore, a scientific citation style should not be expected in policy-related documents in general. Thus, policy-related documents may not mention every important paper on which a policy-related document is based on. (4) There are possible barriers and low interaction levels between researchers and policy makers.

What was the corresponding percentage here, in Finnish 2015 papers? I haven’t checked how many of them is indexed by WoS, but with a very rough estimate of 50% would give us 0.1%

EFSA and food safety

By far the most frequent policy organisation in this Finnish sample is EFSA, European Food Safety Authority. From their How we work:

Most of EFSA’s work is undertaken in response to requests for scientific advice from the European Commission, the European Parliament and EU Member States. We also carry out scientific work on our own initiative, in particular to examine emerging issues and new hazards and to update our assessment methods and approaches. This is known as “self-tasking”.

EFSA’s scientific advice is mostly provided by its Scientific Panels and Scientific Committee, members of which are appointed through an open selection procedure.

A few EFSA Scientific Panels have selected Finnish members in the prevailing period of 2015-2018. They are

In addition, some also had Finns during the previous period, lasting until 2015.

All but one of the EFSA cites that Altmetric adds up for a given publication, come from EFSA Journal which is the publishing platform of the Scientific Opinions of the Panels. So, I guess it is a matter of taste whether you, in this context, like to count these mentions as self-citations or some such. In this 2015 dataset, the only other citing source for EFSA-originated publications is GOV.UK, referring Scientific Opinion on lumpy skin disease (AWAH) in its various monitoring documents.

In the above-mentioned preprint, authors note that there is no way of knowing whether the citing document indeed was a legitimate source, because sites that Altmetric track, include also CV’s.

However, considering the small percentage of WoS publications mentioned in policy-related documents (which we found in this study), we expect that only very few mentions originate from such unintended sites [as CV’s]

In my small sample here, if we consider EFSA Journal to be an unintended source, the percentage was significant. Still, numbers don’t necessarily tell anything about the potential impact of these publications. In fact, given their specialized and instrumental character, EFSA Opinions may indeed be very impact-rich.

Reading the Abstracts of the featured Opinions is fascinating. Proportionally, they are almost equally divided between NDA and AWAH.

NDA is often asked to deliver an opinion on the scientific substantiation of a health claim related to X, where X can be L-tug lycopene; fat-free yogurts and fermented milks; glycaemic carbohydrates and maintenance of normal brain function pursuant; native chicory inulin; refined Buglossoides oil as a novel food ingredient, and other chemical substances and combounds which, if they get scientific green light from the Panel, can eventually emerge on market shelves as functional food products. On the other hand, the span of influence of Scientific Opinion on Dietary Reference Values for vitamin A is potentially the whole human population.

Not all AWAH Opinion Abstracts are for the faint hearted. Topics range from slaughterhouse technologies such as electrical parameters for stunning of small ruminants, and use of carbon dioxide for stunning rabbits to global threats like African swine fever and Oyster mortality.

The only publication related to the broader theme of food and nutrition, and not cited by EFSA above, comes from University of Tampere. The impact of lipid-based nutrient supplement provision to pregnant women on newborn size in rural Malawi: a randomized controlled trial1-4 has been cited by World Bank.

Society at large

Out of the 27 publications in this sample, two fall in the thematic area of society. They come from Aalto University, and University of Jyväskylä, respectively.

Hedge Funds and Stock Market Efficiency is cited by Australia Policy Online (APO), whereas Democracy, poverty and civil society in Mozambique has got a mention from The Royal Institute of International Affairs. What is perhaps noteworthy with the latter is that it has a single author.

Health & diseases

Similarly, only two papers in this theme. The first one is from University of Tampere, and the second one is co-authored by universities of Oulu and Jyväskylä.

The effects of unemployment and perceived job insecurity: a comparison of their association with psychological and somatic complaints, self-rated health and satisfaction is cited by International Labour Organization (ILO), and Genome characterisation of two Ljungan virus isolates from wild bank voles (Myodes glareolus) in Sweden is mentioned by GOV.UK. However, here we have an unintentional source again; three authors of this paper belong to the staff of Animal & Plant Health Agency (APHA), and is listed in Papers published by APHA Staff in 2015.

Wicked problems?

In his kick-off speech of the new academic year 2015, Rector of University of Helsinki, Jukka Kola, is quoted to have said that academic knowledge is in the front-line of solving so-called wicked problems such as

Climate change, Euro-Russian relations, healthcare challenges, strict nationalist movements

Scientific impact on these would be really something, wouldn’t it? Yet, the first giant step for mankind is to miraculously, uniformly, honestly and openly acknowledge the problem.

From the perspective of Aalto University, had I chosen a wider year range for publications to this exercise, I would’ve found multiple citations from policy documents to our research on climate change, notably Global sea level linked to global temperature (2009), and Climate related sea-level variations over the past two millennia (2011). For the record, here are all the papers published 2007-2015 which have policy mentions at Altmetric (again, by DOI). The only extra is number of WoS citations, added manually to these 24 papers.

Future work

The harvesting work of Altmetric helps a great deal, but understandably their focus is on international and UK sources. How about Finland? How much is Finnish academic research cited in local policy documents? Are there differences between publications written by consulting companies and those written by authors coming from the academia itself? Few starting points for questions like these include

See you in PDF parsing sessions!

Posted by Tuija Sonkkila

Data Curator at Aalto University. When out of office, in the (rain)forest with binoculars and a travel zoom.
Coding - Comments Off on Policy sources and impact

Email anatomy of a project

Project is finished
When a HE project runs more than few years, and when it keeps more people busy than just some, it generates a substantial amount of email. Of course, IM has been among us for quite some time, but it has not substituted older forms of digital communication. Far from it.

The CRIS project of Aalto University was officially initiated in early 2012. At that time, it was still a side note in my email traffic. It took a year until the investment decision was made. After that, the lengthy EU tender process. A turning point was in September 2015: a two-day kick-off seminar with the vendor, where practicalities started to took shape. That day was also the beginning of a more regular flow of emails.

I save work emails in folders by category, a habit that often results in obscure collections the focus of which escapes me later on. At this point of time, when the CRIS project is only recently – last week – been declared to be finished, I still remember what the folder name refers to 🙂

Note that I must have also deleted hundreds of posts. Meeting requests; cronjob notifications; emails only superficially related to CRIS but sent FYI; doubles; etc. Still, I have a hunch that roughly 80% of all those emails that were either sent to me, or where I myself was the sender, sit in the CRIS folder of my Outlook 2013.

What could the emails tell about the project? Active dates, days of the week, times of the day. Sadly, less so about the semantics. Our work language is Finnish, and although Finnish is extensively researched, corpus and other digital tools are not that easily accessible for a layman. Unfortunately (for this blog posting), almost all of the English communication with the vendor took place within their Jira issue and project tracking platform.

Toe bone connected to the foot bone

To start with, I selected all emails (4431) from the CRIS folder, Saved as…, and converted the text file to UTF-8.

Emails are seldom clear, separate entities. On the contrary, more often than not they are bundles of several reply posts. Because I most certainly already had all the older ones, I needed to get rid of all that extra. In other words, I was interested only in text from From: to a horizontal line that denotes the beginning of previous correspondence, at least if the email was written with Outlook, the #1 inhouse email client. This awk oneliner does the job.

awk '/\x5f/ { p-- } /^From:/ { p=1 } p>0' emails_utf8.txt > emails_proper.txt

Thanks to an unknown friend, labelling lines in this Mbox format was easy.

/^From/, /^$/ {
    printf "\nhead : %s", $0
    next
}

/^$/,/^From/ {
    if ($1 ~ /^From/) next
    printf "\nbody : %s", $0
}

From the result file, I grep’ed only those lines that were not empty, i.e. gaps between emails. Below you see one example email. It’s from me to Jari, the project manager.

head From: Sonkkila Tuija
head Sent: 21. syyskuuta 2016 10:04
head To: Haggren Jari
head Subject: RE: Kaksi samaa jobia ajossa
body Ok. Muuten: mikä on sun mielestä järkevintä, kun tunnus halutaan disabloida? 
body t. Tuija

Then over to RStudio.

Thigh bone connected to the hip bone

The full R code is here.

First, read in raw, labelled data, and convert to a data frame.

raw <- readLines("email_proper_parsed.txt", encoding = "UTF-8")
lines <- data_frame(raw = raw)

For further work, I needed to have my email “observations” on one line each. This proved tricky. I suspect that the job could have been easier with some command line tool. In fact, I had the nagging feeling that I was trying to re-push the data to some previous stage. Anyway, again, with help, I managed to add a sequence number to lines, showing which head and which body belonged to the same email. A novelty for me, this clever rle function, from the base R.

lines$email <- NA

x <- grepl("^head ", lines$raw)
spanEmail <- rle(x)$lengths[rle(x)$values == TRUE]
lines$email[x] <- rep(seq_along(spanEmail), times = spanEmail)

x <- grepl("^body ", lines$raw)
spanEmail <- rle(x)$lengths[rle(x)$values == TRUE]
lines$email[x] <- rep(seq_along(spanEmail), times = spanEmail)

Then, by every email group, the head and body text to a new column, respectively, plus some further processing.

Hip bone connected to the back bone

Date and time.

I faced the fact that in Finland, the default language of OS varies. I haven’t seen statistics, but based on my small HE sample data, Finnish and English seem to be on equal footing. So, to work with date/time I had to first convert Finnish months and weekdays to their English counterparts. After that I could start working with the lubridate package.

Then, summaries for hourly, weekly, and daily visualizations. Note the use of timezone (TZ). Without explicitly referring to it, the xts object, the cornerstone of the time series graph, is off by a day.

Although word clouds seldom reveal anything we wouldn’t be aware of anyway, I produced two of them nevertheless, here with the packages tm and wordcloud. Excluding stop words is a tedious job without a ready-made list, so I made my life easier and basically just tried to make sure that no person names were involved. BTW, signatures are a nuisance in email text mining.

While working on this, I happened to read the blog posting An overview of text mining visualisations possibilities with R on the CETA trade agreement by BNOSAC, a Belgian consultancy network. They mentioned co-occurrence statistics, and made me experiment with it too. I followed their code example, and constructed a network visualization with ggraph and ggforce.

Neck bone connected to the head bone

Finally, time to build a simple Shiny web app. This is the R code, and here is the web app.

Thanks to the nice dygraphs package, you can zoom in to the time series by selecting an area. Double-clicking reverts to the default view. The hourly and daily plots are made with highcharter.

So, how does it look like?

Overall, it seems to me that the project has a healthy email history. Basically no posts in the wee hours of the day, and almost none during weekends. If you look closely, you’ll notice that, in the hourly barchart, there do are tiny bars at nighttime. I can tell you that these emails were not sent by any of us belonging to the Aalto staff. They were sent by a paid consultant.

The year 2016 word cloud is right when it shows that the word portal was frequent. research.aalto.fi indeed was a big deliverable.

The co-occurrence network graph needs some explanation. First, the title is a bit misleading; there are also other words than just nouns. Second, the two big nodes, starting from the left, translate as as a person, and in the publication. From the latter node, the thickest edges point to already, because, and are. The context rings a bell. CRIS systems are platforms where data about people and their affiliations meet research output, both scientific and artistic. Converting our legacy publication data to this new CRIS environment was a multi-step process, and something that is not fully over yet.

Subtitles from Dem Bones.

Posted by Tuija Sonkkila

Data Curator at Aalto University. When out of office, in the (rain)forest with binoculars and a travel zoom.
Coding , Data - Comments Off on Email anatomy of a project

Making titles to match

Research and citations

Since a few months now, the Aalto University CRIS back-end has served the research active personnel and other inclined staff. Since last week, the front-end site was opened to the public. One of the most interesting aspects of the CRIS is that it offers, in one place, a digital dossier of one’s publications, together with their metrics.

The de facto metrics in science is the number of citations, made available by the two leading commercial indexing services, Web of Science and Scopus. The Pure CRIS system can talk with both via an API, and return the citation count. For the discussion though, Pure needs to know the exact topic of the talk, the unique identification number that the publication has got in these respective indexing services. In Web of Science, the ID is known as UT or WOS. In Scopus, just ID.

In pre-CRIS times, publication metadata was stored in legacy registers. In these, external IDs were an unknown entity. When old metadata was transformed to the Pure data model and imported, these new records were all missing both UT and Scopus ID. Manually copy-pasting them all would be a daunting task, bordering impossible. A programmatic solution offers some remedy here: IDs can be imported to Pure with a ready-made job. For that, you also need the internal record ID of Pure. The input file of the job is a simple CSV with a pair of Pure ID and external ID per row. The challenge is, how to find the matching pairs.

Getting data

Within the Web of Science GUI, you can make an advanced Boolean affiliation search. What this means is that you need to be familiar with all historic name variants of your organization, added with known false positives that you better exclude. I’m not copying our search here, but I can assure you: it is a long and winding string. The result set, filtered by publication years 2010-2016 (the Aalto University period so far) is over 20.000 items. Aalto University outputs roughly 4000 publications yearly.

In what format to download the WoS result data set? There are several possibilities. The most used one I assume, is tabular-limited Windows aka Excel. There is a caveat though. If the character length of some WoS field value exceeds the Excel cell limit, the tabular data is garbled from that point onwards. This happens quite often with the Author field where the names of all contributing authors are concatenated. However, the WoS plain text format doesn’t have this problem, but the output needs parsing before it’s usable. In the exported file, fields are denoted with a two-character label or tag. Fields can span over multiple lines. For this exercise, I’m mostly interested in title (TI), name (SO), ISSN (SN, EI), and article identifier (UT).

As a CRIS (Pure) back-end user, to fetch tabular data out of the system, you have a number of options at the moment:

  • Filtering from the GUI. Quick and easy but doesn’t provide all fields, e.g. record ID
  • Reporting module. Flexible but asks for a few rounds of trial & error, plus the Administrator role if fields you need are not available by default in which case you need to tweak the report configurations. Minimum requirement is the Reporter role. However, you can save your report specifications, and even import/export them in XML, which is nice
  • REST API. For my simple reporting needs, in its present sate, the API would require too much effort in constructing the right query, and parsing the result

Of course there is also the SQL query option, but for that, you need to be familiar with the database schema, and be granted all necessary access rights.

With the Pure Reporting module, filtering by year, and finding publication title, subtitle, ISSN and ID is relatively easy. From the output formats, I choose Excel. This and the parsed WoS data I will then read into R, and start matching.

To parse the WoS data, I chose the Unix tool awk.

But before parsing, data download. For this, you need paper and pen.

There is an export limit of 500 records at WoS, so you have to select, click and save 25 times to get all the 20K+ result set. To keep a manual record of the operation is a must, otherwise you’ll end up exporting some items several times, or skipping others.

Parsing

When all WoS data is with you, the Bash shell script below does the work. Note that I’ve added newlines in longer commands for clarity. You may also notice, that I had some troubles in defining appropiate if clauses in awk, to handle ISSNs. By brute force parsing, I circumvented the obstacle, and made housekeeping afterwards.

#!/bin/bash

# Delete output file from previous run
rm parsed.txt

# Rename the first WoS output file
mv savedrecs.txt savedrecs\(0\).txt

# Loop over all exported files and filter relevant fields
for i in {0..25}
do
 awk 'BEGIN {ORS="|"}; # ORS = output field separator
      /^TI/,/^LA/ {print $0}; # match boundaries, to get the SO field that's in between there
      /^DT/ {print $0}; 
      /^SN/ {print $0}; 
      /^EI/ {print $0}; 
      /^UT/ {printf "%s\n",$0}' savedrecs\($i\).txt >> parsed.txt
done

# Delete labels, LA field, occasional SE field, and extra white space resulting 
# from fields with multiple lines
sed -i 's/^TI //g; 
        s/|   / /g; 
        s/|SO /|/g; 
        s/|LA [^|]*//g; 
        s/|DT /|/g; 
        s/|SN /|/g; 
        s/|UT /|/g; 
        s/|SE [^|]*//g' parsed.txt

# Filter out rows without a second ISSN
grep -v '|EI ' parsed.txt > noEI.txt

# Filter out rows with a second ISSN
grep '|EI ' parsed.txt > yesEI.txt

# Add and empty field to those without, just before the last field
sed -i 's/|WOS/||WOS/g' noEI.txt

# Concat these two files
cat noEI.txt yesEI.txt > parsed_all.csv

# Delete the label of the second ISSN
sed -i 's/|EI /|/g' parsed_all.csv

# If, before '||', which denotes a non-existing 2nd ISSN, there is no match for an ISSN 
# (so there wasn't any in data, meaning the publication is not an article), 
# add an empty field 
sed -i '/|[0-9][0-9][0-9][0-9]\-[0-9][0-9][0-9][0-9X]||/! s/||/|||/g' parsed_all.csv

After a few seconds, the parsed_all.csv includes rows like the following three ones:

Modulation Instability and Phase-Shifted Fermi-Pasta-Ulam Recurrence|SCIENTIFIC REPORTS|Article|2045-2322||WOS:000379981100001
Real World Optimal UPFC Placement and its Impact on Reliability|RECENT ADVANCES IN ENERGY AND ENVIRONMENT SE Energy and Environmental Engineering Series|Proceedings Paper|||WOS:000276787500010
Deposition Order Controls the First Stages of a Metal-Organic Coordination Network on an Insulator Surface|JOURNAL OF PHYSICAL CHEMISTRY C|Article|1932-7447||WOS:000379990400030

Processing in R

There are several R packages for reading and writing Excel. XLConnect has served me well on Ubuntu Linux. However, on my Windows laptop at home, its Java dependencies have been a minor headache lately.

library(XLConnect)

wb <- loadWorkbook("puredata20160909.xls")
puredata <- readWorksheet(wb, sheet = "Research Output")
puredata <- puredata[, c("Title.of.the.contribution.in.original.language.1", "Subtitle.of.the.contribution.in.original.language.2", 
                         "Journal...Journal.3", "Id.4", "Journal...ISSN...ISSN.5")]
names(puredata) <- c("title", "subtitle", "journal", "id", "issn")
# Paste title and subtitle, if there
puredata$title <- ifelse(!is.na(puredata$subtitle), paste(puredata$title, puredata$subtitle), puredata$title)
puredata <- puredata[, c("title", "journal", "id", "issn")]

WoS data file import:

wosdata <- read.csv("parsed_all.txt", stringsAsFactors = F, header = F, sep = "|", quote = "", row.names = NULL)
names(wosdata) <- c("title", "journal", "type", "issn", "issn2", "ut")

Before attempting to do any string matching though, some harmonization is necessary: all chars to lowercase, remove punctuation and articles, etc. Without going into details, here is my clean function. The gsub clauses are intentionally separated one to each for easier reading. Also, I haven’t used any character class but defined characters one by one.

clean <- function(dataset) {
  # Journal
  dataset$journal <- ifelse(!is.na(dataset$journal), tolower(dataset$journal), dataset$journal)
  # WOS
  if ( "ut" %in% names(dataset) ) {
    dataset$ut <- gsub("WOS:", "", dataset$ut)
  }
  # Title
  dataset$title <- tolower(dataset$title)
  dataset$title <- gsub(":", " ", dataset$title)
  dataset$title <- gsub(",", " ", dataset$title)
  dataset$title <- gsub("-", " ", dataset$title)
  dataset$title <- gsub("%", " ", dataset$title)
  dataset$title <- gsub('\\"', ' ', dataset$title)
  dataset$title <- gsub('\\?', ' ', dataset$title)
  dataset$title <- gsub("\\([^)]+\\)", " ", dataset$title)
  dataset$title <- gsub(" the ", " ", dataset$title)
  dataset$title <- gsub("^[Tt]he ", "", dataset$title)
  dataset$title <- gsub(" an ", " ", dataset$title)
  dataset$title <- gsub(" a ", " ", dataset$title)
  dataset$title <- gsub("^a ", "", dataset$title)
  dataset$title <- gsub(" on ", " ", dataset$title)
  dataset$title <- gsub(" of ", " ", dataset$title)
  dataset$title <- gsub(" for ", " ", dataset$title)
  dataset$title <- gsub(" in ", " ", dataset$title)
  dataset$title <- gsub(" by ", " ", dataset$title)
  dataset$title <- gsub(" non([^\\s])", " non \\1", dataset$title)
  dataset$title <- gsub("^[[:space:]]*", "" , dataset$title)
  dataset$title <- gsub("[[:space:]]*$", "" , dataset$title)
  # ISSN
  dataset$issn <- gsub("-", "" , dataset$issn)
  dataset$issn <- gsub("[[:space:]]*$", "" , dataset$issn)
  dataset$issn <- gsub("^[[:space:]]*", "" , dataset$issn)
  # Second ISSN
  if ( "issn2" %in% names(dataset) ) {
    dataset$issn2 <- gsub("-", "" , dataset$issn2)
    dataset$issn2 <- gsub("[[:space:]]*$", "" , dataset$issn2)
    dataset$issn2 <- gsub("^[[:space:]]*", "" , dataset$issn2)
  } 
  
  return(dataset)
}

Match by joining

The R dplyr package is indispensable for all kind of data manipulation. One of the join types it supports (familiar from SQL), is left_join, which

returns all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

So here I am saying to R:Compare my two data frames by title and ISSN, as they are given. Please.

library(dplyr)

# Join with the first ISSN
joined <- left_join(puredata, wosdata, by=c("title"="title", "issn"="issn"))
# and then with the second ISSN
joined2 <- left_join(puredata, wosdata, by=c("title"="title", "issn"="issn2"))

found <- joined[!is.na(joined$ut),]
found2 <- joined2[!is.na(joined2$ut),]

names(found) <- c("title", "journal", "id", "issn", "journal2", "type", "issn2", "ut")
names(found2) <- c("title", "journal", "id", "issn", "journal2", "type", "issn2", "ut")
allfound <- rbind(found, found2)

This way, roughly 55% of CRIS articles, published between 2010 and 2016, found a match – and the needed WOS/UT. Then, similarly, I matched conference publications with title, i.e. items without a corresponding ISSN. Here, the match rate was even better, 73%, although the number of publications was not that big.

However, only one typo in title or ISSN – and no match could happen. Thanks to my smart coworkers, I was made aware of the Levenshtein distance. In R, this algorithm is implemented in few packages.

Fuzzier match

I started with the adist function from the utils package that is normally loaded automatically when you start R. The code examples from Fuzzy String Matching – a survival skill to tackle unstructured information were of great help. The initial tests with a smallish sample looked promising.

tweet

Pretty soon it became obvious though, that it would take too much time to process my whole data in this manner. If 1000 x 1000 rows took 15 minutes, 6000 x 12000 rows would take probably ten times that, or maybe even more. What I needed was to run the R code in parallel.

My Linux machine at work runs a 64-bit Ubuntu 16.04, with 7.7 GB RAM and 2.67 GHz Intel Xeon processors, ” x 8″, as the System settings concludes. I’m not sure if 8 here refers to the number of cores or threads. Anyway, if only my code could be parallelized, I could let the computer or other higher mind to decide how to core-source the job in the most efficient way.

Luckily, there is the stringdist package by Mark van der Loo that does just that. By default, it parallelizes your work between all cores you’ve got, minus one. This is common sense, because while your code is running, you may want to do something else with you machine, like watch the output of the top command.

Here I am running the amatch function from the package, trying to match those article titles that were not matched in the join. The method argument value lv stands for Levenshtein distance, as in R’s native adist. maxDist controls how many edits from the first string to the second is allowed for a match. The last argument defines the output in nomatch cases.

system.time(matched <- amatch(notfound_articles$title, wosdata_articles$title, method="lv", maxDist = 5, nomatch = 0))

Look: %CPU 700!

top command output

I haven’t tried to change the default nthread argument value, so I don’t know if my run time of roughly 6 minutes could be beaten. Nevertheless, six minutes (363.748 milliseconds) is just fantastic.

    user  system  elapsed 
2275.980   0.648  363.748 

In the matched integer vector, I had now, for every row from notfound_articles data frame, either a 0 or a row number from wosdata_articles. Among the first twenty one rows, there are already two matches.

>matched[1:20]
[1]  0  0  0  0  0 55  0  0  0  0  0  0  0  0  0  0  1  0  0  0

All I needed to do, was to gather the IDs of the matching strings, and output. For the sake of quality control, I’m picking up also the strings.

N <- length(matched)
pure <- character(N) 
pureid <- character(N)
wos <- character(N)
ut <- character(N)

for (x in 1:N) { 
  pure[x] <- notfound_articles$title[x]
  pureid[x] <- notfound_articles$id[x]
  wos[x] <- ifelse(matched[x] != 0, wosdata_articles[matched[x], "title"], "NA")
  ut[x] <- ifelse(matched[x] != 0, wosdata_articles[matched[x], "ut"], "NA")
}

df <- data.frame(pure, pureid, wos, ut)
df <- df[df$wos!='NA',c("pureid", "ut")]
names(df) <- c("PureID", "ExternalID")

write.csv(df, paste0("pure_wos_fuzzy_articles", Sys.Date(), ".csv"), row.names = F)

Then, the same procedure for the conference output, and I was done.

All in all, thanks to stringdist and other brilliant stuff by the R community, I could find a matching WoS ID to 67.7% of our CRIS publications from the Aalto era.

To repeat the same process for Scopus means to build another parser, because the export format is different. The rest should follow similar lines.

EDIT 15.9.2016: Made corrections to the shell script.

Posted by Tuija Sonkkila

Data Curator at Aalto University. When out of office, in the (rain)forest with binoculars and a travel zoom.
Coding - Comments Off on Making titles to match

Towards more automation (and better quality)

Few weeks ago, CSC-IT Center for Science opened the much awaited VIRTA REST API for publications (page in Finnish). VIRTA is the Finnish higher education achievement register,

a tool to be utilized in the authoritative data harvesting in the way that the collected data will be both commensurable and of a good quality.

The API is good news for a data consumer like me who likes to experiment with altmetrics applications. As strange as it sounds, it’s not necessarily that easy to get your hands into a set of DOIs from inside the University. The ultimate quality control happens whenever University reports its academic output to the Ministry. And that’s precisely what the VIRTA is about: the publication set there is stamped by the University to be an achievement of their researchers.

VIRTA is still work in progress, and so far only a couple of universities import their publication data there on a regular basis, either from their CRIS or otherwise. Aalto University is not among the piloting organisations, so you’ll not find our 2016 publications in there yet. Years covered are 2014-2015.

Get data and filter

OK, let’s download our stuff in XML (the other option is JSON). Note that the service is IP-based, so the very first thing you have to do is to contact CSC, tell who you are and from which IP you work. When the door is open, you’ll need the organisation ID of Aalto. That’s 10076.

Below, I’ve changed the curl command given at the API page so that the header is not outputted to the file (no -i option), only the result XML.


curl -k -H "Accept: application/xml" "https://dwitjutife1.csc.fi/api/julkaisut/haku/xml?organisaatioTunnus=10076" -o aaltopubl.xml

My target is a similar kind of R Shiny web application that I made for the 2:am conference, where you can filter data by School etc. To make my life easier during the subsequent R coding, I first produced a CSV file from the returned XML, with selected columns: DOI, publication year, title, name of the journal, number of authors, OA status and department code. At Aalto, two first characters of the department code tell the School, which helps.

Below is the rather verbose XSLT transformation code parse.xsl. What it says is that for every publication, output the above mentioned element values in quotes (separated by a semicolon), add a line feed, and continue with the next publication.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0"
	xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xmlns:csc="urn:mace:funet.fi:julkaisut/2015/03/01">

	<xsl:output method="text" encoding="UTF-8"/>

	<xsl:template match="/">
<xsl:for-each select="//csc:Julkaisu[csc:DOI]">
"<xsl:value-of select="csc:DOI"/>";"<xsl:value-of select="csc:JulkaisuVuosi"/>";"<xsl:value-of select="translate(csc:JulkaisunNimi, ';', '.')"/>";"<xsl:value-of select="csc:LehdenNimi"/>";"<xsl:value-of select="csc:TekijoidenLkm"/>";"<xsl:value-of select="csc:AvoinSaatavuusKoodi"/>";"<xsl:value-of select="csc:JulkaisunOrgYksikot/csc:YksikkoKoodi"/>"<xsl:text>
</xsl:text></xsl:for-each>
	</xsl:template>
</xsl:stylesheet>

For the transformation, I’ll use here the Saxon-CE XSLT engine. You can run it from the command line like this:


java -jar saxon9he.jar aaltopubl.xml parse.xsl -o:aaltopubl.csv

Right, so now I have the basic data. Next, altmetrics from Altmetric.

Query altmetrics with a (cleaned) set of DOIs

With the reliable rAltmetric R package, the procedure follows a familiar pattern: with every DOI in turn preambled with doi:, the Altmetric API is queried by the altmetrics function. Results are saved as a list. When all rows are done, the list is transformed to a dataframe with the altmetric_data function.

But before that happens, DOIs need cleaning. Upstream, there was no quality control in DOI input, so the field can (and does) contain extra characters that the rAltmetric query does not tolerate, and with a good reason.

dois$doi <- gsub("http://dx.doi.org/", "", dois$doi)
dois$doi <- gsub("dx.doi.org/", "", dois$doi)
dois$doi <- gsub("doi:", "", dois$doi)
dois$doi <- gsub("DOI:", "", dois$doi)
dois$doi <- gsub("DOI", "", dois$doi)
dois$doi <- gsub("%", "/", dois$doi)
dois$doi <- gsub(" ", "", dois$doi)
dois$doi <- gsub("^/", "", dois$doi)
dois$doi <- gsub("http://doi.org/", "", dois$doi)

dois_cleaned <- dois %>%
  filter(doi != 'DOI') %>%
  filter(doi != '[doi]') %>%
  filter(doi != "") %>%
  filter(!grepl("http://", doi)) %>%
  filter(grepl("/", doi))

When all extras are removed, querying is easy.

raw_metrics <- plyr::llply(paste0('doi:',dois_cleaned$doi), altmetrics, .progress = 'text')
metric_data <- plyr::ldply(raw_metrics, altmetric_data)
write.csv(metric_data, file = "aalto_virta_altmetrics.csv")

Data storage

From this step onwards, the rest is Shiny R coding, following similar lines as the 2:am experiment http://dx.doi.org/10.5281/zenodo.32108.

One thing I decided to do differently this time, though: data storage. Because I’m aiming at more automation where, with the same DOI set, new metrics is gathered on, say, monthly basis – to get time series – I need a sensible way to store data between runs. I could of course just upload new data by re-deploying the application to where it is hosted, shinyapps.io. Every file in the sync’ed directory is transmitted to the remote server. However, that could result to an unnecessarily bloated application.

Other Shiny users ponder this too of course. In his blog posting Persistent data storage in Shiny apps, Dean Attali goes through several options.

First I had in mind using Google Drive, thanks to a handy new R package googlesheets. For that, I added a date stamp to the dataframe, and subsetted it by School. Then – after the necessary OAuth step – gs_new registers a new Google spreadsheet, ws_title names the first sheet, and uploads data in input. In subsequent pipe commands, gs_ws_new generates new sheets, populating them with the rest of the dataframes.

library(googlesheets)

upload <- gs_new("aalto-virta", ws_title = "ARTS", input = ARTS, trim = TRUE) %>% 
  gs_ws_new("BIZ", input = BIZ, trim = TRUE) %>% 
  gs_ws_new("CHEM", input = CHEM, trim = TRUE) %>%
  gs_ws_new("ELEC", input = ELEC, trim = TRUE) %>%
  gs_ws_new("ENG", input = ENG, trim = TRUE) %>%
  gs_ws_new("SCI", input = SCI, trim = TRUE)

Great! However, when I tried to make use of the data, I stumbled on the same HTTP 429 issue that Jennifer Bryan writes about here. It persisted even when I stripped data to just 10 rows per School. I’ll definitely return to this package in the future but for now, I had to let it be.

Next try: Dropbox. This proved more successful. The code is quite terse. Note that here too, you first have to authenticate. When that is done, data is uploaded. The dest argument refers to a Dropbox directory.

library(rdrop2)

write.table(aalto_virta, file ="aalto-virta-2016_04.csv", row.names = F, fileEncoding = "UTF-8")
drop_upload('aalto-virta-2016_04.csv', dest = "virta")

In the application, you then download data with drop_get when need be, and read into the application with read.table or some such.

drop_get("virta/aalto-virta-2016_04.csv", overwrite = T)

I didn’t notice any major lag in the initiation phase of the application where you see the animated Please Wait… progress bar, so that’s a good sign.

Here is the first version of the application. Now with quality control! In the next episode: time series!

Posted by Tuija Sonkkila

Data Curator at Aalto University. When out of office, in the (rain)forest with binoculars and a travel zoom.
Coding , Data - Comments Off on Towards more automation (and better quality)

Mining REF2014 impact case studies

The British 2014 Research Excellence Framework is a landmark in recent European academic life, no matter from what angle you look at it. One of its novelties was Impact Case Studies, a collection of impressive 6,975 items which

showcase how research undertaken in UK Higher Education Institutions (HEIs) over the past 20 years has benefited society beyond academia – whether in the UK or globally

It is a friendly gesture from REF’s side that there is a searchable interface to the studies, a REST API, and you are even allowed to download data. All under Creative Commons Attribution 4.0 licence.

The initial report titled The nature, scale and beneficiaries of research impact is coauthored by King’s College London and Digital Science. A 70+ page analysis, it is nicely sprinkled with graphs and tables. In Headline findings, authors make humble observations of their task:

Text-mining itself can be dangerous and dirty: dangerous, as it is possible to misinterpret information in the text; and dirty, as it involves a lot of experimentation and trying by doing. Given the size of the dataset (more than 6 million words in the ‘Details of the impact’ section of the case studies), text-mining was useful in producing an analysis of the general patterns and themes that could be described in the case studies.

Briefly on their methodologies, as explained on page 16:

  • topic modelling with Apache Mallet Toolkit Latent Dirichlet Allocation (LDA) algorithm
  • keyword-in-context (KWIC) to develop an alphabetical list of keywords displayed with their surrounding text (word) context
  • information extraction to identify references to geographic locations

What I’m about to present below, is a sign of how REF impact stories made an impact on me. I became curious about keywords and topics; how they are mined, and what results you might get. Note that my approach stays roughly on the mineral extraction level. Still, even with only modest knowledge and even more modest understanding of text-mining principles, one can experiment, thanks to all code, tutorials and blogs out there by those who know what they are doing.

Some time ago I saw a press release of IBM’s acquisition of AlchemyAPI,

leading provider of scalable cognitive computing application program interface (API) services and deep learning technology

IBM has plans to integrate AlchemyAPI’s technology into the core Watson platform. This is an interesting move. Who knows, perhaps the already famous IBM Chef Watson will soon get even smarter sister implementations.

The list of AlchemyAPI calls is long, and documentation quite good, as far as I can tell. I haven’t gone through all the calls, but those ones I’ve tried, behaved as documented. The one best suited for me seemed to be Text API: Keyword / Term Extraction. Thanks to the helpful and prompt AlchemyAPI support, my newly minted API key got promoted to an academic one. With the help of a somewhat higher daily request limit, I was able to make all the API calls I needed during the one and same day.

First, REF data.

Panel-wise, one Excel at a time, I downloaded files, and imported them to RStudio. The only data cleanup I did at this stage was that I got rid of newline characters of Details of the impact text blocks.

library(XLConnect)
library(XML)

# Main Panel A
docA <- loadWorkbook("CaseStudiesPanelA.xlsx")
casesA <- readWorksheet(docA, 
                           sheet = "CaseStudies", 
                           header = TRUE)
casesA$Details.of.the.impact <- gsub("\n", " ", casesA$Details.of.the.impact)

The TextGetRankedKeywords API call has several parameters. I decided to proceed with default values, with the exceptions of keywordExtractMode (strict) and Sentiment (1, i.e. return sentiment score and type). The default maximum number of keywords is 50.

I made a few attempts to send the entire Details of the impact text, but stumbled on an URI too long exception. With some trial and error, I ended up cutting text strings down to 4000 characters each.

library(httr)

alchemy_url <- "http://access.alchemyapi.com/"
api_key <- "[API key]"
call <- "calls/text/TextGetRankedKeywords"
kwmode <- "strict"
sentiment <- "1"

url <- paste0(alchemy_url, call)

q.kw <- function(df, row, col){  
  r <- POST(url,
            query = list(apikey = api_key,
                         text = substring(df[row,col],1,4000),
                         keywordExtractMode = kwmode,
                         sentiment = sentiment))
  return(r)
}

resultlist <- vector(mode = "list", length = nrow(casesA))

system.time(
for ( i in 1:nrow(casesA) ) {  
  res <- content(q.kw(casesA, i, "Details.of.the.impact"), useInternalNodes=T)
  kw <- paste(xpathApply(res, "//text", xmlValue), 
              xpathApply(res, "//relevance", xmlValue), 
              xpathApply(res, "//score", xmlValue),
              xpathApply(res, "//type", xmlValue),
              sep = ";")
  resultlist[[i]] <- c(casesA[i, c("Case.Study.Id", "Institution", "Unit.of.Assessment", "Title")], kw) 
})

save(resultlist, file = "keywordsA.Rda")

There is a lesson to learn at this point, namely I found it relatively difficult to decide, which data structure would be the most optimal to store query results. My pet bad habit is to transform data back and forth.

The query run time varied from 14 to 20 minutes. Within roughly one hour, I had processed all data and saved results as R list objects. Then, data to data frames, bind all together, and some cleaning.

Next, the web application.

Shiny Dashboard examples are very nice, particularly the one on streaming CRAN data, so I first made a quick sketch of a similar one. However, D3.js bubbles htmlwidget by Joe Cheng wasn’t really suitable to my values. Instead, I decided to use a ggvis graph, reflecting the chosen Unit of Assessment. The x axis could show the keyword relevance. The y axis, sentiment score. On the same “dashboard” view, two tables to list some top items from both dimensions. And finally, behind a separate tab, data as a DT table.

The R Shiny web application is here.

While browsing the results, I noticed that some studies were missing the name of the institution.

REF2014 id XML response

From the case study 273788 (one that lacked name) I could confirm that indeed the Institution element is sometimes empty, whereas Institutions/HEI/InstitutionName seemed a more reliable source.

The REST API to the rescue.

With one call to the API, I got the missing names. Then I just had to join them to the rest of the data.

ids <- unique(kw.df[kw.df$University == "",]$id)
idlist <- paste(ids, collapse = ",")

q <- httr::GET("http://impact.ref.ac.uk/casestudiesapi/REFAPI.svc/SearchCaseStudies", query = list(ID = idlist, format = "XML"))
doc <- httr::content(q, useInternalNodes=T)

library(XML)
ns <- c(ns="http://schemas.datacontract.org/2004/07/REFAPIService")
missingunis <- paste(xpathApply(doc, "//ns:CaseStudyId", xmlValue, namespaces = ns), xpathApply(doc, "//ns:InstitutionName", xmlValue, namespaces = ns), sep = ";")

missingdf <- data.frame(matrix(unlist(missingunis), nrow=length(missingunis), byrow=T))
names(missingdf) <- "id_univ"

missing %
  tidyr::extract(id_univ, c("id", "University"), "(.*);(.*)")

missing$id <- as.numeric(missing$id)

kw.df %
  left_join(missing, by = "id")  %>%
  mutate(University = if (!is.na(University.y)) University.y else University.x) %>%
  select(id, University, UnitOfA, Title, Keyword, Relevance, SentimentScore, SentimentType, Unique_Keywords)

Sentiment analysis sounds lucrative but the little I know about it, it doesn’t compute easily. A classic example is medicine, where the vocabulary inherently contains words that without context are negative, such as names of diseases. Yet, it’s not difficult to think of text corpuses that are more apt to sentiment analysis. Say, tweets by airline passengers.

There is also the question of the bigger picture. In Caveats and limitations of analysis (pp. 17-18), authors of the report make an important note:

The sentiment in the language of the case studies is universally positive,
reflecting its purpose as part of an assessment process

In my limited corpus the shares between neutral, positive and negative are as follows, according to AlchemyAPI:

# Neutral
>paste0(round((nrow(kw.df[kw.df$SentimentType == 'neutral',])) / nrow(kw.df) * 100, digits=1), "%")
[1] "46.2%"
# Positive
> paste0(round((nrow(kw.df[kw.df$SentimentType == 'positive',])) / nrow(kw.df) * 100, digits=1), "%")
[1] "40%"
# Negative
> paste0(round((nrow(kw.df[kw.df$SentimentType == 'negative',])) / nrow(kw.df) * 100, digits=1), "%")
[1] "13.7%"

In the web application, behind the tab Sentiment analysis stats, you’ll find a more detailed statistics by Units of Assessment. The cool new D3heatmap package by RStudio lets you to interact with the heatmap: click row or column to focus, or zoom in by drawing a rectangle.

On topic modeling, the REF report observes:

One of the most striking observations from the analysis of the REF case studies was the diverse range of contributions that UK HEIs have made to society. This is
illustrated in a heat map of 60 impact topics by the 36 Units of Assessment (UOAs) (Figure 8; page 33), and the six deep mine analyses in Chapter 4, demonstrating that such diverse impacts occur from a diverse range of study disciplines

To get the feel of LDA topic modeling, how to fit the model to a text corpus, and how to visualize the output, I followed a nicely detailed document by Carson Sievert. The only detail I changed in Carson’s code was that I also excluded all-digit terms, which were plenty in the corpus:

delD <- regmatches(names(term.table), regexpr("^[[:digit:]]+$", names(term.table), perl=TRUE))
stop_words_digits <- c(stop_words, delD)
del <- names(term.table) %in% stop_words_digits | term.table < 5 
term.table <- term.table[!del]

The REF report tells how the modelling had ended up with 60 topics. Based on this “ground-truth”, I decided to first process the four Main Panel study files separately, and on each run, set up a topic model (=K) with 15 topics. Please note that I don’t really know if this makes sense. Yet, my goal here was mainly to have a look at how the LDAvis topic visualization tool works. For a quick introduction to LDAvis, I can recommend this 8 minute video.

Here are the links to the respective LDAvis visualizations: Main Panel A, B, C and D.

With LDAvis, I like the way you can reflect on actual terms; why some exist in several topics, what these topics might be, etc. For example, within Panel B, the term internet does not only occur in topic 7 – which is clearly about Computer Science and Informatics – but also in 15, which is semantically (and visually) further away, and seems to be related to Electrical and Electronic Engineering, Metallurgy and Materials.

For the record, I also processed all data in one go. For this I turned to the most robust academic R environment in Finland AFAIK, CSC – IT Center for Science. My multi-processor batch job file is below; it’s identical with CSC’s own example except that I reserved 8 x 5GB memory (so I thought):

#!/bin/bash -l
#SBATCH -J r_multi_proc
#SBATCH -o output_%j.txt
#SBATCH -e errors_%j.txt
#SBATCH -t 06:00:00
#SBATCH --ntasks=8
#SBATCH --nodes=1
#SBATCH --mem-per-cpu=5000

module load r-env
srun -u -n 8 Rmpi --no-save < model.R

model.R is the R script where I first merge all data from the Main Panels, and then continue in the same manner as I did with individual Panel data. One exception: this time, 60 topic models (=K).

For the sake of comparison, I started the same script on my Ubuntu Linux workstation, too. Interestingly, the run time on it (4 core 2.6GHz processor, 8GB memory) and on Taito at CSC was roughly the same. Ubuntu 6:39 hours, Taito 5:54. Disclaimer: there’s a learning curve on Taito – as well as on R! – so I may in fact reserved only limited or basic resources, although I thought otherwise. Or, perhaps the R script needs to use specific packages for a multi-core job? Maybe the script felt the presence of all the memory and cores in there but didn’t have instructions on how to make use of them? Frankly, I don’t have a clue what happened under the hood.

Unlike Carson in his review example I used, or in his other paper Finding structure in xkcd comics with Latent Dirichlet Allocation, I didn’t inspect visually fit$log.likelihood. I couldn’t follow Carson’s code at that point so I gave up, but saved the list just in case. It would’ve been interesting to see, whether the plot would’ve indicated that the choice for the number of topics, 60, was on the right track.

Here is the LDAvis visualization of 60 topic models. Some look a little strange, for example 60 has many cryptic terms like c1, c2, etc. In the panel-wise visualizations, they are located in 14 on Panel C and 14 on Panel D. Are they remains of URLs or what? Most likely, because I didn’t have any special handling for them. In pre-processing, I replaced punctuations by a space, meaning e.g. that URLs were split by slash and full stop, at least. The example data that Carson used in his document were film reviews. Seldom URLs in those I guess. So, I realized that I should’ve excluded all links to start with. Anyway, it looks like if, in the case studies, URLs were less frequently used in natural sciences, dealt with by Panel A and B.

Code of the R Shiny web application.

Posted by Tuija Sonkkila

Data Curator at Aalto University. When out of office, in the (rain)forest with binoculars and a travel zoom.
Coding , Data - Comments Off on Mining REF2014 impact case studies

Wikipedia outreach by field of science

Since the previous posting, the Scholarly Article Citation dataset on Figshare has been upgraded to include also DOIs. Great! I’d imagine that unlike with PMIDs, there’d be more coverage from Aalto with DOIs.

Like in all altmetrics exercises so far, I first gathered all our articles published since 2007 and having a DOI, according to Web of Science. I also saved some more data on the articles such as the number of cites and the field(s) of science. This data I then joined with the Figshare set by the DOI field. Result: 193 articles.

Which research fields do these articles represent?

Web of Science makes use of five broad research areas. With some manual work, I made a lookup table where the various subfields are aggregated onto these areas. Now I could easily add the area name to the dataset by looking at the subfield of the article. To make life easier, I picked only the first field if there was more than one. Within each area, I then calculated the average citation count, and saved also the number of articles by area (group size). With these two values, it was now possible to construct a small “network” graph; node size would tell about the article count within that area, and node color the average number of citations. But how to keep the areas as ready-made clusters in the graph, without any edges?

A while ago I read about a neat trick by Clement Levallois on the Gephi forum. With the GeoLayout plugin, you can arrange nodes to the canvas based on their geocoordinates, using one of the projections available. As a bonus, the GEFX export format preserves this information in the x and y attribute of the viz:position element. This way, the JavaScript GEXF Viewer knows where to render the nodes.

gexf.

What coordinates to use, where to get them, and how to use them? A brute force solution was good enough in my case. One friendly stackoverflower mentioned that he had US state polygons in XML. Fine. What I did is that I simply choose five states from different corners of the US (to avoid collision), and named each research area after it. For example, Technology got Alaska. Here’s the whole list from the code:

nodes.attr$state <- sapply(nodes.attr$agg, function(x) {
  if (x == 'Life Sciences & Biomedicine') "Washington" 
  else if (x == 'Physical Sciences') "Florida"
  else if (x == 'Technology') "Alaska"
  else if (x == 'Arts & Humanities') "North Dakota"
  else "Maine"
})

Then I made two new variables for latitude and longitude, and picked up coordinates from the polygon data of that state, one by one. Because polygon coordinates follow the border of the state, the shape of the Technology cluster follows the familiar, elongated shape of Alaska. A much more elegant solution of course would’ve been to choose random coordinates within each state.

One of the standard use cases of Gephi is to let it choose colors for the nodes after the result of a community detection run. Because my data had everything pre-defined, also communities aka research areas, I couldn’t use that feature. Instead, I used another Gephi plugin, Give color to nodes by Clement Levallois, who seems to be very active within the Gephi plugin developer community too. All I needed to do, is to give different hexadecimal RGB values for ranges of average citations counts. For a suitable color scheme, I went to the nice visual aid by Mike Bostock that shows all ColorBrewer schemes by Cynthia Brewer. When you click one of the schemes – I clicked RdYlGr – you get the corresponding hex values to the JavaScript console of the browser.

colorbrewer

From the last line showing all colors, I picked both end values, and three from the middle ones. To my knowledge, you cannot easily add a custom legend to the GEFX Viewer layout, so I’ll add it here below raw, copied from the R code.

nodes.attr$Color <- sapply(nodes.attr$WoSCitesAvg, function(x) {
  if (x <= 10) "#a50026" 
  else if (x > 10 && x <= 50) "#fdae61" 
  else if (x > 50 && x <= 100) "#ffffbf"
  else if (x > 100 && x <= 200) "#a6d96a"
  else "#006837"
})

From the data, I finally saved two files: one for the nodes (Id, Label, Type), and one for the node attributes (Id, Count, WoSCitesAvg, Latitude, Longitude, Color). Then, to Gephi.

New project, Data Laboratory, and Import Spreadsheet. First nodes, then attributes. Both as node tables. Note that in the node import, you create new nodes, whereas in attribute import, you do not. Note also that in the attribute import, you need to choose correct data types.

importnodeattr

First I gave the nodes their respective color. The icon of the new color plugin sits on the left-hand side, vertical panel of the Graph window. Click it, and you’ll get a notification that the plugin will now start to color the nodes.

colornodes

Nodes get their size – the value of Count in my case – from the Ranking tab.

When you install the GeoLayout plugin, it appears in the Layout dropdown list. I tried all the projections. My goal was just to place the different research area clusters as clearly apart from each other as possible, and Winkel tripel seemed to produce what I wanted.

Finally, bring node labels visible by clicking the big T icon on the horizontal panel, make the size follow the node size (the first A icon from the left), and scale the font. A few nodes will inevitably be stacked on top of each other, so some Label Adjust from the Layout options is needed. Be aware though that it can do all too much cleaning, and wipe the GeoLayout result obsolete. To prevent this from happening, lower the speed from the default to, say, 0.1. Now you can stop the adjusting on its tracks whenever necessary.

geolayout

Few things left. Export to GEFX; install the Viewer; in the HTML file, point to the configuration file; in the configuration file, point to the GEFX file – and that’s it.

It’s hardly surprising I think that multidisciplinary research gets attention in Wikipedia. Articles in this field have also gathered quite a lot of citations. Does academic popularity increase Wikipedia citing, too? Note though, that because the graph lacks the time dimension, you cannot say anything about the age of the articles. Citations tend to be slow in coming.

For those of you interested in the gore R code, it’s available as a GitHub Gist.

Posted by Tuija Sonkkila

Data Curator at Aalto University. When out of office, in the (rain)forest with binoculars and a travel zoom.
Coding , Data - Comments Off on Wikipedia outreach by field of science

Cited in Wikipedia

In a recent blog posting Altmetric announced that they have added Wikipedia to their sources. Earlier this month, WikiResearch tweeted about a CC0-licensed dataset upload to Figshare, Scholarly article citations in Wikipedia. The dataset is a 35+ MB file of cites that carry a PubMed identifier, PMID.

tweetW

From a university perspective, these are excellent news. Now we are able to start posing questions such as “Have any articles from our University got any cites in Wikipedia?”

To start a small forensic investigation in the case of Aalto University, I first need a list of all PMID’s connected to our affiliation. With the R jsonlite package, and great help from How to search for articles using the PubMed API, the following query (broken down into separate lines for clarity) returns 938 PMIDs as part of the JSON response.

pubmed.res <- fromJSON("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi
?db=pubmed
&retmax=2000
&retmode=json
&term=aalto+university[Affiliation]")

Note that querying the Affiliation field doesn’t necessarily capture all our authors. Also, I didn’t even dare to venture into different legacy name variants of our present Schools before Aalto University saw the light in 2010.

Now, with the two PMID lists at hand, and with an appropiate join function from the R dplyr package, it was easy to check, which ones are the same in both.

joined <- inner_join(pubmeddata, wikidata, by = c("id" = "id"))

Turns out, there is only one article from Aalto University that has been cited with its PubMed identifier, but on three different pages.

        id page_id              page_title    rev_id  timestamp type
1 21573056 4355487        Apolipoprotein B 448851207 2011-09-07 pmid
2 21573056   51521 Low-density lipoprotein 430508516 2011-05-23 pmid
3 21573056   92512             Lipoprotein 586010949 2013-12-14 pmid

From the page_title we can see that the subject in all is obviously lipoproteins. For the article metadata and other interesting facts, we need to make two more API calls: one to Wikipedia and one, again, to PubMed. From Wikipedia, it’d be interesting to return the number of page watchers, if given. The size of the watcher community tells something about the commitment of the authors and readers of the page, don’t you think? From PubMed, I can get the title of the article, the date of publication etc. How quickly after publication were the cites made?

The URL stubs below make a two-part skeleton of the Wikipedia query. The PMID needs to be plugged in between. For more API details, see this documentation.


w.url.base <- "http://en.wikipedia.org/w/api.php?action=query&pageids="
w.url.props <- "&prop=info&inprop=watchers&format=json&continue="

PubMed has got several different query modules and options. For the core metadata, I need to use the esummary module, and the abstract result type. Again, the query here is split into several lines to show the parts more clearly. The PMID is added to the very end.

pubmed.abs.url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi
?db=pubmed
&retmode=json
&rettype=abstract
&id="

The article Three-dimensional cryoEM reconstruction of native LDL particles to 16Å resolution at physiological body temperature was published in PLoS One on 9th May 2011. Two weeks later, on the 23th, it was cited on the Wikipedia page Low-density Lipoprotein, in the chapter describing the structure of these complex particles composed of multiple proteins which transport all fat molecules (lipids) around the body within the water outside cells.

wedit

At the moment, the page has 121 watchers, a fair bit more than the other citing pages.

   pubDate wikipediaDate               pageTitle watchers
1 2011-05-09    2011-09-07        Apolipoprotein B      N/A
2 2011-05-09    2011-05-23 Low-density lipoprotein      121
3 2011-05-09    2013-12-14             Lipoprotein       50

When there are fewer than 30 watchers, like in the case of Apolipoprotein B, Wikipedia uses the phrase Fewer than 30 watchers on the web page, and the API returns no value.

I also updated the experimental altmetrics web application with metrics courtesy of e.g. Altmetric – now also with Wikipedia! This version follows the same logic as the older one, only with fresh metrics.

Change either one of the axes to show Wikipedia, and you’ll notice a couple of things.

altm2015

First, according to Altmetric, among Aalto articles since 2007, the highest Wikipedia score is 3. But this is not our lipoprotein article, published Open Access (marked with a golden stroke in the app) in PLoS One, but The Chemical Structure of a Molecule Resolved by Atomic Force Microscopy, published in Science. The lipoprotein article is among those with 2 Wikipedia cites, in the far left. Why the difference?

Second, there seems to be quite a many articles cited in Wikipedia. Why weren’t they not in the Figshare dataset?

Altmetric is not just aggregating Wikipedia citations by PMID. Take for example the Science article. Follow the Wikipedia link in Mentioned by…, and from there, click either one of the timestamped links Added on…. You land on the diff page showing when the citation was added, by whom and how. Version control is a cool time machine. You can almost hear the author tap-tapping the text.

She doesn’t write a PMID. She writes a DOI.

It remains a minor mystery though, why Altmetric doesn’t count the citation made to Apolipoprotein B on 7th September 2011. Maybe this is just a hickup in the new process, something that Altmetric clearly says in the blog posting:

Also please note that to begin with we’ve had a big load of historical data to update all at once, so some of it is still feeding through to the details pages – but give it a week or so and it should all be up to date.

For those of you interested in R, the code on querying Wikipedia and PubMed is available as a GitHub Gist.

Posted by Tuija Sonkkila

Data Curator at Aalto University. When out of office, in the (rain)forest with binoculars and a travel zoom.
Coding , Data - Comments Off on Cited in Wikipedia

Bot

Twitter bots is something I like a lot. Up until now, I haven’t ventured into making one myself, which is a bit unfortunate because earlier, the process of getting an R/W access for a bot was much easier. Anyway, thanks to a level-headed blog posting by Dalton Hubble, I managed to grasp the basics. Timo Koola and Duukkis, central figures in Finnish bot scene, and friendly guys at that, gave helpful advice too.

The last push forward came from abroad. Early October, Andy Teucher announced that he had published a bot that used the twitteR package, something I had in mind too. My turn!

Tweet by Andy Teucher on his R-based Twitter bot

Technicalities aside, there was also the question of What?

My pick was to start with job-related text snippets, touting altmetrics.

This summer, Impactstory showed a fair amount of generosity. Although their profiles ceased to be free of charge, they gave away free waivers – and I was lucky to get one for the experimental profile showing a sample of Aalto University research outputs.

The JSON file, linked from the landing page, is packed with data. From the many possibilities, I decided to first focus on what new there is since last week. New GitHub forks, video plays, Slideshare views etc. The same information is visible on the Impactstory profile. It’s the small flag with a + sign and number, just after the metrics label.

Impacstory flag showing new activity since last week

It took a while to get familiar enough with the JSON, and how to traverse it with R. Modelling the tweet was another thing that needed a number of test rounds. The last mile included e.g. finding out how to fit in the URL of the product. In here, Andy’s GitHub repo of the Rare Bird Alert Twitter Bot came to rescue: he had done the URL shortening with Bitly.

I had nurtured the fancy idea of (somehow) making use of Unicode block elements like Alex Kerin has done. However, after some experiments in my test account I gave up. They might come handy in showing a trend, though.

Test tweets with sparklines

Setting up a new Twitter account for the bot was of course easy, but even here was something new to learn. Did you know that you can have multiple, custom Gmail address variants within one account?

So, here is the WeeklyMetrics Twitter bot in its present shape. What is still on a test phase, is running the R script successfully in batch. The cron scheduling seems to work OK, but as Eric Minikel points out at the end of his thorough posting, making the script robust asks probably for some tryCatch blocks. The Impactstory server could be temporarily on hold, Bitly’s likewise. At the moment, my script assumes happily that all is safe and sound in the universe.

The R code

Posted by Tuija Sonkkila

Data Curator at Aalto University. When out of office, in the (rain)forest with binoculars and a travel zoom.
Coding , Data , Learning - Comments Off on Bot