Mining REF2014 impact case studies

The British 2014 Research Excellence Framework is a landmark in recent European academic life, no matter from what angle you look at it. One of its novelties was Impact Case Studies, a collection of impressive 6,975 items which

showcase how research undertaken in UK Higher Education Institutions (HEIs) over the past 20 years has benefited society beyond academia – whether in the UK or globally

It is a friendly gesture from REF’s side that there is a searchable interface to the studies, a REST API, and you are even allowed to download data. All under Creative Commons Attribution 4.0 licence.

The initial report titled The nature, scale and beneficiaries of research impact is coauthored by King’s College London and Digital Science. A 70+ page analysis, it is nicely sprinkled with graphs and tables. In Headline findings, authors make humble observations of their task:

Text-mining itself can be dangerous and dirty: dangerous, as it is possible to misinterpret information in the text; and dirty, as it involves a lot of experimentation and trying by doing. Given the size of the dataset (more than 6 million words in the ‘Details of the impact’ section of the case studies), text-mining was useful in producing an analysis of the general patterns and themes that could be described in the case studies.

Briefly on their methodologies, as explained on page 16:

  • topic modelling with Apache Mallet Toolkit Latent Dirichlet Allocation (LDA) algorithm
  • keyword-in-context (KWIC) to develop an alphabetical list of keywords displayed with their surrounding text (word) context
  • information extraction to identify references to geographic locations

What I’m about to present below, is a sign of how REF impact stories made an impact on me. I became curious about keywords and topics; how they are mined, and what results you might get. Note that my approach stays roughly on the mineral extraction level. Still, even with only modest knowledge and even more modest understanding of text-mining principles, one can experiment, thanks to all code, tutorials and blogs out there by those who know what they are doing.

Some time ago I saw a press release of IBM’s acquisition of AlchemyAPI,

leading provider of scalable cognitive computing application program interface (API) services and deep learning technology

IBM has plans to integrate AlchemyAPI’s technology into the core Watson platform. This is an interesting move. Who knows, perhaps the already famous IBM Chef Watson will soon get even smarter sister implementations.

The list of AlchemyAPI calls is long, and documentation quite good, as far as I can tell. I haven’t gone through all the calls, but those ones I’ve tried, behaved as documented. The one best suited for me seemed to be Text API: Keyword / Term Extraction. Thanks to the helpful and prompt AlchemyAPI support, my newly minted API key got promoted to an academic one. With the help of a somewhat higher daily request limit, I was able to make all the API calls I needed during the one and same day.

First, REF data.

Panel-wise, one Excel at a time, I downloaded files, and imported them to RStudio. The only data cleanup I did at this stage was that I got rid of newline characters of Details of the impact text blocks.


# Main Panel A
docA <- loadWorkbook("CaseStudiesPanelA.xlsx")
casesA <- readWorksheet(docA, 
                           sheet = "CaseStudies", 
                           header = TRUE)
casesA$Details.of.the.impact <- gsub("\n", " ", casesA$Details.of.the.impact)

The TextGetRankedKeywords API call has several parameters. I decided to proceed with default values, with the exceptions of keywordExtractMode (strict) and Sentiment (1, i.e. return sentiment score and type). The default maximum number of keywords is 50.

I made a few attempts to send the entire Details of the impact text, but stumbled on an URI too long exception. With some trial and error, I ended up cutting text strings down to 4000 characters each.


alchemy_url <- ""
api_key <- "[API key]"
call <- "calls/text/TextGetRankedKeywords"
kwmode <- "strict"
sentiment <- "1"

url <- paste0(alchemy_url, call) <- function(df, row, col){  
  r <- POST(url,
            query = list(apikey = api_key,
                         text = substring(df[row,col],1,4000),
                         keywordExtractMode = kwmode,
                         sentiment = sentiment))

resultlist <- vector(mode = "list", length = nrow(casesA))

for ( i in 1:nrow(casesA) ) {  
  res <- content(, i, "Details.of.the.impact"), useInternalNodes=T)
  kw <- paste(xpathApply(res, "//text", xmlValue), 
              xpathApply(res, "//relevance", xmlValue), 
              xpathApply(res, "//score", xmlValue),
              xpathApply(res, "//type", xmlValue),
              sep = ";")
  resultlist[[i]] <- c(casesA[i, c("Case.Study.Id", "Institution", "Unit.of.Assessment", "Title")], kw) 

save(resultlist, file = "keywordsA.Rda")

There is a lesson to learn at this point, namely I found it relatively difficult to decide, which data structure would be the most optimal to store query results. My pet bad habit is to transform data back and forth.

The query run time varied from 14 to 20 minutes. Within roughly one hour, I had processed all data and saved results as R list objects. Then, data to data frames, bind all together, and some cleaning.

Next, the web application.

Shiny Dashboard examples are very nice, particularly the one on streaming CRAN data, so I first made a quick sketch of a similar one. However, D3.js bubbles htmlwidget by Joe Cheng wasn’t really suitable to my values. Instead, I decided to use a ggvis graph, reflecting the chosen Unit of Assessment. The x axis could show the keyword relevance. The y axis, sentiment score. On the same “dashboard” view, two tables to list some top items from both dimensions. And finally, behind a separate tab, data as a DT table.

The R Shiny web application is here.

While browsing the results, I noticed that some studies were missing the name of the institution.

REF2014 id XML response

From the case study 273788 (one that lacked name) I could confirm that indeed the Institution element is sometimes empty, whereas Institutions/HEI/InstitutionName seemed a more reliable source.

The REST API to the rescue.

With one call to the API, I got the missing names. Then I just had to join them to the rest of the data.

ids <- unique(kw.df[kw.df$University == "",]$id)
idlist <- paste(ids, collapse = ",")

q <- httr::GET("", query = list(ID = idlist, format = "XML"))
doc <- httr::content(q, useInternalNodes=T)

ns <- c(ns="")
missingunis <- paste(xpathApply(doc, "//ns:CaseStudyId", xmlValue, namespaces = ns), xpathApply(doc, "//ns:InstitutionName", xmlValue, namespaces = ns), sep = ";")

missingdf <- data.frame(matrix(unlist(missingunis), nrow=length(missingunis), byrow=T))
names(missingdf) <- "id_univ"

missing %
  tidyr::extract(id_univ, c("id", "University"), "(.*);(.*)")

missing$id <- as.numeric(missing$id)

kw.df %
  left_join(missing, by = "id")  %>%
  mutate(University = if (! University.y else University.x) %>%
  select(id, University, UnitOfA, Title, Keyword, Relevance, SentimentScore, SentimentType, Unique_Keywords)

Sentiment analysis sounds lucrative but the little I know about it, it doesn’t compute easily. A classic example is medicine, where the vocabulary inherently contains words that without context are negative, such as names of diseases. Yet, it’s not difficult to think of text corpuses that are more apt to sentiment analysis. Say, tweets by airline passengers.

There is also the question of the bigger picture. In Caveats and limitations of analysis (pp. 17-18), authors of the report make an important note:

The sentiment in the language of the case studies is universally positive,
reflecting its purpose as part of an assessment process

In my limited corpus the shares between neutral, positive and negative are as follows, according to AlchemyAPI:

# Neutral
>paste0(round((nrow(kw.df[kw.df$SentimentType == 'neutral',])) / nrow(kw.df) * 100, digits=1), "%")
[1] "46.2%"
# Positive
> paste0(round((nrow(kw.df[kw.df$SentimentType == 'positive',])) / nrow(kw.df) * 100, digits=1), "%")
[1] "40%"
# Negative
> paste0(round((nrow(kw.df[kw.df$SentimentType == 'negative',])) / nrow(kw.df) * 100, digits=1), "%")
[1] "13.7%"

In the web application, behind the tab Sentiment analysis stats, you’ll find a more detailed statistics by Units of Assessment. The cool new D3heatmap package by RStudio lets you to interact with the heatmap: click row or column to focus, or zoom in by drawing a rectangle.

On topic modeling, the REF report observes:

One of the most striking observations from the analysis of the REF case studies was the diverse range of contributions that UK HEIs have made to society. This is
illustrated in a heat map of 60 impact topics by the 36 Units of Assessment (UOAs) (Figure 8; page 33), and the six deep mine analyses in Chapter 4, demonstrating that such diverse impacts occur from a diverse range of study disciplines

To get the feel of LDA topic modeling, how to fit the model to a text corpus, and how to visualize the output, I followed a nicely detailed document by Carson Sievert. The only detail I changed in Carson’s code was that I also excluded all-digit terms, which were plenty in the corpus:

delD <- regmatches(names(term.table), regexpr("^[[:digit:]]+$", names(term.table), perl=TRUE))
stop_words_digits <- c(stop_words, delD)
del <- names(term.table) %in% stop_words_digits | term.table < 5 
term.table <- term.table[!del]

The REF report tells how the modelling had ended up with 60 topics. Based on this “ground-truth”, I decided to first process the four Main Panel study files separately, and on each run, set up a topic model (=K) with 15 topics. Please note that I don’t really know if this makes sense. Yet, my goal here was mainly to have a look at how the LDAvis topic visualization tool works. For a quick introduction to LDAvis, I can recommend this 8 minute video.

Here are the links to the respective LDAvis visualizations: Main Panel A, B, C and D.

With LDAvis, I like the way you can reflect on actual terms; why some exist in several topics, what these topics might be, etc. For example, within Panel B, the term internet does not only occur in topic 7 – which is clearly about Computer Science and Informatics – but also in 15, which is semantically (and visually) further away, and seems to be related to Electrical and Electronic Engineering, Metallurgy and Materials.

For the record, I also processed all data in one go. For this I turned to the most robust academic R environment in Finland AFAIK, CSC – IT Center for Science. My multi-processor batch job file is below; it’s identical with CSC’s own example except that I reserved 8 x 5GB memory (so I thought):

#!/bin/bash -l
#SBATCH -J r_multi_proc
#SBATCH -o output_%j.txt
#SBATCH -e errors_%j.txt
#SBATCH -t 06:00:00
#SBATCH --ntasks=8
#SBATCH --nodes=1
#SBATCH --mem-per-cpu=5000

module load r-env
srun -u -n 8 Rmpi --no-save < model.R

model.R is the R script where I first merge all data from the Main Panels, and then continue in the same manner as I did with individual Panel data. One exception: this time, 60 topic models (=K).

For the sake of comparison, I started the same script on my Ubuntu Linux workstation, too. Interestingly, the run time on it (4 core 2.6GHz processor, 8GB memory) and on Taito at CSC was roughly the same. Ubuntu 6:39 hours, Taito 5:54. Disclaimer: there’s a learning curve on Taito – as well as on R! – so I may in fact reserved only limited or basic resources, although I thought otherwise. Or, perhaps the R script needs to use specific packages for a multi-core job? Maybe the script felt the presence of all the memory and cores in there but didn’t have instructions on how to make use of them? Frankly, I don’t have a clue what happened under the hood.

Unlike Carson in his review example I used, or in his other paper Finding structure in xkcd comics with Latent Dirichlet Allocation, I didn’t inspect visually fit$log.likelihood. I couldn’t follow Carson’s code at that point so I gave up, but saved the list just in case. It would’ve been interesting to see, whether the plot would’ve indicated that the choice for the number of topics, 60, was on the right track.

Here is the LDAvis visualization of 60 topic models. Some look a little strange, for example 60 has many cryptic terms like c1, c2, etc. In the panel-wise visualizations, they are located in 14 on Panel C and 14 on Panel D. Are they remains of URLs or what? Most likely, because I didn’t have any special handling for them. In pre-processing, I replaced punctuations by a space, meaning e.g. that URLs were split by slash and full stop, at least. The example data that Carson used in his document were film reviews. Seldom URLs in those I guess. So, I realized that I should’ve excluded all links to start with. Anyway, it looks like if, in the case studies, URLs were less frequently used in natural sciences, dealt with by Panel A and B.

Code of the R Shiny web application.

Posted by Tuija Sonkkila

Data Curator at Aalto University. When out of office, in the (rain)forest with binoculars and a travel zoom.
Coding , Data - Comments Off on Mining REF2014 impact case studies

Wikipedia outreach by field of science

Since the previous posting, the Scholarly Article Citation dataset on Figshare has been upgraded to include also DOIs. Great! I’d imagine that unlike with PMIDs, there’d be more coverage from Aalto with DOIs.

Like in all altmetrics exercises so far, I first gathered all our articles published since 2007 and having a DOI, according to Web of Science. I also saved some more data on the articles such as the number of cites and the field(s) of science. This data I then joined with the Figshare set by the DOI field. Result: 193 articles.

Which research fields do these articles represent?

Web of Science makes use of five broad research areas. With some manual work, I made a lookup table where the various subfields are aggregated onto these areas. Now I could easily add the area name to the dataset by looking at the subfield of the article. To make life easier, I picked only the first field if there was more than one. Within each area, I then calculated the average citation count, and saved also the number of articles by area (group size). With these two values, it was now possible to construct a small “network” graph; node size would tell about the article count within that area, and node color the average number of citations. But how to keep the areas as ready-made clusters in the graph, without any edges?

A while ago I read about a neat trick by Clement Levallois on the Gephi forum. With the GeoLayout plugin, you can arrange nodes to the canvas based on their geocoordinates, using one of the projections available. As a bonus, the GEFX export format preserves this information in the x and y attribute of the viz:position element. This way, the JavaScript GEXF Viewer knows where to render the nodes.


What coordinates to use, where to get them, and how to use them? A brute force solution was good enough in my case. One friendly stackoverflower mentioned that he had US state polygons in XML. Fine. What I did is that I simply choose five states from different corners of the US (to avoid collision), and named each research area after it. For example, Technology got Alaska. Here’s the whole list from the code:

nodes.attr$state <- sapply(nodes.attr$agg, function(x) {
  if (x == 'Life Sciences & Biomedicine') "Washington" 
  else if (x == 'Physical Sciences') "Florida"
  else if (x == 'Technology') "Alaska"
  else if (x == 'Arts & Humanities') "North Dakota"
  else "Maine"

Then I made two new variables for latitude and longitude, and picked up coordinates from the polygon data of that state, one by one. Because polygon coordinates follow the border of the state, the shape of the Technology cluster follows the familiar, elongated shape of Alaska. A much more elegant solution of course would’ve been to choose random coordinates within each state.

One of the standard use cases of Gephi is to let it choose colors for the nodes after the result of a community detection run. Because my data had everything pre-defined, also communities aka research areas, I couldn’t use that feature. Instead, I used another Gephi plugin, Give color to nodes by Clement Levallois, who seems to be very active within the Gephi plugin developer community too. All I needed to do, is to give different hexadecimal RGB values for ranges of average citations counts. For a suitable color scheme, I went to the nice visual aid by Mike Bostock that shows all ColorBrewer schemes by Cynthia Brewer. When you click one of the schemes – I clicked RdYlGr – you get the corresponding hex values to the JavaScript console of the browser.


From the last line showing all colors, I picked both end values, and three from the middle ones. To my knowledge, you cannot easily add a custom legend to the GEFX Viewer layout, so I’ll add it here below raw, copied from the R code.

nodes.attr$Color <- sapply(nodes.attr$WoSCitesAvg, function(x) {
  if (x <= 10) "#a50026" 
  else if (x > 10 && x <= 50) "#fdae61" 
  else if (x > 50 && x <= 100) "#ffffbf"
  else if (x > 100 && x <= 200) "#a6d96a"
  else "#006837"

From the data, I finally saved two files: one for the nodes (Id, Label, Type), and one for the node attributes (Id, Count, WoSCitesAvg, Latitude, Longitude, Color). Then, to Gephi.

New project, Data Laboratory, and Import Spreadsheet. First nodes, then attributes. Both as node tables. Note that in the node import, you create new nodes, whereas in attribute import, you do not. Note also that in the attribute import, you need to choose correct data types.


First I gave the nodes their respective color. The icon of the new color plugin sits on the left-hand side, vertical panel of the Graph window. Click it, and you’ll get a notification that the plugin will now start to color the nodes.


Nodes get their size – the value of Count in my case – from the Ranking tab.

When you install the GeoLayout plugin, it appears in the Layout dropdown list. I tried all the projections. My goal was just to place the different research area clusters as clearly apart from each other as possible, and Winkel tripel seemed to produce what I wanted.

Finally, bring node labels visible by clicking the big T icon on the horizontal panel, make the size follow the node size (the first A icon from the left), and scale the font. A few nodes will inevitably be stacked on top of each other, so some Label Adjust from the Layout options is needed. Be aware though that it can do all too much cleaning, and wipe the GeoLayout result obsolete. To prevent this from happening, lower the speed from the default to, say, 0.1. Now you can stop the adjusting on its tracks whenever necessary.


Few things left. Export to GEFX; install the Viewer; in the HTML file, point to the configuration file; in the configuration file, point to the GEFX file – and that’s it.

It’s hardly surprising I think that multidisciplinary research gets attention in Wikipedia. Articles in this field have also gathered quite a lot of citations. Does academic popularity increase Wikipedia citing, too? Note though, that because the graph lacks the time dimension, you cannot say anything about the age of the articles. Citations tend to be slow in coming.

For those of you interested in the gore R code, it’s available as a GitHub Gist.

Posted by Tuija Sonkkila

Data Curator at Aalto University. When out of office, in the (rain)forest with binoculars and a travel zoom.
Coding , Data - Comments Off on Wikipedia outreach by field of science

Cited in Wikipedia

In a recent blog posting Altmetric announced that they have added Wikipedia to their sources. Earlier this month, WikiResearch tweeted about a CC0-licensed dataset upload to Figshare, Scholarly article citations in Wikipedia. The dataset is a 35+ MB file of cites that carry a PubMed identifier, PMID.


From a university perspective, these are excellent news. Now we are able to start posing questions such as “Have any articles from our University got any cites in Wikipedia?”

To start a small forensic investigation in the case of Aalto University, I first need a list of all PMID’s connected to our affiliation. With the R jsonlite package, and great help from How to search for articles using the PubMed API, the following query (broken down into separate lines for clarity) returns 938 PMIDs as part of the JSON response.

pubmed.res <- fromJSON("

Note that querying the Affiliation field doesn’t necessarily capture all our authors. Also, I didn’t even dare to venture into different legacy name variants of our present Schools before Aalto University saw the light in 2010.

Now, with the two PMID lists at hand, and with an appropiate join function from the R dplyr package, it was easy to check, which ones are the same in both.

joined <- inner_join(pubmeddata, wikidata, by = c("id" = "id"))

Turns out, there is only one article from Aalto University that has been cited with its PubMed identifier, but on three different pages.

        id page_id              page_title    rev_id  timestamp type
1 21573056 4355487        Apolipoprotein B 448851207 2011-09-07 pmid
2 21573056   51521 Low-density lipoprotein 430508516 2011-05-23 pmid
3 21573056   92512             Lipoprotein 586010949 2013-12-14 pmid

From the page_title we can see that the subject in all is obviously lipoproteins. For the article metadata and other interesting facts, we need to make two more API calls: one to Wikipedia and one, again, to PubMed. From Wikipedia, it’d be interesting to return the number of page watchers, if given. The size of the watcher community tells something about the commitment of the authors and readers of the page, don’t you think? From PubMed, I can get the title of the article, the date of publication etc. How quickly after publication were the cites made?

The URL stubs below make a two-part skeleton of the Wikipedia query. The PMID needs to be plugged in between. For more API details, see this documentation.

w.url.base <- ""
w.url.props <- "&prop=info&inprop=watchers&format=json&continue="

PubMed has got several different query modules and options. For the core metadata, I need to use the esummary module, and the abstract result type. Again, the query here is split into several lines to show the parts more clearly. The PMID is added to the very end.

pubmed.abs.url <- "

The article Three-dimensional cryoEM reconstruction of native LDL particles to 16Å resolution at physiological body temperature was published in PLoS One on 9th May 2011. Two weeks later, on the 23th, it was cited on the Wikipedia page Low-density Lipoprotein, in the chapter describing the structure of these complex particles composed of multiple proteins which transport all fat molecules (lipids) around the body within the water outside cells.


At the moment, the page has 121 watchers, a fair bit more than the other citing pages.

   pubDate wikipediaDate               pageTitle watchers
1 2011-05-09    2011-09-07        Apolipoprotein B      N/A
2 2011-05-09    2011-05-23 Low-density lipoprotein      121
3 2011-05-09    2013-12-14             Lipoprotein       50

When there are fewer than 30 watchers, like in the case of Apolipoprotein B, Wikipedia uses the phrase Fewer than 30 watchers on the web page, and the API returns no value.

I also updated the experimental altmetrics web application with metrics courtesy of e.g. Altmetric – now also with Wikipedia! This version follows the same logic as the older one, only with fresh metrics.

Change either one of the axes to show Wikipedia, and you’ll notice a couple of things.


First, according to Altmetric, among Aalto articles since 2007, the highest Wikipedia score is 3. But this is not our lipoprotein article, published Open Access (marked with a golden stroke in the app) in PLoS One, but The Chemical Structure of a Molecule Resolved by Atomic Force Microscopy, published in Science. The lipoprotein article is among those with 2 Wikipedia cites, in the far left. Why the difference?

Second, there seems to be quite a many articles cited in Wikipedia. Why weren’t they not in the Figshare dataset?

Altmetric is not just aggregating Wikipedia citations by PMID. Take for example the Science article. Follow the Wikipedia link in Mentioned by…, and from there, click either one of the timestamped links Added on…. You land on the diff page showing when the citation was added, by whom and how. Version control is a cool time machine. You can almost hear the author tap-tapping the text.

She doesn’t write a PMID. She writes a DOI.

It remains a minor mystery though, why Altmetric doesn’t count the citation made to Apolipoprotein B on 7th September 2011. Maybe this is just a hickup in the new process, something that Altmetric clearly says in the blog posting:

Also please note that to begin with we’ve had a big load of historical data to update all at once, so some of it is still feeding through to the details pages – but give it a week or so and it should all be up to date.

For those of you interested in R, the code on querying Wikipedia and PubMed is available as a GitHub Gist.

Posted by Tuija Sonkkila

Data Curator at Aalto University. When out of office, in the (rain)forest with binoculars and a travel zoom.
Coding , Data - Comments Off on Cited in Wikipedia


Twitter bots is something I like a lot. Up until now, I haven’t ventured into making one myself, which is a bit unfortunate because earlier, the process of getting an R/W access for a bot was much easier. Anyway, thanks to a level-headed blog posting by Dalton Hubble, I managed to grasp the basics. Timo Koola and Duukkis, central figures in Finnish bot scene, and friendly guys at that, gave helpful advice too.

The last push forward came from abroad. Early October, Andy Teucher announced that he had published a bot that used the twitteR package, something I had in mind too. My turn!

Tweet by Andy Teucher on his R-based Twitter bot

Technicalities aside, there was also the question of What?

My pick was to start with job-related text snippets, touting altmetrics.

This summer, Impactstory showed a fair amount of generosity. Although their profiles ceased to be free of charge, they gave away free waivers – and I was lucky to get one for the experimental profile showing a sample of Aalto University research outputs.

The JSON file, linked from the landing page, is packed with data. From the many possibilities, I decided to first focus on what new there is since last week. New GitHub forks, video plays, Slideshare views etc. The same information is visible on the Impactstory profile. It’s the small flag with a + sign and number, just after the metrics label.

Impacstory flag showing new activity since last week

It took a while to get familiar enough with the JSON, and how to traverse it with R. Modelling the tweet was another thing that needed a number of test rounds. The last mile included e.g. finding out how to fit in the URL of the product. In here, Andy’s GitHub repo of the Rare Bird Alert Twitter Bot came to rescue: he had done the URL shortening with Bitly.

I had nurtured the fancy idea of (somehow) making use of Unicode block elements like Alex Kerin has done. However, after some experiments in my test account I gave up. They might come handy in showing a trend, though.

Test tweets with sparklines

Setting up a new Twitter account for the bot was of course easy, but even here was something new to learn. Did you know that you can have multiple, custom Gmail address variants within one account?

So, here is the WeeklyMetrics Twitter bot in its present shape. What is still on a test phase, is running the R script successfully in batch. The cron scheduling seems to work OK, but as Eric Minikel points out at the end of his thorough posting, making the script robust asks probably for some tryCatch blocks. The Impactstory server could be temporarily on hold, Bitly’s likewise. At the moment, my script assumes happily that all is safe and sound in the universe.

The R code

Posted by Tuija Sonkkila

Data Curator at Aalto University. When out of office, in the (rain)forest with binoculars and a travel zoom.
Coding , Data , Learning - Comments Off on Bot

Journal Metrics by Scopus

Some days ago, William Gunn, Head of Academic Outreach for Mendeley, tweeted about the availability of the latest Journal Metrics.

Tweet about Journal Metrics by Scopus

The experimental web application of (alt)metrics on our publications since 2007 includes now

The number of years is solely a matter of layout; the table in the Data tab has so many columns already that you need to scroll horizontally to see them all, and that’s a pain. AFAIK the values in SNIP 2012 and SNIP 2013 are identical, so I left out the previous year.

Short excerpts from the web page of respective metrics (bolding is mine):

SNIP measures contextual citation impact by weighting citations based on the total number of citations in a subject field. The impact of a single citation is given higher value in subject areas where citations are less likely, and vice versa.

The IPP measures the ratio of citations in a year to scholarly papers published in the three previous years divided by the number of scholarly papers published in those same years. The IPP is not normalized for the subject field and therefore gives a raw indication of the average number of citation a publication published in the journal will likely receive.

SJR is a measure of scientific influence of scholarly journals that accounts for both the number of citations received by a journal and the importance or prestige of the journals where such citations come from. It is a variant of the eigenvector centrality measure used in network theory. Such measures establish the importance of a node in a network based on the principle that connections to high-scoring nodes contribute more to the score of the node

While at it, I also added the possibility of filtering publications by School when all items are visible, i.e. nothing is selected from the Title nor Journal.

Please note that the citation count in this app comes from the Web of Science (WoS), not from Scopus.

Posted by Tuija Sonkkila

Data Curator at Aalto University. When out of office, in the (rain)forest with binoculars and a travel zoom.
Coding , Data - Comments Off on Journal Metrics by Scopus

On game articles and megajournals

On a hot work day like this, you need to take a relaxing break from time to time. While at it, why not play with the altmetrics web app? For example, you might ask yourself: are there any publications on games or gaming? Well yes, indeed there are.

But before we continue, some general remarks on two extra data sets.

The colour of the circle still shows the School, but some circles have got a yellowish outer layer (that’s stroke in the ggvis parlance, and the colour is officially gold). This tells that the item – article or journal, depending on the dimension – is Open Access. More precisely, one that belongs to the core journal collection of Web of Science, and has a record in the Directory of Open Access Journals (DOAJ). I mentioned this in my previous posting, but let me repeat: data is courtesy of Lib4RI from Switzerland. Danke schön!

The other new data set is of Finnish origin, by Publication Forum (JUFO). JUFO is now among the values you can choose to either axis. What JUFO does is that it gives a ranking to publications. You may have heard, that Ministry of Education and Culture is revising the funding model for universities as of 2015. Here’s a quote from the proposal:

The publication ratings system devised by the publication forum would be used as a rating in the computation of funding so that the quality perspective would be strengthened over a transitional period of 2015-2016 and would be even more pronounced as of 2017. During the transitional period the rating of publications would be executed so that in Level 0 the coefficient for peer reviewed scientific articles and publications would be 1, in Level 1 it would be 1.5 and in Levels 2 and 3 it would be 3.

As far as I know, the Level 0 is not used any longer, or at least it is absent from the data. Few publications have no Level at all. If you see some circles that float in the outskirts of the plotted graph, it’s these ones, because without any value they don’t fit in the scale.

BTW, if you wonder what JUFO stands for, it comes from the Finnish word Julkaisufoorumi.

Back to games.

Citations and JUFO ranking of publications on game

Here we have, in a JUFO ranking vs WoS citation comparison, 10 items chosen from the Title dimension (type game). But wait, why only 8 circles? That’s because some share exact the same values, and are thus plotted on top of each other.

Articles are spread on all JUFO Levels. One Open Access item on Level 2.

All but two have been cited, but one is higher on the vertical axis. Hover over it. From the tooltip you’ll see the first words of the title. It’s about the psychophysiology of none else than James Bond, published in the journal Emotion in 2008.

What about altmetrics? Change the vertical axis to Twitter and horizontal to, say, Altmetric score.

Altmetric score vs Twitter on articles about games

Now the picture changes a little. The most tweeted article is the one published in PLOS One, which is Open Access. To me, this seems logical; it doesn’t usually make much sense to tweet about an URL that cannot be accessed by all readers.

In their posting The 3 dangers of publishing in “megajournals”–and how you can avoid them, the ImpactStory blog refers to research findings which state that Open Access can get you more readers, and also more citations. Here, we cannot really see either, for two reasons.

First, to have any clue of the amount of readers, i.e. page views and PDF downloads, you need to traverse the links all the way to the publication site, and find the Metrics page or some such. It’s worth doing, though, if you have time, because information there gives much more context.

Second, there is only one Open Access item in my small sample, published last year. It has zero WoS citations so far, but you’d need to be a clairvoyant to say anything more. From my small PLOS ALM visualization you can check the (somewhat outdated) status of that article, Keep Your Opponents Close: Social Context Affects EEG and fEMG Linkage in a Turn-Based Computer Game.

Altmetrics of one PLOS One article

If you’re interested in Altmetric scores, follow the links to their article landing page, and have a look at the percentiles given there, in the Score tab. There is also a link to this helpful knowledgebase article on how the score is calculated. In this game sample, it looks as if the amount of tweets would affect the score, don’t you think?

Talking about megajournals, do we have examples of them in this web application? Turns out there are only two: PLOS ONE, and Scientific Reports.

WoS vs Twitter of the so-called megajournal articles

It looks like being Open Access doesn’t harm impact, and why would it? A lot of activity here, both in the number of tweets and citations. But, citations and tweets don’t go hand in hand, something that has been shown to be the case.

Posted by Tuija Sonkkila

Data Curator at Aalto University. When out of office, in the (rain)forest with binoculars and a travel zoom.
Coding , Data - Comments Off on On game articles and megajournals

Open Access items and altmetrics

Lib4RI aka libraries of the four research institutes of the ETH-Domain announced some days ago that they have made a list of all Open Access journals included in Science Citation Index Expanded 2013 (SCIE). Nice work! Matching is done against the Directory of Open Access Journals (DOAJ).

I added this information to the altmetrics web application of our publications since 2007. OA items are now indicated by a golden stroke around the circle.

There are few items among the PLOS articles published in the print version of PLOS Computational Biology, PLOS Medicine and PLOS Biology, respectively. Unlike in the Lib4RI data, their ISSN differs from that of the E-ISSN. Here are the correct ones: 1553-734X, 1549-1277, 1544-9173. I noticed these small typos thanks to the visualization.

Posted by Tuija Sonkkila

Data Curator at Aalto University. When out of office, in the (rain)forest with binoculars and a travel zoom.
Coding , Data - Comments Off on Open Access items and altmetrics

Software, slides, videos, e-prints and articles

There are now three separate and partly overlapping views on (alt)metrics of Aalto University research outputs, published roughly during the last decade. Majority is from recent years, but few date back to early 2000.

The variety and incompleteness reflects the fact that, in this context, data is both dynamic and somewhat arbitrary due to a variety of reasons. So, everything presented here should be taken with a grain of salt. NISO Altmetrics Standards Project White Paper, open for comments still today, is the definitive source for those of you interested in this topic.


The first view is an application around articles, based on data from Altmetric with citations and metadata from Web of Science. For more info, see this recent posting.


The second one is an ImpactStory profile that shows a sample of other research artefacts than articles: arXiv e-prints, Figshare datasets, GitHub repositories, SlideShare slides, and videos from Vimeo and YouTube. Except arXiv items, the gathering process was manual: I made a Google search of each one to our site, and checked every hit so that non-Aalto authors were excluded. Still, some misunderstandings are of course possible.

The arXiv ID’s are taken from the local Tenttu database which so far serves as the official source of Aalto publications, although the situation is a bit fuzzy due to ongoing changes in the local infrastructure.

Tenttu didn’t know of any PMID’s. There were several Google hits to reference lists but only three seemed to be by locals. However, I couldn’t be sure, so I left them out.

Here, particularly interesting are GitHub repos. To my knowledge, this is the first time software and code made under the hood of the University are laid out like this, as a collection. Note that this is by far not all code there is; many research teams use a local storage, or the cloud repository is somewhere else than at GitHub, notably at Bitbucket.

There are a few items that don’t have a title. I’ve understood that for some reason, there is no relevant data at the other end of the link, or the URL is rotten. Due to some issues with the profile at the moment, I have not managed to delete these bogus ones (also, the whole Webpage section is redundant). Remember, ImpactStory Profiles is still a new and evolving service.

Profiles are mainly for individual researchers. For them, the job is easy. Rather that giving a list of item ID’s like I did, they can automatically sync their accounts, say, with GitHub, and pull other items with their personal ORCID number.

Based on a Google search on, there are only two aaltoians (is there such a word?) with an ImpactStory profile: Aki Vehtari, and Enrico Glerean. Of course there can be others. They just don’t mention it anywhere on

BTW, if you haven’t seen the Happy birthday charts Aki has made with Andrew Gelman, do take a look. The New York Times ran a story on it in its Science section in December last year.

Talking about altmetrics, Enrico shows a great deal of activity. For example, he has brought Altmetric badges on his web page.


The third view is an interactive web visualization built on PLOS ALM data. The DOI sample is the same as in the first view but the metrics are newer, from today, 18th July. As a proof of concept really, I also added Altmetric badges at the end of each line. A caveat is that they emerge only after you refresh the page. I need to fix that some day.

For some more info on this PLOS view, see this posting.

EDIT 25 July

As of yesterday evening, the ImpactStory profile is working much better, thanks to their listening to my bug report, and fixing things.

Now the statistics is correct, showing how many live products/items there are, how many of these have some metrics whatsoever, and again, how many of these have gathered some new metrics this week. Note that all these are links.

ImpactStory profile main page

What I also find a nicely done service, is the traditional email digest that accompanies the profile. By default, the owner of the profile – me, in this case – gets it once a week. Look how ImpactStory gives feedback, and encourages!

ImpactStory email digest 1/3

ImpactStory email digest 2/3

ImpactStory email digest 3/3

Posted by Tuija Sonkkila

Data Curator at Aalto University. When out of office, in the (rain)forest with binoculars and a travel zoom.
Coding , Data - Comments Off on Software, slides, videos, e-prints and articles

Altmetrics from different angles

As of this week, there are exactly 900 articles published by Aalto University since 2007 recorded by Web of Science of Thomson Reuters, having a DOI, and with some altmetrics aggregated by Altmetric. With this data at hand, I made an updated version of the interactive web application I wrote about previously.

Now you can

  • check any two metrics against each other in Compare. Select metrics to the respective axis, and see the result rendered nicely (thanks to ggvis, an R wrapper to Vega), coloured by School and decorated with a tooltip
  • have a quick overview in All metrics
  • in the Data tab, sort columns, search over all items, or proceed via the URL to the article landing page at Altmetric
  • query any DOI against the Altmetric API. In other words, this tab returns live data.

Due to some present shortcomings in Compare, note that

  • if you have only item selected, it is rendered together with a dummy one. Ignore that, please.
  • if you have zero items selected – which is the case at start, and any time you switch dimensions – all 900 items are plotted. This means of course, that you cannot decipher all points because most of them are piled on top of each other near the origin. Also, there is no Data available then, mainly because I suspect that the free beta account provided by RStudio would have hard time serving it.

The app doesn’t tell quite the whole story.

First, to simplify things, I’ve changed all missing values to zeros. Second, some articles are in fact coauthored by several Schools, but in the process of duplicate removal, only one unique item is left, so one School gets it all. Which one of them, is more or less random.

To get you going, here are some possible use cases.

First, let’s have a look at water. Always a good topic.

While Title selected, type water repeatedly and choose as many items as you can (10). At least three Schools deal with water. The default comparison is between WoS citations and Mendeley saves. In my selection, one article stands out in this context.

WoS vs Mendely on some water-related articles

Hover over the lone blue circle in the upper right corner, and you’ll see a few characters of its title (Directional water co), the year of publication (2010), and the name of the journal (Nature). Switch over to the All metrics tab. Here, graphics is courtesy of rCharts, another great R wrapper, this one to NVD3.

All metrics of some water-related articles

Because of the scale and the high values of both WoS and Mendeley, other bars are barely discernible. Opt out WoS and Mendeley by clicking the dot next to their legend. This way, the undergrowth shoots upwards.

Smaller metrics values of selected articles

Note that not all values are conceptually equal; NrOfAuthors isn’t a score but an attribute of the article, and the Altmetric score isn’t a count in the same manner than the number of tweets, for example.

Move over to Data. Sort articles by year. In this sample, the one with most citations is also the oldest, which makes sense.

Data tab of some water-related articles

Continue by sorting the columns one by one. You’ll see that many metrics are all zeros (remember what I said about zeros above), Twitter being often an exception.

Next, some physics.

In their paper Astrophysicists on Twitter: An in-depth analysis of tweeting and scientific publication behavior ($) Haustein et al. show how, in their target group, there is a general negative relation between scientific impact and Twitter activity, although there are exceptions. Is that the case at Aalto University too? Switch the dimension to Journal, and pick those ones with astro in the name (there are 4). We make a quick & dirty assumption that articles published in them indeed are astrophysics. Let the x-axis remain WoS, but change y to Twitter.

Journals in astrophysics and their WoS vs Twitter count

One article is quite high up on the Twitter axis whereas the rest hang low, which seems to back up research findings. The comparison between Twitter and NrOfAuthors reveals further that the exception has only one author. A notable feature of astrophysics articles is that the number of authors is relatively big, in hundreds rather than in tens.

Journals in astrophysics, NrOfAuthors vs Twitter count

What is the affiliation of the author? The color of the circle tells that the School is ENG and not SCI where the major part of astrophysics research is done at Aalto University, in the Metsähovi Radio Observatory.

Out of curiosity, let’s have a closer look at the article. Go to the Data tab, sort by the Twitter column, pick the highest one, and follow the link to Altmetric. In there, peak into Demographics tab. There we have a world map of tweeters, and the type they belong to, determined by what people write in their Twitter profile I understand. It seems that the article has got most attention among general public.

Click the title at Altmetric. This brings you to the site of the journal, ScienceDirect by Elsevier. But wait – because here we have a non-open access journal, you might not get any further. Instead, let’s turn to Google Scholar. From this profile page, we can finally see that indeed the author is not an astrophysicist, and – if you are interested to read the article – it is available as an Accepted Author Manuscript version at

There are 7 different authors among these 19 articles we have been investigating. From them, the only one with a Twitter account is he who turned out not to be an astrophysicist. However, like Haustein et al. note, Twitter is a versatile tool.

Researchers have multidimensional lives – they might be avid runners, foodies, sport fans or enjoy stamp collecting – and their discussions on Twitter might include these various activities (Bauer, 2013). Hence, at least in the case of the astrophysicists analyzed in this paper, researchers’ activity on Twitter should not be considered as purely scientific, as Twitter is not restricted to this single type of communication.

I haven’t a clear picture of how much Twitter is used at our University, although I have a hunch: not very much. A good starting point to dig deeper into this topic could be the Aalto list maintained by Mikko Heiskala. BTW, Mikko gives out a Aalto University Weekly, worth a look.

Rewinding back to water for a moment. The 10 articles I mentioned above have 15 different authors from our University. All but two come from the cross-disciplinary Water & Development Research Group. Two persons out of these 15 authors have a Twitter account. The other one seems to be for private life activities only (sport, in this case).

We have looked at a tiny sample of Aalto University researchers. Among them, the only somewhat active, open Twitter account has an About text that says

materials researcher inspired by nature, working on water-repellent surfaces and other functional materials.

A short side-step. Before I tell you who this person is, I need to ask myself: is it OK to do so? In his blog posting Twitter as Public Evidence and the Ethics of Twitter Research, Ernesto Priego summarizes recent discussion on this issue, especially geared to situations where you gather Twitter data from the streaming API or some other source where you have access to potentially millions of tweets.

Well, I’m fairly confident that Robin Ras doesn’t mind me linking to his Twitter site.

Next: homework! Take a look at altmetrics of Switchable Static and Dynamic Self-Assembly of Magnetic Droplets on Superhydrophobic Surfaces, an article that Ras has been coauthoring. Select it by name from the Title, or make a Query by the DOI 10.1126/science.1233775

One final note. If you look closely, there are in fact two false positives among my water articles here. This is due to inevitable but embarrasing bugs in the algorithm that tries to match affiliations. Will be fixed. Sorry about that.

Posted by Tuija Sonkkila

Data Curator at Aalto University. When out of office, in the (rain)forest with binoculars and a travel zoom.
Coding , Data - Comments Off on Altmetrics from different angles

Status update

Just came here to mention briefly that now the new altmetrics app has also a tab for a ggvis graph. I made it to show the comparison between the number of authors and Web of Science citations by Thomson Reuters. AFAIK you cannot make coordinates dynamic in here, as on the top tab which uses the ggplot2 library.

I struggled for a while with ggvis dynamics, but thanks to excellent advice, managed to solve the problem.

There are some cosmetic issues here and there. With only one item, the ggvis plot looks a bit odd. Also, at least with the Chrome browser, tooltips tend to float around until the graph is rendered again, and some tooltip is fired. Firefox does not show tooltips in the All metrics tab, where the library is rCharts. The table on the Data tab is very wide due to several variables/columms etc.

Note that I’ve used one of the most basic Shiny UI layouts. Possibilities are almost limiteless though, because the interface can be coded from scratch.

Please keep in mind that the library development work of R is done by a busy community, and bugs do occur. Especially the ggvis library is very new and by no means ready.

Posted by Tuija Sonkkila

Data Curator at Aalto University. When out of office, in the (rain)forest with binoculars and a travel zoom.
Coding - Comments Off on Status update