Mining REF2014 impact case studies

The British 2014 Research Excellence Framework is a landmark in recent European academic life, no matter from what angle you look at it. One of its novelties was Impact Case Studies, a collection of impressive 6,975 items which

showcase how research undertaken in UK Higher Education Institutions (HEIs) over the past 20 years has benefited society beyond academia – whether in the UK or globally

It is a friendly gesture from REF’s side that there is a searchable interface to the studies, a REST API, and you are even allowed to download data. All under Creative Commons Attribution 4.0 licence.

The initial report titled The nature, scale and beneficiaries of research impact is coauthored by King’s College London and Digital Science. A 70+ page analysis, it is nicely sprinkled with graphs and tables. In Headline findings, authors make humble observations of their task:

Text-mining itself can be dangerous and dirty: dangerous, as it is possible to misinterpret information in the text; and dirty, as it involves a lot of experimentation and trying by doing. Given the size of the dataset (more than 6 million words in the ‘Details of the impact’ section of the case studies), text-mining was useful in producing an analysis of the general patterns and themes that could be described in the case studies.

Briefly on their methodologies, as explained on page 16:

  • topic modelling with Apache Mallet Toolkit Latent Dirichlet Allocation (LDA) algorithm
  • keyword-in-context (KWIC) to develop an alphabetical list of keywords displayed with their surrounding text (word) context
  • information extraction to identify references to geographic locations

What I’m about to present below, is a sign of how REF impact stories made an impact on me. I became curious about keywords and topics; how they are mined, and what results you might get. Note that my approach stays roughly on the mineral extraction level. Still, even with only modest knowledge and even more modest understanding of text-mining principles, one can experiment, thanks to all code, tutorials and blogs out there by those who know what they are doing.

Some time ago I saw a press release of IBM’s acquisition of AlchemyAPI,

leading provider of scalable cognitive computing application program interface (API) services and deep learning technology

IBM has plans to integrate AlchemyAPI’s technology into the core Watson platform. This is an interesting move. Who knows, perhaps the already famous IBM Chef Watson will soon get even smarter sister implementations.

The list of AlchemyAPI calls is long, and documentation quite good, as far as I can tell. I haven’t gone through all the calls, but those ones I’ve tried, behaved as documented. The one best suited for me seemed to be Text API: Keyword / Term Extraction. Thanks to the helpful and prompt AlchemyAPI support, my newly minted API key got promoted to an academic one. With the help of a somewhat higher daily request limit, I was able to make all the API calls I needed during the one and same day.

First, REF data.

Panel-wise, one Excel at a time, I downloaded files, and imported them to RStudio. The only data cleanup I did at this stage was that I got rid of newline characters of Details of the impact text blocks.


# Main Panel A
docA <- loadWorkbook("CaseStudiesPanelA.xlsx")
casesA <- readWorksheet(docA, 
                           sheet = "CaseStudies", 
                           header = TRUE)
casesA$Details.of.the.impact <- gsub("\n", " ", casesA$Details.of.the.impact)

The TextGetRankedKeywords API call has several parameters. I decided to proceed with default values, with the exceptions of keywordExtractMode (strict) and Sentiment (1, i.e. return sentiment score and type). The default maximum number of keywords is 50.

I made a few attempts to send the entire Details of the impact text, but stumbled on an URI too long exception. With some trial and error, I ended up cutting text strings down to 4000 characters each.


alchemy_url <- ""
api_key <- "[API key]"
call <- "calls/text/TextGetRankedKeywords"
kwmode <- "strict"
sentiment <- "1"

url <- paste0(alchemy_url, call) <- function(df, row, col){  
  r <- POST(url,
            query = list(apikey = api_key,
                         text = substring(df[row,col],1,4000),
                         keywordExtractMode = kwmode,
                         sentiment = sentiment))

resultlist <- vector(mode = "list", length = nrow(casesA))

for ( i in 1:nrow(casesA) ) {  
  res <- content(, i, "Details.of.the.impact"), useInternalNodes=T)
  kw <- paste(xpathApply(res, "//text", xmlValue), 
              xpathApply(res, "//relevance", xmlValue), 
              xpathApply(res, "//score", xmlValue),
              xpathApply(res, "//type", xmlValue),
              sep = ";")
  resultlist[[i]] <- c(casesA[i, c("Case.Study.Id", "Institution", "Unit.of.Assessment", "Title")], kw) 

save(resultlist, file = "keywordsA.Rda")

There is a lesson to learn at this point, namely I found it relatively difficult to decide, which data structure would be the most optimal to store query results. My pet bad habit is to transform data back and forth.

The query run time varied from 14 to 20 minutes. Within roughly one hour, I had processed all data and saved results as R list objects. Then, data to data frames, bind all together, and some cleaning.

Next, the web application.

Shiny Dashboard examples are very nice, particularly the one on streaming CRAN data, so I first made a quick sketch of a similar one. However, D3.js bubbles htmlwidget by Joe Cheng wasn’t really suitable to my values. Instead, I decided to use a ggvis graph, reflecting the chosen Unit of Assessment. The x axis could show the keyword relevance. The y axis, sentiment score. On the same “dashboard” view, two tables to list some top items from both dimensions. And finally, behind a separate tab, data as a DT table.

The R Shiny web application is here.

While browsing the results, I noticed that some studies were missing the name of the institution.

REF2014 id XML response

From the case study 273788 (one that lacked name) I could confirm that indeed the Institution element is sometimes empty, whereas Institutions/HEI/InstitutionName seemed a more reliable source.

The REST API to the rescue.

With one call to the API, I got the missing names. Then I just had to join them to the rest of the data.

ids <- unique(kw.df[kw.df$University == "",]$id)
idlist <- paste(ids, collapse = ",")

q <- httr::GET("", query = list(ID = idlist, format = "XML"))
doc <- httr::content(q, useInternalNodes=T)

ns <- c(ns="")
missingunis <- paste(xpathApply(doc, "//ns:CaseStudyId", xmlValue, namespaces = ns), xpathApply(doc, "//ns:InstitutionName", xmlValue, namespaces = ns), sep = ";")

missingdf <- data.frame(matrix(unlist(missingunis), nrow=length(missingunis), byrow=T))
names(missingdf) <- "id_univ"

missing %
  tidyr::extract(id_univ, c("id", "University"), "(.*);(.*)")

missing$id <- as.numeric(missing$id)

kw.df %
  left_join(missing, by = "id")  %>%
  mutate(University = if (! University.y else University.x) %>%
  select(id, University, UnitOfA, Title, Keyword, Relevance, SentimentScore, SentimentType, Unique_Keywords)

Sentiment analysis sounds lucrative but the little I know about it, it doesn’t compute easily. A classic example is medicine, where the vocabulary inherently contains words that without context are negative, such as names of diseases. Yet, it’s not difficult to think of text corpuses that are more apt to sentiment analysis. Say, tweets by airline passengers.

There is also the question of the bigger picture. In Caveats and limitations of analysis (pp. 17-18), authors of the report make an important note:

The sentiment in the language of the case studies is universally positive,
reflecting its purpose as part of an assessment process

In my limited corpus the shares between neutral, positive and negative are as follows, according to AlchemyAPI:

# Neutral
>paste0(round((nrow(kw.df[kw.df$SentimentType == 'neutral',])) / nrow(kw.df) * 100, digits=1), "%")
[1] "46.2%"
# Positive
> paste0(round((nrow(kw.df[kw.df$SentimentType == 'positive',])) / nrow(kw.df) * 100, digits=1), "%")
[1] "40%"
# Negative
> paste0(round((nrow(kw.df[kw.df$SentimentType == 'negative',])) / nrow(kw.df) * 100, digits=1), "%")
[1] "13.7%"

In the web application, behind the tab Sentiment analysis stats, you’ll find a more detailed statistics by Units of Assessment. The cool new D3heatmap package by RStudio lets you to interact with the heatmap: click row or column to focus, or zoom in by drawing a rectangle.

On topic modeling, the REF report observes:

One of the most striking observations from the analysis of the REF case studies was the diverse range of contributions that UK HEIs have made to society. This is
illustrated in a heat map of 60 impact topics by the 36 Units of Assessment (UOAs) (Figure 8; page 33), and the six deep mine analyses in Chapter 4, demonstrating that such diverse impacts occur from a diverse range of study disciplines

To get the feel of LDA topic modeling, how to fit the model to a text corpus, and how to visualize the output, I followed a nicely detailed document by Carson Sievert. The only detail I changed in Carson’s code was that I also excluded all-digit terms, which were plenty in the corpus:

delD <- regmatches(names(term.table), regexpr("^[[:digit:]]+$", names(term.table), perl=TRUE))
stop_words_digits <- c(stop_words, delD)
del <- names(term.table) %in% stop_words_digits | term.table < 5 
term.table <- term.table[!del]

The REF report tells how the modelling had ended up with 60 topics. Based on this “ground-truth”, I decided to first process the four Main Panel study files separately, and on each run, set up a topic model (=K) with 15 topics. Please note that I don’t really know if this makes sense. Yet, my goal here was mainly to have a look at how the LDAvis topic visualization tool works. For a quick introduction to LDAvis, I can recommend this 8 minute video.

Here are the links to the respective LDAvis visualizations: Main Panel A, B, C and D.

With LDAvis, I like the way you can reflect on actual terms; why some exist in several topics, what these topics might be, etc. For example, within Panel B, the term internet does not only occur in topic 7 – which is clearly about Computer Science and Informatics – but also in 15, which is semantically (and visually) further away, and seems to be related to Electrical and Electronic Engineering, Metallurgy and Materials.

For the record, I also processed all data in one go. For this I turned to the most robust academic R environment in Finland AFAIK, CSC – IT Center for Science. My multi-processor batch job file is below; it’s identical with CSC’s own example except that I reserved 8 x 5GB memory (so I thought):

#!/bin/bash -l
#SBATCH -J r_multi_proc
#SBATCH -o output_%j.txt
#SBATCH -e errors_%j.txt
#SBATCH -t 06:00:00
#SBATCH --ntasks=8
#SBATCH --nodes=1
#SBATCH --mem-per-cpu=5000

module load r-env
srun -u -n 8 Rmpi --no-save < model.R

model.R is the R script where I first merge all data from the Main Panels, and then continue in the same manner as I did with individual Panel data. One exception: this time, 60 topic models (=K).

For the sake of comparison, I started the same script on my Ubuntu Linux workstation, too. Interestingly, the run time on it (4 core 2.6GHz processor, 8GB memory) and on Taito at CSC was roughly the same. Ubuntu 6:39 hours, Taito 5:54. Disclaimer: there’s a learning curve on Taito – as well as on R! – so I may in fact reserved only limited or basic resources, although I thought otherwise. Or, perhaps the R script needs to use specific packages for a multi-core job? Maybe the script felt the presence of all the memory and cores in there but didn’t have instructions on how to make use of them? Frankly, I don’t have a clue what happened under the hood.

Unlike Carson in his review example I used, or in his other paper Finding structure in xkcd comics with Latent Dirichlet Allocation, I didn’t inspect visually fit$log.likelihood. I couldn’t follow Carson’s code at that point so I gave up, but saved the list just in case. It would’ve been interesting to see, whether the plot would’ve indicated that the choice for the number of topics, 60, was on the right track.

Here is the LDAvis visualization of 60 topic models. Some look a little strange, for example 60 has many cryptic terms like c1, c2, etc. In the panel-wise visualizations, they are located in 14 on Panel C and 14 on Panel D. Are they remains of URLs or what? Most likely, because I didn’t have any special handling for them. In pre-processing, I replaced punctuations by a space, meaning e.g. that URLs were split by slash and full stop, at least. The example data that Carson used in his document were film reviews. Seldom URLs in those I guess. So, I realized that I should’ve excluded all links to start with. Anyway, it looks like if, in the case studies, URLs were less frequently used in natural sciences, dealt with by Panel A and B.

Code of the R Shiny web application.

Posted by Tuija Sonkkila

About Tuija Sonkkila

Data Curator at Aalto University. When out of office, in the (rain)forest with binoculars and a travel zoom.
This entry was posted in Data and tagged , , , , , , , , , . Bookmark the permalink.

Comments are closed.