In a recent blog posting Altmetric announced that they have added Wikipedia to their sources. Earlier this month, WikiResearch tweeted about a CC0-licensed dataset upload to Figshare, Scholarly article citations in Wikipedia. The dataset is a 35+ MB file of cites that carry a PubMed identifier, PMID.
From a university perspective, these are excellent news. Now we are able to start posing questions such as “Have any articles from our University got any cites in Wikipedia?”
To start a small forensic investigation in the case of Aalto University, I first need a list of all PMID’s connected to our affiliation. With the R jsonlite
package, and great help from How to search for articles using the PubMed API, the following query (broken down into separate lines for clarity) returns 938 PMIDs as part of the JSON response.
pubmed.res <- fromJSON("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi
?db=pubmed
&retmax=2000
&retmode=json
&term=aalto+university[Affiliation]")
Note that querying the Affiliation field doesn’t necessarily capture all our authors. Also, I didn’t even dare to venture into different legacy name variants of our present Schools before Aalto University saw the light in 2010.
Now, with the two PMID lists at hand, and with an appropiate join
function from the R dplyr
package, it was easy to check, which ones are the same in both.
joined <- inner_join(pubmeddata, wikidata, by = c("id" = "id"))
Turns out, there is only one article from Aalto University that has been cited with its PubMed identifier, but on three different pages.
id page_id page_title rev_id timestamp type 1 21573056 4355487 Apolipoprotein B 448851207 2011-09-07 pmid 2 21573056 51521 Low-density lipoprotein 430508516 2011-05-23 pmid 3 21573056 92512 Lipoprotein 586010949 2013-12-14 pmid
From the page_title
we can see that the subject in all is obviously lipoproteins. For the article metadata and other interesting facts, we need to make two more API calls: one to Wikipedia and one, again, to PubMed. From Wikipedia, it’d be interesting to return the number of page watchers, if given. The size of the watcher community tells something about the commitment of the authors and readers of the page, don’t you think? From PubMed, I can get the title of the article, the date of publication etc. How quickly after publication were the cites made?
The URL stubs below make a two-part skeleton of the Wikipedia query. The PMID needs to be plugged in between. For more API details, see this documentation.
w.url.base <- "http://en.wikipedia.org/w/api.php?action=query&pageids="
w.url.props <- "&prop=info&inprop=watchers&format=json&continue="
PubMed has got several different query modules and options. For the core metadata, I need to use the esummary
module, and the abstract
result type. Again, the query here is split into several lines to show the parts more clearly. The PMID is added to the very end.
pubmed.abs.url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi
?db=pubmed
&retmode=json
&rettype=abstract
&id="
The article Three-dimensional cryoEM reconstruction of native LDL particles to 16Å resolution at physiological body temperature was published in PLoS One on 9th May 2011. Two weeks later, on the 23th, it was cited on the Wikipedia page Low-density Lipoprotein, in the chapter describing the structure of these complex particles composed of multiple proteins which transport all fat molecules (lipids) around the body within the water outside cells.
At the moment, the page has 121 watchers, a fair bit more than the other citing pages.
pubDate wikipediaDate pageTitle watchers 1 2011-05-09 2011-09-07 Apolipoprotein B N/A 2 2011-05-09 2011-05-23 Low-density lipoprotein 121 3 2011-05-09 2013-12-14 Lipoprotein 50
When there are fewer than 30 watchers, like in the case of Apolipoprotein B, Wikipedia uses the phrase Fewer than 30 watchers on the web page, and the API returns no value.
I also updated the experimental altmetrics web application with metrics courtesy of e.g. Altmetric – now also with Wikipedia! This version follows the same logic as the older one, only with fresh metrics.
Change either one of the axes to show Wikipedia, and you’ll notice a couple of things.
First, according to Altmetric, among Aalto articles since 2007, the highest Wikipedia score is 3. But this is not our lipoprotein article, published Open Access (marked with a golden stroke in the app) in PLoS One, but The Chemical Structure of a Molecule Resolved by Atomic Force Microscopy, published in Science. The lipoprotein article is among those with 2 Wikipedia cites, in the far left. Why the difference?
Second, there seems to be quite a many articles cited in Wikipedia. Why weren’t they not in the Figshare dataset?
Altmetric is not just aggregating Wikipedia citations by PMID. Take for example the Science article. Follow the Wikipedia link in Mentioned by…, and from there, click either one of the timestamped links Added on…. You land on the diff page showing when the citation was added, by whom and how. Version control is a cool time machine. You can almost hear the author tap-tapping the text.
She doesn’t write a PMID. She writes a DOI.
It remains a minor mystery though, why Altmetric doesn’t count the citation made to Apolipoprotein B on 7th September 2011. Maybe this is just a hickup in the new process, something that Altmetric clearly says in the blog posting:
Also please note that to begin with we’ve had a big load of historical data to update all at once, so some of it is still feeding through to the details pages – but give it a week or so and it should all be up to date.
For those of you interested in R, the code on querying Wikipedia and PubMed is available as a GitHub Gist.