In his recent blog posting Understanding the implications of Open Citations – how far along are we?, Aaron Tay from Singapore Management University gives a compact and clear overview on open citations. As Tay rightly notes,
With the recent interest in integrating discovery of open access, as well as linked data (with a dash of machine learning and text mining) we have the beginnings of an interesting situation. A third development which was harder to forsee is the rise in Open Citation movement
Tay mentions the fresh CWTS analysis Crossref as a new source of citation data: A comparison with Web of Science and Scopus where at least 39.7% of Web of Science references where found to match an open reference.
Also, finding OA texts is easier by the day.
In The state of OA: a large-scale analysis of the prevalence and impact of Open Access articles Heather Piwowar et al. estimate that
at least 28% of the scholarly literature is OA (19M in total) and that this proportion is growing, driven particularly by growth in Gold and Hybrid. The most recent year analyzed (2015) also has the highest percentage of OA (45%). Because of this growth, and the fact that readers disproportionately access newer articles, we find that Unpaywall users encounter OA quite frequently: 47% of articles they view are OA.
Lately, Digital Science Dimensions has generated a lot of welcomed buzz. Like Tay points out, their data are at least partly based on Open Citations but
I’m pretty sure Dimensions goes beyond it, as it is a combination of input and expertise from 6 different teams including ReadCube, Altmetric, Figshare, Symplectic, DS Consultancy and ÜberResearch and other publisher partners.
Compared to my last test with OpenCitations Corpus and Crossref, how would Dimensions serve us, citations-wise? Also, how much of Aalto University articles can be found and read as OA?
As a base set, I used Aalto University publications from 2015 to 2017. These years are well covered by the VIRTA service, and thanks to the REST API (in Finnish), getting your hands into the publication set is fairly easy. Basically you just need to parse the returned XML. Note that I could have also used data from our local DW to start with; publication metadata are transferred first from our CRIS into the DW and later to VIRTA on a regular basis.
While at it, in addition to citations data via Dimensions and Web of Science (WoS), I mashed up some more:
- links to possible Open Access full texts via Unpaywall with the roadoi R package by Najko Jahn
- Mendeley reader counts, tweets, and Altmetric score by Altmetric
- research field names from Statistics Finland
- Aalto University department names from the local CRIS
Those of you interested in the R code might like to check it on GitHub. A mental note to myself: combining data from so many external sources would desperately need a modular solution.
All other data featured here you can query, parse and join in a computational manner except WoS citations because their APIs don’t return them. See comments in the code for details. EDIT: the citation count do is returned, my misunderstanding. See e.g. the example code with one DOI here that uses the wosr package.
Because there is no quality control of DOI strings in VIRTA, all bad DOIs are silently skipped. Since late 2016 at out University, broken DOIs are history at last thanks to the new CRIS but legacy data are still haunting. To fix DOIs, I only made some quick replacements but apparently you can in fact catch most of the correct ones.
Speaking of Open Access at Aalto University in general, a viable path goes via the CRIS portal research.aalto.fi where you can filter research output with full text. Another route starts from the Aaltodoc DSpace repository.
In their API documentation, Unpaywall states that the field evidence is used only for debugging, and values are subject to change. OK. Despite this warning though, I couldn’t help making a quick look on from which sources Unpaywall is finding our OA full texts, and what’s the rough split between different source types. For this I put up a collapsible tree which is a slightly modified version of the original one by Mike Bostock.
When you launch the web application that includes all of the mashup data of this exercise, what you’ll see after a short “Please Wait…”, is a scatterplot on University level. By default, it compares citation counts from Dimensions (x axis) and those from WoS (y axis) on a linear scale. The marker symbol refers to the OA availability of the full text: triangel is an OA, circle a non-OA.
FYI, in the data table you’ll find yet another OA column labelled oa_virta. These values come from VIRTA and thus originate in Aalto University itself. See the codes of Avoin saatavuus (in Finnish)
Note that unlike the CWTS analysis mentioned above, I’ll not compare how many WoS items are found in Dimensions. Perhaps I should do that too… Anyway, my original set of VIRTA items is first queried against Dimensions, and all found ones are then queried against WoS.
The difference between citation counts provided by Dimensions vs. that by WoS is seemingly small. By selecting some School you’ll get a more detailed view. Take for instance School of Chemical Engineering, and note how close the values are. The same goes to School of Engineering, and to a lesser degree also to School of Electrical Engineering. With other Schools, citation counts start to disperse. In the case of School of Science, take a look at the departmental level. In there, Department of Industrial Engineering and Management has a somewhat equal coverage in both sources whereas Helsinki Institute for Information Technology HIIT is less so. Note though that the axis scales is not fixed between plots.
In my test set, 38% of all articles are found as an Open Access full text via Unpaywall. On School level, the proportion is highest in School of Science, 48%. All in all, percentages echo the findings of Piwowar et al.
The heatmaps by department visualize citation metrics in a matrix by color. Values are arithmetical means grouped by School, department, and field. Here I took the liberty of using the English translations of the local research field names. This is just for illustration purposes. The actual fields behind the Dimensions metric field_citation_ratio are not known so please be careful with conclusions.
In the web app, there are some technical caveats that you should be aware of. The scatterplot lacks jitter which means that markers are on top of each other when datapoints have the same value. If you zoom in by selecting a rectangular area, double-clicking doesn’t alway bring you back to the default level; if this happens, click the Home icon on the navigation bar instead. In the data table, horizontal scroll doesn’t stay in sync with the fixed header row, unfortunately.
For the record, 56% of those articles that had citation metrics at Dimensions were also found in Web of Science. So what was missing? The plots below show, by publisher (>=20 articles) and School respectively, the percentage of articles not found by DOI from WoS, and missing a URL to an OA full text at Unpaywall. The darker gray, the better coverage. The lighter, the more values are missing.
The core code of making use of the levelplot function of the trellis R package comes from Rense Nieuwenhuis.
Along with the obvious publishers like PLoS and Hindawi, other noteworthy OA ones in our case are e.g. American Physical Society. Taking into account both of the plotted variables, the “weakest performers” are Emerald, IEEE and Inderscience Publishers.
Yet, DOIs work only so well. As the CWTS analysis reminds
Every publication in Crossref has a DOI, but only a selection of the publications in WoS and Scopus have such an identifier. Furthermore, not all publications with a DOI in WoS and Scopus have a matching DOI in Crossref.