The legacy data of our publications is now open. Like has been clearly shown, the number of contributors per publication has grown dramatically in recent years; the total number of authors passed 15000 in 2007, dropped slightly in 2008, and then continued on an upward track again in 2009. Coauthoring makes networks. We know that there are significant differences between fields of research. Do they surface on a coauthor network graph too, and if so, how?
The tutorial about visualizing Hollywood movies by Willem Robert van Hage is nicely detailed and informative. Authors and publications are conceptually similar to actors and movies, at least for the sake of this exercise. So, all I needed to do was collect data by querying the Aalto Open Data SPARQL endpoint, adapt the R code, write the result in GraphML, open the file in Gephi, and visualize the network.
I began boldly, and tried to handle the entire* Aalto publication universe at once. That was a mistake in so many ways. Both R and Gephi are memory-intensive. Still, looking back, it was not perhaps all that bad to start from the wrong end, so to say. That way, I was forced to have a closer look at my tools, and learned a lesson (I think). This Wiki page was of great help. When I had given Java enough memory, Gephi stopped closing down due to lack of resources. In the RStudio IDE on the other hand, I was forced to remove all bigger, unused objects and manually command garbage collection. Otherwise RStudio refused to continue with big matrix operations. There are some special packages that are said to be of help when handling bigger data; to my understanding they use swapping. AFAIK you cannot feed more memory to the R environment just like that.
More focus, that’s what I needed. How? There is only limited information about publications in the open dataset; no affiliations of authors, for example. Some of the local authors surely are in the Aalto People graph where the affiliation is given, but unfortunately one cannot accurately join the two datasets by the name of the author alone.
So I ended up making a purely subjective, small selection of some of the more prominent researchers from STM science. Here they are with their core field of research interest:
- Martti Hallikainen, microwave remote sensing and small satellite technology
- Riitta Hari, neurophysiology
- Maarit Karppinen, materials research
- Riitta Smeds, collaborative innovation in networked business and service processes
- Olli Varis, global and international water issues
They all must have had a central position in many publications over the years, giving enough big networks to look at. According to the dataset, the number of publications they have contributed to averages 150.
Following the example of van Hage, I decided to have a look at the modularity of their coauthor networks with the community detection algorithm Leading Eigenvector, provided by the R igraph package. Please note though that I do not know whether some other algorithm would be more apt in calculating communities in research publications.
First, as a general overview, a line graph that shows on a timeline the average number of authors per publication. The dotted line is the whole* Aalto University. Note however that because different fields of research, there is little sense comparing the lines as such.
The Leading Eigenvector community is stored as a number in the node attribute lec_community. In Gephi, the values were visualized with different colors. The size of both the node and the label tell about the centrality of the node in the network. The wider the line (edge) between two nodes, the bigger the weight. All the rest is just visualization and does not have any particular meaning. For example, one should not jump into conclusions if some nodes seem to be more lone than others. The layout algorithm used is ForceAtlas 2, in some cases added with few runs of Expansion and Label Adjust. Graphs were exported from Gephi in GEXF format.
I have not filtered any nodes or edges, nor made any real data curation. The only thing I did was manually correct some visible character encoding issues; most certainly there are some left so bare with me. When you zoom in, you will most probably also notice some duplicate nodes, and nodes with bad data. All variations of names – e.g. Suematsu, H. and Suematsu, H (without the dot) – generate new nodes. If anything, this exercise was an eye-opener for me when it comes to quality of data. Like somebody just put it: being a data scientist is more about being a janitor than anything else. Cleaning, cleaning, cleaning.
The amount of communities calculated by Leading Eigenvector is the following:
- Martti Hallikainen: 6
- Riitta Hari: 16
- Maarit Karppinen: 10
- Riitta Smeds: 7
- Olli Varis: 5
The range is rather big, but I don’t really dare to interpret the differences that much. Given that communities tell something about fields of science – which is not certain – perhaps e.g. neurophysiology is by nature intensely cross-disciplinary?
All graphs have one or two semi-central nodes, but on the graph of Olli Varis there is an interesting pattern. Besides himself, there are two other nodes that have a centrality value well over 0.70, making a powerful trio. Looking at the weights of the edges, there are strong 1:1 connections on the graph of Martti Hallikainen and Maarit Karppinen.
Example of Gephi processing session as a screencast (2:03)
* only the tkkjulkaisee graph is used here