Email anatomy of a project

Project is finished
When a HE project runs more than few years, and when it keeps more people busy than just some, it generates a substantial amount of email. Of course, IM has been among us for quite some time, but it has not substituted older forms of digital communication. Far from it.

The CRIS project of Aalto University was officially initiated in early 2012. At that time, it was still a side note in my email traffic. It took a year until the investment decision was made. After that, the lengthy EU tender process. A turning point was in September 2015: a two-day kick-off seminar with the vendor, where practicalities started to took shape. That day was also the beginning of a more regular flow of emails.

I save work emails in folders by category, a habit that often results in obscure collections the focus of which escapes me later on. At this point of time, when the CRIS project is only recently – last week – been declared to be finished, I still remember what the folder name refers to 🙂

Note that I must have also deleted hundreds of posts. Meeting requests; cronjob notifications; emails only superficially related to CRIS but sent FYI; doubles; etc. Still, I have a hunch that roughly 80% of all those emails that were either sent to me, or where I myself was the sender, sit in the CRIS folder of my Outlook 2013.

What could the emails tell about the project? Active dates, days of the week, times of the day. Sadly, less so about the semantics. Our work language is Finnish, and although Finnish is extensively researched, corpus and other digital tools are not that easily accessible for a layman. Unfortunately (for this blog posting), almost all of the English communication with the vendor took place within their Jira issue and project tracking platform.

Toe bone connected to the foot bone

To start with, I selected all emails (4431) from the CRIS folder, Saved as…, and converted the text file to UTF-8.

Emails are seldom clear, separate entities. On the contrary, more often than not they are bundles of several reply posts. Because I most certainly already had all the older ones, I needed to get rid of all that extra. In other words, I was interested only in text from From: to a horizontal line that denotes the beginning of previous correspondence, at least if the email was written with Outlook, the #1 inhouse email client. This awk oneliner does the job.

awk '/\x5f/ { p-- } /^From:/ { p=1 } p>0' emails_utf8.txt > emails_proper.txt

Thanks to an unknown friend, labelling lines in this Mbox format was easy.

/^From/, /^$/ {
    printf "\nhead : %s", $0

/^$/,/^From/ {
    if ($1 ~ /^From/) next
    printf "\nbody : %s", $0

From the result file, I grep’ed only those lines that were not empty, i.e. gaps between emails. Below you see one example email. It’s from me to Jari, the project manager.

head From: Sonkkila Tuija
head Sent: 21. syyskuuta 2016 10:04
head To: Haggren Jari
head Subject: RE: Kaksi samaa jobia ajossa
body Ok. Muuten: mikä on sun mielestä järkevintä, kun tunnus halutaan disabloida? 
body t. Tuija

Then over to RStudio.

Thigh bone connected to the hip bone

The full R code is here.

First, read in raw, labelled data, and convert to a data frame.

raw <- readLines("email_proper_parsed.txt", encoding = "UTF-8")
lines <- data_frame(raw = raw)

For further work, I needed to have my email “observations” on one line each. This proved tricky. I suspect that the job could have been easier with some command line tool. In fact, I had the nagging feeling that I was trying to re-push the data to some previous stage. Anyway, again, with help, I managed to add a sequence number to lines, showing which head and which body belonged to the same email. A novelty for me, this clever rle function, from the base R.

lines$email <- NA

x <- grepl("^head ", lines$raw)
spanEmail <- rle(x)$lengths[rle(x)$values == TRUE]
lines$email[x] <- rep(seq_along(spanEmail), times = spanEmail)

x <- grepl("^body ", lines$raw)
spanEmail <- rle(x)$lengths[rle(x)$values == TRUE]
lines$email[x] <- rep(seq_along(spanEmail), times = spanEmail)

Then, by every email group, the head and body text to a new column, respectively, plus some further processing.

Hip bone connected to the back bone

Date and time.

I faced the fact that in Finland, the default language of OS varies. I haven’t seen statistics, but based on my small HE sample data, Finnish and English seem to be on equal footing. So, to work with date/time I had to first convert Finnish months and weekdays to their English counterparts. After that I could start working with the lubridate package.

Then, summaries for hourly, weekly, and daily visualizations. Note the use of timezone (TZ). Without explicitly referring to it, the xts object, the cornerstone of the time series graph, is off by a day.

Although word clouds seldom reveal anything we wouldn’t be aware of anyway, I produced two of them nevertheless, here with the packages tm and wordcloud. Excluding stop words is a tedious job without a ready-made list, so I made my life easier and basically just tried to make sure that no person names were involved. BTW, signatures are a nuisance in email text mining.

While working on this, I happened to read the blog posting An overview of text mining visualisations possibilities with R on the CETA trade agreement by BNOSAC, a Belgian consultancy network. They mentioned co-occurrence statistics, and made me experiment with it too. I followed their code example, and constructed a network visualization with ggraph and ggforce.

Neck bone connected to the head bone

Finally, time to build a simple Shiny web app. This is the R code, and here is the web app.

Thanks to the nice dygraphs package, you can zoom in to the time series by selecting an area. Double-clicking reverts to the default view. The hourly and daily plots are made with highcharter.

So, how does it look like?

Overall, it seems to me that the project has a healthy email history. Basically no posts in the wee hours of the day, and almost none during weekends. If you look closely, you’ll notice that, in the hourly barchart, there do are tiny bars at nighttime. I can tell you that these emails were not sent by any of us belonging to the Aalto staff. They were sent by a paid consultant.

The year 2016 word cloud is right when it shows that the word portal was frequent. indeed was a big deliverable.

The co-occurrence network graph needs some explanation. First, the title is a bit misleading; there are also other words than just nouns. Second, the two big nodes, starting from the left, translate as as a person, and in the publication. From the latter node, the thickest edges point to already, because, and are. The context rings a bell. CRIS systems are platforms where data about people and their affiliations meet research output, both scientific and artistic. Converting our legacy publication data to this new CRIS environment was a multi-step process, and something that is not fully over yet.

Subtitles from Dem Bones.

Posted by Tuija Sonkkila

About Tuija Sonkkila

Data Curator at Aalto University. When out of office, in the (rain)forest with binoculars and a travel zoom.
This entry was posted in Data and tagged , , , , , , , . Bookmark the permalink.

Comments are closed.