## Blog has moved to: mining4meaning.com

New address: http://mining4meaning.com/

*(**An English summary for this and the previous post can be found here.)*

Viime postauksen yhteydessä sain lukijoiltani paljon hyviä ideoita Raplysaattoriin liittyen. Yksi ideoista oli selvittää, millä yksittäisellä kappaleella on kovin riimikerroin. Tässä postauksessa julkaisen yksittäisten kappaleiden top 10 -listan.

Sain lisäksi paljon pyyntöjä selvittää riimikertoimia sellaisille artisteille, joita en ollut analysoinut aiemmassa kirjoituksessani. Monet artisteista olivat jääneet ulkopuolelle siitä syystä, ettei minulla ollut käytössäni heidän lyriikoitaan. Nyt olen kuitenkin julkaissut sivuston raplysaattori.fi, jossa jokainen voi käydä laskemassa riimikertoimet omille suosikeilleen. read more >>

**BLOGINI ON MUUTTANUT. TÄMÄ KIRJOITUS LÖYTYY NYKYISIN OSOITTEESTA:**

*http://mining4meaning.com/2014/08/25/rap_algoritmi/*

*( Päivitys 7.9.2014: Riimikertoimia voi nyt laskea itse osoitteessa raplysaattori.fi. Lähdekoodi saatavilla GitHubissa.)*

*( Päivitys 27.8.2014: Lisäsin joitain uusia artisteja riimikerroinvertailuun ja päivitin tekstiä sen mukaisesti. Aion julkaista lähiaikoina nettisivun, jossa voit laskea riimikertoimen haluamillesi sanoituksille. Lisäksi aion laittaa ohjelman lähdekoodin julkiseen jakoon. *

*“Puolet räppäreist ei tajuu rimmaamisest mitään / ennen mikkiin päästämistä pitäis kirjalliset pitää”*

Näin toteaa suomiräpin epäilemättä tämän hetken tunnetuin nimi, *Cheek*, kappaleessaan *Kuka muu muka*. Tässä kirjoituksessa kuvailen, miten tietokoneella voidaan löytää lyriikoissa esiintyviä riimejä automaattisesti ja tutkin, löytyykö edellä mainitulle Cheekin väitteelle katetta analysoimalla Suomen tunnetuimpien räppäreiden sanoituksia toteuttamallani tietokoneohjelmalla. Ohjelma laskee tunnistamiensa riimien pituuksia sekä arvioi artistin sanavaraston kokoa. read more >>

Many of us like to think that we’re free to go wherever we want – at least while we’re still young and without too many commitments. In reality, however, there are lots of routines each of us follow from day to day, like the following pattern: home -> work -> lunch -> work, and so on. But how much do we actually stick to these routines and how strongly do they dictate our daily lives? Could we try to build a mathematical model to capture the routines and quantify how predictable our movements are? read more >>

Let me start with an easier question: rappers and physicists – what do they have in common? Apart from the fact that I’m moderately passionate about both rap and physics, there’s one obvious similarity: rappers and physicists are both very active collaborators. I’d say that rappers feature in each other’s songs much more often than artists of other genres do and physicists, on the other hand, have sometimes up to several hundreds of co-authors in their papers!

So they both have wide collaboration networks but the interesting question here is what are the differences and similarities of these two networks. I took a closer look at this question back in 2011 when I was doing a project assignment for a course on complex networks and what I found was quite surprising! But before we actually go on to study these networks, let me say a few words about how one can get access to the networks. For physicists (or theoretical particle physicists, to be more precise) it was easy since the network was readily available on the Internet. For rappers, I decided to use Wikipedia since it has loads of articles about rappers (even though I restricted myself only to Finnish rappers) and in the discography section, the articles typically list “Guest appearances“. So I picked a popular Finnish rap artist and started following the links from this artist to other artists and so on. Doing this manually would have required huge efforts but luckily there’s an algorithm suitable for this kind of crawling tasks called the BFS search.

Wait, you’ve heard the name of that algorithm before? You’re right, it’s exactly the same algorithm I used in my previous post for graphics generation! Pretty cool you can use such a simple algorithm for such different types of tasks, don’t you think!

In practice we take all collaborators (neighbors) of the seed artist and add them to a queue. Then we start going through the queue, one artist at a time, take his or her neighbors adding them to the queue, and remove the artist from the queue. We make sure that we don’t visit any artist more than once which guarantees that the search terminates at some point. The resulting network can be seen below. (You can view it in full resolution here or download the PDF version of the network. Also the same network from 2011 can be found here).

Every now and then I’ve thought about starting my own blog but only recently, when an idea of a blog focused entirely on data science issues crossed my mind, I really got excited about it as I could easily think of several projects I would want to tell others about. It also seemed like a natural timing as I officially started my doctoral studies last Thursday.

I have to admit that I’m quite lazy to follow other people’s blogs but, fortunately, I’ve decided to make this blog such that even I myself could imagine following it! 😉 This in mind, I’ve set myself three goals regarding what I should publish:

- Write only about
*cool*things (as judged by me) - Try to include at least one picture per post
- Write (mainly) popular science so that you don’t have to have a degree in computer science to get the main message in each post (please, let me know if I’m failing at this)

The first criterion should be relatively easy to meet as I’ve had an opportunity to work on many projects in my studies, work and free time that I think are really exciting!

At this point, you might be wondering, what is *data science*, anyways? Well, for starters it’s a buzzword somehow related to the fact that the amount of publicly available data has exploded and people have started realizing its potential to transform our society. This has led some people to talk about data as the “*new oil”.*

But to be more precise, I think Wikipedia gives a rather nice description:

Data scienceis the study of the generalizable extraction of knowledge from data.

So basically, we are trying to find meaningful patterns among a bunch of 0s and 1s. One could even say: we are *Mining for Meaning* from seemingly messy data. Term *data science* spans over several different disciplines, including data mining, machine learning, statistics, visualization, etc., and this actually suits me well as it provides nice flexibility when thinking about what posts would be relevant for this blog.

Let’s finally get our hands dirty. In this first post, we’re actually not going to analyze any data (except for some self-generated one) but we’ll rather take a look at two fundamental graph search algorithms, namely the breadth-first search (BFS) and the depth-first search (DFS).

I learned these algorithms back in high school where I was taught that I could solve a maze using either of these algorithms (there are, of course, many other applications as well). The BFS would go through the maze by extending the search uniformly in all directions, while the DFS would proceed along one path as long as it could until it found the exit or hit a dead-end, after which it would backtrack to the previous intersection and continue.

I wasn’t so much into maze solving so I started thinking what else could I do with these algorithms. One idea occurred to me – I could view an image as a graph where each pixel corresponds to a node and each node is linked to its neighboring pixels. Then I could try to color the picture pixel-by-pixel, using one of the two algorithms, not actually to search for anything but simply to go through the whole image. The idea was that when we first visit a pixel, we color it by calculating the average color of the neighboring pixels that have already been colored and by adding a small deviation to the average.

Using such a simple algorithm, I was pretty awestruck when I first ran the algorithm and the output was something like this:

Then I modified the picture generating program so that it supports running several instances of BFS and DFS searches simultaneously with different parameters in order to get more diverse pictures. Here are two examples of what I got:

Finally, during the first year of my university studies, I put together a simple software that allows me to draw these pictures more interactively. Here’s a link to a video of that software animating the formation of a picture:

http://www.youtube.com/watch?v=jc7NQ3sNN58

To wrap it up: a system with very simple rules can exhibit surprisingly interesting behavior!