OpenRefine, formerly known as Google Refine and even more formerly as Freebase Gridworks, is a versatile tool for cleaning and exploring data. There is no shortage of online material, as you can see from the extensive list on the OpenRefine GitHub wiki page of external resources, for example the tutorials of School of Data, to name just one.
The newly released book Using OpenRefine by Packt Publishing is a compact overview on the subject. The contents is organized around the idea of a cookbook which has become almost a standard in learning digital skills. One of the ideas behind the concept is that as a reader you are a client of the text; usage has to be pragmatic, multi-purpose and swift. Independent chunks (recipes) ease the approach.
With the book you will also get something to cook on: an old version – meaning that it is still relatively messy and needs cleaning – of the Australian Powerhouse Museum Collection dataset.
The Appendix is titled Regular Expressions and GREL (General Refine Expression Language). Familiarity with both will make you an OpenRefine power user although simpler jobs can be accomplished even without them. In its present form, the 10-page appendix is a bit disappointing. It does not show you the OpenRefine flavour of regexp nor give a coherent introduction to GREL. With less regexp basics and more GREL, the usability of the section would be much approved. Yet, the appetite has been woken and can be filled online.
The real value of this small book lies in how it shows in few pages, with real data, how burdensome cleaning in fact is. Sophisticated clustering algorithms and reconciliation services help only that much. As an interesting historical bonus, the Foreword is written by David Huynh. He is one of the pioneers who created the Freebase Gridworks.
This digital book review was kindly made possible by a free copy from the Packt Publishing.