On “Preparing ‘Messy Data’ with OpenRefine: A Workshop”

The fourth Digital Scholars group’s meeting for this semester consisted in a workshop led by Dr. Richard J. Urban (School of Information, Florida State University) on the possibilities of OpenRefine as a tool of data management. The workshop focused in OpenRefine’s possibilities as a tool for polishing and improving the presentation, structuring, and grouping of data on different fronts: namely, to “Remove duplicate records, separate multiple values contained in the same field, analyze the distribution of values throughout a data set and group together different representations of the same reality”. Dr. Urban led the session basing it on two tutorials: one by Michigan State University’s Digital Scholarship librarian, Thomas Padilla, and another by Drs. Van Hooland, Verborgh and De Wilde, from different Belgian universities and institutes.

The workshop’s main concern is the refinement of tidy data, and the achieving of an adequate structuring that would allow to bring a variety of categories together as well as to present them in an accessible way. As Hadley Wickham pointed out in his paper “Tidy data”, its preparation is not only a first step within data analysis, but a recurrent activity throughout the process as a whole; it is then necessary to deal with the constant appearance and subsequent incorporation of new data, which could take up to 80% of the time dedicated to analysis as such. Regarding these circumstances, the need of a good method and the employment of adequate tools seem indispensable both for efficiency criteria and the investigation’s success.

OpenRefine, one of the so called IDTs, or Interactive Data Transformation tools (a denomination that includes others like Potter’s Wheel ABC and Wrangler), provides aid against frequent errors in data sets, such as blank cells, duplicates or spelling inconsistencies, and it does so through four fundamental operations/solutions: faceting, filtering, clustering and transforming. Albeit OpenRefine also allows to combine data sets with open data, thus reconciling it with existing knowledge bases, both this operation of linking with external concepts and authorities and the process of named-entity recognition (known as NER) rely on achieving a well-articulated, coherent data set.

Using a sample data set on comics developed and gathered by the British Library, Dr. Urban showed the audience how to perform the aforementioned data polishing tasks with OpenRefine. Firstly, we observed how the facet function in each column allowed for identifying inconsistencies and repetitions; for instance, and regarding the publisher’s column, we could see that the publishing house Titan appeared in the data set under a number of variants (Titan], Titan., Titans, etc.). Through the facet function the user can rewrite them, thus avoiding these variants being considered as different publishers. The filter data function incorporates a text filter that allows to locate and find variants of pieces of data. After using the facet function, certain cases of variant spelling may persist; these would be easily identifiable thanks to the filter, which would display other defective, repeated records.

The cluster function helps locating patterns of variation, so that the user does not have to eliminate inconsistencies one by one through facets or filters. OpenRefine displays the cluster size and the different values that it takes, and allows the user to substitute or merge them in one single denomination. For instance, the nearly 4000 records that appear in the data set with the variations of the publishing house Titan mentioned above can be rewritten as “Titan” in one sitting. As one of the tutorials points out, the scope of these changes should not be a concern, given that OpenRefine keeps a record of changes and allows to go back to previously saved versions of the project. In the same vein, the tool offers a transformation function; with it, the user can modify data, be it through eliminating whitespaces that could cause a duplication of any of the information categories, be it by means of General Refine Expression Language (GREL). Thus, the tutorial’s example focused on the suppression of periods that could compromise an adequate register of the categories.

Lastly, Dr. Urban left room for questions and comments where the audience spoke about their projects and the ways whereby OpenRefine could be beneficial for them; for instance, one of the questions referred to how to deal with spaces (a critical aspect of the tool at different levels) in languages whose writing systems do not use them, such as Japanese. In addition to this, some issues and flaws of the tool were mentioned; perhaps one of the most significant is that an excessive storage size of the data set prevents OpenRefine from functioning properly, something that should prompt a revision towards an urgent upgrade.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s