Linking Library Data (Part 2)

Trevor Thornton then introduced his project, which involved linking data in archives and establishing links in archival storage systems to open data systems.

The NYPL got a series of private grants to digitize a variety of data from manuscripts and archives, which itself had to focus on a number of different elements:

-Linking archival data to open source GUIs;

-Redesign a web-based user interface to take advantage of linked open data;

-Establishing links between the appropriate collections and open data sources.

Focusing on personal names which existed in the description was their first step. Through Library of Congress URIs* they can link LC authority records to clusters of IDs that collectively represent the name in question.

The Samuel J. Tildon Papers was the first collection Trevor’s team worked on. Interestingly, LC and Wikipedia were all considered to be valid access points, with correspondence files used to provide additional data access points where needed. Ultimately, 1300 personal names and 100 corporate names emerged as a result of the linking practices. That done, name authority control utilities streamlined the process and distributed the work among the researchers.

The model being established, the next project was a bit more involved but went more quickly. This time they went to the Thomas Addis Emmet collection, which had been donated to the library in 1896. The documents involve all founding fathers, reprints from historical documentation included. One of the examples of the emergent model that Trevor showed us was a calendar to the Emmet collection, including a letter from Thomas Jefferson to John Adams describing how all the newspapers in the colonies hated each other.)

Google also become an important part of the process, used to refine data , i.e., cleaning up dirty data in large sets. The addition allows one to refine large collections of dirty data values into a more uniform value. Finally, they ended up with 3000 personal names.

The big lesson: discrete data can and will eventually give way to open frameworks as more and more private data supplies become available for use by those open frameworks.


*URI = Uniform Resource Identifier. Slightly different from a Uniform Resource Locator (URL) inasmuch as it points to a particular datum rather than the server location where the datum sits.

Finally, Christina Pattuelli spoke about her own linked data work on the Linked Jazz Project.

The main thrust of this particular project was the idea that linked open data takes disparate data which is published online into a single global dataspace. New data paths create newly navigable paths and new interpretation of data in an emergent web of relationships. The ultimate goal was to create a linked open data cloud (LOD Cloud). The phrase Christina used to bring this home was “Sharing Reuse Integration”.

Legacy data allowed the cloud to grow 100 pieces, but theoretically, the only limits to such a database would be storage space, bandwidth, and maintenance (read: labor) costs.

The resulting Web of linked data made use of documents on the web, linking networks of people to networks of information, connective creativity being the pathways between each discrete item.

As the title of the project suggests, they used Jazz musicians as their points of access: the statements of musicians were used as data sets. For instance, statements by Mary Lou Williams, Marian McPartland, Count Basie, and Art Williams became linked by way of personal names linked by their mention of each other’s names. (Think of it as a running interactive record of mutual citation.)

Once the relationships were established, they sat down to begin building an application to use as a distributed platform. It wasn’t easy. The Linked Jazz name directory had to build a controlled vocabulary of jazz artists’ names from scratch, using DBPedia as a semantic hub. A personal name mapping tool was created by extracting names from DBpedia relative to authority names. Integration with alternate names was achieved with a transcript analyzer which led to the use of another tool, which mapped to authority files within given context.

The final result was the Linked Jazz Visualizer, an interactive tool that had no need for plugins or downloads.

Take a look at the final result on the Linked Jazz website.


Related Posts Plugin for WordPress, Blogger...


  1. [...] and ask ourselves how reliable is Wikipedia anyway? I mean, considering that professionals like Trevor Thornton and Christina Pattuelli are using publically edited records for their own work? Does the description of Wikipedia’s [...]