Building a Microbial Genetics Thesaurus

During the spring quarter of my first year at the iSchool, I took a course in building a thesaurus (LIS 537). As part of a group of three MLIS students, my colleagues and I built a thesaurus from the ground-up. Given the choice of several different domains, our group chose "Microbial Genetics" as our focus.

This turned out to be a very challenging domain, as none of us was an expert in either microbiology or genetics. Fortunately, we were able to find a microbial genetics expert to assist us in our domain analysis, as well as developing our terms and relationships. One of the advantages of building a thesaurus for a scientific domain is that the terminology is fairly static and boundaries well defined. Building a thesaurus for a domain so outside of our areas of expertise also helped us to focus on the process of thesaurus construction rather than focusing so much on the specific terms.

After establishing the boundaries of the field, we began collecting our terms. The entire group participated in this process. I read through microbial genetics textbooks and websites, as well as relevant scientific articles. I then collected all of our terms into the term collection spreadsheet given to us by our instructor, eliminating duplicates and grouping variant spellings.

The next steps in building the thesaurus were to select which forms of our terms to use (plural or singular), to identify preferred terms and to create the relationships. This was an iterative process for our group. Each member of the group had primary responsibility for a task, then circulated the spreadsheet to the other group members for comment.

One of the main tasks for which I was responsible was the creation of relationships. Our user group was undergraduates enrolled in their first microbial genetics course, so I decided to do a significant amount of upward posting. This simplified the thesaurus while allowing users to get a better feel for the overall nature of the field. I did this by reading the definitions of each of the terms and making the connections between them.

My final major contribution to the thesaurus project was the creation of the actual thesaurus document. I saved the Excel spreadsheet as a SpreadsheetML (Excel's XML format) document, then created an XSLT stylesheet to convert the XML to WordprocessingML (Word's XML format), adding "bookmarks" (Word's version of HTML anchor tags). Our final document had links between terms. Here is a sample entry from the Alphabetical Schedule:

Each of the hierarchical links (BT, NT, RT) goes to the term's entry in the Alphabetical Schedule. The numbered link after the term takes the user to the term's entry in the Hierarchical Schedule, which displays the faceted hierarchy of the terms. A sample hierarchical entry:

Within the context of the course, there wasn't time to test our thesaurus with a user group, but I did meet one final time with our expert to review the thesaurus. There were a couple of terms he thought should be moved to another place in the hierarchy, but overall was very impressed with our work.

This experience provided a significant intellectual challenge, as I was pushed to understand an entirely new domain, as well as synthesizing information about the construction of a thesaurus. The process of creating the thesaurus required a thorough understanding of the field of microbial genetics, as well as really understanding how to construct a thesaurus. While there are rules for how to create a thesaurus, there is also a great reliance on "editorial judgment," as Trent repeatedly told us. This is incredibly challenging as a first-time thesaurus developer. How do you know if your judgment is correct? We simply had to trust in our own instincts.

Comments from Trent Hill, LIS 537 instructor:

Here's your superlative thesaurus, with my comments.  Outstanding work, some of the best I've seen in any version of this class!
Take care, Trent

Thesaurus (.doc)

   
    brolland *at* u.washington.edu