Subscribe to: Posts Comments Photos Links 250 Posts and 424 Comments till now

Alternative search, part 2: Analyzing a document

Analyzing the metadata in a document is a fairly straightforward process. However, analyzing the document itself is a little messier.

Document analysis is nothing new- people have been programming computers to refine full text analysis since the days when full text first started to appear. As hard drive space became cheaper and computers become more powerful, new documents began to be stored and new ways to analyze them were developed. In Information Storage and Retrieval, Korfhage mentions several of these methods, but things have evolved quite a bit since then. Besides the methods mentioned below, specialized retrieval systems such as face recognition used in law enforcement have been developed, but I will focus on a few technologies available to the general public.

Text Analysis

Text analysis in documents has come a long way since the inclusion of the first full text documents in databases. Search engines have become quite good at parsing the full text of web pages, as well as using hypertext and other measures to determine what a page is about. With the advent off more and more full electronic text, scholars have started to study ways to use text analysis on literary works. One such project is the Mellon funded MONK project. Sites are starting to work text analysis into their search and browsing features as well.

The Willa Cather Archive offers a feature to perform in depth text analysis on all of Cather’s books using a program called TokenX. This process is different than simply searching for a term because you can do new things, such as compare the use of words across books and contextualize the words for the user. These kinds of analyses allow scholars new ways to analyze literature.

Screenshots for Information Retrieval paper

Cather Archive text analysis powered by TokenX
, search results view.

Screenshots for Information Retrieval paper

Cather Archive text analysis powered by TokenX
, words in context view.

Another now common way to analyze documents is to create a word cloud of common words. Word clouds are commonly made up of user entered metadata such as tags, and are less commonly used with entire documents. The reason why is fairly obvious when one sees such a cloud- words like “a,” “an,” “the,” and “that” end up being the largest words in the cloud because they are the most common. However, a word cloud can be a useful way to browse even full text documents. This can be achieved by carefully filtering out words that do not add meaning to the cloud. The website “The Mountain Meadows Massacre in public discourse” does this in one of its visualizations, offering a view of common words used in articles about the Mountain Meadows Massacre . Another site that uses this technique is a search engine called Quintura (Fig. 15). Quintura analyzes the results from a web search and creates a word cloud of corresponding terms. Users can click on words to add or subtract them from a search. This may be more intuitive for users who don’t know how to use an advanced search.

Screenshots for Information Retrieval paper Screenshots for Information Retrieval paper
“The Mountain Meadows Massacre in public discourse” word cloud. Quintura search engine.

Multimedia Document Analysis

Although text analysis has been around for a while, it is only recently that computers have been able to analyze image and sound documents. It is not that such a search is impossible. In fact, Korfhage reported work was already beginning on such analysis in 1997, but it is extremely computer intensive and complex. As Korfhage notes, the transformations something might go through are enormous- a picture of a bridge can be from above, below, from the side, or on the bridge. It might be a sketch or a photograph. Also, there are hundreds of types of bridges (p. 249). Asking a computer to identify a bridge in an image is still a long way off and may never happen. However, other kinds of image analysis are possible and even easy using computers.

Color is one thing that is easy enough to analyze using a computer. The computer can select areas of a picture, average the colors, and match those colors up to a user provided hue. This allows for some interesting image analysis that aids both browsing and finding. One such site is called Flickr Colr Pickr . The navigation in this site is simple: choose a color from the color wheel, and the engine returns results that match the color. Another search that uses the Flickr API is called Retrievr, which allows for an even more complex query: it lets the user draw a picture to return pictures that resemble the drawing. This may work well when looking for photos of a sunset or the ocean, and less well for images of a dog. Retreivr is based on research by Chuck Jacobs, Adam Finkelstein and David Salesin, who created an algorithm which is “simple, requires very little storage overhead for the database of signatures, and is fast” (Jacobs, Finkelstein, & Salesin, 1995, p. 277).

Screenshots for Information Retrieval paper Screenshots for Information Retrieval paper
Flickr Color Fields allows searching Flickr photos by color. Retrievr matches photos to a drawing.

The above means of finding photos work well for browsing, but not as well for finding. One application that could prove very useful for finding is demonstrated by Dave Pattern (based on earlier experiments by Tim Hodson) (clarified thanks to Tim’s comment below) in an experimental site which lets you search for a book by color. Tim Hodson explained the usefulness of such a feature in a blog post. Imagine a patron asking “I heard about a book three months ago. I can’t remember who wrote it or what it was called, but it was blue” (Hodson, 2008, para. 3). Pattern goes on to describe the process for searching book covers in the same way Retreivr searches Flickr images “The search works by comparing the hex colours of the 8×8 version of the search image with the corresponding pixels of the book covers. Each book cover then gets ranked by how well it matches the search image” (Pattern, 2007, para. 8). Etsy has yet another fun way to search for products with its Colors search. Pick a color and Etsy will show you photos of products whose colors match your request. Although this isn’t a perfect method, it is an innovative way to search products.

Screenshots for Information Retrieval paper Screenshots for Information Retrieval paper
Dave Pattern’s demo of a book search by color. Etsy Colors.

One website, called like.com, uses several methods to help the user find a good result. Like.com might be one of the first applications of research performed by Wei-Ying Ma1 and B. S. Manjunath in 1999, promising the ability to “retrieve all images that contain regions that have the color of object A, texture of object B, shape of object C, and lie in the upper of the image” (p. 184). It not only uses existing metadata as mentioned above, it uses image analysis to find similar products. In the example picture, a small box is drawn around part of the product, and the engine finds products similar in style or color. The user can then refine by style, color, and other options. This kind of innovative searching is likely to get more and more common.

Screenshots for Information Retrieval paper
Like.com lets you search by drawing a box around the part of the item you like.

Though full text document analysis is exciting, things really start to get interesting when sites allow for user added metadata and use that data to provide ever better search results. That’ll be the next (and last) part in the series.

Bibliography:

Hodson, T. (2008, March 6). Colourphon: cooking up something interesting. Information Takes Over. Retrieved April 28, 2008, from http://informationtakesover.co.uk/archives/2008/03/06/colourphon-cooking-up-something-interesting/.

Jacobs, C. E., Finkelstein, A., & Salesin, D. H. (1995). Fast multiresolution image querying. Proceedings of the 22nd annual conference on Computer graphics and interactive techniques, 277-286.

Korfhage, R. (1997). Information storage and retrieval. New York: Wiley Computer Pub.

Ma, W. Y., & Manjunath, B. S. (1999). NeTra: A toolbox for navigating large image databases. Multimedia Systems, 7, 184-198.

Pattern, D. (2007, February 1). Michael Stephens = Norman Bates?!? Self-plagiarism is style. Retrieved April 28, 2008, from http://www.daveyp.com/blog/index.php/archives/172/.
Figures

Alternative search, part 1: Using existing metadata or data

Yesterday, I mentioned three types of alternative search:

  • Search that uses existing human or computer supplied metadata to find and display information.
  • Search that analyzes a documents’ contents to return a result.
  • Search that relies on user added metadata

Today, I’m talking about the first technique: using existing metadata in new ways to facilitate finding and browsing. Most files have some metadata attached: the date the file was created, the date it was last altered, the owner’s name, categories, etc. Many scholarly projects have extra metadata associated with it, expertly researched or generated. Library online catalogs also have rich metadata for holdings. Many systems have rich metadata, but don’t use it in a way that helps users to find what they are looking for.

Mapping

One example of using existing metadata to help a user find what they want is mapping technologies. Take crime statistics. Many police departments are now mapping crime data on interactive, online maps. The Lincoln City Police Department is doing just that:

Screenshots for Information Retrieval paper

Map of crime provided by the Lincoln, Nebraska Police Department.

Here we have already existing data from the police blotters, which has been accessible for quite some time. In years past, one could find police blotter information in the newspaper, and later these were moved online. While it was nice to have the information, and any citizen could scan the pages to find crime in his or her area, it was very difficult to answer the question “what crimes have taken place within a quarter mile of my house in the last 5 days?” By plotting the already existing information on a map, citizens can keep watch on crime in their area.

The previous example helps the user find the answer to a specific question. Other map based systems help the user browse through materials and form new questions. This is often the case in systems for scholarly research papers. As an example, The Willa Cather Archive has a forthcoming feature which maps Willa Cather’s travels across the globe and links them both to time and to objects which include primarily letters and photos.

Screenshots for Information Retrieval paper Screenshots for Information Retrieval paper

Screenshots of Willa Cather Archive production feature. (Feature not yet available.)

Called “Mapping a writer’s world: A Geographic Chronology of Willa Cather’s Life,” this feature allows a Willa Cather scholar to explore the archive’s collections not only through time, but through space as well. This allows the scholar to make connections that would be hard to make otherwise. The time component allows users to be led through Cather’s travels. This new view, a sort of geographic biography, brings a new perspective to Willa Cather’s life and may shatter the stereotype some have of a woman who lived her life on the plains. Similarly, map based views of documents on the site “Envisaging the west: Thomas Jefferson and the roots of Lewis and Clark” help the user find documents in space as well as time, and can help the user put documents together in new ways.

Screenshots for Information Retrieval paper

“Envisaging the West” Map view.

Screenshots for Information Retrieval paper

Etsy Geolocator.

Browsing by geography is not exclusive to scholarly works and crime maps. The commercial website Etsy, where users can buy and sell handmade goods, has several different search and browse methods, one of which will to let you find items on a map. This feature may not help you find a specific item, but it can help you find items made in your own city, therefore supporting your local economy. The geography feature has the added feature of helping sellers find local, dedicated buyers, who can support a small business.

Time

Another kind of existing metadata that can be used to make finding and browsing easier is time. Most objects have some kind of time identifier- either a time stamp added by a computer or recording device (for example, most cameras automatically imprint the time a shot was taken in the metadata) or the date something was created, added later by a scholar.

Mechanically added time stamps are marginally useful for historic objects, but for newer, born digital objects they can be very useful. For example, the aforementioned website Etsy provides another way of browsing items by sorting them by the most recently listed. This could be done through a simple list of items, but Etsy has added a 3-D component and an analog style clock that helps the user browse the items. A photo program called Picasa (offered by Google) sorts photos by date taken in the default view, offering the user a long list of chronologically ordered photos. This view depends on embedded metadata and depends not on the filename or title of a shot, but metadata associated with an object.

Screenshots for Information Retrieval paper

Etsy Time Machine.

Many documents have a date associated with them that indicate when the item was created. For instance, Zotero, a scholarly citation management system, has a field for “date” where the date of an object can be entered. If the date is not supplied in the metadata, the user can add this information. This allows for a new way to view one’s collected resources: a timeline.

zotero timeline

Zotero timeline, showing highlighted words. The Zotero timeline was created with the help of MIT’s SIMILE project.

The timeline interface also allows users to highlight items containing certain words, which lets the user do a quick check on their own sources to answer questions such as “did scholars stop using a certain term after the turn of the century?” This kind of question would be difficult to answer given the traditional list view. Another scholarly example of mapping existing metadata to a timeline is found in the Envisaging the West website, where documents have been mapped to a timeline and color coded. This allows the user to see at a glance where documents fall on the timeline and what types occurred when.

Screenshots for Information Retrieval paper

Envisaging the West timeline view.

Faceted Browsing

A final way that existing metadata might be used is to create a method for drilling down through results via faceted browsing. With this method, information about each item is extracted and offered to the viewer so they can navigate through results with ease. Faceted browsing helps both browsing and finding: the rich metadata offers otherwise new paths to follow, and can also assist in finding a specific item by breaking aspects into categories. A few examples of these systems include a library catalog which uses extensive metadata to allow a user to navigate through items, (See McMaster University Library catalog, below), or a shopping site that allows a user to set a number of controls to find exactly the item they want (See screenshots of Buzzillions and Volkswagen UK sites, below).

Screenshots for Information Retrieval paper

McMaster University catalog, powered by Endeca.

Screenshots for Information Retrieval paper

Buzzillions website, also powered by Endeca. (via Peter Morville’s Flickr Stream)

Screenshots for Information Retrieval paper

Volkswagen UK site. Users can move the sliders to control what cars are shown. (via Peter Morville’s Flickr Stream)

Tomorrow I will explore search that nalyzes a documents’ contents to return results.

Alternative search methods

The field of information storage and retrieval concentrates heavily on mathematic formulas for ideal retrieval, and while this is really fascinating (and way over my head) I am also interested in new methods that have been developed for information retrieval in the last five or so years. That’s not to say math is not involved in the new methods- it’s still there, but there are new methods of collecting and using metadata and analyzing materials that are surprisingly useful. search thumbnails

Types of Alternative Search

Alternative search technologies can be divided into a few distinct categories. There’s more than I have listed, I’m sure, but these are the ones I am primarily interested in.

  • Many sites use existing human or computer supplied metadata to find and display information, but some sites are taking this approach above and beyond the traditional ways to create new and novel ways of finding information.
  • Some searches analyze a documents’ contents (documents is used in the loosest form here, and meant to include everything from text to sound, images, and video) to return a result. Text is traditionally used for this, but some aspects of images are very easily returned in this way. For instance, it is fairly easy to analyze a picture for an average color and search by colors nearby in the color spectrum.
  • A final method of search and retrieval is to rely on user added metadata. This form of search is becoming increasingly popular, and sites are inventing new ways to encourage users to supply their own metadata.

Two further distinctions in retrieval systems can be made: finding systems and browsing systems. Finding systems assist the user in finding a specific item, for instance, a picture of a cat. A finding system may also help answer a specific question. A browsing system helps the user find something, even if they are not exactly sure what they want. Browsing may also help the user make connections in a collection of documents, an especially useful attribute in online exhibits; in this way, browsing helps the user formulate a question rather than find an answer. A system that doesn’t work as a finding system may work wonderfully as a browsing system. One final note is that more and more systems use a combination of search techniques to find a relevant match.

Over the next few days I’ll examine a few sites that use existing metadata, the document’s content, or user supplied metadata to facilitate finding and browsing.

A few final words on Digital Humanities and Art History before I move on

Thanks to everyone who commented on my previous two posts. I’m still working these things out in my head, and am speaking from a very limited (and naive) perspective of only a handful of institutions and projects that I have seen.

One of the things I left out is that digital humanities centers are by no means the only entity that could help with digital projects or publications of art materials. This could also be accomplished through collaborations with other departments on campus (such as Computer Science)  or through a university press. I imagine that we’ll probably start to see a number of these collaborations at the same time.

The part of this that is stuck in my brain, and which I don’t have an answer for, is what one of these projects would look like? I now have an idea of what a history or literature project looks like, but not much of what an art history or especially a fine art project would look like. I have seen a few examples of art history sites, and just presenting the images as one would in a book is somehow a bit of a letdown. But I don’t know what it is that I expect to be different. As for fine art- I have seen several fine art projects on the internet, and again, I always think something is somehow missing. I’m going to ponder this and research more and come up with some links and ideas.

More Thoughts on Digital Humanities and Fine Arts

After more thought about the previous post, I think my question is:

Should digital humanities centers take it upon themselves to encourage fine art and art history faculty to create digital projects?

That would probably involve searching for funding from different venues and changing some assumptions, but I certainly think it is possible. It might mean specifically reaching out to fine art and art history faculty and demonstrating what a digital humanities center can do for them. More than just getting images on the web, it would mean a new kind of exploration for art history and fine art. Imagine an art history digital project illustrated with beautiful, high resolution zoomable (and downloadable) images that explain a concept better than static text ever could. Or a faculty artist’s web page which explores the meaning of the work in depth with (again) high resolution images interwoven with text and multimedia that brings the work alive. Better yet, imagine at least some of that content released under a license so others can reuse it, at least for educational purposes.

Ben noted in the comments of the last post that very few images that come up in a Google image search for an artist come from .edu domains. That does not surprise me—many artists and curators, especially in the academic realm, are nervous about posting images online and are stingy with high resolution images. However, what is considered high resolution has changed. I think of high resolution as above 1200×900—but many images on museum websites are around 300 pixels. Some museums sell high quality copies, but they could provide a nice big resolution and still sell the REALLY high resolution photo. Museum websites often are also stingy about letting you download images for your own use.

Ben also commented that some projects might be squashed by university lawyers. I think that is absolutely true, but that has been true for digital humanities in general. One of the great things about these centers is that they are constantly looking for materials to publish online, and will push for access for all. This is important because if we (as a society) don’t push for fair use from copyright holders, the copyright holders will take advantage and achieve ever more restrictions on use. This is true for books as well as paintings—but books, of course, are easier to deal with, because there are multiple copies. So we can go ahead and digitize that book that is clear of copyright, because it can be bought for a decent price, or our library already has a copy. With paintings, however, it’s more tricky. Many museums disallow photography in all galleries, even if the some galleries contain out of copyright works. This is all the more reason, I think, for digital humanities centers to step in, especially on campuses that hold works of art.

Ira Greenburg also left a great comment, saying:

Where I teach, “digital” seems to get inserted into every conversation these days – ranging in tone from vitriolic to sacrosanct. As a painter turned programmer (I still consider myself an artist), I find the debate tiresome and primarily fueled by ignorance on both sides.

I totally agree with this. I sometimes question whether digital humanities centers will continue past the next 10 or 20 years because I hope, eventually, that the facilities to create digital works, projects, and research, will be prevalent in every department on campus. Right now, though, a faculty member who wants to attempt a digital project has little support on many campuses. If they want to write a book, there’s a fairly straightforward process to follow, but a digital project requires expertise many don’t have.

Digital humanities centers are uniquely placed to reach out to fine art and art history faculty and create some unique and very exciting projects. Funding might be tough at first- but then, it was for digital humanities projects too in the beginning. I have a feeling that quite a few individual art faculty would really appreciate the help- some want to move online, but don’t know how or what the web can do for them. And if my suspicions are correct, they probably won’t get a lot of help from within their own department. (Again, depending on the institution.)

At this point I still have more questions than answers. I’ll end with a fantastic quote from Ira’s comment:

Working at the level of code, established disciplinary boundaries dissolve (and eventually the temples that house them will as well.)

« Previous PageNext Page »