Search This Blog

Sunday, March 25, 2012

Words by the Millions, Sorted by Software

Excerpt from an article in

The New York Times (Business)
Sunday, March 25, 2012

Words by the Millions, Sorted by Software

By ANNE EISENBERG

IT just keeps growing — the vast electronic archive of books, journals and scholarly literature stored on the Web. But scientists are aiming to keep up with this trove of collective knowledge by devising computer-based tools to winnow and quantify it.

David M. Blei of Princeton University is among those who are teaching computers to sift through the digital pages of books and articles and categorize the contents by subject, even when that subject isn’t stated explicitly.

For decades, of course, librarians and many others have labeled books and documents with keywords. “But human categorization can only go so far,” said Dr. Blei, an associate professor in computer science. “We don’t have the human power to read and tag all this information.”

To cope with the information explosion, Dr. Blei and other researchers write algorithms so that computers can sift through millions of works and find their common themes by sorting related words into categories. It’s a field called probabilistic topic modeling.

Other research tools identify shifts in language over time that could signal important cultural, scientific or historical changes. At Harvard, Erez Lieberman Aiden and Jean-Baptiste Michel, who jointly lead a group there called the Cultural Observatory, will soon inaugurate a browser that searches for such language changes in a large online repository of scientific papers known as arXiv (pronounced like “archive”).

Users will be able to type in one or two words at the site, called Bookworm-arXiv, and immediately see a graph showing the ups and downs of the phrase’s use in the archive, Dr. Michel said. (A test version is at arxiv.culturomics.org.) Users can then click on the graph and drill down to read the original papers in which the terms appear, tracing ideas back toward their roots, or to spots where scientific ideas spread from one field to another.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.