You choose, we deliver
If you are interested in this story, you might be interested in others from The Journal Gazette. Go to www.journalgazette.net/newsletter and pick the subjects you care most about. We'll deliver your customized daily news report at 3 a.m. Fort Wayne time, right to your email.

U.S.

Advertisement

Twitter database poses daunting task for library

– In the few minutes it will take you to read this story, some 3 million new tweets will have flitted across the publishing platform Twitter and ricocheted across the Internet. The Library of Congress is busy archiving the sprawling and frenetic Twitter canon – with some key exceptions – dating back to the site’s 2006 launch. That means saving for posterity more than 170 billion tweets and counting, with an average of more than 400 million new tweets sent each day, according to Twitter.

But in the two years since the library announced this unprecedented acquisition project, few details have emerged about how its unwieldy corpus of 140-character bursts will be made available to the public.

That’s because the library hasn’t figured it out yet.

“People expect fully indexed – if not online searchable – databases, and that’s very difficult to apply to massive digital databases in real time,” said Deputy Librarian of Congress Robert Dizard Jr. “The technology for archival access has to catch up with the technology that has allowed for content creation and distribution on a massive scale. Twitter is focused on creating and distributing content; that’s the model. Our focus is on collecting that data, archiving it, stabilizing it and providing access; a very different model.”

Archiving tweets

Colorado-based data company Gnip is managing the transfer of tweets to the archive, which is populated by a fully automated system that processes tweets from across the globe. Each archived tweet comes with more than 50 fields of metadata – where the tweet originated, how many times it was retweeted, who follows the account that posted the tweet and so on – although content from links, photos and videos attached to tweets are not included. For security’s sake, there are two copies of the complete collection.

But the library hasn’t started the daunting task of sorting or filtering its 133 terabytes of Twitter data, which it receives from Gnip in chronological bundles, in any meaningful way.

“It’s pretty raw,” Dizard said. “You often hear a reference to Twitter as a fire hose, that constant stream of tweets going around the world. What we have here is a large and growing lake. What we need is the technology that allows us to both understand and make useful that lake of information.”

At what cost?

For now, giving researchers access to the archive remains cost-prohibitive for the cash-strapped library, which has spent tens of thousands of dollars on the project so far, Dizard says. Like many federal agencies, the Library of Congress has been hit by budget cuts in recent years. Without a major overhaul to its computing infrastructure, it isn’t equipped to handle even the simplest queries.

“We know from the testing we’ve done with even small parts of the data that we are not going to be able to, on our own, provide really useful access at a cost that is reasonable for us,” Dizard said. “For even just the 2006 to 2010 portion of the archive, which is about 21 billion tweets, just to do one search could take 24 hours using our existing servers.”

Instead, the library is exploring whether it might be able to afford to pay a third party to provide public access to the archive. But for those who have immediate research interests – and many people have contacted the library, Dizard says – the wait is maddening.

Historical value?

Even after questions of access are resolved, Gnip President Chris Moody says he expects centuries to pass before the full value of the Twitter archive can be realized.

“We’re very, very early,” Moody said. “We’re 1 percent of the way into what this data will mean.”

The eventual plan is to make the collection available only within the Library of Congress reading rooms. Requiring an in-person visit to search a database of material that originated online may seem incongruous, but Dizard says it’s a condition of the deal with Twitter, which gifted the archive, so that the library won’t be “competing with the commercial sector.”

There are other limitations. The library is not archiving tweets from those who opt for the strictest privacy settings, which allow Twitter users to approve or reject each potential follower.

The library is also planning to scrub deleted tweets. Dizard, citing privacy concerns, calls that decision “one of the more significant policy questions we face.”

Advertisement