Lab Home | Phone | Search | ||||||||
|
||||||||
Twitter and other social internet systems offer a rich and voluminous stream of data which reflects the observations, mood, and knowledge of people distributed around the world. However, specific location information is missing from nearly all messages (e.g., roughly 1% of tweets contain a geotag), meaning that it is very difficult to draw conclusions about specific locales. We are using the content of tweets to infer missing locations, learning on the small fraction of tweets which do have a geotag. Specifically, we parse training tweets into (word, geopoint) pairs and then fit a gaussian mixture model (GMM) to the points associated with each distinct word in the training data. Then, the location estimate for a tweet is a combination of the GMMs previously learned for the words in that tweet. This goes beyond prior work to offer probabilistic, geographic location estimates (rather than a single best point or suggested locale names) which (we expect) will be more accurate than current techniques. We also offer more robust metrics for accuracy, precision, and calibration. We expect these techniques to impact a wide variety of social internet analysis research and applications. This talk will present work in progress, and so feedback, suggestions, and discussion will be greatly appreciated. Host: Kipton Barros, T-4 and CNLS |