Pages

Tuesday, March 7, 2006

Zoom into the Room

Hot on the heels of ZoneTag, we're releasing another spiffy mobile prototype - a "mobile friend finder" we're calling CheckMates

venueIn addition to some very cool features (an incredibly intuitive mobile interface, leveraging Flickr for both the social network and a place to park my geopresence, etc.) I love the support for "private maps".  This allows for pinpointing my location within a venue.  It rocks IMHO.  Great job Chad, Karon, Sam, Ed, Jonathan, etc.

Chad does a great job describing how this work evolved.

And Edward gives yet more detail.

Sunday, March 5, 2006

Capture v. Derive

Universal Law: It is easier, cheaper and more accurate to capture metadata upstream, than to reverse engineer it downstream.

Back at Virage, we worked on the problem of indexing rich media - deriving metadata from video. We would apply all kinds of fancy (and fuzzy) technology like speech recognition, automatic scene change detection, face recognition, etc. to commercial broadcast video so that you could later perform a query like, "Find me archival footage where George Bush utters the terms 'Iraq' and 'weapons of mass destruction.'"

What was fascinating (and frustrating) about this endeavor is that we were applying a lot of computationally expensive and error-prone techniques to reverse engineer metadata that by all rights shoulda and coulda been easily married to the media further upstream. Partly this was due to the fact that analog television signal in the US is based on a standard that is more than 50 years old. There's no convenient place to put interesting metadata (although we did some very interesting projects stuffing metadata and even entire websites in the vertical blanking interval of the signal.) Even as the industry migrates to digital formats (MPEG2), the data in the stream generally is what is minimally needed to reconstitute the signal and nothing more. MPEG4 and MPEG7 at least pay homage to metadata by having representations built into the standard.

Applying speech recognition to derive a searchable transcript seems bass-ackwards since for much video of interest the protagonists are reading material that is already in digital form (whether from a teleprompter or a script.) So much metadata is needlessly thrown away in the production process.

In particular, cameras should populate the stream with all of the easy stuff, including:

  • roll
  • pitch
  • yaw
  • altitude
  • location
  • time
  • focal length
  • aperture setting
  • gain / white balance settings
  • temperature
  • barometric pressure
  • heartrate and galvanic skin response of the camera operator
  • etc.
  • Heartrate and galvanic skin response of the camera operator? Ok, maybe not... I'm making a point. That point is that it is relatively easy and cheap to use sensors to capture these kinds of things in the moment... but difficult (and in the case of barometric pressure) impossible to derive them post facto. Why would you want to know this stuff? I'll be the first to confess that I don't know... but that's not the point IMHO. It's so easy and cheap to capture these, and so expensive and error-prone to derive them that we should simply do the former when practical.


    An admittedly slightly off-point example... When the Monika Lewinsky story broke, the archival shot of her and Clinton hugging suddenly became newsworthy. Until that moment she was just one of tens of thousands of bystanders amongst thousands of hours of archival footage. Point being - you don't always know what's important at time of capture.

    So segueing to today... Marc, Ellen, Mor and the rest of the team at Yahoo Research Berkeley have recently released ZoneTag. One of the things that ZoneTag does is take advantage of context. I carry around a Treo 650 with Good software installed for email, calendar, contact sync'ing. When I snap a photo the device knows a lot of context automagically, such as: who I am, time (via the clock), where I am supposed to be (via the calendar), where I actually am (via the nearest cell phone tower's ID), who I am supposed to be with (via calendar), what people / devices might be around me (via bluetooth co-presence), etc. Generally most of this valuable context is lost when I upload an image to Flickr via the email gateway. I end up with a raw JPG (in the case of the Treo even the EXIF fields are empty.)

    ZoneTag lays the foundation for fixing this and leveraging this information.

    It also dabbles in the next level of transformation from signal to knowledge. Knowing the location of the closest cell phone tower ID gives us course location, but it's not in a form that's particularly useful. Something like a ZIP code, a city name, or a lat/long would be a much more conventional and useful representation. So in order to make that transformation, ZoneTag relies on people to build up the necessary look-up tables.

    This is subtle, but cool. Whereas I've been talking about capturing raw signal from sensors, once we add people (and especially many people) to the mix we can do more interesting things. To foreshadow the kinds of things coming...

    • If a large sample of photos coming from a particular location have the following tag sets [eiffel tower, emily], [eiffel tower, john, vacation], [eiffel tower, lisette], we can do tag-factoring across a large data set to tease out 'eiffel tower.'
    • Statistically, the tag 'sunset' tends to apply to photos taken at a particular time each day.
    • When we've got 1000s of Flickr users at an event like Live8 and we see an upload spike clustered around a specific place and time (i.e. Berlin at 7:57pm) that likely means something interesting happened at that moment (maybe Green Day took the stage.)

    All of the above examples lead to extrapolations that are "fuzzy." Just as my clustering example might have problems with people "eating turkey in Turkey", it's one thing to have the knowledge - it's another to know how to use it in ways that provide value back to users. This is an area where we need to tread lightly, and is worth of another post (and probably in fact a tome to be written by someone much more cleverer than me.)

    Even as I remain optimistic that we'll eventually solve the generalized computer vision problem ("Computer - what's in this picture?"), I wonder how much value it will ultimately deliver. In addition to what's in the picture, I want to know if it's funny, ironic, or interesting. Much of the metadata people most care about is not likely to be algorithmically derived against the signal in isolation. Acoustic analysis of music (beats per minute, etc.) tends to be a poor predictor of taste, while collaborative filtering ("People who liked that, also liked this...") tends to work better.

    Again - all of this resonates nicely with the "people plus machines" philosophy captured in the "Better Search through People" mantra. Smart sensors, cutting-edge technology, algorithms, etc. are interspersed throughout these systems, not just at one end or the other. There are plenty of worthwhile problems to spend our computrons on, without burdening the poor machines with the task of reinventing the metadata we left by the side of the road...

    Thursday, March 2, 2006

    Lowering Barriers to Participation

    In a previous post, I mentioned our efforts around lowering barriers to entry for participation, i.e. empowering consumers with tools that transform them into creators.  Tagging is perhaps the simplest and most direct example of how lowering a barrier to entry can drive and spur participation.

    Tagging works, in part, because it's so simple.  Rather than being forced to tag Rashi (the name of my puppy) in a hierarchical taxonomy: (Animal => Mammal => Canine => Rhodesian Ridgeback => Rashi) I can just type Rashi.  The instructions for tagging on Flickr are vague; likely the less said the better.  You learn by watching and doing, making mistakes and fixing them...  sometimes tagging for oneself, sometimes for ones friends, sometimes for others.  Tagging, while initially uncomfortably unstructured (staring into that blank field it's easy to freeze up with "taggers block"), becomes painless and thought-free.  Note that there is no spellcheck against submitted tags.  People commonly invent tags that have no meaning outside of a shared or personal context, for instance specific tags for events.

    In the great taxonomy/folksonomy debate, dewey-decimal fans generally invoke semantic ambiguity as a place where tagging will breakdown.  Stewart invoked these illustrative examples in his blog post that introduced the Flickr clustering feature.  For instance, the word "turkey" has several different senses - turkey the bird, turkey the food, and Turkey the country. 

    Forcing a user to resolve this ambiguity at data entry time would be a drag, and we'd likely see a huge dropoff in the amount of user metadata that we collect.  (Moreover, we really couldn't.  As pointed out before, tags must be allowed to take on personal meaning - "turkey" might be the name of my school's mascot, e.g. the Tarrytown Turkeys, or a pejorative term I apply to a bad snapshot...)  What Flickr can and does do, is provide an ipso facto means of resolving this ambiguity and browsing the data:  Flickr's clustery goodness.

    So check out the turkey clusters.  Flickr uses the co-occurance of tags to cluster terms.  In other words photos with the tags "turkey" and "stuffing" tend to be about the food, "turkey" and "mosque" tend to be about the country, and "turkey" and "feather" about the bird.

    There are limitations with this approach.  Co-occurance means that there exist more than a single tag for a given photo.  Something tagged with just "turkey" is shit outta luck, and doesn't get to come to the clustering party.   Precision and Recall tolerances within the Flickr system are very different than in a tradition information retrieval based system.  A lot of what we're going for here is discovery as opposed to recall;  there photos that don't come to clustering party aren't really hurting anything.  Moreover,  the system doesn't really know about the semantic clusters I defined in the above paragraph: "food", "country" and "bird".  In fact I just assigned those names by looking at the results of the clusters and reverse engineering what I intuit is going on.

    In fact, in addition to these tidy clusters onto which I can slap a sensible label, there are also several other clusters which aren't immediately recognizable.  One is the "sea" cluster; apparently lots of people take pictures of the sea in Turkey.  The other, which is harder to divine, seems to contain a lot of words in which appear to be in turkish.  (Reflections on multi-lingual tagging deserve their own post.)  This reverse engineering can be fun, and I'm sure there is a game in there somewhere that someone has already built.  (Lots of folks have come up with interesting Flickr games, i.e. "Guess the tag!")

    Ambiguous words like "turkey" or "jaguar" (cat, car, operating system) are illustrative.  Clusters against tags like "love" (again an example Stewart invokes) are downright fascinating.  Here we have clusters corresponding to  (again reverse engineering/inventing labels) symbols of love, romantic love, women (perhaps loved by men), familial love, and pets.  Pretty cool.

    Another thing that's cool is that these clusters are dynamic.  The clustering shifts to accommodate words that take on new meanings.  As Caterina pointed out to me, for months Katrina was a tag mostly applied to women and girls; one day it suddenly meant something else.  The clusterbase shifts and adapts to accommodate this.

    Per my first post - I'm just documenting my observations, celebrating Flickr and not breaking any new ground here.  Hooray for Stewart and Serguei and team that actually create this stuff!  Hooray for Tom and the other pundits (like Clay and Thomas) who have already figured out most everything there is to know about tags!

    The reason I'm hilighting this feature is that a few folks misunderstood the pyramid in my first post to be Yahoo's strategy...  on the contrary it's just an empirical observation that these ratios exist, and that social software can be successful in the face of them.  We're flattening, dismantling, and disrupting this pyramid every day! 

    Flickr clustering speaks to our unofficial tag line, "Better search through people."  What I love about it is that it's not "human or machine", or heaven forbid "human versus machine", but "human plus machine".  We let people do what they're really good at (understanding images at a glance) and keep it nice and simple for them.  We then let machines do what their good at, and invoke algorithms and AI to squeeze out additional value.  There's also a cool "wisdom of crowds" effect here, in that the clusters are the result of integrating a lot of data across many individuals.

    Some of our folks at YRB in Berkeley will be prototyping some additional very cool "wisdom of crowds" or "collective intelligence" type stuff RSN (Real Soon Now.)  More about their work in an upcoming post.  In the meantime, get a taste of it in the ZoneTag application.  It applies many of the these principles to the task of associating course location with cell phone tower IDs - a cheap, simple way to squeeze location out of phones before we've all got GPS.