Elatable : Bradley Horowitz: metadata

Gordon Luk has a really interesting post that I'll use as a launching pad to clarify a point I often make in public lectures... In the interest of saving you a click, see below.
This reminded me of Umair's article "Why Yahoo Didn't Build MySpace..." which basically suggests that the pyramid of participation I reference is a Yahoo "strategy." Nothing could be further from the truth. Destroying that pyramid is our strategy. The pyramid is more of a forensic, backward-looking empirical observation. The very next slide in the deck is also shown below.

Lesson: Of course, I take full responsibility for these misunderstandings. Gordon and Umair are brilliant guys. So as I'm dishing out soundbites, maybe I need to slow down and make sure that I'm clearer...

Gordon says:

Do you ever have posts sitting around in wordpress for months at a time, delayed for one reason or another? This is one of them, and after re-reading it, I think I’ll go ahead and post it, but remember that it’s kind of a warp back in time to October 2006.

Yahoo! Open Hack Day was a massive, massive success, and i’m glad to have been a part of it. Now that i’ve had a few days to rest and reflect upon my experiences, I want to discuss an observation of Bradley Horowitz’s that has stuck in my mind.

Bradley’s one of the foremost advocates for social search development here at Yahoo. He’s one of the brightest minds around, and always makes my head spin a little bit when I talk with him. You can check out his Keynote presentation here (warning, this was 4GB to download!). Around the end of minute five, Bradley says some really interesting stuff. First, he showed the famous grainy video clip of a monkey trained to perform martial arts kicks in the context of what the worst-case scenario behind user-filtered content could produce. Then he went on to show some beautiful photographs from Flickr’s Interestingness, as a way to demonstrate the better side of what can be efficiently extracted from collaborative participation. His point that these photos bubbled to the top because of implicit user activity is key; as he mentions, the aggregate human cost of photo moderation borne by the user community on Flickr dwarfs anything possible by simply paying employees to review and rate them.

Ze Frank, seen in this video speaking at TED, a design conference, seems to also think hard about the new culture of participation on the Internet. Ze often invites his viewership to participate with him on various flights of fancy, including making silly faces, creating short video clips, playing with flash toys and drawing tools, etc. During his TED presentation, and also at various times on The Show, Ze talked about the hold that various groups have on the perception of art, and how many people are able to participate and create in a new culture without being ostracized by an established hierarchy. He seems to hold that the “ugliness” which seems to permeate MySpace is, in fact, a manifestation of participation outside of the boundaries of hierarchical editorial control. Thus, his position seems to be that the silliness and ugliness of the huge amount of web “design” on myspace depends heavily on perspective. At the minimum, he seemed to believe that participation culture removes barriers to experimentation that could lead to an overthrow of traditional design aesthetics.

These perspectives seem to be at odds. On one side, Bradley appears to be advocating the harvesting of social participation to come to results that select traditionally valuable content. In other words, using New Media platforms to efficiently perform the job of the Old Media publishing empires (Kung Fu Monkeys should be buried!). On the other side is Ze, who seems to be advocating not only a disruption of Old Media distribution through mass publication, but also seems to be leading a charge to disrupt traditional aesthetic values (Kung Fu Monkeys are beautiful, and should be encouraged!).

I think it’s an interesting contrast, and I worry that i’m mischaracterizing the arguments of each.

My personal viewpoint is a bit more nuanced. I believe that one day, web platforms will also be able to efficiently cluster their users based upon interests or tastes, similar to how Flickr can cluster tags to disambiguate meaning. These clusters will probably be designed not around user surveys or self-reported demographics, but instead will most likely be extracted through efficient methods of recording implicit participation information over the long term. There may well be a cluster (which I would belong to!) of folks that do enjoy Kung Fu monkeys, and there is almost definitely a cluster that find it degrading and offensive. The difference here between traditional preference filtering and clustered audiences is similar - one requires a great deal of potentially inaccurate user feedback about their preferences, whereas the latter acts more on implicit activity, and is thus more likely to produce the desired effects.

Not only would such a model be able to try and target clusters of preferences among users, but it would also allow for users to participate in cultures in which they feel welcome from the beginning.

I responded:

My argument is not so much that Kung Fu monkeys = bad, or that they should be “buried.” But in a world where “anyone can say anything to everyone at once”, our most precious commodity becomes attention. I remember sitting at the Harvard Cyberposium Conference a few years ago when someone said… “It’s getting to the point where every moment of our life can now be digital recorded and preserved for posterity…. [pregnant pause…] Unfortunately, one doesn’t get a second life with which to review the first one.”

Coming up with the right tools to help me get to what matters to me becomes essential. But I don’t want to get prescriptive - what matters to the fans of Kung Fu monkeys is… Kung Fu monkeys! And we should be providing tools that help that community as much as any other…

Another way of putting it… I’m disinclined to subscribe the a Flickr feed for the tag “baby”. Just not interested in seeing random babies, thank you very much. But my brother’s baby? My neice? Cutest baby ever! I want to see every picture of her that exists!

Death to the monoculture and long live the long tail! Long live low-brow humor, stupid pet tricks and mentos and diet coke! And Ze Frank…

My point is that tools like Flickr interestingness allow us to leverage aggregate attention for the benefit of each user. I love interestingness, and use it as a sort criterion for just about every search I do on Flickr… But Flickr also uses a social graph with varying coefficients (me, family, friends, contacts, public) to provide another dimension that helps direct my attention to the right babies. ;-)

I think my thesis is simply that in democratizing the creation of content, we’ve created a high-class problem… There’s too much “on”… 500 channels, maybe. 500M channels? Never. The flip side of this wonderful revolution in publishing, destroying the hierarchical pyramid of participation, is that we (our industry) have a burden to provide people the means of actually getting to the content they want to see… (Perhaps sometimes, even before they know they want to see it.) This ought to keep us busy for a lifetime or so…

I think you captured my view pretty much in your closing paragraph. I’d guess Ze Frank agrees with us mostly too.

Universal Law: It is easier, cheaper and more accurate to capture metadata upstream, than to reverse engineer it downstream.

Back at Virage, we worked on the problem of indexing rich media - deriving metadata from video. We would apply all kinds of fancy (and fuzzy) technology like speech recognition, automatic scene change detection, face recognition, etc. to commercial broadcast video so that you could later perform a query like, "Find me archival footage where George Bush utters the terms 'Iraq' and 'weapons of mass destruction.'"

What was fascinating (and frustrating) about this endeavor is that we were applying a lot of computationally expensive and error-prone techniques to reverse engineer metadata that by all rights shoulda and coulda been easily married to the media further upstream. Partly this was due to the fact that analog television signal in the US is based on a standard that is more than 50 years old. There's no convenient place to put interesting metadata (although we did some very interesting projects stuffing metadata and even entire websites in the vertical blanking interval of the signal.) Even as the industry migrates to digital formats (MPEG2), the data in the stream generally is what is minimally needed to reconstitute the signal and nothing more. MPEG4 and MPEG7 at least pay homage to metadata by having representations built into the standard.

Applying speech recognition to derive a searchable transcript seems bass-ackwards since for much video of interest the protagonists are reading material that is already in digital form (whether from a teleprompter or a script.) So much metadata is needlessly thrown away in the production process.

In particular, cameras should populate the stream with all of the easy stuff, including:

roll

pitch

yaw

altitude

location

time

focal length

aperture setting

gain / white balance settings

temperature

barometric pressure

heartrate and galvanic skin response of the camera operator

etc.

Heartrate and galvanic skin response of the camera operator? Ok, maybe not... I'm making a point. That point is that it is relatively easy and cheap to use sensors to capture these kinds of things in the moment... but difficult (and in the case of barometric pressure) impossible to derive them post facto. Why would you want to know this stuff? I'll be the first to confess that I don't know... but that's not the point IMHO. It's so easy and cheap to capture these, and so expensive and error-prone to derive them that we should simply do the former when practical.

An admittedly slightly off-point example... When the Monika Lewinsky story broke, the archival shot of her and Clinton hugging suddenly became newsworthy. Until that moment she was just one of tens of thousands of bystanders amongst thousands of hours of archival footage. Point being - you don't always know what's important at time of capture.

So segueing to today... Marc, Ellen, Mor and the rest of the team at Yahoo Research Berkeley have recently released ZoneTag. One of the things that ZoneTag does is take advantage of context. I carry around a Treo 650 with Good software installed for email, calendar, contact sync'ing. When I snap a photo the device knows a lot of context automagically, such as: who I am, time (via the clock), where I am supposed to be (via the calendar), where I actually am (via the nearest cell phone tower's ID), who I am supposed to be with (via calendar), what people / devices might be around me (via bluetooth co-presence), etc. Generally most of this valuable context is lost when I upload an image to Flickr via the email gateway. I end up with a raw JPG (in the case of the Treo even the EXIF fields are empty.)

ZoneTag lays the foundation for fixing this and leveraging this information.

It also dabbles in the next level of transformation from signal to knowledge. Knowing the location of the closest cell phone tower ID gives us course location, but it's not in a form that's particularly useful. Something like a ZIP code, a city name, or a lat/long would be a much more conventional and useful representation. So in order to make that transformation, ZoneTag relies on people to build up the necessary look-up tables.

This is subtle, but cool. Whereas I've been talking about capturing raw signal from sensors, once we add people (and especially many people) to the mix we can do more interesting things. To foreshadow the kinds of things coming...

If a large sample of photos coming from a particular location have the following tag sets [eiffel tower, emily], [eiffel tower, john, vacation], [eiffel tower, lisette], we can do tag-factoring across a large data set to tease out 'eiffel tower.'
Statistically, the tag 'sunset' tends to apply to photos taken at a particular time each day.
When we've got 1000s of Flickr users at an event like Live8 and we see an upload spike clustered around a specific place and time (i.e. Berlin at 7:57pm) that likely means something interesting happened at that moment (maybe Green Day took the stage.)

All of the above examples lead to extrapolations that are "fuzzy." Just as my clustering example might have problems with people "eating turkey in Turkey", it's one thing to have the knowledge - it's another to know how to use it in ways that provide value back to users. This is an area where we need to tread lightly, and is worth of another post (and probably in fact a tome to be written by someone much more cleverer than me.)

Even as I remain optimistic that we'll eventually solve the generalized computer vision problem ("Computer - what's in this picture?"), I wonder how much value it will ultimately deliver. In addition to what's in the picture, I want to know if it's funny, ironic, or interesting. Much of the metadata people most care about is not likely to be algorithmically derived against the signal in isolation. Acoustic analysis of music (beats per minute, etc.) tends to be a poor predictor of taste, while collaborative filtering ("People who liked that, also liked this...") tends to work better.

Again - all of this resonates nicely with the "people plus machines" philosophy captured in the "Better Search through People" mantra. Smart sensors, cutting-edge technology, algorithms, etc. are interspersed throughout these systems, not just at one end or the other. There are plenty of worthwhile problems to spend our computrons on, without burdening the poor machines with the task of reinventing the metadata we left by the side of the road...

Elatable : Bradley Horowitz

Pages

Sunday, December 2, 2007

Me v. Ze Frank (not so much…)

Sunday, March 5, 2006

Capture v. Derive

Pages