Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Science Technology

Extracting Audio From Visual Information 142

rtoz writes Researchers at MIT, Microsoft, and Adobe have developed an algorithm that can reconstruct an audio signal by analyzing minute vibrations of objects depicted in video. In one set of experiments, they were able to recover intelligible speech from the vibrations of a potato-chip bag (video) photographed from 15 feet away through soundproof glass.
This discussion has been archived. No new comments can be posted.

Extracting Audio From Visual Information

Comments Filter:
  • Re:Not surprising (Score:5, Informative)

    by Z00L00K ( 682162 ) on Monday August 04, 2014 @09:45AM (#47599099) Homepage Journal

    To follow up, look at the Electromax Laser Listening Systems [electromax.com].

  • by BitZtream ( 692029 ) on Monday August 04, 2014 @09:53AM (#47599145)

    The sensor and optics must have been ridiculously high quality and resolution for this to work. Sensor noise alone would almost certainly rule this out for any COTS consumer package. They certainly aren't doing it with CNN footage or old CCTV surveillance tapes.

    In which case, it's of no practical value since a laser mic would be far cheaper and more discrete.

    Cool from an academic perspective that they can use DSP now, but it's just more fun with a laser mic, same principals and theories, new less workable application.

  • Re:Not surprising (Score:5, Informative)

    by JazzHarper ( 745403 ) on Monday August 04, 2014 @10:02AM (#47599195) Journal

    There is a very significant difference: this involves detecting vibrations in images of objects in a video recording rather than the objects themselves. However, not just any video will do; it requires a very high frame rate.

  • by Anonymous Coward on Monday August 04, 2014 @10:26AM (#47599375)

    You need a good 500 fps to recover audio from video. This has not been standard, but is possible with some cameras.

  • by SydShamino ( 547793 ) on Monday August 04, 2014 @11:48AM (#47600147)

    No, you can pick up something higher than Nyquist, as long as you understand your sources of information and noise. It will alias down into the measurable range, and you can extract useful information from the alias. We have a system that operates up to 1 MHz using a 1.8 MHz ADC. When we know the signal is at 1 MHz, we extract the information at 800 kHz and use that.

    What the GGP was talking about, though, was finding resonance on the bag where unique 30-Hz-width bands higher frequencies were being naturally modulated to baseband. If you had 100 points on the bag that each modulated a different frequency (30 Hz, 45 Hz, 90 Hz, ... 1500 Hz), you could extract the data from each sub-band separately and reconstruct the original signal. See http://en.wikipedia.org/wiki/F... [wikipedia.org] and assume the source isn't one 1500 Hz conversation but instead one hundred 15 Hz conversations. And also assume that is one amazing bag of chips.

  • by blincoln ( 592401 ) on Monday August 04, 2014 @11:48AM (#47600149) Homepage Journal

    For some reason, the person who posted the article or the Slashdot editors linked to a bad knock-off video that removed 3/4 of the details instead of the actual researchers' video [youtube.com]. The real video makes it clear that they can also get results from a standard DSLR 60 FPS video by taking advantage of the rolling shutter effect. There's a fidelity loss, but it's a lot better than I would have expected.

  • Re:Not surprising (Score:5, Informative)

    by doublebackslash ( 702979 ) <doublebackslash@gmail.com> on Monday August 04, 2014 @12:47PM (#47600677)

    FTFA

    In other experiments, however, they used an ordinary digital camera. Because of a quirk in the design of most cameras’ sensors, the researchers were able to infer information about high-frequency vibrations even from video recorded at a standard 60 frames per second. While this audio reconstruction wasn’t as faithful as it was with the high-speed camera, it may still be good enough to identify the gender of a speaker in a room; the number of speakers; and even, given accurate enough information about the acoustic properties of speakers’ voices, their identities.

    They don't go into detail on the algorithm but reading between the lines it seems that they are using the spatial nature of video and the fact that not every pixel is captured at exactly the same moment (let alone each line) to ferret out higher frequency information. I have other guesses, but they are wild speculation. Either way VERY cool.

Genetics explains why you look like your father, and if you don't, why you should.

Working...