Extracting Audio From Visual Information

Extracting Audio From Visual Information 142

Posted by samzenpus on Monday August 04, 2014 @09:38AM from the what-the-bag-says dept.

rtoz writes Researchers at MIT, Microsoft, and Adobe have developed an algorithm that can reconstruct an audio signal by analyzing minute vibrations of objects depicted in video. In one set of experiments, they were able to recover intelligible speech from the vibrations of a potato-chip bag (video) photographed from 15 feet away through soundproof glass.

Extracting Audio From Visual Information

This discussion has been archived. No new comments can be posted.

Search 142 Comments Log In/Create an Account

Comments Filter:

Not surprising (Score:5, Insightful)

by Z00L00K ( 682162 ) writes: on Monday August 04, 2014 @09:42AM (#47599079) Homepage Journal

Measuring the vibrations of windows or other items was used already 40 to 50 years ago by spy agencies, so I wonder if this isn't something that has been re-discovered?

Re:Not surprising (Score:5, Insightful)

by Hamsterdan ( 815291 ) writes: on Monday August 04, 2014 @10:00AM (#47599181)

The countermeasure for laser listening was to install the windows inside a pipe *frame* and play music in the pipes. Using an object inside the building to extract audio defeats that countermeasure. This is 2014, do not expect any privacy, especially from government agencies...

Re:Now my tin-foil hat... (Score:5, Insightful)

by JackieBrown ( 987087 ) writes: on Monday August 04, 2014 @10:07AM (#47599223)

The hat is a trick!
The reason they want you to wear foil is so that the sound can bounce off it.

Re:Not surprising (Score:5, Insightful)

by timeOday ( 582209 ) writes: on Monday August 04, 2014 @10:08AM (#47599233)

Well, even a normal microphone is "just" measuring the linear displacement of a membrane over time, so clearly the important distinction is how you measure it. A laser range-finder is different from a microphone, and a video camera is different from a laser range-finder.

Re:Requires a very high speed camera (Score:5, Insightful)

by interiot ( 50685 ) writes: on Monday August 04, 2014 @10:10AM (#47599241) Homepage

30 Hz is far below the Nyquist rate [wikipedia.org] (6800 Hz, going by POTS specs), so no, that wouldn't be possible without some fundamental changes in our understanding of information theory and physics.

Re:Requires a very high speed camera (Score:3, Insightful)

by sunderland56 ( 621843 ) writes: on Monday August 04, 2014 @10:13AM (#47599267)

reconstruct from standard 30 fps video
Dear sir: what you are asking is impossible.
Sincerely yours,
Harry Nyquist

Re:Now my tin-foil hat... (Score:5, Insightful)

by fuzzyfuzzyfungus ( 1223518 ) writes: on Monday August 04, 2014 @10:37AM (#47599495) Journal

Worse than that. If there's a metal foil involved, vibration measurement should be doable with RF as well as light. Only with a next generation reduced radar cross section geometries and RF absorbent materials can a truly secure tinfoil hat be constructed.

Unfortunately, walking around with what appears to be a small F-117 attached to your head offers limited visual camouflage potential and may prove counterproductive in your attempts to avoid Their surveillance.

Re:Requires a very high speed camera (Score:4, Insightful)

by tepples ( 727027 ) writes: <tepples.gmail@com> on Monday August 04, 2014 @11:30AM (#47599961) Homepage Journal

In theory, if you can find different targets in the frame with resonant frequencies spaced no more than 15 Hz apart, you can read a different 15 Hz off each target.

Re:Requires a very high speed camera (Score:3, Insightful)

by Anonymous Coward writes: on Monday August 04, 2014 @11:32AM (#47599989)

Oh dear. You even linked to Wikipedia (although not to the Wikipedia page "Nyquist Rate"). Does it not occur to you that OP understands those things better than you do?
To start with you need to understand what the Nyquist rate means. Sampling is like wrapping a signal around a cylinder. Just because parts are overlaid ("aliasing") doesn't mean you can't untangle the original signal. For instance, if a single audio source contains only pure harmonics, so the frequencies are known to be N, 2N, 3N, 4N, and so on, and if you have the range of possible N down to a smallish range (e.g. you know it's a voice) and you know that higher harmonics are always smaller than lower harmonics, then you can, from a massively sub-Nyquist sampling like this, extract both N *and* all the coefficients of all the harmonics. It's just like determining the dimensions of a triangle after it's wrapped around a cylinder. No, the triangle doesn't have to fit within one revolution of the cylinder, that's just the trivial case that obviously works.
What OP is proposing is that because different parts of the physical system have different resonances, when you look at that part of the image you are seeing a strongly filtered version of the original signal - basically a single frequency. You can measure the size of this signal using an aliased sampling - there's no problem with that whatsoever, it just works, an aliased sampling has the same energy as a non-aliased sampling, the samples are just in a different order. Then if you know different image areas have different responses, you can build up an image of the signal by patchwork. It would be a bloody hard job for a crisp packet in arbitrary configuration, but if you get to design the object you're looking at you can make this as sensitive as you like, and even use really crappy cameras to do it.
Nyquist rate isn't the be-all and end-all people think it is, it's just a limit for *perfect* reconstruction of *arbitrary* signals. The naive approach is to restrict yourself to sub-Nyquist signals and use the easy algorithms everybody knows. The fun stuff (read: the stuff you might get paid for) involves at least flirting with the Nyquist range, or even fully embracing that aliasing is happening and figuring out the consequences from first principles. Once you do this, you can do amazing things that seem impossible to Signal Processing 101 students ... the only problem then is you get SP101 students telling you you're an idiot for thinking that's possible. Oh, well.
BTW, sampling rate on telephony is 8000Hz as standard. Pro-tip: if you want to sound like a signal processing expert, know common sample rates.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Extracting Audio From Visual Information 142

Extracting Audio From Visual Information More Login

Extracting Audio From Visual Information

Not surprising (Score:5, Insightful)

Re:Not surprising (Score:5, Insightful)

Re:Now my tin-foil hat... (Score:5, Insightful)

Re:Not surprising (Score:5, Insightful)

Re:Requires a very high speed camera (Score:5, Insightful)

Re:Requires a very high speed camera (Score:3, Insightful)

Re:Now my tin-foil hat... (Score:5, Insightful)

Re:Requires a very high speed camera (Score:4, Insightful)

Re:Requires a very high speed camera (Score:3, Insightful)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot