Researchers ‘Prevost’ and ‘Ghose’ have designed a deep-learning algorithm that can generate an engaging soundtrack that is synchronized to an otherwise silent video.

Conventionally, sound effects in movies are added in post-production (after the recording) in a process called ‘Foley’. Some Texan researchers, however, decided to use deep learning to automate this process. The team was able to train a neural network to understand 12 popular movie events where Foley effects are usually added. Their neural network detects the type of sound to generate, then it uses a sequential network to generate that sound. To sync things up just right, the team then used neural networks to sync up the generated effect with its respective frames of video.

This was accomplished by creating a dataset that contained short video clips that featured the 12 distinct movie events. The team recorded 1000 videos that were about 5 seconds long. For things like footsteps, cutting, and clock sounds, they physically generated and recorded in a studio. Other sounds, like a horse running, gunshots, and those involving fire were downloaded from YouTube. 

The next thing that needs to be done is to predict the correct class of sound. To do this, they compared two methods. A TRN (Frame Relation Network) and an FSLSTM (Frame Sequence Network).

Frame Sequence Network

In the first approach, they take each video frame and interpolate them between the other frames in the video to increase granularity. They used a ResNet-50 CNN (Convolutional Neural Network) to extract the image’s features. Then, the sound class was predicted using a Fast-Slow LSTM (Recurrent Neural Network) and then computed with the image features. In the FRN, they attempted to capture the object’s actions and the transformations of detail using less computational time.

Frame Relation Network

The MSTRN (Multi-Scale Temporal Relation Network) then compares features from frames that are at a controlled, but changing number of frames apart. Finally, MP (Multilayer Perceptron) is used to combine all the features.

Then, a sound has to be generated for this class. To do so, the Texan researchers used a technique known as Inverse SFFT (Short Time Fourier Transform). Using SFFT, they determined the average of all captured spectrograms related to each sound class. Doing so gave them a decent starting point to generate a sound. The only job left for the neural network at this point is to predict the change to this average sound bias for each sample step of the sound. Moden video uses over 44,000 sound samples per second.

Four methods were used to review the performance of the project. One of which was a human-based qualitative evaluation. Local college students were asked to choose the most appropriate sound, the most lifelike sound, the most accurately synchronized sound, and the sound with the least noise.

As it turns out, the synthesized sounds were preferred over the actual sound 73.71% of the time for one of the models. In another, 65.96% of students prefered the generated sounds.


Leave a Reply

Your email address will not be published. Required fields are marked *