SONY

Audio, Speech & NLP

Control sounds freely, and create uncharted acoustic experiences

High-Quality Sound Technologies

There is growing demand for audio products capable of delivering high-quality sound, so Sony is developing multiple technologies that allow users to enjoy high-resolution sound. These include DSEE Ultimate (Digital Sound Enhancement Engine) which upscales existing music such as CDs or MP3 files to near-high-resolution sound quality using AI-based technology and LDAC™ audio-coding technology for transmitting high-resolution sound wirelessly.

Image of technology for upscaling sound sources such as CDs and MP3s Image of technology for upscaling sound sources such as CDs and MP3s

Noise Canceling

We are developing advanced Noise-Canceling technology to reduce ambient noise and create a comfortable listening environment. Noise-Canceling technology consists of signal-processing algorithms, LSI (Large Scale Integration) hardware, and acoustic-transducer technologies for microphones and speakers. This fusion technology delivers industry-leading Noise-Canceling performance with low power consumption in a compact package. Seeking to boost the performance further, we are also developing application algorithms such as interpersonal difference optimization to enable the handling of variations between users. Noise-Canceling technology is now evolving to include Ambient Sound Mode which reduces unwanted noise but which also extracts desired sounds out of all the sounds heard by the microphone. Our goal is the optimum listening environment regardless of the location.

Overview chart of Noise Canceling Overview chart of Noise Canceling

Audio Codec

Previously, Sony contributed to the MPEG standardization of AAC, the audio compression method used in terrestrial digital broadcasting and music distribution services. In recent years, we have been developing and standardizing technologies related to 4K/8K broadcasting, as well as 3D audio. By having sound reach the listener from all directions, 3D audio provides an unprecedented sense of presence. Sony is also focusing on the development of new playback technologies such as Object Audio, which is independent of the position and number of speakers. In addition, we are also working on 3D audio playback for free-viewpoint video, which has recently become the subject of considerable attention.

Timeline of audio codec transition Timeline of audio codec transition

Source Separation

Source separation is technology with high expectations in a wide variety of applications including extracting vocals from past recordings and remixing them, as well as extracting the sound of instruments from monoaural recordings and rearranging it in the latest spatial audio format. We have been developing cutting-edge source separation technology for over 10 years while getting feedback from creators. Also, we achieved the best separation score in Music Separation Task at the international competition SiSEC (Signal Separation Evaluation Campaign) 3 times in a row.

Overview image of Source Separation Overview image of Source Separation

Audio Signal Processing and Speech Recognition

We are developing technology which accurately recognizes users’ natural speech amongst background noise and reverberation. Our focus is on improving the performance of audio signal processing and speech recognition technologies in the real world. We use deep learning to optimally integrate audio signal processing and speech recognition. This enables advanced speech recognition in unfavorable conditions, such as when there is mechanical noise from robotics. These technical optimizations catered to devices and use cases will be thoroughly user-friendly.

Image of Audio Signal Processing and Speech Recognition from imput to output Image of Audio Signal Processing and Speech Recognition from imput to output

Spoken Language Understanding / Natural Language Processing

We are developing Spoken Language Understanding technology to understand user utterances. This technology converts speech recognition text strings into machine-understandable information (semantic representation). We have based our models on various linguistic phenomena such as disfluencies and abbreviations, in addition to a semantic database which links spoken language with the real world. For further understanding of natural language itself, we are developing Natural Language Processing technology that analyzes text. This process involves tokenizing, assigning parts of speech and semantic attributes, and parsing the structure. We are also developing Knowledge Information Processing technology which is applied for disambiguation of language.

Image of Spoken Language Understanding / Natural Language Processing from input to output Image of Spoken Language Understanding / Natural Language Processing from input to output
to the top