3D environmental sensing is composed of several core technologies which use photos and information from IMUs (Inertial Measurement Units) and other sensors to recognize the 3D structure of the environment and the camera’s own position in 3D space.
Recently, development of virtual and augmented reality (VR and AR) applications have been on the rise. 3D location recognition (head tracking) technology and 3D structure recognition technology are at the core of realizing such applications, with various companies developing their own technologies to expand the market. At the same time, the field of robotics is also experiencing a need for the aforementioned technologies due to the development of autonomous drones, cars, and robots, which require machine control, route planning, and obstacle avoidance capabilities.
Development of these technologies at Sony began with the purpose of adding autonomous movement functionality to entertainment robots such as AIBO and QRIO. The technologies have made significant contributions to many of Sony’s products, for example, helping realize SmartAR, an AR application for mobile devices announced in 2011.
Additionally, dedicated sensing processors were developed in collaboration with Sony Semiconductor Solutions and installed into Sony’s prototype AR glasses, which were exhibited at the “GHOSTBUSTERS ROOKIE TRAINING” events held in 2019 in Los Angeles and Ginza. Users had a chance to experience a new form of AR entertainment which allows for unprecedented immersion by fusing an everyday environment with augmented reality.
Recently, we have been developing our existing technology further, applying it to the field of robotics and mobile objects. It was successfully implemented into the professional Airpeak S1 drone unveiled at CES in 2021. By recognizing its own 3D position and the structure of its surrounding 3D environment, this technology helps the drone control its flight and detect obstacles, enabling stable navigation of environments without GPS and contributing to highly precise landings.
The signal processing technologies we have been developing include:
(1) Technology that allows devices to estimate their position based on photos taken with a camera (SLAM, VPS)
(2) Technology that enables the reconstruction of the target object’s shape based on images (3D Reconstruction)
For our technologies, we value precision, throughput, and being able to reliably operate in any environment (robustness). But in reality, there are multiple factors that make image-based position estimation and shape reconstruction difficult, including camera-related factors such as blur, dark place noise, and lens distortion, and environmental factors such as changes in the amount of sunlight and lack of texture. We are working to overcome these challenges and improve quality in order to create a device capable of functioning reliably under all sorts of conditions.
Depending on the application, even more complicated problems might arise, making it difficult to achieve unparalleled performance only by developing sensing technologies. Considering the entire process from sensing to control to discover solutions is especially important in the mobile objects field. Mobility sensing, explained in more detail below, involves using limited resources to grasp the mobile object’s surrounding environment in real time and realize autonomous obstacle avoidance.
We are developing the SLAM (Simultaneous Localization and Mapping) and VPS (Visual Positioning System) technologies, which are used for position estimation.
SLAM uses “mapping” to recognize the surrounding environment as a 3D model in real time, simultaneously performing “localization” to determine where the device itself is located within the model. Our Visual SLAM processes information on edge devices, which necessitates the development of light, low-load, and low-latency algorithms. Because Sony is capable of producing image sensors in-house, these algorithm-related challenges can be shared with the manufacturing departments, which enables approaches from the hardware side as well.
For example, utilizing ToF image sensors developed by Sony Semiconductor Solutions to make estimating the distance to the target easier and using IMU information which assists image stabilization to improve processing efficiency.
VPS is a technology aimed at position estimation within urban environments and commercial facilities. 3D Reconstruction technologies, explained in more detail below, is utilized to generate large-scale 3D maps, and the device’s position is estimated within the map based on information from images. By sharing the position and environment-related information of several devices using the map, we are creating new map-based AR experiences and facilitating communication between devices.
In case of inputting images to estimate one’s position on the map generated beforehand, image mapping becomes necessary. The biggest problems are related to changes in the lighting when creating the map and using it.
With machines, it is necessary to perform association of daytime and nighttime views from a digital signal, which makes modelling more difficult. We are using a deep learning-based approach to realize reliable performance even in environments that differ greatly, such as during daytime and nighttime.
3D Reconstruction involves integrating several camera images spatiotemporally to restore and analyze 3D structures. By estimating the camera’s position and orientation from the images, its distance from structures is estimated in terms of pixels, and other information such as normal vectors, reliability, and texture is integrated to create a realistic 3D model. Using distance measuring devices (LiDAR, etc.) to add depth-related information also enables physical scale estimates, increases computational speed, and improves precision.
In addition to viewing uses such as 3D printers and free viewpoint contents, applications are expected to extend to sensing uses including collision detection in AR/VR and obstacle detection for drones and vehicles. As mentioned previously, it is also being used for position estimation as map information.
One of the applications involves implementing real-world structures into a virtual space and making them usable by humans and robots. Utilizing cloud computing will make it possible to create city-scale 3D models by inputting tens of thousands of high-definition images.
On the other hand, there are use cases that require lightweight processing on edge devices. We are also developing real-time 3D modeling technology and providing applications that allow devices with limited resources such as smartphones and head-mounted displays to recognize the shape of a room in real time.
We are aiming to realize autonomous movement for cars and drones with the development of “Mobility Sensing” technology which applies the 3D Reconstruction and Visual SLAM technologies to mobility.
Spurred by progress made in autonomous vehicles, the development of Mobility Sensing is proceeding against a backdrop of computer vision and deep learning advancements as well as evolving sensors and processors. With increases in the variety of mobile objects such as drones and autonomous robots, demand for remote operation, danger avoidance, and autonomous movement has seen a steep rise.
This technology allows a mobile object to not only recognize its surroundings, position, and distance from various objects through images, but also to immediately detect and avoid sudden obstacles and be aware of the area it can traverse.
For autonomous movement, realizing low-latency real-time sensing is important. Thus, it is necessary to make the image recognition process lightweight, limiting its memory and power consumption according to the capacity of each device. When applying the technology to cars and drones, robust adaptations to motor and engine vibration and constantly changing lighting and environments are also required.
Development of 3D environment sensing algorithms is widespread throughout the entire world, but it is not an easy task to realize a system which integrates sensors and applications on a single platform.
Sony has a significant advantage when it comes to image sensors, the core 3D environment sensing device. In addition, we excel at producing edge devices. To realize 3D environment sensing on edge devices with superior capabilities, it is very important to reflect the views of core engineers in areas ranging from device input to recognition processing.
The CMOS image sensor for sensing applications that is used reflects the views of various engineers in order to realize high robustness. In addition to using an external signal to control shooting timing that assumes the use of several cameras, it controls the start and end of the exposure even during AE (Auto Exposure) to space exposure center time at even intervals, which allows for easy synchronization with IMU observations.
Additionally, Sony’s vision sensing processor used for recognition processing is a system which allows devices with greatly differing sampling frequencies, such as cameras and IMUs, to synchronize by controlling timing, making it much easier to realize algorithms. Also, the internal processor structures are selected and designed based on each algorithm’s processing details, and the algorithms themselves are sped up and performed on the optimal circuitry for each processor’s characteristics, ensuring efficient algorithm processing.
These detailed specifications reflect core engineers’ advice, allowing us to realize low power consumption and low latency real-time 3D environment sensing technology with superior capabilities on the Airpeak S1.
Development of 3D Reconstruction originally began at our R&D Center Stuttgart Laboratory. Although its origins differed from Visual SLAM and VPS, which originated from Japan, international collaboration between the developers of these technologies gradually increased. Currently, 3D Reconstruction is being combined with Visual SLAM and VPS to utilize it for remote operations, develop autonomous mobility in specific 3D environments, and propose new AR/VR applications.
We are also proactively collaborating with external parties as well. Utilizing Sony’s own program for collaboration with universities, we are strengthening our initiatives with overseas universities and research institutions in Europe, North America, and elsewhere. The 3D Reconstruction project was also created as a result of a collaboration with an Italian university. In cooperation with Sony Mobile Communications, this project was commercialized in the form of the “3D Creator” app which lets users create a 3D avatar using photos taken by their smartphone. Currently, Austrian doctoral students are working on novel algorithms for follow-up projects. Computer vision is a field with very active university research, so we will aim to exchange information and continue improving our existing technologies.
In fact, our 3D Reconstruction technology is displaying top-class performance on a worldwide scale in public benchmarks that have become the de facto standard. Applied development is underway with the expectation of uses in the creation of CG assets for virtual production, games, and measurement.
Our technology makes it possible for devices and machines to understand their environments and their own location. When they coexist in society with humans in the future, this functionality will be increasingly important, as it will help them decide what actions to take next.
Additionally, AR/VR application use cases have seen digitalization of real spaces with the use of the 3D Reconstruction technology and combining it with SLAM and VPS to provide a user experience that makes it seem as though one has jumped right into the virtual world, or the virtual world has fused with reality.
In this way, 3D environment sensing is a technology which could be used in a world where humans and robots move in and out of real and virtual space. A world of AI and the metaverse, which was the subject of SF works when we were children, is slowly becoming a reality. In this environment, we want to utilize Sony’s strengths in image sensing and the entertainment industry to contribute to the society of tomorrow.
3D environment sensing is a very “hot” field at the moment due to autonomous driving and other applications. It is great fun to be able to come up with algorithms on your own in such an environment. Also, because Sony’s engineers are always aiming for commercialization, we always create prototypes during development. For example, when developing sensors for drones, we actually purchased a drone and tried controlling it with the algorithms we were developing, which helped us make improvements to the core algorithm. Being able to get a feel for how real products equipped with your algorithms actually behave is one of the perks of working at Sony.
The field of 3D environment sensing is attracting attention because it enables the creation of new types of products such as drones and AR/VR applications. I believe that Sony is one of the few companies which can develop truly epoch-making products. In addition to implementing signal processing and geometry knowledge in our development, we have recently started utilizing deep neural networks, which are used in other fields as well. That being said, they do not replace all traditional methods modeled by humans, so we need to discover the appropriate balance between old and new technologies. This too is one of the thrills of our research.
Computer vision is a very dynamic field, and advancements in machine learning are making it even more complicated to keep up than before. Sony has already realized various breakthroughs and created new use cases, yet challenges still remain. This is why we continue to collaborate with universities and research institutes to stay at the cutting edge of technology. Creating innovative products is not an easy task, but it is what makes working in the field worthwhile, and Sony offers a unique environment that allows you to do just that.