Google is creating a sign language interpreter for video calls

Google video calls

People with hearing and speech impairments have difficulty using a video calling app not only when trying to hear what the other participants are saying but also when using sign language as the system may not give it to that person “Communication” priority.

PoseNet is Google’s technology to detect hand movements and gestures made by a hearing or speech impaired user during a video call.

To solve these and other problems for the hearing impaired when making a video call, a team of developers at Google Research worked on a technology called “PoseNet,” a system that makes real-time estimates of hand and arm positions performed by the user to give them the same priority to be visited automatically by the system like any other participant.

Google has focused on two things: first, on instant sign language recognition, and on the system giving priority to the speaker like any other user to make video calls more accessible. Second, to create a design that is light, fast, simple, and easy to use and connect, that does not cause complications or poor performance to the system, and allows the user to visualize and appropriately interpret what they mean.

To do this, the search engine reduces the workload on the CPU during HD video input, turning it into a reference point where every part of the body that the user is expressing with (eyes, nose), shoulders, hands, arms …) will be displayed considered. This will calculate movements and prevent your literal words from being skipped. In addition, the system recognizes the size of the person based on the distance between their shoulders. In this way, it adapts to each person and any gesture or movement it creates can be interpreted.

This system was presented at the European Computer Vision 2020 Conference (ECCV, acronym in English), where Google demonstrated its real-time sign language recognition model and how it works during a video call. When the unable to speak user moves their hands or arms to say something, the recording will focus them directly without any help from the host, as shown below:

The engineers at Google have used a technological architecture with which the optical flow (movement patterns between an object and a person) of the user can be recorded. In this way, the system knows in which picture frame the person is moving in order to recognize what he will say every two consecutive pictures. It then falls back on its short term and long term memory information base until the frame is processed. This is done automatically during the entire participation of the user by speaking in sign language. Here is a demonstration of the process explained:

Once Google recognized sign language, the challenge was to project what the user was saying using voice input. To do this, they developed an interpreter that speaks with an ultrasonic audio tone over a virtual audio cable that can be recognized by any platform to make video calls.

Audio is transmitted at a frequency of 20 kHz, a wave that is above the normal hearing level for humans. Video call platforms recognize the loudspeaker volume and thus give it priority. So, if you use a higher frequency, you can “fool” the system into believing that a user is talking and not a machine that has detected their movement. The following picture shows the operation of this system:

Google has released more accessible features for other of its products in the past few months. In Google Chrome, for example, it is already possible to create transcripts in real time, while in Google Maps the accessible areas for people with reduced mobility are already indicated.

Click to rate this entry!
(Votes: 1 Average: 5)

Leave a Comment