Digital Avatars: The Next Leap in Remote Work?

Zain Raza

Published in

Dev Genius

11 min readJun 3, 2021

Where Remote Work Fails

My friends, the fatigue is real — “Zoom Fatigue”, that is.

Love or hate it, remote work is likely to stay for awhile, as it is still the safest, most affordable, and arguably the most environmentally-friendly way to conduct business in many parts of the globe.

However, if you’ve ever felt drained after a day being on videoconference calls, you are far from alone. This past February, Dr. Jeremy Bailenson of Stanford University, formally published a research study on the causes of Zoom Fatigue. In brief, Dr. Bailenson argues these causes are “excessive amounts of close-up eye gaze, cognitive load, increased self-evaluation from staring at video of oneself, and constraints on physical mobility” [1].

With the pandemic still going on, Zoom is the lifeblood for many of us in our career. But is there any way knowledge workers can save our minds from Zoom Fatigue?

How Deep Learning Can Help: Digital Avatars!

Is it possible one day that remote teams will rely upon digital avatars such as “Roberto” (the robot seen above) to communicate between team members?

But What is a Digital Avatar?

An avatar is nothing more than a representation of the user created in a computer. More formally, Professor Ducheneaut et al. describe (in their 2009 ACM paper, ““Body and mind: a study of avatar personalization in three virtual worlds”) that these avatars can often be controlled and customized by the user to fit their personalized appearance[2].

Here’s Why They Matter

As seen in the GIF above, users can customize the facial expression that “Roberto: The Empathetic Robot” wears, simply by changing the one they wear themselves. Unlike the the avatars found in popular XR applications today, Roberto can be controlled with nothing more a web camera. This is thanks to utilizing the popular facial recognition library, face-api.js. It was originally created by Vincent Mühler in 2018, and the code is open-source.

So, is it possible one day that remote teams will rely upon digital avatars such as Roberto to customize the team members in a fun, less cognitively-exhaustive manner?

For the rest of the blog, I will report on how I tried to meet the following objectives while building “Roberto: The Empathetic Robot” project, and the results thereof.

Objectives of this Project:

Demonstrate how to convey emotion via 3D models in the web browser.
Test the limits on how effectively the TinyYoloV2 model could identify mixtures of emotion.
Produce a smooth UI/UX by balancing the trade offs between stability and responsiveness in the Three.js scene.

Objective 1: Conveying Emotions on the Web

Animating the Robot: The Approach

To be frank, I was a novice in computer graphics when I started this project, and wasn't really sure where to begin. Therefore I tried starting with a simple UX goal: to make a robot — an artificial being — convey emotion the way humans naturally do.

The meaning of these 7 facial expressions are recognized as “universal” — meaning they can be identified regardless of culture, ethnicity, religion, gender, or any other demographic marker. This made it easier to design for a robot’s facial expressions, since we can generalize the features it should display to be recognizable by human users. Source: David Matsumoto, Humintell, 2021.

The good news is psychologists have already done a fair amount of research on how we as humans display our feelings. As Dr. David Matsumoto of San Francisco State University writes on his website, Humintell, there are several emotions that can universally identified just through facial expressions [3]. This makes it easier to understand what aspects of an expression would need to be included on a 3D robot character. For example, if a robot character who only has eyes and eyebrows needs to display anger, then they would have to make the characteristic “V” shape often seen in angry humans’ eyebrows.

The next step was actually finding an expressive 3D model of a robot. Fortunately, there are lots to choose from, thanks to the decades of existing R&D in computer graphics. I eventually chose this one, which is an open-source model originally created by Tomás Laulhé, and used as an example of “skinning and morphing” on the Three.js website. Apart from being open source, I chose this model because it came in the glTF format. As reported by Lewy Blue in the DISCOVER three.js (2021) ebook, this is by far the most popular file format to use for loading 3D models in the browser, as the file sizes tend to be smaller, and the animations load faster than other file types like OBJ, or FBX [4].

The original robot model, controlled via the mouse and features a clickable GUI (source: Three.js).

These short animations, aka shape keys, already came with the model (thanks to the work of open source contributor Don McCurdy) which I downloaded from the Three.js GitHub repository. However, they can also be manually added to a glTF model through a 3rd-party modeling program such as Blender.

The Results

I am still early in conducting UX tests for the robot’s facial expressions, and the feedback has been mainly qualitative. When trying to gauge if users thought the robot’s expressions (in response to their own facial contortions) seemed “genuine”, I would ask questions like the following:

“What does the robot remind you of, if anything?”
“How do you feel watching the robot’s animations?”

To this, usually users responded that while they thought the robot looked cool, it did not remind them strongly of anything else. On watching the robot, they similarly said the animations did not illicit any strong sense of connection.

Objective 2: Displaying Mixed Feelings in Realtime

The Approach

Now that we have an expressive robot, the next step after that was finding a way to control the robot face with just our facial expressions.

This is where the face-api.js module came in handy: it offers an simple-to-understand API to utilize the TinyYolov2 model, a powerful and lightweight neural network for conducting several computer vision tasks. In our case, we needed to perform “Face Expression Recognition”, i.e. detecting emotions through a web cam. An example of what the model will output while doing this task is shown below, using still images found on the face-api.js documentation.

As you can see, this example from the face-api.js documentation shows how the TinyYolov2 model can both identify the universal emotions; and it also provides a value between 0–1, representing how confident the model is in its prediction (source: face-api.js).

Using the aforementioned API, we can use the TinyYolov2 model to both identify the user’s expression captured on webcam, as well as returning a the confidence level for that prediction (a value between 0 and 1).

This knowledge still begs the question: how will we display mixed feelings on the robot?

This just comes down to knowing how the API for face-api.js works. To elaborate, we can call the following function:

.withFaceExpressions();

In order to return what confidence the model would have for every possible emotion it can detect, not just the one that is most likely at any given moment. Coincidentally, I’ll mention the creators of TinyYolov2 used the seven universal emotions as the potential labels for any facial expression the model classifies.

The probabilities of each emotion being predicted by the model (again, there are seven in total), also have to add up to 100% in total. To visualize this for you, it might look something like the below pie chart (please zoom in to see the percentages):

Therefore, my approach to displaying mixed emotions was using the confidence level for each predicted emotion, as the degree to which the animation for that emotion would play. For example, if the model detected the user was experiencing combination of emotions similar to the pie chart above at a given moment, then the robot would display an expression that was mostly surprised, then fearful, followed by sadness, and so on.

Below, I will include the code snippet that actually implements the approach described above. To enable the robot’s expression to change in realtime, the last step is just to call this function below as part of a larger rendering loop (which you can find at the bottom of the GitHub link here).

Code snippet that animates the robot face, allowing for mixed feelings to show based on the web cam, by *using the confidence level for each predicted emotion, as the degree to which the animation for that emotion.*

The Results

Although this approach is fairly intuitive, it is not yet very robust. For example, I have noticed in an unscientific survey of the app that in a number of trials, the app becomes much less responsive in dim lighting conditions.

Objective 3: Producing a Smooth UI/UX

Balancing Tradeoffs Between Stability vs. Responsiveness

The third largest issue we’ve tackled so far, is one of performance — initially the robot animation was so laggy, it was unrealistic real people would ever find it an enjoyable experience.

Eventually, I found using Tweening on the robot’s facial animations helped me in getting the app to respond fast enough to the user, as measured by frames per second (FPS), without getting too jittery. For background, “Tweening” merely refers to using Javascript to smoothen out an animation that helps onscreen, by filling in the intermediate steps. You can learn more about the formal definition of Tweening in this lecture by Terry Brooks, of the University of Washington[5]. All in all, Tweening allowed me to play around and try to find a balance between a slow, stable app versus a jittery, albeit responsive one.

Taking Measurements

My goal here was to see if we could minimize the jitter in the app (defined as the number of times Roberto’s expression changing unpredictably, without the user’s expression changing) while optimizing for the highest responsiveness (as measured my FPS) possible.

To do this, I decided to measure the performance of the app over 11 trials. In each trial, the independent variable was the duration of the Tweening animation. I would adjust this value by 0.10 seconds, starting at an initial value of 0.05 seconds.

With each trial, I set a timer for 5 seconds, and during the trial I set a global Javascript variable counter in the code, that would start at 0 and increment each time the animation was triggered. Then it would logged to the console. Then, as the timer went down I would look at the camera and hold my own face in the “Surprised” expression — eyebrows raised and eyes widened.

To summarize, our dependent variables are the following:

“Expression Changes”: this was the number of times the animation was triggered over a span of 5 seconds, as I held my face steady in the “Surprised” expression. It was logged to the console, and I viewed it using Google Chrome’s Inspect tool. Again, this was measured using a global Javascript variable called counter .
Max. FPS — using the code in dat.gui.module.js (aka “dat.GUI”), I was able to measure the highest FPS achieved by the app over the course of the trial (which in total would take around 10 seconds). For interested readers, “dat.GUI” is yet another open-source Javascript module, which came included with the boilerplate code I grabbed from the Three.js repository. More info can be found here.

While I don’t have this down to a science, the results of the trials are shown below:

As you can see, the best Tweening duration observed was about 0.85 seconds. This value correlated to the lowest value of Expression Changes at approximately 40, as well as the highest Max. FPS at 11 frames per second. For a more in-depth look at the data I collected, please see my Jupyter notebook in Google Colab.

Next Steps for Improvement

Objective 1: with regards to animating the robot, there are two main suggestions on how to make its facial expressions appear more authentic:

Adding more human-like features: such as a mouth, or the ability to move the neck up and down. This may improve the user experience by increasing the sense of realism that the robot creates.
Continue testing, with more kinds of facial expressions: the good news about the wide availability of modeling programs like Blender, is that it is fairly straightforward to edit or add new kinds of expression animations to the robot. Therefore, we may just need to continue search for the design that resonates most well with users. Taking from the list of 7 universal emotions expressed through the face, I made some low-fidelity designs for what the designs of different robot expressions may look like going forward:

Taking from the list of 7 universal emotions expressed through the face, I made some low-fidelity designs for what the designs of different robot expressions may look like going forward. These would need to be added using a modeling program such as Blender, of course.

Objective 2: with regards to making the emotion recognition more robust, the option most likely to work is retrain the TinyYolov2 model, or use another one, so that it will work better in situations of dim lighting, or faces seen from different camera angles.

Objective 3: while I have only measured 11 trials, and it may be the case it’s too early to think about to optimizing performance, the app is already averaging around 11 FPS (out of 11 trials). This is nearly 5x less than in the original Robot example on the Three.js website, which was controllable via the mouse. This suggests more work will need to be done in the future to optimize the GPU/CPU loads, because this lag will shows up in other places in the app like when the robot is walking, waving hello, etc. while trying to detect emotions, and worsen the user experience.

Some ideas to improve here:

Optimize for WebGPU: to my knowledge, the face-api.js library currently uses WebGL as the backend for the TinyYolov2 model. However, the newer WebGPU backend for Tensorflow.js 2 has recently been released, and promises to deliver better performance, so it might be worthwhile to modify the library code to support the newer backend.
Adaptive Tweening: another idea, but one that requires more research, is to look into developing an algorithm to dynamically adjust the load on the hardware accelerators (i.e. the GPU). This could potentially make the UX more smooth across different machines, as the app could optimize how long the Tween animation takes, regardless of changing constraints in memory availability, internet connection, etc. to balance responsiveness and stablility.

Conclusion and Future Applications

Zoom Fatigue need not become a fact of everyday life. The use of digital avatars can potentially change the dynamics of remote work to be more fluid, less exhausting on the psyche, and just as collaborative as in-person meetings.

Below I will share a few ideas for where this technology can lead to, in the hopes that others might also be inspired to look more into this area:

In the Classroom: with digital avatars, we will have another way to measure how employees respond to a learning experience, e.g. a VR application that trains you to navigate a factory. This can help provide feedback to the VR creators, reading what facial expressions were observed by the user in order to see if the lesson was understandable.
In the Home: similar to Animoji on iOS devices, digital avatars can provide another way for family members to send computer-generated messages to one another via the web browser.
In the Office: lasting, digital avatars give remote team members another way to communicate if in case they can’t come on camera, yet still want to express themselves visually in a meeting.

References

[1] Bailenson, Jeremy N. “Nonverbal Overload: A Theoretical Argument for the Causes of Zoom Fatigue.” Technology, Mind, and Behavior, American Psychological Association, 23 Feb. 2021.

[2] Ducheneaut, Nicolas et al. “Body and mind: a study of avatar personalization in three virtual worlds.” CHI (2009).

[3] Matsumoto, David. “The Universality of Facial Expressions of Emotion, Update 2021.” Humintell, Humintell, 3 Jan. 2021.

[4] Blue, Lewy. “Load 3D Models in GlTF Format.” DISCOVER Three.js, DISCOVER Three.js, 1 Oct. 2018.

[5] Brooks, Terry. “JavaScript Motion Tween”, INFO 343 Web Technologies, University of Washington, 2011.

Digital Avatars: The Next Leap in Remote Work?

Where Remote Work Fails

How Deep Learning Can Help: Digital Avatars!

But What is a Digital Avatar?

Here’s Why They Matter

Objectives of this Project:

Objective 1: Conveying Emotions on the Web

Animating the Robot: The Approach

The Results

Objective 2: Displaying Mixed Feelings in Realtime

The Approach

The Results

Objective 3: Producing a Smooth UI/UX

Balancing Tradeoffs Between Stability vs. Responsiveness

Taking Measurements

Next Steps for Improvement

Conclusion and Future Applications

References

Written by Zain Raza