Proposal for GPU resources to train deep learning models for animating 3D avatars

Summary

  • GoodGang Labs is a tech startup building an avatar communication platform for Metaverse and Web3.
  • The GPU resources, if granted, will be used to train generative deep learning models that dynamically animate 3D avatars based on camera and speech audio input.
  • GPU provision by AI Network will greatly expedite the development of cutting-edge avatar technologies.

Background

GoodGang Labs is a team of talented developers, engineers, and artists dedicated to pushing the boundaries of 3D avatar technologies. Our mission is to enable natural and free avatar communication in the Metaverse. We are focusing on deep learning technologies for generating natural gestures and facial expressions for 3D avatars that will help our users freely express themselves online.

As our models become increasingly complex and require large amounts of video, audio, and text data, the computational resources required to train and deploy the models have become a significant bottleneck. Access to AI Network’s powerful GPUs is essential for us to continue making progress and maintain a competitive edge in the rapidly evolving AI landscape. Using AI Network’s GPU resources, we wish to develop generative AIs to create natural and dynamic avatar expressions.

Relevant Links:

Scope of Work

What is the scope of this proposal?*

  • Goals
    • To create a generative AI deep learning model that naturally and dynamically animate 3D avatars based on the user’s camera or audio (speech) input.
  • Features
    • Develop key generative deep learning technologies, described below:
      • Text2Avatar: Take user’s sentence input in text, and generate avatar audio, facial expressions, and bodily gesture animations.
      • Speech2Avatar: Take user’s audio (speech) input and generate 3D avatar’s facial expressions and body gesture animations.
    • Iterate the model architecture to reduce the number of model parameters using knowledge distillation methods, so that the models can be used on a wider range of devices.
  • Tasks
    • Experiment with prototypes and seek open-source solutions to minimize development time.
    • Gather relevant data that can be used for commercial purposes, and preprocess.
    • Train the deep learning models.
    • Apply knowledge distillation to create smaller versions of the model while retaining high performance.

Timeline

Week 1-2: Prototyping

  • Define project requirements and objectives.
  • Research existing open-source models and methods for generating time-series outputs using text and audio inputs.
  • Develop initial prototypes of the AI system, focusing on individual components (e.g., facial expression generation, body gesture generation, speech analysis).

Week 3-5: Data Collection and Preprocessing

  • Identify and curate relevant datasets for training and validating the generative AI models.
  • Collect additional data, if necessary, through partnerships, crowd-sourced initiatives, or custom recording sessions.
  • Ensure data diversity and representation to avoid biases and improve the robustness of the AI system.
  • Clean and preprocess the collected data, including text and audio standardization, audio segmentation, and feature extraction.
  • Augment the data using techniques such as noise injection, pitch shift, and data mixing to increase the training dataset’s size and variety.
  • Split the data into training, validation, and testing sets to enable model evaluation and performance monitoring.

Week 6-15: Model Training

  • Develop and train generative deep learning models using the preprocessed data and secured GPU resources.
  • Experiment with various model architectures, hyperparameters, and training strategies to optimize performance and minimize overfitting.
  • Monitor training progress and make adjustments as needed to ensure convergence and stability of the generative models.

Week 15-20: Model Validation & Iteration

  • Evaluate the performance of the trained generative models on the validation and testing datasets using quantitative metrics (e.g., Fréchet Inception Distance, Perceptual Path Length) and qualitative assessments (e.g., alpha user feedback, internal reviews).
  • Identify areas for improvement and iterate on the models, addressing any issues with realism, diversity, or naturalness of the generated avatar expressions.
  • Conduct pilot tests and demonstrations with users to gather feedback on the AI system’s usability, effectiveness, and impact on communication and social experiences in virtual environments.
  • Upon completion of these milestones, we will continue refining and expanding our AI system based on the feedback and insights gained during the validation phase. Additionally, we will explore potential partnerships, licensing opportunities, and integration of our technology into existing platforms and applications within the DAO ecosystem and beyond.

Specification

Our project is a cutting-edge project aimed at transforming the way we communicate and interact in the digital space. Our technology leverages the power of artificial intelligence and machine learning to create realistic, expressive, and interactive avatars based on user input. Our platform offers two groundbreaking components: “Speech2Avatar” and “Text2Avatar”, which work together to provide seamless and immersive experiences for users across various industries, such as gaming, social media, virtual reality, and education.

1) Speech2Avatar

Our Speech2Avatar technology is a deep learning model that analyzes the user’s voice (voice file) without any additional hardware or sensors and infers 52 facial expression Blendshape weights and 15 lip sync Blendshape weights in real-time. This innovative system replaces the need for animators to manually model 3D avatar lip-sync animations, enabling the creation of 3D avatars that animate according to the user’s words without additional manual work.

Key Features of Speech2Avatar:

  • Real-time voice input analysis and generation of facial expressions and lip sync Blendshape weights
  • Elimination of manual modeling for 3D avatar lip-sync animations
  • JALI Support: Model that generates Blendshape weights in real-time, compatible with 18 3D avatar lip sync animations used in film and game industries
  • Viseme Support: Model that generates Blendshape weights in real-time, compatible with 15 3D avatar lip sync animations supported by Meta Oculus
  • Easy integration with existing platforms and applications

2) Text2Avatar

Our Text2Avatar technology enables users to create lifelike avatars from written text, providing a powerful tool for enhancing digital communication and storytelling. By leveraging state-of-the-art natural language processing and deep learning techniques, Text2Avatar enables the user to enter gesture style and emotion behind the text input and generates a corresponding avatar with appropriate facial expressions, gestures, and body movements. This technology can be utilized in various applications, such as virtual customer service, chatbots, and interactive storytelling.

Key Features of Text2Avatar:

  • Style-sensitive avatar generation from written text input
  • Appropriate avatar expressions and movements according to style and emotion-inputs.
  • Scalability for large-scale implementations and application

By bridging the gap between human communication and digital avatars using these technologies, our platform offers an immersive and engaging experience for users across a wide range of industries. With our technology, we envision a future where digital interactions are more natural, expressive, and impactful than ever before.

Request

  • Description: 1 GPU server for training and inference
  • Resource (Support) type: A100 GPU with 1TB of storage.
  • Amount: 1
  • Date: ASAP
  • Impact: This project has the potential to positively impact the AI Network community:
    • Whitelist Invitations for BeerGang NFT Sales: AI Network members will receive exclusive whitelist access to future NFT sales, gaining opportunity to secure valuable BeerGang NFTs that may appreciate in value.
    • Meetup Invitations: GoodGang Labs will host regular meetups for AI Network members to connect, discuss projects, and explore collaborations, fostering a strong bond and vibrant environment. By attending these meetups, AI Network members will have the opportunity to engage with our team, gain insights into our development process, and contribute to the shaping of our future projects.
    • Kiki Town Benefits: AI Network members will receive exclusive benefits while using kiki town, our 3D avatar communication service powered by generative AI, as a token of appreciation for their support.

Targets

  • High-Quality 3D Avatar Animation Generation: Models should generate visually accurate, natural and responsive animations.
  • Accurate Interpretation of Input Text and Speech Audio: AI models must generate contextually appropriate animations that align with the user’s intent and emotional cues.
  • User Satisfaction and Engagement: The project’s success depends on user satisfaction with the quality, responsiveness, and accuracy of AI-generated animations, as well as user engagement metrics.
  • Scalability and Adaptability: The AI models should be scalable to accommodate a growing user base and adaptable to incorporate new features and improvements based on user feedback and advancements in AI research.

Participants

Jaecheol Kim (CTO) is an expert with abundant research and development experience, and plays a central role in the project. His vast experience in both technical problem-solving, product development, and technical management nearly guarantees this project’s chance of success. Prior to co-founding GoodGang Labs, he was the CTO of SeersLab, which developed a camera filter app called Lollicam, which achieved great commercial success.

Junhwan Jang is the core deep learning engineer of GoodGang Labs. He has extensive knowledge and skills in computer vision and deep learning, and was the primary developer in charge of creating the deep learning model in the demo link above. He also has the valuable communication skills that helps organizations make the seamless connection between deep learning’s technicalities and 3D artists’ works. His prior works include working at SeersLab to fine-tune computer vision deep learning models for camera applications, and also DoAI medical AI startup to develop core computer vision models for problem-solving in the medical area.

We also have artists, 3D modelers and animators to help us naturally project our deep learning models’ outputs to 3D avatars. Seoyoung Kim (Creative Director) has worked at Naver, Line, and most recently led the Facebook Messenger AR team to develop a wide range of AR filters and effects. Jeonghun Kim and Wooyong are 3D modelers who interpret conceptual designs to turn them into digital 3D models. Hee Won Ahn animates 3D characters, objects and environments to bring them to life.

Finally, Dookyung Ahn (CEO), has extensive successful experience in overseeing Instagram Stories. His prior experiences include managing development partnerships in Korea and Japan, and leading product management for several products at Line. He will explore the project’s potential partnerships and licensing opportunities, and set the direction for how we will integrate the technology to future services and platforms.

Voting

The voting period will be between 2 to 7 days. Please send the voting options clearly indicating if they are for or against your proposal. If available, include all links to previously held surveys and/or voting (i.e. on Discord).

  • Examples: Approve/Disapprove or Yes/No
  • Something like “support the proposal but needs revision” may be an option, but will count towards the disapproval of funding the project.

Thanks for submitting the proposal. We will have two voting rounds, and 1st round just started. It will be closed within one week. Let me get back to you once the voting round is completed.

AIN DAO members have accepted this proposal. We will contact you directly to get information for resource distribution.

  1. Discord
  2. Snapshot