[Proposal] Polyglot project demo gpu server

Title: * Request demo server for Polyglot(Large Language Models of Well-balanced Competence in Multi-languages) models.*
Author: Kichang Yang, kevin-ai#4032
Date posted: 2023/02/23

Summary

Polyglot is the project making non-english LLM for everyone by EleutherAI.

Background

Why another multilingual model?

Various multilingual models such as mBERT, BLOOM, and XGLM have been released. Therefore, someone might ask, “why do we need to make multilingual models again?” Before answering the question, we would like to ask, “Why do people around the world make monolingual models in their language even though there are already many multilingual models?” We would like to point out there is a dissatisfaction with the non-English language performance of the current multilingual models as one of the most significant reason. So we want to make multilingual models with higher non-English language performance. This is the reason we need to make multilingual models again and why we name them ‘Polyglot’.

Scope of Work

What is the scope of this proposal?*

  • Goals : Deploying demo-server for the model we trained.
  • Features : Demo Web can infer the model.
  • Tasks

Timeline

What is the timeline of the project?

Below are required for Snapshots*

Specification

What do we focus on to make better multilingual models?

We will focus on the following two factors to make multilingual models which show better non-English performance.

Amount of data in each language and its balance

Most multilingual models are trained using data from fairly uneven distribution of languages. For example, BLOOM’s training data is still English-centric. English data takes 30% of the data, however some languages such as Vietnamese and Indonesian are only 1-3% of data. XGLM has taken a step forward for mitigating this problem by data up-sampling, but we believe there is a limitation of data up-sampling. To resolve this problem, we will collect large multilingual datasets with hundreds billions tokens per language and balance them so that the model can learn various languages in balance.

Language selection

Most multilingual models learned dozens of languages, including low-resource languages. For example, XGLM learned 30 languages, and BLOOM learned 42 languages. However, we plan to let go of the desire to be good at too many languages at once. The number of steps a model can learn is somewhat set, and the model converges when it exceeds that. So if one model takes too many languages, the training efficiency for each language decreases. Therefore, we want to train the model with languages in similar language families which enable synergy effect between them. In addition, we have excluded languages used by a few users use because it is difficult to collect a large amount of data. Therefore, we will only focus on high or middle-resource languages in our project.

Request

Write down what kind of support you need to do this and what you would like to give back.

  • Description: 1GPU server for inference.
  • Resource(Support) type: 1 x V100 GPU, and more than 30GB storage.
  • Amount: 1
  • Date: ALAP
  • Impact: The Polyglot project has the potential to greatly impact and reward our community in a number of ways:
  1. Access to Knowledge: By creating non-english language models (LLMs) that are accessible to everyone, the Polyglot project can democratize access to knowledge and information. This can empower individuals who speak non-english languages to engage with and contribute to a broader range of topics, including science, technology, and culture.
  2. Cultural Understanding: The Polyglot project can also foster greater understanding and appreciation for diverse cultures and perspectives. By making it easier for people to communicate and share ideas across language barriers, the project can help break down cultural divides and promote mutual understanding.
  3. Business Opportunities: As the world becomes more interconnected, the ability to communicate across language barriers is becoming increasingly important. The Polyglot project can help create new business opportunities by making it easier for companies to reach customers in non-english speaking markets and for individuals to work and collaborate across borders

Targets

Making and releasing 40B size of korean LLM.

Participants

Introduce the team and the members’ interests, background, time commitments, etc.

Voting

The voting period will be between 2 to 7 days. Please send the voting options clearly indicating if they are for or against your proposal. If available, include all links to previously held surveys and/or voting (i.e. on Discord).

  • Examples: Approve/Disapprove or Yes/No
  • Something like “support the proposal but needs revision” may be an option, but will count towards the disapproval of funding the project.

*Snapshots: We use snapshot.io as the official voting platform. Once the proposal gains enough approvals, it will be promoted from Discord to Forum and finally to Snapshot. For more information on the voting process, refer to this document.

It is pended as any members can not to proceed it.