Quantcast
Channel: Analytics India Magazine
Viewing all articles
Browse latest Browse all 3489

MoE will Power the Next Generation of Indic LLMs

$
0
0
MoE will Power the Next Generation of Indic LLMs

The potential of MoE in making Indic LLMs is immense. In a recent podcast with AIM, CognitiveLab founder Aditya Kolavi said that the company has been using the MoE (Mixture of Experts) architecture to fuse Indian languages and build multilingual LLMs. 

“We have used the MoE architecture to fuse Hindi, Tamil, and Kannada, and it worked out pretty well,” he said.

Similarly, Reliance-backed TWO has released its AI model SUTRA, which uses MoE and supports 50+ languages, including Gujarati, Hindi, Tamil, and more, surpassing ChatGPT-3.5.

Ola Krutrim is also leveraging Databricks’ Lakehouse Platform to enhance its data analytics and AI capabilities while hinting at using MoE to power its Indic LLM platform. 

Apart from Indic LLMs, GPT-4, Mixtral-8x7B, Grok-1 and DBRX are powered by MoE. These are some excellent examples of how impactful this architecture is.

How can MoE help India make better LLMs?

Although datasets are available for the 22 official Indian languages, hundreds of other actively used local languages and dialects need representation in Indic LLMs. One challenge that Indian developers face is the lack of quality Indian data.

MoE models are promising in terms of handling machine translation tasks where there is little data available to train on. They prevent the model from over-focusing too narrowly on the limited data, which is a common issue with small datasets.

MoE layers in models allow them to handle multiple languages.

They can learn specific representations for each language while also sharing some core knowledge across languages. This sharing ability is useful for transferring what is learned from data-rich languages like Hindi to other related languages that don’t have as much data available.

DBRX is a great example of how you can achieve efficiency and cost-effectiveness using MoE.

“The economics are so much better for serving. They’re more than 2X better in terms of flops and floating point operations required to do the serving,” shared Navin Rao, the VP of generative AI at Databricks, in an exclusive interaction with AIM.

“DBRX is actually better than Llama 3 and Gemma for Indic languages,” said Ramsri Goutham Golla, the founder of Telugu LLM Labs, in an interview with AIM, particularly in the context of instruction tuning. The company was recently featured in Google I/O for leveraging Gemma to create Navarasa

In terms of energy efficiency, MoE can help you train larger models with less computing, which is a crucial factor for developing countries like India. For example, Google’s 1.2 trillion parameter GLaM model required only 456-megawatt hours to train, compared to 1,287 for the 175B parameter GPT-3, while outperforming it. 

With the help of MoE, one can also reduce the cost while scaling the model. Google’s 1.6T parameter Switch Transformer was trained with a similar computational budget as the 13B T5 model.

Going beyond MoE

Another good example of the MoE model is Jamba, developed by AI21 Labs, which combines the strengths of Transformer and structured state space model (SSM) architectures.

It applies MoE at every other layer, with 16 experts, and uses the top 2 experts at each token. “The more the MoE layers, and the more the experts in each MoE layer, the larger is the total number of model parameters,” wrote AI21 Labs in Jamba’s research paper.

A similar but enhanced approach to MoE can be utilising Recurrent Independent Mechanisms (RIMs). RIMs consist of multiple independent recurrent modules that interact sparsely, allowing for dynamic and modular computation.

They can adapt to changes in the input distribution and handle out-of-distribution generalisation better than Transformers.

Another good idea is using Structured State Space (S4) Models. These use a state space representation to capture long-range dependencies more efficiently than Transformers. Their linear memory footprint and constant memory access make them more scalable for longer sequences.

Simply put, MoE can help India build LLMs, solving complex problems like lack of data, energy requirements and money. While it seems more helpful in merging the already available LLMs, it can also fine-tune future models built from scratch.

The post MoE will Power the Next Generation of Indic LLMs appeared first on AIM.


Viewing all articles
Browse latest Browse all 3489

Trending Articles