Recent advancements in Large Language Models (LLMs) have transformed the world by facilitating access to information. Building a Large Language Model (LLM) is a significant and ambitious project. Unfortunately, the lack of an LLM for the Low resourced languages like Sinhala and Tamil presents significant challenges. Hence there is a huge requirement for developing LLM models for low resourced languages. This can be achieved through the following two approaches.

Building an LLM from scratch, which includes main subtasks like tokenization, embedding, training and fine tuning, in addition to acquiring a substantial amount of text data.
Developing a language interface to an open-source LLM using Retrieval-Augmented Generation (RAG).

In parallel to the ICTer 2024 conference, and as a follow-up of the Shared Task on Sinhala LLM at iCIIT 2024, UCSC and IIT are collaboratively organizing a workshop titled “Shared Task towards building a Sinhala/Tamil Large Language Model” which includes two main tasks.

Objectives

To advance the development of a Sinhala/Tamil LLM.
To foster collaboration and innovation in the NLP community.
To create robust benchmarks for Sinhala/Tamil NLP.

Click here to register as a participant

Shared Task 01

Developing a Tokenizer and a Word Embedding Model for Sinhala/Tamil Language

This task involves developing an efficient tokenizer and a robust word embedding model specifically tailored for the Sinhala and Tamil languages.

Subtask 01: Developing a Tokenizer for Sinhala/Tamil

Tokenization is a fundamental step in natural language processing (NLP) where text is segmented into smaller units, such as words or subwords, which can then be processed by language models. Generic tokenizers are often designed based on linguistic rules and structures prevalent in widely spoken languages like English. Since Sinhala and Tamil languages have unique linguistic characteristics, using generic tokenizers for LLM development is not efficient. Therefore, it is important to have a proper tokenizer for both Sinhala and Tamil languages that is suitable for LLM development. In this task, our goal is to build an efficient tokenizer for Sinhala/Tamil Language.

Objective

To develop and evaluate tokenization algorithms specifically tailored for Sinhala/Tamil, considering their unique scripts and linguistic properties.

Subtask 02: Developing a Sinhala/Tamil Word Embedding Model (Word2Vec)

The word embedding model will transform the above tokens into dense vector representations that capture their semantic meaning, which is crucial for various NLP tasks such as translation, summarization, and sentiment analysis. This task involves developing a Word Embedding model that will transform the above tokens into dense vector representations, capturing their semantic meaning. This process involves creating a numerical representation of words in a continuous vector space, where words with similar meanings are mapped to similar vectors.

Objective

To create word and sentence embedding models for Sinhala/Tamil that capture semantic and syntactic nuances.

Shared Task 02

Developing a Chatbot System with Q&A Support Using the RAG Model for Sinhala/Tamil

The second task involves participants utilizing available open-source Large Language Models (LLMs) to develop a chatbot system capable of providing Q&A support using the Retrieval-Augmented Generation (RAG) model. Participants will be provided with an initial dataset and a set of questions that the chatbot should be able to answer. The evaluation of the chatbot systems will be conducted separately to ensure a fair and comprehensive assessment.

Task Objectives:

Foundation: Participating teams should use existing open-source LLMs as the foundation for their chatbot systems.
RAG Model: The RAG model, which combines retrieval-based and generation-based approaches, should be used to enhance the chatbot’s ability to provide accurate and contextually relevant answers.
Vector Databases: Open-source vector databases will be utilized to efficiently store and retrieve vector representations of text data.
Accuracy and Efficiency: The chatbot should be able to answer the given questions accurately and efficiently, demonstrating the effectiveness of the RAG model in Q&A tasks.

***Subtask 01*: Developing a Retrieval-Augmented Generation (RAG) model for Sinhala/Tamil**

Implement and fine-tune RAG models for Sinhala/Tamil, combining retrieval mechanisms with generation capabilities.

***Subtask 02*: Sinhala/Tamil Interface development**

Develop a user-friendly interface for interacting with the LLM in Sinhala/Tamil, making it accessible for end-users and developers.

NOTE

An initial dataset and a set of questions will be provided to train and test the chatbot systems.
The evaluation metrics will be disclosed in due course.

Timeline (For both Shared Tasks)

Call for registration: ~~September 06th, 2024~~
Initial briefing: ~~September 10th, 2024~~
Interim progress meeting (Online): ~~September 30th, 2024~~
Final Submission: November 01st, 2024
Shared Task competition: November 8th, 2024

Click here to register as a participant

Who Can Participate?

Teams from Universities and Industries are Invited to Compete!

Each team can include up to 5 members and can participate for free.
Other interested parties can join (8th November) by registering with a fee of 5000 rupees.

Don’t miss this exciting opportunity to learn and collaborate at our conference workshop. Register now!

UCSC – IIT Workshop on Shared Task towards building a Sinhala/Tamil Large Language Model

Objectives

Shared Task 01

Developing a Tokenizer and a Word Embedding Model for Sinhala/Tamil Language

Subtask 01: Developing a Tokenizer for Sinhala/Tamil

Objective

Subtask 02: Developing a Sinhala/Tamil Word Embedding Model (Word2Vec)

Objective

Shared Task 02

Developing a Chatbot System with Q&A Support Using the RAG Model for Sinhala/Tamil

***Subtask 01*: Developing a Retrieval-Augmented Generation (RAG) model for Sinhala/Tamil**

***Subtask 02*: Sinhala/Tamil Interface development**

NOTE

Timeline (For both Shared Tasks)

Who Can Participate?

Organized By

Language Technology Research Lab, UCSC

Informatics Institute of Technology

UCSC – IIT Workshop on Shared Task towards building a Sinhala/Tamil Large Language Model

Objectives

Shared Task 01

Developing a Tokenizer and a Word Embedding Model for Sinhala/Tamil Language

Subtask 01: Developing a Tokenizer for Sinhala/Tamil

Objective

Subtask 02: Developing a Sinhala/Tamil Word Embedding Model (Word2Vec)

Objective

Shared Task 02

Developing a Chatbot System with Q&A Support Using the RAG Model for Sinhala/Tamil

Subtask 01: Developing a Retrieval-Augmented Generation (RAG) model for Sinhala/Tamil

Subtask 02: Sinhala/Tamil Interface development

NOTE

Timeline (For both Shared Tasks)

Who Can Participate?

Organized By

Language Technology Research Lab, UCSC

Informatics Institute of Technology

Young Outstanding Researcher

Most Contributing Researcher

Most Popular Researcher

***Subtask 01*: Developing a Retrieval-Augmented Generation (RAG) model for Sinhala/Tamil**

***Subtask 02*: Sinhala/Tamil Interface development**