UCSC – IIT Workshop on Shared Task towards building a Sinhala/Tamil Large Language Model

Recent advancements in Large Language Models (LLMs) have transformed the world by facilitating access to information. Building a Large Language Model (LLM) is a significant and ambitious project. Unfortunately, the lack of an LLM for the Low resourced languages like Sinhala and Tamil presents  significant challenges. Hence there is a huge requirement for developing LLM models for low resourced languages. This can be achieved through the following two approaches.

  1. Building an LLM from scratch, which includes main subtasks like tokenization, embedding, training and fine tuning, in addition to acquiring a substantial amount of text data.
  2. Developing a language interface to an open-source LLM using Retrieval-Augmented Generation (RAG).

In parallel to the ICTer 2024 conference, and as a follow-up of the Shared Task on Sinhala LLM at iCIIT 2024, UCSC and IIT are collaboratively organizing a workshop titled Shared Task towards building a Sinhala/Tamil Large Language Model which includes two main tasks.

Objectives

  • To advance the development of a Sinhala/Tamil LLM.
  • To foster collaboration and innovation in the NLP community.
  • To create robust benchmarks for Sinhala/Tamil NLP.

Shared Task 01

Developing a Tokenizer and a Word Embedding Model for Sinhala/Tamil Language

This task involves developing an efficient tokenizer and a robust word embedding model specifically tailored for the Sinhala and Tamil languages. 

Subtask 01: Developing a Tokenizer for Sinhala/Tamil

Tokenization is a fundamental step in natural language processing (NLP) where text is segmented into smaller units, such as words or subwords, which can then be processed by language models. Generic tokenizers are often designed based on linguistic rules and structures prevalent in widely spoken languages like English. Since Sinhala and Tamil languages have unique linguistic characteristics, using generic tokenizers for LLM development  is not efficient. Therefore, it is important to have a proper tokenizer for both Sinhala and Tamil languages that is suitable for LLM development. In this task, our goal is to build an efficient tokenizer for Sinhala/Tamil Language.

Objective

To develop and evaluate tokenization algorithms specifically tailored for Sinhala/Tamil, considering their unique scripts and linguistic properties.

Subtask 02: Developing a Sinhala/Tamil Word Embedding Model (Word2Vec)

The word embedding model will transform the above tokens into dense vector representations that capture their semantic meaning, which is crucial for various NLP tasks such as translation, summarization, and sentiment analysis. This task involves developing a Word Embedding model that will transform the above tokens into dense vector representations, capturing their semantic meaning. This process involves creating a numerical representation of words in a continuous vector space, where words with similar meanings are mapped to similar vectors. 

Objective

To create word and sentence embedding models for Sinhala/Tamil that capture semantic and syntactic nuances.

Shared Task 02

Developing a Chatbot System with Q&A Support Using the RAG Model for Sinhala/Tamil

The second task involves participants utilizing available open-source Large Language Models (LLMs) to develop a chatbot system capable of providing Q&A support using the Retrieval-Augmented Generation (RAG) model. Participants will be provided with an initial dataset and a set of questions that the chatbot should be able to answer. The evaluation of the chatbot systems will be conducted separately to ensure a fair and comprehensive assessment.

Task Objectives:

  • Foundation: Participating teams should use existing open-source LLMs as the foundation for their chatbot systems.
  • RAG Model: The RAG model, which combines retrieval-based and generation-based approaches, should be used to enhance the chatbot’s ability to provide accurate and contextually relevant answers.
  • Vector Databases: Open-source vector databases will be utilized to efficiently store and retrieve vector representations of text data.
  • Accuracy and Efficiency: The chatbot should be able to answer the given questions accurately and efficiently, demonstrating the effectiveness of the RAG model in Q&A tasks.

Subtask 01: Developing a Retrieval-Augmented Generation (RAG) model for Sinhala/Tamil

Implement and fine-tune RAG models for Sinhala/Tamil, combining retrieval mechanisms with generation capabilities.

Subtask 02: Sinhala/Tamil Interface development

Develop a user-friendly interface for interacting with the LLM in Sinhala/Tamil, making it accessible for end-users and developers.

NOTE

  • An initial dataset and a set of questions will be provided to train and test the chatbot systems.
  • The evaluation metrics will be disclosed in due course.

Timeline (For both Shared Tasks)

  • Call for registration: September 06th, 2024
  • Initial briefing: September 10th, 2024
  • Interim progress meeting (Online): September 30th, 2024
  • Final Submission: November 01st, 2024 
  • Shared Task competition: November 8th, 2024

Who Can Participate?

Teams from Universities and Industries are Invited to Compete!

  • Each team can include up to 5 members and can participate for free.
  • Other interested parties can join (8th November) by registering with a fee of 5000 rupees.

Don’t miss this exciting opportunity to learn and collaborate at our conference workshop. Register now!

Organized By

Language Technology Research Lab, UCSC

Informatics Institute of Technology

Young Outstanding Researcher


Young outstanding researchers are the future of the scientific community, and we want to recognize and celebrate their achievements. This competition is designed to identify the most outstanding young researchers and provide them with the recognition they deserve.

We'll be looking for researchers who have already demonstrated exceptional talent and promise in their field. Whether it's through groundbreaking research, innovative ideas, or a commitment to advancing the frontiers of knowledge, we want to see evidence that you're already making an impact in your field.

We're particularly interested in researchers who are working on cutting-edge topics or exploring new areas of inquiry. We're looking for individuals who are pushing the boundaries of what is currently known, and who have the potential to make significant contributions to their field in the years to come.

We'll also be evaluating the quality and impact of your research work. We'll be looking for evidence of rigorous methodology, innovative thinking, and the potential for real-world application. We'll be assessing the quality of your publications, the impact of your work, and your potential for continued success in the field.

But we're not just looking for exceptional researchers - we're also looking for individuals who have the potential to become leaders in their field. We'll be evaluating your leadership potential, communication skills, and ability to collaborate with others.

Overall, the Young Outstanding Researcher competition is an opportunity for you to showcase your exceptional talent, dedication, and promise as a young researcher. We encourage you to apply and show us why you're the most outstanding young researcher in your field.

This will close in 0 seconds

Most Contributing Researcher


The Most Contributing Researcher competition recognizes researchers who have made significant contributions to their field over the course of their careers. We're looking for individuals who have dedicated themselves to advancing the frontiers of knowledge, and whose work has had a meaningful impact on their field and beyond.

But we're not just looking for exceptional researchers - we're also looking for individuals who have made meaningful contributions to their community and society as a whole.

This will close in 0 seconds

Most Popular Researcher


This exciting competition is designed to help talented students like you showcase your research skills and connect with top professionals in academia and industry. Our focus is on selecting the most popular researcher from among the participating students, and we'll be evaluating a variety of factors to determine the winner.

One of the key factors we'll be looking at is your connection with academia and industry. We believe that strong connections and collaborative potential are essential for success in research.

We'll also be considering your communication skills, leadership abilities, and overall potential to become successful researchers. Effective communication and collaboration are critical in research, and we'll be looking for evidence that you have these skills and the potential to develop them further.

Overall, the Research Talent Event provides an excellent opportunity for you to showcase your research talents, connect with top professionals in academia and industry, and potentially win the title of most popular researcher. We encourage you to apply and show us what you've got.

This will close in 0 seconds