Recent advancements in Large Language Models (LLMs) have transformed the world by facilitating access to information. Building a Large Language Model (LLM) is a significant and ambitious project. Unfortunately, the lack of an LLM for the Low resourced languages like Sinhala and Tamil presents significant challenges. Hence there is a huge requirement for developing LLM models for low resourced languages. This can be achieved through the following two approaches.
- Building an LLM from scratch, which includes main subtasks like tokenization, embedding, training and fine tuning, in addition to acquiring a substantial amount of text data.
- Developing a language interface to an open-source LLM using Retrieval-Augmented Generation (RAG).
In parallel to the ICTer 2024 conference, and as a follow-up of the Shared Task on Sinhala LLM at iCIIT 2024, UCSC and IIT are collaboratively organizing a workshop titled “Shared Task towards building a Sinhala/Tamil Large Language Model” which includes two main tasks.
Objectives
- To advance the development of a Sinhala/Tamil LLM.
- To foster collaboration and innovation in the NLP community.
- To create robust benchmarks for Sinhala/Tamil NLP.
Shared Task 01
Developing a Tokenizer and a Word Embedding Model for Sinhala/Tamil Language
This task involves developing an efficient tokenizer and a robust word embedding model specifically tailored for the Sinhala and Tamil languages.
Shared Task 02
Developing a Chatbot System with Q&A Support Using the RAG Model for Sinhala/Tamil
The second task involves participants utilizing available open-source Large Language Models (LLMs) to develop a chatbot system capable of providing Q&A support using the Retrieval-Augmented Generation (RAG) model. Participants will be provided with an initial dataset and a set of questions that the chatbot should be able to answer. The evaluation of the chatbot systems will be conducted separately to ensure a fair and comprehensive assessment.
Task Objectives:
- Foundation: Participating teams should use existing open-source LLMs as the foundation for their chatbot systems.
- RAG Model: The RAG model, which combines retrieval-based and generation-based approaches, should be used to enhance the chatbot’s ability to provide accurate and contextually relevant answers.
- Vector Databases: Open-source vector databases will be utilized to efficiently store and retrieve vector representations of text data.
- Accuracy and Efficiency: The chatbot should be able to answer the given questions accurately and efficiently, demonstrating the effectiveness of the RAG model in Q&A tasks.
NOTE
- An initial dataset and a set of questions will be provided to train and test the chatbot systems.
- The evaluation metrics will be disclosed in due course.
Who Can Participate?
Teams from Universities and Industries are Invited to Compete!
- Each team can include up to 5 members and can participate for free.
- Other interested parties can join (8th November) by registering with a fee of 5000 rupees.
Don’t miss this exciting opportunity to learn and collaborate at our conference workshop. Register now!