{"id":1740,"date":"2024-08-16T06:13:03","date_gmt":"2024-08-16T06:13:03","guid":{"rendered":"https:\/\/icter.lk\/?page_id=1740"},"modified":"2024-12-04T09:10:39","modified_gmt":"2024-12-04T09:10:39","slug":"ucsc-iit-workshop-on-shared-task-towards-building-a-sinhala-tamil-large-language-model","status":"publish","type":"page","link":"https:\/\/icter.lk\/icter_2024\/workshops\/ucsc-iit-workshop-on-shared-task-towards-building-a-sinhala-tamil-large-language-model\/","title":{"rendered":"UCSC &#8211; IIT Workshop on Shared Task towards building a Sinhala\/Tamil Large Language Model"},"content":{"rendered":"<p>[vc_row][vc_column][vc_column_text]<span style=\"font-weight: 400;\">Recent advancements in Large Language Models (LLMs) have transformed the world by facilitating access to information. Building a Large Language Model (LLM) is a significant and ambitious project. Unfortunately, the lack of an LLM for the Low resourced languages like Sinhala and Tamil presents\u00a0 significant challenges. Hence there is a huge requirement for developing LLM models for low resourced languages. This can be achieved through the following two approaches.<\/span><\/p>\n<ol style=\"color: #666666!important;\">\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Building an LLM from scratch, which includes main subtasks like tokenization, embedding, training and fine tuning, in addition to acquiring a substantial amount of text data.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Developing a language interface to an open-source LLM using Retrieval-Augmented Generation (RAG).<\/span><\/li>\n<\/ol>\n<p>In parallel to the ICTer 2024 conference, and as a follow-up of the Shared Task on Sinhala LLM at iCIIT 2024, UCSC and IIT are collaboratively\u00a0organizing a workshop titled <em>\u201c<strong>Shared Task towards building a Sinhala\/Tamil Large Language Model<\/strong>\u201d<\/em> which includes two main tasks.<\/p>\n<h3><b>Objectives<\/b><\/h3>\n<ul style=\"color: #666666!important;\">\n<li><span style=\"font-weight: 400;\">To advance the development of a Sinhala\/Tamil LLM.<\/span><\/li>\n<li><span style=\"font-weight: 400;\">To foster collaboration and innovation in the NLP community.<\/span><\/li>\n<li><span style=\"font-weight: 400;\">To create robust benchmarks for Sinhala\/Tamil NLP.<\/span><\/li>\n<\/ul>\n<p>[\/vc_column_text][vc_custom_heading text=&#8221;Shared Task 01&#8243; font_container=&#8221;tag:h1|text_align:left&#8221; use_theme_fonts=&#8221;yes&#8221;][vc_column_text]<\/p>\n<h2><b>Developing a Tokenizer and a Word Embedding Model for Sinhala\/Tamil Language<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">This task involves developing an efficient tokenizer and a robust word embedding model specifically tailored for the Sinhala and Tamil languages.\u00a0<\/span>[\/vc_column_text][vc_message message_box_style=&#8221;3d&#8221; message_box_color=&#8221;orange&#8221; icon_fontawesome=&#8221;fas fa-paper-plane&#8221;]<\/p>\n<h4><b><i>Subtask 01: <\/i><\/b><b>Developing a Tokenizer for Sinhala\/Tamil<\/b><\/h4>\n<p>Tokenization is a fundamental step in natural language processing (NLP) where text is segmented into smaller units, such as words or subwords, which can then be processed by language models. Generic tokenizers are often designed based on linguistic rules and structures prevalent in widely spoken languages like English. Since Sinhala and Tamil languages have unique linguistic characteristics, using generic tokenizers for LLM development\u00a0 is not efficient. Therefore, it is important to have a proper tokenizer for both Sinhala and Tamil languages that is suitable for LLM development. In this task, our goal is to build an efficient tokenizer for Sinhala\/Tamil Language.<\/p>\n<h5><b>Objective<\/b><\/h5>\n<p><span style=\"font-weight: 400;\">To develop and evaluate tokenization algorithms specifically tailored for Sinhala\/Tamil, considering their unique scripts and linguistic properties.<\/span>[\/vc_message][vc_message message_box_style=&#8221;3d&#8221; message_box_color=&#8221;orange&#8221; icon_fontawesome=&#8221;fas fa-paper-plane&#8221;]<\/p>\n<h4><b><i>Subtask 02:<\/i><\/b><b> Developing a Sinhala\/Tamil Word Embedding Model (Word2Vec)<\/b><\/h4>\n<p><span style=\"font-weight: 400;\">The word embedding model will transform the above tokens into dense vector representations that capture their semantic meaning, which is crucial for various NLP tasks such as translation, summarization, and sentiment analysis. <\/span><span style=\"font-weight: 400;\">This task involves developing <\/span><span style=\"font-weight: 400;\">a Word Embedding model that will transform <\/span><span style=\"font-weight: 400;\">the above tokens into dense vector representations, capturing their semantic meaning. This process involves creating a numerical representation of words in a continuous vector space, where words with similar meanings are mapped to similar vectors.\u00a0<\/span><\/p>\n<h5><b>Objective<\/b><\/h5>\n<p><span style=\"font-weight: 400;\">To create word and sentence embedding models for Sinhala\/Tamil that capture semantic and syntactic nuances.<\/span>[\/vc_message][vc_custom_heading text=&#8221;Shared Task 02&#8243; font_container=&#8221;tag:h1|text_align:left&#8221; use_theme_fonts=&#8221;yes&#8221;][vc_column_text]<\/p>\n<h2><b>Developing a Chatbot System with Q&amp;A Support Using the RAG Model for Sinhala\/Tamil<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">The second task involves participants utilizing available open-source Large Language Models (LLMs) to develop a chatbot system capable of providing Q&amp;A support using the Retrieval-Augmented Generation (RAG) model. Participants will be provided with an initial dataset and a set of questions that the chatbot should be able to answer. The evaluation of the chatbot systems will be conducted separately to ensure a fair and comprehensive assessment.<\/span><\/p>\n<p><b>Task Objectives:<\/b><\/p>\n<ul style=\"color: #666666!important;\">\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Foundation:<\/b><span style=\"font-weight: 400;\"> Participating teams should use existing open-source LLMs as the foundation for their chatbot systems.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>RAG Model:<\/b><span style=\"font-weight: 400;\"> The RAG model, which combines retrieval-based and generation-based approaches, should be used to enhance the chatbot&#8217;s ability to provide accurate and contextually relevant answers.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Vector Databases:<\/b><span style=\"font-weight: 400;\"> Open-source vector databases will be utilized to efficiently store and retrieve vector representations of text data.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Accuracy and Efficiency:<\/b><span style=\"font-weight: 400;\"> The chatbot should be able to answer the given questions accurately and efficiently, demonstrating the effectiveness of the RAG model in Q&amp;A tasks.<\/span><\/li>\n<\/ul>\n<p>[\/vc_column_text][vc_message message_box_style=&#8221;3d&#8221; message_box_color=&#8221;orange&#8221; icon_fontawesome=&#8221;fas fa-paper-plane&#8221;]<\/p>\n<h4><b><i>Subtask 01<\/i><\/b><b>: Developing a Retrieval-Augmented Generation (RAG) model for Sinhala\/Tamil<\/b><\/h4>\n<p><span style=\"font-weight: 400;\">Implement and fine-tune RAG models for Sinhala\/Tamil, combining retrieval mechanisms with generation capabilities.<\/span>[\/vc_message][vc_message message_box_style=&#8221;3d&#8221; message_box_color=&#8221;orange&#8221; icon_fontawesome=&#8221;fas fa-paper-plane&#8221;]<\/p>\n<h4><b><i>Subtask 02<\/i><\/b><b>: Sinhala\/Tamil Interface development<\/b><\/h4>\n<p><span style=\"font-weight: 400;\">Develop a user-friendly interface for interacting with the LLM in Sinhala\/Tamil, making it accessible for end-users and developers.<\/span>[\/vc_message][\/vc_column][\/vc_row][vc_row][vc_column][vc_column_text]<\/p>\n<h4><b>NOTE<\/b><\/h4>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><i><span style=\"font-weight: 400;\">An initial dataset and a set of questions will be provided to train and test the chatbot systems.<\/span><\/i><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><i><span style=\"font-weight: 400;\">The evaluation metrics will be disclosed in due course.<\/span><\/i><\/li>\n<\/ul>\n<p>[\/vc_column_text][vc_message message_box_style=&#8221;3d&#8221; message_box_color=&#8221;success&#8221; icon_fontawesome=&#8221;far fa-calendar-alt&#8221;]<\/p>\n<h3><b>Timeline <\/b>(<em>For both Shared Tasks<\/em>)<\/h3>\n<ul style=\"color: #666666!important;\">\n<li><span style=\"font-weight: 400;\">Call for registration: <\/span> <del><strong>September 06th, 2024<\/strong><\/del><\/li>\n<li><span style=\"font-weight: 400;\">Initial briefing: <\/span> <del><strong>September 10th, 2024<\/strong><\/del><\/li>\n<li><span style=\"font-weight: 400;\">Interim progress meeting (Online): <\/span> <del><strong>September 30th, 2024<\/strong><\/del><\/li>\n<li><span style=\"font-weight: 400;\">Final Submission: <strong>November <\/strong><\/span><strong>01st, 2024\u00a0<\/strong><\/li>\n<li><span style=\"font-weight: 400;\">Shared Task competition: <\/span> <strong>November 8th, 2024<\/strong><\/li>\n<\/ul>\n<p>[\/vc_message][vc_column_text]<\/p>\n<h3><b>Who Can Participate?<\/b><\/h3>\n<p>Teams from Universities and Industries are Invited to Compete!<\/p>\n<ul style=\"color: #666666!important;\">\n<li aria-level=\"1\">Each team can include up to 5 members and can participate for free.<\/li>\n<li aria-level=\"1\">Other interested parties can join (8th November) by registering with a fee of 5000 rupees.<\/li>\n<\/ul>\n<p>Don&#8217;t miss this exciting opportunity to learn and collaborate at our conference workshop. Register now!<\/p>\n<h3><b>Organized By<\/b><\/h3>\n<p>[\/vc_column_text][\/vc_column][\/vc_row][vc_row][vc_column width=&#8221;1\/2&#8243;][vc_custom_heading text=&#8221;Language Technology Research Lab, UCSC&#8221; font_container=&#8221;tag:h4|text_align:center&#8221; use_theme_fonts=&#8221;yes&#8221;][vc_row_inner][vc_column_inner width=&#8221;1\/2&#8243;][vc_single_image image=&#8221;1805&#8243; img_size=&#8221;full&#8221; alignment=&#8221;center&#8221; onclick=&#8221;custom_link&#8221; img_link_target=&#8221;_blank&#8221; link=&#8221;https:\/\/ucsc.cmb.ac.lk\/&#8221;][\/vc_column_inner][vc_column_inner width=&#8221;1\/2&#8243;][vc_single_image image=&#8221;1804&#8243; img_size=&#8221;full&#8221; alignment=&#8221;center&#8221; onclick=&#8221;custom_link&#8221; img_link_target=&#8221;_blank&#8221; link=&#8221;https:\/\/ucsc.cmb.ac.lk\/language-technology-research-laboratory\/&#8221;][\/vc_column_inner][\/vc_row_inner][\/vc_column][vc_column width=&#8221;1\/2&#8243;][vc_custom_heading text=&#8221;Informatics Institute of Technology&#8221; font_container=&#8221;tag:h4|text_align:center&#8221; use_theme_fonts=&#8221;yes&#8221;][vc_single_image image=&#8221;1776&#8243; img_size=&#8221;full&#8221; alignment=&#8221;center&#8221; onclick=&#8221;custom_link&#8221; img_link_target=&#8221;_blank&#8221; link=&#8221;https:\/\/www.iit.ac.lk\/&#8221;][\/vc_column][\/vc_row][vc_row disable_element=&#8221;yes&#8221;][vc_column][vc_column_text][vc_btn title=&#8221;Click here to register for the competition&#8221; shape=&#8221;round&#8221; color=&#8221;warning&#8221; i_icon_fontawesome=&#8221;fas fa-sign-in-alt&#8221; add_icon=&#8221;true&#8221; link=&#8221;url:https%3A%2F%2Fwww.icter.lk%2Fworkshop-registration%2Fpublic%2Fcompetition|target:_blank&#8221;][\/vc_column_text][\/vc_column][\/vc_row]<\/p>\n","protected":false},"excerpt":{"rendered":"<p>[vc_row][vc_column][vc_column_text]Recent advancements in Large Language Models (LLMs) have transformed the world by facilitating access to information. Building a Large Language Model (LLM) is a significant and ambitious project. Unfortunately, the lack of an LLM for the Low resourced languages like Sinhala and Tamil presents\u00a0 significant challenges. Hence there is a huge requirement for developing LLM [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":400,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"pgc_sgb_lightbox_settings":"","footnotes":""},"class_list":["post-1740","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/icter.lk\/icter_2024\/wp-json\/wp\/v2\/pages\/1740","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/icter.lk\/icter_2024\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/icter.lk\/icter_2024\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/icter.lk\/icter_2024\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/icter.lk\/icter_2024\/wp-json\/wp\/v2\/comments?post=1740"}],"version-history":[{"count":49,"href":"https:\/\/icter.lk\/icter_2024\/wp-json\/wp\/v2\/pages\/1740\/revisions"}],"predecessor-version":[{"id":2103,"href":"https:\/\/icter.lk\/icter_2024\/wp-json\/wp\/v2\/pages\/1740\/revisions\/2103"}],"up":[{"embeddable":true,"href":"https:\/\/icter.lk\/icter_2024\/wp-json\/wp\/v2\/pages\/400"}],"wp:attachment":[{"href":"https:\/\/icter.lk\/icter_2024\/wp-json\/wp\/v2\/media?parent=1740"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}