İnce Ayarlanmış RAG Bileşenlerini Kullanarak Türkçe Veri Setleri için Yeni Bir Füzyon Yöntemi ile Yeniden Sıralama Konfigürasyonu

Bıkmaz, Erdoğan

İnce Ayarlanmış RAG Bileşenlerini Kullanarak Türkçe Veri Setleri için Yeni Bir Füzyon Yöntemi ile Yeniden Sıralama Konfigürasyonu

Date

2025

Authors

Bıkmaz, Erdoğan

Abstract

Bu çalışma, Türkçe için, özellikle de tıp alanında, Retrieval-Augmented Generation (RAG) sistemlerinin çok dilli yeteneklerindeki boşluğu ele almaktadır. Büyük Dil Modellerinin (LLM'ler) yükselişi ve yaygın uygulamalarıyla, halüsinasyonları azaltmak ve yanıt doğruluğunu artırmak için, harici bilgilere dayalı retrieval (geri çağırma) bileşenlerinin kullanımı kritik bir hale gelmiştir. Ancak, mevcut retrieval bileşenlerinin çoğu (embedding'ler ve reranker'lar dahil olmak üzere) ağırlıklı olarak İngilizce veri setleri üzerinde eğitilmiştir, bu da çok dilli ve alana özgü yetenekler açısından önemli bir sınırlamayı ortaya koymaktadır. Bu durumu ele almak için, bu çalışma kapsamında Türkçe tıbbi bir veri seti olan Pubmed-RAG-TR ve popüler bir Türkçe RAG veri seti olan WikiRAG-TR [36] kullanılarak retrieval bileşenleri ince ayar (fine-tuning) ile geliştirilmiştir. Ayrıca, LLM'ler için bağlam oluşturmayı iyileştirmek amacıyla yeni bir RRF (Reciprocal Rank Fusion) tabanlı reranker pipeline'ı geliştirilmiştir. Deneysel sonuçlar, retrieval bileşenlerinin alana özgü veri setleri üzerinde ince ayar yapılmasının, retrieval ve post-retrieval kalitesini önemli ölçüde artırdığını ve LLM yanıtlarının doğruluğunu iyileştirdiğini göstermiştir. Çalışma, alana özgü semantiğin retrieval ve reranking modellerine dahil edilmesinin, çok dilli bağlamlarda RAG sistemlerinin performansını önemli ölçüde artırabileceği sonucuna varmaktadır.
This study addresses the gap in the multilingual capabilities of Retrieval Augmented Generation (RAG) systems for the Turkish language, particularly in the medical domain. With the rise of Large Language Models (LLMs) and their widespread applications, the reliance on external knowledge through retrieval components has become crucial to mitigate hallucinations and improve response accuracy. However, most existing retrieval components, including embeddings and rerankers, are predominantly trained on English datasets, highlighting a significant limitation in multilingual and domain-specific capabilities. To address this, the study introduced Pubmed-RAG-TR, a Turkish-language medical dataset, and fine-tuned retrieval components on both Pubmed-RAG-TR and WikiRAG-TR, a Turkish RAG dataset. A novel RRF-based reranker pipeline was also developed to improve the context construction for LLMs. Experimental results demonstrated that fine-tuning retrieval components on domain-specific datasets significantly enhanced the retrieval and post-retrieval quality, improving the accuracy of LLM responses. The study concludes that incorporating domain-specific semantics into retrieval and reranking models can substantially boost the performance of RAG systems in multilingual contexts.

Keywords

Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol, Metin Erişim, Sentetik Veriler, Yeniden Sıralama, İnsan-Yapay Zeka Etkileşimi, Computer Engineering and Computer Science and Control, Text Retrieval, Synthetic Dataset, Reordering, Human-Artificial Intelligence Interaction

Turkish CoHE Thesis Center URL

Click Here

End Page

84

URI

https://tez.yok.gov.tr/UlusalTezMerkezi/TezGoster?key=CtwiQkYvArAb95Ufpfs_vmGr1Pe7Xe79cAQMQFD_GnVQxn9jeCbO96j0PiZaAZPZ
https://hdl.handle.net/20.500.12416/15830

Collections

Doktora Tezleri

Full item page

Google Scholar™

Check

İnce Ayarlanmış RAG Bileşenlerini Kullanarak Türkçe Veri Setleri için Yeni Bir Füzyon Yöntemi ile Yeniden Sıralama Konfigürasyonu

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Open Access Color

OpenAIRE Downloads

OpenAIRE Views

Research Projects

Journal Issue

Abstract

Description

Keywords

Turkish CoHE Thesis Center URL

Fields of Science

Citation

WoS Q

Scopus Q

Source

Volume

Issue

Start Page

End Page

URI

Collections

Google Scholar™

Sustainable Development Goals