Rock YouTube channel with real views, likes and subscribers
Get Free YouTube Subscribers, Views and Likes

How To Create Datasets for Finetuning From Multiple Sources! Improving Finetunes With Embeddings.

Follow
AemonAlgiz

Today, we delve into the process of setting up data sets for finetuning large language models (LLMs). Starting from the initial considerations needed before dataset construction, we navigate through various pipeline setup questions, such as the need for embeddings. We discuss how to structure raw text data for finetuning, exemplified with real coding and medical appeals scenarios.

We also explore how to leverage embeddings to provide additional context to our models, a crucial step in building more general and robust models. The video further explains how to transform books into structured data sets using LLMs, with an example of transforming the book 'Twenty Thousand Leagues Under the Sea' into a questionandanswer format.

In addition, we look at the process of finetuning LLMs to write in specific programming languages, showing a practical application with a Cipher query for graph databases. Lastly, we demonstrate how to enhance the performance of a medical application with the use of embedded information utilizing the Superbooga platform.

Whether you're interested in coding, medical applications, book conversion, or simply finetuning LLMs in general, this video provides comprehensive insights. Tune in to discover how to augment your models with advanced techniques and tools. Join us on our live stream for a deep dive into how to broaden the context in local models and results from our book training and comedy sets.

0:00 Intro
0:44 Considerations For Finetuning Datasets
2:45 Reviewing Embeddings
5:35 Finetuning With Embeddings
8:31 Creating Datasets From Raw/Books
12:08 Coding Finetuning Example
14:02 Medicare/Medicaid Appeals Example
17:01 Outro

Training datasets: https://github.com/tomasonjo/blogdat...

Massive Text Embeddings: https://huggingface.co/blog/mteb

Github Repo: https://github.com/AemonAlgiz/Datese...

#machinelearning #ArtificialIntelligence #LargeLanguageModels #FineTuning #DataPreprocessing #Embeddings

posted by dekkdijaz9