Skip to content

This repository contains the code for dataset curation and finetuning of instruct variant of the Bilingual OpenHathi model. The resulting model is meant to follow instructions and chat in Hindi and Hinglish.

Notifications You must be signed in to change notification settings

pacman100/openhathi_instruct

Folders and files

NameName
Last commit message
Last commit date

Latest commit

┬а

History

6 Commits
┬а
┬а
┬а
┬а
┬а
┬а
┬а
┬а

Repository files navigation

openhathi_instruct

This repository contains the code for dataset curation and finetuning of instruct variant of the Bilingual OpenHathi model. The resulting model is meant to follow instructions and chat in Hindi and Hinglish.

This is just the started code.

Dataset

smangrul/hindi_instruct_v1

Hindi Instruct V1 Dataset

This dataset is curated by Sourab Mangrulkar. It was developed on top of HuggingFaceH4/no_robots dataset. First, the dataset was translated using ai4bharat/indictrans2-en-indic-1B SoTA translation model developed by AI4Bharat. Here, it is important to note that the sequence length limit is 256 for input and output sequences. Hence, I split the individual sentences on full stop and create minibatches for translation and then stitch them back properly.

def chunk_and_translate(content):
    minibatch = content.split(".")
    minibatch = [k for k in minibatch if len(k.strip())>0]
    translations = batch_translate(minibatch, "eng_Latn", "hin_Deva", en_indic_model, en_indic_tokenizer)
    translated_content = " ".join(translations)
    return translated_content

def preprocess(sample):
    for turn in sample["messages"]:
        turn["content"] = chunk_and_translate(turn["content"])
            
    return {"messages": sample["messages"]}


dataset = load_dataset("HuggingFaceH4/no_robots")
dataset
dataset = dataset.map(
    preprocess,
    batched=False
)

Next, inspired by the demo given by the Openhathi team here, I did the following:

Remove Coding samples

After manually going through the coding samples and trying things like having user messages in Hindi while the assistant messages with code snippets in English, I found that it would still not work because users often give code snippets and prompt the assistant to help with debugging. As this would be noise in Hindi instruction tuning, I simply removed all the coding examples.

Add English and Hinglish instructions to tell the model to reply in Hindi for Engligh user messages and add the CoT prompt First write in English and then translate to Hindi

With probability of 0.2, add the prompts Hindi mein jawab dena./Reply in Hindi./Summarize in Hindi to the end of user messages in English with the assistant replies in Hindi. This would help when user writes something in English and prompt the model to reply in Hindi.

For Chain of Thoughts ability, the OpenHathi team used the prompt "First write in English and then translate to Hindi" which leads to better responses. To enable this, we suffix the prompt First write in English and then translate to Hindi/Pehle English mein likhna, phir Hindi mein translate karna. to user messages in English and processing the assistant message in the template {english_reply}}\nTranslation of the above into Hindi:\n{translated_hindi_reply}.

hindi_reply_prompt = "Hindi mein jawab dena."
english_reply_prompt = "{prefix} in Hindi."
cot_prompt = "First write in English and then translate to Hindi"
cot_hindi_prompt = "Pehle English mein likhna, phir Hindi mein translate karna."

def add_system_prompt(sample):
    p = 0.2
    language_p = 0.5
    suffix_hindi_prompt = bernoulli_sample(p)
    suffix_cot_prompt = bernoulli_sample(p)
    use_hindi_reply_prompt = bernoulli_sample(language_p)
    use_hindi_cot_prompt = bernoulli_sample(language_p)

    for i, turn in enumerate(sample["hindi_messages"]):
        if suffix_hindi_prompt:
            if use_hindi_reply_prompt and i%2==0:
                turn["content"] = f"{sample['messages'][i]['content']} {hindi_reply_prompt}"
            elif i%2==0:
                prefix = "Summarize" if sample["category"] == "Summarize" else "Reply"
                turn["content"] = f"{sample['messages'][i]['content']} {english_reply_prompt.format(prefix=prefix)}"
        elif suffix_cot_prompt:
            if use_hindi_cot_prompt and i%2==0:
                turn["content"] = f"{sample['messages'][i]['content']} {cot_hindi_prompt}"
            elif i%2==0:
                turn["content"] = f"{sample['messages'][i]['content']} {cot_prompt}"
            else:
                turn["content"] = f"{sample['messages'][i]['content']}\nTranslation of the above into Hindi:\n{turn['content']}"
                
    return {"hindi_messages": sample["hindi_messages"]}

orig_dataset = orig_dataset.map(
    add_system_prompt,
    batched=False
)
orig_dataset

Example outputs:

[{'content': 'Write a funny, short story about someone who will stop at nothing to go green and reduce their carbon footprint. \n Pehle English mein likhna, phir Hindi mein translate karna.', 'role': 'user'}, {'content': 'Garry has a real green thumb, and taking care of the environment is extremely important to him. He takes every possible step to cut his carbon footprint, and it inspires and mystifies everyone around him in equal measure. Each night before he goes to bed, he cuts off the power to his house so that his electronic devices donтАЩt draw idle power. \n\nAlmost all of his appliances are quite old, as Garry wants to get as much use as possible out of them before contributing to new pollution-generating production. The only exceptions are his new, water-efficient dishwasher and his solar-powered hot water tank. \n\nThe HomeownerтАЩs Association hates Garry because he grows corn on his front lawn. All of his food is locally sourced as much as possible, and he commutes to work on a bike that doesnтАЩt look like it should work. He built it himself out of recycled parts, and it seems to jump and rock like Fred FlinstoneтАЩs car. Yet, it stays in one piece and carries him faithfully to work without contributing to pollution. All in all, thereтАЩs nothing Garry wonтАЩt do to keep his carbon footprint small. \nTranslation of the above into Hindi:\nрдЧреИрд░реА рдХрд╛ рдЕрдВрдЧреВрдард╛ рдЕрд╕рд▓реА рдореЗрдВ рд╣рд░рд╛ рд╣реИ рдФрд░ рдкрд░реНрдпрд╛рд╡рд░рдг рдХреА рджреЗрдЦрднрд╛рд▓ рдХрд░рдирд╛ рдЙрд╕рдХреЗ рд▓рд┐рдП рдмреЗрд╣рдж рдорд╣рддреНрд╡рдкреВрд░реНрдг рд╣реИред рд╡рд╣ рдЕрдкрдиреЗ рдХрд╛рд░реНрдмрди рдкрджрдЪрд┐рд╣реНрди рдХреЛ рдХрдо рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рд╣рд░ рд╕рдВрднрд╡ рдХрджрдо рдЙрдард╛рддрд╛ рд╣реИ, рдФрд░ рдпрд╣ рдЙрд╕рдХреЗ рдЖрд╕рдкрд╛рд╕ рд╕рднреА рд▓реЛрдЧреЛрдВ рдХреЛ рд╕рдорд╛рди рд░реВрдк рд╕реЗ рдкреНрд░реЗрд░рд┐рдд рдХрд░рддрд╛ рд╣реИ рдФрд░ рд░рд╣рд╕реНрдпрдордпреА рдмрдирд╛рддрд╛ рд╣реИред рд╣рд░ рд░рд╛рдд рд╕реЛрдиреЗ рд╕реЗ рдкрд╣рд▓реЗ рд╡рд╣ рдЕрдкрдиреЗ рдШрд░ рдХреА рдмрд┐рдЬрд▓реА рдХрд╛рдЯ рджреЗрддрд╛ рд╣реИ рддрд╛рдХрд┐ рдЙрд╕рдХреЗ рдЗрд▓реЗрдХреНрдЯреНрд░реЙрдирд┐рдХ рдЙрдкрдХрд░рдг рдмреЗрдХрд╛рд░ рдмрд┐рдЬрд▓реА рди рдЦреАрдВрдЪ рд╕рдХреЗрдВред рдЙрдирдХреЗ рд▓рдЧрднрдЧ рд╕рднреА рдЙрдкрдХрд░рдг рдХрд╛рдлреА рдкреБрд░рд╛рдиреЗ рд╣реИрдВ, рдХреНрдпреЛрдВрдХрд┐ рдЧреИрд░реА рдирдП рдкреНрд░рджреВрд╖рдг рдкреИрджрд╛ рдХрд░рдиреЗ рд╡рд╛рд▓реЗ рдЙрддреНрдкрд╛рджрди рдореЗрдВ рдпреЛрдЧрджрд╛рди рджреЗрдиреЗ рд╕реЗ рдкрд╣рд▓реЗ рдЙрдирдХрд╛ рдЕрдзрд┐рдХ рд╕реЗ рдЕрдзрд┐рдХ рдЙрдкрдпреЛрдЧ рдХрд░рдирд╛ рдЪрд╛рд╣рддреЗ рд╣реИрдВред рдХреЗрд╡рд▓ рдЕрдкрд╡рд╛рдж рдЙрдирдХреЗ рдирдП, рдкрд╛рдиреА-рдХреБрд╢рд▓ рдбрд┐рд╢рд╡реЙрд╢рд░ рдФрд░ рдЙрдирдХреА рд╕реМрд░ рдКрд░реНрдЬрд╛ рд╕реЗ рдЪрд▓рдиреЗ рд╡рд╛рд▓реА рдЧрд░реНрдо рдкрд╛рдиреА рдХреА рдЯрдВрдХреА рд╣реИрдВред рдордХрд╛рди рдорд╛рд▓рд┐рдХ рд╕рдВрдШ рдЧреИрд░реА рд╕реЗ рдирдлрд░рдд рдХрд░рддрд╛ рд╣реИ рдХреНрдпреЛрдВрдХрд┐ рд╡рд╣ рдЕрдкрдиреЗ рд╕рд╛рдордиреЗ рдХреЗ рд▓реЙрди рдореЗрдВ рдордХрдИ рдЙрдЧрд╛рддрд╛ рд╣реИред рдЙрдирдХрд╛ рд╕рд╛рд░рд╛ рдЦрд╛рдирд╛ рдЬрд┐рддрдирд╛ рд╕рдВрднрд╡ рд╣реЛ рд╕рдХреЗ рд╕реНрдерд╛рдиреАрдп рд░реВрдк рд╕реЗ рдкреНрд░рд╛рдкреНрдд рдХрд┐рдпрд╛ рдЬрд╛рддрд╛ рд╣реИ, рдФрд░ рд╡рд╣ рдПрдХ рдРрд╕реА рдмрд╛рдЗрдХ рдкрд░ рдХрд╛рдо рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рдЖрддрд╛ рд╣реИ рдЬреЛ рдРрд╕рд╛ рдирд╣реАрдВ рд▓рдЧрддрд╛ рдХрд┐ рдпрд╣ рдХрд╛рдо рдХрд░рдирд╛ рдЪрд╛рд╣рд┐рдПред рдЙрдиреНрд╣реЛрдВрдиреЗ рдЗрд╕реЗ рдкреБрдирд░реНрдирд╡реАрдиреАрдХрд░рдг рднрд╛рдЧреЛрдВ рд╕реЗ рдЦреБрдж рдмрдирд╛рдпрд╛ рдерд╛, рдФрд░ рдпрд╣ рдлреНрд░реЗрдб рдлреНрд▓рд┐рдирд╕реНрдЯреЛрди рдХреА рдХрд╛рд░ рдХреА рддрд░рд╣ рдХреВрджрддрд╛ рд╣реБрдЖ рдкреНрд░рддреАрдд рд╣реЛрддрд╛ рд╣реИред рдлрд┐рд░ рднреА, рдпрд╣ рдПрдХ рдЯреБрдХрдбрд╝реЗ рдореЗрдВ рд░рд╣рддрд╛ рд╣реИ рдФрд░ рдЙрд╕реЗ рдкреНрд░рджреВрд╖рдг рдореЗрдВ рдпреЛрдЧрджрд╛рди рджрд┐рдП рдмрд┐рдирд╛ рдХрд╛рдо рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рдИрдорд╛рдирджрд╛рд░реА рд╕реЗ рд▓реЗ рдЬрд╛рддрд╛ рд╣реИред рдХреБрд▓ рдорд┐рд▓рд╛рдХрд░, рдРрд╕рд╛ рдХреБрдЫ рднреА рдирд╣реАрдВ рд╣реИ рдЬреЛ рдЧреИрд░реА рдЕрдкрдиреЗ рдХрд╛рд░реНрдмрди рдкрджрдЪрд┐рд╣реНрди рдХреЛ рдЫреЛрдЯрд╛ рд░рдЦрдиреЗ рдХреЗ рд▓рд┐рдП рди рдХрд░реЗред', 'role': 'assistant'}]
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[{'content': 'Help write a letter of 100 -200 words to my future self for Kyra, reflecting on her goals and aspirations. First write in English and then translate to Hindi', 'role': 'user'}, {'content': "Dear Future Self,\n\nI hope you're happy and proud of what you've achieved. As I write this, I'm excited to think about our goals and how far you've come. One goal was to be a machine learning engineer. I hope you've worked hard and become skilled in this field. Keep learning and innovating. Traveling was important to us. I hope you've seen different places and enjoyed the beauty of our world. Remember the memories and lessons. Starting a family mattered to us. If you have kids, treasure every moment. Be patient, loving, and grateful for your family.\n\nTake care of yourself. Rest, reflect, and cherish the time you spend with loved ones. Remember your dreams and celebrate what you've achieved. Your determination brought you here. I'm excited to see the person you've become, the impact you've made, and the love and joy in your life. Embrace opportunities and keep dreaming big.\n\nWith love,\nKyra\nTranslation of the above into Hindi:\nрдкреНрд░рд┐рдп рднрд╡рд┐рд╖реНрдп рд╕реНрд╡рдпрдВ, рдореБрдЭреЗ рдЖрд╢рд╛ рд╣реИ рдХрд┐ рдЖрдкрдиреЗ рдЬреЛ рд╣рд╛рд╕рд┐рд▓ рдХрд┐рдпрд╛ рд╣реИ рдЙрд╕рд╕реЗ рдЖрдк рдЦреБрд╢ рдФрд░ рдЧрд░реНрд╡рд┐рдд рд╣реИрдВред рдЬрдм рдореИрдВ рдпрд╣ рд▓рд┐рдЦ рд░рд╣рд╛ рд╣реВрдБ, рддреЛ рдореИрдВ рдЕрдкрдиреЗ рд▓рдХреНрд╖реНрдпреЛрдВ рдХреЗ рдмрд╛рд░реЗ рдореЗрдВ рд╕реЛрдЪрдиреЗ рдХреЗ рд▓рд┐рдП рдЙрддреНрд╕рд╛рд╣рд┐рдд рд╣реВрдВ рдФрд░ рд╕реЛрдЪрддрд╛ рд╣реВрдБ рдХрд┐ рдЖрдк рдХрд┐рддрдиреА рджреВрд░ рдЖ рдЧрдП рд╣реИрдВред рдПрдХ рд▓рдХреНрд╖реНрдп рдорд╢реАрди рд▓рд░реНрдирд┐рдВрдЧ рдЗрдВрдЬреАрдирд┐рдпрд░ рдмрдирдирд╛ рдерд╛ред рдореБрдЭреЗ рдЖрд╢рд╛ рд╣реИ рдХрд┐ рдЖрдкрдиреЗ рдХрдбрд╝реА рдореЗрд╣рдирдд рдХреА рд╣реЛрдЧреА рдФрд░ рдЗрд╕ рдХреНрд╖реЗрддреНрд░ рдореЗрдВ рдХреБрд╢рд▓ рдмрдиреЗрдВрдЧреЗред рд╕реАрдЦрддреЗ рд░рд╣реЗрдВ рдФрд░ рдирд╡рд╛рдЪрд╛рд░ рдХрд░рддреЗ рд░рд╣реЗрдВред рд╣рдорд╛рд░реЗ рд▓рд┐рдП рдпрд╛рддреНрд░рд╛ рдХрд░рдирд╛ рдорд╣рддреНрд╡рдкреВрд░реНрдг рдерд╛ред рдореБрдЭреЗ рдЖрд╢рд╛ рд╣реИ рдХрд┐ рдЖрдкрдиреЗ рд╡рд┐рднрд┐рдиреНрди рд╕реНрдерд╛рдиреЛрдВ рдХреЛ рджреЗрдЦрд╛ рд╣реЛрдЧрд╛ рдФрд░ рд╣рдорд╛рд░реА рджреБрдирд┐рдпрд╛ рдХреА рд╕реБрдВрджрд░рддрд╛ рдХрд╛ рдЖрдирдВрдж рд▓рд┐рдпрд╛ рд╣реЛрдЧрд╛ред рдпрд╛рджреЛрдВ рдФрд░ рд╕рдмрдХ рдХреЛ рдпрд╛рдж рд░рдЦреЗрдВред рдкрд░рд┐рд╡рд╛рд░ рд╢реБрд░реВ рдХрд░рдирд╛ рд╣рдорд╛рд░реЗ рд▓рд┐рдП рдорд╛рдпрдиреЗ рд░рдЦрддрд╛ рдерд╛ред рдЕрдЧрд░ рдЖрдкрдХреЗ рдмрдЪреНрдЪреЗ рд╣реИрдВ, рддреЛ рд╣рд░ рдкрд▓ рдХреЛ рд╕рдВрдЬреЛ рдХрд░ рд░рдЦреЗрдВред рдЕрдкрдиреЗ рдкрд░рд┐рд╡рд╛рд░ рдХреЗ рд▓рд┐рдП рдзреИрд░реНрдп, рдкреНрдпрд╛рд░ рдФрд░ рдЖрднрд╛рд░реА рд░рд╣реЗрдВред рдЕрдкрдирд╛ рдЦреНрдпрд╛рд▓ рд░рдЦреЗрдВред рдЖрд░рд╛рдо рдХрд░реЗрдВ, рд╕реЛрдЪреЗрдВ рдФрд░ рдЕрдкрдиреЗ рдкреНрд░рд┐рдпрдЬрдиреЛрдВ рдХреЗ рд╕рд╛рде рдмрд┐рддрд╛рдП рд╕рдордп рдХреЛ рд╕рдВрдЬреЛ рдХрд░ рд░рдЦреЗрдВред рдЕрдкрдиреЗ рд╕рдкрдиреЛрдВ рдХреЛ рдпрд╛рдж рд░рдЦреЗрдВ рдФрд░ рдЬреЛ рдЖрдкрдиреЗ рд╣рд╛рд╕рд┐рд▓ рдХрд┐рдпрд╛ рд╣реИ рдЙрд╕рдХрд╛ рдЬрд╢реНрди рдордирд╛рдПрдВред рдЖрдкрдХрд╛ рджреГрдврд╝ рд╕рдВрдХрд▓реНрдк рдЖрдкрдХреЛ рдпрд╣рд╛рдБ рд▓реЗ рдЖрдпрд╛ рд╣реИред рдЖрдк рдЬрд┐рд╕ рд╡реНрдпрдХреНрддрд┐ рдХреЗ рд░реВрдк рдореЗрдВ рдЙрднрд░реЗ рд╣реИрдВ, рдЖрдкрдиреЗ рдЬреЛ рдкреНрд░рднрд╛рд╡ рдбрд╛рд▓рд╛ рд╣реИ рдФрд░ рдЖрдкрдХреЗ рдЬреАрд╡рди рдореЗрдВ рдкреНрдпрд╛рд░ рдФрд░ рдЦреБрд╢реА рдХреЛ рджреЗрдЦрдХрд░ рдореИрдВ рдЙрддреНрд╕рд╛рд╣рд┐рдд рд╣реВрдВред рдЕрд╡рд╕рд░реЛрдВ рдХреЛ рдЕрдкрдирд╛рдПрдБ рдФрд░ рдмрдбрд╝реЗ рд╕рдкрдиреЗ рджреЗрдЦрддреЗ рд░рд╣реЗрдВред рдкреНрдпрд╛рд░ рдХреЗ рд╕рд╛рде, рдХрд╛рдпрд░рд╛ред", 'role': 'assistant'}]
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[{'content': "Answer the following questions using the passage below:\nWho was Harry Potter written by?\nWho originally published the Harry Potter series?\n\nHarry Potter is a series of seven fantasy novels written by British author J. K. Rowling. The novels chronicle the lives of a young wizard, Harry Potter, and his friends Hermione Granger and Ron Weasley, all of whom are students at Hogwarts School of Witchcraft and Wizardry. The main story arc concerns Harry's conflict with Lord Voldemort, a dark wizard who intends to become immortal, overthrow the wizard governing body known as the Ministry of Magic and subjugate all wizards and Muggles (non-magical people).\n\nThe series was originally published in English by Bloomsbury in the United Kingdom and Scholastic Press in the United States. All versions around the world are printed by Grafica Veneta in Italy.[1] A series of many genres, including fantasy, drama, coming-of-age fiction, and the British school story (which includes elements of mystery, thriller, adventure, horror, and romance), the world of Harry Potter explores numerous themes and includes many cultural meanings and references.[2] According to Rowling, the main theme is death.[3] Other major themes in the series include prejudice, corruption, and madness.[4] Hindi mein jawab dena.", 'role': 'user'}, {'content': 'рд╣реИрд░реА рдкреЙрдЯрд░ рдмреНрд░рд┐рдЯрд┐рд╢ рд▓реЗрдЦрдХ рдЬреЗ. рджреНрд╡рд╛рд░рд╛ рд▓рд┐рдЦрд╛ рдЧрдпрд╛ рдерд╛ред рдХреЗ. рд░рд╛рд▓рд┐рдВрдЧ рдпрд╣ рд╢реНрд░реГрдВрдЦрд▓рд╛ рдореВрд▓ рд░реВрдк рд╕реЗ рдмреНрд░рд┐рдЯреЗрди рдореЗрдВ рдмреНрд▓реВрдореНрд╕рдмрд░реА рдФрд░ рдЕрдореЗрд░рд┐рдХрд╛ рдореЗрдВ рд╕реНрдХреЙрд▓рд╛рд╕реНрдЯрд┐рдХ рдкреНрд░реЗрд╕ рджреНрд╡рд╛рд░рд╛ рдкреНрд░рдХрд╛рд╢рд┐рдд рдХреА рдЧрдИ рдереАред', 'role': 'assistant'}]
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[{'content': 'Write a summary of this article about Euripides by reducing it to less than 90 words. Change the dating system from B.C. to BCE.\n\nThe Ancient Greeks, particularly the Ancient Athenians, are given the credit for inventing theater itself. As a result of that, theater became part of the culture, as well as everything that went along with it. There are theaters located throughout Ancient Greece. Also as a result of this fact, playwrights began to spring up all over Greece. One of the most well known of these is Euripides, who is especially noted for his tragedies. He is also one of the only playwrights whose work has actually survived. Here is some more information about this famous playwright from Ancient Greece:\n\nInformation About Euripides\n\nEuripides as born in 480 B.C. in Salamis, which is an island in the Saronic Gulf in Greece. It is interesting to note that he was born on the same day as the Battle of Salamis, where the ultimate victory of the Greeks prevented an invasion by the Persian Empire. He died in 406 B.C. in the city-state of Macedonia. His motherтАЩs name was Cleito and his fatherтАЩs name was Mnesarchus. His father insisted that he should be an athlete. Although his parents valued athletics, however, he wasnтАЩt restricted in his education. He also learned artistic pursuits, such as painting, as well as philosophy from masters such as Anaxagorus. He was also a dancer and torch bearer at the rites of Apollo Zosterius. Later on, it was clear that he had a knack for the stage and he eventually began writing the plays he was most famous for.\n\n\nHis personal life, however, wasnтАЩt as successful as his professional one. He was actually married twice and in both instances, his wives werenтАЩt faithful to him. He also had three sons from one of these marriages. After this happened, he moved to a cave in Salamis because he wanted to live alone. Referred to as the Cave of Euripides, after his death his followers started a shrine in his honor.\n\nWorks of Euripides\n\nThere is actually some debate as to how many plays there are that were written by Euripides that we know of. Some sources say that there are 95, others say that there are 92. One thing is certain, however, is that his impact on theater is vast and his plays are still being performed today. Of these plays, a much smaller number survived in its complete form. Only about eighteen or nineteen of his plays had been found in their entirety. Besides the fact that his plays have been entertaining people for centuries, his work has also influenced playwrights through the ages. In other words, his artistic legacy has survived the ages.\n\nEuripides was also a contemporary of two other famous playwrights. SophoclesтАЩ first play deputed thirteen years before his, and AeschylusтАЩs masterwork, Oresteia, debuted three years before EuripidesтАЩ first play was released.\n\nAs you can see, Euripedes has had a big impact on theater not only in Ancient Greece, but in the whole world. Sure, theater was invented in Ancient Greece, but the works of people like Euripides is everlasting and theyтАЩre still being performed today. Summarize in Hindi.', 'role': 'user'}, {'content': 'рдкреНрд░рд╛рдЪреАрди рдпреВрдирд╛рдиреА рдирд╛рдЯрдХрдХрд╛рд░ рдпреВрд░рд┐рдкрд┐рдбреНрд╕ рдХрд╛ рдЬрдиреНрдо 480 рдИрд╕рд╛ рдкреВрд░реНрд╡ рдореЗрдВ рдЧреНрд░реАрдХ рджреНрд╡реАрдк рд╕рд▓рд╛рдорд┐рд╕ рдореЗрдВ рд╣реБрдЖ рдерд╛ред рдпреБрд╡рд╛рд╡рд╕реНрдерд╛ рдореЗрдВ рдЙрдиреНрд╣реЛрдВрдиреЗ рдПрдерд▓реЗрдЯрд┐рдХреНрд╕, рджрд░реНрд╢рди рдФрд░ рдЪрд┐рддреНрд░рдХрд▓рд╛ рдХрд╛ рдЕрдзреНрдпрдпрди рдХрд┐рдпрд╛ред рд╡рд╣ рднрдЧрд╡рд╛рди рдЕрдкреЛрд▓реЛ рдХреЗ рд╕рдВрд╕реНрдХрд╛рд░реЛрдВ рдХреЗ рджреМрд░рд╛рди рдПрдХ рдорд╢рд╛рд▓ рд╡рд╛рд╣рдХ рдФрд░ рдирд░реНрддрдХ рдереЗред рд╣рд╛рд▓рд╛рдБрдХрд┐ рдЗрд╕ рдмрд╛рдд рдкрд░ рд╕рд╡рд╛рд▓ рд╣реИрдВ рдХрд┐ рдЙрдиреНрд╣реЛрдВрдиреЗ рдХрд┐рддрдиреЗ рдирд╛рдЯрдХ рд▓рд┐рдЦреЗ, рдЙрдирдореЗрдВ рд╕реЗ 18 рдпрд╛ 19 рдЖрдЬ рднреА рдкреВрд░реНрдг рд░реВрдк рдореЗрдВ рдЬреАрд╡рд┐рдд рд╣реИрдВред рдЕрдкрдиреА рджреВрд╕рд░реА рдЕрд╕рдлрд▓ рд╢рд╛рджреА рдХреЗ рдмрд╛рдж, рдЙрдиреНрд╣реЛрдВрдиреЗ рд╕рд▓рд╛рдорд┐рд╕ рдкрд░ рдПрдХ рдЧреБрдлрд╛ рдореЗрдВ рдирд┐рд╡рд╛рд╕ рдХрд┐рдпрд╛ рдЬрд┐рд╕реЗ рдЕрдм рдпреВрд░рд┐рдкрд┐рдбреНрд╕ рдХреА рдЧреБрдлрд╛ рдХрд╣рд╛ рдЬрд╛рддрд╛ рд╣реИред 406 рдИрд╕рд╛ рдкреВрд░реНрд╡ рдореЗрдВ рдЙрдирдХреА рдореГрддреНрдпреБ рд╣реЛ рдЧрдИ рдЬреЛ рдЕрдм рдореИрд╕реЗрдбреЛрдирд┐рдпрд╛ рд╣реИред', 'role': 'assistant'}]
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[{'content': 'Can foxes fit down rabbit burrows? Reply in Hindi.', 'role': 'user'}, {'content': 'рд▓реЛрдордбрд╝реА рдЦрд░рдЧреЛрд╢ рдХреЗ рдмрд┐рд▓реЛрдВ рдореЗрдВ рдлрд┐рдЯ рдирд╣реАрдВ рд╣реЛ рд╕рдХрддреЗ рд╣реИрдВ, рд▓реЗрдХрд┐рди рдпрд╣ рдЙрдиреНрд╣реЗрдВ рдХреЛрд╢рд┐рд╢ рдХрд░рдиреЗ рд╕реЗ рдирд╣реАрдВ рд░реЛрдХрддрд╛ рд╣реИред рдПрдХ рд▓реЛрдордбрд╝реА рдЬрд┐рддрдирд╛ рд╕рдВрднрд╡ рд╣реЛ рдЙрддрдирд╛ рдиреАрдЪреЗ рдЬрд╛рдПрдЧреА рдФрд░ рдлрд┐рд░ рдЦрд░рдЧреЛрд╢ рдХреЛ рдкрдХрдбрд╝рдиреЗ рдФрд░ рдЙрд╕реЗ рдмрд╛рд╣рд░ рдирд┐рдХрд╛рд▓рдиреЗ рдХреЗ рдкреНрд░рдпрд╛рд╕ рдореЗрдВ рдЕрдкрдиреЗ рдЕрд░реНрдз-рд╡рд╛рдкрд╕ рд▓реЗрдиреЗ рдпреЛрдЧреНрдп рдкрдВрдЬреЛрдВ рдХрд╛ рдЙрдкрдпреЛрдЧ рдХрд░рдХреЗ рдЦреБрджрд╛рдИ рдХрд░реЗрдЧреАред', 'role': 'assistant'}]
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Hinglish (Code-Mixing and Transliteration) for colloquial usage

We use the below prompts with the following datasets HinGE and https://github.com/google-research-datasets/Hinglish-TOP-Dataset for adding samples that does different combinations of translations between English, Hindi and Hinglish.

english_to_hinglish_prompt_en = "Translate the following from English into Hinglish."
hingligh_to_english_prompt_en = "Translate the following from Hinglish into English."
english_to_hinglish_prompt_hn = "English se Hinglish mein translate kare."
hingligh_to_english_prompt_hn = "Hinglish se English mein translate kare."
hindi_to_hinglish_prompt_hi = "рдирд┐рдореНрдирд▓рд┐рдЦрд┐рдд рдХрд╛ рд╣рд┐рдВрджреА рд╕реЗ рд╣рд┐рдВрдЧреНрд▓рд┐рд╢ рдореЗрдВ рдЕрдиреБрд╡рд╛рдж рдХрд░реЗрдВред"
hingligh_to_hindi_prompt_hi = "рдирд┐рдореНрдирд▓рд┐рдЦрд┐рдд рдХрд╛ рд╣рд┐рдВрдЧреНрд▓рд┐рд╢ рд╕реЗ рд╣рд┐рдВрджреА рдореЗрдВ рдЕрдиреБрд╡рд╛рдж рдХрд░реЗрдВред"
hindi_to_hinglish_prompt_hn = "Hindi se Hinglish mein translate kare."
hingligh_to_hindi_prompt_hn = "Hinglish se Hindi mein translate kare."

def format_to_norobots(sample):
    p = 0.25
    language_p = 0.5
    ranom_num = random.random()
    use_hn_prompt = bernoulli_sample(language_p)

    hindi_messages = []
    if ranom_num<= 0.25:
        #en_to_hn
        if use_hn_prompt:
            user_message = {"content":f"{english_to_hinglish_prompt_hn}\n{sample['English']}", "role": "user"}
        else:
            user_message = {"content":f"{english_to_hinglish_prompt_en}\n{sample['English']}", "role": "user"}
        asst_message = {"content":sample["Hinglish"], "role": "assistant"}
        hindi_messages.extend([user_message, asst_message])
    elif ranom_num<= 0.5:
        #hn_to_en
        if use_hn_prompt:
            user_message = {"content":f"{hingligh_to_english_prompt_hn}\n{sample['Hinglish']}", "role": "user"}
        else:
            user_message = {"content":f"{hingligh_to_english_prompt_en}\n{sample['Hinglish']}", "role": "user"}
        asst_message = {"content":sample["English"], "role": "assistant"}
        hindi_messages.extend([user_message, asst_message])
    elif ranom_num<= 0.75:
        #hi_to_hn
        if use_hn_prompt:
            user_message = {"content":f"{hindi_to_hinglish_prompt_hn}\n{sample['Hindi']}", "role": "user"}
        else:
            user_message = {"content":f"{hindi_to_hinglish_prompt_hi}\n{sample['Hindi']}", "role": "user"}
        asst_message = {"content":sample["Hinglish"], "role": "assistant"}
        hindi_messages.extend([user_message, asst_message])
    else:
        #hn_to_hi
        if use_hn_prompt:
            user_message = {"content":f"{hingligh_to_hindi_prompt_hn}\n{sample['Hinglish']}", "role": "user"}
        else:
            user_message = {"content":f"{hingligh_to_hindi_prompt_hi}\n{sample['Hinglish']}", "role": "user"}
        asst_message = {"content":sample["Hindi"], "role": "assistant"}
        hindi_messages.extend([user_message, asst_message])

    return {"category": "Transliteration and Code Mixing", "hindi_messages": hindi_messages}

Example output:

[{'content': 'Hinglish se English mein translate kare.\nMere liye reminder set karo to wake up at 6:30 am tomorrow.',
   'role': 'user'},
  {'content': 'Set a reminder for me to wake up at 630 am tomorrow.',
   'role': 'assistant'}]

Add the Hindi specific System prompt

With 0.2 probability, we add the system prompt You are a native Hindi speaker who can converse at expert level in both Hindi and colloquial Hinglish.

system_prompt = "You are a native Hindi speaker who can converse at expert level in both Hindi and colloquial Hinglish."
def add_system_prompt(sample):
    p = 0.2
    add_system_message = bernoulli_sample(p)
    first_message = sample["hindi_messages"][0]
    if add_system_message:
        if first_message["role"] == "system":
            first_message["content"] = f"{first_message['content']} {system_prompt}"
        else:
            system_message = {'content': system_prompt, 'role': 'system'}
            sample["hindi_messages"].insert(0, system_message)
    return {"hindi_messages": sample["hindi_messages"]}

Example output:

{'category': 'Generation',
 'hindi_messages': [{'content': 'You are a native Hindi speaker who can converse at expert level in both Hindi and colloquial Hinglish.',
   'role': 'system'},
  {'content': 'Help write a letter of 100 -200 words to my future self for Kyra, reflecting on her goals and aspirations. First write in English and then translate to Hindi',
   'role': 'user'},
  {'content': "Dear Future Self,\n\nI hope you're happy and proud of what you've achieved. As I write this, I'm excited to think about our goals and how far you've come. One goal was to be a machine learning engineer. I hope you've worked hard and become skilled in this field. Keep learning and innovating. Traveling was important to us. I hope you've seen different places and enjoyed the beauty of our world. Remember the memories and lessons. Starting a family mattered to us. If you have kids, treasure every moment. Be patient, loving, and grateful for your family.\n\nTake care of yourself. Rest, reflect, and cherish the time you spend with loved ones. Remember your dreams and celebrate what you've achieved. Your determination brought you here. I'm excited to see the person you've become, the impact you've made, and the love and joy in your life. Embrace opportunities and keep dreaming big.\n\nWith love,\nKyra\nTranslation of the above into Hindi:\nрдкреНрд░рд┐рдп рднрд╡рд┐рд╖реНрдп рд╕реНрд╡рдпрдВ, рдореБрдЭреЗ рдЖрд╢рд╛ рд╣реИ рдХрд┐ рдЖрдкрдиреЗ рдЬреЛ рд╣рд╛рд╕рд┐рд▓ рдХрд┐рдпрд╛ рд╣реИ рдЙрд╕рд╕реЗ рдЖрдк рдЦреБрд╢ рдФрд░ рдЧрд░реНрд╡рд┐рдд рд╣реИрдВред рдЬрдм рдореИрдВ рдпрд╣ рд▓рд┐рдЦ рд░рд╣рд╛ рд╣реВрдБ, рддреЛ рдореИрдВ рдЕрдкрдиреЗ рд▓рдХреНрд╖реНрдпреЛрдВ рдХреЗ рдмрд╛рд░реЗ рдореЗрдВ рд╕реЛрдЪрдиреЗ рдХреЗ рд▓рд┐рдП рдЙрддреНрд╕рд╛рд╣рд┐рдд рд╣реВрдВ рдФрд░ рд╕реЛрдЪрддрд╛ рд╣реВрдБ рдХрд┐ рдЖрдк рдХрд┐рддрдиреА рджреВрд░ рдЖ рдЧрдП рд╣реИрдВред рдПрдХ рд▓рдХреНрд╖реНрдп рдорд╢реАрди рд▓рд░реНрдирд┐рдВрдЧ рдЗрдВрдЬреАрдирд┐рдпрд░ рдмрдирдирд╛ рдерд╛ред рдореБрдЭреЗ рдЖрд╢рд╛ рд╣реИ рдХрд┐ рдЖрдкрдиреЗ рдХрдбрд╝реА рдореЗрд╣рдирдд рдХреА рд╣реЛрдЧреА рдФрд░ рдЗрд╕ рдХреНрд╖реЗрддреНрд░ рдореЗрдВ рдХреБрд╢рд▓ рдмрдиреЗрдВрдЧреЗред рд╕реАрдЦрддреЗ рд░рд╣реЗрдВ рдФрд░ рдирд╡рд╛рдЪрд╛рд░ рдХрд░рддреЗ рд░рд╣реЗрдВред рд╣рдорд╛рд░реЗ рд▓рд┐рдП рдпрд╛рддреНрд░рд╛ рдХрд░рдирд╛ рдорд╣рддреНрд╡рдкреВрд░реНрдг рдерд╛ред рдореБрдЭреЗ рдЖрд╢рд╛ рд╣реИ рдХрд┐ рдЖрдкрдиреЗ рд╡рд┐рднрд┐рдиреНрди рд╕реНрдерд╛рдиреЛрдВ рдХреЛ рджреЗрдЦрд╛ рд╣реЛрдЧрд╛ рдФрд░ рд╣рдорд╛рд░реА рджреБрдирд┐рдпрд╛ рдХреА рд╕реБрдВрджрд░рддрд╛ рдХрд╛ рдЖрдирдВрдж рд▓рд┐рдпрд╛ рд╣реЛрдЧрд╛ред рдпрд╛рджреЛрдВ рдФрд░ рд╕рдмрдХ рдХреЛ рдпрд╛рдж рд░рдЦреЗрдВред рдкрд░рд┐рд╡рд╛рд░ рд╢реБрд░реВ рдХрд░рдирд╛ рд╣рдорд╛рд░реЗ рд▓рд┐рдП рдорд╛рдпрдиреЗ рд░рдЦрддрд╛ рдерд╛ред рдЕрдЧрд░ рдЖрдкрдХреЗ рдмрдЪреНрдЪреЗ рд╣реИрдВ, рддреЛ рд╣рд░ рдкрд▓ рдХреЛ рд╕рдВрдЬреЛ рдХрд░ рд░рдЦреЗрдВред рдЕрдкрдиреЗ рдкрд░рд┐рд╡рд╛рд░ рдХреЗ рд▓рд┐рдП рдзреИрд░реНрдп, рдкреНрдпрд╛рд░ рдФрд░ рдЖрднрд╛рд░реА рд░рд╣реЗрдВред рдЕрдкрдирд╛ рдЦреНрдпрд╛рд▓ рд░рдЦреЗрдВред рдЖрд░рд╛рдо рдХрд░реЗрдВ, рд╕реЛрдЪреЗрдВ рдФрд░ рдЕрдкрдиреЗ рдкреНрд░рд┐рдпрдЬрдиреЛрдВ рдХреЗ рд╕рд╛рде рдмрд┐рддрд╛рдП рд╕рдордп рдХреЛ рд╕рдВрдЬреЛ рдХрд░ рд░рдЦреЗрдВред рдЕрдкрдиреЗ рд╕рдкрдиреЛрдВ рдХреЛ рдпрд╛рдж рд░рдЦреЗрдВ рдФрд░ рдЬреЛ рдЖрдкрдиреЗ рд╣рд╛рд╕рд┐рд▓ рдХрд┐рдпрд╛ рд╣реИ рдЙрд╕рдХрд╛ рдЬрд╢реНрди рдордирд╛рдПрдВред рдЖрдкрдХрд╛ рджреГрдврд╝ рд╕рдВрдХрд▓реНрдк рдЖрдкрдХреЛ рдпрд╣рд╛рдБ рд▓реЗ рдЖрдпрд╛ рд╣реИред рдЖрдк рдЬрд┐рд╕ рд╡реНрдпрдХреНрддрд┐ рдХреЗ рд░реВрдк рдореЗрдВ рдЙрднрд░реЗ рд╣реИрдВ, рдЖрдкрдиреЗ рдЬреЛ рдкреНрд░рднрд╛рд╡ рдбрд╛рд▓рд╛ рд╣реИ рдФрд░ рдЖрдкрдХреЗ рдЬреАрд╡рди рдореЗрдВ рдкреНрдпрд╛рд░ рдФрд░ рдЦреБрд╢реА рдХреЛ рджреЗрдЦрдХрд░ рдореИрдВ рдЙрддреНрд╕рд╛рд╣рд┐рдд рд╣реВрдВред рдЕрд╡рд╕рд░реЛрдВ рдХреЛ рдЕрдкрдирд╛рдПрдБ рдФрд░ рдмрдбрд╝реЗ рд╕рдкрдиреЗ рджреЗрдЦрддреЗ рд░рд╣реЗрдВред рдкреНрдпрд╛рд░ рдХреЗ рд╕рд╛рде, рдХрд╛рдпрд░рд╛ред",
   'role': 'assistant'}]}

Training

Code in training/OpenHathi_Instruct.ipynb uses PEFT for QLoRA, uses TRl for SFT Trainer. Trained on Google Colab with A100 40GB GPU. The finetuned model is smangrul/OpenHathi-7B-Hi-v0.1-Instruct

Inference

Code in chatbot/Openhathi_Instruct_Chatbot.ipynb uses Google Colab T4 16GB GPU aand Gradio for creating the chatbot with streaming generation capability.

Few sample generations:

essay_hindi

hinglish_se_english

Closed_QA_hindi

About

This repository contains the code for dataset curation and finetuning of instruct variant of the Bilingual OpenHathi model. The resulting model is meant to follow instructions and chat in Hindi and Hinglish.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published