Nepali is spoken by over 17 million people, yet it remains severely underrepresented in large language models. While LLaMA 3, Mistral, and GPT-4 have some multilingual capability, their Nepali performance lags significantly behind English — especially for formal written Nepali, technical terminology, and culturally specific contexts. Fine-tuning a LLaMA 3 model on Nepali text changes this.
This guide covers the complete process: understanding when fine-tuning is the right approach, setting up the QLoRA training pipeline, preparing Nepali datasets, running training on a GPU, and pushing your model to HuggingFace Hub so the community can benefit. This is a technically advanced tutorial — if you're new to LLMs, you may want to start with the RAG chatbot guide first.
When Should You Fine-tune vs RAG vs Prompt Engineering?
This is the most important question in applied LLM work. Fine-tuning is expensive, time-consuming, and often unnecessary. Before starting, work through this decision tree:
QLoRA: Efficient Fine-tuning with Low-Rank Adapters
Fine-tuning a full 7B parameter LLaMA model requires updating 28GB+ of weights — needing 80GB+ VRAM. QLoRA (Quantized Low-Rank Adaptation) makes this practical on a single 16–24GB GPU by combining two innovations: 4-bit quantization (reduces memory by 4x) and LoRA adapters (only trains a tiny fraction of parameters).
LoRA Mathematics
The key insight behind LoRA is that the weight updates during fine-tuning have a low intrinsic rank — meaning they can be well approximated by a product of two small matrices, even though the full weight matrix is huge.
W' = W + BA, where B ∈ ℝ^(d×r), A ∈ ℝ^(r×k), rank r ≪ d,kW is the frozen pre-trained weight (d×k). B and A are the trainable LoRA matrices with rank r (typically 4–64). The product BA has rank at most r, so it captures a low-dimensional update to W. The scaling factor α/r controls the magnitude of the update. For d=k=4096 and r=16, LoRA adds just 131K params vs 16.7M for the full matrix — 127x reduction.
Step 1: Install Dependencies
# Install all required packages for QLoRA fine-tuning
pip install transformers>=4.40.0 \
peft>=0.10.0 \
datasets>=2.18.0 \
accelerate>=0.28.0 \
bitsandbytes>=0.43.0 \
trl>=0.8.0 \
huggingface_hub \
wandb \
scipy
# Log in to HuggingFace (to access LLaMA 3 — requires approval from Meta)
huggingface-cli login
# Log in to W&B for experiment tracking (optional but recommended)
wandb login
# Check GPU memory
python -c "import torch; print(f'GPU: {torch.cuda.get_device_name(0)}'); print(f'VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB')"
- LLaMA 3 8B with QLoRA (4-bit): Minimum 16GB VRAM (RTX 3090, RTX 4080, A100)
- LLaMA 3 70B with QLoRA: Minimum 48GB VRAM — requires multi-GPU or A100/H100
- Inference only (4-bit): 8B model fits in 8GB VRAM (RTX 3070, RTX 4060 Ti)
- Nepal context: Most Nepali engineers won't have local GPUs — use Google Colab A100, Kaggle (30hr free GPU), or RunPod ($0.40/hr for RTX 3090)
HuggingFace Zero GPU on Spaces gives free A100 access for inference. For training, Colab Pro gives A100 access. A typical QLoRA fine-tuning run on a 10k sample Nepali dataset takes about 2–3 hours on an A100 — well within Colab Pro's limits.
Alternatively, use Kaggle Notebooks (30 hours/week free T4 GPU) for smaller experiments, then scale to Colab for full training runs.
Step 2: Load Model in 4-bit Quantization
# load_model.py — Load LLaMA 3 with 4-bit quantization (QLoRA setup)
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
)
from huggingface_hub import login
import os
# ── HuggingFace authentication ──
# LLaMA 3 requires accepting Meta's license on hf.co/meta-llama
login(token=os.environ.get("HF_TOKEN"))
MODEL_ID = "meta-llama/Meta-Llama-3-8B" # or Meta-Llama-3-8B-Instruct
# ── 4-bit quantization configuration ──
bnb_config = BitsAndBytesConfig(
load_in_4bit=True, # Enable 4-bit loading
bnb_4bit_quant_type="nf4", # NormalFloat4 — best for normal distributions
bnb_4bit_compute_dtype=torch.bfloat16, # Compute in bf16 (numerically stable)
bnb_4bit_use_double_quant=True, # Double quantisation saves extra ~0.4 bits/param
)
print(f"Loading {MODEL_ID} in 4-bit...")
print(f"Expected VRAM: ~5GB for 8B model in 4-bit")
# ── Load model ──
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
quantization_config=bnb_config,
device_map="auto", # automatically distribute across available GPUs
torch_dtype=torch.bfloat16,
trust_remote_code=False,
)
# ── Load tokenizer ──
tokenizer = AutoTokenizer.from_pretrained(
MODEL_ID,
trust_remote_code=False,
padding_side="right", # right padding is needed for SFTTrainer
)
# LLaMA 3 doesn't have a pad token by default
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id
model.config.pad_token_id = tokenizer.pad_token_id
# Check memory usage
print(f"Model loaded. GPU memory used: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"Model parameters: {model.num_parameters():,}")
print(f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")
Step 3: Prepare the Nepali Dataset
# prepare_dataset.py — Format Nepali training data for SFT
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer
import json
import re
TOKENIZER_ID = "meta-llama/Meta-Llama-3-8B"
MAX_LENGTH = 2048
# ── Conversation format for LLaMA 3 ──
# LLaMA 3 uses a specific chat template with special tokens
def format_nepali_example(example: dict) -> dict:
"""
Format training examples as LLaMA 3 chat format.
Input: {"instruction": "...", "input": "...", "output": "..."}
Output: {"text": "<|begin_of_text|><|start_header_id|>system..."}
"""
system_prompt = (
"तपाईं एक सहायक हुनुहुन्छ जो नेपाली भाषामा राम्रोसँग जवाफ दिन्छ। "
"सधैं स्पष्ट, सटीक र उपयोगी जानकारी प्रदान गर्नुहोस्।"
)
# English: "You are an assistant who responds well in Nepali language.
# Always provide clear, accurate and useful information."
instruction = example.get("instruction", "")
context = example.get("input", "")
output = example.get("output", "")
# Build user message
user_message = instruction
if context:
user_message = f"{instruction}
सन्दर्भ: {context}"
# LLaMA 3 chat format
text = (
f"<|begin_of_text|>"
f"<|start_header_id|>system<|end_header_id|>
"
f"{system_prompt}<|eot_id|>"
f"<|start_header_id|>user<|end_header_id|>
"
f"{user_message}<|eot_id|>"
f"<|start_header_id|>assistant<|end_header_id|>
"
f"{output}<|eot_id|>"
)
return {"text": text}
def load_nepali_dataset(data_path: str) -> Dataset:
"""
Load and format a Nepali instruction dataset.
Expected format: JSONL with instruction/input/output fields
"""
# Option 1: Load from HuggingFace Hub
if data_path.startswith("hf://") or "/" in data_path and not data_path.startswith("/"):
dataset = load_dataset(data_path, split="train")
print(f"Loaded {len(dataset)} examples from HuggingFace")
# Option 2: Load from local JSONL
else:
examples = []
with open(data_path, "r", encoding="utf-8") as f:
for line in f:
if line.strip():
examples.append(json.loads(line))
dataset = Dataset.from_list(examples)
print(f"Loaded {len(dataset)} examples from {data_path}")
# Format all examples
dataset = dataset.map(
format_nepali_example,
remove_columns=dataset.column_names,
desc="Formatting examples",
)
# Filter by length
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_ID)
def filter_by_length(example):
tokens = tokenizer(example["text"], return_length=True)
return tokens["length"][0] <= MAX_LENGTH
dataset = dataset.filter(filter_by_length, desc="Filtering by length")
print(f"After filtering: {len(dataset)} examples")
# Train/val split
split = dataset.train_test_split(test_size=0.05, seed=42)
print(f"Train: {len(split['train'])} | Val: {len(split['test'])}")
return split
# ── Show sample ──
if __name__ == "__main__":
# Replace with your actual dataset path
dataset = load_nepali_dataset("Shushant/nepali-alpaca")
print("
Sample training example:")
print(dataset["train"][0]["text"][:500])
Step 4: Configure LoRA
# configure_lora.py — Set up PEFT LoRA adapters
from peft import LoraConfig, TaskType, get_peft_model, prepare_model_for_kbit_training
import torch
def configure_lora(model) -> object:
"""
Apply LoRA adapters to LLaMA 3 for parameter-efficient fine-tuning.
Only the LoRA adapter weights (A and B matrices) will be trained.
"""
# ── Prepare model for k-bit training ──
# This enables gradient checkpointing and handles dtype casting
model = prepare_model_for_kbit_training(
model,
use_gradient_checkpointing=True, # saves memory, trains slower
)
# ── LoRA Configuration ──
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM, # language modelling objective
# Rank of LoRA matrices (higher = more params = better but slower)
# r=16 is a good default; try r=8 for less data, r=32 for more
r=16,
# LoRA scaling factor: effective update = (alpha/r) * BA
# alpha=32 with r=16 gives scaling of 2.0
lora_alpha=32,
# Which weight matrices to apply LoRA to
# For LLaMA: attention projections + MLP gates
target_modules=[
"q_proj", # query projection
"k_proj", # key projection
"v_proj", # value projection
"o_proj", # output projection
"gate_proj", # MLP gate
"up_proj", # MLP up
"down_proj", # MLP down
],
# Dropout on LoRA layers (regularisation)
lora_dropout=0.05,
# Don't update biases (saves params)
bias="none",
# Inference mode is off during training
inference_mode=False,
)
# ── Apply LoRA to model ──
model = get_peft_model(model, lora_config)
# ── Print trainable parameter count ──
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
pct = 100 * trainable_params / total_params
print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,} ({pct:.3f}%)")
print(f"Frozen parameters: {total_params - trainable_params:,}")
return model
Step 5: Full Training with SFTTrainer
# train_nepali_llama.py — Complete QLoRA training script
import torch
import wandb
from transformers import TrainingArguments
from trl import SFTTrainer, SFTConfig, DataCollatorForCompletionOnlyLM
from load_model import model, tokenizer
from configure_lora import configure_lora
from prepare_dataset import load_nepali_dataset
import os
# ── Weights & Biases tracking ──
wandb.init(
project="nepali-llama3-finetune",
name="qlora-8b-nepali-v1",
config={
"model": "Meta-Llama-3-8B",
"method": "QLoRA",
"language": "Nepali",
"lora_r": 16,
"lora_alpha": 32,
}
)
# ── Apply LoRA ──
model = configure_lora(model)
# ── Load dataset ──
dataset = load_nepali_dataset("Shushant/nepali-alpaca")
# ── Training configuration ──
training_args = SFTConfig(
output_dir="./outputs/nepali-llama3-qlora",
# Training duration
num_train_epochs=3,
max_steps=-1, # -1 = use num_train_epochs
# Batch size (micro-batching for gradient accumulation)
per_device_train_batch_size=2, # 2 examples per GPU per step
gradient_accumulation_steps=8, # effective batch = 2×8 = 16
per_device_eval_batch_size=4,
# Optimiser
optim="paged_adamw_8bit", # memory-efficient 8-bit AdamW
learning_rate=2e-4,
weight_decay=0.001,
max_grad_norm=0.3, # gradient clipping
# Learning rate schedule
warmup_ratio=0.03, # 3% of steps for warmup
lr_scheduler_type="cosine", # cosine decay
# Precision
fp16=False,
bf16=True, # bfloat16 (requires Ampere GPU)
# Sequence length
max_seq_length=2048,
# Evaluation & checkpointing
eval_strategy="steps",
eval_steps=100,
save_strategy="steps",
save_steps=200,
save_total_limit=3, # keep only 3 latest checkpoints
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
# Logging
logging_dir="./logs",
logging_steps=25,
report_to="wandb",
# Packing short sequences for efficiency
packing=False, # set True if most examples are short
# Dataset formatting
dataset_text_field="text",
)
# ── Initialise trainer ──
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
)
# ── Start training ──
print("Starting QLoRA fine-tuning of LLaMA 3 on Nepali text...")
print(f"Training examples: {len(dataset['train'])}")
print(f"Validation examples: {len(dataset['test'])}")
print(f"Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print()
trainer.train()
# ── Save the final model ──
print("
Saving fine-tuned model...")
trainer.save_model("./outputs/nepali-llama3-qlora/final")
tokenizer.save_pretrained("./outputs/nepali-llama3-qlora/final")
wandb.finish()
print("Training complete!")
Training Hyperparameter Reference
| Hyperparameter | Value Used | Range | Effect |
|---|---|---|---|
| LoRA rank (r) | 16 | 4–64 | Higher = more capacity, more VRAM, slower |
| LoRA alpha (α) | 32 | 8–128 | α/r scaling factor; 2.0 is a good default |
| LoRA dropout | 0.05 | 0–0.1 | Regularisation; 0 if you have lots of data |
| Learning rate | 2e-4 | 1e-5 – 5e-4 | Higher = faster but can diverge |
| Batch size (effective) | 16 | 8–64 | Larger = more stable gradients, needs more memory |
| Epochs | 3 | 1–5 | More data → fewer epochs needed |
| Warmup ratio | 0.03 | 0.01–0.05 | % of steps for LR warmup |
| LR scheduler | cosine | linear/cosine/constant | Cosine decay tends to give cleaner convergence |
| Quantization | NF4 4-bit | 4-bit/8-bit | 4-bit saves most memory; 8-bit is more stable |
| Max sequence length | 2048 | 512–4096 | Longer = better for long texts, needs more VRAM |
Training Loss — Expected Progress
Rapid loss drop — model adapts to Nepali script and instruction format
Gradual improvement — learning vocabulary, grammar patterns
Fine-grained tuning — cultural context, idiomatic expressions
Step 6: Run Inference
# inference.py — Run inference with the fine-tuned Nepali model
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from peft import PeftModel
BASE_MODEL_ID = "meta-llama/Meta-Llama-3-8B"
ADAPTER_PATH = "./outputs/nepali-llama3-qlora/final"
# ── Option 1: Load base model + LoRA adapters separately ──
# (Useful during development — can swap adapters easily)
base_model = AutoModelForCausalLM.from_pretrained(
BASE_MODEL_ID,
torch_dtype=torch.float16,
device_map="auto",
)
model = PeftModel.from_pretrained(base_model, ADAPTER_PATH)
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_ID)
# ── Option 2: Merge LoRA into base model (faster inference) ──
# merged_model = model.merge_and_unload()
# merged_model.save_pretrained("./merged_model")
def generate_nepali(
instruction: str,
context: str = "",
max_new_tokens: int = 512,
temperature: float = 0.7,
top_p: float = 0.9,
) -> str:
"""Generate a Nepali response for a given instruction."""
system = (
"तपाईं एक सहायक हुनुहुन्छ जो नेपाली भाषामा राम्रोसँग जवाफ दिन्छ।"
)
user_msg = f"{instruction}
सन्दर्भ: {context}" if context else instruction
prompt = (
f"<|begin_of_text|>"
f"<|start_header_id|>system<|end_header_id|>
{system}<|eot_id|>"
f"<|start_header_id|>user<|end_header_id|>
{user_msg}<|eot_id|>"
f"<|start_header_id|>assistant<|end_header_id|>
"
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=temperature,
top_p=top_p,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=tokenizer.eos_token_id,
repetition_penalty=1.1,
)
# Decode only the new tokens (not the prompt)
new_tokens = outputs[0][inputs["input_ids"].shape[1]:]
response = tokenizer.decode(new_tokens, skip_special_tokens=True)
return response.strip()
# ── Test examples ──
examples = [
{
"instruction": "नेपालको राजधानी सहरको बारेमा बताउनुहोस्।",
"description": "Simple factual question about Nepal"
},
{
"instruction": "मेसिन लर्निङ भनेको के हो? सरल भाषामा बुझाउनुहोस्।",
"description": "Explain machine learning in Nepali"
},
{
"instruction": "एउटा Python कोड लेख्नुहोस् जसले नेपाली वर्णमाला प्रिन्ट गर्छ।",
"description": "Code generation in Nepali"
},
]
print("Testing fine-tuned Nepali LLaMA 3:
")
for ex in examples:
print(f"Task: {ex['description']}")
print(f"Instruction: {ex['instruction']}")
response = generate_nepali(ex["instruction"])
print(f"Response: {response}")
print("-" * 60 + "
")
Step 7: Push to HuggingFace Hub
# push_to_hub.py — Share your fine-tuned model with the community
from huggingface_hub import HfApi, login
from transformers import AutoTokenizer
from peft import PeftModel
import torch
# -- Authentication --
login(token="your_hf_token_here")
REPO_ID = "hexcodenepal/llama3-8b-nepali-qlora" # your HuggingFace username/model-name
# -- Option A: Push LoRA adapter only (small, fast) --
# Others can use it with the base model
model.push_to_hub(
REPO_ID,
private=False, # True if you don't want public access
safe_serialization=True,
)
tokenizer.push_to_hub(REPO_ID)
# -- Option B: Push merged model (larger but no dependency on base) --
print("Merging LoRA adapters into base model...")
merged = model.merge_and_unload()
merged.save_pretrained("./merged_llama3_nepali", safe_serialization=True)
tokenizer.save_pretrained("./merged_llama3_nepali")
api = HfApi()
api.create_repo(repo_id=REPO_ID + "-merged", private=False, exist_ok=True)
api.upload_folder(
folder_path="./merged_llama3_nepali",
repo_id=REPO_ID + "-merged",
repo_type="model",
)
# -- Create Model Card --
model_card = (
"---\n"
"language: ne\n"
"license: llama3\n"
"base_model: meta-llama/Meta-Llama-3-8B\n"
"tags:\n"
" - llama3\n"
" - nepali\n"
" - qlora\n"
"---\n\n"
"# LLaMA 3 8B Fine-tuned on Nepali Text\n\n"
"QLoRA fine-tuned version of Meta-Llama-3-8B on Nepali instruction data.\n\n"
"## Training Details\n"
"- Base model: meta-llama/Meta-Llama-3-8B\n"
"- Method: QLoRA (4-bit NF4 + LoRA rank 16)\n"
"- Dataset: Nepali Alpaca + custom data\n"
"- Training time: ~3 hours on 1x A100 40GB\n\n"
"Developed by HexCode Nepal.\n"
)
with open("./merged_llama3_nepali/README.md", "w") as f:
f.write(model_card)
print(f"Model pushed to: https://huggingface.co/{REPO_ID}")
Nepali Dataset Sources
| Dataset | Source | Size | Type | License |
|---|---|---|---|---|
| Nepali Alpaca | Shushant/nepali-alpaca (HuggingFace) | ~52k examples | Instruction following | Apache 2.0 |
| Nepali Wikipedia | HuggingFace datasets: wikipedia (ne) | ~50k articles | Pre-training / knowledge | CC-BY-SA |
| Nepali News Corpus | GitHub: sanjaalcorps/NepaliNLP | ~100k articles | Language modelling | CC BY |
| FLORES-200 Nepali | Meta/flores-200 (HuggingFace) | ~1k sentences | Translation benchmark | CC BY-SA 4.0 |
| Nepali StorySet | GitHub: nepali-nlp | ~10k stories | Text generation | MIT |
| Oscar Nepali | oscar-corpus/OSCAR-2301 (ne) | ~200MB | Web crawl / pre-training | CC0 |
| Custom crawl | Wikipedia, Kantipur, eKantipur, Nagarik | Variable | Domain-specific | Check individual TOS |
The Path Forward for Nepali NLP
You now have a complete QLoRA fine-tuning pipeline for adapting LLaMA 3 to Nepali text. The techniques here — 4-bit quantization, LoRA adapters, SFTTrainer — represent the current state of the art for efficient LLM fine-tuning. With these tools, a single engineer with access to a rented GPU can produce models that would have required a team and significant compute budget just two years ago.
Nepali NLP is at an exciting inflection point. The language has enough online presence to gather meaningful training data, but is underrepresented enough that well-trained models create genuine value. Applications like Nepali language customer service bots, government document assistants, educational tools for students in rural Nepal, and Nepali content generation systems are all within reach.
Please open-source what you build. Push your datasets and models to HuggingFace Hub. Write about your process. The Nepali NLP community is small but growing, and every contribution — whether a 500-example dataset or a fine-tuned model — moves the field forward for everyone.