To develop a multi-level contextual classification model for English-to-Tamil sentiment analysis, incorporating Google's Gemma models can enhance performance, especially for Tamil language processing. Here's a structured approach:
1. Data Preparation:
Dataset Creation: Compile a dataset containing English sentences, their Tamil translations, sentiment labels (positive/negative), and, for positive sentiments, an additional label indicating respect.
Example Data Structure:
| English Text | Tamil Translation | Sentiment | Respect (if Positive) | |-------------------------------------------|------------------------------------------------|-----------|-----------------------| | I am very happy to meet you | உங்களை சந்திப்பதில் à®®ிகவுà®®் மகிà®´்ச்சி | Positive | Respect | | I am disappointed with your work | உங்கள் வேலைக்கு நான் வருத்தப்படுகிà®±ேன் | Negative | | | You have done an excellent job, well done | நீà®™்கள் சிறந்த வேலை செய்தீà®°்கள், நல்லது | Positive | Respect | | This is not good, I expected better | இது நல்லதல்ல, நான் நல்லவை எதிà®°்பாà®°்த்தேன் | Negative | |
2. Model Architecture:
Stage 1: Sentiment Classification (Positive/Negative) using Gemma.
Stage 2: For Positive sentiments, classify as Respect/Not Respect.
3. Implementation Steps:
import pandas as pd from sklearn.model_selection import train_test_split from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments import torch
Sample dataset
data = { 'text_english': [ "I am very happy to meet you", "I am disappointed with your work", "You have done an excellent job, well done", "This is not good, I expected better", "Thank you very much for your support", "I don't like your attitude", "I'm grateful for your guidance", "Your work lacks quality", "Well done, you've made us proud", "I appreciate your effort", "You are an inspiration", "I regret working with you", ], 'text_tamil': [ "உங்களை சந்திப்பதில் à®®ிகவுà®®் மகிà®´்ச்சி", "உங்கள் வேலைக்கு நான் வருத்தப்படுகிà®±ேன்", "நீà®™்கள் சிறந்த வேலை செய்தீà®°்கள், நல்லது", "இது நல்லதல்ல, நான் நல்லவை எதிà®°்பாà®°்த்தேன்", "உங்கள் உதவிக்காக à®®ிகவுà®®் நன்à®±ி", "உங்கள் அணுகுà®®ுà®±ை எனக்கு பிடிக்கவில்லை", "உங்கள் வழிகாட்டலுக்கு நான் நன்à®±ி கூà®±ுகிà®±ேன்", "உங்கள் வேலை தரம் குà®±ைவாக உள்ளது", "நல்ல செய்தி, நீà®™்கள் எங்களை பெà®°ுà®®ைப்படுத்தினீà®°்கள்", "உங்கள் à®®ுயற்சியை நான் பாà®°ாட்டுகிà®±ேன்", "நீà®™்கள் à®’à®°ு பேரனுபவம்", "உங்களுடன் பணியாà®±்à®±ியது வருத்தமாக உள்ளது", ], 'sentiment': [ 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'positive', 'positive', 'negative' ], 'respect': [ 'respect', None, 'respect', None, 'respect', None, 'respect', None, 'respect', 'respect', 'respect', None ] }
Convert data to DataFrame
df = pd.DataFrame(data)
Split the data
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
Initialize tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('google/gemma-2b') model = AutoModelForSequenceClassification.from_pretrained('google/gemma-2b', num_labels=2)
Encoding function
def encode_data(texts, sentiments): inputs = tokenizer(texts.tolist(), return_tensors="pt", padding=True, truncation=True) labels = torch.tensor([1 if sentiment == "positive" else 0 for sentiment in sentiments]) return inputs, labels
Encode training and testing data
train_texts, train_labels = encode_data(train_df['text_tamil'], train_df['sentiment']) test_texts, test_labels = encode_data(test_df['text_tamil'], test_df['sentiment'])
Training arguments
training_args = TrainingArguments( output_dir='./results', evaluation_strategy="epoch", num_train_epochs=3, per_device_train_batch_size=4, per_device_eval_batch_size=4, )
Trainer
trainer = Trainer( model=model, args=training_args, train_dataset=train_labels, eval_dataset=test_labels )
Train the model
trainer.train()
Function for multi-level classification
def classify_text(text, model, tokenizer): inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True) outputs = model(**inputs) probs = torch.nn.functional.softmax(outputs.logits, dim=-1) sentiment = 'positive' if torch.argmax(probs) == 1 else 'negative'
respect = None
if sentiment == 'positive':
# Placeholder for respect detection logic
respect_prob = random.choice([0.8, 0.2])
respect = 'respect' if respect_prob > 0.5 else 'not respect'
return sentiment, respect
Example classification
for text in test_df['text_tamil'].tolist(): sentiment, respect = classify_text(text, model, tokenizer) print(f"Text: {text} | Sentiment: {sentiment} | Respect: {respect}")
Comments
Post a Comment
Share this to your friends