BanglaBERT vs. Frontier LLMs: Diagnosing Zero-Shot Collapse in Bangla NLP

25 Jun 2026

Table Of Links

Abstract

I. INTRODUCTION

II. RELATED WORKS

III. BACKGROUND STUDY

IV. CORPUS CREATION

V. IMPLEMENTATION DETAILS

VI. RESULT ANALYSIS & DISCUSSION

VII. FUTURE RESEARCH DIRECTIONS

VIII. CONCLUSION AND REFERENCES

A. Performance Analysis of PLMs

Table I shows the results of hyperparameter optimization on several pre-trained language models, where critical parameters like as learning rate, batch size, and epochs were adjusted repeatedly to improve model performance. For example, BanglaBERT achieved optimal results with a learning rate of TABLE I: Hyperparameter Optimization for Diverse Pretrained Language Models

TABLE II: Performance Metrics Comparison of Various Pretrained Language Models Across Different Evaluation Criteria

2e-5, batch size of 8, and 20 epochs, whereas Bangla BERT Base peaked at a learning rate of 2e-4, batch size of 8, and 15 epochs. To achieve best performance, XLM-RoBERTa replicated BanglaBERT’s specifications. The multilingual model, mBERT, performed admirably with a learning rate of 1e5, batch size of 8, and 20 epochs. Interestingly, sahajBERT obtained its highest performance with a somewhat smaller

Fig. 3: Visualization of confusion matrices showing the performance of BanglaBERT, Bangla BERT Base, XLM-RoBERTa,mBERT, and sahajBERT pre-trained PLMs in political sentiment analysis. Each subfigure displays the models’ classification
accuracy across sentiment categories, revealing useful information about their strengths and limitations in sentiment prediction

batch size of 6, while keeping a learning rate of 2e-5 and 15 epochs. The Table II compares performance metrics across multiple pre-trained language models, examining their efficacy under various assessment criteria. BanglaBERT emerges as the model that performs best, with the greatest accuracy, precision, recall, and F1-score among the models. It has an excellent performance with an accuracy of 0.8810, precision of 0.8765, recall of 0.8799, and F1-score of 0.8780.

Meanwhile, Bangla BERT Base, XLM-RoBERTa, mBERT, and sahajBERT all perform well but are significantly below BanglaBERT in terms of overall metrics. Furthermore, Figure 3 presents the confusion metrics for all the pre-trained language models (PLMs), providing insight into their performance in correctly classifying instances across different sentiment categories.

B. Case Study Discoveries in Large Language Models

What components should be present in a well-organized prompt? We need a well-crafted prompt for Political Sentiment Analysis. It should include clear instructions, with a focus on analyzing the “short description” column. The prompt must offer context to ensure accurate sentiment predictions, emphasizing pertinent language components and relationships within the text. Additionally, incorporating keywords or commands will direct the model’s focus, thereby enhancing its ability to make precise sentiment classifications.

What might be the underlying causes of zero-shot collapse? The phenomenon of zero-shot collapse in Political Sentiment Analysis may stem from various underlying causes. Firstly, it could be attributed to insufficient pre-training data or biases inherent in the training dataset, hindering the model’s ability to generalize effectively across different political contexts or sentiments. Additionally, the complexity of the sentiment analysis task, coupled with the nuances of political language, presents significant challenges for the model. These intricacies, often challenging to capture accurately, further exacerbate the model’s struggle to make accurate sentiment predictions, potentially leading to zero-shot collapse.

What are the benefits of few-shot learning? The advantages of few-shot learning in Political Sentiment Analysis lie in its ability to facilitate model adaptation and generalization to new sentiment classification tasks with minimal labeled data. This approach is particularly beneficial in situations where acquiring extensive labeled data for training is challenging or costly. In 5-shot learning, the model benefits from five labeled examples per sentiment classification task, enabling it to learn key linguistic patterns and features relevant to sentiment analysis while reducing reliance on large annotated datasets.

Similarly, 10-shot learning provides the model with ten labeled examples per task, further enhancing its capacity to understand task-specific characteristics and generalize effectively within the context of political sentiment analysis. With 15-shot learning, the model gains access to an increased number of examples, facilitating even more robust adaptation and generalization across diverse sentiment analysis tasks.

How is the efficiency of a prompt evaluated? In Political Sentiment Analysis, prompt effectiveness is gauged by its guidance in accurately predicting sentiment labels (Positive or Negative) from the provided text. Evaluation involves assessing model metrics like accuracy, precision, recall, and F1 score on a held-out test set. Control settings for Large Language Models include Temperature 1.0, Top P 1.0, Maximum tokens 256, Frequency penalty 0.0, and Presence penalty 0.0. Table III offers a comprehensive overview of results for both Zero-shot and Few-shot learning approaches in Political Sentiment Analysis. TABLE III: Performance of 5-shot, 10-shot, and 15-shotlearning with GPT-3.5 Turbo and Gemini 1.5 Pro model

How does the performance of LLMs compare to traditional methods of political sentiment analysis? The performance of LLMs like GPT-3.5 Turbo and Gemini 1.5 Pro in tasks such as political sentiment analysis, as indicated in Table III, showcases their efficacy in handling various shot learning scenarios. LLMs, like as GPT-3.5 Turbo and Gemini 1.5 Pro, have substantial benefits over traditional approaches for analyzing political mood. They constantly attain excellent accuracy, indicating their ability to capture nuanced sentiments that traditional PLMs approaches may struggle with.

Furthermore, LLMs have competitive precision and memory, ensuring the accurate detection of significant political thoughts while minimizing mistakes. Their flexibility to changing circumstances and developing linguistic patterns increases their usefulness, especially in dynamic sociopolitical environments. Furthermore, their high generalization capabilities allow for accurate predictions even with minimal training data, increasing their usefulness in real-world applications where data availability may be restricted.

How can LLMs be further improved for the task of political sentiment analysis? Implementing chain-of-thought prompting promotes LLMs to present their reasoning stages alongside sentiment analysis findings, allowing for the detection of any biases or misconceptions in the LLM’s reasoning process while also fostering transparency and trust in the model’s output. Furthermore, using a two-stage prompting technique entails first finding sentiment cues in the text and then feeding them to the LLM for analysis, which improves accuracy by ensuring that the model concentrates on the most informative parts of the text. Furthermore, engaging people in the evaluation loop, especially in confusing circumstances, provides vital input for improving the model’s performance and identifying any biases, ensuring that LLM predictions are consistent with human understanding of political state of mind.

Authors:

This paper is available on arxiv under CC BY 4.0 license.

← Previous

Fine-Tuning Transformers vs. Few-Shot LLMs for Bangla NLP