Glue results - Simon Chu

This page records benchmark runs on latest models against GLUE benchmark.
Neural Network ability to judge whether a sentence construct is of high quality is very poor, best numbers from CoLA is 0.66, leaderboard scores 75% at best.

CoLA - Corpus of Linguistic Acceptance, Matthew's Corr, 10.6k single sentences from 23 linguistic pub
- Does sentence grammatically correct/acceptable
SST - Stanford Sentiment Treebank, Preserve meaning of word order, Accuracy
- Binary label of sentiment, Sentiment analysis - single sentence
MRPC - Microsoft Research Paraphrase Corpus - F1/Accuracy
- binary classification if sentence 2 is paraphrase of sentence 1
STSB - Semantic Textual Similarity Benchmark - 1.0 - 5.0 grading similarity between 2 sentences.
- Grading Short answers, dialog systems
QQP - Quora Question Pair
- Binary classification to check if two questions are asking the same thing
Question NLI - Question Natural language inference
- binary label of entail given question and answer
RTE - Recognizing Textual Entailment
- binary label of entailment given 2 sentences
Winograd NLI - small test case 400 train 71 eval
- Disambiguate pronoun reference
MultiNLI Matched - (tupled dataset) Multi Genre Natural Language Inference
MultiNLI Mismatched - (tupled dataset) Multi Genre Natural Language Inference
SNLI - (tupled dataset) Stanford Natural Language Inference

Benchmark	Dataset size	Metric	Comments
CoLA	10.657k	Matt Corr	Grammar, 23 linguistic pub, single sentence
SST-2	70.045k	Accuracy	Sentiments
MRPC	6.212k	F1/Acc
STS-B	8.631k	Pearson-Spearman Corr	Sentence pair similarity
QQP	8.631k	F1/Acc	Sentence pair
MNLI	431.997k	Accuracy	Multi-NLI - matched and mismatched
QNLI	116.672k	Accuracy	Question-NLI
RTE	5.770k	Accuracy
WNLI	0.858k	Accuracy	Winograd-NLI
SNLI	569.036k

Disentangled attention mechanism
- 2 vectors for each word representing content and position
- attention weights among words are computed using disentangled matrices on contents and relative positions
- Enhanced mask decoder - incorporate absolute positions in decoding layer predict masked tokens.
- need only half the training data to beat older models
Deberta large works for batchsize=8, for batchsizes above that run out of memory (wnli)

* Filtering out Sequential Redundancy

CoLA is the worst performing benchmark from leaderboard, sporting a 75% tops, we are able to do 66% with funnel-xlarge.
MRPC, RTE, SST-2, STS-B eval results are well within 5% of what is in leaderboard.
The MRPC score with deberta-large is very respectable at Acc:F1 = 91:93% it is almost at par with the leaderboard at 92:94. This is a paraphrase benchmarks, this means the most advanced models are good at detecting semantic similarity at the sentence level.
Have not been able to run NLI benchmarks other than the WNLI which I was able to only get 56%.

Model	# Parameters	Comments
Bert-base	110M
Bert-large	345M
DeBerta-1.5	1.5B	Surpass human performance on SuperGLUE
DeBerta-base	134M
DeBerta-large	390M

BenchMark	Model/Parameters	Eval Acc
WNLI	bert-base-uncased, 20 epochs, LR=2e-5, Batch=32	0.338
WNLI	bert-base-uncased, 5 epochs, LR=2e-4, Batch=32	0.437
WNLI	bert-base-uncased, 5 epochs, LR=2e-6, Batch=32	0.563
WNLI	bert-base-uncased, 5 epochs, LR=2e-7, Batch=32	0.563
WNLI	textattack/bert-based-uncased-WNLI, 5 epochs, LR=2e-5, Batch=32	0.5
WNLI	textattack/bert-based-uncased-WNLI, 5 epochs, LR=5e-5, Batch=64	0.5
WNLI	textattack/bert-based-uncased-WNLI,5 epochs,LR=5e-5,Batch=32,MaxSeq=256	0.5

BenchMark	Model	Eval Acc/F1
MRPC	deberta-large	0.911/0.936
MRPC	funnel-transformer-large (batch=8)	0.909/0.935
MRPC	ernie-2.0-large	0.895/0.925
MRPC	funnel-transformer-large (batch=16)	0.889/0.919
MRPC	funnel-transformer-xlarge (batch=8)	0.887/0.919
MRPC	bert-large-uncased	0.882/0.917
MRPC	deberta-base(3 epochs)	0.877/0.913
MRPC	albert-large-v1	0.870/0.907
MRPC	albert-large-v2	0.850/0.894
MRPC	electra-discriminator-large	0.684/0.812
MRPC	roberta-large	0.683/0.812

BenchMark	Model	Eval Pearson	Eval Spearman
STSB	Ernie2.0-large	0.924	0.921
STSB	Funnel-xlarge	0.921	0.920
STSB	DeBerta-large	0.904	0.910
STSB	Bert-uncased-large	0.907	0.904

BenchMark	Model	Eval Acc	Train Time
SST2	Bert-large-uncased	0.934	20,086
SST2	Ernie2.0-large	0.928	18,922
SST2	DeBerta-large	0.512
SST2	Funnel-xlarge	0.509
SST2	Funnel-large (batch=16)	0.509

BenchMark	Model	Eval Acc
RTE	Funnel-xlarge	0.902
RTE	Ernie2.0-large	0.830
RTE	bert-uncased-large (batchsize=8)	0.729
RTE	DeBerta-large	0.472

BenchMark	Model
MRPC	funnel-transformer-intermediate (batch=32)
MRPC	funnel-transformer-intermediate (batch=48)
MRPC	deberta-base(3 epochs)