Summary Observation
This page records benchmark runs on latest models against GLUE benchmark.
Neural Network ability to judge whether a sentence construct is of high quality is very poor, best numbers from CoLA is 0.66, leaderboard scores 75% at best.
HyperParameters considered
learning rate
training time (epoch)
choice of model
model size
batch size
optimizer
Models
GLUE benchmark Description
BenchMark Comparisons
Benchmark
Dataset size
Metric
Comments
CoLA
10.657k
Matt Corr
Grammar, 23 linguistic pub, single sentence
SST-2
70.045k
Accuracy
Sentiments
MRPC
6.212k
F1/Acc
STS-B
8.631k
Pearson-Spearman Corr
Sentence pair similarity
QQP
8.631k
F1/Acc
Sentence pair
MNLI
431.997k
Accuracy
Multi-NLI - matched and mismatched
QNLI
116.672k
Accuracy
Question-NLI
RTE
5.770k
Accuracy
WNLI
0.858k
Accuracy
Winograd-NLI
SNLI
569.036k
DeBerta - Decoding enhanced BERT with disentangled attention
Disentangled attention mechanism
2 vectors for each word representing content and position
attention weights among words are computed using disentangled matrices on contents and relative positions
Enhanced mask decoder - incorporate absolute positions in decoding layer predict masked tokens.
need only half the training data to beat older models
Deberta large works for batchsize=8, for batchsizes above that run out of memory (wnli)
* Filtering out Sequential Redundancy
CoLA is the worst performing benchmark from leaderboard, sporting a 75% tops, we are able to do 66% with funnel-xlarge.
MRPC, RTE, SST-2, STS-B eval results are well within 5% of what is in leaderboard.
The MRPC score with deberta-large is very respectable at Acc:F1 = 91:93% it is almost at par with the leaderboard at 92:94. This is a paraphrase benchmarks, this means the most advanced models are good at detecting semantic similarity at the sentence level.
Have not been able to run NLI benchmarks other than the WNLI which I was able to only get 56%.
Model Comparisons and Parameters
Model
# Parameters
Comments
Bert-base
110M
Bert-large
345M
DeBerta-1.5
1.5B
Surpass human performance on SuperGLUE
DeBerta-base
134M
DeBerta-large
390M
BenchMark
Model/Parameters
Eval Acc
WNLI
bert-base-uncased, 20 epochs, LR=2e-5, Batch=32
0.338
WNLI
bert-base-uncased, 5 epochs, LR=2e-4, Batch=32
0.437
WNLI
bert-base-uncased, 5 epochs, LR=2e-6, Batch=32
0.563
WNLI
bert-base-uncased, 5 epochs, LR=2e-7, Batch=32
0.563
WNLI
textattack/bert-based-uncased-WNLI, 5 epochs, LR=2e-5, Batch=32
0.5
WNLI
textattack/bert-based-uncased-WNLI, 5 epochs, LR=5e-5, Batch=64
0.5
WNLI
textattack/bert-based-uncased-WNLI,5 epochs,LR=5e-5,Batch=32,MaxSeq=256
0.5
Attempts at the large models : GLUE (Shibainu) 5 epochs Batch Size=8
BenchMark
Model
Eval Acc/F1
MRPC
deberta-large
0.911/0.936
MRPC
funnel-transformer-large (batch=8)
0.909/0.935
MRPC
ernie-2.0-large
0.895/0.925
MRPC
funnel-transformer-large (batch=16)
0.889/0.919
MRPC
funnel-transformer-xlarge (batch=8)
0.887/0.919
MRPC
bert-large-uncased
0.882/0.917
MRPC
deberta-base(3 epochs)
0.877/0.913
MRPC
albert-large-v1
0.870/0.907
MRPC
albert-large-v2
0.850/0.894
MRPC
electra-discriminator-large
0.684/0.812
MRPC
roberta-large
0.683/0.812
BenchMark
Model
Eval Acc
WNLI
deberta-large (batch=8, 5 epochs)
0.563
WNLI
funnel-transformer-xlarge
0.563
WNLI
deberta-large (batch=8, 15 epochs)
0.493
WNLI
Ernie2.0-large
0.423
BenchMark
Model
Eval Pearson
Eval Spearman
STSB
Ernie2.0-large
0.924
0.921
STSB
Funnel-xlarge
0.921
0.920
STSB
DeBerta-large
0.904
0.910
STSB
Bert-uncased-large
0.907
0.904
BenchMark
Model
Eval Acc
Train Time
SST2
Bert-large-uncased
0.934
20,086
SST2
Ernie2.0-large
0.928
18,922
SST2
DeBerta-large
0.512
SST2
Funnel-xlarge
0.509
SST2
Funnel-large (batch=16)
0.509
BenchMark
Model
Eval Acc
RTE
Funnel-xlarge
0.902
RTE
Ernie2.0-large
0.830
RTE
bert-uncased-large (batchsize=8)
0.729
RTE
DeBerta-large
0.472
BenchMark
Model
Eval Matt Corr
CoLA
Funnel-xlarge
0.663
CoLA
Ernie2.0-large
0.660
CoLA
Bert-Large
0.619
CoLA
DeBerta-large
0 ??
Attempts at the small models : GLUE (Shibainu) Batch Size=32
BenchMark
Model
MRPC
funnel-transformer-intermediate (batch=32)
MRPC
funnel-transformer-intermediate (batch=48)
MRPC
deberta-base(3 epochs)
Documentation built with MkDocs using Windmill theme by Grist Labs.