TOP ACTIVATIONS
MAX = 0.740

mom my said , " we 'll be home soon ."
was covered in sticky cob we bs and had to clean
smiled and said , " we are making o atmeal !
and decided that maybe cob we bs weren 't so bad
hug him and say " we love you , Max ".
she didn 't hate cob we bs anymore and even helped
too . He says " we are friends , we are
, " Yes , cob we bs can be scary ,
covered in dust and cob we bs . She was holding
said his father , " we are too far away to
Tim my said , " we can wash them when we
't like spiders or cob we bs . But I like
and called it a " we ird rock ." Tim my
He was always making cob we bs to catch flies for
didn 't know what " we igh " meant , so
finished and Sam said " we should go back to playing
could not . The cob we bs and the spider held
each other and said " we did a great job !
air and looked for cob we bs to play in .
didn 't know what " we igh " meant , so

MIN ACTIVATIONS
MIN = -0.560

her tears . She said sorry too . "
said . " We 're sorry we lied and broke the
was wet and cold and sorry . He hugged Mia .
" I 'm sorry , Ben , I didn
. She said she was sorry . She said they should
. They said they were sorry . They said they would
, mom . I am sorry , bee . I did
. They said they were sorry to their mom and dad
said , " I 'm sorry , Mom . I didn
L ily felt sorry for Ben . She said
. They said they were sorry . Mama hugged them .
Anna and Ben were very sorry . They had not meant
L ily and Ben were sorry too . They said they
is okay . He is sorry . He gives Tom his
" I 'm sorry , Ben . I wanted
. Anna and Ben are sorry . They nod . They
" I am sorry , Anna . You can
L ily and Ben were sorry . They learned their lesson
L ila , I 'm sorry . That dog was very
. They said they were sorry . They said they had

INTERVAL 0.415 - 0.740
CONTAINS 0.410%

, they filled out some forms . Then they waited and
The fix er said , " You are
handle . " Look , Ben , what is
He still liked the ot ters , but he knew they
. He said , " Look , what is this ?"

INTERVAL 0.090 - 0.415
CONTAINS 29.733%

happy and said , " Thank you , Lily ! You
started to get very cold and the wind started to blow
moms first . Where is your mom ?" L
course you can have some glue ." She gave him some
Mom , can I have this one ?" She asked her

INTERVAL -0.235 - 0.090
CONTAINS 61.134%

doing , Lily ?" " I 'm picking
kept moving away from Jack . Finally he caught it with
Sam tried to stop him . They said they were sorry
. <|endoftext|> Once upon a time , there was a normal
bake . Then , we can enjoy it for dinner ."

INTERVAL -0.560 - -0.235
CONTAINS 8.723%

had been so compassionate and kind , and had warned him
. " I am the wise owl ," replied the owl
happy and the cat was happy too . The cat kept
toe . He is very sad and in pain ."
the park for the frame , but they couldn 't find
Token mom
Token ID1995
Feature activation-1.78e-01

Pos logit contributions
we+0.421
els+0.390
orns+0.376
ers+0.371
les+0.369
Neg logit contributions
 sorry-0.252
 okay-0.245
 thank-0.244
urry-0.224
 know-0.222
Tokenmy
Token ID1820
Feature activation+1.95e-01

Pos logit contributions
 sorry+0.372
 thank+0.359
 okay+0.358
 pleasure+0.327
 know+0.318
Neg logit contributions
we-0.442
els-0.402
orns-0.392
ays-0.385
ting-0.385
Token said
Token ID531
Feature activation-5.98e-02

Pos logit contributions
we+0.497
els+0.451
orns+0.441
ting+0.435
les+0.435
Neg logit contributions
 sorry-0.387
 thank-0.374
 okay-0.373
 pleasure-0.340
urry-0.328
Token,
Token ID11
Feature activation-2.88e-01

Pos logit contributions
 sorry+0.086
 know+0.086
 thank+0.086
 pleasure+0.084
urry+0.084
Neg logit contributions
we-0.171
ting-0.150
ers-0.150
els-0.146
les-0.145
Token "
Token ID366
Feature activation-1.35e-01

Pos logit contributions
 sorry+0.450
 thank+0.431
 okay+0.424
 pleasure+0.397
 know+0.385
Neg logit contributions
we-0.840
ting-0.746
els-0.744
ers-0.716
ays-0.715
Tokenwe
Token ID732
Feature activation+7.27e-01

Pos logit contributions
 okay+0.323
 thank+0.294
 please+0.290
 sorry+0.288
 pleasure+0.284
Neg logit contributions
els-0.300
ays-0.280
ries-0.271
orns-0.271
les-0.269
Token'll
Token ID1183
Feature activation-6.28e-02

Pos logit contributions
we+2.056
els+2.009
les+1.957
ting+1.926
orns+1.921
Neg logit contributions
 sorry-1.038
 okay-0.998
 thank-0.980
 pleasure-0.906
 tum-0.864
Token be
Token ID307
Feature activation-1.98e-01

Pos logit contributions
 sorry+0.098
 okay+0.088
 thank+0.085
 glad+0.083
 pleasure+0.080
Neg logit contributions
we-0.176
els-0.162
ting-0.157
al-0.154
en-0.154
Token home
Token ID1363
Feature activation-8.06e-02

Pos logit contributions
 sorry+0.277
 okay+0.266
 thank+0.263
 pleasure+0.218
 know+0.215
Neg logit contributions
we-0.635
els-0.594
orns-0.589
ays-0.577
ting-0.569
Token soon
Token ID2582
Feature activation+3.66e-03

Pos logit contributions
 sorry+0.167
 thank+0.162
 okay+0.151
 pleasure+0.143
 know+0.141
Neg logit contributions
we-0.201
els-0.172
ers-0.170
ting-0.170
ays-0.167
Token."
Token ID526
Feature activation+1.04e-01

Pos logit contributions
we+0.008
els+0.007
orns+0.007
les+0.007
ers+0.007
Neg logit contributions
 sorry-0.008
 thank-0.008
 okay-0.008
 glad-0.007
 pleasure-0.007
Token was
Token ID373
Feature activation+1.61e-01

Pos logit contributions
we+0.115
els+0.106
orns+0.105
ays+0.102
ting+0.100
Neg logit contributions
 sorry-0.098
 thank-0.096
 okay-0.094
 know-0.085
 pleasure-0.083
Token covered
Token ID5017
Feature activation+1.51e-01

Pos logit contributions
we+0.440
els+0.417
orns+0.402
ers+0.397
ays+0.392
Neg logit contributions
 sorry-0.267
 thank-0.267
 okay-0.254
urry-0.239
 know-0.230
Token in
Token ID287
Feature activation+1.64e-01

Pos logit contributions
ucks+0.407
rob+0.396
we+0.390
ipt+0.383
orns+0.380
Neg logit contributions
 okay-0.276
 sorry-0.260
,-0.238
 thank-0.231
 sad-0.227
Token sticky
Token ID23408
Feature activation+8.35e-03

Pos logit contributions
we+0.433
rob+0.414
ays+0.407
ucks+0.401
orns+0.399
Neg logit contributions
 sorry-0.315
 okay-0.313
 thank-0.309
 worry-0.290
 know-0.282
Token cob
Token ID22843
Feature activation+1.40e-01

Pos logit contributions
we+0.019
ipt+0.019
ucks+0.018
ays+0.018
rob+0.018
Neg logit contributions
 thank-0.018
 sorry-0.017
 okay-0.017
 tum-0.016
,-0.016
Tokenwe
Token ID732
Feature activation+7.24e-01

Pos logit contributions
ipt+0.101
ucks+0.093
ays+0.073
els+0.063
rob+0.063
Neg logit contributions
 sorry-0.480
 okay-0.468
 not-0.447
 glad-0.443
 thank-0.443
Tokenbs
Token ID1443
Feature activation-1.53e-01

Pos logit contributions
we+2.222
els+2.071
orns+2.043
ays+2.001
les+1.985
Neg logit contributions
 sorry-1.080
 thank-1.046
 okay-1.028
 pleasure-0.896
 know-0.847
Token and
Token ID290
Feature activation+1.19e-01

Pos logit contributions
 sorry+0.301
 okay+0.297
 thank+0.292
 know+0.254
 tired+0.249
Neg logit contributions
we-0.390
els-0.362
orns-0.360
ucks-0.355
ipt-0.351
Token had
Token ID550
Feature activation-7.87e-02

Pos logit contributions
we+0.286
els+0.262
orns+0.260
ays+0.254
ting+0.248
Neg logit contributions
 sorry-0.258
 thank-0.252
 okay-0.249
 know-0.223
 pleasure-0.222
Token to
Token ID284
Feature activation-1.10e-01

Pos logit contributions
 sorry+0.141
 thank+0.138
 okay+0.136
 pleasure+0.119
 know+0.118
Neg logit contributions
we-0.219
els-0.202
orns-0.200
ays-0.195
ting-0.192
Token clean
Token ID3424
Feature activation-1.52e-01

Pos logit contributions
 sorry+0.163
 okay+0.155
 thank+0.154
 pleasure+0.137
 glad+0.134
Neg logit contributions
we-0.327
els-0.310
orns-0.301
ting-0.291
ers-0.291
Token smiled
Token ID13541
Feature activation-1.77e-02

Pos logit contributions
we+0.369
els+0.339
orns+0.336
ays+0.325
ting+0.322
Neg logit contributions
 sorry-0.313
 thank-0.306
 okay-0.301
 know-0.268
 pleasure-0.267
Token and
Token ID290
Feature activation+1.21e-01

Pos logit contributions
 sorry+0.039
 thank+0.038
 okay+0.038
 know+0.035
 pleasure+0.034
Neg logit contributions
we-0.041
els-0.038
orns-0.037
ays-0.036
ting-0.036
Token said
Token ID531
Feature activation-6.74e-02

Pos logit contributions
we+0.320
els+0.291
orns+0.285
ays+0.281
ting+0.280
Neg logit contributions
 sorry-0.230
 okay-0.225
 thank-0.221
 pleasure-0.200
urry-0.196
Token,
Token ID11
Feature activation-2.91e-01

Pos logit contributions
 forget+0.076
 sorry+0.073
 okay+0.072
 know+0.071
 pleasure+0.070
Neg logit contributions
we-0.217
ers-0.198
ting-0.196
els-0.194
med-0.192
Token "
Token ID366
Feature activation-1.36e-01

Pos logit contributions
 sorry+0.466
 okay+0.452
 thank+0.447
 pleasure+0.427
 know+0.426
Neg logit contributions
we-0.819
ting-0.736
els-0.712
ipt-0.692
med-0.692
Tokenwe
Token ID732
Feature activation+7.24e-01

Pos logit contributions
 okay+0.296
 thank+0.292
 pleasure+0.283
 know+0.271
 please+0.269
Neg logit contributions
els-0.317
ays-0.291
ries-0.291
eds-0.284
orns-0.275
Token are
Token ID389
Feature activation-9.01e-02

Pos logit contributions
we+2.038
els+1.994
les+1.936
orns+1.915
al+1.878
Neg logit contributions
 sorry-1.074
 okay-1.044
 thank-0.996
 pleasure-0.909
 tum-0.883
Token making
Token ID1642
Feature activation+9.36e-03

Pos logit contributions
 thank+0.130
urry+0.128
 sorry+0.120
 okay+0.116
 tum+0.114
Neg logit contributions
we-0.250
els-0.240
ers-0.232
orns-0.227
ting-0.227
Token o
Token ID267
Feature activation+1.39e-01

Pos logit contributions
we+0.024
orns+0.022
els+0.022
ays+0.021
ting+0.021
Neg logit contributions
 sorry-0.019
 okay-0.019
 thank-0.019
 pleasure-0.017
 know-0.016
Tokenatmeal
Token ID45280
Feature activation+5.25e-02

Pos logit contributions
we+0.270
els+0.237
orns+0.231
ays+0.223
ting+0.220
Neg logit contributions
 sorry-0.372
 thank-0.365
 okay-0.361
 pleasure-0.334
 know-0.331
Token!
Token ID0
Feature activation+6.56e-02

Pos logit contributions
we+0.142
els+0.127
orns+0.126
ers+0.123
ays+0.123
Neg logit contributions
 sorry-0.099
 thank-0.095
 okay-0.094
 pleasure-0.085
 know-0.084
Token and
Token ID290
Feature activation+1.17e-01

Pos logit contributions
 sorry+0.142
 thank+0.137
 okay+0.133
 pleasure+0.130
 know+0.129
Neg logit contributions
we-0.156
els-0.138
ting-0.135
ers-0.134
les-0.133
Token decided
Token ID3066
Feature activation+1.11e-01

Pos logit contributions
we+0.298
els+0.272
ting+0.261
les+0.258
orns+0.255
Neg logit contributions
 okay-0.224
 sorry-0.223
 thank-0.212
 pleasure-0.207
urry-0.206
Token that
Token ID326
Feature activation+1.87e-02

Pos logit contributions
we+0.311
els+0.273
ting+0.270
ers+0.267
orns+0.262
Neg logit contributions
 sorry-0.194
 thank-0.186
 okay-0.182
 pleasure-0.180
urry-0.170
Token maybe
Token ID3863
Feature activation-2.79e-01

Pos logit contributions
we+0.052
els+0.049
orns+0.048
ays+0.047
ting+0.046
Neg logit contributions
 sorry-0.034
 thank-0.033
 okay-0.033
 know-0.028
 pleasure-0.028
Token cob
Token ID22843
Feature activation+1.32e-01

Pos logit contributions
 sorry+0.539
 thank+0.528
 okay+0.526
 know+0.453
 pleasure+0.452
Neg logit contributions
we-0.734
els-0.678
orns-0.677
ays-0.664
ucks-0.649
Tokenwe
Token ID732
Feature activation+7.24e-01

Pos logit contributions
ipt+0.040
rob+0.027
ucks+0.024
ays+0.023
orns+0.023
Neg logit contributions
 sorry-0.526
 okay-0.520
 thank-0.499
 please-0.475
 not-0.471
Tokenbs
Token ID1443
Feature activation-1.53e-01

Pos logit contributions
we+2.217
els+2.079
orns+2.057
ays+1.997
ting+1.994
Neg logit contributions
 sorry-1.073
 thank-1.032
 okay-1.021
 pleasure-0.882
 glad-0.842
Token weren
Token ID6304
Feature activation-2.11e-01

Pos logit contributions
 sorry+0.279
 thank+0.273
 okay+0.271
 know+0.238
 pleasure+0.230
Neg logit contributions
we-0.416
els-0.385
orns-0.381
ays-0.371
ucks-0.368
Token't
Token ID470
Feature activation-1.54e-01

Pos logit contributions
 sorry+0.246
 okay+0.236
 thank+0.232
 know+0.184
 please+0.179
Neg logit contributions
we-0.720
els-0.679
orns-0.675
ays-0.660
ipt-0.657
Token so
Token ID523
Feature activation-1.58e-01

Pos logit contributions
 sorry+0.256
 thank+0.256
 okay+0.241
 pleasure+0.223
urry+0.216
Neg logit contributions
we-0.438
els-0.408
orns-0.404
ting-0.391
les-0.385
Token bad
Token ID2089
Feature activation-2.41e-01

Pos logit contributions
 thank+0.204
 sorry+0.203
 okay+0.200
 pleasure+0.189
urry+0.170
Neg logit contributions
we-0.482
els-0.457
les-0.443
orns-0.440
ting-0.432
Token hug
Token ID16225
Feature activation-2.26e-01

Pos logit contributions
we+0.088
els+0.088
orns+0.087
les+0.086
ting+0.083
Neg logit contributions
 sorry-0.050
 okay-0.050
 pleasure-0.043
urry-0.043
 glad-0.041
Token him
Token ID683
Feature activation+2.25e-01

Pos logit contributions
 sorry+0.463
 thank+0.460
 okay+0.452
 please+0.383
 worry+0.381
Neg logit contributions
we-0.589
orns-0.557
els-0.555
ays-0.547
ipt-0.534
Token and
Token ID290
Feature activation+1.24e-01

Pos logit contributions
we+0.570
els+0.517
orns+0.515
ays+0.499
rob+0.487
Neg logit contributions
 sorry-0.479
 thank-0.469
 okay-0.462
 know-0.405
 pleasure-0.399
Token say
Token ID910
Feature activation-2.02e-01

Pos logit contributions
we+0.379
els+0.360
orns+0.354
ting+0.344
les+0.344
Neg logit contributions
 sorry-0.169
 okay-0.167
 thank-0.150
 pleasure-0.141
urry-0.140
Token "
Token ID366
Feature activation-1.28e-01

Pos logit contributions
 okay+0.283
 sorry+0.277
 worry+0.266
 know+0.265
 pleasure+0.263
Neg logit contributions
we-0.601
els-0.543
ers-0.541
ting-0.540
orns-0.535
Tokenwe
Token ID732
Feature activation+7.23e-01

Pos logit contributions
 okay+0.311
 thank+0.305
 sorry+0.287
 know+0.282
 understand+0.280
Neg logit contributions
els-0.274
ays-0.244
ries-0.243
orns-0.230
eds-0.230
Token love
Token ID1842
Feature activation-2.00e-01

Pos logit contributions
we+2.039
els+1.921
orns+1.898
les+1.887
ting+1.867
Neg logit contributions
 sorry-1.139
 okay-1.074
 thank-1.052
 worry-0.930
urry-0.922
Token you
Token ID345
Feature activation+7.47e-02

Pos logit contributions
 thank+0.379
 sorry+0.366
 okay+0.357
 please+0.332
 pleasure+0.315
Neg logit contributions
we-0.503
els-0.493
ays-0.487
ipt-0.481
orns-0.474
Token,
Token ID11
Feature activation-2.89e-01

Pos logit contributions
we+0.221
els+0.205
orns+0.202
ays+0.198
ting+0.198
Neg logit contributions
 sorry-0.118
 thank-0.114
 okay-0.112
 know-0.098
 pleasure-0.098
Token Max
Token ID5436
Feature activation-6.85e-03

Pos logit contributions
 sorry+0.572
 okay+0.552
 thank+0.546
 pleasure+0.516
 know+0.515
Neg logit contributions
we-0.697
els-0.636
orns-0.621
ting-0.615
ays-0.607
Token".
Token ID1911
Feature activation+1.55e-01

Pos logit contributions
 sorry+0.013
 okay+0.013
 thank+0.013
 know+0.011
 pleasure+0.011
Neg logit contributions
we-0.019
els-0.017
orns-0.017
ays-0.017
ipt-0.017
Token she
Token ID673
Feature activation-1.57e-01

Pos logit contributions
 sorry+0.573
 okay+0.544
 thank+0.542
urry+0.511
 pleasure+0.509
Neg logit contributions
we-0.790
orns-0.710
els-0.706
ting-0.690
les-0.674
Token didn
Token ID1422
Feature activation-2.26e-01

Pos logit contributions
 sorry+0.259
 thank+0.253
 okay+0.250
 know+0.215
 pleasure+0.215
Neg logit contributions
we-0.457
els-0.425
orns-0.419
ays-0.411
ting-0.405
Token't
Token ID470
Feature activation-1.53e-01

Pos logit contributions
 sorry+0.315
 thank+0.304
 okay+0.298
 pleasure+0.257
 know+0.247
Neg logit contributions
we-0.718
els-0.671
orns-0.661
ays-0.652
ting-0.642
Token hate
Token ID5465
Feature activation-2.86e-01

Pos logit contributions
 sorry+0.187
 okay+0.168
 pleasure+0.167
 thank+0.164
 glad+0.157
Neg logit contributions
we-0.456
en-0.443
els-0.442
les-0.440
ting-0.439
Token cob
Token ID22843
Feature activation+1.30e-01

Pos logit contributions
 sorry+0.684
 okay+0.665
 thank+0.654
 worry+0.588
 sad+0.587
Neg logit contributions
we-0.613
ucks-0.558
els-0.556
ipt-0.556
orns-0.556
Tokenwe
Token ID732
Feature activation+7.23e-01

Pos logit contributions
ipt+0.046
ucks+0.026
ays+0.022
els+0.018
acked+0.013
Neg logit contributions
 sorry-0.510
 okay-0.507
 thank-0.478
 glad-0.463
urry-0.460
Tokenbs
Token ID1443
Feature activation-1.54e-01

Pos logit contributions
we+2.156
els+2.015
orns+1.987
les+1.942
ting+1.941
Neg logit contributions
 sorry-1.116
 thank-1.088
 okay-1.047
 pleasure-0.948
 glad-0.888
Token anymore
Token ID7471
Feature activation-3.82e-02

Pos logit contributions
 sorry+0.274
 okay+0.269
 thank+0.266
 know+0.227
 worry+0.224
Neg logit contributions
we-0.423
els-0.391
orns-0.390
ucks-0.385
ipt-0.383
Token and
Token ID290
Feature activation+1.07e-01

Pos logit contributions
 sorry+0.083
 thank+0.081
 okay+0.080
 pleasure+0.074
 know+0.071
Neg logit contributions
we-0.093
els-0.084
orns-0.083
ting-0.080
ays-0.080
Token even
Token ID772
Feature activation-8.90e-02

Pos logit contributions
we+0.286
els+0.261
orns+0.256
ting+0.250
les+0.246
Neg logit contributions
 sorry-0.200
 okay-0.194
 thank-0.191
 pleasure-0.174
urry-0.173
Token helped
Token ID4193
Feature activation-6.73e-02

Pos logit contributions
 sorry+0.172
 okay+0.166
 thank+0.165
 pleasure+0.152
urry+0.147
Neg logit contributions
we-0.232
els-0.212
orns-0.209
ting-0.204
les-0.203
Token too
Token ID1165
Feature activation-4.49e-02

Pos logit contributions
we+1.056
els+0.968
ers+0.961
ting+0.957
orns+0.944
Neg logit contributions
 sorry-0.723
 thank-0.719
urry-0.671
 okay-0.670
 know-0.661
Token.
Token ID13
Feature activation+6.62e-02

Pos logit contributions
 sorry+0.119
 thank+0.115
 okay+0.113
 pleasure+0.111
 please+0.108
Neg logit contributions
we-0.071
ers-0.062
les-0.062
orns-0.060
els-0.060
Token He
Token ID679
Feature activation-3.18e-02

Pos logit contributions
we+0.171
orns+0.158
els+0.156
ays+0.155
ucks+0.153
Neg logit contributions
 sorry-0.126
 thank-0.123
 okay-0.122
 pleasure-0.109
 worry-0.108
Token says
Token ID1139
Feature activation+5.03e-02

Pos logit contributions
 sorry+0.071
 okay+0.070
 thank+0.069
 pleasure+0.063
 know+0.062
Neg logit contributions
we-0.073
els-0.067
orns-0.066
ays-0.064
les-0.063
Token "
Token ID366
Feature activation-1.29e-01

Pos logit contributions
we+0.139
ers+0.127
ting+0.124
els+0.124
orns+0.122
Neg logit contributions
 pleasure-0.072
 know-0.072
 worry-0.072
 okay-0.071
urry-0.071
Tokenwe
Token ID732
Feature activation+7.22e-01

Pos logit contributions
 okay+0.292
 thank+0.283
 understand+0.264
 know+0.263
 sorry+0.257
Neg logit contributions
els-0.301
ries-0.277
ays-0.260
eds-0.248
we-0.240
Token are
Token ID389
Feature activation-9.07e-02

Pos logit contributions
we+1.997
els+1.905
orns+1.875
les+1.866
ting+1.839
Neg logit contributions
 sorry-1.134
 okay-1.064
 thank-1.041
urry-0.935
 pleasure-0.927
Token friends
Token ID2460
Feature activation-1.16e-01

Pos logit contributions
 sorry+0.169
 thank+0.165
 okay+0.164
urry+0.150
 know+0.149
Neg logit contributions
we-0.235
els-0.219
orns-0.214
ting-0.208
ers-0.207
Token,
Token ID11
Feature activation-2.90e-01

Pos logit contributions
 sorry+0.199
 thank+0.194
 okay+0.192
 pleasure+0.165
 know+0.164
Neg logit contributions
we-0.332
els-0.309
orns-0.305
ays-0.298
ting-0.293
Token we
Token ID356
Feature activation-9.55e-02

Pos logit contributions
 sorry+0.541
 okay+0.517
 thank+0.516
 know+0.479
 pleasure+0.477
Neg logit contributions
we-0.752
els-0.681
orns-0.669
ting-0.657
ays-0.649
Token are
Token ID389
Feature activation-9.06e-02

Pos logit contributions
 sorry+0.168
 okay+0.154
 thank+0.150
urry+0.146
 tum+0.141
Neg logit contributions
we-0.244
orns-0.233
els-0.233
les-0.231
ers-0.227
Token,
Token ID11
Feature activation-2.91e-01

Pos logit contributions
 forget+0.078
 pleasure+0.078
 know+0.074
 miss+0.068
 thank+0.067
Neg logit contributions
we-0.201
ers-0.181
ting-0.178
els-0.177
med-0.173
Token "
Token ID366
Feature activation-1.32e-01

Pos logit contributions
 thank+0.463
 sorry+0.460
 pleasure+0.460
 okay+0.458
 know+0.447
Neg logit contributions
we-0.826
ting-0.740
els-0.711
ers-0.685
med-0.684
TokenYes
Token ID5297
Feature activation+7.30e-02

Pos logit contributions
 okay+0.285
 thank+0.268
 please+0.265
 sorry+0.264
 know+0.258
Neg logit contributions
els-0.310
ays-0.291
ries-0.283
les-0.277
orns-0.270
Token,
Token ID11
Feature activation-2.89e-01

Pos logit contributions
we+0.225
els+0.214
les+0.211
ting+0.208
ries+0.206
Neg logit contributions
 sorry-0.083
 know-0.077
 pleasure-0.076
 okay-0.074
 determined-0.074
Token cob
Token ID22843
Feature activation+1.34e-01

Pos logit contributions
 sorry+0.547
 thank+0.527
 okay+0.521
 pleasure+0.468
 know+0.467
Neg logit contributions
we-0.776
els-0.711
orns-0.698
ting-0.680
ays-0.679
Tokenwe
Token ID732
Feature activation+7.22e-01

Pos logit contributions
ipt+0.075
ays+0.055
ucks+0.050
els+0.050
orns+0.048
Neg logit contributions
 sorry-0.510
 okay-0.479
 thank-0.476
 please-0.448
 not-0.447
Tokenbs
Token ID1443
Feature activation-1.51e-01

Pos logit contributions
we+2.067
els+2.046
orns+2.000
les+1.986
ers+1.942
Neg logit contributions
 sorry-1.009
 okay-0.982
 thank-0.939
 pleasure-0.886
 tum-0.828
Token can
Token ID460
Feature activation-1.10e-01

Pos logit contributions
 sorry+0.258
 thank+0.254
 okay+0.249
 know+0.219
 please+0.211
Neg logit contributions
we-0.427
els-0.395
orns-0.393
ipt-0.386
rob-0.386
Token be
Token ID307
Feature activation-1.98e-01

Pos logit contributions
 sorry+0.194
 thank+0.183
 okay+0.179
 pleasure+0.167
 glad+0.164
Neg logit contributions
we-0.314
els-0.292
orns-0.282
ting-0.274
ays-0.274
Token scary
Token ID14343
Feature activation-1.37e-01

Pos logit contributions
 sorry+0.318
 thank+0.317
 okay+0.301
 pleasure+0.273
urry+0.263
Neg logit contributions
we-0.581
els-0.542
orns-0.535
ers-0.515
ting-0.513
Token,
Token ID11
Feature activation-2.92e-01

Pos logit contributions
 sorry+0.269
 okay+0.263
 thank+0.261
 know+0.229
 pleasure+0.228
Neg logit contributions
we-0.354
els-0.329
orns-0.323
ays-0.316
ipt-0.311
Token covered
Token ID5017
Feature activation+1.45e-01

Pos logit contributions
 sorry+0.613
 okay+0.584
 thank+0.555
 sad+0.514
 glad+0.511
Neg logit contributions
we-0.747
orns-0.708
els-0.694
rob-0.689
ays-0.687
Token in
Token ID287
Feature activation+1.62e-01

Pos logit contributions
rob+0.409
orns+0.384
ucks+0.380
we+0.379
ymes+0.354
Neg logit contributions
 sorry-0.265
 okay-0.256
,-0.231
 not-0.225
 sad-0.217
Token dust
Token ID8977
Feature activation+3.08e-01

Pos logit contributions
we+0.394
rob+0.367
orns+0.358
ays+0.357
ting+0.347
Neg logit contributions
 sorry-0.364
 okay-0.359
 thank-0.347
 worry-0.335
 know-0.323
Token and
Token ID290
Feature activation+1.22e-01

Pos logit contributions
we+0.844
orns+0.795
ucks+0.794
ipt+0.790
els+0.784
Neg logit contributions
 sorry-0.548
 okay-0.527
 thank-0.522
 sad-0.440
 tired-0.435
Token cob
Token ID22843
Feature activation+1.28e-01

Pos logit contributions
we+0.281
els+0.263
orns+0.262
ays+0.260
les+0.251
Neg logit contributions
 sorry-0.276
 thank-0.266
 okay-0.266
 know-0.237
 glad-0.232
Tokenwe
Token ID732
Feature activation+7.22e-01

Pos logit contributions
ipt+0.074
acked+0.063
ays+0.057
ucks+0.054
les+0.048
Neg logit contributions
 sorry-0.454
 okay-0.427
 not-0.414
 thank-0.402
 glad-0.400
Tokenbs
Token ID1443
Feature activation-1.54e-01

Pos logit contributions
we+2.319
els+2.161
orns+2.143
ays+2.100
ting+2.083
Neg logit contributions
 sorry-0.986
 thank-0.946
 okay-0.941
 know-0.793
 pleasure-0.767
Token.
Token ID13
Feature activation+6.60e-02

Pos logit contributions
 okay+0.270
 sorry+0.267
 thank+0.255
 tired+0.220
 please+0.218
Neg logit contributions
we-0.424
rob-0.399
ucks-0.398
orns-0.398
ipt-0.396
Token She
Token ID1375
Feature activation-3.30e-02

Pos logit contributions
we+0.182
els+0.169
orns+0.167
ays+0.164
ting+0.160
Neg logit contributions
 sorry-0.119
 thank-0.117
 okay-0.115
 pleasure-0.102
 know-0.100
Token was
Token ID373
Feature activation+1.60e-01

Pos logit contributions
 sorry+0.057
 thank+0.056
 okay+0.054
 know+0.049
 sad+0.047
Neg logit contributions
we-0.094
orns-0.088
els-0.087
ays-0.087
ting-0.086
Token holding
Token ID4769
Feature activation-5.10e-03

Pos logit contributions
we+0.496
els+0.465
ers+0.455
ting+0.446
web+0.444
Neg logit contributions
 thank-0.220
 pleasure-0.196
 okay-0.183
urry-0.182
 sorry-0.178
Token said
Token ID531
Feature activation-5.99e-02

Pos logit contributions
 sorry+0.115
 okay+0.114
 thank+0.114
 pleasure+0.107
urry+0.103
Neg logit contributions
we-0.174
els-0.156
orns-0.152
ting-0.149
ers-0.148
Token his
Token ID465
Feature activation+1.56e-01

Pos logit contributions
 thank+0.117
 sorry+0.116
 okay+0.114
 pleasure+0.110
 know+0.109
Neg logit contributions
we-0.144
els-0.129
ting-0.126
ers-0.125
orns-0.122
Token father
Token ID2988
Feature activation-1.66e-01

Pos logit contributions
we+0.430
els+0.417
ipt+0.386
ers+0.386
ucks+0.385
Neg logit contributions
 thank-0.232
 okay-0.229
 sorry-0.225
urry-0.213
 pleasure-0.205
Token,
Token ID11
Feature activation-2.88e-01

Pos logit contributions
 sorry+0.341
 thank+0.334
 okay+0.331
 pleasure+0.326
 know+0.303
Neg logit contributions
we-0.397
els-0.363
ting-0.351
ers-0.350
les-0.347
Token "
Token ID366
Feature activation-1.38e-01

Pos logit contributions
 sorry+0.523
 thank+0.517
 okay+0.504
 pleasure+0.463
 know+0.459
Neg logit contributions
we-0.783
els-0.713
ting-0.694
orns-0.688
ays-0.672
Tokenwe
Token ID732
Feature activation+7.21e-01

Pos logit contributions
 okay+0.333
 sorry+0.308
 please+0.306
 glad+0.301
 thank+0.296
Neg logit contributions
els-0.295
ries-0.274
ays-0.273
orns-0.272
les-0.271
Token are
Token ID389
Feature activation-8.87e-02

Pos logit contributions
els+2.086
we+2.067
les+2.006
orns+1.985
ays+1.945
Neg logit contributions
 sorry-0.963
 okay-0.951
 thank-0.924
 pleasure-0.873
 tum-0.816
Token too
Token ID1165
Feature activation-4.83e-02

Pos logit contributions
 thank+0.147
 sorry+0.138
 okay+0.135
urry+0.134
 know+0.130
Neg logit contributions
we-0.243
els-0.230
ting-0.222
orns-0.219
ers-0.218
Token far
Token ID1290
Feature activation-5.55e-03

Pos logit contributions
 thank+0.077
 sorry+0.072
 pleasure+0.070
 okay+0.069
 yes+0.062
Neg logit contributions
we-0.139
els-0.132
ays-0.123
les-0.123
eds-0.123
Token away
Token ID1497
Feature activation+2.39e-01

Pos logit contributions
 sorry+0.013
 thank+0.013
 okay+0.013
 know+0.011
 pleasure+0.011
Neg logit contributions
we-0.013
els-0.011
orns-0.011
ays-0.011
ucks-0.011
Token to
Token ID284
Feature activation-1.08e-01

Pos logit contributions
we+0.582
els+0.532
orns+0.525
ays+0.510
ting+0.504
Neg logit contributions
 sorry-0.509
 thank-0.497
 okay-0.490
 pleasure-0.441
 know-0.438
Token Tim
Token ID5045
Feature activation-5.71e-02

Pos logit contributions
 sorry+0.119
 thank+0.116
 okay+0.112
 know+0.096
 glad+0.093
Neg logit contributions
we-0.225
els-0.211
orns-0.209
ays-0.206
ucks-0.201
Tokenmy
Token ID1820
Feature activation+1.81e-01

Pos logit contributions
 sorry+0.154
 thank+0.150
 okay+0.147
 pleasure+0.140
urry+0.135
Neg logit contributions
we-0.106
els-0.095
orns-0.093
les-0.091
ays-0.088
Token said
Token ID531
Feature activation-6.42e-02

Pos logit contributions
we+0.419
les+0.390
ers+0.375
els+0.373
ipt+0.361
Neg logit contributions
 sorry-0.376
 pleasure-0.364
 okay-0.359
 thank-0.351
urry-0.338
Token,
Token ID11
Feature activation-2.95e-01

Pos logit contributions
 pleasure+0.159
 thank+0.150
urry+0.148
 forget+0.147
 know+0.147
Neg logit contributions
we-0.120
ers-0.098
ting-0.094
les-0.094
med-0.093
Token "
Token ID366
Feature activation-1.37e-01

Pos logit contributions
 thank+0.453
 sorry+0.452
 okay+0.450
 pleasure+0.439
urry+0.425
Neg logit contributions
we-0.860
ting-0.779
els-0.751
ers-0.729
les-0.728
Tokenwe
Token ID732
Feature activation+7.21e-01

Pos logit contributions
 okay+0.352
 sorry+0.334
 thank+0.327
 please+0.307
 glad+0.304
Neg logit contributions
els-0.275
ays-0.253
ries-0.251
orns-0.249
rob-0.231
Token can
Token ID460
Feature activation-1.10e-01

Pos logit contributions
we+1.996
els+1.947
les+1.909
orns+1.877
ays+1.859
Neg logit contributions
 sorry-1.122
 thank-1.087
 okay-1.068
 pleasure-0.994
 glad-0.920
Token wash
Token ID13502
Feature activation-1.73e-01

Pos logit contributions
 sorry+0.176
 okay+0.166
 thank+0.159
 pleasure+0.153
 glad+0.151
Neg logit contributions
we-0.322
els-0.302
orns-0.285
les-0.282
ays-0.282
Token them
Token ID606
Feature activation+4.08e-01

Pos logit contributions
 sorry+0.343
 okay+0.340
 thank+0.321
 please+0.283
 sad+0.280
Neg logit contributions
we-0.448
rob-0.430
ipt-0.428
orns-0.423
ucks-0.398
Token when
Token ID618
Feature activation+6.26e-02

Pos logit contributions
we+0.972
els+0.886
orns+0.877
ays+0.849
ting+0.837
Neg logit contributions
 sorry-0.899
 thank-0.878
 okay-0.877
 pleasure-0.782
 know-0.780
Token we
Token ID356
Feature activation-1.01e-01

Pos logit contributions
we+0.180
els+0.165
orns+0.162
ays+0.160
ting+0.159
Neg logit contributions
 sorry-0.104
 thank-0.102
 okay-0.100
 pleasure-0.088
 know-0.088
Token't
Token ID470
Feature activation-1.47e-01

Pos logit contributions
 sorry+0.394
 thank+0.372
 okay+0.363
 know+0.291
 please+0.273
Neg logit contributions
we-1.139
orns-1.070
els-1.069
ays-1.035
ting-1.028
Token like
Token ID588
Feature activation-1.52e-01

Pos logit contributions
 sorry+0.157
 thank+0.146
 pleasure+0.141
 yes+0.130
 Once+0.128
Neg logit contributions
orns-0.450
we-0.445
els-0.432
les-0.429
en-0.426
Token spiders
Token ID26120
Feature activation+2.56e-01

Pos logit contributions
 sorry+0.343
 thank+0.333
 okay+0.327
 know+0.293
 worry+0.289
Neg logit contributions
we-0.354
els-0.322
ipt-0.321
orns-0.321
ays-0.318
Token or
Token ID393
Feature activation+5.17e-02

Pos logit contributions
we+0.609
orns+0.550
els+0.549
les+0.537
ting+0.533
Neg logit contributions
 sorry-0.543
 thank-0.530
 okay-0.521
 pleasure-0.500
 glad-0.476
Token cob
Token ID22843
Feature activation+1.27e-01

Pos logit contributions
we+0.113
els+0.102
orns+0.101
ting+0.097
ays+0.096
Neg logit contributions
 sorry-0.121
 thank-0.119
 okay-0.119
 pleasure-0.109
urry-0.107
Tokenwe
Token ID732
Feature activation+7.21e-01

Pos logit contributions
ipt+0.081
acked+0.056
rolley+0.048
ucks+0.043
eled+0.039
Neg logit contributions
 sorry-0.464
 okay-0.449
 thank-0.424
 not-0.408
 help-0.406
Tokenbs
Token ID1443
Feature activation-1.57e-01

Pos logit contributions
we+2.196
els+2.060
orns+2.038
ays+1.990
les+1.976
Neg logit contributions
 sorry-1.068
 thank-1.031
 okay-1.015
 pleasure-0.896
 glad-0.850
Token.
Token ID13
Feature activation+6.99e-02

Pos logit contributions
 sorry+0.277
 okay+0.273
 thank+0.269
 please+0.234
 know+0.231
Neg logit contributions
we-0.434
ipt-0.410
ucks-0.399
els-0.398
rob-0.395
Token But
Token ID887
Feature activation-2.09e-01

Pos logit contributions
we+0.185
els+0.169
orns+0.169
ays+0.166
ting+0.164
Neg logit contributions
 sorry-0.125
 thank-0.124
 okay-0.121
 know-0.109
 pleasure-0.109
Token I
Token ID314
Feature activation-1.05e-01

Pos logit contributions
 sorry+0.402
 thank+0.386
 okay+0.380
 know+0.364
urry+0.361
Neg logit contributions
we-0.528
els-0.481
orns-0.458
ting-0.453
ries-0.446
Token like
Token ID588
Feature activation-1.51e-01

Pos logit contributions
 sorry+0.169
 thank+0.148
 pleasure+0.135
 okay+0.134
 yes+0.121
Neg logit contributions
we-0.288
orns-0.261
les-0.258
ucks-0.258
al-0.249
Token and
Token ID290
Feature activation+1.07e-01

Pos logit contributions
we+0.357
els+0.347
ers+0.332
les+0.329
orns+0.327
Neg logit contributions
 sorry-0.287
 thank-0.276
 pleasure-0.269
 okay-0.262
 tum-0.245
Token called
Token ID1444
Feature activation-6.38e-04

Pos logit contributions
we+0.287
els+0.274
orns+0.263
ting+0.256
les+0.255
Neg logit contributions
 sorry-0.188
 okay-0.185
 thank-0.184
 pleasure-0.173
urry-0.166
Token it
Token ID340
Feature activation+1.73e-01

Pos logit contributions
 sorry+0.001
 thank+0.001
 okay+0.001
 pleasure+0.001
 know+0.001
Neg logit contributions
we-0.002
els-0.001
orns-0.001
ays-0.001
ting-0.001
Token a
Token ID257
Feature activation+4.90e-03

Pos logit contributions
we+0.415
els+0.382
orns+0.362
ers+0.357
ays+0.356
Neg logit contributions
 sorry-0.370
 thank-0.369
 okay-0.355
 pleasure-0.342
 glad-0.328
Token "
Token ID366
Feature activation-1.44e-01

Pos logit contributions
we+0.010
els+0.009
orns+0.008
les+0.008
ymes+0.008
Neg logit contributions
 okay-0.011
 sorry-0.011
urry-0.010
 thank-0.010
 glad-0.010
Tokenwe
Token ID732
Feature activation+7.21e-01

Pos logit contributions
 okay+0.345
 know+0.323
 thank+0.319
 please+0.316
 sorry+0.314
Neg logit contributions
we-0.264
els-0.254
ays-0.247
ting-0.246
rob-0.243
Tokenird
Token ID1447
Feature activation-2.01e-01

Pos logit contributions
we+2.223
els+2.119
orns+2.080
ays+2.034
les+2.020
Neg logit contributions
 sorry-1.021
 thank-0.977
 okay-0.973
 pleasure-0.842
 glad-0.797
Token rock
Token ID3881
Feature activation+1.82e-01

Pos logit contributions
 sorry+0.514
 thank+0.503
 okay+0.495
 pleasure+0.452
 glad+0.447
Neg logit contributions
we-0.415
els-0.379
orns-0.364
ays-0.358
ucks-0.346
Token."
Token ID526
Feature activation+9.33e-02

Pos logit contributions
we+0.466
els+0.427
orns+0.423
ays+0.411
ting+0.405
Neg logit contributions
 sorry-0.364
 thank-0.355
 okay-0.352
 know-0.311
 pleasure-0.309
Token Tim
Token ID5045
Feature activation-6.40e-02

Pos logit contributions
we+0.267
els+0.246
orns+0.245
ays+0.238
ting+0.236
Neg logit contributions
 sorry-0.159
 thank-0.156
 okay-0.151
 pleasure-0.133
 know-0.130
Tokenmy
Token ID1820
Feature activation+1.74e-01

Pos logit contributions
 sorry+0.154
 okay+0.151
 thank+0.151
 know+0.138
 pleasure+0.135
Neg logit contributions
we-0.139
els-0.125
orns-0.123
ays-0.120
ting-0.118
Token He
Token ID679
Feature activation-5.05e-02

Pos logit contributions
we+0.143
orns+0.134
els+0.131
ays+0.128
ucks+0.127
Neg logit contributions
 sorry-0.092
 okay-0.089
 thank-0.088
 pleasure-0.076
 worry-0.075
Token was
Token ID373
Feature activation+1.41e-01

Pos logit contributions
 sorry+0.090
 thank+0.087
 okay+0.086
 pleasure+0.075
 know+0.074
Neg logit contributions
we-0.142
els-0.131
orns-0.130
ays-0.127
ting-0.125
Token always
Token ID1464
Feature activation-2.40e-01

Pos logit contributions
we+0.362
orns+0.351
ting+0.345
ers+0.344
els+0.333
Neg logit contributions
 thank-0.231
urry-0.213
 pleasure-0.204
 promise-0.203
 okay-0.197
Token making
Token ID1642
Feature activation+4.30e-03

Pos logit contributions
 thank+0.359
 sorry+0.327
 okay+0.304
urry+0.288
 pleasure+0.287
Neg logit contributions
we-0.741
orns-0.684
ting-0.680
els-0.674
ers-0.656
Token cob
Token ID22843
Feature activation+1.26e-01

Pos logit contributions
we+0.011
els+0.010
orns+0.010
ays+0.010
ting+0.009
Neg logit contributions
 sorry-0.009
 thank-0.009
 okay-0.009
 know-0.008
 pleasure-0.008
Tokenwe
Token ID732
Feature activation+7.21e-01

Pos logit contributions
ipt+0.055
ays+0.053
els+0.048
les+0.043
ucks+0.040
Neg logit contributions
 sorry-0.476
 okay-0.467
 thank-0.450
 glad-0.434
 know-0.432
Tokenbs
Token ID1443
Feature activation-1.56e-01

Pos logit contributions
we+2.317
els+2.165
orns+2.138
ays+2.107
ucks+2.073
Neg logit contributions
 sorry-0.978
 thank-0.944
 okay-0.930
 know-0.780
 pleasure-0.767
Token to
Token ID284
Feature activation-1.12e-01

Pos logit contributions
 sorry+0.290
 okay+0.281
 thank+0.281
 know+0.245
 pleasure+0.239
Neg logit contributions
we-0.421
els-0.389
orns-0.383
ays-0.377
ipt-0.375
Token catch
Token ID4929
Feature activation-6.35e-03

Pos logit contributions
 sorry+0.181
 thank+0.174
 okay+0.173
 pleasure+0.150
 know+0.146
Neg logit contributions
we-0.331
els-0.308
orns-0.305
ays-0.296
ting-0.294
Token flies
Token ID17607
Feature activation+3.61e-01

Pos logit contributions
 okay+0.012
 sorry+0.012
 thank+0.011
 know+0.011
 sad+0.011
Neg logit contributions
we-0.017
ipt-0.016
ays-0.016
rob-0.016
ucks-0.015
Token for
Token ID329
Feature activation-1.39e-01

Pos logit contributions
we+1.017
els+0.925
ucks+0.920
rob+0.913
ays+0.909
Neg logit contributions
 okay-0.645
 sorry-0.632
 thank-0.619
 know-0.559
 tired-0.548
Token didn
Token ID1422
Feature activation-2.21e-01

Pos logit contributions
we+0.052
els+0.048
les+0.046
orns+0.046
ting+0.045
Neg logit contributions
 sorry-0.037
 okay-0.036
 thank-0.035
 pleasure-0.032
urry-0.031
Token't
Token ID470
Feature activation-1.50e-01

Pos logit contributions
 sorry+0.315
 thank+0.303
 okay+0.300
 pleasure+0.264
 know+0.252
Neg logit contributions
we-0.689
els-0.643
orns-0.628
ays-0.625
ting-0.617
Token know
Token ID760
Feature activation-4.36e-01

Pos logit contributions
 pleasure+0.167
 tum+0.162
 sorry+0.158
 okay+0.141
urry+0.135
Neg logit contributions
les-0.438
ting-0.429
we-0.422
en-0.417
els-0.414
Token what
Token ID644
Feature activation-2.03e-02

Pos logit contributions
 sorry+0.771
 thank+0.754
 okay+0.752
 pleasure+0.633
 know+0.632
Neg logit contributions
we-1.229
orns-1.131
els-1.121
ays-1.109
ucks-1.104
Token "
Token ID366
Feature activation-1.40e-01

Pos logit contributions
 sorry+0.046
 okay+0.044
 thank+0.044
 pleasure+0.041
urry+0.040
Neg logit contributions
we-0.047
els-0.043
orns-0.042
ting-0.040
ays-0.039
Tokenwe
Token ID732
Feature activation+7.20e-01

Pos logit contributions
 okay+0.337
 know+0.314
 thank+0.306
 understand+0.303
 please+0.301
Neg logit contributions
els-0.249
we-0.247
ays-0.243
ucks-0.241
orns-0.234
Tokenigh
Token ID394
Feature activation+1.04e-01

Pos logit contributions
we+2.094
els+2.002
orns+1.950
ting+1.918
les+1.915
Neg logit contributions
 sorry-1.130
 okay-1.069
 thank-1.060
 pleasure-0.954
 glad-0.902
Token"
Token ID1
Feature activation+1.08e-01

Pos logit contributions
we+0.230
orns+0.207
els+0.206
ays+0.198
ucks+0.196
Neg logit contributions
 sorry-0.243
 thank-0.241
 okay-0.238
 know-0.216
 pleasure-0.212
Token meant
Token ID4001
Feature activation+4.19e-02

Pos logit contributions
we+0.282
els+0.265
orns+0.260
ays+0.256
ucks+0.251
Neg logit contributions
 okay-0.206
 sorry-0.203
 thank-0.199
 know-0.180
 please-0.173
Token,
Token ID11
Feature activation-2.98e-01

Pos logit contributions
we+0.117
els+0.107
ting+0.104
ers+0.102
ries+0.100
Neg logit contributions
 sorry-0.070
 thank-0.068
 okay-0.065
 pleasure-0.064
 tum-0.060
Token so
Token ID523
Feature activation-1.57e-01

Pos logit contributions
 sorry+0.407
 thank+0.391
 okay+0.387
 pleasure+0.337
 know+0.330
Neg logit contributions
we-0.947
els-0.885
orns-0.870
ting-0.860
ays-0.853
Token finished
Token ID5201
Feature activation-1.94e-01

Pos logit contributions
we+0.382
ers+0.356
ting+0.347
orns+0.345
med+0.344
Neg logit contributions
 thank-0.237
 sorry-0.226
 okay-0.209
urry-0.209
 know-0.205
Token and
Token ID290
Feature activation+1.07e-01

Pos logit contributions
 sorry+0.387
 thank+0.377
 okay+0.375
 pleasure+0.331
 know+0.329
Neg logit contributions
we-0.498
els-0.459
orns-0.454
ays-0.442
ting-0.435
Token Sam
Token ID3409
Feature activation-2.87e-02

Pos logit contributions
we+0.289
els+0.261
orns+0.258
ting+0.255
ays+0.253
Neg logit contributions
 sorry-0.196
 thank-0.190
 okay-0.189
 pleasure-0.170
urry-0.168
Token said
Token ID531
Feature activation-7.63e-02

Pos logit contributions
 sorry+0.058
 thank+0.056
 okay+0.055
 pleasure+0.050
 know+0.048
Neg logit contributions
we-0.073
els-0.067
orns-0.066
les-0.064
ting-0.064
Token "
Token ID366
Feature activation-1.46e-01

Pos logit contributions
 pleasure+0.122
 forget+0.115
urry+0.115
 okay+0.109
 age+0.109
Neg logit contributions
we-0.199
ers-0.184
med-0.178
ting-0.177
les-0.177
Tokenwe
Token ID732
Feature activation+7.19e-01

Pos logit contributions
 okay+0.348
 thank+0.326
 pleasure+0.313
 worry+0.308
 please+0.308
Neg logit contributions
els-0.319
ays-0.295
ries-0.277
orns-0.267
eds-0.267
Token should
Token ID815
Feature activation-1.16e-02

Pos logit contributions
we+2.042
els+1.941
orns+1.906
les+1.882
ting+1.866
Neg logit contributions
 sorry-1.133
 okay-1.074
 thank-1.074
 pleasure-0.951
 glad-0.908
Token go
Token ID467
Feature activation-1.91e-01

Pos logit contributions
 sorry+0.016
 glad+0.015
 okay+0.015
 pleasure+0.014
 thank+0.014
Neg logit contributions
we-0.032
els-0.030
al-0.030
orns-0.030
ers-0.029
Token back
Token ID736
Feature activation+3.32e-02

Pos logit contributions
 sorry+0.370
 thank+0.358
 okay+0.355
 pleasure+0.320
 glad+0.310
Neg logit contributions
we-0.500
orns-0.455
els-0.450
ting-0.448
ays-0.439
Token to
Token ID284
Feature activation-1.13e-01

Pos logit contributions
we+0.093
els+0.087
orns+0.085
ays+0.083
ting+0.081
Neg logit contributions
 sorry-0.059
 thank-0.058
 okay-0.057
 know-0.050
 pleasure-0.050
Token playing
Token ID2712
Feature activation-1.32e-01

Pos logit contributions
 sorry+0.239
 okay+0.235
 thank+0.232
 pleasure+0.211
urry+0.207
Neg logit contributions
we-0.269
els-0.242
orns-0.241
ting-0.237
ays-0.233
Token could
Token ID714
Feature activation-5.52e-02

Pos logit contributions
 sorry+0.271
 thank+0.265
 okay+0.257
 pleasure+0.238
 know+0.223
Neg logit contributions
we-0.418
els-0.382
orns-0.377
les-0.364
ays-0.363
Token not
Token ID407
Feature activation-3.40e-01

Pos logit contributions
 pleasure+0.089
 sorry+0.071
 thank+0.069
 dear+0.067
onest+0.066
Neg logit contributions
we-0.144
ting-0.122
els-0.120
les-0.116
web-0.115
Token.
Token ID13
Feature activation+7.14e-02

Pos logit contributions
 sorry+0.691
 pleasure+0.659
 thank+0.638
 glad+0.634
 okay+0.613
Neg logit contributions
we-0.820
els-0.744
orns-0.730
les-0.724
ting-0.724
Token The
Token ID383
Feature activation-2.23e-02

Pos logit contributions
we+0.179
els+0.167
ays+0.164
orns+0.164
ucks+0.159
Neg logit contributions
 thank-0.139
 sorry-0.136
 okay-0.130
 pleasure-0.121
 know-0.120
Token cob
Token ID22843
Feature activation+1.27e-01

Pos logit contributions
 sorry+0.047
 okay+0.046
 thank+0.046
 know+0.042
 pleasure+0.042
Neg logit contributions
we-0.053
els-0.049
orns-0.048
ting-0.046
ays-0.046
Tokenwe
Token ID732
Feature activation+7.19e-01

Pos logit contributions
ipt+0.095
ting+0.090
ucks+0.089
ays+0.086
els+0.086
Neg logit contributions
 sorry-0.439
 okay-0.420
 thank-0.415
 not-0.392
 please-0.391
Tokenbs
Token ID1443
Feature activation-1.58e-01

Pos logit contributions
we+2.153
els+2.020
orns+1.995
ays+1.946
ting+1.928
Neg logit contributions
 sorry-1.116
 thank-1.080
 okay-1.069
 pleasure-0.934
 know-0.889
Token and
Token ID290
Feature activation+1.21e-01

Pos logit contributions
 sorry+0.275
 thank+0.267
 okay+0.265
 know+0.227
 please+0.220
Neg logit contributions
we-0.448
orns-0.413
els-0.413
ucks-0.400
ays-0.399
Token the
Token ID262
Feature activation+1.47e-01

Pos logit contributions
we+0.283
orns+0.255
els+0.254
ays+0.249
ucks+0.243
Neg logit contributions
 sorry-0.277
 thank-0.273
 okay-0.265
 know-0.241
 pleasure-0.235
Token spider
Token ID19230
Feature activation+1.61e-01

Pos logit contributions
we+0.332
els+0.301
orns+0.298
ays+0.295
ucks+0.286
Neg logit contributions
 sorry-0.346
 thank-0.337
 okay-0.333
 pleasure-0.299
 know-0.299
Token held
Token ID2714
Feature activation+5.26e-02

Pos logit contributions
we+0.411
els+0.378
orns+0.374
ays+0.365
ting+0.358
Neg logit contributions
 sorry-0.322
 thank-0.313
 okay-0.310
 know-0.274
 pleasure-0.272
Token each
Token ID1123
Feature activation+2.19e-01

Pos logit contributions
 sorry+0.095
 thank+0.094
 okay+0.093
 pleasure+0.084
 know+0.082
Neg logit contributions
we-0.090
orns-0.082
els-0.081
ays-0.078
ucks-0.076
Token other
Token ID584
Feature activation+1.36e-01

Pos logit contributions
we+0.522
els+0.474
orns+0.468
ays+0.460
ucks+0.443
Neg logit contributions
 sorry-0.490
 thank-0.489
 okay-0.479
 know-0.439
 pleasure-0.429
Token and
Token ID290
Feature activation+1.04e-01

Pos logit contributions
we+0.310
els+0.279
orns+0.270
ting+0.268
les+0.267
Neg logit contributions
 sorry-0.307
 thank-0.297
 okay-0.295
 pleasure-0.268
 know-0.268
Token said
Token ID531
Feature activation-7.49e-02

Pos logit contributions
we+0.281
els+0.257
orns+0.252
ting+0.249
ays+0.246
Neg logit contributions
 sorry-0.192
 okay-0.187
 thank-0.182
 pleasure-0.163
urry-0.161
Token "
Token ID366
Feature activation-1.42e-01

Pos logit contributions
 okay+0.110
 forget+0.106
 sorry+0.106
 know+0.104
 worry+0.102
Neg logit contributions
we-0.213
ting-0.192
els-0.191
ers-0.190
med-0.186
Tokenwe
Token ID732
Feature activation+7.19e-01

Pos logit contributions
 okay+0.330
 pleasure+0.310
 thank+0.308
 understand+0.288
 glad+0.287
Neg logit contributions
els-0.296
ays-0.272
eds-0.263
ucks-0.258
ries-0.258
Token did
Token ID750
Feature activation-1.19e-01

Pos logit contributions
we+2.015
els+1.923
les+1.871
ers+1.841
ting+1.834
Neg logit contributions
 sorry-1.080
 okay-1.011
 thank-1.008
 tum-0.906
 pleasure-0.894
Token a
Token ID257
Feature activation+1.22e-02

Pos logit contributions
 sorry+0.294
 okay+0.282
 thank+0.282
urry+0.277
 glad+0.263
Neg logit contributions
we-0.223
els-0.203
ers-0.189
ries-0.185
orns-0.183
Token great
Token ID1049
Feature activation-2.96e-01

Pos logit contributions
we+0.033
les+0.031
ymes+0.031
ers+0.030
els+0.030
Neg logit contributions
 sorry-0.016
urry-0.015
 okay-0.015
 thank-0.014
 glad-0.013
Token job
Token ID1693
Feature activation-3.50e-01

Pos logit contributions
 sorry+0.334
 okay+0.324
 thank+0.288
 glad+0.264
 pleasure+0.248
Neg logit contributions
we-0.982
les-0.910
els-0.891
orns-0.885
ting-0.882
Token!
Token ID0
Feature activation+5.33e-02

Pos logit contributions
 sorry+0.761
 thank+0.736
 okay+0.732
 know+0.672
 pleasure+0.663
Neg logit contributions
we-0.840
els-0.743
ting-0.716
ays-0.713
orns-0.709
Token air
Token ID1633
Feature activation+3.01e-01

Pos logit contributions
we+0.317
orns+0.288
els+0.288
ting+0.277
les+0.273
Neg logit contributions
 sorry-0.285
 thank-0.279
 okay-0.275
 know-0.249
urry-0.247
Token and
Token ID290
Feature activation+1.04e-01

Pos logit contributions
we+0.743
els+0.678
orns+0.666
ays+0.651
ting+0.648
Neg logit contributions
 sorry-0.631
 thank-0.614
 okay-0.609
 pleasure-0.549
 know-0.547
Token looked
Token ID3114
Feature activation-1.44e-02

Pos logit contributions
we+0.261
els+0.239
orns+0.235
ays+0.229
ting+0.227
Neg logit contributions
 sorry-0.217
 thank-0.212
 okay-0.210
 pleasure-0.188
 know-0.187
Token for
Token ID329
Feature activation-1.41e-01

Pos logit contributions
 thank+0.028
 sorry+0.028
urry+0.028
 okay+0.028
 know+0.027
Neg logit contributions
we-0.030
ting-0.030
les-0.030
els-0.029
ays-0.028
Token cob
Token ID22843
Feature activation+1.27e-01

Pos logit contributions
 sorry+0.300
 thank+0.292
 okay+0.290
 pleasure+0.259
 know+0.259
Neg logit contributions
we-0.344
els-0.316
orns-0.311
ays-0.299
ting-0.298
Tokenwe
Token ID732
Feature activation+7.19e-01

Pos logit contributions
ipt+0.079
ucks+0.057
ays+0.052
acked+0.047
Look+0.044
Neg logit contributions
 sorry-0.445
 okay-0.445
 thank-0.408
 not-0.407
 help-0.407
Tokenbs
Token ID1443
Feature activation-1.56e-01

Pos logit contributions
we+2.271
els+2.122
orns+2.099
ays+2.055
ting+2.035
Neg logit contributions
 sorry-1.016
 thank-0.979
 okay-0.966
 pleasure-0.813
 know-0.803
Token to
Token ID284
Feature activation-1.14e-01

Pos logit contributions
 sorry+0.297
 okay+0.294
 thank+0.290
 know+0.249
 pleasure+0.249
Neg logit contributions
we-0.410
els-0.379
orns-0.373
ucks-0.371
ipt-0.369
Token play
Token ID711
Feature activation-1.37e-01

Pos logit contributions
 sorry+0.173
 okay+0.166
 thank+0.162
 pleasure+0.142
urry+0.138
Neg logit contributions
we-0.344
els-0.323
orns-0.318
ting-0.310
les-0.305
Token in
Token ID287
Feature activation+1.60e-01

Pos logit contributions
 okay+0.259
 sorry+0.255
 thank+0.239
 please+0.214
 happy+0.213
Neg logit contributions
we-0.357
els-0.356
ipt-0.355
ays-0.352
ucks-0.345
Token.
Token ID13
Feature activation+5.35e-02

Pos logit contributions
we+0.440
els+0.403
ucks+0.399
orns+0.398
ays+0.397
Neg logit contributions
 sorry-0.296
 okay-0.291
 thank-0.285
 know-0.255
 pleasure-0.250
Token didn
Token ID1422
Feature activation-2.23e-01

Pos logit contributions
we+0.037
els+0.034
orns+0.033
les+0.032
ays+0.032
Neg logit contributions
 sorry-0.029
 okay-0.027
 thank-0.027
 pleasure-0.025
 glad-0.024
Token't
Token ID470
Feature activation-1.53e-01

Pos logit contributions
 sorry+0.326
 thank+0.314
 okay+0.311
 pleasure+0.278
 glad+0.263
Neg logit contributions
we-0.686
els-0.641
orns-0.625
ays-0.622
ting-0.615
Token know
Token ID760
Feature activation-4.40e-01

Pos logit contributions
 tum+0.179
 pleasure+0.174
 sorry+0.169
 glad+0.142
sing+0.140
Neg logit contributions
les-0.429
ers-0.417
le-0.411
ting-0.403
en-0.402
Token what
Token ID644
Feature activation-2.44e-02

Pos logit contributions
 sorry+0.739
 thank+0.718
 okay+0.711
 pleasure+0.613
 know+0.610
Neg logit contributions
we-1.278
els-1.184
orns-1.172
ays-1.146
ting-1.130
Token "
Token ID366
Feature activation-1.48e-01

Pos logit contributions
 sorry+0.048
 okay+0.045
 thank+0.044
 pleasure+0.042
urry+0.041
Neg logit contributions
we-0.063
els-0.058
orns-0.056
ers-0.054
ries-0.053
Tokenwe
Token ID732
Feature activation+7.19e-01

Pos logit contributions
 okay+0.347
 know+0.322
 thank+0.316
 please+0.311
 sorry+0.310
Neg logit contributions
we-0.278
els-0.275
ays-0.272
ucks-0.262
orns-0.257
Tokenigh
Token ID394
Feature activation+1.03e-01

Pos logit contributions
we+2.000
els+1.951
orns+1.914
ting+1.876
les+1.872
Neg logit contributions
 sorry-1.158
 okay-1.099
 thank-1.070
 pleasure-1.003
 glad-0.940
Token"
Token ID1
Feature activation+1.02e-01

Pos logit contributions
we+0.235
els+0.211
orns+0.210
ays+0.203
ucks+0.200
Neg logit contributions
 sorry-0.233
 thank-0.231
 okay-0.229
 know-0.207
 pleasure-0.202
Token meant
Token ID4001
Feature activation+3.98e-02

Pos logit contributions
we+0.257
els+0.241
orns+0.237
ays+0.232
ucks+0.226
Neg logit contributions
 okay-0.201
 sorry-0.199
 thank-0.194
 know-0.175
 please-0.169
Token,
Token ID11
Feature activation-3.04e-01

Pos logit contributions
we+0.116
els+0.106
ting+0.105
ers+0.101
ries+0.099
Neg logit contributions
 sorry-0.057
 thank-0.056
 pleasure-0.055
 okay-0.052
 tum-0.051
Token so
Token ID523
Feature activation-1.59e-01

Pos logit contributions
 sorry+0.442
 thank+0.424
 okay+0.421
 pleasure+0.390
urry+0.386
Neg logit contributions
we-0.922
els-0.860
ting-0.838
orns-0.831
ays-0.824
Token her
Token ID607
Feature activation+1.45e-01

Pos logit contributions
we+0.325
orns+0.287
ipt+0.283
ymes+0.279
rob+0.279
Neg logit contributions
 sorry-0.212
 thank-0.208
 okay-0.200
 know-0.182
 sad-0.181
Token tears
Token ID10953
Feature activation+8.35e-02

Pos logit contributions
we+0.393
orns+0.360
els+0.359
ays+0.350
ting+0.346
Neg logit contributions
 sorry-0.283
 thank-0.273
 okay-0.270
 pleasure-0.237
 know-0.234
Token.
Token ID13
Feature activation+6.40e-02

Pos logit contributions
we+0.200
els+0.183
orns+0.179
ays+0.175
ting+0.173
Neg logit contributions
 sorry-0.182
 thank-0.178
 okay-0.176
 pleasure-0.159
 know-0.158
Token She
Token ID1375
Feature activation-3.91e-02

Pos logit contributions
we+0.170
els+0.156
orns+0.155
ays+0.154
ting+0.150
Neg logit contributions
 sorry-0.118
 thank-0.116
 okay-0.114
 know-0.102
 pleasure-0.101
Token said
Token ID531
Feature activation-6.01e-02

Pos logit contributions
 sorry+0.069
 thank+0.068
 okay+0.067
 know+0.059
 pleasure+0.057
Neg logit contributions
we-0.109
els-0.102
orns-0.101
ays-0.099
ting-0.098
Token sorry
Token ID7926
Feature activation-5.60e-01

Pos logit contributions
 pleasure+0.074
 forget+0.073
 know+0.071
urry+0.070
 believe+0.066
Neg logit contributions
we-0.180
els-0.167
ers-0.162
ting-0.161
ays-0.160
Token too
Token ID1165
Feature activation-4.51e-02

Pos logit contributions
 sorry+1.068
 thank+1.035
 okay+1.012
 pleasure+0.959
 glad+0.919
Neg logit contributions
we-1.494
els-1.336
orns-1.304
les-1.299
ays-1.288
Token.
Token ID13
Feature activation+7.03e-02

Pos logit contributions
 pleasure+0.123
 thank+0.119
 sorry+0.116
 believe+0.113
urry+0.112
Neg logit contributions
we-0.065
els-0.055
les-0.054
ers-0.050
ymes-0.050
Token
Token ID198
Feature activation-1.68e-02

Pos logit contributions
we+0.183
els+0.165
ays+0.164
orns+0.162
ucks+0.161
Neg logit contributions
 sorry-0.131
 thank-0.130
 okay-0.128
 know-0.117
 pleasure-0.117
Token
Token ID198
Feature activation-1.73e-02

Pos logit contributions
 sorry+0.028
 thank+0.027
 okay+0.027
 know+0.023
 glad+0.023
Neg logit contributions
we-0.049
els-0.046
orns-0.046
ays-0.045
ting-0.044
Token"
Token ID1
Feature activation+1.19e-01

Pos logit contributions
 sorry+0.039
 thank+0.038
 okay+0.038
 know+0.035
 glad+0.034
Neg logit contributions
we-0.041
els-0.038
orns-0.037
ays-0.037
les-0.036
Token said
Token ID531
Feature activation-6.82e-02

Pos logit contributions
 pleasure+0.257
 sorry+0.229
 thank+0.229
 okay+0.225
urry+0.225
Neg logit contributions
we-0.228
els-0.218
ers-0.213
les-0.212
ep-0.208
Token.
Token ID13
Feature activation+6.79e-02

Pos logit contributions
 thank+0.164
 pleasure+0.163
 know+0.157
 forget+0.157
 sorry+0.154
Neg logit contributions
we-0.138
els-0.116
ting-0.112
ers-0.111
ays-0.107
Token "
Token ID366
Feature activation-1.36e-01

Pos logit contributions
we+0.196
els+0.177
ays+0.172
ipt+0.170
ucks+0.170
Neg logit contributions
 thank-0.111
 sorry-0.110
 okay-0.108
 pleasure-0.099
 know-0.097
TokenWe
Token ID1135
Feature activation+3.33e-01

Pos logit contributions
 okay+0.351
 sorry+0.342
 thank+0.314
 glad+0.314
 understand+0.310
Neg logit contributions
els-0.259
orns-0.250
les-0.249
ays-0.238
rob-0.238
Token're
Token ID821
Feature activation-3.63e-01

Pos logit contributions
we+0.806
els+0.804
orns+0.761
les+0.747
al+0.739
Neg logit contributions
 sorry-0.580
 thank-0.558
 okay-0.526
 pleasure-0.525
 glad-0.496
Token sorry
Token ID7926
Feature activation-5.59e-01

Pos logit contributions
 thank+0.190
 pleasure+0.132
 hesitate+0.122
urry+0.119
 know+0.107
Neg logit contributions
we-1.371
els-1.325
ting-1.285
ries-1.276
orns-1.273
Token we
Token ID356
Feature activation-1.05e-01

Pos logit contributions
 sorry+0.950
 thank+0.942
 pleasure+0.914
 determined+0.871
 okay+0.865
Neg logit contributions
we-1.536
els-1.350
ting-1.301
les-1.299
ucks-1.294
Token lied
Token ID19837
Feature activation-1.66e-01

Pos logit contributions
 sorry+0.192
 thank+0.189
 okay+0.180
 pleasure+0.175
urry+0.162
Neg logit contributions
we-0.273
els-0.250
orns-0.245
ting-0.245
les-0.242
Token and
Token ID290
Feature activation+1.23e-01

Pos logit contributions
 sorry+0.316
 okay+0.310
 thank+0.308
 know+0.268
 pleasure+0.266
Neg logit contributions
we-0.443
els-0.413
orns-0.409
ays-0.395
ting-0.389
Token broke
Token ID6265
Feature activation+2.25e-02

Pos logit contributions
we+0.321
ting+0.271
els+0.265
ries+0.261
eds+0.256
Neg logit contributions
 okay-0.196
urry-0.193
 pleasure-0.190
 sorry-0.185
 thank-0.174
Token the
Token ID262
Feature activation+1.48e-01

Pos logit contributions
rob+0.058
we+0.057
orns+0.056
ucks+0.054
ipt+0.054
Neg logit contributions
 okay-0.044
 sorry-0.044
 thank-0.038
 worry-0.037
 sad-0.036
Token was
Token ID373
Feature activation+1.56e-01

Pos logit contributions
 sorry+0.064
 thank+0.062
 okay+0.061
 pleasure+0.054
 know+0.053
Neg logit contributions
we-0.098
els-0.090
orns-0.089
ays-0.087
ting-0.086
Token wet
Token ID9583
Feature activation-4.30e-02

Pos logit contributions
we+0.497
els+0.464
orns+0.458
ting+0.448
ers+0.448
Neg logit contributions
 thank-0.206
 sorry-0.196
 okay-0.187
 pleasure-0.175
 know-0.169
Token and
Token ID290
Feature activation+1.20e-01

Pos logit contributions
 sorry+0.081
 okay+0.080
 thank+0.071
 sad+0.066
 tired+0.066
Neg logit contributions
we-0.118
els-0.110
orns-0.108
rob-0.107
ays-0.106
Token cold
Token ID4692
Feature activation-1.71e-01

Pos logit contributions
we+0.344
els+0.313
orns+0.311
ting+0.301
ays+0.299
Neg logit contributions
 thank-0.187
 okay-0.179
 pleasure-0.179
 sorry-0.176
urry-0.167
Token and
Token ID290
Feature activation+1.21e-01

Pos logit contributions
 sorry+0.327
 okay+0.316
 thank+0.307
 tired+0.265
 know+0.262
Neg logit contributions
we-0.460
els-0.434
orns-0.417
ays-0.416
ucks-0.412
Token sorry
Token ID7926
Feature activation-5.59e-01

Pos logit contributions
we+0.352
els+0.322
orns+0.321
ting+0.311
ays+0.310
Neg logit contributions
 thank-0.187
 sorry-0.184
 okay-0.182
 pleasure-0.172
urry-0.162
Token.
Token ID13
Feature activation+6.89e-02

Pos logit contributions
 sorry+1.198
 thank+1.170
 okay+1.160
 pleasure+1.048
 know+1.037
Neg logit contributions
we-1.366
els-1.244
orns-1.227
ays-1.195
ting-1.178
Token He
Token ID679
Feature activation-3.46e-02

Pos logit contributions
we+0.189
els+0.173
orns+0.172
ays+0.169
ucks+0.166
Neg logit contributions
 sorry-0.124
 thank-0.122
 okay-0.120
 pleasure-0.107
 know-0.106
Token hugged
Token ID41130
Feature activation-4.07e-02

Pos logit contributions
 sorry+0.058
 thank+0.057
 okay+0.055
 know+0.048
 glad+0.047
Neg logit contributions
we-0.101
els-0.094
orns-0.093
ays-0.092
ting-0.090
Token Mia
Token ID32189
Feature activation+4.34e-02

Pos logit contributions
 sorry+0.065
 thank+0.063
 okay+0.059
 sad+0.051
 please+0.051
Neg logit contributions
we-0.126
els-0.116
rob-0.116
ays-0.116
ymes-0.114
Token.
Token ID13
Feature activation+6.71e-02

Pos logit contributions
we+0.104
els+0.095
orns+0.094
ays+0.091
ting+0.090
Neg logit contributions
 sorry-0.095
 thank-0.093
 okay-0.093
 pleasure-0.082
 know-0.082
Token
Token ID198
Feature activation-1.99e-02

Pos logit contributions
we+0.195
els+0.180
orns+0.178
ays+0.174
ting+0.172
Neg logit contributions
 sorry-0.118
 thank-0.116
 okay-0.115
 pleasure-0.102
 know-0.100
Token
Token ID198
Feature activation-1.92e-02

Pos logit contributions
 sorry+0.029
 okay+0.028
 thank+0.027
 know+0.024
 glad+0.023
Neg logit contributions
we-0.063
els-0.060
ting-0.059
orns-0.059
ays-0.058
Token"
Token ID1
Feature activation+1.18e-01

Pos logit contributions
 sorry+0.044
 okay+0.043
 thank+0.040
 know+0.039
 please+0.038
Neg logit contributions
we-0.044
els-0.043
ays-0.042
orns-0.042
ries-0.041
TokenI
Token ID40
Feature activation+3.41e-01

Pos logit contributions
els+0.244
orns+0.234
les+0.232
eds+0.230
ries+0.225
Neg logit contributions
 sorry-0.282
 okay-0.281
 thank-0.260
 glad-0.250
 tired-0.250
Token'm
Token ID1101
Feature activation-2.96e-01

Pos logit contributions
we+0.962
les+0.879
els+0.845
en+0.829
ucks+0.825
Neg logit contributions
 sorry-0.511
 pleasure-0.464
 thank-0.455
 glad-0.407
 okay-0.405
Token sorry
Token ID7926
Feature activation-5.59e-01

Pos logit contributions
 thank+0.257
 pleasure+0.230
 forget+0.205
 promise+0.196
 know+0.193
Neg logit contributions
we-1.007
ting-0.949
els-0.941
ays-0.913
ries-0.906
Token,
Token ID11
Feature activation-2.88e-01

Pos logit contributions
 sorry+0.686
 pleasure+0.678
 thank+0.663
 glad+0.648
 determined+0.635
Neg logit contributions
we-1.738
els-1.596
les-1.551
ucks-1.548
ting-1.537
Token Ben
Token ID3932
Feature activation+2.60e-02

Pos logit contributions
 sorry+0.580
 okay+0.569
 thank+0.565
 know+0.533
urry+0.532
Neg logit contributions
we-0.678
els-0.634
orns-0.610
ting-0.608
ays-0.603
Token,
Token ID11
Feature activation-2.88e-01

Pos logit contributions
we+0.069
els+0.063
orns+0.063
les+0.062
ers+0.062
Neg logit contributions
 sorry-0.047
 thank-0.045
 okay-0.045
 pleasure-0.042
 glad-0.040
Token I
Token ID314
Feature activation-1.10e-01

Pos logit contributions
 sorry+0.493
 okay+0.475
 thank+0.472
 pleasure+0.452
 know+0.450
Neg logit contributions
we-0.772
els-0.694
ting-0.685
orns-0.673
ays-0.661
Token didn
Token ID1422
Feature activation-2.22e-01

Pos logit contributions
 sorry+0.182
 thank+0.181
 pleasure+0.171
 glad+0.155
 okay+0.148
Neg logit contributions
we-0.301
les-0.265
els-0.258
ucks-0.252
orns-0.252
Token.
Token ID13
Feature activation+6.87e-02

Pos logit contributions
we+0.961
els+0.869
ers+0.854
orns+0.851
ting+0.839
Neg logit contributions
 thank-0.842
 sorry-0.837
 okay-0.778
 pleasure-0.778
urry-0.777
Token She
Token ID1375
Feature activation-4.24e-02

Pos logit contributions
we+0.154
ays+0.150
ucks+0.146
els+0.144
orns+0.144
Neg logit contributions
 thank-0.135
 sorry-0.134
 okay-0.133
 believe-0.128
 pleasure-0.125
Token said
Token ID531
Feature activation-5.88e-02

Pos logit contributions
 sorry+0.068
 thank+0.066
 okay+0.065
 know+0.056
 pleasure+0.056
Neg logit contributions
we-0.126
els-0.118
orns-0.116
ays-0.114
ting-0.112
Token she
Token ID673
Feature activation-1.49e-01

Pos logit contributions
 pleasure+0.095
 okay+0.093
 sorry+0.093
 thank+0.092
urry+0.092
Neg logit contributions
we-0.162
els-0.148
ting-0.143
orns-0.143
ries-0.142
Token was
Token ID373
Feature activation+1.61e-01

Pos logit contributions
 sorry+0.286
 thank+0.278
 okay+0.270
 pleasure+0.248
urry+0.237
Neg logit contributions
we-0.404
els-0.364
orns-0.361
les-0.349
ays-0.348
Token sorry
Token ID7926
Feature activation-5.59e-01

Pos logit contributions
we+0.598
els+0.571
ers+0.567
ting+0.556
orns+0.555
Neg logit contributions
 thank-0.080
urry-0.073
 pleasure-0.067
 know-0.061
 worry-0.059
Token.
Token ID13
Feature activation+6.98e-02

Pos logit contributions
 sorry+1.119
 thank+1.100
 okay+1.063
 pleasure+1.034
 know+0.990
Neg logit contributions
we-1.433
els-1.277
orns-1.244
ays-1.219
ting-1.212
Token She
Token ID1375
Feature activation-4.16e-02

Pos logit contributions
we+0.162
ays+0.153
ucks+0.150
orns+0.149
els+0.148
Neg logit contributions
 thank-0.136
 sorry-0.134
 okay-0.133
 believe-0.127
 pleasure-0.125
Token said
Token ID531
Feature activation-5.89e-02

Pos logit contributions
 sorry+0.068
 thank+0.066
 okay+0.065
 pleasure+0.056
 know+0.056
Neg logit contributions
we-0.123
els-0.114
orns-0.113
ays-0.110
ting-0.109
Token they
Token ID484
Feature activation-1.07e-01

Pos logit contributions
 pleasure+0.095
 okay+0.093
 sorry+0.093
urry+0.092
 thank+0.092
Neg logit contributions
we-0.162
els-0.146
ting-0.143
orns-0.142
ries-0.142
Token should
Token ID815
Feature activation-1.91e-03

Pos logit contributions
 sorry+0.205
 thank+0.199
 okay+0.189
 pleasure+0.187
urry+0.175
Neg logit contributions
we-0.279
els-0.250
orns-0.249
les-0.244
ers-0.238
Token.
Token ID13
Feature activation+7.16e-02

Pos logit contributions
we+0.422
els+0.384
orns+0.376
ers+0.365
ays+0.364
Neg logit contributions
 sorry-0.422
 thank-0.418
 okay-0.402
 pleasure-0.384
urry-0.374
Token They
Token ID1119
Feature activation+3.16e-02

Pos logit contributions
we+0.157
ays+0.146
ucks+0.144
els+0.144
ipt+0.139
Neg logit contributions
 sorry-0.150
 thank-0.150
 okay-0.148
urry-0.139
 know-0.137
Token said
Token ID531
Feature activation-5.76e-02

Pos logit contributions
we+0.072
els+0.067
les+0.066
orns+0.066
ers+0.064
Neg logit contributions
 sorry-0.063
 okay-0.060
 pleasure-0.059
 thank-0.058
urry-0.058
Token they
Token ID484
Feature activation-1.06e-01

Pos logit contributions
 pleasure+0.107
urry+0.107
 know+0.102
 believe+0.101
 forget+0.096
Neg logit contributions
we-0.141
ries-0.124
els-0.123
ting-0.116
ays-0.114
Token were
Token ID547
Feature activation-4.43e-02

Pos logit contributions
 sorry+0.204
 thank+0.198
 pleasure+0.187
 okay+0.185
urry+0.175
Neg logit contributions
we-0.272
els-0.240
orns-0.238
les-0.237
ers-0.229
Token sorry
Token ID7926
Feature activation-5.59e-01

Pos logit contributions
 thank+0.021
 pleasure+0.016
 know+0.015
urry+0.014
 sorry+0.013
Neg logit contributions
we-0.171
els-0.166
orns-0.161
ers-0.161
ting-0.159
Token.
Token ID13
Feature activation+7.13e-02

Pos logit contributions
 sorry+1.239
 thank+1.219
 okay+1.178
 pleasure+1.168
 know+1.110
Neg logit contributions
we-1.312
els-1.145
orns-1.095
ays-1.090
ucks-1.086
Token They
Token ID1119
Feature activation+3.16e-02

Pos logit contributions
we+0.165
ays+0.151
els+0.149
ucks+0.149
orns+0.145
Neg logit contributions
 sorry-0.146
 thank-0.145
 okay-0.143
urry-0.133
 know-0.132
Token said
Token ID531
Feature activation-5.73e-02

Pos logit contributions
we+0.077
els+0.071
orns+0.070
les+0.070
ers+0.068
Neg logit contributions
 sorry-0.061
 okay-0.058
 pleasure-0.056
 thank-0.056
urry-0.054
Token they
Token ID484
Feature activation-1.06e-01

Pos logit contributions
urry+0.105
 pleasure+0.104
 know+0.102
 believe+0.099
 okay+0.095
Neg logit contributions
we-0.143
ries-0.124
els-0.124
ting-0.119
ays-0.116
Token would
Token ID561
Feature activation-6.50e-02

Pos logit contributions
 sorry+0.201
 thank+0.195
 pleasure+0.183
 okay+0.182
urry+0.171
Neg logit contributions
we-0.275
els-0.243
orns-0.243
les-0.241
ers-0.233
Token,
Token ID11
Feature activation-2.83e-01

Pos logit contributions
we+0.176
ting+0.166
ers+0.163
eds+0.151
ucks+0.151
Neg logit contributions
 sorry-0.114
 thank-0.112
 determined-0.110
 miss-0.110
 pleasure-0.109
Token mom
Token ID1995
Feature activation-2.00e-01

Pos logit contributions
 okay+0.520
 sorry+0.517
 know+0.497
urry+0.496
 thank+0.487
Neg logit contributions
we-0.643
els-0.603
ting-0.592
orns-0.586
ays-0.577
Token.
Token ID13
Feature activation+7.19e-02

Pos logit contributions
 sorry+0.379
 okay+0.377
 thank+0.374
 know+0.331
 glad+0.320
Neg logit contributions
we-0.533
els-0.495
orns-0.485
ays-0.477
rob-0.468
Token I
Token ID314
Feature activation-1.00e-01

Pos logit contributions
we+0.186
els+0.169
orns+0.169
ays+0.166
ucks+0.165
Neg logit contributions
 sorry-0.136
 okay-0.133
 thank-0.133
 know-0.119
 worry-0.117
Token am
Token ID716
Feature activation-1.31e-01

Pos logit contributions
 sorry+0.167
 thank+0.146
 okay+0.139
 pleasure+0.132
urry+0.131
Neg logit contributions
we-0.278
ers-0.250
les-0.250
orns-0.246
ucks-0.241
Token sorry
Token ID7926
Feature activation-5.59e-01

Pos logit contributions
urry+0.092
 tum+0.092
 thank+0.090
 worry+0.083
 hesitate+0.079
Neg logit contributions
we-0.446
ers-0.421
els-0.409
ting-0.409
ep-0.401
Token,
Token ID11
Feature activation-2.83e-01

Pos logit contributions
 sorry+1.212
 thank+1.198
 okay+1.153
 pleasure+1.143
 know+1.123
Neg logit contributions
we-1.298
els-1.097
ting-1.084
orns-1.082
ays-1.080
Token bee
Token ID20697
Feature activation+1.21e-01

Pos logit contributions
 sorry+0.462
 thank+0.445
 okay+0.439
 know+0.427
urry+0.420
Neg logit contributions
we-0.755
els-0.693
orns-0.687
ting-0.683
ays-0.670
Token.
Token ID13
Feature activation+7.29e-02

Pos logit contributions
we+0.301
els+0.279
orns+0.273
ays+0.266
ting+0.261
Neg logit contributions
 sorry-0.253
 thank-0.247
 okay-0.245
 know-0.216
 pleasure-0.216
Token I
Token ID314
Feature activation-1.02e-01

Pos logit contributions
we+0.201
orns+0.183
els+0.182
ays+0.180
ucks+0.179
Neg logit contributions
 sorry-0.123
 okay-0.121
 thank-0.119
 know-0.107
 worry-0.105
Token did
Token ID750
Feature activation-1.05e-01

Pos logit contributions
 sorry+0.187
 thank+0.173
 okay+0.167
 pleasure+0.161
 glad+0.156
Neg logit contributions
we-0.273
orns-0.244
les-0.243
els-0.238
ers-0.238
Token.
Token ID13
Feature activation+6.74e-02

Pos logit contributions
 sorry+0.257
 thank+0.252
 okay+0.250
 know+0.218
 glad+0.208
Neg logit contributions
we-0.386
orns-0.371
els-0.368
ays-0.364
rob-0.350
Token They
Token ID1119
Feature activation+3.07e-02

Pos logit contributions
we+0.177
els+0.163
orns+0.161
ays+0.157
ting+0.155
Neg logit contributions
 sorry-0.133
 thank-0.130
 okay-0.129
 pleasure-0.114
 know-0.113
Token said
Token ID531
Feature activation-5.77e-02

Pos logit contributions
we+0.070
els+0.066
orns+0.066
ers+0.064
les+0.064
Neg logit contributions
 sorry-0.061
 okay-0.058
 pleasure-0.057
 thank-0.056
urry-0.054
Token they
Token ID484
Feature activation-1.06e-01

Pos logit contributions
 pleasure+0.094
 forget+0.088
 know+0.088
urry+0.086
 believe+0.086
Neg logit contributions
we-0.153
els-0.138
ries-0.133
ays-0.132
ting-0.132
Token were
Token ID547
Feature activation-4.41e-02

Pos logit contributions
 sorry+0.199
 thank+0.195
 okay+0.180
 pleasure+0.177
urry+0.164
Neg logit contributions
we-0.284
orns-0.255
els-0.254
les-0.246
ting-0.243
Token sorry
Token ID7926
Feature activation-5.59e-01

Pos logit contributions
 thank+0.037
 sorry+0.031
 pleasure+0.030
 okay+0.029
 know+0.029
Neg logit contributions
we-0.158
els-0.150
orns-0.148
ers-0.147
ting-0.146
Token to
Token ID284
Feature activation-1.05e-01

Pos logit contributions
 sorry+1.222
 thank+1.198
 okay+1.161
 pleasure+1.136
 know+1.079
Neg logit contributions
we-1.347
els-1.180
orns-1.132
ays-1.124
ucks-1.122
Token their
Token ID511
Feature activation+1.54e-01

Pos logit contributions
 sorry+0.242
 okay+0.236
 thank+0.235
 pleasure+0.217
 glad+0.211
Neg logit contributions
we-0.235
els-0.214
orns-0.207
ting-0.201
ays-0.201
Token mom
Token ID1995
Feature activation-2.03e-01

Pos logit contributions
we+0.461
els+0.438
ucks+0.421
ays+0.414
orns+0.413
Neg logit contributions
 okay-0.220
 sorry-0.219
 thank-0.213
 pleasure-0.187
urry-0.184
Token and
Token ID290
Feature activation+1.25e-01

Pos logit contributions
 sorry+0.462
 thank+0.452
 okay+0.448
 pleasure+0.406
 know+0.403
Neg logit contributions
we-0.469
els-0.427
orns-0.420
ays-0.408
ting-0.402
Token dad
Token ID9955
Feature activation-1.97e-01

Pos logit contributions
we+0.299
els+0.277
ipt+0.268
ting+0.266
ucks+0.266
Neg logit contributions
 okay-0.206
 pleasure-0.202
 sorry-0.201
urry-0.198
 thank-0.189
Token said
Token ID531
Feature activation-6.00e-02

Pos logit contributions
 sorry+0.046
 thank+0.045
 okay+0.044
 know+0.038
 pleasure+0.037
Neg logit contributions
we-0.082
els-0.076
orns-0.075
ays-0.074
ting-0.073
Token,
Token ID11
Feature activation-2.90e-01

Pos logit contributions
 pleasure+0.085
 forget+0.082
 know+0.071
 miss+0.070
 age+0.070
Neg logit contributions
we-0.174
els-0.160
ers-0.158
med-0.151
ries-0.151
Token "
Token ID366
Feature activation-1.32e-01

Pos logit contributions
 pleasure+0.467
 thank+0.464
 sorry+0.451
 okay+0.450
 know+0.449
Neg logit contributions
we-0.807
ting-0.720
els-0.707
orns-0.676
ays-0.675
TokenI
Token ID40
Feature activation+3.44e-01

Pos logit contributions
 okay+0.316
 sorry+0.287
 worry+0.273
 please+0.268
 thank+0.267
Neg logit contributions
els-0.302
orns-0.282
ays-0.282
ries-0.280
les-0.277
Token'm
Token ID1101
Feature activation-2.92e-01

Pos logit contributions
we+0.905
les+0.801
ucks+0.799
els+0.770
orns+0.759
Neg logit contributions
 sorry-0.523
 pleasure-0.504
 thank-0.472
 glad-0.436
 determined-0.413
Token sorry
Token ID7926
Feature activation-5.58e-01

Pos logit contributions
 pleasure+0.168
 thank+0.166
 hesitate+0.145
 forget+0.144
urry+0.143
Neg logit contributions
we-1.031
ting-0.971
els-0.964
 yourselves-0.952
ep-0.944
Token,
Token ID11
Feature activation-2.89e-01

Pos logit contributions
 pleasure+0.737
 thank+0.736
 sorry+0.719
 determined+0.701
 glad+0.684
Neg logit contributions
we-1.676
els-1.522
ucks-1.496
ting-1.475
ays-1.471
Token Mom
Token ID11254
Feature activation-1.21e-01

Pos logit contributions
 sorry+0.454
 thank+0.439
 okay+0.437
urry+0.419
 know+0.410
Neg logit contributions
we-0.781
els-0.727
ays-0.701
orns-0.699
ting-0.696
Token.
Token ID13
Feature activation+6.86e-02

Pos logit contributions
 sorry+0.275
 thank+0.273
 okay+0.265
 pleasure+0.252
urry+0.244
Neg logit contributions
we-0.271
els-0.243
orns-0.233
ting-0.230
ays-0.230
Token I
Token ID314
Feature activation-1.07e-01

Pos logit contributions
we+0.180
ays+0.161
els+0.160
ucks+0.159
orns+0.157
Neg logit contributions
 okay-0.118
 sorry-0.115
 thank-0.114
 determined-0.107
 know-0.105
Token didn
Token ID1422
Feature activation-2.18e-01

Pos logit contributions
 sorry+0.209
 thank+0.203
 pleasure+0.190
 okay+0.185
 glad+0.178
Neg logit contributions
we-0.271
les-0.235
els-0.233
ucks-0.230
orns-0.229
Token
Token ID198
Feature activation-2.41e-02

Pos logit contributions
we+0.270
els+0.246
orns+0.246
ays+0.240
ucks+0.237
Neg logit contributions
 sorry-0.186
 thank-0.184
 okay-0.183
 pleasure-0.164
 know-0.157
Token
Token ID198
Feature activation-2.45e-02

Pos logit contributions
 sorry+0.033
 thank+0.032
 okay+0.031
 know+0.029
 glad+0.027
Neg logit contributions
we-0.078
els-0.075
ting-0.073
ays-0.073
orns-0.073
TokenL
Token ID43
Feature activation+3.26e-01

Pos logit contributions
 sorry+0.060
 okay+0.059
 thank+0.057
 know+0.054
 glad+0.053
Neg logit contributions
ays-0.053
we-0.052
els-0.051
ries-0.050
ting-0.049
Tokenily
Token ID813
Feature activation+1.60e-01

Pos logit contributions
we+0.845
ting+0.785
els+0.772
ays+0.754
ning+0.751
Neg logit contributions
 sorry-0.644
 okay-0.629
 thank-0.612
 know-0.574
urry-0.551
Token felt
Token ID2936
Feature activation-4.41e-02

Pos logit contributions
we+0.429
els+0.396
orns+0.388
ays+0.379
les+0.377
Neg logit contributions
 sorry-0.300
 thank-0.293
 okay-0.290
 pleasure-0.264
 know-0.253
Token sorry
Token ID7926
Feature activation-5.58e-01

Pos logit contributions
 pleasure+0.029
urry+0.028
 thank+0.027
 okay+0.024
 age+0.021
Neg logit contributions
we-0.154
orns-0.149
ting-0.148
els-0.148
les-0.148
Token for
Token ID329
Feature activation-1.39e-01

Pos logit contributions
 sorry+1.090
 thank+1.083
 pleasure+1.059
 okay+1.049
urry+0.967
Neg logit contributions
we-1.407
els-1.266
les-1.248
ting-1.236
orns-1.233
Token Ben
Token ID3932
Feature activation+2.61e-02

Pos logit contributions
 sorry+0.303
 thank+0.298
 okay+0.295
 pleasure+0.268
 know+0.266
Neg logit contributions
we-0.331
els-0.302
orns-0.296
ays-0.287
ting-0.286
Token.
Token ID13
Feature activation+6.97e-02

Pos logit contributions
we+0.051
els+0.046
ting+0.045
orns+0.045
ers+0.044
Neg logit contributions
 thank-0.063
 sorry-0.062
 pleasure-0.061
 okay-0.059
urry-0.056
Token She
Token ID1375
Feature activation-4.11e-02

Pos logit contributions
we+0.184
els+0.170
orns+0.168
ays+0.166
ting+0.164
Neg logit contributions
 sorry-0.129
 thank-0.129
 okay-0.128
 pleasure-0.115
 know-0.112
Token said
Token ID531
Feature activation-6.17e-02

Pos logit contributions
 sorry+0.069
 okay+0.066
 thank+0.066
 pleasure+0.062
urry+0.058
Neg logit contributions
we-0.116
els-0.107
orns-0.106
les-0.104
ting-0.103
Token.
Token ID13
Feature activation+6.45e-02

Pos logit contributions
 sorry+0.266
 thank+0.262
 okay+0.250
 pleasure+0.248
urry+0.234
Neg logit contributions
we-0.248
els-0.217
orns-0.207
ers-0.205
ays-0.202
Token They
Token ID1119
Feature activation+3.00e-02

Pos logit contributions
we+0.172
els+0.157
ays+0.157
ucks+0.155
orns+0.154
Neg logit contributions
 thank-0.112
 sorry-0.112
 okay-0.109
 pleasure-0.099
 know-0.098
Token said
Token ID531
Feature activation-6.15e-02

Pos logit contributions
els+0.066
we+0.066
les+0.062
ers+0.060
orns+0.060
Neg logit contributions
 sorry-0.059
 pleasure-0.059
 okay-0.058
urry-0.057
 thank-0.054
Token they
Token ID484
Feature activation-1.07e-01

Pos logit contributions
 pleasure+0.087
 forget+0.079
urry+0.078
 permit+0.076
 know+0.073
Neg logit contributions
we-0.176
els-0.169
ries-0.165
ting-0.158
ers-0.158
Token were
Token ID547
Feature activation-4.70e-02

Pos logit contributions
 sorry+0.203
 thank+0.201
 okay+0.191
 pleasure+0.190
urry+0.180
Neg logit contributions
we-0.276
els-0.257
les-0.246
orns-0.246
ays-0.240
Token sorry
Token ID7926
Feature activation-5.58e-01

Pos logit contributions
 thank+0.025
 pleasure+0.021
urry+0.017
 know+0.017
 okay+0.014
Neg logit contributions
we-0.181
els-0.178
ers-0.170
ting-0.169
orns-0.168
Token.
Token ID13
Feature activation+6.55e-02

Pos logit contributions
 sorry+1.208
 thank+1.193
 okay+1.155
 pleasure+1.114
 know+1.069
Neg logit contributions
we-1.355
els-1.210
orns-1.166
ays-1.151
ting-1.145
Token Mama
Token ID40084
Feature activation-1.31e-01

Pos logit contributions
we+0.169
ays+0.152
els+0.152
ucks+0.151
orns+0.150
Neg logit contributions
 thank-0.120
 sorry-0.120
 okay-0.116
 pleasure-0.107
 know-0.107
Token hugged
Token ID41130
Feature activation-3.59e-02

Pos logit contributions
 sorry+0.224
 thank+0.219
 okay+0.213
 know+0.185
 glad+0.179
Neg logit contributions
we-0.377
orns-0.354
els-0.351
ays-0.344
ting-0.337
Token them
Token ID606
Feature activation+4.09e-01

Pos logit contributions
 sorry+0.084
 okay+0.081
 thank+0.080
 sad+0.071
 please+0.069
Neg logit contributions
we-0.088
rob-0.083
web-0.081
els-0.078
orns-0.078
Token.
Token ID13
Feature activation+6.44e-02

Pos logit contributions
we+0.957
els+0.873
orns+0.857
ays+0.836
ting+0.828
Neg logit contributions
 sorry-0.909
 thank-0.890
 okay-0.879
 pleasure-0.801
 know-0.794
TokenAnna
Token ID31160
Feature activation+4.11e-01

Pos logit contributions
 sorry+0.046
 okay+0.046
 thank+0.044
 please+0.041
 know+0.041
Neg logit contributions
we-0.041
ays-0.039
els-0.039
ries-0.038
orns-0.037
Token and
Token ID290
Feature activation+1.24e-01

Pos logit contributions
we+1.056
els+0.971
orns+0.960
ays+0.934
ting+0.920
Neg logit contributions
 sorry-0.828
 thank-0.808
 okay-0.801
 know-0.710
 pleasure-0.706
Token Ben
Token ID3932
Feature activation+2.74e-02

Pos logit contributions
we+0.341
els+0.312
orns+0.310
ays+0.303
ucks+0.296
Neg logit contributions
 sorry-0.234
 thank-0.228
 okay-0.224
 know-0.196
 pleasure-0.195
Token were
Token ID547
Feature activation-4.16e-02

Pos logit contributions
we+0.070
els+0.064
ting+0.063
les+0.063
ers+0.062
Neg logit contributions
 sorry-0.052
 thank-0.050
 okay-0.050
 pleasure-0.048
urry-0.045
Token very
Token ID845
Feature activation-2.23e-01

Pos logit contributions
 thank+0.046
 pleasure+0.042
urry+0.038
 okay+0.035
 please+0.034
Neg logit contributions
we-0.133
els-0.129
ers-0.128
ep-0.126
orns-0.124
Token sorry
Token ID7926
Feature activation-5.58e-01

Pos logit contributions
 pleasure+0.333
 thank+0.288
're+0.279
 Once+0.271
 yes+0.261
Neg logit contributions
we-0.587
ep-0.582
ucks-0.570
els-0.568
ins-0.566
Token.
Token ID13
Feature activation+7.08e-02

Pos logit contributions
 sorry+1.263
 thank+1.242
 okay+1.218
 pleasure+1.159
 know+1.110
Neg logit contributions
we-1.291
els-1.154
orns-1.122
ays-1.104
ting-1.093
Token They
Token ID1119
Feature activation+3.60e-02

Pos logit contributions
we+0.199
els+0.185
orns+0.184
ays+0.176
les+0.174
Neg logit contributions
 sorry-0.131
 thank-0.126
 okay-0.124
 know-0.109
 pleasure-0.109
Token had
Token ID550
Feature activation-6.39e-02

Pos logit contributions
we+0.084
els+0.081
les+0.078
orns+0.078
ers+0.077
Neg logit contributions
 sorry-0.068
 okay-0.068
 pleasure-0.067
 thank-0.065
urry-0.064
Token not
Token ID407
Feature activation-3.39e-01

Pos logit contributions
 sorry+0.119
 thank+0.116
 okay+0.111
 pleasure+0.106
urry+0.105
Neg logit contributions
we-0.167
orns-0.153
els-0.153
ays-0.146
ers-0.145
Token meant
Token ID4001
Feature activation+4.55e-02

Pos logit contributions
 sorry+0.591
 thank+0.570
 pleasure+0.540
 okay+0.530
urry+0.496
Neg logit contributions
we-0.885
orns-0.835
ers-0.821
els-0.820
ep-0.789
TokenL
Token ID43
Feature activation+3.31e-01

Pos logit contributions
 sorry+0.054
 thank+0.054
 okay+0.054
 please+0.049
 glad+0.048
Neg logit contributions
we-0.043
els-0.039
ting-0.038
orns-0.038
ays-0.038
Tokenily
Token ID813
Feature activation+1.62e-01

Pos logit contributions
we+0.863
ting+0.782
els+0.781
rob+0.763
orns+0.751
Neg logit contributions
 okay-0.672
 sorry-0.640
 thank-0.618
urry-0.575
 please-0.554
Token and
Token ID290
Feature activation+1.22e-01

Pos logit contributions
we+0.390
els+0.357
orns+0.352
ays+0.342
ting+0.337
Neg logit contributions
 sorry-0.353
 thank-0.346
 okay-0.344
 know-0.307
 pleasure-0.302
Token Ben
Token ID3932
Feature activation+2.86e-02

Pos logit contributions
we+0.359
els+0.332
orns+0.329
ays+0.321
ting+0.317
Neg logit contributions
 sorry-0.207
 thank-0.201
 okay-0.199
 know-0.170
 pleasure-0.169
Token were
Token ID547
Feature activation-4.55e-02

Pos logit contributions
we+0.071
els+0.065
les+0.065
ting+0.064
ers+0.063
Neg logit contributions
 sorry-0.054
 thank-0.051
 okay-0.051
 pleasure-0.051
urry-0.048
Token sorry
Token ID7926
Feature activation-5.58e-01

Pos logit contributions
 thank+0.040
 pleasure+0.036
urry+0.031
 please+0.028
 okay+0.026
Neg logit contributions
we-0.157
els-0.152
ers-0.148
ep-0.147
orns-0.147
Token too
Token ID1165
Feature activation-4.70e-02

Pos logit contributions
 sorry+1.300
 thank+1.294
 pleasure+1.271
 okay+1.240
 know+1.177
Neg logit contributions
we-1.215
els-1.055
ays-1.001
ucks-0.999
les-0.994
Token.
Token ID13
Feature activation+7.13e-02

Pos logit contributions
 pleasure+0.128
 thank+0.125
 sorry+0.117
 believe+0.113
 please+0.113
Neg logit contributions
we-0.076
els-0.069
ting-0.066
les-0.063
ays-0.062
Token They
Token ID1119
Feature activation+3.16e-02

Pos logit contributions
we+0.175
els+0.160
ays+0.157
orns+0.157
ucks+0.154
Neg logit contributions
 sorry-0.148
 thank-0.145
 okay-0.144
 pleasure-0.129
 know-0.128
Token said
Token ID531
Feature activation-5.74e-02

Pos logit contributions
we+0.071
els+0.067
les+0.067
orns+0.066
ting+0.065
Neg logit contributions
 sorry-0.061
 pleasure-0.060
 okay-0.059
 thank-0.057
urry-0.057
Token they
Token ID484
Feature activation-1.06e-01

Pos logit contributions
 pleasure+0.102
urry+0.100
 know+0.098
 believe+0.097
 forget+0.093
Neg logit contributions
we-0.146
els-0.129
ries-0.128
ting-0.124
ays-0.123
Token is
Token ID318
Feature activation+1.95e-01

Pos logit contributions
 sorry+0.053
 thank+0.052
 okay+0.051
 pleasure+0.046
 know+0.046
Neg logit contributions
we-0.060
els-0.055
orns-0.054
ays-0.053
ting-0.052
Token okay
Token ID8788
Feature activation-5.18e-01

Pos logit contributions
we+0.628
els+0.596
ers+0.589
orns+0.577
ting+0.573
Neg logit contributions
 thank-0.214
 pleasure-0.186
urry-0.180
 sorry-0.171
 know-0.169
Token.
Token ID13
Feature activation+6.43e-02

Pos logit contributions
 sorry+1.213
 thank+1.210
 okay+1.144
 pleasure+1.118
 know+1.070
Neg logit contributions
we-1.158
els-1.039
orns-0.988
ting-0.957
ays-0.955
Token He
Token ID679
Feature activation-2.77e-02

Pos logit contributions
we+0.167
els+0.153
orns+0.152
ays+0.150
ucks+0.149
Neg logit contributions
 sorry-0.119
 thank-0.116
 okay-0.116
 know-0.103
 pleasure-0.103
Token is
Token ID318
Feature activation+1.93e-01

Pos logit contributions
 sorry+0.059
 thank+0.058
 okay+0.057
 pleasure+0.051
 know+0.051
Neg logit contributions
we-0.067
els-0.062
orns-0.061
ays-0.059
ting-0.058
Token sorry
Token ID7926
Feature activation-5.58e-01

Pos logit contributions
we+0.611
els+0.579
ers+0.567
orns+0.565
ting+0.554
Neg logit contributions
 thank-0.227
 pleasure-0.198
 sorry-0.197
urry-0.190
 okay-0.189
Token.
Token ID13
Feature activation+6.68e-02

Pos logit contributions
 sorry+1.253
 thank+1.237
 okay+1.201
 pleasure+1.197
 know+1.132
Neg logit contributions
we-1.270
els-1.104
orns-1.060
ays-1.044
les-1.042
Token He
Token ID679
Feature activation-2.76e-02

Pos logit contributions
we+0.183
els+0.169
orns+0.168
ays+0.164
ucks+0.162
Neg logit contributions
 sorry-0.120
 thank-0.117
 okay-0.116
 pleasure-0.102
 know-0.102
Token gives
Token ID3607
Feature activation+4.47e-02

Pos logit contributions
 sorry+0.058
 thank+0.056
 okay+0.056
 pleasure+0.050
 know+0.050
Neg logit contributions
we-0.069
els-0.063
orns-0.062
ays-0.060
ting-0.060
Token Tom
Token ID4186
Feature activation-4.31e-03

Pos logit contributions
we+0.143
rob+0.132
ymes+0.131
web+0.130
ays+0.129
Neg logit contributions
 okay-0.057
 sorry-0.056
 thank-0.052
 help-0.044
 tired-0.044
Token his
Token ID465
Feature activation+1.59e-01

Pos logit contributions
 sorry+0.008
 okay+0.008
 thank+0.008
 know+0.007
 tired+0.006
Neg logit contributions
we-0.012
els-0.011
orns-0.011
ays-0.011
rob-0.011
Token
Token ID198
Feature activation-1.68e-02

Pos logit contributions
we+0.183
els+0.165
ays+0.164
orns+0.162
ucks+0.161
Neg logit contributions
 sorry-0.131
 thank-0.130
 okay-0.128
 know-0.117
 pleasure-0.117
Token
Token ID198
Feature activation-1.73e-02

Pos logit contributions
 sorry+0.028
 thank+0.027
 okay+0.027
 know+0.023
 glad+0.023
Neg logit contributions
we-0.049
els-0.046
orns-0.046
ays-0.045
ting-0.044
Token"
Token ID1
Feature activation+1.19e-01

Pos logit contributions
 sorry+0.039
 thank+0.038
 okay+0.038
 know+0.035
 glad+0.034
Neg logit contributions
we-0.041
els-0.038
orns-0.037
ays-0.037
les-0.036
TokenI
Token ID40
Feature activation+3.46e-01

Pos logit contributions
els+0.254
we+0.243
ays+0.243
orns+0.240
eds+0.236
Neg logit contributions
 thank-0.290
 sorry-0.288
 okay-0.287
 please-0.258
 glad-0.253
Token'm
Token ID1101
Feature activation-2.87e-01

Pos logit contributions
we+0.822
ucks+0.758
les+0.730
ymes+0.675
le+0.672
Neg logit contributions
 sorry-0.533
 pleasure-0.481
 thank-0.472
 Once-0.452
 determined-0.441
Token sorry
Token ID7926
Feature activation-5.58e-01

Pos logit contributions
 hesitate+0.163
 thank+0.154
 plan+0.147
 dare+0.137
 desire+0.134
Neg logit contributions
we-0.998
 yourselves-0.933
ting-0.933
ays-0.912
els-0.908
Token,
Token ID11
Feature activation-2.87e-01

Pos logit contributions
 sorry+0.708
 pleasure+0.696
 determined+0.677
 thank+0.671
 hardly+0.664
Neg logit contributions
we-1.690
els-1.564
ucks-1.511
les-1.504
ries-1.503
Token Ben
Token ID3932
Feature activation+2.87e-02

Pos logit contributions
 sorry+0.568
 okay+0.551
 thank+0.550
urry+0.540
 know+0.537
Neg logit contributions
we-0.652
els-0.622
ting-0.595
orns-0.589
ays-0.589
Token.
Token ID13
Feature activation+7.10e-02

Pos logit contributions
we+0.070
els+0.063
orns+0.063
les+0.062
ers+0.061
Neg logit contributions
 sorry-0.059
 thank-0.057
 okay-0.056
 pleasure-0.053
urry-0.051
Token I
Token ID314
Feature activation-1.08e-01

Pos logit contributions
we+0.196
ays+0.176
els+0.175
orns+0.175
ucks+0.173
Neg logit contributions
 sorry-0.117
 okay-0.116
 thank-0.116
 know-0.106
 determined-0.104
Token wanted
Token ID2227
Feature activation-7.35e-02

Pos logit contributions
 sorry+0.207
 thank+0.201
 pleasure+0.184
 okay+0.183
 glad+0.172
Neg logit contributions
we-0.278
les-0.241
els-0.241
orns-0.239
ucks-0.239
Token.
Token ID13
Feature activation+6.84e-02

Pos logit contributions
 sorry+0.237
 thank+0.231
 okay+0.227
 pleasure+0.210
 know+0.204
Neg logit contributions
we-0.256
els-0.231
orns-0.225
ays-0.219
ting-0.216
Token Anna
Token ID11735
Feature activation+1.22e-01

Pos logit contributions
we+0.165
els+0.151
ays+0.150
orns+0.146
ucks+0.146
Neg logit contributions
 sorry-0.141
 thank-0.140
 okay-0.139
 pleasure-0.128
 know-0.124
Token and
Token ID290
Feature activation+1.26e-01

Pos logit contributions
we+0.269
els+0.247
orns+0.240
ays+0.237
les+0.234
Neg logit contributions
 sorry-0.284
 thank-0.277
 okay-0.277
 pleasure-0.257
 know-0.246
Token Ben
Token ID3932
Feature activation+3.08e-02

Pos logit contributions
we+0.331
els+0.305
orns+0.300
ays+0.293
ting+0.289
Neg logit contributions
 sorry-0.249
 thank-0.243
 okay-0.241
 pleasure-0.214
 know-0.212
Token are
Token ID389
Feature activation-9.21e-02

Pos logit contributions
els+0.080
les+0.079
we+0.078
ting+0.076
ers+0.074
Neg logit contributions
 pleasure-0.048
 sorry-0.045
 okay-0.045
urry-0.042
 thank-0.040
Token sorry
Token ID7926
Feature activation-5.58e-01

Pos logit contributions
 thank+0.131
 pleasure+0.120
 okay+0.108
 sorry+0.107
urry+0.105
Neg logit contributions
we-0.287
els-0.272
ers-0.263
ting-0.257
orns-0.255
Token.
Token ID13
Feature activation+6.85e-02

Pos logit contributions
 sorry+1.270
 thank+1.261
 okay+1.227
 pleasure+1.201
 know+1.132
Neg logit contributions
we-1.275
els-1.127
orns-1.076
ays-1.071
ting-1.067
Token They
Token ID1119
Feature activation+3.71e-02

Pos logit contributions
we+0.182
els+0.167
orns+0.165
ays+0.161
ting+0.159
Neg logit contributions
 sorry-0.133
 thank-0.130
 okay-0.129
 pleasure-0.114
 know-0.113
Token nod
Token ID18666
Feature activation-3.53e-01

Pos logit contributions
els+0.093
les+0.092
orns+0.087
ting+0.087
ers+0.087
Neg logit contributions
 okay-0.058
 pleasure-0.057
 sorry-0.053
urry-0.053
 glad-0.048
Token.
Token ID13
Feature activation+6.89e-02

Pos logit contributions
 sorry+0.766
 thank+0.748
 okay+0.745
 know+0.663
 pleasure+0.654
Neg logit contributions
we-0.862
els-0.791
orns-0.785
ays-0.759
ting-0.745
Token They
Token ID1119
Feature activation+3.74e-02

Pos logit contributions
we+0.180
els+0.164
orns+0.162
ays+0.160
ucks+0.156
Neg logit contributions
 sorry-0.135
 thank-0.132
 okay-0.131
 pleasure-0.117
 know-0.115
Token
Token ID198
Feature activation-1.86e-02

Pos logit contributions
we+0.185
els+0.169
orns+0.167
ays+0.165
ucks+0.162
Neg logit contributions
 sorry-0.139
 thank-0.135
 okay-0.135
 pleasure-0.119
 know-0.118
Token
Token ID198
Feature activation-1.85e-02

Pos logit contributions
 sorry+0.032
 thank+0.031
 okay+0.031
 know+0.027
 pleasure+0.027
Neg logit contributions
we-0.053
els-0.049
orns-0.049
ays-0.048
ting-0.047
Token"
Token ID1
Feature activation+1.21e-01

Pos logit contributions
 sorry+0.040
 thank+0.039
 okay+0.038
 know+0.035
 glad+0.035
Neg logit contributions
we-0.045
orns-0.043
els-0.042
ays-0.041
ting-0.040
TokenI
Token ID40
Feature activation+3.49e-01

Pos logit contributions
we+0.223
els+0.223
orns+0.220
ays+0.211
les+0.204
Neg logit contributions
 sorry-0.321
 thank-0.317
 okay-0.314
 glad-0.284
 pleasure-0.282
Token am
Token ID716
Feature activation-1.32e-01

Pos logit contributions
we+0.844
ucks+0.778
cer+0.727
Look+0.709
eddy+0.706
Neg logit contributions
 sorry-0.552
 thank-0.482
 pleasure-0.466
 Once-0.444
 determined-0.437
Token sorry
Token ID7926
Feature activation-5.58e-01

Pos logit contributions
 hesitate+0.091
 ending+0.077
 thank+0.077
 tum+0.075
 dare+0.072
Neg logit contributions
we-0.438
 yourselves-0.426
els-0.417
ays-0.415
ucks-0.412
Token,
Token ID11
Feature activation-2.87e-01

Pos logit contributions
 sorry+0.794
 thank+0.772
 pleasure+0.764
 okay+0.741
 determined+0.718
Neg logit contributions
we-1.664
els-1.506
ucks-1.463
les-1.453
ays-1.438
Token Anna
Token ID11735
Feature activation+1.19e-01

Pos logit contributions
 sorry+0.482
 okay+0.470
 thank+0.467
urry+0.439
 know+0.435
Neg logit contributions
we-0.776
els-0.735
ays-0.713
ting-0.700
les-0.694
Token.
Token ID13
Feature activation+7.16e-02

Pos logit contributions
we+0.283
els+0.256
orns+0.254
ays+0.245
les+0.244
Neg logit contributions
 sorry-0.258
 thank-0.251
 okay-0.249
 pleasure-0.229
 know-0.223
Token You
Token ID921
Feature activation+1.24e-01

Pos logit contributions
we+0.194
els+0.174
ays+0.174
orns+0.172
ucks+0.170
Neg logit contributions
 sorry-0.129
 okay-0.127
 thank-0.127
 know-0.114
 determined-0.110
Token can
Token ID460
Feature activation-1.04e-01

Pos logit contributions
we+0.311
els+0.298
orns+0.290
les+0.284
ays+0.282
Neg logit contributions
 sorry-0.223
 thank-0.212
 okay-0.210
 pleasure-0.207
urry-0.203
TokenL
Token ID43
Feature activation+3.30e-01

Pos logit contributions
 sorry+0.047
 okay+0.047
 thank+0.045
 glad+0.044
 know+0.043
Neg logit contributions
we-0.040
els-0.039
ays-0.039
orns-0.039
ting-0.038
Tokenily
Token ID813
Feature activation+1.62e-01

Pos logit contributions
we+0.857
ting+0.812
ning+0.782
rob+0.780
els+0.774
Neg logit contributions
 okay-0.637
 sorry-0.635
 thank-0.598
 know-0.562
urry-0.551
Token and
Token ID290
Feature activation+1.24e-01

Pos logit contributions
we+0.379
els+0.345
orns+0.340
ays+0.330
ting+0.325
Neg logit contributions
 sorry-0.364
 thank-0.356
 okay-0.353
 pleasure-0.318
 know-0.316
Token Ben
Token ID3932
Feature activation+2.81e-02

Pos logit contributions
we+0.307
els+0.287
orns+0.279
ays+0.275
ting+0.271
Neg logit contributions
 sorry-0.253
 thank-0.249
 okay-0.246
 pleasure-0.226
 know-0.223
Token were
Token ID547
Feature activation-4.60e-02

Pos logit contributions
we+0.069
els+0.063
ting+0.062
ers+0.062
les+0.062
Neg logit contributions
 sorry-0.054
 thank-0.052
 pleasure-0.051
 okay-0.051
urry-0.048
Token sorry
Token ID7926
Feature activation-5.58e-01

Pos logit contributions
 thank+0.051
 pleasure+0.044
urry+0.040
 okay+0.039
 please+0.038
Neg logit contributions
we-0.151
els-0.144
ers-0.142
ting-0.139
orns-0.139
Token.
Token ID13
Feature activation+6.73e-02

Pos logit contributions
 sorry+1.278
 thank+1.263
 pleasure+1.255
 okay+1.220
 know+1.142
Neg logit contributions
we-1.237
els-1.061
les-1.032
ucks-1.019
ting-1.013
Token They
Token ID1119
Feature activation+3.05e-02

Pos logit contributions
we+0.190
els+0.176
orns+0.175
ting+0.167
ays+0.167
Neg logit contributions
 sorry-0.124
 thank-0.119
 okay-0.117
 know-0.104
 pleasure-0.103
Token learned
Token ID4499
Feature activation-1.03e-01

Pos logit contributions
we+0.062
les+0.060
els+0.058
ting+0.058
ers+0.058
Neg logit contributions
 pleasure-0.062
 sorry-0.059
urry-0.059
 okay-0.058
 thank-0.056
Token their
Token ID511
Feature activation+1.54e-01

Pos logit contributions
 sorry+0.234
 thank+0.229
 okay+0.227
 pleasure+0.205
 know+0.203
Neg logit contributions
we-0.239
els-0.218
orns-0.214
ays-0.208
ting-0.205
Token lesson
Token ID11483
Feature activation-1.42e-01

Pos logit contributions
we+0.380
els+0.354
ucks+0.340
 stations+0.338
ipt+0.334
Neg logit contributions
 sorry-0.275
 okay-0.268
 thank-0.264
 pleasure-0.253
 please-0.251
Token L
Token ID406
Feature activation+5.81e-02

Pos logit contributions
 sorry+0.536
 thank+0.531
 know+0.526
 okay+0.526
urry+0.523
Neg logit contributions
we-0.704
els-0.642
ting-0.632
ries-0.619
orns-0.608
Tokenila
Token ID10102
Feature activation+2.28e-01

Pos logit contributions
we+0.110
els+0.097
orns+0.096
ays+0.092
ting+0.091
Neg logit contributions
 sorry-0.157
 thank-0.155
 okay-0.154
 pleasure-0.140
 know-0.140
Token,
Token ID11
Feature activation-2.96e-01

Pos logit contributions
we+0.705
els+0.653
orns+0.644
ting+0.630
ers+0.628
Neg logit contributions
 sorry-0.327
 thank-0.315
 okay-0.307
 pleasure-0.289
 know-0.273
Token I
Token ID314
Feature activation-1.12e-01

Pos logit contributions
 sorry+0.535
 okay+0.519
 thank+0.516
 pleasure+0.505
urry+0.488
Neg logit contributions
we-0.755
els-0.685
ting-0.665
orns-0.653
ries-0.643
Token'm
Token ID1101
Feature activation-2.95e-01

Pos logit contributions
 sorry+0.184
 thank+0.172
 pleasure+0.164
 okay+0.149
 glad+0.147
Neg logit contributions
we-0.285
les-0.265
els-0.262
orns-0.257
ep-0.254
Token sorry
Token ID7926
Feature activation-5.58e-01

Pos logit contributions
 thank+0.200
 pleasure+0.182
 forget+0.177
 hesitate+0.172
 promise+0.161
Neg logit contributions
we-0.993
els-0.937
ting-0.927
ers-0.922
ep-0.918
Token.
Token ID13
Feature activation+6.19e-02

Pos logit contributions
 sorry+1.211
 thank+1.193
 okay+1.158
 pleasure+1.122
 know+1.081
Neg logit contributions
we-1.342
els-1.169
ting-1.132
orns-1.129
ays-1.128
Token That
Token ID1320
Feature activation+6.46e-02

Pos logit contributions
we+0.178
els+0.163
orns+0.161
ays+0.159
ting+0.157
Neg logit contributions
 sorry-0.104
 thank-0.102
 okay-0.101
 pleasure-0.089
 know-0.088
Token dog
Token ID3290
Feature activation+5.80e-02

Pos logit contributions
we+0.156
els+0.143
orns+0.140
ays+0.137
ucks+0.136
Neg logit contributions
 sorry-0.141
 thank-0.139
 okay-0.138
 know-0.124
 pleasure-0.121
Token was
Token ID373
Feature activation+1.58e-01

Pos logit contributions
we+0.126
els+0.113
orns+0.111
ays+0.108
les+0.106
Neg logit contributions
 sorry-0.139
 thank-0.136
 okay-0.135
 pleasure-0.125
 know-0.123
Token very
Token ID845
Feature activation-2.31e-01

Pos logit contributions
we+0.503
els+0.470
orns+0.465
ays+0.457
ting+0.450
Neg logit contributions
 sorry-0.225
 thank-0.215
 okay-0.214
 pleasure-0.177
 know-0.176
Token.
Token ID13
Feature activation+6.93e-02

Pos logit contributions
 sorry+0.493
 thank+0.473
 okay+0.472
 know+0.406
 glad+0.400
Neg logit contributions
we-0.633
orns-0.591
els-0.587
ays-0.569
rob-0.566
Token They
Token ID1119
Feature activation+3.52e-02

Pos logit contributions
we+0.159
els+0.144
ays+0.142
orns+0.142
ting+0.140
Neg logit contributions
 thank-0.148
 sorry-0.147
 okay-0.146
 pleasure-0.138
urry-0.132
Token said
Token ID531
Feature activation-5.65e-02

Pos logit contributions
we+0.087
els+0.081
orns+0.079
les+0.079
ting+0.078
Neg logit contributions
 okay-0.066
 pleasure-0.065
 sorry-0.065
 thank-0.061
urry-0.060
Token they
Token ID484
Feature activation-1.04e-01

Pos logit contributions
 pleasure+0.109
urry+0.101
 promise+0.099
 believe+0.098
 know+0.098
Neg logit contributions
we-0.138
ting-0.124
els-0.122
ries-0.120
orns-0.117
Token were
Token ID547
Feature activation-4.24e-02

Pos logit contributions
 sorry+0.184
 thank+0.179
 okay+0.172
 pleasure+0.162
urry+0.150
Neg logit contributions
we-0.294
els-0.268
orns-0.265
ting-0.256
les-0.256
Token sorry
Token ID7926
Feature activation-5.58e-01

Pos logit contributions
 thank+0.040
 pleasure+0.036
 okay+0.033
 sorry+0.033
 know+0.033
Neg logit contributions
we-0.147
els-0.139
ers-0.136
ting-0.136
orns-0.135
Token.
Token ID13
Feature activation+6.97e-02

Pos logit contributions
 sorry+1.218
 thank+1.196
 okay+1.167
 pleasure+1.118
 know+1.071
Neg logit contributions
we-1.344
els-1.197
orns-1.168
ting-1.138
ays-1.135
Token They
Token ID1119
Feature activation+3.50e-02

Pos logit contributions
we+0.169
els+0.153
orns+0.151
ays+0.150
ucks+0.147
Neg logit contributions
 sorry-0.143
 thank-0.143
 okay-0.140
 pleasure-0.130
urry-0.126
Token said
Token ID531
Feature activation-5.62e-02

Pos logit contributions
we+0.089
els+0.082
orns+0.081
les+0.080
ers+0.080
Neg logit contributions
 sorry-0.064
 okay-0.064
 pleasure-0.061
 thank-0.060
urry-0.058
Token they
Token ID484
Feature activation-1.05e-01

Pos logit contributions
 pleasure+0.105
urry+0.101
 know+0.098
 thank+0.097
 believe+0.097
Neg logit contributions
we-0.141
ting-0.125
els-0.122
ries-0.121
orns-0.119
Token had
Token ID550
Feature activation-6.31e-02

Pos logit contributions
 sorry+0.187
 thank+0.182
 okay+0.174
 pleasure+0.167
urry+0.154
Neg logit contributions
we-0.292
els-0.264
orns-0.263
les-0.254
ting-0.253
Token,
Token ID11
Feature activation-3.07e-01

Pos logit contributions
we+0.191
ers+0.181
els+0.174
ting+0.169
ays+0.169
Neg logit contributions
 sorry-0.080
 okay-0.080
 pleasure-0.077
 thank-0.075
 know-0.064
Token they
Token ID484
Feature activation-1.24e-01

Pos logit contributions
 sorry+0.626
 thank+0.613
 okay+0.603
 pleasure+0.549
 know+0.547
Neg logit contributions
we-0.768
els-0.706
orns-0.688
ays-0.678
ting-0.675
Token filled
Token ID5901
Feature activation+9.62e-03

Pos logit contributions
 sorry+0.271
 okay+0.262
 thank+0.255
 pleasure+0.242
urry+0.224
Neg logit contributions
we-0.294
els-0.275
orns-0.261
ting-0.261
les-0.258
Token out
Token ID503
Feature activation+6.71e-02

Pos logit contributions
we+0.027
orns+0.027
ting+0.025
els+0.025
les+0.024
Neg logit contributions
 sorry-0.017
 thank-0.016
 happy-0.015
 okay-0.015
 know-0.015
Token some
Token ID617
Feature activation+1.79e-01

Pos logit contributions
we+0.188
orns+0.173
rob+0.168
ting+0.167
els+0.164
Neg logit contributions
 sorry-0.127
 thank-0.123
 okay-0.116
 tired-0.110
 glad-0.107
Token forms
Token ID5107
Feature activation+4.23e-01

Pos logit contributions
we+0.440
orns+0.391
ays+0.387
els+0.380
ting+0.379
Neg logit contributions
 thank-0.387
 sorry-0.386
 okay-0.367
 know-0.341
 tum-0.336
Token.
Token ID13
Feature activation+5.21e-02

Pos logit contributions
we+1.109
orns+1.044
els+1.023
rob+1.012
ays+0.995
Neg logit contributions
 thank-0.798
 sorry-0.798
 okay-0.789
 pleasure-0.693
 glad-0.690
Token Then
Token ID3244
Feature activation+6.13e-02

Pos logit contributions
we+0.136
els+0.128
ays+0.126
orns+0.125
ting+0.124
Neg logit contributions
 sorry-0.096
 okay-0.094
 thank-0.093
 know-0.081
 glad-0.081
Token they
Token ID484
Feature activation-1.23e-01

Pos logit contributions
we+0.176
els+0.167
ting+0.167
ries+0.165
ted+0.160
Neg logit contributions
 sorry-0.081
urry-0.077
 pleasure-0.075
 okay-0.075
 thank-0.074
Token waited
Token ID13488
Feature activation+6.09e-03

Pos logit contributions
 sorry+0.254
 okay+0.244
 thank+0.244
 pleasure+0.221
 glad+0.213
Neg logit contributions
we-0.312
els-0.290
orns-0.282
ays-0.276
ting-0.273
Token and
Token ID290
Feature activation+1.03e-01

Pos logit contributions
we+0.017
els+0.015
ting+0.015
ays+0.015
les+0.015
Neg logit contributions
 sorry-0.011
 thank-0.010
 okay-0.010
 know-0.010
 pleasure-0.009
Token
Token ID220
Feature activation-2.08e-01

Pos logit contributions
we+0.186
els+0.171
orns+0.170
ays+0.167
ting+0.164
Neg logit contributions
 sorry-0.120
 thank-0.117
 okay-0.116
 know-0.100
 pleasure-0.100
Token
Token ID198
Feature activation-3.17e-02

Pos logit contributions
 sorry+0.358
 thank+0.347
 okay+0.345
 pleasure+0.299
 know+0.299
Neg logit contributions
we-0.592
els-0.549
orns-0.542
ays-0.529
ting-0.525
Token
Token ID198
Feature activation-3.26e-02

Pos logit contributions
 sorry+0.050
 thank+0.049
 okay+0.048
 know+0.043
 glad+0.042
Neg logit contributions
we-0.094
els-0.091
orns-0.088
ays-0.088
ting-0.086
TokenThe
Token ID464
Feature activation+3.05e-01

Pos logit contributions
 okay+0.080
 sorry+0.080
 thank+0.080
 know+0.075
 glad+0.074
Neg logit contributions
els-0.066
we-0.066
ays-0.064
orns-0.062
ries-0.061
Token fix
Token ID4259
Feature activation-8.12e-02

Pos logit contributions
we+0.726
els+0.660
orns+0.640
ting+0.635
les+0.624
Neg logit contributions
 okay-0.645
 sorry-0.644
 thank-0.634
 know-0.577
 pleasure-0.575
Tokener
Token ID263
Feature activation+4.95e-01

Pos logit contributions
 sorry+0.177
 thank+0.174
 okay+0.169
 know+0.153
 pleasure+0.150
Neg logit contributions
we-0.193
els-0.177
orns-0.177
ays-0.172
ucks-0.171
Token said
Token ID531
Feature activation-7.27e-02

Pos logit contributions
we+1.349
els+1.213
ting+1.188
les+1.177
ers+1.176
Neg logit contributions
 sorry-0.882
 okay-0.850
 thank-0.833
 pleasure-0.774
urry-0.721
Token,
Token ID11
Feature activation-2.98e-01

Pos logit contributions
 forget+0.093
urry+0.092
 know+0.091
 okay+0.089
 pleasure+0.088
Neg logit contributions
we-0.213
ers-0.196
ting-0.194
els-0.189
med-0.188
Token "
Token ID366
Feature activation-1.45e-01

Pos logit contributions
 sorry+0.450
 okay+0.435
 thank+0.429
 know+0.396
 pleasure+0.391
Neg logit contributions
we-0.883
ting-0.799
els-0.791
ays-0.763
orns-0.763
TokenYou
Token ID1639
Feature activation+3.78e-01

Pos logit contributions
 okay+0.326
 thank+0.312
 pleasure+0.304
 please+0.286
 ok+0.283
Neg logit contributions
els-0.333
ays-0.310
ries-0.301
eds-0.299
orns-0.292
Token are
Token ID389
Feature activation-9.87e-02

Pos logit contributions
les+0.857
al+0.853
ers+0.850
we+0.846
els+0.807
Neg logit contributions
 sorry-0.620
urry-0.566
 okay-0.558
 thank-0.555
 glad-0.553
Token handle
Token ID5412
Feature activation-5.12e-02

Pos logit contributions
 sorry+0.744
 okay+0.715
 thank+0.714
 pleasure+0.662
 glad+0.646
Neg logit contributions
we-0.797
els-0.762
orns-0.740
les-0.715
ers-0.702
Token.
Token ID13
Feature activation+4.98e-02

Pos logit contributions
 sorry+0.094
 thank+0.093
 okay+0.090
 know+0.079
 please+0.079
Neg logit contributions
we-0.137
orns-0.129
ucks-0.123
rob-0.123
els-0.122
Token
Token ID198
Feature activation-3.89e-02

Pos logit contributions
we+0.138
els+0.128
orns+0.127
ays+0.124
ting+0.121
Neg logit contributions
 sorry-0.090
 thank-0.088
 okay-0.086
 pleasure-0.076
 know-0.075
Token
Token ID198
Feature activation-3.88e-02

Pos logit contributions
 sorry+0.059
 okay+0.058
 thank+0.058
 know+0.052
urry+0.049
Neg logit contributions
we-0.118
els-0.111
orns-0.109
ting-0.108
ays-0.107
Token"
Token ID1
Feature activation+1.07e-01

Pos logit contributions
 okay+0.091
 sorry+0.090
 thank+0.088
 know+0.081
urry+0.081
Neg logit contributions
we-0.087
els-0.079
orns-0.078
ays-0.078
ting-0.075
TokenLook
Token ID8567
Feature activation+5.29e-01

Pos logit contributions
les+0.226
eds+0.223
ays+0.211
orns+0.207
els+0.207
Neg logit contributions
 okay-0.245
 worry-0.223
 thank-0.221
 please-0.217
 understand-0.215
Token,
Token ID11
Feature activation-3.02e-01

Pos logit contributions
we+1.583
els+1.570
ays+1.491
ries+1.482
les+1.465
Neg logit contributions
 sorry-0.693
 thank-0.661
 okay-0.631
 pleasure-0.601
 know-0.582
Token Ben
Token ID3932
Feature activation+1.94e-02

Pos logit contributions
 sorry+0.612
 okay+0.592
 thank+0.585
 determined+0.564
urry+0.562
Neg logit contributions
we-0.657
els-0.646
ays-0.612
ries-0.604
ting-0.597
Token,
Token ID11
Feature activation-3.03e-01

Pos logit contributions
we+0.060
els+0.057
orns+0.056
ers+0.055
les+0.054
Neg logit contributions
 sorry-0.027
 thank-0.025
 okay-0.024
 pleasure-0.022
 glad-0.021
Token what
Token ID644
Feature activation-1.89e-02

Pos logit contributions
 sorry+0.588
 okay+0.567
 thank+0.566
 pleasure+0.516
urry+0.504
Neg logit contributions
we-0.773
els-0.714
orns-0.695
ting-0.681
ays-0.676
Token is
Token ID318
Feature activation+1.89e-01

Pos logit contributions
 sorry+0.042
 thank+0.041
 okay+0.041
 pleasure+0.037
 glad+0.036
Neg logit contributions
we-0.044
els-0.040
orns-0.040
ays-0.038
ting-0.037
Token He
Token ID679
Feature activation-3.54e-02

Pos logit contributions
we+0.162
orns+0.153
ays+0.151
ucks+0.151
els+0.151
Neg logit contributions
 sorry-0.136
 okay-0.135
 thank-0.133
 worry-0.120
 pleasure-0.119
Token still
Token ID991
Feature activation-1.43e-01

Pos logit contributions
 sorry+0.066
 okay+0.063
 thank+0.062
 pleasure+0.056
urry+0.054
Neg logit contributions
we-0.096
els-0.088
orns-0.087
les-0.085
ays-0.084
Token liked
Token ID8288
Feature activation-2.41e-02

Pos logit contributions
 sorry+0.267
 thank+0.262
 okay+0.261
urry+0.244
 pleasure+0.243
Neg logit contributions
we-0.353
orns-0.321
els-0.319
les-0.315
ers-0.312
Token the
Token ID262
Feature activation+1.45e-01

Pos logit contributions
 sorry+0.050
 thank+0.049
 okay+0.048
 pleasure+0.042
 know+0.042
Neg logit contributions
we-0.061
els-0.056
orns-0.055
ays-0.055
ucks-0.053
Token ot
Token ID30972
Feature activation+1.22e-01

Pos logit contributions
we+0.349
orns+0.325
els+0.320
ers+0.310
les+0.306
Neg logit contributions
 sorry-0.292
 okay-0.285
 thank-0.280
urry-0.265
 know-0.258
Tokenters
Token ID1010
Feature activation+5.28e-01

Pos logit contributions
we+0.137
els+0.127
ays+0.116
orns+0.111
ipt+0.110
Neg logit contributions
 sorry-0.408
 okay-0.401
 thank-0.393
 please-0.363
 worry-0.361
Token,
Token ID11
Feature activation-2.92e-01

Pos logit contributions
we+1.269
els+1.176
orns+1.138
ting+1.129
ers+1.113
Neg logit contributions
 sorry-1.047
 thank-1.047
 okay-1.011
 pleasure-0.970
urry-0.932
Token but
Token ID475
Feature activation-1.72e-01

Pos logit contributions
 sorry+0.417
 thank+0.404
 okay+0.394
 pleasure+0.338
 know+0.335
Neg logit contributions
we-0.915
els-0.852
orns-0.846
ays-0.823
ting-0.819
Token he
Token ID339
Feature activation-1.11e-01

Pos logit contributions
 sorry+0.274
 thank+0.268
 okay+0.262
 pleasure+0.235
 know+0.230
Neg logit contributions
we-0.504
els-0.467
orns-0.458
ays-0.449
ting-0.446
Token knew
Token ID2993
Feature activation-1.95e-01

Pos logit contributions
 sorry+0.203
 thank+0.196
 okay+0.195
 pleasure+0.174
 know+0.167
Neg logit contributions
we-0.307
els-0.283
orns-0.279
ays-0.270
les-0.268
Token they
Token ID484
Feature activation-1.09e-01

Pos logit contributions
 sorry+0.325
 thank+0.314
 okay+0.313
 glad+0.258
 pleasure+0.257
Neg logit contributions
we-0.565
orns-0.528
els-0.527
ays-0.518
ucks-0.514
Token.
Token ID13
Feature activation+6.47e-02

Pos logit contributions
we+0.105
els+0.096
orns+0.094
ays+0.092
ting+0.091
Neg logit contributions
 sorry-0.093
 thank-0.090
 okay-0.090
 pleasure-0.081
 know-0.080
Token He
Token ID679
Feature activation-3.48e-02

Pos logit contributions
we+0.179
els+0.165
orns+0.163
ays+0.161
ucks+0.157
Neg logit contributions
 sorry-0.116
 thank-0.113
 okay-0.113
 pleasure-0.099
 know-0.097
Token said
Token ID531
Feature activation-6.13e-02

Pos logit contributions
 sorry+0.061
 thank+0.060
 okay+0.059
 know+0.052
 pleasure+0.051
Neg logit contributions
we-0.098
els-0.091
orns-0.090
ays-0.088
ting-0.087
Token,
Token ID11
Feature activation-2.93e-01

Pos logit contributions
 pleasure+0.078
 forget+0.077
 okay+0.073
urry+0.072
 miss+0.071
Neg logit contributions
we-0.180
els-0.171
ers-0.166
ays-0.163
ries-0.161
Token "
Token ID366
Feature activation-1.34e-01

Pos logit contributions
 okay+0.483
 sorry+0.477
 pleasure+0.475
 thank+0.474
 know+0.456
Neg logit contributions
we-0.805
ting-0.723
els-0.710
ays-0.681
med-0.675
TokenLook
Token ID8567
Feature activation+5.27e-01

Pos logit contributions
 know+0.283
 understand+0.270
 okay+0.268
 thank+0.261
 want+0.260
Neg logit contributions
eds-0.266
ries-0.265
els-0.260
les-0.251
ators-0.246
Token,
Token ID11
Feature activation-2.92e-01

Pos logit contributions
we+1.702
els+1.667
ays+1.575
ries+1.572
les+1.554
Neg logit contributions
 sorry-0.598
 thank-0.586
 okay-0.546
 pleasure-0.528
 know-0.501
Token what
Token ID644
Feature activation-2.22e-02

Pos logit contributions
 sorry+0.580
 okay+0.574
 thank+0.561
 pleasure+0.544
urry+0.535
Neg logit contributions
we-0.679
els-0.633
orns-0.598
ays-0.591
ting-0.585
Token is
Token ID318
Feature activation+1.95e-01

Pos logit contributions
 sorry+0.044
 thank+0.043
 okay+0.043
 pleasure+0.039
 glad+0.037
Neg logit contributions
we-0.056
els-0.052
orns-0.051
ays-0.050
ymes-0.048
Token this
Token ID428
Feature activation+1.34e-01

Pos logit contributions
we+0.503
orns+0.457
els+0.450
ucks+0.442
ays+0.439
Neg logit contributions
 sorry-0.407
 thank-0.400
 okay-0.384
 pleasure-0.336
 know-0.335
Token?"
Token ID1701
Feature activation+1.20e-02

Pos logit contributions
we+0.337
els+0.311
orns+0.306
ays+0.298
ting+0.293
Neg logit contributions
 sorry-0.276
 thank-0.265
 okay-0.264
 pleasure-0.236
 know-0.234
Token happy
Token ID3772
Feature activation-3.88e-01

Pos logit contributions
 pleasure+0.320
 thank+0.298
're+0.283
 hesitate+0.281
 believe+0.276
Neg logit contributions
ins-0.590
we-0.570
ucks-0.556
acked-0.537
ep-0.535
Token and
Token ID290
Feature activation+1.04e-01

Pos logit contributions
 sorry+0.756
 thank+0.735
 okay+0.726
 pleasure+0.646
 know+0.641
Neg logit contributions
we-1.036
els-0.944
orns-0.937
ays-0.911
ting-0.906
Token said
Token ID531
Feature activation-7.53e-02

Pos logit contributions
we+0.245
ting+0.214
ers+0.212
les+0.212
orns+0.211
Neg logit contributions
 okay-0.197
 sorry-0.196
 pleasure-0.191
urry-0.183
 thank-0.180
Token,
Token ID11
Feature activation-3.04e-01

Pos logit contributions
 pleasure+0.107
 forget+0.105
 promise+0.103
urry+0.101
 know+0.100
Neg logit contributions
we-0.201
ers-0.186
med-0.176
ting-0.175
ries-0.174
Token "
Token ID366
Feature activation-1.45e-01

Pos logit contributions
 sorry+0.514
 okay+0.498
 pleasure+0.492
 thank+0.492
 know+0.483
Neg logit contributions
we-0.857
ting-0.758
els-0.704
ers-0.700
ipt-0.695
TokenThank
Token ID10449
Feature activation+1.18e-01

Pos logit contributions
 okay+0.270
 thank+0.264
 know+0.238
 forget+0.237
 sorry+0.235
Neg logit contributions
els-0.407
ays-0.375
ries-0.371
we-0.351
ymes-0.342
Token you
Token ID345
Feature activation+6.86e-02

Pos logit contributions
we+0.305
els+0.299
ays+0.290
orns+0.287
les+0.279
Neg logit contributions
 sorry-0.230
 thank-0.225
 okay-0.222
 please-0.196
 know-0.189
Token,
Token ID11
Feature activation-3.04e-01

Pos logit contributions
we+0.186
ers+0.175
ting+0.174
eds+0.173
ucks+0.170
Neg logit contributions
 determined-0.094
 sorry-0.089
 seem-0.085
 pleasure-0.085
 know-0.085
Token Lily
Token ID20037
Feature activation+1.41e-02

Pos logit contributions
urry+0.669
 okay+0.668
 sorry+0.666
 pleasure+0.651
 know+0.640
Neg logit contributions
we-0.574
web-0.493
ting-0.487
ucks-0.485
ers-0.482
Token!
Token ID0
Feature activation+5.77e-02

Pos logit contributions
we+0.036
els+0.033
orns+0.033
ays+0.032
ting+0.031
Neg logit contributions
 sorry-0.028
 thank-0.028
 okay-0.028
 know-0.025
 pleasure-0.024
Token You
Token ID921
Feature activation+1.21e-01

Pos logit contributions
we+0.126
orns+0.114
ting+0.111
ays+0.111
ucks+0.111
Neg logit contributions
 sorry-0.128
 thank-0.125
 okay-0.123
 know-0.115
 determined-0.114
Token started
Token ID2067
Feature activation+4.75e-02

Pos logit contributions
we+0.449
els+0.423
orns+0.418
ays+0.404
ucks+0.402
Neg logit contributions
 sorry-0.313
 okay-0.306
 thank-0.301
 know-0.263
 tired-0.261
Token to
Token ID284
Feature activation-1.17e-01

Pos logit contributions
we+0.140
els+0.128
orns+0.127
ays+0.125
ting+0.124
Neg logit contributions
 sorry-0.078
 thank-0.077
 okay-0.074
 pleasure-0.067
 know-0.064
Token get
Token ID651
Feature activation-1.47e-01

Pos logit contributions
 sorry+0.223
 thank+0.218
 okay+0.215
 pleasure+0.196
 glad+0.189
Neg logit contributions
we-0.307
els-0.283
orns-0.279
ting-0.269
ays-0.269
Token very
Token ID845
Feature activation-2.36e-01

Pos logit contributions
 sorry+0.245
 thank+0.237
 okay+0.236
 pleasure+0.202
 know+0.202
Neg logit contributions
we-0.432
els-0.401
orns-0.396
ays-0.388
ting-0.383
Token cold
Token ID4692
Feature activation-1.76e-01

Pos logit contributions
 thank+0.398
 sorry+0.395
 okay+0.379
 pleasure+0.359
 know+0.339
Neg logit contributions
we-0.655
els-0.610
orns-0.596
ays-0.595
les-0.589
Token and
Token ID290
Feature activation+1.04e-01

Pos logit contributions
 sorry+0.317
 okay+0.311
 thank+0.308
 tired+0.271
 know+0.261
Neg logit contributions
we-0.485
els-0.457
ymes-0.437
orns-0.431
ucks-0.428
Token the
Token ID262
Feature activation+1.29e-01

Pos logit contributions
we+0.269
els+0.246
orns+0.242
ays+0.236
ting+0.234
Neg logit contributions
 sorry-0.199
 thank-0.198
 okay-0.193
 pleasure-0.178
urry-0.174
Token wind
Token ID2344
Feature activation+2.70e-03

Pos logit contributions
we+0.294
els+0.276
orns+0.274
ays+0.264
ucks+0.258
Neg logit contributions
 sorry-0.303
 thank-0.294
 okay-0.293
 pleasure-0.261
 know-0.259
Token started
Token ID2067
Feature activation+4.82e-02

Pos logit contributions
orns+0.007
we+0.007
els+0.007
ucks+0.007
rob+0.006
Neg logit contributions
 sorry-0.005
 thank-0.005
 okay-0.005
 sad-0.004
 tired-0.004
Token to
Token ID284
Feature activation-1.16e-01

Pos logit contributions
we+0.143
els+0.131
orns+0.129
ays+0.127
ting+0.126
Neg logit contributions
 sorry-0.079
 thank-0.077
 okay-0.075
 pleasure-0.067
 know-0.065
Token blow
Token ID6611
Feature activation+7.07e-02

Pos logit contributions
 sorry+0.210
 thank+0.203
 okay+0.201
 pleasure+0.179
 know+0.175
Neg logit contributions
we-0.320
els-0.296
orns-0.291
ays-0.284
ting-0.283
Token moms
Token ID34015
Feature activation-6.01e-03

Pos logit contributions
we+0.475
els+0.451
orns+0.426
ucks+0.423
ipt+0.412
Neg logit contributions
 sorry-0.310
 okay-0.302
 thank-0.300
urry-0.269
 know-0.266
Token first
Token ID717
Feature activation-1.06e-01

Pos logit contributions
 sorry+0.011
 thank+0.011
 okay+0.011
 pleasure+0.010
 know+0.010
Neg logit contributions
we-0.016
els-0.015
orns-0.014
ting-0.014
ays-0.014
Token.
Token ID13
Feature activation+7.40e-02

Pos logit contributions
 sorry+0.253
 thank+0.243
 okay+0.239
 pleasure+0.223
 glad+0.222
Neg logit contributions
we-0.223
els-0.205
ers-0.194
orns-0.193
ays-0.189
Token Where
Token ID6350
Feature activation-6.03e-02

Pos logit contributions
we+0.184
els+0.169
ays+0.165
orns+0.165
ucks+0.164
Neg logit contributions
 sorry-0.145
 thank-0.143
 okay-0.140
 pleasure-0.128
 know-0.127
Token is
Token ID318
Feature activation+2.04e-01

Pos logit contributions
 pleasure+0.136
 sorry+0.134
 thank+0.125
 glad+0.125
 grin+0.122
Neg logit contributions
we-0.119
els-0.115
ymes-0.110
ers-0.110
ays-0.105
Token your
Token ID534
Feature activation+1.33e-01

Pos logit contributions
we+0.547
els+0.505
orns+0.499
ays+0.486
ting+0.480
Neg logit contributions
 sorry-0.388
 thank-0.377
 okay-0.374
 pleasure-0.329
 know-0.327
Token mom
Token ID1995
Feature activation-2.00e-01

Pos logit contributions
we+0.321
els+0.311
orns+0.297
ries+0.288
ucks+0.287
Neg logit contributions
 sorry-0.242
 okay-0.235
 thank-0.233
urry-0.216
 know-0.215
Token?"
Token ID1701
Feature activation+2.06e-02

Pos logit contributions
 sorry+0.410
 thank+0.399
 okay+0.382
 pleasure+0.375
urry+0.348
Neg logit contributions
we-0.493
els-0.458
orns-0.442
ays-0.438
ting-0.430
Token
Token ID198
Feature activation-1.31e-02

Pos logit contributions
we+0.053
els+0.048
orns+0.047
ucks+0.047
ays+0.047
Neg logit contributions
 sorry-0.040
 okay-0.039
 thank-0.039
 pleasure-0.037
 know-0.034
Token
Token ID198
Feature activation-1.30e-02

Pos logit contributions
 sorry+0.018
 thank+0.018
 okay+0.017
 know+0.016
 glad+0.014
Neg logit contributions
we-0.042
els-0.040
orns-0.040
ting-0.040
ays-0.039
TokenL
Token ID43
Feature activation+3.40e-01

Pos logit contributions
 thank+0.031
 sorry+0.031
 okay+0.030
 know+0.029
 please+0.028
Neg logit contributions
we-0.028
orns-0.027
els-0.027
ays-0.027
ting-0.027
Token course
Token ID1781
Feature activation+3.38e-01

Pos logit contributions
 sorry+0.312
 thank+0.309
 okay+0.304
 know+0.282
 pleasure+0.282
Neg logit contributions
we-0.163
els-0.141
orns-0.139
ays-0.135
ucks-0.128
Token you
Token ID345
Feature activation+6.72e-02

Pos logit contributions
we+0.839
ers+0.726
els+0.717
ting+0.717
les+0.708
Neg logit contributions
 sorry-0.636
 okay-0.592
 thank-0.584
 know-0.573
 pleasure-0.565
Token can
Token ID460
Feature activation-1.14e-01

Pos logit contributions
les+0.138
els+0.135
al+0.128
ers+0.128
we+0.123
Neg logit contributions
 sorry-0.127
 tum-0.120
 glad-0.119
 okay-0.116
 worry-0.115
Token have
Token ID423
Feature activation-2.78e-01

Pos logit contributions
 sorry+0.226
 okay+0.208
 thank+0.199
 glad+0.193
 pleasure+0.188
Neg logit contributions
we-0.285
els-0.267
les-0.250
orns-0.248
ers-0.248
Token some
Token ID617
Feature activation+1.78e-01

Pos logit contributions
 sorry+0.636
 thank+0.625
 okay+0.620
 pleasure+0.558
 know+0.556
Neg logit contributions
we-0.644
els-0.582
orns-0.575
ays-0.560
ting-0.552
Token glue
Token ID22749
Feature activation+1.68e-01

Pos logit contributions
we+0.397
els+0.367
orns+0.357
ays+0.345
les+0.342
Neg logit contributions
 sorry-0.413
 thank-0.400
 okay-0.399
 pleasure-0.362
 know-0.362
Token."
Token ID526
Feature activation+8.80e-02

Pos logit contributions
we+0.471
els+0.436
orns+0.430
ays+0.420
ting+0.416
Neg logit contributions
 sorry-0.297
 thank-0.290
 okay-0.288
 pleasure-0.249
 know-0.248
Token She
Token ID1375
Feature activation-5.29e-02

Pos logit contributions
we+0.237
els+0.218
orns+0.213
ays+0.213
ting+0.209
Neg logit contributions
 sorry-0.163
 okay-0.158
 thank-0.157
 pleasure-0.138
urry-0.135
Token gave
Token ID2921
Feature activation-7.12e-02

Pos logit contributions
 thank+0.095
 sorry+0.094
 okay+0.091
 know+0.081
 pleasure+0.079
Neg logit contributions
we-0.146
orns-0.136
els-0.135
ting-0.131
ays-0.131
Token him
Token ID683
Feature activation+2.21e-01

Pos logit contributions
 sorry+0.128
 thank+0.127
 okay+0.125
 know+0.110
 pleasure+0.107
Neg logit contributions
we-0.200
orns-0.189
els-0.181
ipt-0.178
rob-0.176
Token some
Token ID617
Feature activation+1.78e-01

Pos logit contributions
we+0.625
orns+0.556
els+0.553
rob+0.553
ays+0.542
Neg logit contributions
 sorry-0.407
 thank-0.404
 okay-0.399
 know-0.352
 pleasure-0.348
TokenMom
Token ID29252
Feature activation+3.27e-01

Pos logit contributions
 okay+0.329
 thank+0.305
 pleasure+0.305
 please+0.300
 permit+0.291
Neg logit contributions
els-0.292
orns-0.282
eds-0.282
les-0.281
ays-0.272
Token,
Token ID11
Feature activation-3.04e-01

Pos logit contributions
we+0.879
els+0.834
ries+0.802
ays+0.786
les+0.784
Neg logit contributions
 sorry-0.589
 thank-0.532
 okay-0.527
 pleasure-0.478
urry-0.475
Token can
Token ID460
Feature activation-1.13e-01

Pos logit contributions
 sorry+0.524
 okay+0.489
 thank+0.466
urry+0.443
 glad+0.432
Neg logit contributions
we-0.820
els-0.766
ting-0.735
al-0.724
ays-0.722
Token I
Token ID314
Feature activation-1.13e-01

Pos logit contributions
 sorry+0.191
 okay+0.182
 thank+0.180
 pleasure+0.164
 glad+0.161
Neg logit contributions
we-0.325
els-0.297
orns-0.286
ting-0.284
ays-0.284
Token have
Token ID423
Feature activation-2.76e-01

Pos logit contributions
 sorry+0.202
 okay+0.175
 glad+0.169
 pleasure+0.161
 thank+0.157
Neg logit contributions
we-0.289
els-0.275
al-0.274
les-0.271
orns-0.261
Token this
Token ID428
Feature activation+1.36e-01

Pos logit contributions
 sorry+0.576
 thank+0.574
 okay+0.567
 pleasure+0.504
 know+0.502
Neg logit contributions
we-0.691
els-0.628
orns-0.616
ays-0.605
ipt-0.602
Token one
Token ID530
Feature activation+6.65e-02

Pos logit contributions
we+0.331
els+0.307
orns+0.300
ays+0.298
les+0.291
Neg logit contributions
 sorry-0.283
 thank-0.270
 okay-0.270
 know-0.244
urry-0.244
Token?"
Token ID1701
Feature activation-3.46e-03

Pos logit contributions
we+0.157
els+0.145
ays+0.141
orns+0.141
les+0.139
Neg logit contributions
 sorry-0.141
 okay-0.132
 thank-0.131
urry-0.119
 glad-0.119
Token She
Token ID1375
Feature activation-5.32e-02

Pos logit contributions
 sorry+0.006
 okay+0.006
 thank+0.006
 pleasure+0.005
urry+0.005
Neg logit contributions
we-0.010
els-0.009
ays-0.009
les-0.009
ting-0.009
Token asked
Token ID1965
Feature activation-3.39e-02

Pos logit contributions
 sorry+0.098
 thank+0.094
 okay+0.094
 pleasure+0.084
 know+0.081
Neg logit contributions
we-0.146
els-0.135
orns-0.131
ays-0.129
les-0.128
Token her
Token ID607
Feature activation+1.37e-01

Pos logit contributions
 sorry+0.072
 thank+0.070
 okay+0.069
 pleasure+0.064
 know+0.063
Neg logit contributions
we-0.083
els-0.075
ting-0.072
ays-0.070
ers-0.070
Token doing
Token ID1804
Feature activation-1.68e-01

Pos logit contributions
we+0.165
els+0.160
ting+0.152
orns+0.150
ays+0.148
Neg logit contributions
 thank-0.116
 pleasure-0.102
 sorry-0.101
 tum-0.093
 yes-0.091
Token,
Token ID11
Feature activation-3.07e-01

Pos logit contributions
 thank+0.319
 sorry+0.311
 okay+0.291
 know+0.287
 pleasure+0.286
Neg logit contributions
we-0.442
els-0.404
orns-0.390
ting-0.389
ays-0.385
Token Lily
Token ID20037
Feature activation+1.39e-02

Pos logit contributions
 sorry+0.564
 thank+0.546
 okay+0.535
 pleasure+0.496
urry+0.496
Neg logit contributions
we-0.789
els-0.745
orns-0.714
ays-0.710
les-0.704
Token?"
Token ID1701
Feature activation-9.74e-04

Pos logit contributions
we+0.037
els+0.034
orns+0.034
ays+0.033
ting+0.032
Neg logit contributions
 sorry-0.027
 thank-0.026
 okay-0.026
 know-0.022
 pleasure-0.022
Token
Token ID220
Feature activation-2.14e-01

Pos logit contributions
 sorry+0.002
 thank+0.002
 okay+0.002
 pleasure+0.002
 know+0.002
Neg logit contributions
we-0.003
els-0.002
orns-0.002
ays-0.002
ting-0.002
Token
Token ID198
Feature activation-4.19e-02

Pos logit contributions
 sorry+0.386
 thank+0.377
 okay+0.369
 pleasure+0.327
 know+0.322
Neg logit contributions
we-0.593
els-0.547
orns-0.541
ays-0.531
ting-0.519
Token
Token ID198
Feature activation-4.11e-02

Pos logit contributions
 sorry+0.063
 okay+0.063
 thank+0.062
 know+0.053
 glad+0.051
Neg logit contributions
we-0.128
els-0.122
orns-0.120
ays-0.117
ting-0.116
Token"
Token ID1
Feature activation+1.01e-01

Pos logit contributions
 okay+0.092
 sorry+0.092
 thank+0.089
 please+0.082
 know+0.081
Neg logit contributions
we-0.096
orns-0.091
els-0.090
ays-0.089
rob-0.088
TokenI
Token ID40
Feature activation+3.34e-01

Pos logit contributions
els+0.189
les+0.184
orns+0.182
we+0.181
ers+0.172
Neg logit contributions
 okay-0.261
 sorry-0.259
 thank-0.249
 glad-0.237
 please-0.234
Token'm
Token ID1101
Feature activation-2.96e-01

Pos logit contributions
we+0.939
les+0.816
els+0.794
orns+0.792
ays+0.789
Neg logit contributions
 sorry-0.537
 thank-0.492
 pleasure-0.461
 okay-0.422
 glad-0.389
Token picking
Token ID10868
Feature activation-7.68e-02

Pos logit contributions
 thank+0.572
 pleasure+0.517
 know+0.509
 hesitate+0.507
 forget+0.502
Neg logit contributions
we-0.733
els-0.683
ting-0.673
ries-0.649
ays-0.643
Token kept
Token ID4030
Feature activation-1.26e-03

Pos logit contributions
we+0.272
els+0.250
orns+0.247
ting+0.244
ays+0.240
Neg logit contributions
 sorry-0.199
 thank-0.197
 okay-0.191
 pleasure-0.179
 know-0.176
Token moving
Token ID3867
Feature activation+4.66e-02

Pos logit contributions
 sorry+0.002
 okay+0.002
 thank+0.002
 glad+0.002
 tired+0.002
Neg logit contributions
we-0.003
els-0.003
orns-0.003
ays-0.003
ucks-0.003
Token away
Token ID1497
Feature activation+2.29e-01

Pos logit contributions
we+0.125
els+0.114
orns+0.114
ays+0.113
ucks+0.112
Neg logit contributions
 sorry-0.090
 okay-0.089
 thank-0.086
 know-0.075
 pleasure-0.074
Token from
Token ID422
Feature activation+1.62e-01

Pos logit contributions
we+0.564
els+0.516
orns+0.508
ays+0.494
ting+0.490
Neg logit contributions
 sorry-0.486
 thank-0.475
 okay-0.470
 pleasure-0.422
 know-0.420
Token Jack
Token ID3619
Feature activation-1.36e-01

Pos logit contributions
we+0.399
els+0.357
orns+0.356
ucks+0.352
ays+0.352
Neg logit contributions
 sorry-0.358
 okay-0.341
 thank-0.339
 pleasure-0.307
 tired-0.302
Token.
Token ID13
Feature activation+5.22e-02

Pos logit contributions
 sorry+0.303
 thank+0.297
 okay+0.288
 pleasure+0.267
 know+0.261
Neg logit contributions
we-0.317
els-0.288
orns-0.282
ting-0.275
ays-0.272
Token Finally
Token ID9461
Feature activation-1.87e-01

Pos logit contributions
we+0.143
orns+0.131
els+0.131
ays+0.130
ipt+0.127
Neg logit contributions
 sorry-0.094
 thank-0.092
 okay-0.090
 know-0.077
 pleasure-0.077
Token he
Token ID339
Feature activation-1.22e-01

Pos logit contributions
 pleasure+0.247
 know+0.236
 okay+0.235
urry+0.232
 sorry+0.224
Neg logit contributions
we-0.518
els-0.503
ting-0.500
ers-0.498
med-0.498
Token caught
Token ID4978
Feature activation+2.30e-03

Pos logit contributions
 sorry+0.224
 thank+0.218
 okay+0.216
 pleasure+0.189
 know+0.189
Neg logit contributions
we-0.337
els-0.312
orns-0.308
ays-0.300
ting-0.296
Token it
Token ID340
Feature activation+1.73e-01

Pos logit contributions
we+0.006
rob+0.006
orns+0.006
ucks+0.006
ipt+0.005
Neg logit contributions
 sorry-0.005
 okay-0.005
 thank-0.004
 tired-0.004
 sad-0.004
Token with
Token ID351
Feature activation+1.58e-01

Pos logit contributions
we+0.411
els+0.372
orns+0.367
ting+0.365
ays+0.358
Neg logit contributions
 sorry-0.370
 thank-0.366
 okay-0.344
 pleasure-0.328
 know-0.325
Token Sam
Token ID3409
Feature activation-1.12e-02

Pos logit contributions
we+0.438
els+0.410
orns+0.403
ays+0.394
ting+0.388
Neg logit contributions
 sorry-0.295
 thank-0.285
 okay-0.284
 know-0.247
 pleasure-0.245
Token tried
Token ID3088
Feature activation-3.02e-02

Pos logit contributions
 sorry+0.020
 thank+0.019
 okay+0.019
 pleasure+0.017
 know+0.016
Neg logit contributions
we-0.032
els-0.029
orns-0.029
ays-0.028
ting-0.028
Token to
Token ID284
Feature activation-1.02e-01

Pos logit contributions
 sorry+0.042
 okay+0.041
 thank+0.039
 know+0.033
 glad+0.031
Neg logit contributions
we-0.097
els-0.091
orns-0.090
ays-0.089
rob-0.089
Token stop
Token ID2245
Feature activation-2.72e-01

Pos logit contributions
 sorry+0.163
 okay+0.150
 thank+0.149
 pleasure+0.141
 glad+0.137
Neg logit contributions
we-0.298
els-0.282
orns-0.276
ting-0.264
ers-0.263
Token him
Token ID683
Feature activation+2.25e-01

Pos logit contributions
 sorry+0.583
 okay+0.557
 thank+0.556
 know+0.484
 glad+0.482
Neg logit contributions
we-0.669
orns-0.619
els-0.615
ays-0.599
rob-0.593
Token.
Token ID13
Feature activation+7.20e-02

Pos logit contributions
we+0.481
els+0.447
orns+0.440
ting+0.424
ays+0.423
Neg logit contributions
 sorry-0.523
 thank-0.520
 okay-0.495
 pleasure-0.481
 glad-0.464
Token They
Token ID1119
Feature activation+3.51e-02

Pos logit contributions
we+0.177
ays+0.162
els+0.161
ipt+0.158
orns+0.158
Neg logit contributions
 thank-0.146
 sorry-0.144
 okay-0.140
 pleasure-0.129
 know-0.128
Token said
Token ID531
Feature activation-5.65e-02

Pos logit contributions
we+0.084
els+0.081
les+0.077
orns+0.077
ers+0.076
Neg logit contributions
 sorry-0.068
 okay-0.066
 pleasure-0.064
 thank-0.063
urry-0.061
Token they
Token ID484
Feature activation-1.02e-01

Pos logit contributions
 pleasure+0.077
 forget+0.073
 know+0.072
urry+0.070
 permit+0.069
Neg logit contributions
we-0.159
els-0.149
ting-0.146
ries-0.146
ers-0.143
Token were
Token ID547
Feature activation-4.23e-02

Pos logit contributions
 sorry+0.203
 thank+0.198
 pleasure+0.188
 okay+0.185
urry+0.175
Neg logit contributions
we-0.258
els-0.230
les-0.226
orns-0.225
ers-0.218
Token sorry
Token ID7926
Feature activation-5.54e-01

Pos logit contributions
 thank+0.016
urry+0.011
 know+0.011
 pleasure+0.011
 worry+0.009
Neg logit contributions
we-0.163
els-0.160
ers-0.157
orns-0.156
ting-0.155
Token.
Token ID13
Feature activation+4.71e-02

Pos logit contributions
 sorry+0.028
 thank+0.028
 okay+0.028
 pleasure+0.025
 know+0.024
Neg logit contributions
we-0.037
els-0.035
orns-0.034
ays-0.034
ucks-0.033
Token<|endoftext|>
Token ID50256
Feature activation+3.48e-02

Pos logit contributions
we+0.134
els+0.124
orns+0.123
ays+0.120
ting+0.119
Neg logit contributions
 sorry-0.082
 thank-0.079
 okay-0.079
 pleasure-0.068
 know-0.068
TokenOnce
Token ID7454
Feature activation-1.26e-01

Pos logit contributions
els+0.096
we+0.095
ries+0.087
eds+0.086
ucks+0.085
Neg logit contributions
 okay-0.055
 sorry-0.054
 thank-0.054
 tired-0.049
 Once-0.049
Token upon
Token ID2402
Feature activation-6.41e-02

Pos logit contributions
 sorry+0.275
 thank+0.266
 okay+0.260
 pleasure+0.246
 hesitate+0.233
Neg logit contributions
we-0.269
ays-0.262
els-0.257
les-0.249
orns-0.244
Token a
Token ID257
Feature activation+3.19e-03

Pos logit contributions
 sorry+0.115
 okay+0.113
 thank+0.112
 know+0.098
 pleasure+0.096
Neg logit contributions
we-0.178
els-0.168
orns-0.164
ays-0.162
ting-0.159
Token time
Token ID640
Feature activation-1.25e-01

Pos logit contributions
we+0.008
orns+0.007
les+0.007
els+0.007
ers+0.007
Neg logit contributions
 sorry-0.006
 thank-0.006
 okay-0.005
 know-0.005
 pleasure-0.005
Token,
Token ID11
Feature activation-3.09e-01

Pos logit contributions
 thank+0.192
 sorry+0.184
 pleasure+0.177
 forget+0.176
 know+0.174
Neg logit contributions
we-0.342
els-0.315
ays-0.304
ries-0.303
orns-0.301
Token there
Token ID612
Feature activation+5.40e-02

Pos logit contributions
 sorry+0.674
 thank+0.666
 okay+0.649
 pleasure+0.594
 know+0.588
Neg logit contributions
we-0.736
els-0.671
orns-0.659
ays-0.643
ting-0.631
Token was
Token ID373
Feature activation+1.35e-01

Pos logit contributions
we+0.122
els+0.111
orns+0.109
ays+0.106
ting+0.104
Neg logit contributions
 sorry-0.125
 thank-0.123
 okay-0.122
 pleasure-0.110
 know-0.110
Token a
Token ID257
Feature activation+2.01e-03

Pos logit contributions
we+0.404
els+0.379
orns+0.366
ays+0.364
ucks+0.359
Neg logit contributions
 sorry-0.222
 okay-0.222
 thank-0.211
 pleasure-0.188
 glad-0.183
Token normal
Token ID3487
Feature activation-1.64e-01

Pos logit contributions
we+0.005
ting+0.004
les+0.004
ymes+0.004
els+0.004
Neg logit contributions
 okay-0.003
 sorry-0.003
 thank-0.003
 know-0.003
 pleasure-0.003
Token bake
Token ID28450
Feature activation-3.45e-02

Pos logit contributions
 sorry+0.174
 okay+0.164
 thank+0.161
 pleasure+0.146
urry+0.144
Neg logit contributions
we-0.295
els-0.277
orns-0.270
ers-0.263
ting-0.261
Token.
Token ID13
Feature activation+6.50e-02

Pos logit contributions
 okay+0.066
 thank+0.063
 sorry+0.063
 please+0.054
 happy+0.051
Neg logit contributions
we-0.093
orns-0.089
els-0.088
ipt-0.088
ting-0.086
Token Then
Token ID3244
Feature activation+6.48e-02

Pos logit contributions
we+0.169
els+0.153
orns+0.153
ays+0.151
ucks+0.151
Neg logit contributions
 sorry-0.122
 thank-0.118
 okay-0.116
 worry-0.103
 know-0.102
Token,
Token ID11
Feature activation-2.91e-01

Pos logit contributions
we+0.158
els+0.142
ting+0.140
ries+0.139
al+0.139
Neg logit contributions
 sorry-0.112
 glad-0.108
 determined-0.108
urry-0.106
 pleasure-0.101
Token we
Token ID356
Feature activation-9.93e-02

Pos logit contributions
 sorry+0.477
 thank+0.441
 okay+0.440
 pleasure+0.398
 glad+0.394
Neg logit contributions
we-0.838
els-0.766
orns-0.747
ting-0.745
ays-0.734
Token can
Token ID460
Feature activation-1.07e-01

Pos logit contributions
 sorry+0.176
 okay+0.163
 thank+0.161
 pleasure+0.148
 glad+0.148
Neg logit contributions
we-0.253
els-0.251
orns-0.242
les-0.242
al-0.237
Token enjoy
Token ID2883
Feature activation-1.91e-01

Pos logit contributions
 sorry+0.183
 okay+0.170
 thank+0.161
 pleasure+0.155
 glad+0.155
Neg logit contributions
we-0.293
els-0.283
ays-0.263
ucks-0.262
ries-0.261
Token it
Token ID340
Feature activation+1.81e-01

Pos logit contributions
 sorry+0.419
 thank+0.411
 okay+0.411
 pleasure+0.360
 glad+0.359
Neg logit contributions
we-0.459
orns-0.422
els-0.418
ays-0.412
ipt-0.411
Token for
Token ID329
Feature activation-1.36e-01

Pos logit contributions
we+0.413
els+0.370
orns+0.367
ays+0.355
ting+0.349
Neg logit contributions
 sorry-0.417
 thank-0.407
 okay-0.399
 pleasure-0.366
 know-0.364
Token dinner
Token ID8073
Feature activation-9.04e-02

Pos logit contributions
 sorry+0.266
 thank+0.262
 okay+0.260
 know+0.227
 pleasure+0.227
Neg logit contributions
we-0.359
els-0.333
orns-0.330
ays-0.329
les-0.319
Token."
Token ID526
Feature activation+9.96e-02

Pos logit contributions
 sorry+0.190
 thank+0.186
 okay+0.185
 pleasure+0.165
 know+0.164
Neg logit contributions
we-0.222
els-0.204
orns-0.201
ays-0.196
ting-0.193
Token had
Token ID550
Feature activation-8.26e-02

Pos logit contributions
we+0.196
els+0.179
orns+0.176
ting+0.172
ays+0.171
Neg logit contributions
 sorry-0.143
 thank-0.139
 okay-0.138
 pleasure-0.124
 know-0.121
Token been
Token ID587
Feature activation+6.27e-03

Pos logit contributions
 sorry+0.162
 thank+0.156
 okay+0.151
 pleasure+0.139
urry+0.138
Neg logit contributions
we-0.213
orns-0.193
els-0.193
ers-0.188
ting-0.186
Token so
Token ID523
Feature activation-1.60e-01

Pos logit contributions
we+0.018
els+0.016
ers+0.016
orns+0.016
ting+0.016
Neg logit contributions
 thank-0.011
 sorry-0.011
 okay-0.010
 pleasure-0.010
 know-0.009
Token compassionate
Token ID32533
Feature activation-3.04e-01

Pos logit contributions
 thank+0.208
 pleasure+0.183
 okay+0.182
 tum+0.181
 sorry+0.180
Neg logit contributions
we-0.462
ers-0.424
ting-0.418
ymes-0.415
els-0.414
Token and
Token ID290
Feature activation+1.12e-01

Pos logit contributions
 sorry+0.584
 thank+0.569
 okay+0.563
 pleasure+0.498
 know+0.496
Neg logit contributions
we-0.808
els-0.744
orns-0.735
ays-0.717
ting-0.708
Token kind
Token ID1611
Feature activation-2.42e-01

Pos logit contributions
we+0.329
ting+0.293
ers+0.285
orns+0.283
els+0.279
Neg logit contributions
 sorry-0.156
 okay-0.151
 thank-0.151
urry-0.145
 pleasure-0.136
Token,
Token ID11
Feature activation-2.99e-01

Pos logit contributions
 sorry+0.461
 thank+0.458
 okay+0.451
 pleasure+0.398
 know+0.392
Neg logit contributions
we-0.640
els-0.598
orns-0.585
ays-0.579
rob-0.563
Token and
Token ID290
Feature activation+1.12e-01

Pos logit contributions
 sorry+0.565
 thank+0.544
 okay+0.532
 know+0.484
urry+0.483
Neg logit contributions
we-0.786
orns-0.709
els-0.704
ting-0.696
ays-0.675
Token had
Token ID550
Feature activation-8.63e-02

Pos logit contributions
we+0.296
ting+0.263
orns+0.261
els+0.259
ers+0.253
Neg logit contributions
 sorry-0.201
 okay-0.196
 thank-0.187
urry-0.179
 pleasure-0.172
Token warned
Token ID7728
Feature activation-1.05e-01

Pos logit contributions
 sorry+0.170
 thank+0.164
 okay+0.161
urry+0.144
 pleasure+0.143
Neg logit contributions
we-0.223
orns-0.203
els-0.203
ays-0.195
ers-0.194
Token him
Token ID683
Feature activation+2.20e-01

Pos logit contributions
 sorry+0.248
 thank+0.242
 okay+0.239
 pleasure+0.221
 know+0.218
Neg logit contributions
we-0.230
els-0.209
orns-0.202
ting-0.197
ays-0.195
Token.
Token ID13
Feature activation+5.08e-02

Pos logit contributions
we+0.327
els+0.310
ting+0.287
les+0.274
ays+0.272
Neg logit contributions
 pleasure-0.398
 thank-0.396
 sorry-0.383
 okay-0.380
 tum-0.352
Token "
Token ID366
Feature activation-1.48e-01

Pos logit contributions
we+0.127
ting+0.115
ipt+0.113
els+0.111
ers+0.110
Neg logit contributions
 sorry-0.094
 okay-0.094
 thank-0.092
 pleasure-0.087
urry-0.084
TokenI
Token ID40
Feature activation+3.32e-01

Pos logit contributions
 sorry+0.347
 okay+0.345
 thank+0.341
 please+0.329
 know+0.324
Neg logit contributions
els-0.291
les-0.289
ays-0.284
orns-0.283
ers-0.276
Token am
Token ID716
Feature activation-1.41e-01

Pos logit contributions
we+0.859
les+0.766
orns+0.753
al+0.751
els+0.730
Neg logit contributions
 sorry-0.566
 pleasure-0.532
 okay-0.531
 thank-0.525
 glad-0.475
Token the
Token ID262
Feature activation+1.34e-01

Pos logit contributions
 thank+0.272
urry+0.255
 pleasure+0.251
 okay+0.250
 promise+0.249
Neg logit contributions
we-0.321
ting-0.303
els-0.284
orns-0.284
ers-0.278
Token wise
Token ID10787
Feature activation-2.88e-01

Pos logit contributions
we+0.319
orns+0.303
ting+0.293
els+0.285
rob+0.279
Neg logit contributions
 sorry-0.265
 thank-0.259
urry-0.253
 okay-0.241
 know-0.233
Token owl
Token ID39610
Feature activation+1.25e-01

Pos logit contributions
 sorry+0.812
 okay+0.782
 thank+0.761
urry+0.737
 pleasure+0.724
Neg logit contributions
we-0.454
orns-0.433
ting-0.417
els-0.408
les-0.401
Token,"
Token ID553
Feature activation-8.89e-02

Pos logit contributions
we+0.309
orns+0.279
les+0.275
els+0.275
ting+0.275
Neg logit contributions
 sorry-0.237
 thank-0.235
 okay-0.228
 pleasure-0.226
urry-0.209
Token replied
Token ID8712
Feature activation-1.43e-01

Pos logit contributions
 sorry+0.135
 thank+0.132
 okay+0.127
 know+0.109
 pleasure+0.104
Neg logit contributions
we-0.275
els-0.258
orns-0.256
ays-0.253
ucks-0.246
Token the
Token ID262
Feature activation+1.34e-01

Pos logit contributions
 sorry+0.341
 thank+0.334
 okay+0.331
 pleasure+0.301
 know+0.299
Neg logit contributions
we-0.314
els-0.285
orns-0.280
ays-0.271
ting-0.267
Token owl
Token ID39610
Feature activation+1.25e-01

Pos logit contributions
we+0.320
orns+0.299
ting+0.294
els+0.290
les+0.281
Neg logit contributions
 sorry-0.266
 okay-0.257
 thank-0.256
urry-0.246
 pleasure-0.242
Token happy
Token ID3772
Feature activation-3.83e-01

Pos logit contributions
we+0.440
ers+0.396
els+0.396
ting+0.396
ays+0.392
Neg logit contributions
 thank-0.198
 sorry-0.183
 okay-0.170
urry-0.162
 pleasure-0.156
Token and
Token ID290
Feature activation+1.09e-01

Pos logit contributions
 sorry+0.786
 thank+0.764
 okay+0.758
 know+0.677
 pleasure+0.676
Neg logit contributions
we-0.979
els-0.884
orns-0.873
ays-0.852
ting-0.851
Token the
Token ID262
Feature activation+1.37e-01

Pos logit contributions
we+0.282
ting+0.257
les+0.241
med+0.239
els+0.239
Neg logit contributions
 sorry-0.200
 okay-0.199
 thank-0.190
 please-0.176
urry-0.175
Token cat
Token ID3797
Feature activation+4.53e-02

Pos logit contributions
we+0.336
els+0.306
orns+0.303
ting+0.293
ays+0.292
Neg logit contributions
 sorry-0.288
 thank-0.281
 okay-0.279
 know-0.250
 pleasure-0.249
Token was
Token ID373
Feature activation+1.43e-01

Pos logit contributions
we+0.117
els+0.107
orns+0.106
ays+0.103
ting+0.102
Neg logit contributions
 sorry-0.090
 thank-0.087
 okay-0.087
 pleasure-0.077
 know-0.076
Token happy
Token ID3772
Feature activation-3.82e-01

Pos logit contributions
we+0.441
ers+0.396
ting+0.395
els+0.394
med+0.392
Neg logit contributions
 thank-0.193
 sorry-0.188
 okay-0.170
urry-0.164
 know-0.160
Token too
Token ID1165
Feature activation-5.00e-02

Pos logit contributions
 sorry+0.805
 thank+0.777
 okay+0.772
 know+0.707
urry+0.696
Neg logit contributions
we-0.955
els-0.837
orns-0.828
ting-0.820
ays-0.812
Token.
Token ID13
Feature activation+5.66e-02

Pos logit contributions
 sorry+0.124
 thank+0.120
 okay+0.115
 know+0.112
 pleasure+0.110
Neg logit contributions
we-0.098
ting-0.088
les-0.083
els-0.080
ays-0.078
Token The
Token ID383
Feature activation-3.28e-02

Pos logit contributions
we+0.161
els+0.149
orns+0.148
ays+0.144
ting+0.142
Neg logit contributions
 sorry-0.098
 thank-0.095
 okay-0.094
 pleasure-0.082
 know-0.081
Token cat
Token ID3797
Feature activation+4.61e-02

Pos logit contributions
 sorry+0.067
 thank+0.066
 okay+0.065
 know+0.059
 pleasure+0.059
Neg logit contributions
we-0.081
els-0.074
orns-0.074
ting-0.071
ays-0.070
Token kept
Token ID4030
Feature activation-3.09e-03

Pos logit contributions
we+0.117
els+0.107
orns+0.106
ays+0.102
les+0.102
Neg logit contributions
 sorry-0.093
 okay-0.090
 thank-0.090
 pleasure-0.080
 know-0.079
Token toe
Token ID21189
Feature activation+7.80e-02

Pos logit contributions
we+0.388
els+0.357
orns+0.348
ays+0.336
ting+0.333
Neg logit contributions
 sorry-0.316
 okay-0.307
 thank-0.307
 know-0.274
 pleasure-0.274
Token.
Token ID13
Feature activation+7.17e-02

Pos logit contributions
we+0.200
els+0.186
orns+0.182
ays+0.177
ucks+0.176
Neg logit contributions
 sorry-0.154
 thank-0.150
 okay-0.149
 know-0.130
 please-0.129
Token He
Token ID679
Feature activation-3.09e-02

Pos logit contributions
we+0.189
els+0.174
orns+0.172
ays+0.169
ting+0.167
Neg logit contributions
 sorry-0.135
 thank-0.132
 okay-0.130
 pleasure-0.117
 know-0.115
Token is
Token ID318
Feature activation+1.94e-01

Pos logit contributions
 sorry+0.061
 thank+0.060
 okay+0.059
 pleasure+0.054
 glad+0.052
Neg logit contributions
we-0.079
els-0.073
orns-0.072
ays-0.070
les-0.069
Token very
Token ID845
Feature activation-2.25e-01

Pos logit contributions
we+0.579
els+0.544
orns+0.535
ers+0.520
ting+0.520
Neg logit contributions
 thank-0.280
 sorry-0.268
 okay-0.262
 pleasure-0.247
urry-0.245
Token sad
Token ID6507
Feature activation-4.22e-01

Pos logit contributions
 thank+0.332
 pleasure+0.318
 okay+0.313
 sorry+0.305
urry+0.281
Neg logit contributions
we-0.644
els-0.601
orns-0.590
les-0.578
eds-0.572
Token and
Token ID290
Feature activation+1.23e-01

Pos logit contributions
 sorry+0.903
 thank+0.881
 okay+0.876
 know+0.778
 pleasure+0.777
Neg logit contributions
we-1.038
els-0.953
orns-0.937
ays-0.912
ting-0.897
Token in
Token ID287
Feature activation+1.64e-01

Pos logit contributions
we+0.389
els+0.365
orns+0.361
ays+0.354
ting+0.354
Neg logit contributions
 thank-0.160
 okay-0.158
 sorry-0.158
 pleasure-0.142
urry-0.133
Token pain
Token ID2356
Feature activation-2.14e-02

Pos logit contributions
we+0.431
els+0.391
orns+0.385
ays+0.380
ting+0.373
Neg logit contributions
 sorry-0.333
 thank-0.324
 okay-0.319
 know-0.284
 pleasure-0.282
Token."
Token ID526
Feature activation+1.01e-01

Pos logit contributions
 sorry+0.043
 thank+0.041
 okay+0.041
 know+0.036
 glad+0.035
Neg logit contributions
we-0.054
els-0.051
orns-0.050
ipt-0.049
ays-0.049
Token
Token ID198
Feature activation-2.27e-02

Pos logit contributions
we+0.256
els+0.230
orns+0.230
ays+0.225
ucks+0.223
Neg logit contributions
 sorry-0.196
 okay-0.193
 thank-0.192
 pleasure-0.177
 know-0.168
Token the
Token ID262
Feature activation+1.35e-01

Pos logit contributions
we+0.110
els+0.104
orns+0.103
ays+0.098
les+0.098
Neg logit contributions
 sorry-0.095
 okay-0.092
 thank-0.092
 pleasure-0.082
 know-0.080
Token park
Token ID3952
Feature activation+1.78e-01

Pos logit contributions
we+0.326
els+0.297
orns+0.293
ays+0.286
ting+0.282
Neg logit contributions
 sorry-0.299
 thank-0.290
 okay-0.289
 pleasure-0.257
 glad-0.256
Token for
Token ID329
Feature activation-1.41e-01

Pos logit contributions
we+0.425
ers+0.390
ting+0.377
els+0.369
orns+0.360
Neg logit contributions
 sorry-0.368
 thank-0.358
 okay-0.334
 pleasure-0.317
 know-0.317
Token the
Token ID262
Feature activation+1.38e-01

Pos logit contributions
 sorry+0.298
 thank+0.292
 okay+0.291
 know+0.260
 pleasure+0.257
Neg logit contributions
we-0.355
els-0.321
ays-0.318
orns-0.317
les-0.307
Token frame
Token ID5739
Feature activation+5.23e-02

Pos logit contributions
we+0.341
els+0.307
orns+0.302
ting+0.291
ymes+0.287
Neg logit contributions
 sorry-0.281
 thank-0.276
 okay-0.268
 know-0.247
 pleasure-0.245
Token,
Token ID11
Feature activation-3.04e-01

Pos logit contributions
we+0.134
orns+0.124
els+0.123
ers+0.121
ting+0.118
Neg logit contributions
 sorry-0.104
 thank-0.100
 okay-0.096
 pleasure-0.088
 know-0.087
Token but
Token ID475
Feature activation-1.77e-01

Pos logit contributions
 sorry+0.441
 thank+0.432
 okay+0.408
 know+0.361
 pleasure+0.358
Neg logit contributions
we-0.949
els-0.870
orns-0.859
ting-0.858
ays-0.837
Token they
Token ID484
Feature activation-1.18e-01

Pos logit contributions
 sorry+0.354
 thank+0.353
 okay+0.329
 pleasure+0.321
 know+0.315
Neg logit contributions
we-0.435
els-0.401
ting-0.391
orns-0.386
ays-0.377
Token couldn
Token ID3521
Feature activation-3.05e-01

Pos logit contributions
 thank+0.193
 sorry+0.192
 pleasure+0.177
 okay+0.164
 glad+0.150
Neg logit contributions
we-0.337
els-0.307
ting-0.300
ipt-0.297
orns-0.296
Token't
Token ID470
Feature activation-1.52e-01

Pos logit contributions
 sorry+0.349
 okay+0.331
 thank+0.330
 know+0.258
 please+0.248
Neg logit contributions
we-1.042
els-0.994
orns-0.983
ting-0.951
les-0.946
Token find
Token ID1064
Feature activation-1.92e-01

Pos logit contributions
 sorry+0.255
 thank+0.234
 okay+0.226
 pleasure+0.219
 glad+0.216
Neg logit contributions
we-0.435
els-0.403
orns-0.389
ting-0.388
al-0.384