TOP ACTIVATIONS
MAX = 0.119

of wonderful memories from <|endoftext|> Lu cy was a little girl
to play with Spot in Mrs . Lee 's yard .
know you did not mean to ruin the cake . But
was an old judge named Mr . Owl . He lived
was a kind lady named Mrs . Smith . She lived
it , because she liked Mrs . Lee and her cat
dragon loved it ! <|endoftext|> Lu cy was jealous of her
Do you know what is in the mystery box ?" Sue
. The end . <|endoftext|> Mrs . Brown took her little
morning j ogs . <|endoftext|> Lu cy wanted to be an
was an elderly cat named Mr . Wh isk ers .
there was a lady named Mrs . Greene . One day
know you did not mean to hurt the bowl . But
Which animal do you want to see next ?" Tim my
The cat wanted to eat Twe ety ! Twe
" Where do you want to go , Blank et ?"
was a gray cat named Mr . Wh isk ers .
The end . <|endoftext|> Lu cy was very excited to
know you did not mean to break the book . It
to visit their neighbor , Mrs . Lee . Mrs .

MIN ACTIVATIONS
MIN = -0.104

he felt a fl apping sensation in the air . It
, Elsa felt a strange sensation in the water . What
a pleasant t ing ling sensation throughout his body . The
felt a strange t ingly sensation and when it stopped they
and Tom experienced a strange sensation they had never felt before
, they felt a strange sensation . They felt light and
Suddenly , he felt a sensation . He began to slide
and Tom felt a strange sensation in their stomach s .
look at it . His pipe was much nicer .
and Ben felt a funny sensation in their stomach s .
shells . She put her shell down and paused to look
. She put her pretty sheet down , and lay on
, he felt a strange sensation , like something powerful was
. He felt a burning sensation in his mouth . He
, it felt a burning sensation and started to jump around
something bad happened . Bob accident ly spilled some of the
Pretty soon her marks were higher than anyone else
much he weighed . He loop ed around the tree ,
feels a hot and burning sensation in her mouth . She
clean . She spread the sheet back on the bed and

INTERVAL 0.063 - 0.119
CONTAINS 1.383%

kids . They might like to see it too ."
mind . She just wanted to share her laundry with her
. They like to play in the park . One day
named Kitty . Kitty loved to play with her cr ay
asked if baby was ready to go , so they all

INTERVAL 0.007 - 0.063
CONTAINS 36.550%

her t eddy bear , " I don 't want to
Tim my . Tim my liked to run and play because
can get a new one . We still have other toys
. They were very upset and told her to stay away
ooter . He looked at Bob and said , � �

INTERVAL -0.048 - 0.007
CONTAINS 59.807%

the little ant . They bit him all over . He
not touch it ." They watch the butterfly rise
was red , her favorite color . She felt a little
swing on the swings . That sounds fun ." he said
I was scared ." Her mom smiled and said , "

INTERVAL -0.104 - -0.048
CONTAINS 2.259%

After they finished cleaning , the man said , " Thank
screamed and kicked . She said she only wanted to wear
. Then , they put the expensive blanket into their bag
at the car . He looks at Lily and Tom .
girl a hug and they said goodbye . The girl waved
Token of
Token ID286
Feature activation+3.83e-02

Pos logit contributions
der+0.006
rier+0.005
bled+0.005
lessly+0.005
lic+0.004
Neg logit contributions
e-0.006
 moderation-0.006
 arrest-0.006
 hate-0.006
 mischief-0.006
Token wonderful
Token ID7932
Feature activation+4.33e-02

Pos logit contributions
e+0.070
 moderation+0.069
 hate+0.064
 arrest+0.063
 client+0.063
Neg logit contributions
der-0.084
lessly-0.071
rier-0.068
ier-0.065
lights-0.064
Token memories
Token ID9846
Feature activation-2.24e-02

Pos logit contributions
e+0.069
 moderation+0.068
 hate+0.062
 arrest+0.061
 client+0.061
Neg logit contributions
der-0.105
lessly-0.090
rier-0.087
ier-0.084
lights-0.081
Token from
Token ID422
Feature activation+2.37e-02

Pos logit contributions
der+0.043
rier+0.039
lessly+0.034
lic+0.033
bled+0.031
Neg logit contributions
e-0.042
keeping-0.037
 visit-0.036
 moderation-0.036
 arrest-0.035
Token<|endoftext|>
Token ID50256
Feature activation+6.19e-02

Pos logit contributions
 moderation+0.044
e+0.044
|+0.041
 hate+0.041
 argue+0.041
Neg logit contributions
der-0.050
lessly-0.042
rier-0.040
lights-0.038
ier-0.038
TokenLu
Token ID25596
Feature activation+1.19e-01

Pos logit contributions
e+0.112
 moderation+0.112
 hate+0.103
 client+0.102
 arrest+0.102
Neg logit contributions
der-0.136
lessly-0.115
rier-0.109
ier-0.105
lights-0.103
Tokency
Token ID948
Feature activation+5.41e-02

Pos logit contributions
 moderation+0.163
 client+0.153
 mischief+0.144
 argue+0.143
 conflicts+0.141
Neg logit contributions
der-0.277
lessly-0.260
ing-0.258
led-0.246
ly-0.241
Token was
Token ID373
Feature activation-6.29e-03

Pos logit contributions
 moderation+0.147
e+0.144
 hate+0.137
 argue+0.137
 client+0.136
Neg logit contributions
der-0.071
lessly-0.052
rier-0.046
ier-0.043
lights-0.040
Token a
Token ID257
Feature activation+3.34e-02

Pos logit contributions
der+0.013
lessly+0.011
rier+0.010
ier+0.010
lights+0.009
Neg logit contributions
 moderation-0.012
e-0.012
 client-0.012
 arrest-0.012
 argue-0.012
Token little
Token ID1310
Feature activation+3.82e-02

Pos logit contributions
 moderation+0.073
keeping+0.073
e+0.072
 arrest+0.069
 argue+0.067
Neg logit contributions
der-0.058
lessly-0.047
rier-0.046
ier-0.040
ing-0.039
Token girl
Token ID2576
Feature activation+3.66e-02

Pos logit contributions
 moderation+0.071
e+0.071
 arrest+0.065
 argue+0.064
 hate+0.064
Neg logit contributions
der-0.083
lessly-0.069
rier-0.068
ier-0.063
lights-0.062
Token to
Token ID284
Feature activation+6.78e-02

Pos logit contributions
e+0.059
 moderation+0.056
|+0.052
 generation+0.052
days+0.051
Neg logit contributions
der-0.049
lessly-0.035
ing-0.033
 that-0.032
rier-0.032
Token play
Token ID711
Feature activation+2.05e-02

Pos logit contributions
 moderation+0.106
keeping+0.100
 conflicts+0.094
|+0.093
atherine+0.093
Neg logit contributions
der-0.142
lessly-0.112
 all-0.105
 that-0.105
lic-0.104
Token with
Token ID351
Feature activation+4.73e-02

Pos logit contributions
 moderation+0.047
e+0.046
 hate+0.044
 client+0.044
 argue+0.043
Neg logit contributions
der-0.035
lessly-0.028
rier-0.026
ier-0.025
lights-0.024
Token Spot
Token ID15899
Feature activation+6.69e-02

Pos logit contributions
e+0.077
 argue+0.076
 moderation+0.075
keeping+0.072
|+0.071
Neg logit contributions
der-0.108
rier-0.089
lessly-0.087
ly-0.081
lights-0.081
Token in
Token ID287
Feature activation+7.40e-02

Pos logit contributions
 moderation+0.184
 arrest+0.177
keeping+0.177
e+0.176
|+0.173
Neg logit contributions
der-0.087
lessly-0.056
ing-0.051
rier-0.050
ier-0.047
Token Mrs
Token ID9074
Feature activation+1.12e-01

Pos logit contributions
 argue+0.170
beh+0.169
 client+0.162
days+0.161
 colony+0.159
Neg logit contributions
der-0.092
lessly-0.076
lights-0.064
rier-0.063
 all-0.060
Token.
Token ID13
Feature activation+3.19e-02

Pos logit contributions
 moderation+0.295
e+0.294
|+0.285
 arrest+0.285
 hate+0.282
Neg logit contributions
der-0.162
rier-0.106
ier-0.103
lessly-0.089
ing-0.088
Token Lee
Token ID5741
Feature activation+4.80e-02

Pos logit contributions
e+0.047
 moderation+0.042
 hate+0.038
|+0.037
beh+0.036
Neg logit contributions
der-0.083
rier-0.071
lessly-0.065
lights-0.063
ier-0.063
Token's
Token ID338
Feature activation-1.82e-03

Pos logit contributions
 moderation+0.127
e+0.122
|+0.119
keeping+0.118
 arrest+0.118
Neg logit contributions
der-0.067
lessly-0.047
rier-0.041
ier-0.041
ing-0.039
Token yard
Token ID12699
Feature activation-1.62e-02

Pos logit contributions
der+0.004
rier+0.003
lessly+0.003
ing+0.003
ows+0.003
Neg logit contributions
e-0.003
|-0.003
 moderation-0.003
Susan-0.003
beh-0.003
Token.
Token ID13
Feature activation+1.76e-02

Pos logit contributions
der+0.029
lessly+0.024
rier+0.023
ier+0.021
lights+0.021
Neg logit contributions
e-0.036
 moderation-0.035
 hate-0.033
 arrest-0.033
 client-0.033
Token know
Token ID760
Feature activation-8.07e-03

Pos logit contributions
 moderation+0.043
e+0.042
 generation+0.042
 arrest+0.042
keeping+0.042
Neg logit contributions
der-0.072
lessly-0.059
ier-0.059
lic-0.056
ing-0.055
Token you
Token ID345
Feature activation+3.51e-02

Pos logit contributions
der+0.019
lessly+0.017
rier+0.016
ier+0.015
lights+0.015
Neg logit contributions
e-0.013
 moderation-0.013
 hate-0.012
 arrest-0.011
 argue-0.011
Token did
Token ID750
Feature activation-3.59e-03

Pos logit contributions
 moderation+0.078
e+0.076
 arrest+0.075
 client+0.073
 courtroom+0.073
Neg logit contributions
der-0.063
lessly-0.049
ier-0.047
rier-0.042
ing-0.041
Token not
Token ID407
Feature activation+5.10e-02

Pos logit contributions
der+0.006
lessly+0.004
ier+0.004
rier+0.004
ing+0.004
Neg logit contributions
|-0.008
 moderation-0.008
e-0.008
keeping-0.008
atherine-0.008
Token mean
Token ID1612
Feature activation+4.87e-02

Pos logit contributions
 moderation+0.099
 generation+0.099
 arrest+0.098
e+0.097
keeping+0.095
Neg logit contributions
der-0.089
ier-0.073
lessly-0.072
lic-0.067
ing-0.067
Token to
Token ID284
Feature activation+1.08e-01

Pos logit contributions
e+0.117
 moderation+0.116
|+0.112
 generation+0.106
 arrest+0.104
Neg logit contributions
der-0.086
lessly-0.063
ing-0.062
rier-0.062
ier-0.058
Token ruin
Token ID16866
Feature activation+6.29e-02

Pos logit contributions
 moderation+0.195
e+0.195
|+0.186
 generation+0.182
atherine+0.176
Neg logit contributions
der-0.229
rier-0.182
ier-0.176
lessly-0.174
ing-0.173
Token the
Token ID262
Feature activation-3.64e-04

Pos logit contributions
e+0.113
|+0.111
 moderation+0.106
 argue+0.106
 hate+0.105
Neg logit contributions
der-0.139
rier-0.113
lessly-0.107
lights-0.103
ier-0.100
Token cake
Token ID12187
Feature activation+1.90e-02

Pos logit contributions
der+0.001
lessly+0.001
rier+0.001
ier+0.001
lights+0.001
Neg logit contributions
e-0.001
 moderation-0.001
 hate-0.001
 client-0.001
 arrest-0.001
Token.
Token ID13
Feature activation+3.20e-02

Pos logit contributions
e+0.041
 moderation+0.040
 hate+0.038
 argue+0.038
 client+0.037
Neg logit contributions
der-0.035
lessly-0.030
rier-0.027
lic-0.026
lights-0.025
Token But
Token ID887
Feature activation-1.73e-02

Pos logit contributions
e+0.062
 moderation+0.061
 argue+0.060
 arrest+0.058
|+0.058
Neg logit contributions
der-0.066
lessly-0.053
rier-0.051
ier-0.046
lights-0.045
Token was
Token ID373
Feature activation+2.97e-02

Pos logit contributions
der+0.018
lessly+0.013
rier+0.012
ier+0.012
ing+0.012
Neg logit contributions
 moderation-0.030
e-0.029
 client-0.029
 arrest-0.029
 argue-0.028
Token an
Token ID281
Feature activation+2.59e-02

Pos logit contributions
 client+0.080
 courtroom+0.079
 argue+0.079
keeping+0.078
 harm+0.076
Neg logit contributions
der-0.034
lessly-0.024
 brightly-0.018
bled-0.018
lic-0.018
Token old
Token ID1468
Feature activation+2.95e-02

Pos logit contributions
 moderation+0.055
e+0.055
keeping+0.053
|+0.053
 arrest+0.052
Neg logit contributions
der-0.047
lessly-0.037
rier-0.037
ier-0.034
lights-0.033
Token judge
Token ID5052
Feature activation+7.25e-03

Pos logit contributions
e+0.051
 moderation+0.051
 hate+0.046
 client+0.046
 arrest+0.046
Neg logit contributions
der-0.068
lessly-0.058
rier-0.055
ier-0.053
lights-0.052
Token named
Token ID3706
Feature activation+4.63e-02

Pos logit contributions
 moderation+0.015
e+0.014
 argue+0.014
|+0.014
 hate+0.014
Neg logit contributions
der-0.014
lessly-0.011
rier-0.011
ier-0.011
ing-0.010
Token Mr
Token ID1770
Feature activation+1.08e-01

Pos logit contributions
 moderation+0.077
e+0.073
 argue+0.069
keeping+0.068
 hate+0.068
Neg logit contributions
der-0.105
lessly-0.083
lic-0.082
rier-0.081
ier-0.075
Token.
Token ID13
Feature activation+4.24e-02

Pos logit contributions
e+0.290
 moderation+0.283
 arrest+0.282
 argue+0.281
 hate+0.279
Neg logit contributions
der-0.145
ier-0.089
rier-0.086
lessly-0.084
ing-0.077
Token Owl
Token ID37007
Feature activation+4.05e-02

Pos logit contributions
e+0.067
 moderation+0.065
 hate+0.062
 argue+0.061
 obey+0.060
Neg logit contributions
der-0.096
rier-0.079
lessly-0.077
ier-0.074
ing-0.071
Token.
Token ID13
Feature activation+2.19e-02

Pos logit contributions
 moderation+0.103
e+0.100
 argue+0.097
 arrest+0.096
 hate+0.096
Neg logit contributions
der-0.062
lessly-0.046
rier-0.040
ier-0.040
ing-0.038
Token He
Token ID679
Feature activation+8.70e-03

Pos logit contributions
e+0.044
 moderation+0.043
 argue+0.040
 hate+0.040
 arrest+0.040
Neg logit contributions
der-0.043
lessly-0.035
rier-0.033
ier-0.030
ing-0.029
Token lived
Token ID5615
Feature activation-1.24e-02

Pos logit contributions
 moderation+0.021
 arrest+0.019
e+0.019
 mischief+0.019
 argue+0.018
Neg logit contributions
der-0.014
lessly-0.011
lic-0.010
ier-0.010
rier-0.009
Token was
Token ID373
Feature activation+2.69e-02

Pos logit contributions
der+0.020
lessly+0.014
ier+0.013
rier+0.013
ing+0.012
Neg logit contributions
 moderation-0.035
e-0.034
 client-0.034
 arrest-0.034
 argue-0.033
Token a
Token ID257
Feature activation+3.81e-02

Pos logit contributions
 client+0.068
 argue+0.067
 courtroom+0.066
keeping+0.065
 arrest+0.064
Neg logit contributions
der-0.037
lessly-0.028
ier-0.022
ing-0.022
bled-0.022
Token kind
Token ID1611
Feature activation+1.60e-02

Pos logit contributions
e+0.088
 moderation+0.087
keeping+0.082
 arrest+0.081
 argue+0.081
Neg logit contributions
der-0.065
lessly-0.052
rier-0.050
ier-0.045
lights-0.044
Token lady
Token ID10846
Feature activation+3.78e-02

Pos logit contributions
 moderation+0.025
e+0.025
 client+0.024
 hate+0.023
 arrest+0.022
Neg logit contributions
der-0.038
lessly-0.033
rier-0.031
ier-0.030
lights-0.030
Token named
Token ID3706
Feature activation+3.76e-02

Pos logit contributions
 moderation+0.068
 argue+0.064
 arrest+0.063
 mischief+0.063
e+0.062
Neg logit contributions
der-0.081
lessly-0.068
ier-0.063
rier-0.063
ing-0.063
Token Mrs
Token ID9074
Feature activation+1.07e-01

Pos logit contributions
 moderation+0.069
e+0.069
 argue+0.064
 hate+0.064
 client+0.063
Neg logit contributions
der-0.077
lessly-0.065
lic-0.063
rier-0.061
ier-0.059
Token.
Token ID13
Feature activation+3.51e-02

Pos logit contributions
e+0.255
 moderation+0.253
 hate+0.243
 argue+0.241
 arrest+0.241
Neg logit contributions
der-0.178
lessly-0.132
rier-0.125
ier-0.123
lights-0.112
Token Smith
Token ID4176
Feature activation+6.24e-02

Pos logit contributions
e+0.042
 moderation+0.040
 hate+0.037
 argue+0.035
 client+0.035
Neg logit contributions
der-0.099
lessly-0.086
rier-0.083
ier-0.082
lights-0.079
Token.
Token ID13
Feature activation+1.56e-02

Pos logit contributions
 moderation+0.162
 arrest+0.157
e+0.157
 argue+0.154
 mischief+0.152
Neg logit contributions
der-0.090
lessly-0.067
ing-0.057
ier-0.056
rier-0.055
Token She
Token ID1375
Feature activation+5.93e-03

Pos logit contributions
e+0.034
 moderation+0.034
 hate+0.031
 arrest+0.031
 argue+0.031
Neg logit contributions
der-0.029
lessly-0.023
rier-0.022
ier-0.021
lights-0.020
Token lived
Token ID5615
Feature activation-8.57e-03

Pos logit contributions
 moderation+0.013
e+0.013
 arrest+0.012
 argue+0.012
 hate+0.012
Neg logit contributions
der-0.011
lessly-0.009
ier-0.008
lic-0.007
rier-0.007
Token it
Token ID340
Feature activation-2.66e-04

Pos logit contributions
e+0.010
 moderation+0.010
 hate+0.009
 arrest+0.009
 client+0.009
Neg logit contributions
der-0.011
lessly-0.009
rier-0.009
ier-0.008
lights-0.008
Token,
Token ID11
Feature activation-4.98e-03

Pos logit contributions
der+0.000
lessly+0.000
rier+0.000
lights+0.000
ier+0.000
Neg logit contributions
e-0.001
 moderation-0.001
 hate-0.001
 argue-0.001
 client-0.001
Token because
Token ID780
Feature activation+1.84e-02

Pos logit contributions
der+0.007
lessly+0.006
rier+0.005
ier+0.005
lights+0.005
Neg logit contributions
e-0.013
 moderation-0.012
 client-0.012
 arrest-0.012
|-0.012
Token she
Token ID673
Feature activation-4.36e-03

Pos logit contributions
e+0.040
 moderation+0.039
 hate+0.037
 arrest+0.036
 client+0.036
Neg logit contributions
der-0.034
lessly-0.028
rier-0.027
ier-0.025
lights-0.025
Token liked
Token ID8288
Feature activation+3.08e-02

Pos logit contributions
der+0.008
lessly+0.007
lic+0.006
rier+0.006
ier+0.006
Neg logit contributions
 moderation-0.009
e-0.009
 argue-0.009
 arrest-0.009
 hate-0.009
Token Mrs
Token ID9074
Feature activation+1.06e-01

Pos logit contributions
e+0.062
 moderation+0.059
|+0.058
 hate+0.057
 arrest+0.056
Neg logit contributions
der-0.061
lessly-0.049
rier-0.047
lights-0.044
ing-0.044
Token.
Token ID13
Feature activation+3.55e-02

Pos logit contributions
e+0.267
 moderation+0.266
 arrest+0.260
 hate+0.254
 argue+0.253
Neg logit contributions
der-0.165
lessly-0.115
rier-0.112
ier-0.111
lights-0.101
Token Lee
Token ID5741
Feature activation+4.78e-02

Pos logit contributions
e+0.053
 moderation+0.047
 hate+0.045
 argue+0.043
 arrest+0.042
Neg logit contributions
der-0.092
rier-0.077
lessly-0.077
ier-0.072
lights-0.071
Token and
Token ID290
Feature activation+1.03e-02

Pos logit contributions
e+0.109
 moderation+0.108
 hate+0.101
 arrest+0.101
 client+0.100
Neg logit contributions
der-0.083
lessly-0.067
rier-0.063
ier-0.059
lights-0.057
Token her
Token ID607
Feature activation+1.15e-02

Pos logit contributions
e+0.020
 moderation+0.019
 hate+0.018
 argue+0.018
 arrest+0.018
Neg logit contributions
der-0.022
lessly-0.018
rier-0.017
ier-0.017
lights-0.016
Token cat
Token ID3797
Feature activation+2.36e-02

Pos logit contributions
e+0.020
 moderation+0.019
|+0.019
 arrest+0.017
keeping+0.017
Neg logit contributions
der-0.027
rier-0.022
lessly-0.022
ier-0.020
lights-0.020
Token dragon
Token ID10441
Feature activation+3.23e-03

Pos logit contributions
der+0.114
lessly+0.096
rier+0.092
ier+0.086
lights+0.085
Neg logit contributions
e-0.107
 moderation-0.106
 hate-0.097
 argue-0.096
 arrest-0.095
Token loved
Token ID6151
Feature activation+8.96e-03

Pos logit contributions
e+0.006
 moderation+0.006
 hate+0.006
 argue+0.006
 arrest+0.006
Neg logit contributions
der-0.007
rier-0.005
lessly-0.005
lights-0.005
ier-0.005
Token it
Token ID340
Feature activation-4.46e-03

Pos logit contributions
 moderation+0.020
e+0.019
|+0.018
 client+0.018
 arrest+0.018
Neg logit contributions
der-0.017
lessly-0.014
rier-0.012
ier-0.012
lights-0.012
Token!
Token ID0
Feature activation-1.20e-02

Pos logit contributions
der+0.008
lessly+0.006
rier+0.006
ier+0.006
lights+0.006
Neg logit contributions
e-0.010
 moderation-0.010
 hate-0.009
 argue-0.009
 arrest-0.009
Token<|endoftext|>
Token ID50256
Feature activation+4.39e-02

Pos logit contributions
der+0.023
lessly+0.019
rier+0.017
lights+0.016
ier+0.016
Neg logit contributions
 moderation-0.026
e-0.026
 argue-0.024
 hate-0.024
 client-0.024
TokenLu
Token ID25596
Feature activation+1.06e-01

Pos logit contributions
e+0.080
 moderation+0.080
 hate+0.074
 client+0.073
 arrest+0.073
Neg logit contributions
der-0.096
lessly-0.081
rier-0.077
ier-0.074
lights-0.072
Tokency
Token ID948
Feature activation+2.41e-02

Pos logit contributions
 moderation+0.160
 client+0.153
 mischief+0.150
 courtroom+0.144
 meetings+0.143
Neg logit contributions
der-0.229
lessly-0.219
led-0.212
ing-0.209
ly-0.202
Token was
Token ID373
Feature activation-1.35e-02

Pos logit contributions
 moderation+0.061
e+0.059
 client+0.057
 hate+0.056
 argue+0.056
Neg logit contributions
der-0.036
lessly-0.027
rier-0.024
ier-0.023
lights-0.022
Token jealous
Token ID19354
Feature activation-2.30e-02

Pos logit contributions
der+0.028
lessly+0.023
rier+0.022
ier+0.021
lights+0.020
Neg logit contributions
 moderation-0.026
e-0.026
 client-0.024
 arrest-0.024
 argue-0.024
Token of
Token ID286
Feature activation+5.27e-02

Pos logit contributions
der+0.042
rier+0.039
lessly+0.037
bled+0.034
lic+0.033
Neg logit contributions
e-0.043
 hate-0.042
 moderation-0.041
 client-0.040
 argue-0.039
Token her
Token ID607
Feature activation+3.73e-02

Pos logit contributions
e+0.098
 moderation+0.096
|+0.094
 client+0.092
keeping+0.089
Neg logit contributions
der-0.110
lessly-0.084
rier-0.079
lights-0.077
ing-0.076
TokenDo
Token ID5211
Feature activation+1.03e-02

Pos logit contributions
e+0.049
 client+0.048
 moderation+0.047
 hate+0.046
 mischief+0.046
Neg logit contributions
der-0.060
lessly-0.048
rier-0.047
lic-0.046
ing-0.045
Token you
Token ID345
Feature activation+4.73e-02

Pos logit contributions
 moderation+0.018
keeping+0.017
e+0.017
 mischief+0.017
 generation+0.016
Neg logit contributions
der-0.022
lessly-0.018
rier-0.017
ing-0.017
lic-0.017
Token know
Token ID760
Feature activation+2.04e-02

Pos logit contributions
keeping+0.078
 moderation+0.076
 arrest+0.074
atherine+0.073
 client+0.073
Neg logit contributions
der-0.102
lessly-0.092
ier-0.083
lic-0.079
ly-0.079
Token what
Token ID644
Feature activation+4.35e-02

Pos logit contributions
e+0.041
 moderation+0.041
 hate+0.038
 client+0.038
 arrest+0.037
Neg logit contributions
der-0.040
lessly-0.035
rier-0.033
ier-0.030
lights-0.030
Token is
Token ID318
Feature activation+2.31e-02

Pos logit contributions
 moderation+0.117
 argue+0.116
|+0.113
e+0.112
 hate+0.110
Neg logit contributions
der-0.051
ing-0.034
ier-0.031
 that-0.028
lessly-0.028
Token in
Token ID287
Feature activation+1.06e-01

Pos logit contributions
e+0.053
 moderation+0.052
 hate+0.049
 client+0.048
 arrest+0.048
Neg logit contributions
der-0.041
lessly-0.033
rier-0.031
ier-0.029
lights-0.028
Token the
Token ID262
Feature activation+9.87e-03

Pos logit contributions
e+0.218
|+0.212
 moderation+0.211
 argue+0.208
 hate+0.208
Neg logit contributions
der-0.185
rier-0.149
lessly-0.146
ier-0.136
lights-0.134
Token mystery
Token ID10715
Feature activation-1.21e-02

Pos logit contributions
e+0.023
 moderation+0.023
 hate+0.021
 argue+0.021
 arrest+0.021
Neg logit contributions
der-0.017
lessly-0.013
rier-0.013
ier-0.012
lights-0.011
Token box
Token ID3091
Feature activation-4.53e-03

Pos logit contributions
der+0.031
lessly+0.028
rier+0.026
bled+0.025
ier+0.024
Neg logit contributions
e-0.017
 moderation-0.016
 client-0.015
 preschool-0.015
 courtroom-0.014
Token?"
Token ID1701
Feature activation+1.52e-02

Pos logit contributions
der+0.013
lessly+0.012
rier+0.011
lights+0.011
bled+0.010
Neg logit contributions
e-0.005
 moderation-0.005
 argue-0.004
 fight-0.004
 hate-0.004
Token Sue
Token ID26089
Feature activation+1.97e-02

Pos logit contributions
 moderation+0.025
 client+0.023
e+0.023
 conflicts+0.022
 generation+0.022
Neg logit contributions
der-0.036
lic-0.029
lessly-0.029
ly-0.028
ing-0.027
Token.
Token ID13
Feature activation+3.03e-03

Pos logit contributions
e+0.004
 moderation+0.004
 hate+0.003
 arrest+0.003
 client+0.003
Neg logit contributions
der-0.003
lessly-0.002
rier-0.002
ier-0.002
lights-0.002
Token The
Token ID383
Feature activation-5.44e-02

Pos logit contributions
e+0.007
 moderation+0.007
 argue+0.007
 hate+0.007
 harm+0.007
Neg logit contributions
der-0.005
lessly-0.004
lights-0.003
rier-0.003
lic-0.003
Token end
Token ID886
Feature activation+1.97e-04

Pos logit contributions
der+0.117
lessly+0.097
rier+0.095
lights+0.089
ier+0.087
Neg logit contributions
e-0.102
 moderation-0.098
|-0.092
keeping-0.091
 arrest-0.090
Token.
Token ID13
Feature activation+5.03e-03

Pos logit contributions
e+0.000
 moderation+0.000
 hate+0.000
 client+0.000
 argue+0.000
Neg logit contributions
der-0.000
rier-0.000
lessly-0.000
ier-0.000
bled-0.000
Token<|endoftext|>
Token ID50256
Feature activation+5.21e-02

Pos logit contributions
 argue+0.012
e+0.012
 arrest+0.012
 hate+0.012
 moderation+0.012
Neg logit contributions
der-0.008
lessly-0.006
lights-0.005
lic-0.005
rier-0.005
TokenMrs
Token ID27034
Feature activation+1.06e-01

Pos logit contributions
 moderation+0.114
 argue+0.111
 client+0.110
 hate+0.110
e+0.109
Neg logit contributions
der-0.089
lessly-0.074
ing-0.066
ier-0.066
rier-0.065
Token.
Token ID13
Feature activation+2.93e-02

Pos logit contributions
 moderation+0.266
e+0.264
 hate+0.257
 argue+0.254
 arrest+0.252
Neg logit contributions
der-0.162
lessly-0.115
rier-0.112
ier-0.110
ing-0.100
Token Brown
Token ID4373
Feature activation+4.86e-02

Pos logit contributions
 moderation+0.046
e+0.045
 hate+0.044
 harm+0.043
 client+0.042
Neg logit contributions
der-0.068
lessly-0.057
ing-0.056
rier-0.054
lic-0.054
Token took
Token ID1718
Feature activation-7.05e-03

Pos logit contributions
 moderation+0.106
e+0.104
 client+0.099
 arrest+0.098
 argue+0.098
Neg logit contributions
der-0.086
lessly-0.069
ier-0.064
rier-0.063
ing-0.060
Token her
Token ID607
Feature activation+2.27e-02

Pos logit contributions
der+0.015
lessly+0.013
rier+0.012
ier+0.011
ing+0.011
Neg logit contributions
 moderation-0.013
e-0.013
 argue-0.012
 hate-0.012
 client-0.012
Token little
Token ID1310
Feature activation+2.33e-02

Pos logit contributions
 moderation+0.045
e+0.044
|+0.041
atherine+0.040
 argue+0.040
Neg logit contributions
der-0.044
lessly-0.036
rier-0.034
lights-0.034
ing-0.032
Token morning
Token ID3329
Feature activation-1.18e-02

Pos logit contributions
der+0.006
rier+0.005
lessly+0.005
ier+0.005
lights+0.005
Neg logit contributions
e-0.004
 moderation-0.004
|-0.004
 arrest-0.004
Susan-0.004
Token j
Token ID474
Feature activation+2.98e-02

Pos logit contributions
der+0.024
lessly+0.020
rier+0.019
ier+0.018
lights+0.017
Neg logit contributions
e-0.025
 moderation-0.024
 hate-0.022
 arrest-0.022
 argue-0.022
Tokenogs
Token ID18463
Feature activation-1.67e-02

Pos logit contributions
e+0.060
 moderation+0.060
 hate+0.055
 arrest+0.055
 client+0.054
Neg logit contributions
der-0.060
lessly-0.050
rier-0.047
ier-0.044
lights-0.044
Token.
Token ID13
Feature activation-5.70e-03

Pos logit contributions
der+0.031
lessly+0.024
rier+0.024
lights+0.022
ier+0.022
Neg logit contributions
e-0.038
 moderation-0.035
 hate-0.034
 argue-0.033
|-0.033
Token<|endoftext|>
Token ID50256
Feature activation+5.03e-02

Pos logit contributions
der+0.010
lessly+0.008
rier+0.007
lights+0.006
lic+0.006
Neg logit contributions
e-0.013
 moderation-0.013
 argue-0.013
 hate-0.013
 harm-0.012
TokenLu
Token ID25596
Feature activation+1.06e-01

Pos logit contributions
 moderation+0.104
e+0.101
 argue+0.099
 hate+0.099
 client+0.098
Neg logit contributions
der-0.094
lessly-0.080
ier-0.071
rier-0.071
ing-0.070
Tokency
Token ID948
Feature activation+3.08e-02

Pos logit contributions
 moderation+0.141
 client+0.134
 mischief+0.130
 meetings+0.127
 arrest+0.123
Neg logit contributions
der-0.256
lessly-0.237
ing-0.234
led-0.225
ier-0.220
Token wanted
Token ID2227
Feature activation+7.12e-04

Pos logit contributions
 moderation+0.084
e+0.080
 client+0.078
 argue+0.078
 hate+0.077
Neg logit contributions
der-0.040
lessly-0.028
rier-0.024
ier-0.024
ing-0.023
Token to
Token ID284
Feature activation+6.27e-02

Pos logit contributions
e+0.002
 moderation+0.002
|+0.001
 client+0.001
 arrest+0.001
Neg logit contributions
der-0.001
lessly-0.001
rier-0.001
ier-0.001
lights-0.001
Token be
Token ID307
Feature activation+2.55e-02

Pos logit contributions
 moderation+0.112
atherine+0.108
|+0.107
 conflicts+0.107
 client+0.103
Neg logit contributions
der-0.114
lessly-0.092
rier-0.089
lights-0.088
ier-0.082
Token an
Token ID281
Feature activation+2.30e-02

Pos logit contributions
e+0.049
 moderation+0.049
 arrest+0.048
|+0.048
 argue+0.047
Neg logit contributions
der-0.047
lessly-0.042
rier-0.041
ier-0.039
ing-0.035
Token was
Token ID373
Feature activation+2.40e-02

Pos logit contributions
der+0.019
lessly+0.014
rier+0.013
ier+0.013
ing+0.012
Neg logit contributions
 moderation-0.034
e-0.033
 client-0.032
 arrest-0.032
 argue-0.032
Token an
Token ID281
Feature activation+2.39e-02

Pos logit contributions
 client+0.062
 argue+0.061
 courtroom+0.060
 harm+0.059
keeping+0.059
Neg logit contributions
der-0.032
lessly-0.022
ing-0.018
rier-0.018
ier-0.017
Token elderly
Token ID13830
Feature activation+3.85e-02

Pos logit contributions
 moderation+0.049
e+0.049
|+0.046
 argue+0.046
 arrest+0.045
Neg logit contributions
der-0.047
lessly-0.038
rier-0.037
ier-0.034
lights-0.033
Token cat
Token ID3797
Feature activation+2.22e-02

Pos logit contributions
e+0.065
 moderation+0.065
 hate+0.059
 client+0.058
 arrest+0.058
Neg logit contributions
der-0.090
lessly-0.077
rier-0.073
ier-0.070
lights-0.069
Token named
Token ID3706
Feature activation+4.97e-02

Pos logit contributions
 moderation+0.033
e+0.030
 argue+0.029
 hate+0.028
 mischief+0.028
Neg logit contributions
der-0.057
lessly-0.048
rier-0.045
ing-0.044
ier-0.044
Token Mr
Token ID1770
Feature activation+1.06e-01

Pos logit contributions
e+0.095
 moderation+0.092
 client+0.090
 argue+0.089
 hate+0.088
Neg logit contributions
der-0.097
lessly-0.074
lic-0.073
ing-0.071
rier-0.068
Token.
Token ID13
Feature activation+3.99e-02

Pos logit contributions
e+0.280
 arrest+0.272
 moderation+0.272
 hate+0.271
 argue+0.270
Neg logit contributions
der-0.146
ier-0.093
lessly-0.090
rier-0.085
ing-0.078
Token Wh
Token ID854
Feature activation+6.00e-02

Pos logit contributions
e+0.070
 hate+0.068
 client+0.068
 moderation+0.068
 obey+0.067
Neg logit contributions
der-0.079
lessly-0.064
rier-0.062
ing-0.061
lic-0.060
Tokenisk
Token ID1984
Feature activation+5.96e-02

Pos logit contributions
 moderation+0.081
e+0.081
 argue+0.077
 arrest+0.074
 hate+0.074
Neg logit contributions
der-0.160
lessly-0.136
ier-0.132
rier-0.126
ing-0.124
Tokeners
Token ID364
Feature activation+1.98e-02

Pos logit contributions
 moderation+0.163
e+0.158
 argue+0.157
 hate+0.156
 client+0.153
Neg logit contributions
der-0.075
lessly-0.059
ier-0.052
rier-0.051
ing-0.044
Token.
Token ID13
Feature activation+2.11e-02

Pos logit contributions
 moderation+0.050
e+0.048
 argue+0.046
 arrest+0.046
 client+0.046
Neg logit contributions
der-0.030
lessly-0.022
rier-0.019
ier-0.019
ing-0.018
Token there
Token ID612
Feature activation-1.09e-02

Pos logit contributions
e+0.000
 moderation+0.000
 hate+0.000
 arrest+0.000
 client+0.000
Neg logit contributions
der-0.000
lessly-0.000
rier-0.000
ier-0.000
lights-0.000
Token was
Token ID373
Feature activation+3.00e-02

Pos logit contributions
der+0.015
lessly+0.012
ier+0.010
rier+0.010
lights+0.010
Neg logit contributions
 moderation-0.028
e-0.027
 client-0.027
 arrest-0.027
 argue-0.026
Token a
Token ID257
Feature activation+3.92e-02

Pos logit contributions
 client+0.078
 argue+0.077
 courtroom+0.076
keeping+0.074
 arrest+0.072
Neg logit contributions
der-0.039
lessly-0.030
ing-0.024
 brightly-0.023
bled-0.023
Token lady
Token ID10846
Feature activation+3.66e-02

Pos logit contributions
e+0.092
 moderation+0.091
keeping+0.087
 arrest+0.085
 argue+0.085
Neg logit contributions
der-0.065
lessly-0.051
rier-0.050
ier-0.044
lic-0.043
Token named
Token ID3706
Feature activation+4.19e-02

Pos logit contributions
 moderation+0.062
e+0.057
 argue+0.057
 arrest+0.056
 client+0.056
Neg logit contributions
der-0.083
lessly-0.071
rier-0.066
ier-0.066
ing-0.065
Token Mrs
Token ID9074
Feature activation+1.06e-01

Pos logit contributions
e+0.077
 moderation+0.077
 argue+0.072
 client+0.071
 arrest+0.070
Neg logit contributions
der-0.085
lessly-0.071
lic-0.069
rier-0.066
ly-0.064
Token.
Token ID13
Feature activation+3.83e-02

Pos logit contributions
e+0.249
 moderation+0.244
 hate+0.236
 argue+0.234
 arrest+0.233
Neg logit contributions
der-0.182
lessly-0.136
rier-0.129
ier-0.129
lights-0.116
Token Greene
Token ID28132
Feature activation+3.38e-02

Pos logit contributions
e+0.047
 moderation+0.045
 hate+0.042
 client+0.040
 argue+0.040
Neg logit contributions
der-0.106
lessly-0.093
rier-0.089
ier-0.087
lic-0.084
Token.
Token ID13
Feature activation+2.00e-02

Pos logit contributions
 moderation+0.083
e+0.080
 arrest+0.077
 hate+0.076
 argue+0.076
Neg logit contributions
der-0.055
lessly-0.042
rier-0.037
ier-0.037
lights-0.035
Token One
Token ID1881
Feature activation+3.25e-02

Pos logit contributions
e+0.042
 moderation+0.042
 hate+0.039
 arrest+0.038
 argue+0.038
Neg logit contributions
der-0.038
lessly-0.031
rier-0.030
ier-0.028
lights-0.027
Token day
Token ID1110
Feature activation-2.48e-02

Pos logit contributions
 arrest+0.054
 argue+0.053
 moderation+0.053
keeping+0.051
e+0.051
Neg logit contributions
der-0.072
lessly-0.064
ing-0.060
ier-0.057
lic-0.056
Token know
Token ID760
Feature activation-6.43e-03

Pos logit contributions
 moderation+0.032
e+0.032
 arrest+0.030
keeping+0.029
 client+0.029
Neg logit contributions
der-0.071
lessly-0.062
ier-0.058
rier-0.057
lic-0.056
Token you
Token ID345
Feature activation+2.95e-02

Pos logit contributions
der+0.017
lessly+0.015
rier+0.014
ier+0.014
lights+0.014
Neg logit contributions
e-0.009
 moderation-0.009
 hate-0.008
 arrest-0.008
 client-0.008
Token did
Token ID750
Feature activation-5.62e-03

Pos logit contributions
 moderation+0.056
e+0.056
 arrest+0.052
 client+0.052
 argue+0.051
Neg logit contributions
der-0.062
lessly-0.052
ier-0.048
rier-0.048
lights-0.045
Token not
Token ID407
Feature activation+4.76e-02

Pos logit contributions
der+0.011
lessly+0.009
rier+0.009
ier+0.009
lights+0.008
Neg logit contributions
e-0.011
 moderation-0.011
|-0.010
 arrest-0.010
 client-0.010
Token mean
Token ID1612
Feature activation+4.80e-02

Pos logit contributions
 moderation+0.071
e+0.070
 arrest+0.069
|+0.068
atherine+0.067
Neg logit contributions
der-0.110
lessly-0.095
ier-0.092
lic-0.085
ing-0.084
Token to
Token ID284
Feature activation+1.05e-01

Pos logit contributions
e+0.106
 moderation+0.104
|+0.099
 arrest+0.096
 hate+0.094
Neg logit contributions
der-0.093
lessly-0.073
rier-0.070
ier-0.067
ing-0.066
Token hurt
Token ID5938
Feature activation+5.41e-02

Pos logit contributions
 moderation+0.177
e+0.177
|+0.163
 arrest+0.160
 client+0.157
Neg logit contributions
der-0.242
lessly-0.201
rier-0.196
ier-0.190
lights-0.184
Token the
Token ID262
Feature activation-1.23e-02

Pos logit contributions
|+0.099
e+0.091
 moderation+0.091
 argue+0.091
 hate+0.089
Neg logit contributions
der-0.116
rier-0.096
ing-0.093
lessly-0.092
ier-0.089
Token bowl
Token ID9396
Feature activation+7.95e-03

Pos logit contributions
der+0.026
lessly+0.022
rier+0.021
ier+0.020
lights+0.019
Neg logit contributions
 moderation-0.024
e-0.024
 hate-0.022
 argue-0.022
 arrest-0.021
Token.
Token ID13
Feature activation+2.80e-02

Pos logit contributions
e+0.017
 moderation+0.016
 hate+0.015
 argue+0.015
 client+0.014
Neg logit contributions
der-0.015
lessly-0.013
rier-0.012
lic-0.012
lights-0.011
Token But
Token ID887
Feature activation-2.11e-02

Pos logit contributions
e+0.052
 moderation+0.051
 argue+0.047
 hate+0.047
 arrest+0.047
Neg logit contributions
der-0.062
lessly-0.052
rier-0.049
ier-0.047
lights-0.046
TokenWhich
Token ID13828
Feature activation+2.21e-02

Pos logit contributions
e+0.034
 moderation+0.033
 hate+0.030
 client+0.029
 arrest+0.029
Neg logit contributions
der-0.065
lessly-0.056
rier-0.054
ier-0.052
lights-0.051
Token animal
Token ID5044
Feature activation+1.43e-02

Pos logit contributions
 moderation+0.039
e+0.038
keeping+0.038
|+0.036
 hate+0.036
Neg logit contributions
der-0.045
rier-0.035
lessly-0.034
ing-0.034
lights-0.033
Token do
Token ID466
Feature activation+2.31e-02

Pos logit contributions
e+0.029
 moderation+0.029
 hate+0.027
 arrest+0.026
 client+0.026
Neg logit contributions
der-0.028
lessly-0.023
rier-0.022
ier-0.021
lights-0.021
Token you
Token ID345
Feature activation+4.80e-02

Pos logit contributions
 moderation+0.037
keeping+0.036
 argue+0.035
e+0.035
 courtroom+0.035
Neg logit contributions
der-0.052
lessly-0.043
rier-0.042
ier-0.041
ing-0.040
Token want
Token ID765
Feature activation+2.47e-02

Pos logit contributions
 moderation+0.068
e+0.065
 client+0.063
 arrest+0.062
keeping+0.060
Neg logit contributions
der-0.119
lessly-0.105
ier-0.099
rier-0.096
lic-0.093
Token to
Token ID284
Feature activation+1.05e-01

Pos logit contributions
e+0.047
 moderation+0.047
 hate+0.043
 arrest+0.043
 client+0.043
Neg logit contributions
der-0.052
lessly-0.044
rier-0.042
ier-0.040
lights-0.039
Token see
Token ID766
Feature activation+3.71e-02

Pos logit contributions
 moderation+0.162
e+0.158
 client+0.154
keeping+0.154
atherine+0.153
Neg logit contributions
der-0.242
lessly-0.196
rier-0.194
ier-0.192
lights-0.180
Token next
Token ID1306
Feature activation+1.14e-02

Pos logit contributions
e+0.061
 moderation+0.058
 hate+0.052
 arrest+0.051
 client+0.050
Neg logit contributions
der-0.091
lessly-0.081
rier-0.076
ier-0.072
lights-0.071
Token?"
Token ID1701
Feature activation+1.20e-02

Pos logit contributions
e+0.020
 moderation+0.020
 hate+0.018
 argue+0.018
 client+0.018
Neg logit contributions
der-0.026
lessly-0.022
rier-0.021
ier-0.020
lights-0.020
Token Tim
Token ID5045
Feature activation+1.52e-02

Pos logit contributions
e+0.018
 moderation+0.018
 client+0.017
 arrest+0.016
 argue+0.016
Neg logit contributions
der-0.030
lessly-0.026
rier-0.024
lic-0.023
lights-0.023
Tokenmy
Token ID1820
Feature activation-5.46e-03

Pos logit contributions
 moderation+0.025
 arrest+0.025
e+0.024
 client+0.023
 generation+0.022
Neg logit contributions
der-0.035
lessly-0.030
ier-0.028
rier-0.027
ing-0.027
Token The
Token ID383
Feature activation-4.05e-02

Pos logit contributions
 moderation+0.015
e+0.015
 hate+0.014
 arrest+0.014
 client+0.014
Neg logit contributions
der-0.025
lessly-0.021
rier-0.021
ier-0.020
lights-0.020
Token cat
Token ID3797
Feature activation+4.79e-03

Pos logit contributions
der+0.105
lessly+0.090
rier+0.086
ier+0.082
lights+0.081
Neg logit contributions
e-0.058
 moderation-0.057
|-0.051
keeping-0.051
 argue-0.050
Token wanted
Token ID2227
Feature activation+4.11e-03

Pos logit contributions
e+0.010
 moderation+0.009
 hate+0.009
 client+0.009
 arrest+0.009
Neg logit contributions
der-0.010
lessly-0.008
rier-0.008
ier-0.008
lights-0.007
Token to
Token ID284
Feature activation+6.78e-02

Pos logit contributions
e+0.010
 moderation+0.010
 argue+0.009
 client+0.009
atherine+0.009
Neg logit contributions
der-0.006
lessly-0.005
ing-0.004
rier-0.004
ier-0.004
Token eat
Token ID4483
Feature activation+2.18e-02

Pos logit contributions
 moderation+0.137
packs+0.131
atherine+0.130
 conflicts+0.128
 courtroom+0.124
Neg logit contributions
der-0.112
 that-0.089
 all-0.082
 spark-0.080
rier-0.080
Token Twe
Token ID24205
Feature activation+1.05e-01

Pos logit contributions
e+0.033
 moderation+0.033
 hate+0.030
 arrest+0.029
 argue+0.029
Neg logit contributions
der-0.054
lessly-0.047
rier-0.045
ier-0.044
lights-0.043
Tokenety
Token ID2963
Feature activation+5.16e-02

Pos logit contributions
 argue+0.122
 client+0.122
 mischief+0.122
e+0.120
 moderation+0.117
Neg logit contributions
der-0.270
ing-0.248
lessly-0.233
ier-0.233
lic-0.229
Token!
Token ID0
Feature activation+4.16e-03

Pos logit contributions
 moderation+0.130
 client+0.129
e+0.129
 courtroom+0.126
 arrest+0.125
Neg logit contributions
der-0.073
lessly-0.051
rier-0.049
ing-0.048
ier-0.047
Token
Token ID198
Feature activation-1.69e-03

Pos logit contributions
 moderation+0.006
e+0.006
 arrest+0.006
 hate+0.006
 argue+0.006
Neg logit contributions
der-0.010
lessly-0.009
ier-0.009
rier-0.008
lights-0.008
Token
Token ID198
Feature activation-2.95e-03

Pos logit contributions
der+0.003
lessly+0.003
rier+0.002
ier+0.002
lic+0.002
Neg logit contributions
 moderation-0.003
e-0.003
 client-0.003
 arrest-0.003
 argue-0.003
TokenTwe
Token ID32665
Feature activation+8.68e-02

Pos logit contributions
der+0.009
lessly+0.008
lights+0.008
rier+0.008
ier+0.008
Neg logit contributions
e-0.003
 moderation-0.003
 hate-0.002
 arrest-0.002
 argue-0.002
Token "
Token ID366
Feature activation+2.51e-02

Pos logit contributions
der+0.031
lessly+0.024
lights+0.024
ier+0.023
rier+0.023
Neg logit contributions
e-0.042
 moderation-0.042
 hate-0.040
 argue-0.039
 client-0.039
TokenWhere
Token ID8496
Feature activation+1.76e-02

Pos logit contributions
 moderation+0.040
e+0.040
 hate+0.038
 client+0.038
 argue+0.037
Neg logit contributions
der-0.058
lessly-0.050
rier-0.048
ier-0.045
ing-0.045
Token do
Token ID466
Feature activation+3.30e-02

Pos logit contributions
 moderation+0.040
e+0.039
 arrest+0.037
keeping+0.037
 argue+0.037
Neg logit contributions
der-0.029
lessly-0.024
ier-0.022
rier-0.021
ing-0.021
Token you
Token ID345
Feature activation+5.41e-02

Pos logit contributions
 moderation+0.060
 argue+0.058
keeping+0.057
e+0.057
|+0.056
Neg logit contributions
der-0.067
lessly-0.054
rier-0.054
ing-0.052
ier-0.051
Token want
Token ID765
Feature activation+3.22e-02

Pos logit contributions
 moderation+0.078
 client+0.072
e+0.072
keeping+0.072
atherine+0.070
Neg logit contributions
der-0.127
lessly-0.113
ier-0.109
rier-0.102
ly-0.099
Token to
Token ID284
Feature activation+1.05e-01

Pos logit contributions
e+0.059
 moderation+0.058
 hate+0.054
 arrest+0.053
 client+0.053
Neg logit contributions
der-0.071
lessly-0.060
rier-0.058
ier-0.055
lights-0.054
Token go
Token ID467
Feature activation+6.65e-02

Pos logit contributions
 moderation+0.182
e+0.175
 client+0.163
|+0.163
keeping+0.161
Neg logit contributions
der-0.230
lessly-0.192
rier-0.190
ier-0.186
lights-0.174
Token,
Token ID11
Feature activation+1.03e-02

Pos logit contributions
e+0.111
 moderation+0.103
 hate+0.097
 argue+0.095
 arrest+0.095
Neg logit contributions
der-0.169
lessly-0.146
rier-0.139
lights-0.131
ier-0.130
Token Blank
Token ID31990
Feature activation+4.69e-02

Pos logit contributions
e+0.017
 moderation+0.017
 hate+0.015
 client+0.015
 argue+0.015
Neg logit contributions
der-0.024
lessly-0.021
ier-0.020
rier-0.020
lights-0.019
Tokenet
Token ID316
Feature activation-5.12e-03

Pos logit contributions
e+0.089
 moderation+0.088
 hate+0.082
 arrest+0.081
 client+0.081
Neg logit contributions
der-0.100
lessly-0.084
rier-0.080
ier-0.077
lights-0.074
Token?"
Token ID1701
Feature activation+1.47e-02

Pos logit contributions
der+0.012
lessly+0.010
rier+0.010
lights+0.009
ier+0.009
Neg logit contributions
e-0.009
 moderation-0.009
 hate-0.008
 argue-0.008
 client-0.008
Token was
Token ID373
Feature activation+2.43e-02

Pos logit contributions
der+0.022
lessly+0.017
rier+0.015
ier+0.015
lights+0.014
Neg logit contributions
 moderation-0.037
e-0.036
 client-0.035
 arrest-0.035
 argue-0.034
Token a
Token ID257
Feature activation+3.75e-02

Pos logit contributions
 client+0.065
 argue+0.065
 courtroom+0.064
keeping+0.063
 harm+0.062
Neg logit contributions
der-0.029
lessly-0.021
 brightly-0.016
lic-0.015
ier-0.015
Token gray
Token ID12768
Feature activation+1.48e-02

Pos logit contributions
e+0.087
 moderation+0.086
keeping+0.082
 arrest+0.080
 argue+0.080
Neg logit contributions
der-0.063
lessly-0.050
rier-0.049
ier-0.043
lic-0.042
Token cat
Token ID3797
Feature activation+1.81e-02

Pos logit contributions
e+0.022
 moderation+0.022
 hate+0.019
 arrest+0.019
 argue+0.019
Neg logit contributions
der-0.038
lessly-0.033
rier-0.031
ier-0.030
lights-0.030
Token named
Token ID3706
Feature activation+4.48e-02

Pos logit contributions
 moderation+0.026
e+0.024
 argue+0.023
 arrest+0.022
 client+0.022
Neg logit contributions
der-0.047
lessly-0.040
rier-0.038
ier-0.037
lights-0.037
Token Mr
Token ID1770
Feature activation+1.05e-01

Pos logit contributions
 moderation+0.076
 argue+0.075
 client+0.074
keeping+0.074
e+0.073
Neg logit contributions
der-0.096
lessly-0.076
lic-0.075
ing-0.071
rier-0.068
Token.
Token ID13
Feature activation+4.16e-02

Pos logit contributions
 arrest+0.273
e+0.273
 moderation+0.272
 argue+0.271
 hate+0.269
Neg logit contributions
der-0.143
ier-0.090
lessly-0.088
rier-0.083
ing-0.076
Token Wh
Token ID854
Feature activation+5.77e-02

Pos logit contributions
e+0.069
 client+0.069
 hate+0.067
 moderation+0.067
 obey+0.066
Neg logit contributions
der-0.088
lessly-0.071
rier-0.068
lic-0.066
lights-0.065
Tokenisk
Token ID1984
Feature activation+5.95e-02

Pos logit contributions
 moderation+0.082
 argue+0.078
e+0.077
 arrest+0.075
 hate+0.073
Neg logit contributions
der-0.150
lessly-0.127
ier-0.123
ows-0.117
rier-0.117
Tokeners
Token ID364
Feature activation+1.97e-02

Pos logit contributions
 moderation+0.165
 argue+0.159
e+0.158
 hate+0.158
 client+0.155
Neg logit contributions
der-0.073
lessly-0.057
ier-0.050
rier-0.048
ing-0.042
Token.
Token ID13
Feature activation+2.10e-02

Pos logit contributions
 moderation+0.048
e+0.045
 arrest+0.045
 mischief+0.045
 argue+0.045
Neg logit contributions
der-0.031
lessly-0.022
ier-0.020
rier-0.019
ing-0.019
Token
Token ID198
Feature activation-1.12e-02

Pos logit contributions
der+0.022
lessly+0.019
lic+0.016
ing+0.014
lights+0.014
Neg logit contributions
 arrest-0.036
 client-0.036
 argue-0.036
 moderation-0.035
e-0.035
TokenThe
Token ID464
Feature activation-5.60e-02

Pos logit contributions
der+0.018
lessly+0.017
lic+0.015
ing+0.013
rier+0.012
Neg logit contributions
 client-0.023
 moderation-0.023
 argue-0.023
 harm-0.023
keeping-0.023
Token end
Token ID886
Feature activation-1.02e-04

Pos logit contributions
der+0.113
lessly+0.094
rier+0.091
ier+0.083
lights+0.082
Neg logit contributions
e-0.112
 moderation-0.110
 arrest-0.101
 hate-0.101
|-0.100
Token.
Token ID13
Feature activation-2.31e-03

Pos logit contributions
der+0.000
rier+0.000
lessly+0.000
ier+0.000
lights+0.000
Neg logit contributions
e-0.000
 moderation-0.000
 hate-0.000
|-0.000
 argue-0.000
Token<|endoftext|>
Token ID50256
Feature activation+4.87e-02

Pos logit contributions
der+0.004
lessly+0.003
rier+0.003
lic+0.003
ing+0.002
Neg logit contributions
 argue-0.005
e-0.005
 moderation-0.005
 hate-0.005
 harm-0.005
TokenLu
Token ID25596
Feature activation+1.05e-01

Pos logit contributions
 moderation+0.107
 client+0.106
 argue+0.106
 hate+0.103
 arrest+0.102
Neg logit contributions
der-0.081
lessly-0.068
ing-0.062
lic-0.061
ier-0.059
Tokency
Token ID948
Feature activation+3.47e-02

Pos logit contributions
 moderation+0.145
 client+0.145
 mischief+0.138
 arrest+0.132
 meetings+0.129
Neg logit contributions
der-0.246
lessly-0.226
ing-0.225
led-0.216
ier-0.211
Token was
Token ID373
Feature activation+7.85e-03

Pos logit contributions
 moderation+0.093
e+0.090
 client+0.087
 hate+0.086
 argue+0.086
Neg logit contributions
der-0.046
lessly-0.033
rier-0.029
ier-0.029
ing-0.027
Token very
Token ID845
Feature activation+4.78e-02

Pos logit contributions
 moderation+0.016
e+0.016
 client+0.015
 arrest+0.015
 argue+0.015
Neg logit contributions
der-0.015
lessly-0.013
rier-0.012
ier-0.011
ing-0.011
Token excited
Token ID6568
Feature activation-1.78e-03

Pos logit contributions
 moderation+0.095
e+0.095
 arrest+0.089
 client+0.088
keeping+0.087
Neg logit contributions
der-0.094
lessly-0.080
rier-0.076
ier-0.071
lights-0.069
Token to
Token ID284
Feature activation+5.63e-02

Pos logit contributions
der+0.003
lessly+0.003
rier+0.003
ier+0.002
lights+0.002
Neg logit contributions
e-0.004
 moderation-0.004
 hate-0.004
 arrest-0.003
 argue-0.003
Token know
Token ID760
Feature activation-8.93e-03

Pos logit contributions
 moderation+0.029
e+0.028
 arrest+0.027
 courtroom+0.026
 client+0.026
Neg logit contributions
der-0.058
lessly-0.051
ier-0.048
lic-0.046
rier-0.045
Token you
Token ID345
Feature activation+2.86e-02

Pos logit contributions
der+0.024
lessly+0.021
rier+0.020
ier+0.019
lights+0.019
Neg logit contributions
e-0.012
 moderation-0.012
 hate-0.011
 arrest-0.011
 client-0.011
Token did
Token ID750
Feature activation-7.80e-03

Pos logit contributions
 moderation+0.059
e+0.057
 arrest+0.056
 client+0.055
 courtroom+0.054
Neg logit contributions
der-0.056
lessly-0.045
ier-0.043
rier-0.039
ing-0.038
Token not
Token ID407
Feature activation+4.57e-02

Pos logit contributions
der+0.015
lessly+0.012
rier+0.012
ier+0.012
lights+0.011
Neg logit contributions
 moderation-0.016
e-0.016
|-0.015
 arrest-0.015
keeping-0.015
Token mean
Token ID1612
Feature activation+4.68e-02

Pos logit contributions
 moderation+0.075
 arrest+0.074
 generation+0.072
e+0.072
keeping+0.071
Neg logit contributions
der-0.096
lessly-0.084
ier-0.081
lic-0.076
ing-0.073
Token to
Token ID284
Feature activation+1.05e-01

Pos logit contributions
e+0.103
 moderation+0.102
|+0.096
 arrest+0.093
atherine+0.091
Neg logit contributions
der-0.090
lessly-0.071
rier-0.068
ier-0.065
ing-0.064
Token break
Token ID2270
Feature activation+4.97e-02

Pos logit contributions
 moderation+0.177
e+0.172
|+0.165
keeping+0.162
 generation+0.158
Neg logit contributions
der-0.236
lessly-0.190
rier-0.185
ier-0.182
ly-0.179
Token the
Token ID262
Feature activation-1.88e-02

Pos logit contributions
e+0.093
 moderation+0.092
 hate+0.086
 client+0.085
 argue+0.085
Neg logit contributions
der-0.107
lessly-0.089
rier-0.087
ier-0.082
lights-0.080
Token book
Token ID1492
Feature activation-1.72e-03

Pos logit contributions
der+0.040
lessly+0.033
rier+0.033
ier+0.031
lights+0.029
Neg logit contributions
e-0.035
 moderation-0.035
 hate-0.032
|-0.032
 argue-0.032
Token.
Token ID13
Feature activation+2.31e-02

Pos logit contributions
der+0.003
lessly+0.003
rier+0.003
lic+0.003
lights+0.002
Neg logit contributions
e-0.003
 moderation-0.003
 hate-0.003
 argue-0.003
 client-0.003
Token It
Token ID632
Feature activation-2.04e-02

Pos logit contributions
e+0.043
 moderation+0.042
 hate+0.039
 argue+0.038
 arrest+0.038
Neg logit contributions
der-0.051
lessly-0.043
rier-0.041
ier-0.039
lights-0.038
Token to
Token ID284
Feature activation+5.34e-02

Pos logit contributions
e+0.004
 moderation+0.004
 client+0.004
 argue+0.004
 arrest+0.004
Neg logit contributions
der-0.006
lessly-0.005
rier-0.005
ier-0.005
lights-0.005
Token visit
Token ID3187
Feature activation+4.37e-02

Pos logit contributions
beh+0.145
keeping+0.140
 colony+0.138
 conflicts+0.136
 moderation+0.135
Neg logit contributions
 all-0.041
 that-0.037
 for-0.036
der-0.036
 and-0.028
Token their
Token ID511
Feature activation+2.70e-02

Pos logit contributions
 moderation+0.086
e+0.085
 arrest+0.082
 hate+0.080
 client+0.080
Neg logit contributions
der-0.087
lessly-0.070
rier-0.066
ier-0.065
lights-0.063
Token neighbor
Token ID4780
Feature activation+4.62e-02

Pos logit contributions
e+0.060
 moderation+0.058
keeping+0.052
|+0.052
 hate+0.052
Neg logit contributions
der-0.048
rier-0.037
lights-0.036
ing-0.035
lessly-0.032
Token,
Token ID11
Feature activation+4.69e-03

Pos logit contributions
 moderation+0.123
e+0.119
 mischief+0.118
 hate+0.118
 arrest+0.117
Neg logit contributions
der-0.052
ing-0.034
ly-0.030
 for-0.028
boards-0.027
Token Mrs
Token ID9074
Feature activation+1.05e-01

Pos logit contributions
e+0.009
 moderation+0.009
keeping+0.008
 argue+0.008
 arrest+0.008
Neg logit contributions
der-0.010
lessly-0.008
lic-0.008
rier-0.008
ier-0.007
Token.
Token ID13
Feature activation+4.71e-02

Pos logit contributions
e+0.268
 moderation+0.267
 arrest+0.254
 hate+0.254
 argue+0.249
Neg logit contributions
der-0.158
lessly-0.114
rier-0.111
ier-0.106
lights-0.094
Token Lee
Token ID5741
Feature activation+3.92e-02

Pos logit contributions
e+0.064
 moderation+0.059
 hate+0.056
 conflicts+0.054
 obey+0.053
Neg logit contributions
der-0.120
rier-0.103
lessly-0.097
ier-0.094
ing-0.092
Token.
Token ID13
Feature activation+3.13e-02

Pos logit contributions
 moderation+0.100
e+0.100
 hate+0.094
 arrest+0.093
 argue+0.093
Neg logit contributions
der-0.057
lessly-0.044
rier-0.040
ier-0.038
lights-0.036
Token Mrs
Token ID9074
Feature activation+9.45e-02

Pos logit contributions
e+0.052
 moderation+0.050
 hate+0.045
 harm+0.044
 arrest+0.044
Neg logit contributions
der-0.073
lessly-0.061
rier-0.059
lights-0.056
ier-0.056
Token.
Token ID13
Feature activation+4.60e-02

Pos logit contributions
e+0.226
 moderation+0.224
 hate+0.211
 arrest+0.210
 argue+0.208
Neg logit contributions
der-0.156
lessly-0.122
rier-0.115
ier-0.109
lights-0.104
Token he
Token ID339
Feature activation-3.02e-02

Pos logit contributions
der+0.083
lessly+0.069
rier+0.066
ier+0.062
lights+0.061
Neg logit contributions
e-0.083
 moderation-0.083
 hate-0.077
 argue-0.076
 client-0.076
Token felt
Token ID2936
Feature activation-5.44e-02

Pos logit contributions
der+0.073
lessly+0.063
rier+0.061
ier+0.058
lights+0.057
Neg logit contributions
e-0.051
 moderation-0.047
 hate-0.045
 arrest-0.044
 argue-0.044
Token a
Token ID257
Feature activation+2.40e-03

Pos logit contributions
der+0.128
lessly+0.106
lights+0.106
rier+0.106
ier+0.101
Neg logit contributions
e-0.093
 moderation-0.092
 hate-0.088
 mischief-0.084
 panic-0.081
Token fl
Token ID781
Feature activation-3.85e-02

Pos logit contributions
 hate+0.004
 panic+0.004
e+0.003
 reason+0.003
 moderation+0.003
Neg logit contributions
der-0.006
lights-0.005
ier-0.005
lessly-0.005
lic-0.005
Tokenapping
Token ID5912
Feature activation-3.44e-02

Pos logit contributions
der+0.089
lights+0.075
boards+0.072
lessly+0.067
ier+0.062
Neg logit contributions
e-0.068
 hate-0.059
 arrest-0.053
 panic-0.050
 charge-0.050
Token sensation
Token ID18098
Feature activation-9.39e-02

Pos logit contributions
der+0.072
lessly+0.057
rier+0.057
ier+0.054
lic+0.054
Neg logit contributions
e-0.069
 hate-0.062
 moderation-0.060
 argue-0.059
 fight-0.058
Token in
Token ID287
Feature activation+2.88e-02

Pos logit contributions
der+0.194
rier+0.166
lights+0.165
lic+0.163
boards+0.151
Neg logit contributions
e-0.174
 hate-0.130
 harm-0.123
aster-0.113
 fight-0.110
Token the
Token ID262
Feature activation-2.29e-02

Pos logit contributions
e+0.051
 moderation+0.051
 client+0.047
 hate+0.046
 argue+0.046
Neg logit contributions
der-0.064
lessly-0.055
rier-0.052
ier-0.049
lights-0.049
Token air
Token ID1633
Feature activation-5.25e-02

Pos logit contributions
der+0.044
lessly+0.035
rier+0.034
ier+0.033
lights+0.031
Neg logit contributions
e-0.050
 moderation-0.049
 hate-0.045
 arrest-0.045
 client-0.044
Token.
Token ID13
Feature activation+1.34e-03

Pos logit contributions
der+0.107
lessly+0.089
lic+0.087
rier+0.087
lights+0.086
Neg logit contributions
e-0.101
 hate-0.089
 moderation-0.088
 argue-0.086
 fight-0.085
Token It
Token ID632
Feature activation-3.22e-02

Pos logit contributions
e+0.003
 moderation+0.003
 argue+0.003
 hate+0.003
 arrest+0.003
Neg logit contributions
der-0.002
lessly-0.002
rier-0.001
lic-0.001
lights-0.001
Token,
Token ID11
Feature activation-5.19e-02

Pos logit contributions
der+0.087
lessly+0.072
rier+0.068
ier+0.063
lights+0.063
Neg logit contributions
e-0.114
 moderation-0.112
 hate-0.105
 arrest-0.104
 client-0.104
Token Elsa
Token ID19226
Feature activation+5.16e-03

Pos logit contributions
der+0.124
lessly+0.109
rier+0.102
ier+0.101
lights+0.101
Neg logit contributions
 moderation-0.085
e-0.084
 client-0.075
 hate-0.075
 heaven-0.073
Token felt
Token ID2936
Feature activation-6.03e-02

Pos logit contributions
e+0.010
 moderation+0.009
 hate+0.009
 arrest+0.009
 client+0.009
Neg logit contributions
der-0.011
lessly-0.010
rier-0.009
ier-0.009
lights-0.009
Token a
Token ID257
Feature activation-7.50e-03

Pos logit contributions
der+0.138
rier+0.118
lights+0.117
lessly+0.116
lic+0.110
Neg logit contributions
 moderation-0.106
e-0.104
 hate-0.100
 mischief-0.093
 panic-0.090
Token strange
Token ID6283
Feature activation-1.47e-02

Pos logit contributions
der+0.018
lights+0.017
lessly+0.016
ier+0.016
rier+0.016
Neg logit contributions
 hate-0.011
 panic-0.011
e-0.011
 moderation-0.010
 reason-0.010
Token sensation
Token ID18098
Feature activation-9.19e-02

Pos logit contributions
der+0.033
lessly+0.030
rier+0.029
lights+0.029
lic+0.028
Neg logit contributions
 hate-0.024
e-0.024
 moderation-0.022
 panic-0.022
 reason-0.022
Token in
Token ID287
Feature activation+2.95e-02

Pos logit contributions
der+0.190
rier+0.175
lights+0.159
lic+0.154
lessly+0.148
Neg logit contributions
e-0.180
 hate-0.144
aster-0.136
 harm-0.134
 moderation-0.122
Token the
Token ID262
Feature activation-2.92e-02

Pos logit contributions
e+0.046
 moderation+0.044
 client+0.043
|+0.041
 argue+0.041
Neg logit contributions
der-0.073
lessly-0.062
rier-0.058
lights-0.056
ier-0.055
Token water
Token ID1660
Feature activation-3.84e-02

Pos logit contributions
der+0.072
lessly+0.060
rier+0.059
ier+0.059
lights+0.053
Neg logit contributions
e-0.047
 moderation-0.044
 hate-0.043
|-0.042
 meet-0.041
Token.
Token ID13
Feature activation-1.42e-02

Pos logit contributions
der+0.078
lessly+0.066
rier+0.066
lights+0.064
lic+0.064
Neg logit contributions
e-0.071
 hate-0.063
 moderation-0.062
 arrest-0.061
 fight-0.061
Token What
Token ID1867
Feature activation+9.99e-04

Pos logit contributions
der+0.030
ier+0.027
lessly+0.027
lights+0.026
rier+0.026
Neg logit contributions
 moderation-0.025
e-0.023
 client-0.022
 hate-0.022
 courtroom-0.022
Token a
Token ID257
Feature activation-2.57e-03

Pos logit contributions
der+0.128
lights+0.106
rier+0.102
boards+0.099
ing+0.098
Neg logit contributions
e-0.089
 moderation-0.086
 hate-0.082
 panic-0.076
 heaven-0.075
Token pleasant
Token ID15497
Feature activation+5.73e-03

Pos logit contributions
der+0.006
lights+0.005
rier+0.005
boards+0.005
lessly+0.005
Neg logit contributions
e-0.004
 hate-0.004
 moderation-0.004
 panic-0.004
 reason-0.004
Token t
Token ID256
Feature activation-3.50e-03

Pos logit contributions
e+0.008
 hate+0.008
 moderation+0.008
 panic+0.008
 visit+0.008
Neg logit contributions
der-0.014
lessly-0.012
rier-0.012
bled-0.012
lic-0.012
Tokening
Token ID278
Feature activation-4.05e-02

Pos logit contributions
der+0.008
lessly+0.007
rier+0.006
lights+0.006
lic+0.006
Neg logit contributions
e-0.006
 moderation-0.005
 arrest-0.005
 hate-0.005
 client-0.005
Tokenling
Token ID1359
Feature activation-2.55e-02

Pos logit contributions
der+0.081
lic+0.074
boards+0.067
lessly+0.065
ows+0.064
Neg logit contributions
e-0.077
 hate-0.071
 harm-0.068
 moderation-0.067
 panic-0.065
Token sensation
Token ID18098
Feature activation-9.11e-02

Pos logit contributions
der+0.050
lic+0.047
rier+0.043
lessly+0.041
ier+0.041
Neg logit contributions
e-0.049
 hate-0.045
 moderation-0.043
 harm-0.043
 panic-0.041
Token throughout
Token ID3690
Feature activation+2.19e-02

Pos logit contributions
der+0.184
lic+0.162
rier+0.161
lights+0.150
boards+0.144
Neg logit contributions
e-0.171
 hate-0.126
 harm-0.123
aster-0.117
 ever-0.112
Token his
Token ID465
Feature activation+2.31e-02

Pos logit contributions
e+0.035
 moderation+0.034
 hate+0.031
 arrest+0.031
 client+0.029
Neg logit contributions
der-0.054
lessly-0.047
rier-0.045
ier-0.043
lights-0.043
Token body
Token ID1767
Feature activation-3.15e-02

Pos logit contributions
 moderation+0.032
e+0.032
 argue+0.030
Susan+0.029
 client+0.029
Neg logit contributions
der-0.059
lessly-0.049
rier-0.047
lights-0.046
ier-0.045
Token.
Token ID13
Feature activation-3.83e-03

Pos logit contributions
der+0.060
rier+0.052
lessly+0.050
lights+0.049
lic+0.048
Neg logit contributions
e-0.066
 moderation-0.059
 hate-0.057
keeping-0.056
 arrest-0.056
Token The
Token ID383
Feature activation-5.65e-02

Pos logit contributions
der+0.008
lessly+0.006
rier+0.006
ier+0.006
lights+0.006
Neg logit contributions
e-0.008
 moderation-0.008
 hate-0.007
 arrest-0.007
 client-0.007
Token felt
Token ID2936
Feature activation-5.33e-02

Pos logit contributions
der+0.056
rier+0.047
lessly+0.046
ier+0.043
lights+0.042
Neg logit contributions
e-0.049
 moderation-0.046
 argue-0.043
 hate-0.043
 arrest-0.042
Token a
Token ID257
Feature activation+1.67e-04

Pos logit contributions
der+0.121
lights+0.095
rier+0.094
lessly+0.094
ier+0.090
Neg logit contributions
e-0.103
 moderation-0.098
 hate-0.094
 mischief-0.087
 panic-0.085
Token strange
Token ID6283
Feature activation-1.55e-02

Pos logit contributions
e+0.000
 hate+0.000
 moderation+0.000
 panic+0.000
 reason+0.000
Neg logit contributions
der-0.000
lessly-0.000
lights-0.000
ier-0.000
rier-0.000
Token t
Token ID256
Feature activation-1.10e-03

Pos logit contributions
der+0.035
lessly+0.030
rier+0.029
lights+0.028
ier+0.028
Neg logit contributions
e-0.027
 moderation-0.026
 hate-0.026
 argue-0.025
 fight-0.024
Tokeningly
Token ID4420
Feature activation-2.30e-02

Pos logit contributions
der+0.002
lessly+0.002
lights+0.002
rier+0.002
lic+0.002
Neg logit contributions
e-0.002
 arrest-0.002
 client-0.002
 hate-0.002
 moderation-0.002
Token sensation
Token ID18098
Feature activation-9.00e-02

Pos logit contributions
der+0.049
lessly+0.042
lights+0.041
rier+0.040
lic+0.040
Neg logit contributions
e-0.042
 moderation-0.041
 hate-0.040
 arrest-0.038
 fight-0.037
Token and
Token ID290
Feature activation-7.09e-03

Pos logit contributions
der+0.175
lic+0.147
rier+0.145
lights+0.134
lessly+0.134
Neg logit contributions
e-0.172
 hate-0.135
 harm-0.127
 fight-0.122
 moderation-0.118
Token when
Token ID618
Feature activation-8.20e-03

Pos logit contributions
der+0.015
rier+0.013
lessly+0.013
ier+0.012
lights+0.012
Neg logit contributions
e-0.013
 moderation-0.012
 hate-0.012
 argue-0.012
 arrest-0.011
Token it
Token ID340
Feature activation-2.99e-02

Pos logit contributions
der+0.014
lessly+0.011
rier+0.011
ier+0.010
lights+0.010
Neg logit contributions
e-0.019
 moderation-0.019
 hate-0.018
|-0.018
 client-0.017
Token stopped
Token ID5025
Feature activation-8.60e-03

Pos logit contributions
der+0.068
rier+0.058
lessly+0.057
ier+0.050
lights+0.050
Neg logit contributions
e-0.056
 hate-0.047
 panic-0.047
 argue-0.046
|-0.046
Token they
Token ID484
Feature activation+6.97e-03

Pos logit contributions
der+0.016
lessly+0.014
rier+0.012
ier+0.012
lights+0.012
Neg logit contributions
e-0.018
 moderation-0.018
 hate-0.017
 argue-0.017
 arrest-0.017
Token and
Token ID290
Feature activation-3.42e-02

Pos logit contributions
e+0.010
 moderation+0.009
 hate+0.009
 arrest+0.009
 argue+0.009
Neg logit contributions
der-0.010
lessly-0.008
rier-0.008
ier-0.008
lights-0.007
Token Tom
Token ID4186
Feature activation+1.54e-02

Pos logit contributions
der+0.072
lessly+0.060
rier+0.055
ing+0.054
ier+0.052
Neg logit contributions
 moderation-0.063
e-0.059
|-0.058
oof-0.054
 hate-0.054
Token experienced
Token ID5924
Feature activation-3.32e-02

Pos logit contributions
e+0.028
 hate+0.024
 argue+0.024
 moderation+0.023
 arrest+0.023
Neg logit contributions
der-0.036
rier-0.032
lessly-0.030
lights-0.028
ier-0.028
Token a
Token ID257
Feature activation+6.86e-03

Pos logit contributions
der+0.076
lessly+0.065
rier+0.064
ier+0.061
lights+0.060
Neg logit contributions
e-0.057
 hate-0.050
 moderation-0.050
 argue-0.049
 mischief-0.049
Token strange
Token ID6283
Feature activation-1.19e-02

Pos logit contributions
e+0.014
 moderation+0.014
 hate+0.013
 arrest+0.013
 client+0.013
Neg logit contributions
der-0.014
lessly-0.011
rier-0.011
ier-0.010
lights-0.010
Token sensation
Token ID18098
Feature activation-8.93e-02

Pos logit contributions
der+0.027
lessly+0.023
rier+0.023
ier+0.022
lights+0.021
Neg logit contributions
e-0.021
 moderation-0.020
 hate-0.019
 argue-0.018
 client-0.018
Token they
Token ID484
Feature activation-1.23e-02

Pos logit contributions
der+0.170
rier+0.157
lic+0.136
lessly+0.128
boards+0.125
Neg logit contributions
e-0.175
 hate-0.143
 harm-0.140
 fight-0.132
 reason-0.126
Token had
Token ID550
Feature activation-1.40e-02

Pos logit contributions
der+0.025
rier+0.021
lessly+0.020
ier+0.019
lights+0.018
Neg logit contributions
e-0.025
 hate-0.023
 argue-0.023
 moderation-0.023
 arrest-0.022
Token never
Token ID1239
Feature activation+1.12e-02

Pos logit contributions
der+0.032
lessly+0.027
rier+0.027
ier+0.025
lights+0.025
Neg logit contributions
e-0.025
 moderation-0.024
 hate-0.022
 arrest-0.022
 argue-0.022
Token felt
Token ID2936
Feature activation-2.33e-02

Pos logit contributions
e+0.015
 moderation+0.015
 hate+0.014
 client+0.013
 arrest+0.013
Neg logit contributions
der-0.030
lessly-0.026
rier-0.025
ier-0.024
lights-0.024
Token before
Token ID878
Feature activation+3.14e-02

Pos logit contributions
der+0.047
lessly+0.038
rier+0.038
lights+0.037
ier+0.035
Neg logit contributions
e-0.046
 moderation-0.045
 hate-0.044
 arrest-0.041
 mischief-0.041
Token,
Token ID11
Feature activation-4.18e-02

Pos logit contributions
der+0.061
lessly+0.050
rier+0.047
ier+0.043
lights+0.042
Neg logit contributions
e-0.077
 moderation-0.073
 hate-0.070
 argue-0.069
 arrest-0.069
Token they
Token ID484
Feature activation-2.51e-02

Pos logit contributions
der+0.086
lessly+0.072
rier+0.068
ier+0.065
lights+0.063
Neg logit contributions
e-0.083
 moderation-0.082
 hate-0.076
 arrest-0.075
 client-0.075
Token felt
Token ID2936
Feature activation-4.81e-02

Pos logit contributions
der+0.075
rier+0.069
lessly+0.064
ier+0.060
lights+0.059
Neg logit contributions
e-0.030
 panic-0.025
 argue-0.025
 hate-0.023
 moderation-0.022
Token a
Token ID257
Feature activation+1.25e-03

Pos logit contributions
der+0.115
lessly+0.098
rier+0.095
ier+0.092
lights+0.090
Neg logit contributions
e-0.081
 moderation-0.080
 hate-0.075
 mischief-0.071
 panic-0.069
Token strange
Token ID6283
Feature activation-1.12e-02

Pos logit contributions
 hate+0.002
 panic+0.002
e+0.002
 reason+0.002
 fight+0.002
Neg logit contributions
der-0.003
ier-0.003
lights-0.003
lessly-0.003
boards-0.003
Token sensation
Token ID18098
Feature activation-8.83e-02

Pos logit contributions
der+0.026
lessly+0.023
rier+0.022
ier+0.021
lights+0.021
Neg logit contributions
e-0.019
 hate-0.018
 argue-0.017
 fight-0.017
 panic-0.017
Token.
Token ID13
Feature activation+4.24e-03

Pos logit contributions
der+0.190
rier+0.172
lic+0.156
lessly+0.150
lights+0.144
Neg logit contributions
e-0.163
 hate-0.119
aster-0.115
 harm-0.113
 fight-0.105
Token They
Token ID1119
Feature activation-2.18e-02

Pos logit contributions
e+0.009
 moderation+0.009
 hate+0.008
 arrest+0.008
 client+0.008
Neg logit contributions
der-0.009
lessly-0.007
rier-0.007
ier-0.006
lights-0.006
Token felt
Token ID2936
Feature activation-4.82e-02

Pos logit contributions
der+0.051
rier+0.047
lessly+0.043
ier+0.043
lights+0.041
Neg logit contributions
e-0.039
 argue-0.035
 hate-0.035
 panic-0.034
 moderation-0.033
Token light
Token ID1657
Feature activation-3.82e-02

Pos logit contributions
der+0.115
lessly+0.098
rier+0.095
ier+0.094
lic+0.093
Neg logit contributions
e-0.081
 moderation-0.077
 hate-0.074
 mischief-0.070
 panic-0.069
Token and
Token ID290
Feature activation-9.28e-04

Pos logit contributions
der+0.074
lessly+0.064
lic+0.061
boards+0.056
rier+0.055
Neg logit contributions
e-0.070
 panic-0.060
 hate-0.059
 harm-0.059
 fight-0.058
TokenSuddenly
Token ID38582
Feature activation-3.46e-02

Pos logit contributions
der+0.018
lessly+0.015
rier+0.015
ier+0.014
lights+0.014
Neg logit contributions
e-0.011
 moderation-0.011
 hate-0.010
 client-0.010
 argue-0.010
Token,
Token ID11
Feature activation-4.14e-02

Pos logit contributions
der+0.069
lessly+0.057
rier+0.054
lights+0.050
ier+0.049
Neg logit contributions
e-0.071
 moderation-0.066
 hate-0.065
 arrest-0.065
 Susan-0.063
Token he
Token ID339
Feature activation-2.47e-02

Pos logit contributions
der+0.091
lessly+0.075
rier+0.073
ier+0.069
lights+0.068
Neg logit contributions
e-0.074
 moderation-0.074
 hate-0.070
 arrest-0.068
 client-0.067
Token felt
Token ID2936
Feature activation-4.71e-02

Pos logit contributions
der+0.067
rier+0.060
lessly+0.059
lights+0.054
ier+0.054
Neg logit contributions
e-0.035
 moderation-0.029
 hate-0.028
 panic-0.028
 arrest-0.028
Token a
Token ID257
Feature activation-1.26e-04

Pos logit contributions
der+0.115
lessly+0.097
lights+0.097
rier+0.096
ier+0.093
Neg logit contributions
e-0.078
 moderation-0.077
 hate-0.072
 mischief-0.068
 panic-0.066
Token sensation
Token ID18098
Feature activation-8.81e-02

Pos logit contributions
der+0.000
lights+0.000
ier+0.000
lessly+0.000
rier+0.000
Neg logit contributions
 panic-0.000
 hate-0.000
e-0.000
 reason-0.000
 weep-0.000
Token.
Token ID13
Feature activation-3.63e-03

Pos logit contributions
der+0.190
rier+0.165
lights+0.162
lic+0.150
boards+0.147
Neg logit contributions
e-0.168
 hate-0.130
aster-0.121
 harm-0.120
 panic-0.112
Token He
Token ID679
Feature activation-2.10e-02

Pos logit contributions
der+0.007
lessly+0.006
ier+0.005
rier+0.005
lights+0.005
Neg logit contributions
e-0.007
 moderation-0.007
 hate-0.007
 client-0.007
 courtroom-0.007
Token began
Token ID2540
Feature activation-1.06e-02

Pos logit contributions
der+0.045
rier+0.041
lessly+0.039
ier+0.037
lights+0.037
Neg logit contributions
e-0.040
 hate-0.035
 panic-0.034
 moderation-0.034
 argue-0.034
Token to
Token ID284
Feature activation+3.21e-02

Pos logit contributions
der+0.022
lessly+0.018
rier+0.017
ier+0.017
lights+0.016
Neg logit contributions
e-0.021
 moderation-0.021
 hate-0.019
 client-0.019
 arrest-0.019
Token slide
Token ID10649
Feature activation-7.78e-03

Pos logit contributions
 moderation+0.053
e+0.051
keeping+0.049
 client+0.048
atherine+0.048
Neg logit contributions
der-0.074
lessly-0.061
rier-0.058
ier-0.054
ing-0.053
Token and
Token ID290
Feature activation-3.14e-02

Pos logit contributions
e+0.009
 moderation+0.008
 hate+0.008
 arrest+0.008
 argue+0.007
Neg logit contributions
der-0.008
lessly-0.007
rier-0.007
ier-0.006
lights-0.006
Token Tom
Token ID4186
Feature activation+2.31e-02

Pos logit contributions
der+0.064
lessly+0.047
rier+0.045
ing+0.043
ier+0.042
Neg logit contributions
 moderation-0.062
e-0.061
|-0.058
oof-0.057
days-0.056
Token felt
Token ID2936
Feature activation-5.33e-02

Pos logit contributions
e+0.045
 moderation+0.041
 hate+0.040
 argue+0.039
 arrest+0.039
Neg logit contributions
der-0.051
rier-0.042
lessly-0.042
lights-0.039
ier-0.039
Token a
Token ID257
Feature activation+5.51e-03

Pos logit contributions
der+0.124
lights+0.098
rier+0.096
lessly+0.095
ier+0.092
Neg logit contributions
e-0.100
 hate-0.092
 moderation-0.091
 panic-0.086
 mischief-0.085
Token strange
Token ID6283
Feature activation-1.25e-02

Pos logit contributions
e+0.009
 hate+0.009
 panic+0.009
 moderation+0.008
 arrest+0.008
Neg logit contributions
der-0.013
lights-0.011
rier-0.010
ier-0.010
lessly-0.010
Token sensation
Token ID18098
Feature activation-8.73e-02

Pos logit contributions
der+0.028
lessly+0.025
rier+0.024
lights+0.024
lic+0.024
Neg logit contributions
e-0.021
 hate-0.021
 argue-0.019
 fight-0.019
 reason-0.019
Token in
Token ID287
Feature activation+3.53e-02

Pos logit contributions
der+0.180
lic+0.156
rier+0.155
boards+0.147
lights+0.141
Neg logit contributions
e-0.162
 hate-0.128
 harm-0.123
aster-0.113
 fight-0.111
Token their
Token ID511
Feature activation+3.69e-03

Pos logit contributions
e+0.059
 moderation+0.058
 client+0.054
|+0.052
 hate+0.052
Neg logit contributions
der-0.084
lessly-0.072
rier-0.068
lights-0.065
ier-0.064
Token stomach
Token ID11384
Feature activation-1.30e-02

Pos logit contributions
 moderation+0.007
e+0.007
|+0.007
Susan+0.006
keeping+0.006
Neg logit contributions
der-0.007
lessly-0.006
lights-0.006
rier-0.006
ly-0.005
Tokens
Token ID82
Feature activation+5.12e-03

Pos logit contributions
der+0.026
lessly+0.024
bled+0.023
rier+0.023
lights+0.022
Neg logit contributions
e-0.022
 panic-0.021
 arrest-0.020
 hate-0.020
 fight-0.020
Token.
Token ID13
Feature activation+7.19e-03

Pos logit contributions
e+0.011
 moderation+0.010
 hate+0.010
 arrest+0.010
 argue+0.009
Neg logit contributions
der-0.009
lic-0.008
lessly-0.008
rier-0.008
lights-0.007
Token look
Token ID804
Feature activation-1.68e-02

Pos logit contributions
 moderation+0.125
atherine+0.120
 client+0.115
|+0.114
Elizabeth+0.112
Neg logit contributions
der-0.156
ier-0.122
rier-0.122
lessly-0.122
ly-0.114
Token at
Token ID379
Feature activation+3.55e-02

Pos logit contributions
der+0.032
lessly+0.025
rier+0.024
lic+0.023
lights+0.023
Neg logit contributions
e-0.035
 hate-0.034
 moderation-0.034
 argue-0.032
 client-0.031
Token it
Token ID340
Feature activation-7.64e-03

Pos logit contributions
e+0.070
 moderation+0.070
 hate+0.065
 client+0.064
 argue+0.064
Neg logit contributions
der-0.073
lessly-0.061
rier-0.058
ier-0.055
lights-0.054
Token.
Token ID13
Feature activation-6.96e-03

Pos logit contributions
der+0.014
lessly+0.012
lic+0.011
rier+0.011
lights+0.011
Neg logit contributions
e-0.016
 moderation-0.015
 hate-0.015
 argue-0.014
 harm-0.014
Token His
Token ID2399
Feature activation-2.67e-02

Pos logit contributions
der+0.013
lessly+0.010
rier+0.009
lic+0.008
ing+0.008
Neg logit contributions
e-0.016
 moderation-0.015
 argue-0.015
 hate-0.015
 arrest-0.015
Token pipe
Token ID12656
Feature activation-8.65e-02

Pos logit contributions
der+0.059
lessly+0.050
rier+0.047
ier+0.045
lights+0.045
Neg logit contributions
e-0.048
 moderation-0.047
|-0.044
 arrest-0.042
 hate-0.042
Token was
Token ID373
Feature activation-4.75e-02

Pos logit contributions
der+0.176
lessly+0.150
rier+0.146
ier+0.134
lights+0.131
Neg logit contributions
e-0.171
 moderation-0.156
 hate-0.155
 arrest-0.153
 client-0.149
Token much
Token ID881
Feature activation-2.31e-03

Pos logit contributions
der+0.107
rier+0.086
lessly+0.086
lights+0.080
ier+0.080
Neg logit contributions
e-0.086
 moderation-0.083
 hate-0.082
 argue-0.076
 harm-0.076
Token nicer
Token ID36597
Feature activation-1.88e-03

Pos logit contributions
der+0.005
lessly+0.004
rier+0.004
lic+0.004
ows+0.004
Neg logit contributions
 moderation-0.004
 hate-0.004
 panic-0.004
 harm-0.004
 fight-0.004
Token.
Token ID13
Feature activation-6.60e-03

Pos logit contributions
der+0.004
lessly+0.003
rier+0.003
lic+0.003
lights+0.003
Neg logit contributions
e-0.004
 moderation-0.003
keeping-0.003
 hate-0.003
 client-0.003
Token
Token ID198
Feature activation-2.38e-02

Pos logit contributions
der+0.012
lessly+0.010
rier+0.009
lights+0.008
ing+0.008
Neg logit contributions
e-0.015
 moderation-0.014
 argue-0.014
 hate-0.014
 arrest-0.014
Token and
Token ID290
Feature activation-2.83e-02

Pos logit contributions
e+0.024
 moderation+0.022
 hate+0.021
 arrest+0.021
 argue+0.021
Neg logit contributions
der-0.024
lessly-0.020
rier-0.020
ier-0.018
lights-0.018
Token Ben
Token ID3932
Feature activation+1.22e-02

Pos logit contributions
der+0.061
lessly+0.047
rier+0.044
ier+0.042
ing+0.042
Neg logit contributions
 moderation-0.054
e-0.052
|-0.050
days-0.048
keeping-0.048
Token felt
Token ID2936
Feature activation-5.31e-02

Pos logit contributions
e+0.024
 hate+0.021
 moderation+0.021
 argue+0.021
 arrest+0.021
Neg logit contributions
der-0.027
rier-0.022
lessly-0.022
lights-0.021
ier-0.020
Token a
Token ID257
Feature activation+6.29e-03

Pos logit contributions
der+0.123
lights+0.097
rier+0.095
lessly+0.094
ier+0.091
Neg logit contributions
e-0.099
 hate-0.091
 moderation-0.090
 panic-0.085
 mischief-0.084
Token funny
Token ID8258
Feature activation+1.01e-02

Pos logit contributions
e+0.011
 hate+0.010
 panic+0.010
 moderation+0.010
 arrest+0.010
Neg logit contributions
der-0.015
lights-0.012
rier-0.012
boards-0.012
ier-0.012
Token sensation
Token ID18098
Feature activation-8.59e-02

Pos logit contributions
e+0.019
 hate+0.018
 moderation+0.017
 reason+0.017
 arrest+0.017
Neg logit contributions
der-0.021
lessly-0.018
rier-0.018
lights-0.017
ier-0.017
Token in
Token ID287
Feature activation+3.45e-02

Pos logit contributions
der+0.177
lic+0.154
boards+0.150
rier+0.149
lights+0.143
Neg logit contributions
e-0.160
 hate-0.125
 harm-0.119
aster-0.110
 fight-0.108
Token their
Token ID511
Feature activation+6.46e-03

Pos logit contributions
e+0.055
 moderation+0.054
 client+0.050
 hate+0.049
 arrest+0.048
Neg logit contributions
der-0.085
lessly-0.073
rier-0.070
ier-0.066
lights-0.066
Token stomach
Token ID11384
Feature activation-1.13e-02

Pos logit contributions
 moderation+0.013
e+0.012
|+0.011
 client+0.011
keeping+0.011
Neg logit contributions
der-0.013
lessly-0.011
lights-0.010
rier-0.010
ier-0.009
Tokens
Token ID82
Feature activation+1.00e-02

Pos logit contributions
der+0.023
lessly+0.022
bled+0.021
rier+0.020
lights+0.020
Neg logit contributions
e-0.018
 panic-0.017
 hate-0.017
 fight-0.017
 arrest-0.016
Token.
Token ID13
Feature activation+9.32e-03

Pos logit contributions
e+0.021
 moderation+0.018
 hate+0.018
keeping+0.018
 argue+0.017
Neg logit contributions
der-0.019
lic-0.017
lessly-0.016
rier-0.015
lights-0.015
Token shells
Token ID19679
Feature activation-2.78e-02

Pos logit contributions
e+0.036
 moderation+0.036
 hate+0.033
 arrest+0.033
 client+0.032
Neg logit contributions
der-0.047
lessly-0.040
rier-0.038
ier-0.036
lights-0.036
Token.
Token ID13
Feature activation-9.04e-03

Pos logit contributions
der+0.057
rier+0.048
lic+0.048
lessly+0.047
lights+0.044
Neg logit contributions
e-0.051
 moderation-0.047
keeping-0.045
 hate-0.043
 argue-0.043
Token She
Token ID1375
Feature activation-2.70e-02

Pos logit contributions
der+0.018
lessly+0.015
rier+0.014
ier+0.014
lights+0.013
Neg logit contributions
e-0.018
 moderation-0.018
 hate-0.017
 client-0.017
 arrest-0.017
Token put
Token ID1234
Feature activation-3.29e-02

Pos logit contributions
der+0.061
rier+0.053
lessly+0.053
ier+0.049
lights+0.047
Neg logit contributions
e-0.050
 moderation-0.045
 hate-0.044
 argue-0.043
 arrest-0.042
Token her
Token ID607
Feature activation-1.27e-02

Pos logit contributions
der+0.088
lessly+0.071
rier+0.069
lic+0.066
ier+0.066
Neg logit contributions
e-0.053
 moderation-0.050
 argue-0.039
 arrest-0.039
 preschool-0.039
Token shell
Token ID7582
Feature activation-8.55e-02

Pos logit contributions
der+0.025
lessly+0.021
rier+0.020
lights+0.019
ier+0.019
Neg logit contributions
 moderation-0.025
e-0.025
 client-0.024
 hate-0.023
 arrest-0.023
Token down
Token ID866
Feature activation-1.03e-02

Pos logit contributions
der+0.192
rier+0.150
lessly+0.148
ier+0.139
lic+0.136
Neg logit contributions
e-0.162
packs-0.141
 moderation-0.136
keeping-0.130
 seek-0.130
Token and
Token ID290
Feature activation-2.32e-02

Pos logit contributions
der+0.021
rier+0.018
lic+0.017
lessly+0.016
lights+0.016
Neg logit contributions
e-0.019
 moderation-0.019
 hate-0.017
 preschool-0.017
 visit-0.016
Token paused
Token ID24487
Feature activation-3.49e-02

Pos logit contributions
der+0.053
rier+0.046
lessly+0.045
ier+0.043
lic+0.040
Neg logit contributions
e-0.041
 moderation-0.037
 argue-0.037
 hate-0.036
 arrest-0.036
Token to
Token ID284
Feature activation+3.88e-02

Pos logit contributions
der+0.071
rier+0.063
lessly+0.062
lic+0.062
lights+0.061
Neg logit contributions
e-0.060
 moderation-0.058
 hate-0.056
 visit-0.053
 preschool-0.053
Token look
Token ID804
Feature activation-4.15e-02

Pos logit contributions
e+0.060
 moderation+0.059
 client+0.053
 hate+0.053
 arrest+0.052
Neg logit contributions
der-0.097
lessly-0.083
rier-0.080
ier-0.077
lights-0.075
Token.
Token ID13
Feature activation-8.36e-03

Pos logit contributions
der+0.015
lessly+0.012
rier+0.012
lights+0.011
ier+0.010
Neg logit contributions
e-0.021
 moderation-0.021
 hate-0.020
 arrest-0.019
 client-0.019
Token She
Token ID1375
Feature activation-2.32e-02

Pos logit contributions
der+0.016
lessly+0.013
rier+0.012
ier+0.012
lights+0.011
Neg logit contributions
e-0.018
 moderation-0.018
 hate-0.016
 arrest-0.016
 argue-0.016
Token put
Token ID1234
Feature activation-2.68e-02

Pos logit contributions
der+0.049
lessly+0.042
rier+0.041
ier+0.038
lights+0.038
Neg logit contributions
e-0.044
 moderation-0.043
 hate-0.040
 arrest-0.040
 argue-0.039
Token her
Token ID607
Feature activation-1.54e-02

Pos logit contributions
der+0.066
lessly+0.054
rier+0.052
lights+0.050
ier+0.050
Neg logit contributions
e-0.047
 moderation-0.046
 hate-0.038
 arrest-0.038
|-0.037
Token pretty
Token ID2495
Feature activation-1.74e-02

Pos logit contributions
der+0.030
rier+0.025
lessly+0.025
lights+0.023
ier+0.022
Neg logit contributions
e-0.028
 moderation-0.028
 client-0.028
|-0.028
 arrest-0.027
Token sheet
Token ID9629
Feature activation-8.43e-02

Pos logit contributions
der+0.037
lessly+0.031
rier+0.030
ier+0.028
lights+0.028
Neg logit contributions
 moderation-0.032
e-0.032
 client-0.030
 hate-0.030
 arrest-0.030
Token down
Token ID866
Feature activation-5.37e-03

Pos logit contributions
der+0.189
rier+0.144
lessly+0.133
ier+0.128
lic+0.125
Neg logit contributions
e-0.176
 moderation-0.153
|-0.141
 visit-0.140
 hate-0.139
Token,
Token ID11
Feature activation-3.17e-02

Pos logit contributions
der+0.011
rier+0.009
lic+0.008
lessly+0.008
lights+0.008
Neg logit contributions
e-0.011
 moderation-0.011
 hate-0.010
 arrest-0.009
 heaven-0.009
Token and
Token ID290
Feature activation-2.65e-02

Pos logit contributions
der+0.065
lessly+0.054
rier+0.054
ier+0.053
lights+0.050
Neg logit contributions
 moderation-0.064
e-0.063
 hate-0.058
 arrest-0.057
 heaven-0.055
Token lay
Token ID3830
Feature activation-2.40e-02

Pos logit contributions
der+0.058
lessly+0.050
rier+0.050
ier+0.047
lights+0.045
Neg logit contributions
e-0.049
 moderation-0.047
 hate-0.044
 arrest-0.043
 argue-0.042
Token on
Token ID319
Feature activation+1.40e-02

Pos logit contributions
der+0.054
lessly+0.043
lic+0.043
ier+0.041
rier+0.040
Neg logit contributions
e-0.044
 moderation-0.043
keeping-0.040
 heaven-0.036
 preschool-0.035
Token,
Token ID11
Feature activation-3.77e-02

Pos logit contributions
der+0.052
lessly+0.044
rier+0.043
lights+0.042
lic+0.041
Neg logit contributions
e-0.037
 panic-0.032
 hate-0.032
 fight-0.032
 argue-0.031
Token he
Token ID339
Feature activation-1.47e-02

Pos logit contributions
der+0.071
lessly+0.057
rier+0.056
ier+0.053
lights+0.051
Neg logit contributions
 moderation-0.082
e-0.082
 hate-0.076
 arrest-0.074
 argue-0.073
Token felt
Token ID2936
Feature activation-4.72e-02

Pos logit contributions
der+0.033
lessly+0.029
rier+0.029
ier+0.027
lights+0.026
Neg logit contributions
e-0.027
 moderation-0.025
 hate-0.023
 argue-0.023
 arrest-0.023
Token a
Token ID257
Feature activation+5.16e-03

Pos logit contributions
der+0.112
lights+0.099
rier+0.093
lessly+0.092
lic+0.091
Neg logit contributions
 moderation-0.080
e-0.078
 hate-0.078
 panic-0.071
 mischief-0.071
Token strange
Token ID6283
Feature activation-5.87e-03

Pos logit contributions
 hate+0.008
 panic+0.008
e+0.008
 moderation+0.008
 reason+0.007
Neg logit contributions
der-0.012
lights-0.011
ier-0.010
lessly-0.010
boards-0.010
Token sensation
Token ID18098
Feature activation-8.41e-02

Pos logit contributions
der+0.013
lessly+0.012
lights+0.012
rier+0.011
ier+0.011
Neg logit contributions
e-0.010
 hate-0.010
 moderation-0.009
 panic-0.009
 argue-0.009
Token,
Token ID11
Feature activation-3.44e-02

Pos logit contributions
der+0.182
rier+0.162
lights+0.156
lic+0.150
cases+0.145
Neg logit contributions
e-0.156
 hate-0.116
aster-0.113
 harm-0.109
 fight-0.097
Token like
Token ID588
Feature activation+1.84e-02

Pos logit contributions
der+0.071
rier+0.064
ier+0.063
lights+0.062
lic+0.061
Neg logit contributions
e-0.061
 moderation-0.061
 hate-0.059
 mischief-0.056
 panic-0.055
Token something
Token ID1223
Feature activation-1.70e-02

Pos logit contributions
 moderation+0.033
e+0.033
 hate+0.030
 arrest+0.030
 mischief+0.029
Neg logit contributions
der-0.040
lessly-0.034
rier-0.034
ier-0.032
lights-0.031
Token powerful
Token ID3665
Feature activation-3.24e-02

Pos logit contributions
der+0.037
rier+0.032
lights+0.030
lessly+0.030
ier+0.029
Neg logit contributions
 hate-0.030
e-0.030
 harm-0.027
 panic-0.027
 moderation-0.027
Token was
Token ID373
Feature activation-1.35e-02

Pos logit contributions
der+0.063
rier+0.062
lessly+0.058
lights+0.056
lic+0.055
Neg logit contributions
e-0.059
 hate-0.056
 harm-0.055
 fight-0.052
 arrest-0.051
Token.
Token ID13
Feature activation+1.22e-03

Pos logit contributions
der+0.087
lessly+0.086
lic+0.084
lights+0.082
rier+0.077
Neg logit contributions
e-0.075
 moderation-0.067
 fight-0.061
 argue-0.058
 heaven-0.058
Token He
Token ID679
Feature activation-3.08e-02

Pos logit contributions
 moderation+0.002
e+0.002
 hate+0.002
 client+0.002
 heaven+0.002
Neg logit contributions
der-0.002
lessly-0.002
ier-0.002
rier-0.002
ing-0.002
Token felt
Token ID2936
Feature activation-5.22e-02

Pos logit contributions
der+0.067
lessly+0.062
rier+0.061
ier+0.057
lights+0.056
Neg logit contributions
e-0.056
 moderation-0.051
 hate-0.049
 argue-0.047
 arrest-0.046
Token a
Token ID257
Feature activation+4.78e-06

Pos logit contributions
der+0.119
lessly+0.105
lights+0.102
lic+0.099
ier+0.098
Neg logit contributions
 moderation-0.091
e-0.089
 hate-0.085
 mischief-0.078
 heaven-0.075
Token burning
Token ID9482
Feature activation-3.33e-02

Feature not active on prev token;
no predictions were affected.
Token sensation
Token ID18098
Feature activation-8.40e-02

Pos logit contributions
der+0.077
lessly+0.068
lic+0.062
rier+0.061
lights+0.060
Neg logit contributions
e-0.061
 moderation-0.055
 hate-0.051
 argue-0.049
 harm-0.049
Token in
Token ID287
Feature activation+3.66e-02

Pos logit contributions
der+0.178
rier+0.170
lic+0.160
lights+0.158
lessly+0.155
Neg logit contributions
e-0.151
aster-0.108
 hate-0.105
 harm-0.095
 moderation-0.091
Token his
Token ID465
Feature activation+4.01e-03

Pos logit contributions
e+0.064
 moderation+0.063
 client+0.059
 hate+0.059
 argue+0.058
Neg logit contributions
der-0.083
lessly-0.070
rier-0.067
ier-0.064
lights-0.063
Token mouth
Token ID5422
Feature activation-4.31e-02

Pos logit contributions
|+0.009
 moderation+0.009
 argue+0.009
e+0.009
Susan+0.009
Neg logit contributions
der-0.007
rier-0.005
lessly-0.005
lights-0.004
ier-0.004
Token.
Token ID13
Feature activation+3.18e-03

Pos logit contributions
der+0.094
rier+0.093
lights+0.091
lessly+0.089
bled+0.087
Neg logit contributions
e-0.069
 moderation-0.054
 hate-0.053
 harm-0.048
 panic-0.048
Token He
Token ID679
Feature activation-2.80e-02

Pos logit contributions
 moderation+0.007
e+0.006
 hate+0.006
 client+0.006
 mischief+0.006
Neg logit contributions
der-0.006
lessly-0.005
ier-0.005
rier-0.005
lights-0.005
Token,
Token ID11
Feature activation-4.08e-02

Pos logit contributions
der+0.064
lessly+0.053
rier+0.051
lights+0.048
ier+0.045
Neg logit contributions
e-0.074
 moderation-0.070
 hate-0.067
 arrest-0.067
 argue-0.065
Token it
Token ID340
Feature activation-4.05e-02

Pos logit contributions
der+0.091
lessly+0.077
rier+0.074
ier+0.071
lights+0.069
Neg logit contributions
e-0.074
 moderation-0.073
 hate-0.067
 arrest-0.066
 client-0.066
Token felt
Token ID2936
Feature activation-4.51e-02

Pos logit contributions
der+0.092
rier+0.083
lessly+0.083
ier+0.074
lights+0.074
Neg logit contributions
e-0.071
 panic-0.061
 hate-0.061
 quarrel-0.061
 fight-0.060
Token a
Token ID257
Feature activation+3.11e-04

Pos logit contributions
der+0.104
lessly+0.089
lights+0.089
rier+0.088
ier+0.084
Neg logit contributions
 moderation-0.081
e-0.075
 hate-0.073
 mischief-0.069
 panic-0.068
Token burning
Token ID9482
Feature activation-3.13e-02

Pos logit contributions
 hate+0.001
 panic+0.000
 moderation+0.000
e+0.000
 reason+0.000
Neg logit contributions
der-0.001
lights-0.001
ier-0.001
lessly-0.001
rier-0.001
Token sensation
Token ID18098
Feature activation-8.36e-02

Pos logit contributions
der+0.070
lessly+0.062
rier+0.057
lic+0.056
ier+0.055
Neg logit contributions
e-0.057
 moderation-0.052
 hate-0.050
 harm-0.048
 fight-0.048
Token and
Token ID290
Feature activation-8.61e-03

Pos logit contributions
der+0.178
rier+0.167
lights+0.149
lic+0.145
lessly+0.142
Neg logit contributions
e-0.155
 hate-0.113
aster-0.111
 harm-0.109
 fight-0.101
Token started
Token ID2067
Feature activation-1.02e-02

Pos logit contributions
der+0.019
lessly+0.017
rier+0.017
ier+0.016
lights+0.016
Neg logit contributions
e-0.014
 moderation-0.014
 hate-0.013
 arrest-0.013
 argue-0.013
Token to
Token ID284
Feature activation+3.39e-02

Pos logit contributions
der+0.021
lessly+0.018
rier+0.017
ier+0.016
lights+0.016
Neg logit contributions
e-0.020
 moderation-0.020
 client-0.018
 hate-0.018
 arrest-0.018
Token jump
Token ID4391
Feature activation-2.92e-02

Pos logit contributions
e+0.059
 moderation+0.059
 client+0.056
atherine+0.055
keeping+0.054
Neg logit contributions
der-0.076
lessly-0.061
rier-0.060
ier-0.055
lights-0.054
Token around
Token ID1088
Feature activation+2.61e-02

Pos logit contributions
der+0.061
rier+0.049
lights+0.049
lessly+0.048
ier+0.047
Neg logit contributions
e-0.053
 hate-0.048
 argue-0.047
|-0.047
 fight-0.047
Token something
Token ID1223
Feature activation-4.03e-02

Pos logit contributions
der+0.043
lessly+0.032
rier+0.030
ier+0.028
lights+0.026
Neg logit contributions
 moderation-0.075
e-0.074
 hate-0.070
|-0.070
 arrest-0.070
Token bad
Token ID2089
Feature activation-4.17e-02

Pos logit contributions
der+0.094
rier+0.081
lessly+0.077
lights+0.074
ier+0.071
Neg logit contributions
e-0.063
 hate-0.063
 moderation-0.060
 reason-0.060
 harm-0.059
Token happened
Token ID3022
Feature activation-2.47e-02

Pos logit contributions
der+0.109
lessly+0.096
rier+0.096
lights+0.089
ier+0.088
Neg logit contributions
e-0.057
 moderation-0.053
 hate-0.050
 client-0.049
 arrest-0.048
Token.
Token ID13
Feature activation-1.18e-02

Pos logit contributions
der+0.049
lessly+0.041
rier+0.040
lights+0.038
ier+0.037
Neg logit contributions
e-0.051
 moderation-0.048
 hate-0.046
 argue-0.045
 arrest-0.044
Token Bob
Token ID5811
Feature activation+2.29e-02

Pos logit contributions
der+0.026
lessly+0.021
rier+0.020
ier+0.019
lights+0.019
Neg logit contributions
e-0.022
 moderation-0.022
 argue-0.020
 hate-0.020
 client-0.020
Token accident
Token ID5778
Feature activation-8.36e-02

Pos logit contributions
 moderation+0.045
e+0.045
 hate+0.041
 client+0.041
 arrest+0.041
Neg logit contributions
der-0.047
lessly-0.039
rier-0.037
ier-0.036
lights-0.034
Tokenly
Token ID306
Feature activation-2.18e-02

Pos logit contributions
der+0.111
rier+0.105
lights+0.097
lessly+0.091
ier+0.070
Neg logit contributions
e-0.174
 arrest-0.164
 hate-0.161
 client-0.161
 mischief-0.156
Token spilled
Token ID31896
Feature activation-3.93e-02

Pos logit contributions
der+0.046
lessly+0.039
rier+0.039
lights+0.036
ier+0.036
Neg logit contributions
e-0.042
 moderation-0.040
 hate-0.038
 arrest-0.038
 argue-0.037
Token some
Token ID617
Feature activation-1.67e-02

Pos logit contributions
der+0.091
lessly+0.078
rier+0.077
lic+0.075
ier+0.074
Neg logit contributions
e-0.064
 moderation-0.063
 hate-0.058
 argue-0.056
 arrest-0.056
Token of
Token ID286
Feature activation+3.58e-02

Pos logit contributions
der+0.034
lessly+0.028
rier+0.027
lic+0.027
lights+0.025
Neg logit contributions
e-0.033
 moderation-0.031
 hate-0.030
 reason-0.028
 mischief-0.027
Token the
Token ID262
Feature activation-4.04e-02

Pos logit contributions
e+0.063
 moderation+0.062
 client+0.058
 hate+0.058
 arrest+0.057
Neg logit contributions
der-0.080
lessly-0.068
rier-0.065
ier-0.062
lights-0.061
Token
Token ID198
Feature activation-1.24e-02

Pos logit contributions
der+0.006
lessly+0.005
rier+0.004
lights+0.004
lic+0.004
Neg logit contributions
e-0.008
 moderation-0.007
 arrest-0.007
 client-0.007
 argue-0.007
Token
Token ID198
Feature activation-1.16e-02

Pos logit contributions
der+0.020
lessly+0.018
lic+0.015
ing+0.014
rier+0.014
Neg logit contributions
e-0.028
 moderation-0.028
 arrest-0.027
 client-0.027
 argue-0.027
TokenPretty
Token ID35700
Feature activation+9.02e-03

Pos logit contributions
der+0.023
lessly+0.021
lic+0.017
rier+0.017
ier+0.017
Neg logit contributions
e-0.023
 moderation-0.023
 client-0.022
 arrest-0.022
 argue-0.021
Token soon
Token ID2582
Feature activation-2.67e-02

Pos logit contributions
e+0.015
 moderation+0.015
 hate+0.014
 argue+0.014
 client+0.013
Neg logit contributions
der-0.021
lessly-0.018
rier-0.018
ier-0.017
lights-0.016
Token her
Token ID607
Feature activation-1.58e-04

Pos logit contributions
der+0.049
lessly+0.040
rier+0.039
lights+0.035
ier+0.035
Neg logit contributions
e-0.058
 moderation-0.057
 hate-0.054
 arrest-0.053
 client-0.052
Token marks
Token ID8849
Feature activation-8.32e-02

Pos logit contributions
der+0.000
lessly+0.000
rier+0.000
ier+0.000
lights+0.000
Neg logit contributions
 moderation-0.000
e-0.000
|-0.000
 arrest-0.000
keeping-0.000
Token were
Token ID547
Feature activation-3.63e-02

Pos logit contributions
der+0.173
rier+0.155
lessly+0.141
lights+0.132
lic+0.130
Neg logit contributions
e-0.150
 hate-0.136
 arrest-0.130
 fight-0.129
 harm-0.127
Token higher
Token ID2440
Feature activation-2.32e-02

Pos logit contributions
der+0.074
lessly+0.060
rier+0.060
lights+0.054
ier+0.053
Neg logit contributions
e-0.069
 moderation-0.068
 hate-0.065
 argue-0.063
 quarrel-0.061
Token than
Token ID621
Feature activation+3.05e-02

Pos logit contributions
der+0.046
lic+0.041
lessly+0.040
rier+0.038
bled+0.037
Neg logit contributions
e-0.041
 moderation-0.040
 hate-0.038
 harm-0.037
 conflicts-0.037
Token anyone
Token ID2687
Feature activation+3.03e-02

Pos logit contributions
e+0.047
 moderation+0.046
 client+0.042
 hate+0.042
 argue+0.042
Neg logit contributions
der-0.076
lessly-0.066
rier-0.063
ier-0.062
lights-0.059
Token else
Token ID2073
Feature activation+4.13e-02

Pos logit contributions
 moderation+0.062
e+0.060
 client+0.056
 arrest+0.056
 argue+0.056
Neg logit contributions
der-0.060
lessly-0.051
rier-0.046
ier-0.046
lights-0.043
Token much
Token ID881
Feature activation+2.59e-02

Pos logit contributions
e+0.103
 moderation+0.103
 hate+0.095
 arrest+0.094
 client+0.094
Neg logit contributions
der-0.095
lessly-0.079
rier-0.074
ier-0.071
lights-0.069
Token he
Token ID339
Feature activation+9.36e-03

Pos logit contributions
e+0.069
 moderation+0.068
 client+0.064
 arrest+0.064
 argue+0.064
Neg logit contributions
der-0.036
lessly-0.027
rier-0.024
ier-0.023
lights-0.023
Token weighed
Token ID20261
Feature activation+1.22e-02

Pos logit contributions
e+0.020
 moderation+0.020
 hate+0.018
 arrest+0.018
 client+0.018
Neg logit contributions
der-0.018
lessly-0.015
rier-0.014
ier-0.013
lights-0.013
Token.
Token ID13
Feature activation-2.44e-03

Pos logit contributions
e+0.024
 moderation+0.024
 hate+0.022
 client+0.021
 argue+0.021
Neg logit contributions
der-0.025
lessly-0.021
rier-0.019
lic-0.019
lights-0.019
Token He
Token ID679
Feature activation-3.22e-02

Pos logit contributions
der+0.004
lessly+0.004
rier+0.003
lights+0.003
lic+0.003
Neg logit contributions
e-0.006
 moderation-0.005
 argue-0.005
 hate-0.005
 arrest-0.005
Token loop
Token ID9052
Feature activation-8.28e-02

Pos logit contributions
der+0.067
lessly+0.057
rier+0.056
ier+0.052
lights+0.052
Neg logit contributions
e-0.063
 moderation-0.060
 hate-0.057
 arrest-0.056
 client-0.056
Tokened
Token ID276
Feature activation-4.97e-02

Pos logit contributions
der+0.130
rier+0.097
lessly+0.096
lights+0.083
ier+0.073
Neg logit contributions
e-0.199
 arrest-0.176
keeping-0.173
 hate-0.168
|-0.167
Token around
Token ID1088
Feature activation+8.82e-03

Pos logit contributions
der+0.098
lessly+0.078
rier+0.076
ier+0.074
lic+0.073
Neg logit contributions
e-0.103
 moderation-0.092
 arrest-0.091
 hate-0.091
 argue-0.088
Token the
Token ID262
Feature activation-3.32e-02

Pos logit contributions
e+0.016
 moderation+0.016
 hate+0.015
 arrest+0.015
 argue+0.014
Neg logit contributions
der-0.019
lessly-0.016
rier-0.016
ier-0.015
lights-0.015
Token tree
Token ID5509
Feature activation-3.12e-03

Pos logit contributions
der+0.074
lessly+0.062
rier+0.061
ier+0.056
ing+0.056
Neg logit contributions
 moderation-0.056
e-0.056
 argue-0.052
 hate-0.051
|-0.051
Token,
Token ID11
Feature activation-3.32e-02

Pos logit contributions
der+0.006
lessly+0.005
rier+0.005
ier+0.004
lights+0.004
Neg logit contributions
e-0.007
 moderation-0.007
 hate-0.006
 arrest-0.006
 argue-0.006
Token feels
Token ID5300
Feature activation-4.16e-02

Pos logit contributions
der+0.052
lessly+0.046
rier+0.044
ier+0.043
lights+0.041
Neg logit contributions
e-0.043
 moderation-0.041
 hate-0.040
 arrest-0.039
 argue-0.039
Token a
Token ID257
Feature activation+4.66e-04

Pos logit contributions
der+0.093
lessly+0.079
rier+0.075
lights+0.074
ier+0.074
Neg logit contributions
e-0.076
 moderation-0.075
 hate-0.071
 mischief-0.066
 arrest-0.066
Token hot
Token ID3024
Feature activation-3.10e-03

Pos logit contributions
 hate+0.001
e+0.001
 moderation+0.001
 panic+0.001
 reason+0.001
Neg logit contributions
der-0.001
lights-0.001
boards-0.001
ier-0.001
lic-0.001
Token and
Token ID290
Feature activation+1.04e-03

Pos logit contributions
der+0.007
lessly+0.007
lic+0.006
lights+0.006
rier+0.006
Neg logit contributions
e-0.005
 moderation-0.005
 hate-0.005
 arrest-0.004
 harm-0.004
Token burning
Token ID9482
Feature activation-2.66e-02

Pos logit contributions
e+0.002
 moderation+0.002
 hate+0.002
 arrest+0.002
 argue+0.002
Neg logit contributions
der-0.002
lessly-0.002
rier-0.002
ier-0.002
lights-0.002
Token sensation
Token ID18098
Feature activation-8.25e-02

Pos logit contributions
der+0.056
lessly+0.047
lic+0.045
rier+0.042
ier+0.042
Neg logit contributions
e-0.053
 moderation-0.048
 hate-0.045
 argue-0.044
keeping-0.044
Token in
Token ID287
Feature activation+3.31e-02

Pos logit contributions
der+0.174
rier+0.156
lic+0.155
lessly+0.148
lights+0.147
Neg logit contributions
e-0.140
 hate-0.103
 harm-0.094
aster-0.087
 moderation-0.086
Token her
Token ID607
Feature activation+1.37e-02

Pos logit contributions
e+0.050
 moderation+0.048
 client+0.045
 hate+0.044
 argue+0.044
Neg logit contributions
der-0.084
lessly-0.073
rier-0.069
ier-0.066
lights-0.065
Token mouth
Token ID5422
Feature activation-3.62e-02

Pos logit contributions
 moderation+0.026
e+0.026
|+0.024
 argue+0.024
 hate+0.024
Neg logit contributions
der-0.028
lessly-0.023
rier-0.023
ier-0.021
lights-0.021
Token.
Token ID13
Feature activation+1.58e-02

Pos logit contributions
der+0.080
rier+0.072
lights+0.071
lessly+0.070
bled+0.070
Neg logit contributions
e-0.059
 hate-0.049
 moderation-0.049
 heaven-0.045
 argue-0.044
Token She
Token ID1375
Feature activation-2.23e-02

Pos logit contributions
e+0.033
 moderation+0.032
 hate+0.030
 arrest+0.030
 client+0.030
Neg logit contributions
der-0.031
lessly-0.025
rier-0.024
ier-0.023
lights-0.022
Token clean
Token ID3424
Feature activation-1.11e-02

Pos logit contributions
 moderation+0.008
e+0.008
 hate+0.007
 obey+0.007
 panic+0.007
Neg logit contributions
der-0.013
rier-0.011
lessly-0.011
ier-0.011
lic-0.010
Token.
Token ID13
Feature activation-1.10e-02

Pos logit contributions
der+0.022
lessly+0.019
lic+0.019
rier+0.017
ier+0.017
Neg logit contributions
e-0.022
 moderation-0.021
keeping-0.019
 mischief-0.019
 heaven-0.019
Token She
Token ID1375
Feature activation-2.44e-02

Pos logit contributions
der+0.021
lessly+0.017
rier+0.016
lights+0.014
ier+0.014
Neg logit contributions
e-0.024
 moderation-0.023
 argue-0.023
 hate-0.022
 arrest-0.022
Token spread
Token ID4104
Feature activation-5.22e-02

Pos logit contributions
der+0.052
rier+0.047
lessly+0.044
ier+0.041
lights+0.040
Neg logit contributions
e-0.047
 moderation-0.044
 hate-0.041
 arrest-0.040
 argue-0.040
Token the
Token ID262
Feature activation-5.38e-02

Pos logit contributions
der+0.150
rier+0.124
lic+0.121
lessly+0.119
ier+0.118
Neg logit contributions
e-0.067
 moderation-0.062
 hate-0.054
 mischief-0.053
 panic-0.050
Token sheet
Token ID9629
Feature activation-8.24e-02

Pos logit contributions
der+0.119
lessly+0.097
rier+0.095
ier+0.090
lights+0.086
Neg logit contributions
e-0.103
 moderation-0.101
 hate-0.092
 arrest-0.090
 argue-0.089
Token back
Token ID736
Feature activation+1.23e-02

Pos logit contributions
der+0.190
rier+0.147
lic+0.132
ier+0.130
lessly+0.125
Neg logit contributions
e-0.156
 visit-0.127
|-0.123
 hate-0.121
 panic-0.121
Token on
Token ID319
Feature activation+2.11e-02

Pos logit contributions
e+0.025
 moderation+0.023
 hate+0.022
packs+0.022
 argue+0.022
Neg logit contributions
der-0.024
rier-0.020
lessly-0.019
lic-0.019
ier-0.018
Token the
Token ID262
Feature activation-3.95e-02

Pos logit contributions
e+0.038
 moderation+0.038
 hate+0.035
 client+0.034
 argue+0.034
Neg logit contributions
der-0.047
lessly-0.040
rier-0.038
ier-0.036
lights-0.036
Token bed
Token ID3996
Feature activation-3.91e-02

Pos logit contributions
der+0.074
lessly+0.060
rier+0.057
ier+0.054
lights+0.052
Neg logit contributions
e-0.086
 moderation-0.085
 hate-0.079
 arrest-0.078
 client-0.078
Token and
Token ID290
Feature activation-2.21e-02

Pos logit contributions
der+0.074
lessly+0.062
rier+0.062
lic+0.059
bled+0.057
Neg logit contributions
e-0.076
 arrest-0.066
 moderation-0.066
 heaven-0.064
keeping-0.063
Token kids
Token ID3988
Feature activation+5.33e-02

Pos logit contributions
e+0.075
|+0.074
 moderation+0.074
 argue+0.070
 courtroom+0.069
Neg logit contributions
der-0.077
lessly-0.060
rier-0.059
ing-0.058
ier-0.058
Token.
Token ID13
Feature activation+2.51e-02

Pos logit contributions
 moderation+0.130
e+0.126
 arrest+0.122
|+0.122
 mischief+0.122
Neg logit contributions
der-0.084
lessly-0.065
rier-0.061
ier-0.059
ing-0.054
Token They
Token ID1119
Feature activation-4.67e-03

Pos logit contributions
e+0.043
 moderation+0.042
 hate+0.038
 client+0.038
 arrest+0.038
Neg logit contributions
der-0.058
lessly-0.050
rier-0.048
ier-0.047
lights-0.045
Token might
Token ID1244
Feature activation+2.87e-03

Pos logit contributions
der+0.010
lessly+0.009
rier+0.008
ier+0.008
lights+0.008
Neg logit contributions
e-0.009
 moderation-0.009
 hate-0.008
 arrest-0.008
 client-0.008
Token like
Token ID588
Feature activation+5.36e-02

Pos logit contributions
e+0.005
 moderation+0.005
keeping+0.004
 client+0.004
 courtroom+0.004
Neg logit contributions
der-0.007
lessly-0.006
rier-0.005
ier-0.005
lights-0.005
Token to
Token ID284
Feature activation+7.87e-02

Pos logit contributions
e+0.118
 moderation+0.115
|+0.108
atherine+0.105
 client+0.104
Neg logit contributions
der-0.102
lessly-0.078
rier-0.076
lights-0.070
ier-0.069
Token see
Token ID766
Feature activation+1.89e-02

Pos logit contributions
 moderation+0.106
e+0.101
atherine+0.098
keeping+0.098
 client+0.097
Neg logit contributions
der-0.196
lessly-0.165
rier-0.162
ier-0.157
lights-0.149
Token it
Token ID340
Feature activation-1.36e-03

Pos logit contributions
e+0.035
 moderation+0.035
 hate+0.032
 client+0.032
 arrest+0.032
Neg logit contributions
der-0.041
lessly-0.034
rier-0.033
ier-0.031
lights-0.030
Token too
Token ID1165
Feature activation+5.24e-02

Pos logit contributions
der+0.003
lessly+0.002
rier+0.002
lights+0.002
ier+0.002
Neg logit contributions
e-0.003
 moderation-0.003
 hate-0.003
 argue-0.003
 arrest-0.002
Token."
Token ID526
Feature activation+1.15e-02

Pos logit contributions
e+0.104
 moderation+0.103
 hate+0.097
 argue+0.095
 arrest+0.094
Neg logit contributions
der-0.106
lessly-0.088
rier-0.085
ier-0.081
lights-0.078
Token
Token ID198
Feature activation+2.55e-03

Pos logit contributions
 moderation+0.020
e+0.020
 hate+0.018
 client+0.018
 arrest+0.017
Neg logit contributions
der-0.027
lessly-0.023
rier-0.022
ier-0.021
lights-0.020
Token mind
Token ID2000
Feature activation+2.74e-02

Pos logit contributions
e+0.007
 moderation+0.007
 arrest+0.007
 client+0.007
 hate+0.006
Neg logit contributions
der-0.011
lessly-0.010
rier-0.009
ier-0.009
lights-0.009
Token.
Token ID13
Feature activation+1.11e-02

Pos logit contributions
e+0.052
 moderation+0.051
 hate+0.048
 argue+0.047
 client+0.047
Neg logit contributions
der-0.058
lessly-0.049
rier-0.047
ier-0.045
lights-0.044
Token She
Token ID1375
Feature activation-2.77e-02

Pos logit contributions
 moderation+0.022
e+0.022
 hate+0.021
 client+0.021
 argue+0.020
Neg logit contributions
der-0.022
lessly-0.019
rier-0.018
ier-0.017
lights-0.017
Token just
Token ID655
Feature activation-1.05e-02

Pos logit contributions
der+0.066
rier+0.061
lessly+0.060
lights+0.057
ier+0.057
Neg logit contributions
e-0.043
 hate-0.040
 moderation-0.039
 argue-0.038
 client-0.037
Token wanted
Token ID2227
Feature activation-4.65e-03

Pos logit contributions
der+0.026
rier+0.024
lessly+0.023
lights+0.022
ier+0.022
Neg logit contributions
e-0.016
 hate-0.014
 moderation-0.014
 argue-0.013
 client-0.012
Token to
Token ID284
Feature activation+6.68e-02

Pos logit contributions
der+0.010
lessly+0.008
rier+0.008
ier+0.008
lights+0.007
Neg logit contributions
e-0.009
 moderation-0.009
 arrest-0.008
|-0.008
 hate-0.008
Token share
Token ID2648
Feature activation+3.06e-02

Pos logit contributions
 moderation+0.122
 colony+0.114
 conflicts+0.113
e+0.110
keeping+0.110
Neg logit contributions
der-0.121
lessly-0.097
rier-0.095
ier-0.092
ly-0.091
Token her
Token ID607
Feature activation+4.18e-02

Pos logit contributions
e+0.044
 moderation+0.044
 hate+0.039
 argue+0.038
 client+0.037
Neg logit contributions
der-0.080
lessly-0.070
rier-0.067
ier-0.065
lic-0.063
Token laundry
Token ID25724
Feature activation-2.27e-02

Pos logit contributions
e+0.075
 moderation+0.073
 arrest+0.067
 hate+0.067
 client+0.067
Neg logit contributions
der-0.094
lessly-0.080
rier-0.076
ier-0.073
lights-0.072
Token with
Token ID351
Feature activation+4.54e-02

Pos logit contributions
der+0.045
lessly+0.038
lic+0.035
rier+0.034
ier+0.033
Neg logit contributions
e-0.046
 moderation-0.046
 argue-0.042
 hate-0.042
keeping-0.042
Token her
Token ID607
Feature activation+2.42e-02

Pos logit contributions
e+0.075
 moderation+0.074
|+0.074
 argue+0.073
 arrest+0.072
Neg logit contributions
der-0.104
lessly-0.085
rier-0.082
ier-0.079
lights-0.078
Token.
Token ID13
Feature activation+2.80e-02

Pos logit contributions
e+0.045
 moderation+0.044
 hate+0.040
 arrest+0.039
 argue+0.039
Neg logit contributions
der-0.039
lessly-0.033
rier-0.031
ier-0.029
lic-0.028
Token They
Token ID1119
Feature activation+1.94e-02

Pos logit contributions
 arrest+0.065
e+0.064
 client+0.063
 moderation+0.063
keeping+0.062
Neg logit contributions
der-0.046
lessly-0.036
lights-0.032
rier-0.031
lic-0.028
Token like
Token ID588
Feature activation+7.14e-02

Pos logit contributions
 moderation+0.039
e+0.038
 arrest+0.035
 client+0.035
keeping+0.035
Neg logit contributions
der-0.039
lessly-0.033
rier-0.030
ier-0.030
lights-0.028
Token to
Token ID284
Feature activation+7.75e-02

Pos logit contributions
e+0.202
 moderation+0.195
atherine+0.189
|+0.188
elope+0.182
Neg logit contributions
der-0.084
ing-0.049
lessly-0.048
 for-0.042
 that-0.038
Token play
Token ID711
Feature activation+2.95e-02

Pos logit contributions
atherine+0.112
keeping+0.106
 client+0.105
 moderation+0.104
 courtroom+0.100
Neg logit contributions
der-0.168
lights-0.140
lessly-0.134
ly-0.133
lic-0.130
Token in
Token ID287
Feature activation+8.38e-02

Pos logit contributions
e+0.064
 moderation+0.063
 hate+0.058
 arrest+0.058
 argue+0.058
Neg logit contributions
der-0.056
lessly-0.046
rier-0.043
ier-0.041
lights-0.040
Token the
Token ID262
Feature activation+2.04e-02

Pos logit contributions
beh+0.231
 client+0.228
 moderation+0.213
e+0.212
oxy+0.212
Neg logit contributions
lights-0.059
 that-0.056
der-0.055
 all-0.053
 spark-0.040
Token park
Token ID3952
Feature activation+3.28e-02

Pos logit contributions
e+0.034
 moderation+0.032
 client+0.030
 argue+0.030
 hate+0.030
Neg logit contributions
der-0.046
lessly-0.039
ing-0.037
rier-0.037
lights-0.037
Token.
Token ID13
Feature activation+2.04e-02

Pos logit contributions
e+0.076
 moderation+0.074
 hate+0.070
 argue+0.069
 client+0.069
Neg logit contributions
der-0.057
lessly-0.046
rier-0.044
ier-0.041
lights-0.040
Token One
Token ID1881
Feature activation+2.81e-02

Pos logit contributions
e+0.043
 arrest+0.042
keeping+0.042
 moderation+0.041
 client+0.041
Neg logit contributions
der-0.039
lessly-0.030
lights-0.028
rier-0.026
lic-0.024
Token day
Token ID1110
Feature activation-2.27e-02

Pos logit contributions
 moderation+0.039
e+0.039
 arrest+0.037
 client+0.036
keeping+0.035
Neg logit contributions
der-0.072
lessly-0.064
rier-0.059
ier-0.058
lights-0.058
Token named
Token ID3706
Feature activation+4.91e-02

Pos logit contributions
 moderation+0.048
e+0.047
 client+0.044
 hate+0.044
 argue+0.044
Neg logit contributions
der-0.035
lessly-0.027
rier-0.025
ier-0.024
lights-0.023
Token Kitty
Token ID21168
Feature activation+6.67e-02

Pos logit contributions
 moderation+0.077
e+0.072
 argue+0.067
 client+0.064
 hate+0.064
Neg logit contributions
der-0.123
lessly-0.106
rier-0.096
lights-0.094
lic-0.094
Token.
Token ID13
Feature activation+1.37e-02

Pos logit contributions
 moderation+0.164
e+0.161
 arrest+0.153
 client+0.152
 argue+0.152
Neg logit contributions
der-0.105
lessly-0.082
rier-0.073
ier-0.072
lights-0.066
Token Kitty
Token ID21168
Feature activation+3.60e-02

Pos logit contributions
e+0.028
 moderation+0.027
 client+0.026
 arrest+0.026
 argue+0.026
Neg logit contributions
der-0.027
lessly-0.022
rier-0.020
lights-0.019
lic-0.019
Token loved
Token ID6151
Feature activation+2.98e-02

Pos logit contributions
 moderation+0.078
e+0.075
 client+0.072
 arrest+0.072
 argue+0.071
Neg logit contributions
der-0.067
lessly-0.054
rier-0.048
ier-0.048
ing-0.046
Token to
Token ID284
Feature activation+7.16e-02

Pos logit contributions
e+0.073
 moderation+0.071
|+0.066
 client+0.065
 argue+0.065
Neg logit contributions
der-0.047
lessly-0.034
ing-0.032
lic-0.029
lights-0.029
Token play
Token ID711
Feature activation+1.96e-02

Pos logit contributions
e+0.075
 moderation+0.075
 client+0.066
 arrest+0.063
 hate+0.062
Neg logit contributions
der-0.208
lessly-0.184
rier-0.175
lights-0.171
ier-0.170
Token with
Token ID351
Feature activation+4.19e-02

Pos logit contributions
e+0.040
 moderation+0.036
 hate+0.034
 arrest+0.034
|+0.033
Neg logit contributions
der-0.042
lessly-0.036
rier-0.035
ier-0.033
lights-0.032
Token her
Token ID607
Feature activation+3.44e-02

Pos logit contributions
 moderation+0.075
e+0.070
 argue+0.069
 courtroom+0.068
 client+0.068
Neg logit contributions
der-0.094
lessly-0.078
lights-0.075
ing-0.071
rier-0.068
Token cr
Token ID1067
Feature activation-2.33e-02

Pos logit contributions
 moderation+0.067
e+0.067
 argue+0.062
 courtroom+0.062
 generation+0.062
Neg logit contributions
der-0.069
lessly-0.056
lights-0.055
ing-0.051
rier-0.050
Tokenay
Token ID323
Feature activation+2.25e-02

Pos logit contributions
der+0.045
rier+0.036
lessly+0.036
ier+0.032
lights+0.029
Neg logit contributions
e-0.051
 moderation-0.049
|-0.044
 arrest-0.044
oof-0.044
Token asked
Token ID1965
Feature activation-1.36e-02

Pos logit contributions
e+0.024
 moderation+0.022
 hate+0.021
 argue+0.021
 client+0.020
Neg logit contributions
der-0.028
rier-0.024
lessly-0.024
ier-0.022
lights-0.022
Token if
Token ID611
Feature activation-9.49e-04

Pos logit contributions
der+0.026
lessly+0.022
rier+0.020
ier+0.019
lights+0.019
Neg logit contributions
e-0.029
 moderation-0.028
 hate-0.027
 argue-0.026
 client-0.026
Token baby
Token ID5156
Feature activation+4.09e-02

Pos logit contributions
der+0.002
lessly+0.001
rier+0.001
ier+0.001
lights+0.001
Neg logit contributions
e-0.002
 moderation-0.002
 hate-0.002
 arrest-0.002
 client-0.002
Token was
Token ID373
Feature activation+4.96e-03

Pos logit contributions
 moderation+0.084
e+0.083
 arrest+0.077
 hate+0.076
|+0.076
Neg logit contributions
der-0.081
lessly-0.065
rier-0.063
ier-0.062
ing-0.059
Token ready
Token ID3492
Feature activation-8.13e-03

Pos logit contributions
e+0.010
 moderation+0.010
 hate+0.009
 client+0.009
 arrest+0.009
Neg logit contributions
der-0.010
lessly-0.008
rier-0.008
ier-0.008
lights-0.007
Token to
Token ID284
Feature activation+7.79e-02

Pos logit contributions
der+0.017
lessly+0.014
rier+0.014
lights+0.013
lic+0.013
Neg logit contributions
e-0.015
 moderation-0.015
 hate-0.014
 argue-0.014
 visit-0.014
Token go
Token ID467
Feature activation+4.40e-02

Pos logit contributions
 moderation+0.132
e+0.131
 client+0.120
 hate+0.119
 arrest+0.118
Neg logit contributions
der-0.177
lessly-0.151
rier-0.143
ier-0.139
lights-0.135
Token,
Token ID11
Feature activation-2.09e-02

Pos logit contributions
e+0.084
 moderation+0.078
 visit+0.077
 hate+0.075
 meet+0.075
Neg logit contributions
der-0.102
rier-0.081
lessly-0.080
lights-0.072
ier-0.072
Token so
Token ID523
Feature activation+9.15e-03

Pos logit contributions
der+0.039
lessly+0.031
rier+0.031
ier+0.031
lights+0.029
Neg logit contributions
 moderation-0.044
e-0.043
 hate-0.042
 argue-0.042
 arrest-0.040
Token they
Token ID484
Feature activation-1.98e-02

Pos logit contributions
e+0.020
 moderation+0.020
 hate+0.019
 argue+0.019
 client+0.019
Neg logit contributions
der-0.017
lessly-0.014
rier-0.013
ier-0.012
lights-0.012
Token all
Token ID477
Feature activation+7.37e-03

Pos logit contributions
der+0.045
rier+0.038
lessly+0.037
lights+0.035
ier+0.034
Neg logit contributions
e-0.035
 argue-0.033
 moderation-0.032
 hate-0.031
 client-0.031
Token her
Token ID607
Feature activation+6.44e-03

Pos logit contributions
 moderation+0.048
e+0.048
|+0.045
keeping+0.044
 colony+0.044
Neg logit contributions
der-0.064
lessly-0.051
rier-0.048
ly-0.046
lights-0.046
Token t
Token ID256
Feature activation+2.57e-02

Pos logit contributions
e+0.013
 moderation+0.012
|+0.012
 arrest+0.011
keeping+0.011
Neg logit contributions
der-0.014
lessly-0.011
rier-0.011
lights-0.011
ier-0.010
Tokeneddy
Token ID21874
Feature activation+3.81e-02

Pos logit contributions
e+0.053
 moderation+0.053
 hate+0.049
 argue+0.049
 arrest+0.048
Neg logit contributions
der-0.051
lessly-0.042
rier-0.040
ier-0.038
lights-0.038
Token bear
Token ID6842
Feature activation+2.97e-02

Pos logit contributions
 moderation+0.072
e+0.072
 hate+0.066
 arrest+0.066
 client+0.065
Neg logit contributions
der-0.082
lessly-0.068
rier-0.065
ier-0.062
lights-0.061
Token,
Token ID11
Feature activation-3.36e-02

Pos logit contributions
e+0.068
 moderation+0.068
 hate+0.063
 client+0.063
 arrest+0.063
Neg logit contributions
der-0.051
lessly-0.041
rier-0.039
ier-0.036
lights-0.035
Token "
Token ID366
Feature activation+1.57e-02

Pos logit contributions
der+0.050
lessly+0.040
rier+0.038
ier+0.037
lights+0.037
Neg logit contributions
 moderation-0.083
e-0.082
 hate-0.079
 argue-0.078
 client-0.077
TokenI
Token ID40
Feature activation+3.06e-02

Pos logit contributions
 moderation+0.026
e+0.026
 arrest+0.025
 argue+0.025
 client+0.025
Neg logit contributions
der-0.036
lessly-0.031
rier-0.029
ier-0.027
ing-0.027
Token don
Token ID836
Feature activation+8.98e-03

Pos logit contributions
 moderation+0.044
e+0.044
 arrest+0.041
keeping+0.040
 client+0.040
Neg logit contributions
der-0.076
lessly-0.067
rier-0.062
ier-0.062
ing-0.059
Token't
Token ID470
Feature activation+3.76e-02

Pos logit contributions
e+0.010
 arrest+0.009
 hate+0.008
 moderation+0.008
 argue+0.008
Neg logit contributions
der-0.027
rier-0.023
lessly-0.023
lights-0.022
ier-0.022
Token want
Token ID765
Feature activation+4.12e-02

Pos logit contributions
e+0.053
 moderation+0.053
 arrest+0.050
 client+0.049
|+0.047
Neg logit contributions
der-0.094
lessly-0.085
ier-0.078
rier-0.078
lic-0.072
Token to
Token ID284
Feature activation+9.58e-02

Pos logit contributions
e+0.087
 moderation+0.086
 client+0.081
 argue+0.081
|+0.080
Neg logit contributions
der-0.079
lessly-0.061
rier-0.056
ier-0.056
lights-0.055
Token Tim
Token ID5045
Feature activation+6.02e-02

Pos logit contributions
 moderation+0.089
keeping+0.083
e+0.083
 conflicts+0.083
beh+0.082
Neg logit contributions
der-0.092
lessly-0.072
lic-0.072
ing-0.072
ly-0.067
Tokenmy
Token ID1820
Feature activation+4.26e-02

Pos logit contributions
 arrest+0.117
 client+0.117
 moderation+0.116
 courtroom+0.116
 mischief+0.114
Neg logit contributions
der-0.113
lessly-0.093
ier-0.087
ing-0.085
ly-0.077
Token.
Token ID13
Feature activation+2.46e-02

Pos logit contributions
 moderation+0.113
e+0.107
 argue+0.106
|+0.105
 client+0.105
Neg logit contributions
der-0.060
lessly-0.042
ier-0.035
ing-0.035
rier-0.034
Token Tim
Token ID5045
Feature activation+4.07e-02

Pos logit contributions
e+0.039
 moderation+0.039
 client+0.038
 arrest+0.036
atherine+0.035
Neg logit contributions
der-0.058
lessly-0.050
lights-0.046
rier-0.045
lic-0.045
Tokenmy
Token ID1820
Feature activation+2.74e-02

Pos logit contributions
 arrest+0.076
 client+0.076
 moderation+0.074
 courtroom+0.072
 argue+0.072
Neg logit contributions
der-0.076
lessly-0.066
ing-0.062
ier-0.061
led-0.059
Token liked
Token ID8288
Feature activation+3.82e-02

Pos logit contributions
 moderation+0.068
 argue+0.064
 arrest+0.063
 client+0.062
 mischief+0.062
Neg logit contributions
der-0.040
lessly-0.030
ing-0.026
lic-0.025
ier-0.024
Token to
Token ID284
Feature activation+7.63e-02

Pos logit contributions
e+0.099
 moderation+0.096
|+0.095
atherine+0.091
 argue+0.088
Neg logit contributions
der-0.055
lessly-0.036
ing-0.036
rier-0.032
lic-0.030
Token run
Token ID1057
Feature activation+1.35e-02

Pos logit contributions
 moderation+0.100
atherine+0.100
keeping+0.094
|+0.092
e+0.092
Neg logit contributions
der-0.187
lessly-0.155
lights-0.149
lic-0.146
rier-0.144
Token and
Token ID290
Feature activation+3.32e-02

Pos logit contributions
e+0.032
 moderation+0.031
 hate+0.030
|+0.030
 argue+0.030
Neg logit contributions
der-0.023
rier-0.018
lessly-0.018
ier-0.016
lights-0.016
Token play
Token ID711
Feature activation+1.87e-02

Pos logit contributions
 moderation+0.048
 client+0.045
e+0.044
atherine+0.044
keeping+0.041
Neg logit contributions
der-0.073
lessly-0.069
ing-0.060
 for-0.059
lic-0.059
Token because
Token ID780
Feature activation+1.87e-02

Pos logit contributions
e+0.038
 moderation+0.035
 hate+0.034
 arrest+0.033
 argue+0.033
Neg logit contributions
der-0.038
lessly-0.033
rier-0.032
ier-0.030
lights-0.030
Token can
Token ID460
Feature activation+3.03e-02

Pos logit contributions
 moderation+0.059
e+0.058
keeping+0.055
 arrest+0.055
 courtroom+0.055
Neg logit contributions
der-0.029
lessly-0.021
ier-0.018
rier-0.017
lights-0.015
Token get
Token ID651
Feature activation+2.44e-02

Pos logit contributions
e+0.038
 moderation+0.037
 hate+0.034
 argue+0.033
 arrest+0.033
Neg logit contributions
der-0.085
lessly-0.076
rier-0.072
ier-0.070
lights-0.069
Token a
Token ID257
Feature activation+2.58e-02

Pos logit contributions
e+0.048
 moderation+0.047
 hate+0.044
 client+0.043
 arrest+0.043
Neg logit contributions
der-0.051
lessly-0.043
rier-0.041
ier-0.039
lights-0.038
Token new
Token ID649
Feature activation+2.52e-02

Pos logit contributions
 moderation+0.049
e+0.049
 hate+0.046
 argue+0.046
 client+0.045
Neg logit contributions
der-0.054
lessly-0.045
rier-0.042
ier-0.041
lights-0.040
Token one
Token ID530
Feature activation+6.76e-03

Pos logit contributions
 moderation+0.048
e+0.048
 argue+0.045
 hate+0.045
 arrest+0.044
Neg logit contributions
der-0.052
lessly-0.045
rier-0.041
ier-0.040
lic-0.039
Token.
Token ID13
Feature activation+1.91e-02

Pos logit contributions
e+0.014
 moderation+0.014
 hate+0.013
 argue+0.013
 client+0.013
Neg logit contributions
der-0.013
lessly-0.011
rier-0.010
ier-0.010
lights-0.009
Token We
Token ID775
Feature activation+1.03e-02

Pos logit contributions
e+0.032
 moderation+0.032
 hate+0.029
 arrest+0.028
 client+0.028
Neg logit contributions
der-0.045
lessly-0.039
rier-0.037
ier-0.036
lights-0.035
Token still
Token ID991
Feature activation+2.31e-02

Pos logit contributions
 moderation+0.026
e+0.025
 arrest+0.024
keeping+0.024
 client+0.024
Neg logit contributions
der-0.015
lessly-0.012
ier-0.011
rier-0.010
lights-0.009
Token have
Token ID423
Feature activation+1.84e-02

Pos logit contributions
 moderation+0.029
e+0.028
 client+0.027
atherine+0.026
 arrest+0.025
Neg logit contributions
der-0.061
lessly-0.055
ier-0.050
rier-0.049
lights-0.048
Token other
Token ID584
Feature activation+3.70e-02

Pos logit contributions
e+0.036
 moderation+0.035
|+0.034
 client+0.034
 hate+0.033
Neg logit contributions
der-0.038
lessly-0.030
rier-0.028
ier-0.028
lights-0.027
Token toys
Token ID14958
Feature activation+9.63e-03

Pos logit contributions
e+0.066
 moderation+0.065
 client+0.059
 hate+0.059
|+0.059
Neg logit contributions
der-0.083
lessly-0.070
rier-0.067
ier-0.065
lights-0.064
Token.
Token ID13
Feature activation-6.14e-03

Pos logit contributions
der+0.041
lessly+0.034
rier+0.033
lights+0.032
lic+0.031
Neg logit contributions
e-0.044
 moderation-0.043
 hate-0.041
 argue-0.040
 client-0.039
Token They
Token ID1119
Feature activation-2.20e-02

Pos logit contributions
der+0.013
lessly+0.011
rier+0.010
ier+0.010
lights+0.010
Neg logit contributions
e-0.012
 moderation-0.012
 hate-0.011
 arrest-0.011
 client-0.011
Token were
Token ID547
Feature activation-4.11e-02

Pos logit contributions
der+0.048
rier+0.042
lessly+0.042
lights+0.038
ier+0.038
Neg logit contributions
e-0.040
 moderation-0.038
 argue-0.037
 hate-0.037
 arrest-0.036
Token very
Token ID845
Feature activation+1.90e-02

Pos logit contributions
der+0.081
lessly+0.067
rier+0.063
ier+0.060
lights+0.058
Neg logit contributions
e-0.086
 moderation-0.085
 hate-0.079
 arrest-0.078
 client-0.078
Token upset
Token ID9247
Feature activation-2.86e-02

Pos logit contributions
e+0.033
 moderation+0.033
 hate+0.031
 argue+0.030
 client+0.030
Neg logit contributions
der-0.044
lessly-0.037
rier-0.035
ier-0.034
lights-0.034
Token and
Token ID290
Feature activation+7.27e-03

Pos logit contributions
rier+0.053
der+0.051
lessly+0.046
lic+0.046
bled+0.045
Neg logit contributions
 hate-0.052
e-0.051
 heaven-0.047
 arrest-0.047
 moderation-0.046
Token told
Token ID1297
Feature activation-3.45e-02

Pos logit contributions
e+0.012
 moderation+0.012
 hate+0.012
 argue+0.012
 client+0.011
Neg logit contributions
der-0.017
lessly-0.014
rier-0.014
ier-0.013
lights-0.013
Token her
Token ID607
Feature activation+1.07e-02

Pos logit contributions
der+0.087
lessly+0.073
rier+0.070
lights+0.067
ier+0.067
Neg logit contributions
e-0.055
 moderation-0.054
 arrest-0.048
 client-0.048
 hate-0.048
Token to
Token ID284
Feature activation+3.70e-02

Pos logit contributions
e+0.025
 moderation+0.025
 hate+0.023
 arrest+0.023
 client+0.023
Neg logit contributions
der-0.018
lessly-0.014
rier-0.013
ier-0.013
lights-0.012
Token stay
Token ID2652
Feature activation-7.28e-03

Pos logit contributions
 moderation+0.058
e+0.058
 client+0.054
 arrest+0.052
|+0.052
Neg logit contributions
der-0.088
lessly-0.075
rier-0.072
ier-0.069
lights-0.067
Token away
Token ID1497
Feature activation-7.13e-03

Pos logit contributions
der+0.015
lessly+0.012
rier+0.012
ier+0.011
lights+0.011
Neg logit contributions
e-0.015
 moderation-0.015
 hate-0.014
 argue-0.013
 arrest-0.013
Tokenooter
Token ID25141
Feature activation-2.20e-02

Pos logit contributions
e+0.029
 moderation+0.028
 hate+0.026
 arrest+0.026
 client+0.025
Neg logit contributions
der-0.038
lessly-0.032
rier-0.032
ier-0.030
lights-0.029
Token.
Token ID13
Feature activation-1.49e-03

Pos logit contributions
der+0.042
lessly+0.036
rier+0.035
lic+0.033
ier+0.033
Neg logit contributions
e-0.045
 moderation-0.042
 hate-0.040
 client-0.040
 argue-0.040
Token He
Token ID679
Feature activation-2.67e-02

Pos logit contributions
der+0.003
lessly+0.003
rier+0.002
ier+0.002
lights+0.002
Neg logit contributions
e-0.003
 moderation-0.003
 hate-0.003
 client-0.003
 arrest-0.003
Token looked
Token ID3114
Feature activation-6.20e-02

Pos logit contributions
der+0.056
lessly+0.048
rier+0.047
ier+0.044
lights+0.043
Neg logit contributions
e-0.052
 moderation-0.049
 hate-0.047
 argue-0.046
 arrest-0.046
Token at
Token ID379
Feature activation+3.51e-03

Pos logit contributions
der+0.137
rier+0.108
ier+0.106
lights+0.101
lessly+0.101
Neg logit contributions
 hate-0.114
e-0.111
 argue-0.103
 panic-0.102
 moderation-0.101
Token Bob
Token ID5811
Feature activation+4.53e-02

Pos logit contributions
e+0.006
 moderation+0.006
 hate+0.006
 client+0.006
 argue+0.006
Neg logit contributions
der-0.008
lessly-0.007
rier-0.006
ier-0.006
lights-0.006
Token and
Token ID290
Feature activation-2.56e-02

Pos logit contributions
e+0.101
 moderation+0.099
 hate+0.093
 argue+0.092
 arrest+0.091
Neg logit contributions
der-0.082
lessly-0.067
rier-0.063
ier-0.060
lights-0.059
Token said
Token ID531
Feature activation-4.99e-02

Pos logit contributions
der+0.052
lessly+0.045
rier+0.043
ier+0.041
lights+0.040
Neg logit contributions
e-0.051
 moderation-0.049
 hate-0.046
 argue-0.046
 client-0.045
Token,
Token ID11
Feature activation-4.42e-02

Pos logit contributions
der+0.103
lights+0.097
lessly+0.092
rier+0.083
lic+0.082
Neg logit contributions
e-0.085
 hate-0.078
 argue-0.075
 Jerry-0.074
 client-0.074
Token �
Token ID6184
Feature activation+9.92e-03

Pos logit contributions
der+0.080
lights+0.076
ier+0.074
rier+0.070
lessly+0.069
Neg logit contributions
e-0.082
 argue-0.078
 meet-0.078
 hate-0.076
 quarrel-0.075
Token�
Token ID95
Feature activation-2.06e-04

Pos logit contributions
e+0.018
 moderation+0.017
 hate+0.016
 arrest+0.016
 client+0.016
Neg logit contributions
der-0.022
lessly-0.019
lights-0.018
rier-0.018
ier-0.018
Token the
Token ID262
Feature activation-3.36e-02

Pos logit contributions
der+0.022
lessly+0.018
rier+0.017
ing+0.017
ier+0.016
Neg logit contributions
 moderation-0.021
e-0.020
 client-0.020
 hate-0.020
 arrest-0.019
Token little
Token ID1310
Feature activation+7.04e-03

Pos logit contributions
der+0.079
lessly+0.066
rier+0.065
ier+0.061
lights+0.059
Neg logit contributions
e-0.058
 moderation-0.057
|-0.052
 hate-0.051
 argue-0.051
Token ant
Token ID1885
Feature activation+2.38e-02

Pos logit contributions
e+0.009
 moderation+0.009
|+0.008
 client+0.008
 argue+0.008
Neg logit contributions
der-0.020
rier-0.017
lessly-0.017
ier-0.016
ing-0.016
Token.
Token ID13
Feature activation+7.94e-03

Pos logit contributions
 moderation+0.058
e+0.058
 client+0.054
 hate+0.054
 arrest+0.054
Neg logit contributions
der-0.037
lessly-0.029
rier-0.027
ier-0.025
lights-0.024
Token They
Token ID1119
Feature activation-1.73e-02

Pos logit contributions
e+0.017
 moderation+0.017
 hate+0.016
 client+0.015
 argue+0.015
Neg logit contributions
der-0.016
lessly-0.012
rier-0.012
lights-0.011
ier-0.011
Token bit
Token ID1643
Feature activation+5.81e-03

Pos logit contributions
der+0.038
lessly+0.033
rier+0.031
ier+0.030
lights+0.029
Neg logit contributions
e-0.031
 moderation-0.031
 hate-0.029
 arrest-0.028
 client-0.028
Token him
Token ID683
Feature activation+5.63e-03

Pos logit contributions
e+0.011
 moderation+0.011
 client+0.011
 hate+0.010
 arrest+0.010
Neg logit contributions
der-0.012
lessly-0.010
rier-0.010
ier-0.009
lights-0.009
Token all
Token ID477
Feature activation+1.64e-02

Pos logit contributions
e+0.014
 moderation+0.014
 hate+0.013
 arrest+0.013
 client+0.013
Neg logit contributions
der-0.009
lessly-0.007
rier-0.006
ier-0.006
lights-0.006
Token over
Token ID625
Feature activation+3.31e-02

Pos logit contributions
e+0.035
 moderation+0.034
 hate+0.033
 argue+0.032
 arrest+0.032
Neg logit contributions
der-0.032
lessly-0.025
rier-0.025
ier-0.024
lights-0.023
Token.
Token ID13
Feature activation+6.78e-03

Pos logit contributions
e+0.065
 moderation+0.065
 client+0.063
|+0.061
 hate+0.061
Neg logit contributions
der-0.066
lessly-0.055
rier-0.051
ier-0.049
lights-0.049
Token He
Token ID679
Feature activation-2.03e-02

Pos logit contributions
e+0.014
 moderation+0.014
 hate+0.013
 client+0.013
 argue+0.013
Neg logit contributions
der-0.014
lessly-0.011
rier-0.010
lights-0.010
ier-0.010
Token not
Token ID407
Feature activation+1.06e-02

Pos logit contributions
der+0.039
lessly+0.035
rier+0.032
ier+0.032
lic+0.031
Neg logit contributions
e-0.027
 moderation-0.027
 argue-0.024
 hate-0.024
 client-0.024
Token touch
Token ID3638
Feature activation+2.92e-02

Pos logit contributions
e+0.022
 moderation+0.022
|+0.021
 arrest+0.021
 client+0.021
Neg logit contributions
der-0.020
lessly-0.016
rier-0.016
ier-0.015
lights-0.014
Token it
Token ID340
Feature activation+1.33e-02

Pos logit contributions
e+0.057
 moderation+0.057
 hate+0.053
 client+0.052
 arrest+0.052
Neg logit contributions
der-0.060
lessly-0.050
rier-0.048
ier-0.046
lights-0.044
Token."
Token ID526
Feature activation+1.14e-02

Pos logit contributions
e+0.024
 moderation+0.022
 hate+0.022
 argue+0.021
 harm+0.020
Neg logit contributions
der-0.029
lessly-0.026
rier-0.024
lic-0.023
lights-0.023
Token
Token ID198
Feature activation+4.87e-03

Pos logit contributions
e+0.023
 moderation+0.022
 hate+0.021
 client+0.021
 arrest+0.021
Neg logit contributions
der-0.023
lessly-0.019
rier-0.018
ier-0.017
lights-0.017
Token
Token ID198
Feature activation+3.41e-03

Pos logit contributions
e+0.009
 moderation+0.009
 hate+0.008
 arrest+0.008
 client+0.008
Neg logit contributions
der-0.011
lessly-0.009
rier-0.009
ier-0.009
lights-0.008
TokenThey
Token ID2990
Feature activation-2.33e-02

Pos logit contributions
e+0.006
 moderation+0.006
 hate+0.005
 client+0.005
 arrest+0.005
Neg logit contributions
der-0.008
lessly-0.007
rier-0.007
ier-0.006
lights-0.006
Token watch
Token ID2342
Feature activation-2.22e-02

Pos logit contributions
der+0.061
lessly+0.053
rier+0.052
ier+0.050
lights+0.049
Neg logit contributions
e-0.033
 moderation-0.032
 hate-0.030
 argue-0.029
 arrest-0.029
Token the
Token ID262
Feature activation-3.65e-02

Pos logit contributions
der+0.051
lessly+0.044
rier+0.042
ier+0.040
lights+0.039
Neg logit contributions
e-0.039
 moderation-0.038
 hate-0.035
 argue-0.035
 client-0.034
Token butterfly
Token ID35113
Feature activation-5.16e-03

Pos logit contributions
der+0.087
lessly+0.075
rier+0.072
ier+0.069
lights+0.067
Neg logit contributions
e-0.061
 moderation-0.060
 hate-0.055
 arrest-0.054
 client-0.054
Token rise
Token ID4485
Feature activation-3.77e-02

Pos logit contributions
der+0.010
lessly+0.009
rier+0.008
ier+0.008
lights+0.008
Neg logit contributions
e-0.011
 moderation-0.010
 hate-0.010
 argue-0.010
 arrest-0.010
Token was
Token ID373
Feature activation-3.83e-02

Pos logit contributions
der+0.081
rier+0.073
lessly+0.070
ier+0.066
lights+0.062
Neg logit contributions
e-0.062
 hate-0.056
 argue-0.054
 quarrel-0.054
 fight-0.053
Token red
Token ID2266
Feature activation-1.37e-02

Pos logit contributions
der+0.085
lessly+0.073
boards+0.067
rier+0.067
ing+0.066
Neg logit contributions
e-0.069
 moderation-0.066
 hate-0.062
 quarrel-0.059
 argue-0.059
Token,
Token ID11
Feature activation-1.75e-02

Pos logit contributions
lic+0.032
lessly+0.030
rier+0.028
der+0.028
bled+0.027
Neg logit contributions
e-0.020
 arrest-0.018
 hate-0.017
 fight-0.017
 moderation-0.016
Token her
Token ID607
Feature activation-8.10e-03

Pos logit contributions
der+0.036
lessly+0.030
ier+0.030
lic+0.029
rier+0.028
Neg logit contributions
 moderation-0.035
e-0.034
 hate-0.032
 argue-0.030
 arrest-0.029
Token favorite
Token ID4004
Feature activation+9.20e-03

Pos logit contributions
der+0.019
lessly+0.016
rier+0.015
ier+0.015
lights+0.014
Neg logit contributions
e-0.014
 moderation-0.014
 hate-0.013
 arrest-0.012
 client-0.012
Token color
Token ID3124
Feature activation-2.69e-02

Pos logit contributions
 moderation+0.018
e+0.018
 client+0.016
|+0.016
 arrest+0.016
Neg logit contributions
der-0.019
lessly-0.016
rier-0.015
ier-0.015
lights-0.014
Token.
Token ID13
Feature activation+1.19e-02

Pos logit contributions
der+0.053
lic+0.051
lessly+0.050
rier+0.047
omed+0.041
Neg logit contributions
e-0.049
 hate-0.042
 fight-0.041
 moderation-0.041
 quarrel-0.040
Token She
Token ID1375
Feature activation-2.00e-02

Pos logit contributions
 moderation+0.023
e+0.022
 hate+0.020
 client+0.020
 heaven+0.020
Neg logit contributions
der-0.024
lessly-0.021
ier-0.020
rier-0.020
lights-0.020
Token felt
Token ID2936
Feature activation-4.68e-02

Pos logit contributions
der+0.045
rier+0.041
lessly+0.040
ier+0.038
lights+0.037
Neg logit contributions
e-0.035
 hate-0.032
 moderation-0.032
 argue-0.030
atherine-0.029
Token a
Token ID257
Feature activation+2.20e-03

Pos logit contributions
der+0.108
lessly+0.092
lights+0.089
lic+0.087
rier+0.087
Neg logit contributions
 moderation-0.081
e-0.080
 hate-0.077
 mischief-0.069
 heaven-0.068
Token little
Token ID1310
Feature activation+9.56e-04

Pos logit contributions
 hate+0.004
e+0.004
 moderation+0.003
 panic+0.003
 reason+0.003
Neg logit contributions
der-0.005
lights-0.005
lessly-0.004
boards-0.004
ier-0.004
Token swing
Token ID9628
Feature activation-9.92e-03

Pos logit contributions
 moderation+0.007
e+0.007
 client+0.007
keeping+0.006
|+0.006
Neg logit contributions
der-0.011
lessly-0.009
rier-0.009
ier-0.009
lights-0.009
Token on
Token ID319
Feature activation+5.66e-02

Pos logit contributions
der+0.019
lessly+0.016
rier+0.015
ier+0.014
lights+0.014
Neg logit contributions
e-0.021
 moderation-0.020
 hate-0.019
 argue-0.019
|-0.019
Token the
Token ID262
Feature activation-2.35e-02

Pos logit contributions
 moderation+0.111
e+0.108
 client+0.108
 hate+0.103
|+0.103
Neg logit contributions
der-0.114
lessly-0.089
rier-0.087
lights-0.083
ier-0.082
Token swings
Token ID26728
Feature activation+3.04e-03

Pos logit contributions
der+0.064
lessly+0.056
rier+0.054
ier+0.053
lights+0.052
Neg logit contributions
e-0.030
 moderation-0.030
 hate-0.027
 arrest-0.026
 argue-0.026
Token.
Token ID13
Feature activation+1.64e-02

Pos logit contributions
e+0.007
 moderation+0.007
 hate+0.006
 argue+0.006
 arrest+0.006
Neg logit contributions
der-0.005
lessly-0.004
rier-0.004
ier-0.004
lights-0.004
Token That
Token ID1320
Feature activation-2.42e-02

Pos logit contributions
e+0.028
 moderation+0.027
 hate+0.025
 client+0.024
 argue+0.024
Neg logit contributions
der-0.038
lessly-0.033
rier-0.032
ier-0.031
lights-0.030
Token sounds
Token ID5238
Feature activation-3.26e-02

Pos logit contributions
der+0.051
lessly+0.043
rier+0.041
ier+0.039
lights+0.038
Neg logit contributions
e-0.047
 moderation-0.047
 hate-0.043
 argue-0.043
 client-0.043
Token fun
Token ID1257
Feature activation+5.55e-03

Pos logit contributions
der+0.083
rier+0.071
lessly+0.069
ier+0.066
lights+0.066
Neg logit contributions
e-0.047
 hate-0.045
 moderation-0.045
 argue-0.041
|-0.041
Token."
Token ID526
Feature activation+4.39e-03

Pos logit contributions
e+0.011
 moderation+0.011
 hate+0.010
 argue+0.010
 arrest+0.010
Neg logit contributions
der-0.011
lessly-0.009
rier-0.009
ier-0.009
lights-0.008
Token he
Token ID339
Feature activation-3.29e-02

Pos logit contributions
e+0.009
 moderation+0.008
keeping+0.008
 client+0.008
 harm+0.008
Neg logit contributions
der-0.009
lessly-0.007
rier-0.006
ing-0.006
ly-0.006
Token said
Token ID531
Feature activation-4.73e-02

Pos logit contributions
der+0.071
lessly+0.061
rier+0.061
lights+0.056
ier+0.056
Neg logit contributions
e-0.062
 argue-0.056
 moderation-0.054
 hate-0.054
 client-0.053
Token I
Token ID314
Feature activation+3.59e-02

Pos logit contributions
e+0.048
 moderation+0.047
|+0.043
 client+0.043
 argue+0.042
Neg logit contributions
der-0.072
lessly-0.060
rier-0.059
ier-0.056
lights-0.055
Token was
Token ID373
Feature activation-1.44e-02

Pos logit contributions
e+0.054
 moderation+0.052
 hate+0.049
 argue+0.047
 arrest+0.046
Neg logit contributions
der-0.092
lessly-0.080
rier-0.078
lights-0.075
ier-0.074
Token scared
Token ID12008
Feature activation+3.14e-03

Pos logit contributions
der+0.030
lessly+0.025
rier+0.024
ier+0.023
lights+0.022
Neg logit contributions
 moderation-0.027
e-0.027
 client-0.025
 arrest-0.025
|-0.025
Token."
Token ID526
Feature activation+8.78e-03

Pos logit contributions
e+0.006
 hate+0.006
 moderation+0.006
 harm+0.005
 arrest+0.005
Neg logit contributions
der-0.006
lessly-0.005
lic-0.005
lights-0.005
bled-0.005
Token Her
Token ID2332
Feature activation-1.01e-02

Pos logit contributions
 moderation+0.019
e+0.019
 arrest+0.018
 conflicts+0.018
 harm+0.018
Neg logit contributions
der-0.015
lessly-0.011
rier-0.010
lights-0.010
ing-0.009
Token mom
Token ID1995
Feature activation-3.61e-03

Pos logit contributions
der+0.020
rier+0.015
lessly+0.015
lights+0.014
ier+0.013
Neg logit contributions
e-0.022
 moderation-0.021
|-0.021
 arrest-0.020
atherine-0.019
Token smiled
Token ID13541
Feature activation-4.77e-02

Pos logit contributions
der+0.011
rier+0.009
lessly+0.009
ier+0.009
lights+0.009
Neg logit contributions
e-0.004
 moderation-0.004
 hate-0.003
 argue-0.003
|-0.003
Token and
Token ID290
Feature activation-1.93e-02

Pos logit contributions
boards+0.107
der+0.104
lic+0.103
lessly+0.099
lights+0.099
Neg logit contributions
e-0.058
 heaven-0.058
 worry-0.055
 visit-0.054
 hate-0.054
Token said
Token ID531
Feature activation-5.58e-02

Pos logit contributions
der+0.045
rier+0.042
lessly+0.041
lights+0.039
ier+0.038
Neg logit contributions
e-0.029
 moderation-0.027
 obey-0.026
 arrest-0.026
 Susan-0.025
Token,
Token ID11
Feature activation-3.97e-02

Pos logit contributions
der+0.114
lights+0.105
lessly+0.105
lic+0.091
boards+0.091
Neg logit contributions
e-0.099
 moderation-0.085
 hate-0.083
 argue-0.079
 obey-0.078
Token "
Token ID366
Feature activation+4.39e-03

Pos logit contributions
der+0.070
lights+0.062
ier+0.062
lessly+0.060
rier+0.058
Neg logit contributions
e-0.077
 moderation-0.074
 argue-0.073
 hate-0.072
 meet-0.072
TokenAfter
Token ID3260
Feature activation-2.08e-02

Pos logit contributions
der+0.050
lessly+0.043
rier+0.039
ier+0.038
lic+0.037
Neg logit contributions
 moderation-0.044
e-0.044
 client-0.041
 hate-0.041
 argue-0.040
Token they
Token ID484
Feature activation-8.02e-03

Pos logit contributions
der+0.042
lessly+0.034
rier+0.033
ier+0.031
lights+0.030
Neg logit contributions
e-0.042
 moderation-0.042
 hate-0.039
 client-0.039
 arrest-0.039
Token finished
Token ID5201
Feature activation+6.12e-03

Pos logit contributions
der+0.019
rier+0.017
lessly+0.015
lights+0.014
ier+0.014
Neg logit contributions
e-0.015
 moderation-0.014
 hate-0.013
|-0.013
 argue-0.013
Token cleaning
Token ID12724
Feature activation+4.50e-03

Pos logit contributions
 moderation+0.009
e+0.009
 argue+0.008
 hate+0.008
 arrest+0.008
Neg logit contributions
der-0.016
lessly-0.014
rier-0.013
ier-0.013
lic-0.013
Token,
Token ID11
Feature activation-3.07e-02

Pos logit contributions
 moderation+0.009
e+0.009
keeping+0.008
 mischief+0.008
 argue+0.008
Neg logit contributions
der-0.009
lessly-0.008
lic-0.007
rier-0.007
ier-0.007
Token the
Token ID262
Feature activation-5.63e-02

Pos logit contributions
der+0.069
lessly+0.059
rier+0.056
ier+0.054
lights+0.053
Neg logit contributions
e-0.054
 moderation-0.054
 hate-0.049
 arrest-0.049
 client-0.049
Token man
Token ID582
Feature activation+4.85e-03

Pos logit contributions
der+0.120
lessly+0.100
rier+0.096
ier+0.092
lights+0.090
Neg logit contributions
 moderation-0.107
e-0.106
 hate-0.101
 client-0.101
 arrest-0.099
Token said
Token ID531
Feature activation-5.68e-02

Pos logit contributions
e+0.010
 moderation+0.009
 arrest+0.008
 hate+0.008
 argue+0.008
Neg logit contributions
der-0.010
lessly-0.009
rier-0.009
ier-0.008
lights-0.008
Token,
Token ID11
Feature activation-4.63e-02

Pos logit contributions
der+0.118
lights+0.102
lessly+0.101
ier+0.089
rier+0.089
Neg logit contributions
e-0.101
 argue-0.091
 hate-0.090
 moderation-0.087
 client-0.086
Token "
Token ID366
Feature activation-1.15e-03

Pos logit contributions
lights+0.095
der+0.094
ier+0.092
rier+0.084
lessly+0.083
Neg logit contributions
 meet-0.066
 visit-0.065
e-0.061
 argue-0.059
 Susan-0.057
TokenThank
Token ID10449
Feature activation+7.29e-03

Pos logit contributions
der+0.003
lessly+0.002
rier+0.002
ier+0.002
lights+0.002
Neg logit contributions
e-0.002
 moderation-0.002
 argue-0.002
 client-0.002
 hate-0.002
Token screamed
Token ID25421
Feature activation-3.74e-02

Pos logit contributions
der+0.051
rier+0.047
lessly+0.045
ier+0.044
lights+0.044
Neg logit contributions
e-0.028
 hate-0.027
 argue-0.026
 moderation-0.026
 arrest-0.025
Token and
Token ID290
Feature activation-9.07e-03

Pos logit contributions
der+0.075
lic+0.072
lights+0.068
lessly+0.068
rier+0.066
Neg logit contributions
e-0.064
 hate-0.061
 argue-0.057
 heaven-0.055
 fight-0.055
Token kicked
Token ID12165
Feature activation-1.44e-02

Pos logit contributions
der+0.024
rier+0.021
lessly+0.020
lights+0.020
ier+0.020
Neg logit contributions
e-0.013
 hate-0.012
 argue-0.012
 moderation-0.011
 panic-0.011
Token.
Token ID13
Feature activation+7.87e-03

Pos logit contributions
der+0.031
lessly+0.026
rier+0.025
ier+0.024
lights+0.023
Neg logit contributions
e-0.027
 moderation-0.027
 hate-0.025
 client-0.025
 argue-0.025
Token She
Token ID1375
Feature activation-2.88e-02

Pos logit contributions
e+0.015
 moderation+0.015
 hate+0.014
 arrest+0.014
 argue+0.014
Neg logit contributions
der-0.017
lessly-0.014
rier-0.013
ier-0.013
lights-0.012
Token said
Token ID531
Feature activation-5.24e-02

Pos logit contributions
der+0.061
lessly+0.051
rier+0.048
ier+0.047
lights+0.045
Neg logit contributions
 moderation-0.055
e-0.054
 hate-0.050
 client-0.050
 arrest-0.049
Token she
Token ID673
Feature activation-3.31e-02

Pos logit contributions
der+0.103
lessly+0.093
rier+0.086
lights+0.083
ier+0.081
Neg logit contributions
e-0.100
 moderation-0.100
 hate-0.095
 client-0.094
 argue-0.092
Token only
Token ID691
Feature activation-1.06e-02

Pos logit contributions
der+0.065
lessly+0.053
rier+0.049
ier+0.048
lights+0.046
Neg logit contributions
 moderation-0.069
e-0.068
 arrest-0.063
 hate-0.063
 client-0.062
Token wanted
Token ID2227
Feature activation+7.42e-03

Pos logit contributions
der+0.023
lessly+0.019
rier+0.018
ier+0.018
lights+0.017
Neg logit contributions
 moderation-0.020
e-0.019
 arrest-0.018
 client-0.018
|-0.018
Token to
Token ID284
Feature activation+7.94e-02

Pos logit contributions
e+0.014
 moderation+0.014
|+0.013
 arrest+0.013
 hate+0.013
Neg logit contributions
der-0.016
lessly-0.013
rier-0.012
ier-0.012
lights-0.012
Token wear
Token ID5806
Feature activation+2.17e-02

Pos logit contributions
 moderation+0.137
 colony+0.124
atherine+0.121
e+0.119
keeping+0.119
Neg logit contributions
der-0.164
lessly-0.135
rier-0.129
lights-0.128
ier-0.128
Token.
Token ID13
Feature activation-7.62e-03

Pos logit contributions
der+0.052
lessly+0.046
lic+0.045
rier+0.045
lights+0.043
Neg logit contributions
e-0.050
keeping-0.045
 argue-0.044
 hate-0.044
 moderation-0.043
Token Then
Token ID3244
Feature activation-3.60e-02

Pos logit contributions
der+0.016
lessly+0.013
rier+0.012
lights+0.011
ier+0.011
Neg logit contributions
e-0.016
 moderation-0.015
 hate-0.014
 client-0.014
 argue-0.014
Token,
Token ID11
Feature activation-5.09e-02

Pos logit contributions
der+0.064
lessly+0.051
rier+0.048
ier+0.046
lights+0.044
Neg logit contributions
e-0.082
 moderation-0.081
 hate-0.076
 arrest-0.075
 client-0.075
Token they
Token ID484
Feature activation-2.65e-02

Pos logit contributions
der+0.096
lessly+0.079
rier+0.074
ier+0.070
lights+0.068
Neg logit contributions
e-0.109
 moderation-0.108
 hate-0.101
 client-0.100
 arrest-0.100
Token put
Token ID1234
Feature activation-3.09e-02

Pos logit contributions
der+0.060
lessly+0.050
rier+0.050
ier+0.047
lights+0.046
Neg logit contributions
e-0.049
 moderation-0.045
 argue-0.044
 hate-0.044
 arrest-0.043
Token the
Token ID262
Feature activation-5.21e-02

Pos logit contributions
der+0.078
lessly+0.065
rier+0.061
ier+0.060
lic+0.059
Neg logit contributions
e-0.052
 moderation-0.050
 argue-0.043
 hate-0.043
 arrest-0.043
Token expensive
Token ID5789
Feature activation-1.80e-02

Pos logit contributions
der+0.113
lessly+0.095
rier+0.093
ier+0.087
lights+0.084
Neg logit contributions
e-0.094
 moderation-0.091
 client-0.085
 argue-0.085
|-0.085
Token blanket
Token ID18447
Feature activation-4.87e-02

Pos logit contributions
der+0.042
lessly+0.036
rier+0.035
ier+0.033
lights+0.032
Neg logit contributions
e-0.030
 moderation-0.030
 hate-0.027
 client-0.027
 arrest-0.027
Token into
Token ID656
Feature activation+3.15e-02

Pos logit contributions
der+0.108
rier+0.081
lic+0.081
lessly+0.080
ier+0.075
Neg logit contributions
e-0.094
packs-0.081
keeping-0.081
 moderation-0.080
 argue-0.079
Token their
Token ID511
Feature activation-4.48e-03

Pos logit contributions
e+0.057
 moderation+0.056
 hate+0.053
 client+0.052
 argue+0.052
Neg logit contributions
der-0.069
lessly-0.059
rier-0.056
ier-0.053
lights-0.052
Token bag
Token ID6131
Feature activation-5.09e-02

Pos logit contributions
der+0.008
rier+0.007
lessly+0.007
lights+0.007
ly+0.006
Neg logit contributions
e-0.009
 moderation-0.009
 client-0.008
Susan-0.008
 argue-0.008
Token at
Token ID379
Feature activation+4.16e-03

Pos logit contributions
der+0.098
rier+0.075
lessly+0.074
lights+0.074
ier+0.071
Neg logit contributions
e-0.098
 hate-0.098
 moderation-0.091
 heaven-0.088
 argue-0.088
Token the
Token ID262
Feature activation-4.78e-02

Pos logit contributions
e+0.007
 moderation+0.006
 hate+0.006
 argue+0.006
 client+0.006
Neg logit contributions
der-0.010
lessly-0.009
rier-0.008
lights-0.008
ier-0.008
Token car
Token ID1097
Feature activation-1.72e-02

Pos logit contributions
der+0.118
lessly+0.101
rier+0.098
ier+0.094
lights+0.092
Neg logit contributions
e-0.076
 moderation-0.075
 hate-0.067
 client-0.067
 arrest-0.067
Token.
Token ID13
Feature activation+1.11e-02

Pos logit contributions
der+0.036
lessly+0.031
lic+0.031
rier+0.029
lights+0.029
Neg logit contributions
e-0.033
 argue-0.029
 arrest-0.029
 fight-0.029
 hate-0.029
Token He
Token ID679
Feature activation-2.25e-02

Pos logit contributions
e+0.023
 moderation+0.023
 hate+0.021
 arrest+0.021
 client+0.021
Neg logit contributions
der-0.022
lessly-0.018
rier-0.017
ier-0.016
lights-0.016
Token looks
Token ID3073
Feature activation-5.34e-02

Pos logit contributions
der+0.049
lessly+0.043
rier+0.041
ier+0.040
lights+0.038
Neg logit contributions
e-0.042
 moderation-0.040
 hate-0.039
 arrest-0.039
 argue-0.037
Token at
Token ID379
Feature activation+6.25e-04

Pos logit contributions
der+0.114
lessly+0.089
lights+0.088
rier+0.087
ier+0.084
Neg logit contributions
 hate-0.105
e-0.105
 moderation-0.096
 argue-0.094
 heaven-0.092
Token Lily
Token ID20037
Feature activation+3.07e-02

Pos logit contributions
e+0.001
 moderation+0.001
 hate+0.001
 argue+0.001
 client+0.001
Neg logit contributions
der-0.002
lessly-0.001
rier-0.001
ier-0.001
lights-0.001
Token and
Token ID290
Feature activation-1.25e-02

Pos logit contributions
e+0.075
 moderation+0.074
 hate+0.070
 client+0.069
 arrest+0.069
Neg logit contributions
der-0.049
lessly-0.038
rier-0.036
ier-0.033
lights-0.032
Token Tom
Token ID4186
Feature activation+3.81e-02

Pos logit contributions
der+0.036
lessly+0.032
rier+0.031
ier+0.030
lights+0.029
Neg logit contributions
e-0.015
 moderation-0.014
 hate-0.013
 arrest-0.012
 client-0.012
Token.
Token ID13
Feature activation+1.58e-02

Pos logit contributions
 moderation+0.093
e+0.092
 client+0.087
 hate+0.087
 arrest+0.086
Neg logit contributions
der-0.059
lessly-0.047
rier-0.044
ier-0.041
lights-0.038
Token girl
Token ID2576
Feature activation+3.62e-02

Pos logit contributions
der+0.066
lessly+0.056
rier+0.054
ier+0.051
lights+0.049
Neg logit contributions
e-0.064
 moderation-0.063
 arrest-0.059
 argue-0.058
 client-0.058
Token a
Token ID257
Feature activation+3.97e-03

Pos logit contributions
 argue+0.075
 moderation+0.075
 client+0.074
 hate+0.073
e+0.073
Neg logit contributions
der-0.063
lessly-0.053
rier-0.046
ing-0.046
ier-0.045
Token hug
Token ID16225
Feature activation-1.78e-02

Pos logit contributions
e+0.009
 moderation+0.009
 hate+0.008
 arrest+0.008
 client+0.008
Neg logit contributions
der-0.007
lessly-0.006
rier-0.006
ier-0.005
lights-0.005
Token and
Token ID290
Feature activation-1.71e-02

Pos logit contributions
der+0.032
rier+0.028
lessly+0.027
lights+0.026
lic+0.024
Neg logit contributions
e-0.039
 moderation-0.034
keeping-0.034
 hate-0.033
 arrest-0.032
Token they
Token ID484
Feature activation-2.00e-02

Pos logit contributions
der+0.035
lessly+0.030
rier+0.029
ier+0.027
lights+0.026
Neg logit contributions
e-0.033
 moderation-0.033
 hate-0.031
 arrest-0.030
 client-0.030
Token said
Token ID531
Feature activation-5.13e-02

Pos logit contributions
der+0.043
rier+0.036
lessly+0.036
ier+0.033
lights+0.032
Neg logit contributions
e-0.039
 moderation-0.037
 hate-0.034
 argue-0.034
|-0.034
Token goodbye
Token ID24829
Feature activation-3.15e-02

Pos logit contributions
der+0.107
lessly+0.091
lights+0.090
rier+0.088
lic+0.086
Neg logit contributions
e-0.089
 hate-0.077
 argue-0.075
 meet-0.074
 moderation-0.073
Token.
Token ID13
Feature activation-7.21e-03

Pos logit contributions
der+0.066
lessly+0.058
lights+0.058
lic+0.057
rier+0.057
Neg logit contributions
e-0.052
keeping-0.048
 argue-0.046
 visit-0.046
 hate-0.045
Token The
Token ID383
Feature activation-5.86e-02

Pos logit contributions
der+0.014
lessly+0.012
rier+0.010
lights+0.010
lic+0.009
Neg logit contributions
e-0.015
 moderation-0.015
 argue-0.015
 hate-0.015
 client-0.014
Token girl
Token ID2576
Feature activation+2.71e-03

Pos logit contributions
der+0.130
lessly+0.110
rier+0.105
ier+0.101
lights+0.098
Neg logit contributions
e-0.107
 moderation-0.106
 hate-0.098
 client-0.096
 arrest-0.096
Token waved
Token ID26834
Feature activation-4.46e-02

Pos logit contributions
e+0.006
 moderation+0.006
 hate+0.005
 arrest+0.005
 client+0.005
Neg logit contributions
der-0.005
lessly-0.004
rier-0.004
ier-0.004
lights-0.004