TOP ACTIVATIONS
MAX = 0.131

to never try to catch Twe ety again . From that
, Bugs said , " Twe ety , I have a
The cat wanted to eat Twe ety ! Twe
of wonderful memories from <|endoftext|> Lu cy was a little girl
dare to climb up to Twe ety 's nest .
a little bird ie like Twe ety . So , Twe
world from up high like Twe ety , but he was
j olly little bird named Twe ety . Twe ety loved
was a little bird named Twe ety . Twe ety was
was a little bird named Twe ety . Twe ety loved
was a little bird named Twe ety . Twe ety loved
was a little bird named Twe ety . Twe ety was
was a little bird named Twe ety . He had an
was a little bird named Twe ety . Twe ety lived
dragon loved it ! <|endoftext|> Lu cy was jealous of her
dishes ever again ! <|endoftext|> Ste lla was sad . She
there was a bird named Twe ety . Twe ety lived
was a little bird named Twe ety . Twe ety was
a curious little bird named Twe ety . Twe ety loved
named Jerry . Twe ety flew to Jerry 's

MIN ACTIVATIONS
MIN = -0.137

he felt a fl apping sensation in the air . It
, Elsa felt a strange sensation in the water . What
Suddenly , he felt a sensation . He began to slide
and Tom experienced a strange sensation they had never felt before
, they felt a strange sensation . They felt light and
felt a strange t ingly sensation and when it stopped they
and Ben felt a funny sensation in their stomach s .
the camera up to her eye , the park seemed magical
a pleasant t ing ling sensation throughout his body . The
and Tom felt a strange sensation in their stomach s .
she unsc rew ed the lid and unveiled a mysterious message
, he felt a strange sensation , like something powerful was
, it felt a burning sensation and started to jump around
friendly voice from the reef replied back :
, the fairy pointed her wand his way , and with
bird looked at her in amaz ement , and said :
found a magical o int ment . He touched it and
in a blink of an eye . She could see cars
The fairy waved her wand and said , " Sh
, the fairy waved her wand once more and the world

INTERVAL 0.064 - 0.131
CONTAINS 0.907%

me . I don 't like cars and trucks , I
what it would be like to sail in a boat like
you . I don 't like brushes ."
cat ! Do you want to play with my toy ?"
I don â t want to go

INTERVAL -0.003 - 0.064
CONTAINS 51.829%

for the advice . From then on , the small bear
saw him from the window and said , " Jack !
. They were all very happy . Amy saw a big
who worked on his farm everyday . Every day he would
, my sweet boys . You can look at it and

INTERVAL -0.070 - -0.003
CONTAINS 47.165%

a little bit embarrassed . They carried boxes full
go home and change . And next time , be more
find out . Finally , the truck stopped at
dry . And next time , let 's hide somewhere else
happy together ." They took the dolls to the

INTERVAL -0.137 - -0.070
CONTAINS 0.100%

Ben jamin smiled and said , " That 's great
as can be . She said to him , " Don
the table suddenly closed its lid ! He was so sad
at the little girl and replied , " No , I
her mom and dad and said , â
Token to
Token ID284
Feature activation+2.34e-02

Pos logit contributions
der+0.032
lights+0.031
rier+0.031
lic+0.028
boards+0.028
Neg logit contributions
e-0.032
 moderation-0.031
oof-0.031
 hate-0.030
 client-0.028
Token never
Token ID1239
Feature activation+2.22e-02

Pos logit contributions
e+0.036
 moderation+0.034
oof+0.033
 hate+0.032
 client+0.029
Neg logit contributions
der-0.054
rier-0.052
lights-0.050
lic-0.047
lessly-0.046
Token try
Token ID1949
Feature activation+3.48e-02

Pos logit contributions
e+0.036
 moderation+0.035
oof+0.034
 hate+0.031
atherine+0.030
Neg logit contributions
der-0.049
rier-0.047
lights-0.044
lic-0.042
ly-0.042
Token to
Token ID284
Feature activation+7.27e-02

Pos logit contributions
e+0.083
 moderation+0.079
oof+0.078
 hate+0.075
atherine+0.073
Neg logit contributions
der-0.055
rier-0.049
lights-0.046
lessly-0.041
lic-0.040
Token catch
Token ID4929
Feature activation+4.47e-02

Pos logit contributions
 moderation+0.116
e+0.114
oof+0.110
atherine+0.100
 hate+0.099
Neg logit contributions
der-0.159
rier-0.152
lights-0.143
ly-0.136
lic-0.135
Token Twe
Token ID24205
Feature activation+1.31e-01

Pos logit contributions
e+0.075
 moderation+0.072
oof+0.069
 hate+0.065
atherine+0.063
Neg logit contributions
der-0.098
rier-0.094
lights-0.087
lessly-0.082
ly-0.082
Tokenety
Token ID2963
Feature activation+6.06e-02

Pos logit contributions
e+0.170
 moderation+0.165
 hate+0.158
 client+0.144
aster+0.143
Neg logit contributions
der-0.304
rier-0.294
lic-0.289
ly-0.286
ing-0.283
Token again
Token ID757
Feature activation+5.11e-02

Pos logit contributions
e+0.157
 moderation+0.156
oof+0.154
 hate+0.148
atherine+0.147
Neg logit contributions
der-0.075
rier-0.068
lights-0.059
ly-0.056
ows-0.052
Token.
Token ID13
Feature activation-3.88e-03

Pos logit contributions
e+0.127
 moderation+0.123
oof+0.120
 hate+0.117
 client+0.110
Neg logit contributions
der-0.071
rier-0.067
lights-0.063
lic-0.057
lessly-0.056
Token From
Token ID3574
Feature activation-9.99e-03

Pos logit contributions
der+0.008
rier+0.008
lights+0.008
lic+0.007
lessly+0.007
Neg logit contributions
oof-0.007
e-0.006
 moderation-0.006
 hate-0.006
atherine-0.006
Token that
Token ID326
Feature activation+4.30e-03

Pos logit contributions
der+0.013
rier+0.013
lights+0.012
lic+0.011
boards+0.011
Neg logit contributions
e-0.025
oof-0.024
 moderation-0.024
 hate-0.023
atherine-0.022
Token,
Token ID11
Feature activation-1.23e-02

Pos logit contributions
e+0.029
 moderation+0.025
 hate+0.025
oof+0.025
 quarrel+0.022
Neg logit contributions
der-0.023
rier-0.022
lights-0.022
lic-0.019
lessly-0.018
Token Bugs
Token ID44991
Feature activation+1.81e-02

Pos logit contributions
der+0.028
rier+0.027
lights+0.026
lic+0.024
lessly+0.024
Neg logit contributions
e-0.019
 moderation-0.018
oof-0.018
 hate-0.017
 client-0.016
Token said
Token ID531
Feature activation-5.37e-02

Pos logit contributions
e+0.033
oof+0.030
 moderation+0.030
 hate+0.030
 supposed+0.026
Neg logit contributions
rier-0.037
der-0.037
lights-0.036
boards-0.031
lessly-0.031
Token,
Token ID11
Feature activation-4.07e-02

Pos logit contributions
lights+0.101
der+0.093
rier+0.091
lessly+0.086
boards+0.082
Neg logit contributions
e-0.105
 hate-0.099
oof-0.096
 moderation-0.091
 quarrel-0.090
Token "
Token ID366
Feature activation+2.35e-02

Pos logit contributions
lights+0.073
rier+0.069
der+0.065
boards+0.058
ows+0.057
Neg logit contributions
e-0.080
oof-0.077
 hate-0.075
 moderation-0.071
 quarrel-0.071
TokenTwe
Token ID32665
Feature activation+1.30e-01

Pos logit contributions
e+0.040
 moderation+0.038
oof+0.036
 hate+0.035
 client+0.034
Neg logit contributions
der-0.050
rier-0.049
lights-0.044
lic-0.044
ly-0.044
Tokenety
Token ID2963
Feature activation+4.57e-02

Pos logit contributions
e+0.172
 hate+0.163
 moderation+0.160
 client+0.158
 quarrel+0.139
Neg logit contributions
der-0.296
rier-0.277
ly-0.276
lic-0.265
ing-0.257
Token,
Token ID11
Feature activation-1.23e-02

Pos logit contributions
e+0.126
 moderation+0.125
oof+0.122
 hate+0.117
 client+0.115
Neg logit contributions
der-0.047
rier-0.043
lights-0.038
lic-0.034
ly-0.033
Token I
Token ID314
Feature activation+3.18e-02

Pos logit contributions
der+0.024
rier+0.023
lights+0.022
lic+0.020
lessly+0.020
Neg logit contributions
e-0.023
 moderation-0.022
oof-0.022
 hate-0.021
 client-0.020
Token have
Token ID423
Feature activation+1.87e-02

Pos logit contributions
 moderation+0.046
oof+0.046
atherine+0.046
 client+0.044
cle+0.044
Neg logit contributions
der-0.060
lic-0.056
rier-0.056
ows-0.055
ly-0.053
Token a
Token ID257
Feature activation+1.11e-02

Pos logit contributions
e+0.037
oof+0.036
 moderation+0.035
 hate+0.034
atherine+0.033
Neg logit contributions
der-0.034
rier-0.032
lights-0.029
boards-0.028
lic-0.027
Token The
Token ID383
Feature activation-3.05e-02

Pos logit contributions
e+0.012
 moderation+0.012
oof+0.012
 hate+0.011
 client+0.010
Neg logit contributions
der-0.016
rier-0.016
lights-0.016
lessly-0.014
lic-0.014
Token cat
Token ID3797
Feature activation+1.27e-02

Pos logit contributions
der+0.067
rier+0.062
lights+0.059
 lid+0.058
lic+0.057
Neg logit contributions
e-0.051
 moderation-0.048
oof-0.047
atherine-0.043
beh-0.043
Token wanted
Token ID2227
Feature activation-1.46e-03

Pos logit contributions
e+0.027
 moderation+0.026
oof+0.025
 hate+0.024
 client+0.023
Neg logit contributions
der-0.023
rier-0.021
lights-0.020
lic-0.019
lessly-0.018
Token to
Token ID284
Feature activation+5.39e-02

Pos logit contributions
der+0.002
rier+0.001
ly+0.001
boards+0.001
lights+0.001
Neg logit contributions
e-0.004
 moderation-0.004
oof-0.004
atherine-0.004
 hate-0.004
Token eat
Token ID4483
Feature activation+2.58e-02

Pos logit contributions
 moderation+0.108
atherine+0.102
cle+0.098
athy+0.097
e+0.096
Neg logit contributions
der-0.081
rier-0.075
ly-0.070
lic-0.069
 lid-0.069
Token Twe
Token ID24205
Feature activation+1.29e-01

Pos logit contributions
e+0.043
 moderation+0.041
oof+0.040
 hate+0.038
 client+0.035
Neg logit contributions
der-0.057
rier-0.055
lights-0.053
lic-0.049
lessly-0.049
Tokenety
Token ID2963
Feature activation+5.74e-02

Pos logit contributions
e+0.169
 hate+0.163
 client+0.158
 moderation+0.157
 generation+0.137
Neg logit contributions
der-0.284
ly-0.276
lic-0.276
ing-0.268
rier-0.268
Token!
Token ID0
Feature activation+3.14e-03

Pos logit contributions
e+0.145
 moderation+0.142
oof+0.141
 client+0.137
 hate+0.137
Neg logit contributions
der-0.072
rier-0.065
ly-0.055
lights-0.055
ows-0.054
Token
Token ID198
Feature activation-3.47e-03

Pos logit contributions
e+0.005
oof+0.005
 moderation+0.005
 hate+0.004
atherine+0.004
Neg logit contributions
der-0.007
lights-0.007
rier-0.007
lessly-0.006
lic-0.006
Token
Token ID198
Feature activation-3.56e-03

Pos logit contributions
der+0.006
rier+0.006
lights+0.006
lic+0.005
lessly+0.005
Neg logit contributions
e-0.007
 moderation-0.007
oof-0.007
 hate-0.007
 client-0.006
TokenTwe
Token ID32665
Feature activation+1.14e-01

Pos logit contributions
der+0.010
rier+0.010
lights+0.010
lessly+0.009
lic+0.009
Neg logit contributions
e-0.004
 moderation-0.004
oof-0.004
 hate-0.003
 client-0.003
Token of
Token ID286
Feature activation+3.02e-02

Pos logit contributions
e+0.002
 moderation+0.002
 hate+0.002
oof+0.002
aster+0.002
Neg logit contributions
rier-0.002
der-0.002
lic-0.001
lights-0.001
atively-0.001
Token wonderful
Token ID7932
Feature activation+4.15e-02

Pos logit contributions
e+0.056
 moderation+0.053
oof+0.052
 hate+0.050
 client+0.046
Neg logit contributions
der-0.062
rier-0.059
lights-0.056
lic-0.052
lessly-0.052
Token memories
Token ID9846
Feature activation-2.25e-02

Pos logit contributions
e+0.068
 moderation+0.065
oof+0.063
 hate+0.062
atherine+0.055
Neg logit contributions
der-0.093
rier-0.091
lights-0.084
lic-0.080
lessly-0.079
Token from
Token ID422
Feature activation+3.17e-02

Pos logit contributions
rier+0.043
der+0.039
lic+0.036
 lid+0.035
lights+0.035
Neg logit contributions
e-0.043
 supposed-0.037
 hate-0.035
 moderation-0.035
oof-0.034
Token<|endoftext|>
Token ID50256
Feature activation+4.17e-02

Pos logit contributions
e+0.062
oof+0.060
 moderation+0.060
 hate+0.057
 client+0.054
Neg logit contributions
der-0.059
rier-0.057
lights-0.055
lessly-0.048
ly-0.047
TokenLu
Token ID25596
Feature activation+1.29e-01

Pos logit contributions
e+0.074
 moderation+0.067
oof+0.064
 hate+0.061
atherine+0.056
Neg logit contributions
der-0.092
rier-0.091
lights-0.085
lessly-0.078
ly-0.075
Tokency
Token ID948
Feature activation+5.35e-02

Pos logit contributions
 moderation+0.195
oof+0.183
 hate+0.176
 client+0.176
 supposed+0.166
Neg logit contributions
der-0.260
ly-0.255
ows-0.251
rier-0.251
lights-0.248
Token was
Token ID373
Feature activation-4.57e-04

Pos logit contributions
e+0.137
 moderation+0.135
oof+0.132
 hate+0.128
 client+0.122
Neg logit contributions
der-0.068
rier-0.063
lights-0.059
lic-0.051
lessly-0.051
Token a
Token ID257
Feature activation+2.55e-02

Pos logit contributions
der+0.001
rier+0.001
lights+0.001
lessly+0.001
lic+0.001
Neg logit contributions
 moderation-0.001
e-0.001
oof-0.001
 hate-0.001
 client-0.001
Token little
Token ID1310
Feature activation+3.34e-02

Pos logit contributions
e+0.056
oof+0.055
 moderation+0.055
beh+0.051
 supposed+0.050
Neg logit contributions
der-0.039
rier-0.039
lights-0.034
lic-0.032
ly-0.032
Token girl
Token ID2576
Feature activation+2.77e-02

Pos logit contributions
e+0.065
 moderation+0.063
oof+0.061
 hate+0.058
 supposed+0.054
Neg logit contributions
der-0.064
rier-0.062
lights-0.059
lessly-0.054
lic-0.053
Token dare
Token ID16498
Feature activation+2.18e-02

Pos logit contributions
e+0.043
 moderation+0.042
oof+0.038
atherine+0.035
 hate+0.033
Neg logit contributions
der-0.089
rier-0.086
lessly-0.081
lic-0.081
lights-0.080
Token to
Token ID284
Feature activation+5.45e-02

Pos logit contributions
e+0.051
 moderation+0.051
oof+0.047
atherine+0.046
 hate+0.045
Neg logit contributions
der-0.032
rier-0.031
ly-0.028
lights-0.026
lessly-0.025
Token climb
Token ID12080
Feature activation+3.34e-02

Pos logit contributions
 moderation+0.112
e+0.105
atherine+0.104
oof+0.103
cle+0.102
Neg logit contributions
der-0.089
rier-0.083
ly-0.077
lic-0.069
ows-0.067
Token up
Token ID510
Feature activation+3.42e-02

Pos logit contributions
e+0.075
 moderation+0.074
oof+0.071
 hate+0.069
atherine+0.067
Neg logit contributions
der-0.053
rier-0.050
lights-0.048
lessly-0.042
lic-0.041
Token to
Token ID284
Feature activation+4.96e-02

Pos logit contributions
e+0.077
 moderation+0.075
oof+0.073
 hate+0.071
atherine+0.067
Neg logit contributions
der-0.055
rier-0.052
lights-0.049
lessly-0.044
lic-0.044
Token Twe
Token ID24205
Feature activation+1.28e-01

Pos logit contributions
e+0.100
oof+0.097
 moderation+0.097
atherine+0.095
cle+0.094
Neg logit contributions
der-0.079
rier-0.072
ly-0.064
lights-0.062
atively-0.059
Tokenety
Token ID2963
Feature activation+5.67e-02

Pos logit contributions
e+0.166
 hate+0.163
 moderation+0.156
 client+0.149
 charge+0.135
Neg logit contributions
der-0.285
lic-0.272
ly-0.266
ing-0.264
ows-0.263
Token's
Token ID338
Feature activation-1.83e-03

Pos logit contributions
e+0.146
 moderation+0.142
oof+0.141
 hate+0.136
atherine+0.130
Neg logit contributions
der-0.073
rier-0.067
lights-0.062
lessly-0.053
lic-0.053
Token nest
Token ID16343
Feature activation-1.31e-02

Pos logit contributions
der+0.004
rier+0.004
lights+0.003
lic+0.003
lessly+0.003
Neg logit contributions
e-0.003
 moderation-0.003
oof-0.003
 hate-0.003
 client-0.003
Token.
Token ID13
Feature activation+7.07e-03

Pos logit contributions
der+0.024
rier+0.023
lic+0.021
lights+0.021
boards+0.020
Neg logit contributions
e-0.027
 moderation-0.025
oof-0.023
 hate-0.023
 supposed-0.022
Token
Token ID198
Feature activation-1.06e-02

Pos logit contributions
e+0.013
 moderation+0.013
oof+0.013
 hate+0.012
atherine+0.012
Neg logit contributions
der-0.014
rier-0.013
lights-0.013
lessly-0.012
lic-0.011
Token a
Token ID257
Feature activation+2.06e-02

Pos logit contributions
e+0.073
 moderation+0.068
oof+0.067
atherine+0.065
 hate+0.064
Neg logit contributions
der-0.074
rier-0.069
lights-0.063
ly-0.062
lessly-0.059
Token little
Token ID1310
Feature activation+2.03e-02

Pos logit contributions
e+0.043
 moderation+0.042
oof+0.040
 hate+0.036
atherine+0.036
Neg logit contributions
der-0.037
rier-0.036
lic-0.032
lights-0.032
ly-0.031
Token bird
Token ID6512
Feature activation+3.89e-02

Pos logit contributions
e+0.038
 moderation+0.036
oof+0.035
 hate+0.034
 client+0.032
Neg logit contributions
der-0.041
rier-0.039
lights-0.038
lic-0.035
lessly-0.035
Tokenie
Token ID494
Feature activation+3.65e-02

Pos logit contributions
 moderation+0.102
e+0.099
oof+0.098
atherine+0.096
 hate+0.093
Neg logit contributions
der-0.046
rier-0.041
lights-0.037
ly-0.036
lic-0.032
Token like
Token ID588
Feature activation+5.17e-02

Pos logit contributions
 moderation+0.098
e+0.097
oof+0.096
atherine+0.091
 hate+0.090
Neg logit contributions
der-0.042
rier-0.036
lights-0.031
ly-0.031
lessly-0.030
Token Twe
Token ID24205
Feature activation+1.26e-01

Pos logit contributions
e+0.103
 moderation+0.098
oof+0.097
 hate+0.091
atherine+0.088
Neg logit contributions
der-0.099
rier-0.092
lights-0.086
boards-0.080
lessly-0.080
Tokenety
Token ID2963
Feature activation+6.15e-02

Pos logit contributions
e+0.183
 hate+0.165
 moderation+0.161
 client+0.161
 charge+0.148
Neg logit contributions
der-0.289
rier-0.265
ly-0.264
ing-0.252
lic-0.250
Token.
Token ID13
Feature activation+9.43e-03

Pos logit contributions
 moderation+0.165
e+0.163
oof+0.162
atherine+0.157
 generation+0.155
Neg logit contributions
der-0.067
rier-0.054
ly-0.052
ows-0.048
lights-0.047
Token So
Token ID1406
Feature activation-1.04e-02

Pos logit contributions
e+0.019
 moderation+0.019
oof+0.018
 hate+0.018
 client+0.016
Neg logit contributions
der-0.017
rier-0.017
lights-0.016
lic-0.014
lessly-0.014
Token,
Token ID11
Feature activation-2.79e-02

Pos logit contributions
der+0.016
rier+0.015
lights+0.015
lic+0.013
boards+0.013
Neg logit contributions
e-0.025
 moderation-0.024
oof-0.024
 hate-0.023
 client-0.021
Token Twe
Token ID24205
Feature activation+8.22e-02

Pos logit contributions
der+0.056
rier+0.053
lights+0.050
lessly+0.046
lic+0.046
Neg logit contributions
e-0.053
 moderation-0.050
oof-0.049
 hate-0.047
atherine-0.045
Token world
Token ID995
Feature activation+4.23e-03

Pos logit contributions
der+0.033
rier+0.032
lights+0.029
lessly+0.027
lic+0.027
Neg logit contributions
e-0.033
oof-0.031
 moderation-0.031
 hate-0.029
atherine-0.028
Token from
Token ID422
Feature activation+3.27e-02

Pos logit contributions
e+0.009
 hate+0.009
 moderation+0.009
oof+0.008
 supposed+0.008
Neg logit contributions
der-0.007
rier-0.007
lights-0.006
lic-0.006
lessly-0.006
Token up
Token ID510
Feature activation+3.18e-02

Pos logit contributions
e+0.069
oof+0.065
 moderation+0.063
atherine+0.062
beh+0.061
Neg logit contributions
der-0.056
rier-0.053
lights-0.048
lessly-0.041
ly-0.041
Token high
Token ID1029
Feature activation+4.31e-03

Pos logit contributions
e+0.067
oof+0.065
 moderation+0.063
 hate+0.060
atherine+0.059
Neg logit contributions
der-0.054
rier-0.053
lights-0.050
lessly-0.046
lic-0.044
Token like
Token ID588
Feature activation+5.39e-02

Pos logit contributions
e+0.009
 moderation+0.008
 hate+0.008
oof+0.008
 harm+0.008
Neg logit contributions
der-0.007
rier-0.007
lic-0.007
lights-0.007
lessly-0.006
Token Twe
Token ID24205
Feature activation+1.24e-01

Pos logit contributions
e+0.098
 moderation+0.093
oof+0.092
 hate+0.087
atherine+0.082
Neg logit contributions
der-0.111
rier-0.106
lights-0.100
lessly-0.093
boards-0.092
Tokenety
Token ID2963
Feature activation+5.20e-02

Pos logit contributions
e+0.165
 hate+0.149
 moderation+0.148
 client+0.137
oof+0.132
Neg logit contributions
der-0.307
rier-0.286
lic-0.277
ly-0.277
lights-0.271
Token,
Token ID11
Feature activation-8.77e-03

Pos logit contributions
e+0.134
 moderation+0.132
oof+0.130
 hate+0.125
atherine+0.121
Neg logit contributions
der-0.067
rier-0.061
lights-0.056
lessly-0.049
boards-0.048
Token but
Token ID475
Feature activation-7.95e-03

Pos logit contributions
der+0.013
rier+0.013
lights+0.012
lic+0.011
lessly+0.010
Neg logit contributions
e-0.021
 moderation-0.020
oof-0.019
 hate-0.019
 client-0.018
Token he
Token ID339
Feature activation-1.17e-02

Pos logit contributions
der+0.011
rier+0.010
lights+0.009
lessly+0.008
lic+0.008
Neg logit contributions
e-0.020
 moderation-0.019
oof-0.019
 hate-0.019
 client-0.018
Token was
Token ID373
Feature activation-2.52e-02

Pos logit contributions
der+0.019
rier+0.015
ly+0.014
lic+0.014
lights+0.014
Neg logit contributions
 moderation-0.027
e-0.026
oof-0.025
 hate-0.025
 generation-0.024
Token j
Token ID474
Feature activation+4.62e-02

Pos logit contributions
e+0.085
 moderation+0.081
oof+0.080
 hate+0.076
beh+0.075
Neg logit contributions
der-0.050
rier-0.049
lights-0.043
lic-0.041
lessly-0.038
Tokenolly
Token ID5098
Feature activation+4.08e-02

Pos logit contributions
e+0.104
 moderation+0.100
oof+0.098
 hate+0.095
 client+0.089
Neg logit contributions
der-0.076
rier-0.072
lights-0.068
lic-0.061
lessly-0.060
Token little
Token ID1310
Feature activation+3.88e-02

Pos logit contributions
e+0.079
 moderation+0.077
oof+0.074
 hate+0.072
 supposed+0.066
Neg logit contributions
der-0.079
rier-0.075
lights-0.071
lic-0.066
lessly-0.065
Token bird
Token ID6512
Feature activation+2.65e-02

Pos logit contributions
e+0.073
 moderation+0.070
oof+0.068
 hate+0.066
 client+0.061
Neg logit contributions
der-0.077
rier-0.074
lights-0.071
lic-0.065
lessly-0.065
Token named
Token ID3706
Feature activation+6.22e-02

Pos logit contributions
e+0.037
 moderation+0.036
oof+0.035
 hate+0.033
atherine+0.031
Neg logit contributions
der-0.064
rier-0.061
lights-0.059
lic-0.055
lessly-0.055
Token Twe
Token ID24205
Feature activation+1.23e-01

Pos logit contributions
e+0.117
 moderation+0.115
 hate+0.111
 client+0.108
beh+0.104
Neg logit contributions
der-0.116
rier-0.108
lic-0.101
lights-0.100
ly-0.096
Tokenety
Token ID2963
Feature activation+5.31e-02

Pos logit contributions
 hate+0.168
 moderation+0.166
e+0.164
 client+0.154
 quarrel+0.153
Neg logit contributions
der-0.268
ows-0.252
rier-0.248
lic-0.247
ing-0.245
Token.
Token ID13
Feature activation+1.69e-02

Pos logit contributions
 moderation+0.145
e+0.143
oof+0.139
 hate+0.137
atherine+0.134
Neg logit contributions
der-0.059
rier-0.050
lights-0.046
lessly-0.040
lic-0.039
Token Twe
Token ID24205
Feature activation+1.02e-01

Pos logit contributions
oof+0.016
e+0.016
 moderation+0.015
 hate+0.014
 generation+0.010
Neg logit contributions
der-0.049
rier-0.048
lights-0.047
lic-0.045
lessly-0.044
Tokenety
Token ID2963
Feature activation+3.10e-02

Pos logit contributions
e+0.130
 moderation+0.128
 hate+0.122
 client+0.117
aster+0.110
Neg logit contributions
der-0.240
lic-0.225
rier-0.224
ows-0.222
lights-0.217
Token loved
Token ID6151
Feature activation+2.71e-02

Pos logit contributions
 moderation+0.058
e+0.058
oof+0.055
 hate+0.053
 client+0.051
Neg logit contributions
der-0.060
rier-0.054
lights-0.054
lic-0.050
lessly-0.050
Token was
Token ID373
Feature activation+4.03e-02

Pos logit contributions
der+0.026
rier+0.023
lights+0.023
lic+0.020
boards+0.020
Neg logit contributions
e-0.045
 moderation-0.044
oof-0.044
 hate-0.042
 client-0.041
Token a
Token ID257
Feature activation+3.37e-02

Pos logit contributions
 client+0.096
beh+0.094
cle+0.092
e+0.092
 hate+0.091
Neg logit contributions
der-0.049
lights-0.040
lic-0.039
rier-0.039
lessly-0.037
Token little
Token ID1310
Feature activation+4.00e-02

Pos logit contributions
e+0.082
 moderation+0.079
oof+0.077
 hate+0.073
beh+0.072
Neg logit contributions
der-0.047
rier-0.047
lights-0.041
lic-0.039
ly-0.036
Token bird
Token ID6512
Feature activation+2.63e-02

Pos logit contributions
e+0.077
 moderation+0.073
oof+0.071
 hate+0.069
 client+0.064
Neg logit contributions
der-0.078
rier-0.075
lights-0.072
lic-0.066
lessly-0.066
Token named
Token ID3706
Feature activation+6.10e-02

Pos logit contributions
e+0.037
 moderation+0.036
oof+0.035
 hate+0.033
atherine+0.031
Neg logit contributions
der-0.063
rier-0.060
lights-0.058
lic-0.055
lessly-0.054
Token Twe
Token ID24205
Feature activation+1.21e-01

Pos logit contributions
e+0.113
 moderation+0.112
 hate+0.108
 client+0.105
beh+0.103
Neg logit contributions
der-0.114
rier-0.108
lights-0.100
lic-0.099
ly-0.095
Tokenety
Token ID2963
Feature activation+5.43e-02

Pos logit contributions
e+0.163
 hate+0.154
 moderation+0.150
 client+0.145
 quarrel+0.139
Neg logit contributions
der-0.278
rier-0.256
ly-0.249
ows-0.247
lights-0.242
Token.
Token ID13
Feature activation+1.82e-02

Pos logit contributions
e+0.145
 moderation+0.144
oof+0.140
 hate+0.137
atherine+0.134
Neg logit contributions
der-0.064
rier-0.054
lights-0.050
lessly-0.044
lic-0.042
Token Twe
Token ID24205
Feature activation+1.01e-01

Pos logit contributions
e+0.017
oof+0.017
 moderation+0.017
 hate+0.015
 client+0.011
Neg logit contributions
der-0.052
rier-0.051
lights-0.050
lic-0.047
lessly-0.047
Tokenety
Token ID2963
Feature activation+3.38e-02

Pos logit contributions
e+0.125
 moderation+0.113
 hate+0.109
 client+0.106
oof+0.099
Neg logit contributions
der-0.254
rier-0.237
lights-0.230
lic-0.226
ly-0.226
Token was
Token ID373
Feature activation-9.44e-03

Pos logit contributions
 moderation+0.064
e+0.063
oof+0.061
 hate+0.059
 client+0.056
Neg logit contributions
der-0.064
rier-0.059
lights-0.058
lic-0.054
lessly-0.054
Token was
Token ID373
Feature activation+3.93e-02

Pos logit contributions
der+0.027
rier+0.024
lights+0.024
lic+0.021
boards+0.021
Neg logit contributions
e-0.046
 moderation-0.046
oof-0.045
 hate-0.043
 client-0.042
Token a
Token ID257
Feature activation+3.25e-02

Pos logit contributions
 client+0.103
beh+0.102
cle+0.099
illas+0.098
 courtroom+0.096
Neg logit contributions
der-0.038
lights-0.026
lic-0.026
boards-0.025
rier-0.025
Token little
Token ID1310
Feature activation+3.62e-02

Pos logit contributions
e+0.080
 moderation+0.076
oof+0.075
 hate+0.071
beh+0.070
Neg logit contributions
der-0.045
rier-0.044
lights-0.038
lic-0.038
lessly-0.034
Token bird
Token ID6512
Feature activation+2.60e-02

Pos logit contributions
e+0.070
 moderation+0.067
oof+0.065
 hate+0.063
 supposed+0.059
Neg logit contributions
der-0.070
rier-0.068
lights-0.064
lic-0.059
lessly-0.058
Token named
Token ID3706
Feature activation+5.66e-02

Pos logit contributions
e+0.035
 moderation+0.034
oof+0.032
 hate+0.031
atherine+0.029
Neg logit contributions
der-0.064
rier-0.062
lights-0.060
lic-0.056
lessly-0.056
Token Twe
Token ID24205
Feature activation+1.21e-01

Pos logit contributions
e+0.104
 moderation+0.100
 hate+0.096
 client+0.094
oof+0.092
Neg logit contributions
der-0.110
rier-0.105
lights-0.097
lic-0.096
ly-0.090
Tokenety
Token ID2963
Feature activation+4.93e-02

Pos logit contributions
e+0.155
 moderation+0.144
 hate+0.144
 client+0.134
 quarrel+0.133
Neg logit contributions
der-0.283
rier-0.261
ly-0.260
ows-0.256
lic-0.253
Token.
Token ID13
Feature activation+1.58e-02

Pos logit contributions
e+0.131
 moderation+0.131
oof+0.127
 hate+0.124
atherine+0.121
Neg logit contributions
der-0.057
rier-0.049
lights-0.046
lessly-0.040
lic-0.040
Token Twe
Token ID24205
Feature activation+1.02e-01

Pos logit contributions
e+0.014
 moderation+0.014
oof+0.014
 hate+0.013
 quarrel+0.009
Neg logit contributions
der-0.046
rier-0.044
lights-0.044
lic-0.042
ows-0.041
Tokenety
Token ID2963
Feature activation+3.22e-02

Pos logit contributions
e+0.118
 moderation+0.110
 hate+0.101
 client+0.096
oof+0.093
Neg logit contributions
der-0.265
rier-0.249
lic-0.241
ly-0.239
lights-0.239
Token loved
Token ID6151
Feature activation+2.64e-02

Pos logit contributions
e+0.061
 moderation+0.061
oof+0.059
 hate+0.056
 client+0.054
Neg logit contributions
der-0.061
rier-0.056
lights-0.055
lic-0.051
lessly-0.051
Token was
Token ID373
Feature activation+4.32e-02

Pos logit contributions
der+0.024
rier+0.022
lights+0.022
lic+0.019
boards+0.019
Neg logit contributions
 moderation-0.044
e-0.044
oof-0.043
 hate-0.041
 client-0.040
Token a
Token ID257
Feature activation+3.49e-02

Pos logit contributions
 client+0.113
beh+0.112
cle+0.110
 courtroom+0.107
illas+0.107
Neg logit contributions
der-0.041
lights-0.028
rier-0.027
lic-0.027
lessly-0.026
Token little
Token ID1310
Feature activation+3.87e-02

Pos logit contributions
e+0.085
 moderation+0.082
oof+0.080
 hate+0.076
beh+0.076
Neg logit contributions
der-0.048
rier-0.048
lights-0.041
lic-0.041
ly-0.037
Token bird
Token ID6512
Feature activation+2.38e-02

Pos logit contributions
e+0.075
 moderation+0.071
oof+0.069
 hate+0.067
 supposed+0.063
Neg logit contributions
der-0.075
rier-0.073
lights-0.069
lic-0.063
lessly-0.063
Token named
Token ID3706
Feature activation+5.60e-02

Pos logit contributions
e+0.033
 moderation+0.032
oof+0.030
 hate+0.029
atherine+0.027
Neg logit contributions
der-0.058
rier-0.055
lights-0.054
lic-0.050
lessly-0.050
Token Twe
Token ID24205
Feature activation+1.21e-01

Pos logit contributions
e+0.102
 moderation+0.100
 hate+0.095
 client+0.094
oof+0.091
Neg logit contributions
der-0.108
rier-0.104
lights-0.096
lic-0.095
ly-0.089
Tokenety
Token ID2963
Feature activation+4.82e-02

Pos logit contributions
e+0.152
 hate+0.149
 moderation+0.142
 client+0.142
 quarrel+0.138
Neg logit contributions
der-0.280
ows-0.254
rier-0.250
ly-0.249
lic-0.247
Token.
Token ID13
Feature activation+1.74e-02

Pos logit contributions
 moderation+0.130
e+0.129
oof+0.125
 hate+0.123
atherine+0.120
Neg logit contributions
der-0.055
rier-0.046
lights-0.043
lessly-0.038
lic-0.036
Token Twe
Token ID24205
Feature activation+1.00e-01

Pos logit contributions
e+0.018
 moderation+0.017
oof+0.017
 hate+0.016
 client+0.012
Neg logit contributions
der-0.049
rier-0.048
lights-0.047
lic-0.044
lessly-0.044
Tokenety
Token ID2963
Feature activation+3.10e-02

Pos logit contributions
e+0.121
 moderation+0.112
 hate+0.109
 client+0.107
oof+0.097
Neg logit contributions
der-0.251
rier-0.230
lic-0.227
ows-0.223
lights-0.223
Token loved
Token ID6151
Feature activation+2.73e-02

Pos logit contributions
e+0.058
 moderation+0.058
oof+0.055
 hate+0.054
 client+0.051
Neg logit contributions
der-0.060
rier-0.055
lights-0.054
lic-0.050
lessly-0.050
Token was
Token ID373
Feature activation+4.64e-02

Pos logit contributions
der+0.025
rier+0.024
lights+0.023
lic+0.020
boards+0.020
Neg logit contributions
e-0.039
 moderation-0.039
oof-0.038
 hate-0.036
 client-0.035
Token a
Token ID257
Feature activation+3.77e-02

Pos logit contributions
beh+0.115
 client+0.114
cle+0.112
atherine+0.109
e+0.109
Neg logit contributions
der-0.051
lights-0.041
rier-0.036
lic-0.034
 lid-0.033
Token little
Token ID1310
Feature activation+3.96e-02

Pos logit contributions
e+0.091
 moderation+0.088
oof+0.086
 hate+0.082
beh+0.081
Neg logit contributions
der-0.053
rier-0.052
lights-0.046
lic-0.044
ly-0.040
Token bird
Token ID6512
Feature activation+2.38e-02

Pos logit contributions
e+0.075
 moderation+0.072
oof+0.070
 hate+0.067
 client+0.062
Neg logit contributions
der-0.078
rier-0.075
lights-0.072
lic-0.066
lessly-0.066
Token named
Token ID3706
Feature activation+5.82e-02

Pos logit contributions
e+0.035
 moderation+0.034
oof+0.032
 hate+0.031
 client+0.029
Neg logit contributions
der-0.056
rier-0.054
lights-0.052
lic-0.049
lessly-0.048
Token Twe
Token ID24205
Feature activation+1.21e-01

Pos logit contributions
e+0.103
 moderation+0.103
 hate+0.097
 client+0.094
oof+0.093
Neg logit contributions
der-0.115
rier-0.110
lights-0.103
lic-0.101
ows-0.095
Tokenety
Token ID2963
Feature activation+4.99e-02

Pos logit contributions
e+0.150
 hate+0.139
 moderation+0.138
 client+0.129
 quarrel+0.127
Neg logit contributions
der-0.291
rier-0.273
ows-0.266
ly-0.265
lights-0.259
Token.
Token ID13
Feature activation+2.00e-02

Pos logit contributions
e+0.131
 moderation+0.131
oof+0.127
 hate+0.124
atherine+0.120
Neg logit contributions
der-0.060
rier-0.052
lights-0.050
lessly-0.043
lic-0.043
Token Twe
Token ID24205
Feature activation+1.03e-01

Pos logit contributions
e+0.020
oof+0.019
 moderation+0.019
 hate+0.017
 client+0.013
Neg logit contributions
der-0.057
rier-0.056
lights-0.055
lic-0.052
ows-0.052
Tokenety
Token ID2963
Feature activation+3.30e-02

Pos logit contributions
e+0.120
 moderation+0.109
 hate+0.101
 client+0.097
oof+0.095
Neg logit contributions
der-0.267
rier-0.254
lights-0.246
lic-0.243
lessly-0.241
Token was
Token ID373
Feature activation-9.10e-03

Pos logit contributions
e+0.063
 moderation+0.062
oof+0.060
 hate+0.058
 client+0.055
Neg logit contributions
der-0.063
rier-0.058
lights-0.057
lic-0.053
lessly-0.052
Token was
Token ID373
Feature activation+4.06e-02

Pos logit contributions
der+0.031
rier+0.028
lights+0.027
lic+0.024
boards+0.024
Neg logit contributions
e-0.050
 moderation-0.050
oof-0.049
 hate-0.047
 client-0.046
Token a
Token ID257
Feature activation+3.22e-02

Pos logit contributions
 client+0.104
beh+0.102
cle+0.100
illas+0.097
atherine+0.097
Neg logit contributions
der-0.044
lic-0.033
lights-0.032
rier-0.031
lessly-0.030
Token little
Token ID1310
Feature activation+3.93e-02

Pos logit contributions
e+0.078
 moderation+0.075
oof+0.074
 hate+0.070
beh+0.069
Neg logit contributions
der-0.045
rier-0.044
lights-0.039
lic-0.038
lessly-0.035
Token bird
Token ID6512
Feature activation+2.44e-02

Pos logit contributions
e+0.076
 moderation+0.072
oof+0.071
 hate+0.068
 client+0.063
Neg logit contributions
der-0.076
rier-0.073
lights-0.070
lic-0.064
lessly-0.064
Token named
Token ID3706
Feature activation+5.69e-02

Pos logit contributions
e+0.034
 moderation+0.033
oof+0.031
 hate+0.030
 client+0.028
Neg logit contributions
der-0.060
rier-0.057
lights-0.056
lic-0.052
lessly-0.052
Token Twe
Token ID24205
Feature activation+1.20e-01

Pos logit contributions
e+0.105
 moderation+0.102
 hate+0.099
 client+0.097
oof+0.093
Neg logit contributions
der-0.109
rier-0.105
lights-0.095
lic-0.095
ly-0.090
Tokenety
Token ID2963
Feature activation+4.81e-02

Pos logit contributions
e+0.156
 hate+0.152
 moderation+0.145
 quarrel+0.142
 client+0.141
Neg logit contributions
der-0.282
rier-0.257
ly-0.253
ows-0.249
lic-0.244
Token.
Token ID13
Feature activation+1.72e-02

Pos logit contributions
e+0.128
 moderation+0.127
oof+0.124
 hate+0.121
atherine+0.118
Neg logit contributions
der-0.057
rier-0.049
lights-0.045
lessly-0.040
lic-0.038
Token He
Token ID679
Feature activation+1.42e-03

Pos logit contributions
e+0.017
 moderation+0.017
oof+0.016
 hate+0.015
 generation+0.011
Neg logit contributions
der-0.049
rier-0.048
lights-0.047
lic-0.045
lessly-0.044
Token had
Token ID550
Feature activation+2.65e-03

Pos logit contributions
 moderation+0.003
e+0.003
oof+0.003
 hate+0.002
atherine+0.002
Neg logit contributions
der-0.003
rier-0.002
lights-0.002
lic-0.002
lessly-0.002
Token an
Token ID281
Feature activation-2.01e-03

Pos logit contributions
e+0.005
 moderation+0.005
oof+0.005
 hate+0.005
atherine+0.005
Neg logit contributions
der-0.005
rier-0.004
lights-0.004
lic-0.004
lessly-0.004
Token was
Token ID373
Feature activation+3.94e-02

Pos logit contributions
der+0.026
rier+0.023
lights+0.022
boards+0.020
lic+0.020
Neg logit contributions
e-0.048
 moderation-0.048
oof-0.047
 hate-0.045
 client-0.045
Token a
Token ID257
Feature activation+3.26e-02

Pos logit contributions
beh+0.100
 client+0.100
cle+0.097
illas+0.096
atherine+0.094
Neg logit contributions
der-0.041
lights-0.029
rier-0.029
boards-0.027
lic-0.027
Token little
Token ID1310
Feature activation+3.73e-02

Pos logit contributions
e+0.080
 moderation+0.077
oof+0.075
 hate+0.071
beh+0.071
Neg logit contributions
der-0.045
rier-0.044
lights-0.039
lic-0.038
lessly-0.034
Token bird
Token ID6512
Feature activation+2.41e-02

Pos logit contributions
e+0.072
 moderation+0.069
oof+0.067
 hate+0.065
 client+0.060
Neg logit contributions
der-0.072
rier-0.070
lights-0.066
lic-0.061
lessly-0.060
Token named
Token ID3706
Feature activation+5.54e-02

Pos logit contributions
e+0.032
 moderation+0.030
oof+0.029
 hate+0.028
 client+0.025
Neg logit contributions
der-0.061
rier-0.058
lights-0.057
lic-0.053
lessly-0.053
Token Twe
Token ID24205
Feature activation+1.20e-01

Pos logit contributions
e+0.103
 moderation+0.099
 hate+0.094
oof+0.091
 client+0.091
Neg logit contributions
der-0.108
rier-0.105
lights-0.096
lic-0.094
ly-0.088
Tokenety
Token ID2963
Feature activation+4.79e-02

Pos logit contributions
e+0.157
 moderation+0.144
 hate+0.143
 quarrel+0.132
 client+0.131
Neg logit contributions
der-0.284
rier-0.266
ly-0.260
ows-0.254
lic-0.250
Token.
Token ID13
Feature activation+1.58e-02

Pos logit contributions
e+0.127
 moderation+0.126
oof+0.122
 hate+0.120
atherine+0.116
Neg logit contributions
der-0.057
rier-0.050
lights-0.046
lessly-0.041
lic-0.040
Token Twe
Token ID24205
Feature activation+1.02e-01

Pos logit contributions
 moderation+0.015
e+0.014
oof+0.014
 hate+0.013
 generation+0.010
Neg logit contributions
der-0.045
rier-0.044
lights-0.044
lic-0.042
ows-0.041
Tokenety
Token ID2963
Feature activation+3.22e-02

Pos logit contributions
e+0.124
 moderation+0.114
 hate+0.105
 client+0.100
oof+0.098
Neg logit contributions
der-0.264
rier-0.250
lights-0.239
lic-0.239
ly-0.239
Token lived
Token ID5615
Feature activation-3.63e-03

Pos logit contributions
e+0.061
 moderation+0.060
oof+0.058
 hate+0.056
 client+0.053
Neg logit contributions
der-0.062
rier-0.057
lights-0.056
lic-0.052
lessly-0.052
Token dragon
Token ID10441
Feature activation+1.18e-02

Pos logit contributions
der+0.080
rier+0.077
lights+0.073
lic+0.067
lessly+0.067
Neg logit contributions
e-0.085
 moderation-0.082
oof-0.080
 hate-0.077
atherine-0.072
Token loved
Token ID6151
Feature activation+9.38e-03

Pos logit contributions
e+0.023
 moderation+0.021
 hate+0.020
oof+0.020
 client+0.019
Neg logit contributions
der-0.023
rier-0.022
lights-0.021
lic-0.019
boards-0.019
Token it
Token ID340
Feature activation-3.82e-03

Pos logit contributions
e+0.020
 moderation+0.020
oof+0.019
 hate+0.018
atherine+0.017
Neg logit contributions
der-0.016
rier-0.015
lights-0.015
lessly-0.013
lic-0.013
Token!
Token ID0
Feature activation-1.50e-02

Pos logit contributions
der+0.006
rier+0.006
lights+0.005
lic+0.005
boards+0.005
Neg logit contributions
e-0.009
 moderation-0.008
oof-0.008
 hate-0.008
 supposed-0.008
Token<|endoftext|>
Token ID50256
Feature activation+2.69e-02

Pos logit contributions
der+0.025
rier+0.023
lights+0.022
lessly+0.020
lic+0.020
Neg logit contributions
e-0.033
 moderation-0.032
oof-0.032
 hate-0.031
 client-0.029
TokenLu
Token ID25596
Feature activation+1.20e-01

Pos logit contributions
e+0.049
 moderation+0.045
oof+0.044
 hate+0.042
atherine+0.038
Neg logit contributions
der-0.057
rier-0.056
lights-0.053
lessly-0.048
ly-0.047
Tokency
Token ID948
Feature activation+2.92e-02

Pos logit contributions
 moderation+0.204
oof+0.191
 client+0.189
 supposed+0.185
astics+0.184
Neg logit contributions
ly-0.217
ows-0.212
der-0.212
rier-0.206
ed-0.203
Token was
Token ID373
Feature activation-6.63e-03

Pos logit contributions
e+0.070
 moderation+0.069
oof+0.068
 hate+0.066
 client+0.063
Neg logit contributions
der-0.041
rier-0.038
lights-0.036
lic-0.032
boards-0.032
Token jealous
Token ID19354
Feature activation-1.86e-02

Pos logit contributions
der+0.012
rier+0.012
lights+0.011
lessly+0.010
lic+0.010
Neg logit contributions
e-0.013
 moderation-0.013
oof-0.012
 hate-0.012
 client-0.011
Token of
Token ID286
Feature activation+5.47e-02

Pos logit contributions
rier+0.033
der+0.029
lights+0.028
lic+0.028
 lid+0.026
Neg logit contributions
e-0.037
 hate-0.036
 moderation-0.034
 supposed-0.034
oof-0.034
Token her
Token ID607
Feature activation+2.95e-02

Pos logit contributions
e+0.106
oof+0.103
 moderation+0.099
 hate+0.095
beh+0.093
Neg logit contributions
der-0.102
rier-0.090
lights-0.089
boards-0.079
ly-0.079
Token dishes
Token ID16759
Feature activation-7.95e-03

Pos logit contributions
 moderation+0.010
e+0.010
 hate+0.009
 mischief+0.008
 matter+0.008
Neg logit contributions
lic-0.014
der-0.014
ows-0.013
rier-0.013
lights-0.013
Token ever
Token ID1683
Feature activation+4.34e-02

Pos logit contributions
lic+0.014
lights+0.013
der+0.013
rier+0.013
 lid+0.011
Neg logit contributions
e-0.016
 moderation-0.015
 hate-0.015
 supposed-0.013
 quarrel-0.013
Token again
Token ID757
Feature activation+3.95e-02

Pos logit contributions
e+0.109
 moderation+0.105
 hate+0.101
oof+0.100
 supposed+0.095
Neg logit contributions
der-0.060
rier-0.058
lights-0.054
lic-0.049
lessly-0.047
Token!
Token ID0
Feature activation-2.34e-02

Pos logit contributions
e+0.096
 moderation+0.092
 hate+0.088
oof+0.088
 client+0.082
Neg logit contributions
der-0.058
rier-0.056
lights-0.053
lic-0.048
lessly-0.046
Token<|endoftext|>
Token ID50256
Feature activation+2.43e-02

Pos logit contributions
der+0.036
rier+0.031
lights+0.030
lessly+0.029
lic+0.027
Neg logit contributions
e-0.055
oof-0.052
 moderation-0.052
 hate-0.051
 supposed-0.047
TokenSte
Token ID7447
Feature activation+1.20e-01

Pos logit contributions
e+0.046
 moderation+0.043
oof+0.042
 hate+0.041
atherine+0.038
Neg logit contributions
der-0.049
rier-0.047
lights-0.045
lessly-0.041
lic-0.041
Tokenlla
Token ID8466
Feature activation+5.93e-02

Pos logit contributions
oof+0.230
 client+0.230
 moderation+0.223
 hate+0.220
 nephew+0.214
Neg logit contributions
lessly-0.186
der-0.173
lic-0.169
rier-0.160
ows-0.158
Token was
Token ID373
Feature activation-9.17e-04

Pos logit contributions
e+0.126
 moderation+0.124
oof+0.121
 hate+0.116
 client+0.110
Neg logit contributions
der-0.101
rier-0.094
lights-0.090
lic-0.082
lessly-0.081
Token sad
Token ID6507
Feature activation-1.91e-02

Pos logit contributions
der+0.002
rier+0.002
lights+0.001
lessly+0.001
lic+0.001
Neg logit contributions
 moderation-0.002
oof-0.002
e-0.002
 client-0.002
 hate-0.002
Token.
Token ID13
Feature activation+9.66e-03

Pos logit contributions
rier+0.036
lic+0.034
lights+0.031
boards+0.031
der+0.030
Neg logit contributions
 hate-0.035
e-0.033
 moderation-0.032
oof-0.030
 supposed-0.030
Token She
Token ID1375
Feature activation-1.09e-02

Pos logit contributions
e+0.023
 moderation+0.022
oof+0.022
 hate+0.021
 client+0.020
Neg logit contributions
der-0.014
rier-0.014
lights-0.013
lic-0.011
lessly-0.011
Token there
Token ID612
Feature activation-2.15e-02

Pos logit contributions
e+0.055
 moderation+0.052
oof+0.051
 hate+0.050
 client+0.046
Neg logit contributions
der-0.045
rier-0.043
lights-0.041
lic-0.037
lessly-0.037
Token was
Token ID373
Feature activation+3.93e-02

Pos logit contributions
der+0.029
rier+0.026
lights+0.025
lic+0.023
boards+0.022
Neg logit contributions
e-0.052
 moderation-0.052
oof-0.051
 hate-0.049
 client-0.048
Token a
Token ID257
Feature activation+3.24e-02

Pos logit contributions
 client+0.104
beh+0.103
cle+0.100
illas+0.098
 courtroom+0.097
Neg logit contributions
der-0.037
lic-0.026
lights-0.026
rier-0.024
 lid-0.023
Token bird
Token ID6512
Feature activation+2.26e-02

Pos logit contributions
e+0.079
 moderation+0.076
oof+0.075
 hate+0.071
beh+0.070
Neg logit contributions
der-0.045
rier-0.045
lights-0.039
lic-0.038
lessly-0.034
Token named
Token ID3706
Feature activation+5.76e-02

Pos logit contributions
e+0.033
 moderation+0.031
oof+0.030
 hate+0.029
 client+0.027
Neg logit contributions
der-0.054
rier-0.052
lights-0.050
lic-0.047
lessly-0.047
Token Twe
Token ID24205
Feature activation+1.20e-01

Pos logit contributions
e+0.110
 moderation+0.108
 hate+0.104
 client+0.101
atherine+0.098
Neg logit contributions
der-0.103
rier-0.101
lic-0.092
lights-0.092
ly-0.085
Tokenety
Token ID2963
Feature activation+4.85e-02

Pos logit contributions
e+0.155
 moderation+0.140
 hate+0.139
 client+0.128
 quarrel+0.126
Neg logit contributions
der-0.288
rier-0.269
ly-0.263
ows-0.253
lic-0.250
Token.
Token ID13
Feature activation+1.70e-02

Pos logit contributions
e+0.129
 moderation+0.128
oof+0.124
 hate+0.122
atherine+0.118
Neg logit contributions
der-0.057
rier-0.050
lights-0.046
lessly-0.040
lic-0.040
Token Twe
Token ID24205
Feature activation+1.01e-01

Pos logit contributions
e+0.016
oof+0.016
 moderation+0.016
 hate+0.015
 client+0.011
Neg logit contributions
der-0.048
rier-0.047
lights-0.047
lic-0.044
ows-0.043
Tokenety
Token ID2963
Feature activation+3.19e-02

Pos logit contributions
e+0.119
 moderation+0.108
 hate+0.100
 client+0.094
oof+0.094
Neg logit contributions
der-0.267
rier-0.253
lights-0.243
ly-0.241
lic-0.240
Token lived
Token ID5615
Feature activation-3.24e-03

Pos logit contributions
e+0.060
 moderation+0.059
oof+0.057
 hate+0.055
 client+0.052
Neg logit contributions
der-0.061
rier-0.058
lights-0.056
lic-0.052
lessly-0.051
Token was
Token ID373
Feature activation+3.84e-02

Pos logit contributions
der+0.028
rier+0.025
lights+0.025
lic+0.022
boards+0.022
Neg logit contributions
e-0.048
 moderation-0.047
oof-0.046
 hate-0.044
 client-0.043
Token a
Token ID257
Feature activation+3.34e-02

Pos logit contributions
 client+0.100
beh+0.100
cle+0.097
illas+0.095
 harm+0.095
Neg logit contributions
der-0.037
lic-0.027
lights-0.026
boards-0.026
rier-0.024
Token little
Token ID1310
Feature activation+3.81e-02

Pos logit contributions
e+0.082
 moderation+0.079
oof+0.078
 hate+0.074
beh+0.073
Neg logit contributions
der-0.046
rier-0.045
lights-0.039
lic-0.039
ly-0.035
Token bird
Token ID6512
Feature activation+2.50e-02

Pos logit contributions
e+0.075
 moderation+0.071
oof+0.070
 hate+0.067
 supposed+0.063
Neg logit contributions
der-0.073
rier-0.070
lights-0.066
lic-0.061
lessly-0.061
Token named
Token ID3706
Feature activation+5.88e-02

Pos logit contributions
e+0.034
 moderation+0.034
oof+0.032
 hate+0.031
atherine+0.028
Neg logit contributions
der-0.061
rier-0.058
lights-0.057
lic-0.053
lessly-0.053
Token Twe
Token ID24205
Feature activation+1.20e-01

Pos logit contributions
e+0.110
 moderation+0.108
 hate+0.104
 client+0.102
atherine+0.098
Neg logit contributions
der-0.109
rier-0.105
lic-0.096
lights-0.096
ly-0.090
Tokenety
Token ID2963
Feature activation+5.00e-02

Pos logit contributions
e+0.160
 hate+0.148
 moderation+0.143
 client+0.137
 quarrel+0.135
Neg logit contributions
der-0.280
ly-0.252
rier-0.250
ows-0.249
lic-0.247
Token.
Token ID13
Feature activation+1.54e-02

Pos logit contributions
e+0.134
 moderation+0.134
oof+0.130
 hate+0.127
atherine+0.124
Neg logit contributions
der-0.058
rier-0.049
lights-0.045
lessly-0.040
lic-0.039
Token Twe
Token ID24205
Feature activation+9.92e-02

Pos logit contributions
e+0.015
 moderation+0.014
oof+0.014
 hate+0.013
 client+0.010
Neg logit contributions
der-0.044
rier-0.043
lights-0.042
lic-0.040
lessly-0.040
Tokenety
Token ID2963
Feature activation+3.08e-02

Pos logit contributions
e+0.123
 moderation+0.111
 hate+0.106
 client+0.102
oof+0.096
Neg logit contributions
der-0.252
rier-0.232
lic-0.229
lights-0.227
ly-0.226
Token was
Token ID373
Feature activation-1.02e-02

Pos logit contributions
 moderation+0.058
e+0.058
oof+0.056
 hate+0.054
 client+0.052
Neg logit contributions
der-0.058
rier-0.054
lights-0.053
lic-0.049
lessly-0.049
Token a
Token ID257
Feature activation+3.15e-02

Pos logit contributions
 client+0.094
beh+0.094
cle+0.092
atherine+0.089
illas+0.089
Neg logit contributions
der-0.038
lights-0.027
lic-0.027
rier-0.026
lessly-0.025
Token curious
Token ID11040
Feature activation+1.03e-02

Pos logit contributions
e+0.077
 moderation+0.074
oof+0.073
 hate+0.069
beh+0.069
Neg logit contributions
der-0.043
rier-0.043
lights-0.037
lic-0.036
ly-0.033
Token little
Token ID1310
Feature activation+3.46e-02

Pos logit contributions
e+0.020
 moderation+0.020
oof+0.019
 hate+0.018
atherine+0.017
Neg logit contributions
der-0.020
rier-0.019
lights-0.018
lic-0.016
lessly-0.016
Token bird
Token ID6512
Feature activation+2.63e-02

Pos logit contributions
e+0.060
 moderation+0.057
oof+0.055
 hate+0.053
 client+0.049
Neg logit contributions
der-0.074
rier-0.072
lights-0.069
lic-0.063
lessly-0.063
Token named
Token ID3706
Feature activation+5.74e-02

Pos logit contributions
e+0.035
 moderation+0.035
oof+0.033
 hate+0.032
atherine+0.030
Neg logit contributions
der-0.064
rier-0.061
lights-0.060
lic-0.056
lessly-0.056
Token Twe
Token ID24205
Feature activation+1.19e-01

Pos logit contributions
e+0.105
 moderation+0.102
 hate+0.098
 client+0.094
oof+0.090
Neg logit contributions
der-0.112
rier-0.107
lic-0.100
lights-0.099
ly-0.094
Tokenety
Token ID2963
Feature activation+5.06e-02

Pos logit contributions
e+0.143
 hate+0.136
 moderation+0.133
 client+0.128
 courtroom+0.123
Neg logit contributions
der-0.297
rier-0.279
ows-0.267
ly-0.266
lic-0.264
Token.
Token ID13
Feature activation+9.85e-03

Pos logit contributions
 moderation+0.137
e+0.136
oof+0.132
 hate+0.130
atherine+0.126
Neg logit contributions
der-0.057
rier-0.048
lights-0.045
lessly-0.039
lic-0.038
Token Twe
Token ID24205
Feature activation+9.78e-02

Pos logit contributions
 moderation+0.009
oof+0.009
e+0.009
 hate+0.009
Jerry+0.006
Neg logit contributions
der-0.028
rier-0.027
lights-0.027
lic-0.025
lessly-0.025
Tokenety
Token ID2963
Feature activation+2.96e-02

Pos logit contributions
e+0.109
 moderation+0.100
 hate+0.096
 client+0.093
oof+0.087
Neg logit contributions
der-0.259
rier-0.246
lights-0.237
lic-0.232
ly-0.231
Token loved
Token ID6151
Feature activation+2.24e-02

Pos logit contributions
 moderation+0.055
e+0.054
oof+0.052
 hate+0.050
 client+0.048
Neg logit contributions
der-0.058
rier-0.052
lights-0.052
lic-0.049
lessly-0.048
Token named
Token ID3706
Feature activation+3.88e-02

Pos logit contributions
e+0.019
 moderation+0.018
oof+0.017
 hate+0.016
 client+0.014
Neg logit contributions
der-0.043
rier-0.041
lights-0.040
lic-0.038
lessly-0.038
Token Jerry
Token ID13075
Feature activation+6.82e-02

Pos logit contributions
e+0.071
 moderation+0.068
 hate+0.063
oof+0.062
cle+0.059
Neg logit contributions
der-0.077
rier-0.074
lic-0.067
ly-0.064
lights-0.064
Token.
Token ID13
Feature activation+6.89e-03

Pos logit contributions
e+0.178
 moderation+0.173
oof+0.169
 hate+0.165
 client+0.157
Neg logit contributions
der-0.087
rier-0.080
lights-0.073
lic-0.065
lessly-0.064
Token
Token ID198
Feature activation-3.25e-03

Pos logit contributions
e+0.014
 moderation+0.013
oof+0.013
 hate+0.013
 client+0.012
Neg logit contributions
der-0.013
rier-0.012
lights-0.012
lic-0.011
lessly-0.011
Token
Token ID198
Feature activation-3.23e-03

Pos logit contributions
der+0.005
rier+0.005
lights+0.005
lic+0.005
lessly+0.004
Neg logit contributions
e-0.007
 moderation-0.007
oof-0.007
 hate-0.006
 client-0.006
TokenTwe
Token ID32665
Feature activation+1.19e-01

Pos logit contributions
der+0.008
rier+0.008
lights+0.008
lic+0.007
lessly+0.007
Neg logit contributions
e-0.004
 moderation-0.004
oof-0.004
 hate-0.004
 client-0.003
Tokenety
Token ID2963
Feature activation+2.02e-02

Pos logit contributions
e+0.162
 moderation+0.154
 hate+0.149
 client+0.133
 mischief+0.130
Neg logit contributions
der-0.278
rier-0.254
ly-0.253
lic-0.251
ows-0.247
Token flew
Token ID13112
Feature activation-3.41e-02

Pos logit contributions
e+0.039
 moderation+0.038
oof+0.037
 hate+0.035
 client+0.033
Neg logit contributions
der-0.039
rier-0.037
lights-0.035
lic-0.033
lessly-0.032
Token to
Token ID284
Feature activation+2.03e-02

Pos logit contributions
der+0.066
rier+0.061
lights+0.058
boards+0.052
lic+0.050
Neg logit contributions
e-0.070
oof-0.065
 moderation-0.063
 hate-0.062
atherine-0.058
Token Jerry
Token ID13075
Feature activation+5.54e-02

Pos logit contributions
e+0.044
oof+0.039
 moderation+0.039
 hate+0.039
beh+0.039
Neg logit contributions
der-0.033
rier-0.032
lights-0.027
ly-0.026
lic-0.026
Token's
Token ID338
Feature activation-1.77e-02

Pos logit contributions
 moderation+0.127
oof+0.126
e+0.125
 hate+0.120
 client+0.116
Neg logit contributions
der-0.081
rier-0.071
lights-0.065
ly-0.064
ows-0.063
Token he
Token ID339
Feature activation-3.40e-02

Pos logit contributions
der+0.065
rier+0.063
lights+0.060
lic+0.054
lessly+0.054
Neg logit contributions
e-0.071
 moderation-0.068
oof-0.066
 hate-0.064
 client-0.060
Token felt
Token ID2936
Feature activation-4.61e-02

Pos logit contributions
der+0.076
rier+0.075
lights+0.071
lessly+0.066
lic+0.065
Neg logit contributions
e-0.057
 moderation-0.051
oof-0.050
 hate-0.050
 client-0.045
Token a
Token ID257
Feature activation-1.15e-02

Pos logit contributions
der+0.105
lights+0.103
rier+0.102
lic+0.095
boards+0.093
Neg logit contributions
e-0.076
 moderation-0.071
 hate-0.070
oof-0.059
 supposed-0.058
Token fl
Token ID781
Feature activation-3.96e-02

Pos logit contributions
lights+0.027
der+0.026
boards+0.025
rier+0.025
ows+0.024
Neg logit contributions
 hate-0.018
e-0.018
 moderation-0.016
 panic-0.014
 matter-0.014
Tokenapping
Token ID5912
Feature activation-4.72e-02

Pos logit contributions
der+0.086
lights+0.086
boards+0.083
rier+0.071
lic+0.070
Neg logit contributions
e-0.070
 hate-0.059
oof-0.055
 moderation-0.050
 supposed-0.047
Token sensation
Token ID18098
Feature activation-1.09e-01

Pos logit contributions
der+0.091
rier+0.087
lights+0.082
lic+0.081
boards+0.080
Neg logit contributions
e-0.095
 hate-0.084
 moderation-0.080
oof-0.078
 quarrel-0.073
Token in
Token ID287
Feature activation+2.04e-02

Pos logit contributions
lights+0.206
rier+0.204
der+0.201
lic+0.195
boards+0.192
Neg logit contributions
e-0.210
 hate-0.159
 supposed-0.153
aster-0.146
 harm-0.135
Token the
Token ID262
Feature activation-2.99e-02

Pos logit contributions
e+0.036
 moderation+0.034
oof+0.033
 hate+0.032
 client+0.029
Neg logit contributions
der-0.044
rier-0.042
lights-0.040
lic-0.038
boards-0.037
Token air
Token ID1633
Feature activation-6.24e-02

Pos logit contributions
der+0.052
rier+0.049
lights+0.047
lic+0.043
lessly+0.042
Neg logit contributions
e-0.066
 moderation-0.062
oof-0.059
 hate-0.059
 client-0.055
Token.
Token ID13
Feature activation-1.10e-02

Pos logit contributions
der+0.111
lights+0.110
rier+0.109
lic+0.107
boards+0.097
Neg logit contributions
e-0.127
 hate-0.110
 moderation-0.106
 quarrel-0.098
 supposed-0.096
Token It
Token ID632
Feature activation-3.95e-02

Pos logit contributions
der+0.015
rier+0.014
lights+0.013
lic+0.011
lessly+0.011
Neg logit contributions
e-0.028
 moderation-0.027
oof-0.027
 hate-0.026
 client-0.025
Token,
Token ID11
Feature activation-3.76e-02

Pos logit contributions
der+0.062
rier+0.061
lights+0.058
lic+0.053
lessly+0.051
Neg logit contributions
e-0.097
 moderation-0.093
 hate-0.089
oof-0.089
 client-0.084
Token Elsa
Token ID19226
Feature activation+1.28e-02

Pos logit contributions
der+0.078
lights+0.076
rier+0.075
lic+0.072
boards+0.069
Neg logit contributions
e-0.066
 moderation-0.064
 hate-0.059
oof-0.057
 client-0.055
Token felt
Token ID2936
Feature activation-4.90e-02

Pos logit contributions
e+0.023
 moderation+0.021
oof+0.021
 hate+0.020
 client+0.019
Neg logit contributions
der-0.027
rier-0.026
lights-0.025
lessly-0.023
lic-0.023
Token a
Token ID257
Feature activation-1.86e-02

Pos logit contributions
der+0.107
rier+0.106
lights+0.105
lic+0.096
boards+0.094
Neg logit contributions
e-0.082
 moderation-0.080
 hate-0.076
oof-0.069
 supposed-0.066
Token strange
Token ID6283
Feature activation-2.99e-02

Pos logit contributions
lights+0.043
der+0.042
rier+0.041
boards+0.040
ows+0.039
Neg logit contributions
e-0.028
 hate-0.028
 moderation-0.026
oof-0.023
 matter-0.022
Token sensation
Token ID18098
Feature activation-1.03e-01

Pos logit contributions
lights+0.062
rier+0.061
der+0.060
boards+0.059
lic+0.058
Neg logit contributions
e-0.052
 hate-0.051
 moderation-0.046
oof-0.044
 quarrel-0.043
Token in
Token ID287
Feature activation+2.31e-02

Pos logit contributions
rier+0.205
lights+0.191
der+0.190
lic+0.178
boards+0.174
Neg logit contributions
e-0.208
 hate-0.169
aster-0.162
 supposed-0.156
 moderation-0.143
Token the
Token ID262
Feature activation-2.48e-02

Pos logit contributions
e+0.037
 moderation+0.035
oof+0.034
 hate+0.032
 client+0.030
Neg logit contributions
der-0.053
rier-0.051
lights-0.049
lessly-0.045
lic-0.045
Token water
Token ID1660
Feature activation-3.77e-02

Pos logit contributions
der+0.052
rier+0.050
lights+0.046
ows+0.044
lic+0.044
Neg logit contributions
e-0.045
 hate-0.041
 moderation-0.041
oof-0.040
 quarrel-0.035
Token.
Token ID13
Feature activation-2.51e-02

Pos logit contributions
rier+0.068
lights+0.068
der+0.067
lic+0.065
ows+0.057
Neg logit contributions
e-0.073
 hate-0.064
 moderation-0.061
 quarrel-0.057
aster-0.056
Token What
Token ID1867
Feature activation-4.23e-03

Pos logit contributions
lights+0.049
rier+0.046
der+0.044
lic+0.043
ier+0.042
Neg logit contributions
 moderation-0.046
e-0.044
 hate-0.042
 client-0.040
oof-0.039
TokenSuddenly
Token ID38582
Feature activation-3.54e-02

Pos logit contributions
der+0.036
rier+0.035
lights+0.033
lic+0.032
lessly+0.032
Neg logit contributions
e-0.024
 moderation-0.023
oof-0.023
 hate-0.021
 client-0.020
Token,
Token ID11
Feature activation-3.51e-02

Pos logit contributions
der+0.061
rier+0.058
lights+0.055
lic+0.051
lessly+0.050
Neg logit contributions
e-0.076
 hate-0.069
 moderation-0.069
oof-0.067
cle-0.064
Token he
Token ID339
Feature activation-3.06e-02

Pos logit contributions
der+0.066
rier+0.063
lights+0.060
lic+0.056
ows+0.055
Neg logit contributions
e-0.069
 moderation-0.066
 hate-0.064
oof-0.063
 client-0.058
Token felt
Token ID2936
Feature activation-3.74e-02

Pos logit contributions
rier+0.076
der+0.075
lights+0.071
lessly+0.066
ows+0.064
Neg logit contributions
e-0.046
 moderation-0.038
 hate-0.038
oof-0.038
cle-0.033
Token a
Token ID257
Feature activation-1.33e-02

Pos logit contributions
der+0.085
lights+0.084
rier+0.083
lic+0.076
boards+0.076
Neg logit contributions
e-0.061
 moderation-0.058
 hate-0.055
oof-0.051
 supposed-0.047
Token sensation
Token ID18098
Feature activation-1.01e-01

Pos logit contributions
lights+0.033
der+0.031
boards+0.030
rier+0.030
ows+0.029
Neg logit contributions
e-0.019
 hate-0.019
 moderation-0.016
 panic-0.016
 matter-0.015
Token.
Token ID13
Feature activation-1.86e-02

Pos logit contributions
rier+0.199
lights+0.197
der+0.195
boards+0.182
lic+0.177
Neg logit contributions
e-0.200
 hate-0.157
aster-0.151
 supposed-0.142
 harm-0.130
Token He
Token ID679
Feature activation-3.34e-02

Pos logit contributions
der+0.027
lights+0.027
rier+0.027
lic+0.024
lessly+0.023
Neg logit contributions
e-0.040
 moderation-0.039
 hate-0.038
oof-0.037
 client-0.036
Token began
Token ID2540
Feature activation-1.69e-02

Pos logit contributions
rier+0.065
der+0.064
lights+0.060
lessly+0.054
boards+0.053
Neg logit contributions
e-0.067
 hate-0.059
oof-0.059
 moderation-0.059
atherine-0.053
Token to
Token ID284
Feature activation+2.44e-02

Pos logit contributions
der+0.029
rier+0.028
lights+0.026
lic+0.024
lessly+0.024
Neg logit contributions
e-0.037
 moderation-0.035
oof-0.034
 hate-0.034
 client-0.031
Token slide
Token ID10649
Feature activation+5.75e-03

Pos logit contributions
oof+0.042
e+0.042
 moderation+0.041
atherine+0.038
cle+0.036
Neg logit contributions
der-0.050
rier-0.047
lights-0.043
ly-0.042
lessly-0.041
Token and
Token ID290
Feature activation-1.95e-02

Pos logit contributions
e+0.025
 moderation+0.024
oof+0.024
 hate+0.023
 client+0.022
Neg logit contributions
der-0.024
rier-0.023
lights-0.022
lic-0.020
lessly-0.020
Token Tom
Token ID4186
Feature activation+2.39e-02

Pos logit contributions
der+0.038
rier+0.035
lights+0.033
ly+0.032
atively+0.032
Neg logit contributions
oof-0.035
 moderation-0.035
e-0.034
 hate-0.031
 supposed-0.030
Token experienced
Token ID5924
Feature activation-3.52e-02

Pos logit contributions
e+0.043
 hate+0.038
 moderation+0.038
oof+0.038
atherine+0.034
Neg logit contributions
der-0.051
rier-0.051
lights-0.048
lic-0.043
boards-0.043
Token a
Token ID257
Feature activation-3.55e-03

Pos logit contributions
der+0.075
rier+0.074
lights+0.071
lic+0.067
boards+0.066
Neg logit contributions
e-0.061
 hate-0.053
 moderation-0.052
oof-0.050
 supposed-0.048
Token strange
Token ID6283
Feature activation-2.28e-02

Pos logit contributions
der+0.006
rier+0.006
lights+0.006
lessly+0.005
lic+0.005
Neg logit contributions
e-0.007
 moderation-0.007
oof-0.007
 hate-0.007
atherine-0.006
Token sensation
Token ID18098
Feature activation-9.94e-02

Pos logit contributions
der+0.048
rier+0.046
lights+0.044
lic+0.041
boards+0.041
Neg logit contributions
e-0.041
 moderation-0.038
 hate-0.037
oof-0.037
 client-0.034
Token they
Token ID484
Feature activation-2.07e-02

Pos logit contributions
rier+0.188
der+0.171
lic+0.161
boards+0.159
lights+0.156
Neg logit contributions
e-0.200
 hate-0.163
 supposed-0.151
aster-0.149
 harm-0.144
Token had
Token ID550
Feature activation-1.92e-02

Pos logit contributions
der+0.040
rier+0.039
lights+0.036
lic+0.033
boards+0.033
Neg logit contributions
e-0.041
 moderation-0.037
 hate-0.037
oof-0.037
 client-0.034
Token never
Token ID1239
Feature activation+1.49e-02

Pos logit contributions
der+0.039
rier+0.039
lights+0.037
lic+0.033
lessly+0.033
Neg logit contributions
e-0.035
 moderation-0.033
oof-0.032
 hate-0.032
 supposed-0.029
Token felt
Token ID2936
Feature activation-1.03e-02

Pos logit contributions
e+0.021
 moderation+0.020
oof+0.019
 hate+0.018
 client+0.016
Neg logit contributions
der-0.037
rier-0.035
lights-0.034
lic-0.033
lessly-0.032
Token before
Token ID878
Feature activation+3.15e-02

Pos logit contributions
der+0.020
rier+0.019
lights+0.019
lic+0.017
boards+0.017
Neg logit contributions
e-0.020
 moderation-0.019
 hate-0.018
oof-0.018
 supposed-0.017
Token,
Token ID11
Feature activation-2.93e-02

Pos logit contributions
der+0.048
rier+0.046
lights+0.042
lic+0.040
lessly+0.039
Neg logit contributions
e-0.071
 moderation-0.066
oof-0.065
 hate-0.064
 client-0.060
Token they
Token ID484
Feature activation-2.67e-02

Pos logit contributions
der+0.055
rier+0.053
lights+0.051
lic+0.046
lessly+0.046
Neg logit contributions
e-0.058
 moderation-0.056
oof-0.055
 hate-0.053
 client-0.049
Token felt
Token ID2936
Feature activation-3.45e-02

Pos logit contributions
rier+0.075
der+0.073
lights+0.067
lessly+0.063
rily+0.061
Neg logit contributions
e-0.034
 hate-0.026
 moderation-0.025
 seen-0.023
 quarrel-0.023
Token a
Token ID257
Feature activation-9.84e-03

Pos logit contributions
der+0.080
rier+0.077
lights+0.075
lic+0.072
boards+0.070
Neg logit contributions
e-0.056
 moderation-0.052
 hate-0.050
oof-0.045
 supposed-0.041
Token strange
Token ID6283
Feature activation-2.60e-02

Pos logit contributions
lights+0.022
der+0.022
boards+0.021
rier+0.021
lic+0.020
Neg logit contributions
e-0.016
 hate-0.015
 moderation-0.013
 panic-0.013
 quarrel-0.012
Token sensation
Token ID18098
Feature activation-9.92e-02

Pos logit contributions
der+0.054
rier+0.054
lights+0.052
boards+0.051
lic+0.050
Neg logit contributions
e-0.047
 hate-0.044
 moderation-0.039
oof-0.039
 quarrel-0.038
Token.
Token ID13
Feature activation-5.72e-03

Pos logit contributions
rier+0.199
der+0.188
lic+0.177
boards+0.175
lights+0.174
Neg logit contributions
e-0.194
 hate-0.145
aster-0.144
 supposed-0.131
 harm-0.123
Token They
Token ID1119
Feature activation-2.69e-02

Pos logit contributions
der+0.010
rier+0.010
lights+0.009
lessly+0.008
lic+0.008
Neg logit contributions
e-0.012
 moderation-0.012
oof-0.011
 hate-0.011
 client-0.010
Token felt
Token ID2936
Feature activation-3.67e-02

Pos logit contributions
rier+0.059
der+0.057
lights+0.054
lic+0.049
boards+0.048
Neg logit contributions
e-0.050
 hate-0.044
 moderation-0.042
oof-0.039
 quarrel-0.038
Token light
Token ID1657
Feature activation-3.53e-02

Pos logit contributions
der+0.084
rier+0.081
lights+0.080
lic+0.078
boards+0.076
Neg logit contributions
e-0.060
 moderation-0.055
 hate-0.054
oof-0.047
 supposed-0.044
Token and
Token ID290
Feature activation+1.22e-02

Pos logit contributions
der+0.064
lic+0.064
boards+0.062
rier+0.058
lessly+0.056
Neg logit contributions
e-0.064
 hate-0.052
 harm-0.047
 matter-0.046
 quarrel-0.045
Token felt
Token ID2936
Feature activation-4.21e-02

Pos logit contributions
der+0.059
rier+0.058
lights+0.054
lic+0.050
boards+0.049
Neg logit contributions
e-0.055
 moderation-0.050
oof-0.049
 hate-0.048
 client-0.044
Token a
Token ID257
Feature activation-1.11e-02

Pos logit contributions
der+0.093
lights+0.088
rier+0.087
boards+0.081
lic+0.079
Neg logit contributions
e-0.076
 moderation-0.070
 hate-0.068
oof-0.063
 supposed-0.060
Token strange
Token ID6283
Feature activation-3.03e-02

Pos logit contributions
der+0.024
lights+0.022
rier+0.022
boards+0.022
ows+0.020
Neg logit contributions
e-0.020
 hate-0.018
 moderation-0.018
oof-0.017
 matter-0.016
Token t
Token ID256
Feature activation+4.57e-03

Pos logit contributions
der+0.062
rier+0.060
lights+0.059
lic+0.056
boards+0.055
Neg logit contributions
e-0.055
 hate-0.052
 moderation-0.051
oof-0.048
 quarrel-0.045
Tokeningly
Token ID4420
Feature activation-3.76e-02

Pos logit contributions
e+0.009
oof+0.008
 hate+0.008
 moderation+0.008
cle+0.008
Neg logit contributions
der-0.009
lights-0.008
rier-0.008
boards-0.007
lic-0.007
Token sensation
Token ID18098
Feature activation-9.82e-02

Pos logit contributions
lights+0.074
der+0.074
rier+0.072
lic+0.070
boards+0.067
Neg logit contributions
e-0.071
 hate-0.066
 moderation-0.065
oof-0.059
 supposed-0.057
Token and
Token ID290
Feature activation+7.20e-03

Pos logit contributions
der+0.173
rier+0.173
lic+0.170
lights+0.164
boards+0.159
Neg logit contributions
e-0.192
 hate-0.151
 supposed-0.145
aster-0.133
 moderation-0.129
Token when
Token ID618
Feature activation-1.29e-03

Pos logit contributions
e+0.013
 hate+0.012
 moderation+0.012
oof+0.010
 quarrel+0.010
Neg logit contributions
rier-0.015
der-0.014
lights-0.014
lic-0.013
boards-0.013
Token it
Token ID340
Feature activation-2.76e-02

Pos logit contributions
der+0.002
rier+0.002
lights+0.002
lessly+0.002
boards+0.002
Neg logit contributions
e-0.003
oof-0.003
 moderation-0.003
 hate-0.003
atherine-0.003
Token stopped
Token ID5025
Feature activation-9.60e-03

Pos logit contributions
rier+0.059
der+0.058
lights+0.052
lessly+0.049
ows+0.048
Neg logit contributions
e-0.052
 hate-0.043
 moderation-0.041
oof-0.040
 quarrel-0.039
Token they
Token ID484
Feature activation+2.56e-03

Pos logit contributions
der+0.017
rier+0.015
lights+0.015
lic+0.015
boards+0.014
Neg logit contributions
e-0.021
 moderation-0.020
 hate-0.019
oof-0.018
cle-0.017
Token and
Token ID290
Feature activation-1.74e-02

Pos logit contributions
e+0.034
 moderation+0.031
oof+0.030
 hate+0.030
atherine+0.028
Neg logit contributions
der-0.030
rier-0.029
lights-0.028
lessly-0.025
lic-0.025
Token Ben
Token ID3932
Feature activation+1.78e-02

Pos logit contributions
der+0.033
rier+0.029
lights+0.027
ly+0.026
lessly+0.026
Neg logit contributions
oof-0.034
e-0.033
 moderation-0.033
 hate-0.030
 supposed-0.030
Token felt
Token ID2936
Feature activation-4.45e-02

Pos logit contributions
e+0.036
 moderation+0.033
oof+0.033
 hate+0.032
atherine+0.029
Neg logit contributions
der-0.034
rier-0.033
lights-0.031
lic-0.028
lessly-0.028
Token a
Token ID257
Feature activation-8.62e-03

Pos logit contributions
der+0.097
lights+0.091
rier+0.090
boards+0.085
lic+0.083
Neg logit contributions
e-0.081
 hate-0.073
 moderation-0.072
oof-0.067
 supposed-0.064
Token funny
Token ID8258
Feature activation-7.90e-03

Pos logit contributions
der+0.018
rier+0.017
lights+0.017
boards+0.016
lic+0.015
Neg logit contributions
e-0.016
 hate-0.015
 moderation-0.014
oof-0.014
 matter-0.012
Token sensation
Token ID18098
Feature activation-9.80e-02

Pos logit contributions
der+0.015
rier+0.015
lights+0.015
boards+0.014
lic+0.014
Neg logit contributions
e-0.015
 hate-0.014
 moderation-0.013
oof-0.013
 quarrel-0.012
Token in
Token ID287
Feature activation+2.81e-02

Pos logit contributions
boards+0.187
rier+0.182
lic+0.182
der+0.181
lights+0.178
Neg logit contributions
e-0.190
 hate-0.149
 supposed-0.142
aster-0.137
 harm-0.127
Token their
Token ID511
Feature activation-6.69e-03

Pos logit contributions
e+0.046
 moderation+0.043
oof+0.042
 hate+0.040
 client+0.037
Neg logit contributions
der-0.064
rier-0.061
lights-0.059
lic-0.055
lessly-0.055
Token stomach
Token ID11384
Feature activation-1.36e-02

Pos logit contributions
der+0.012
lights+0.011
rier+0.011
ly+0.010
 lid+0.010
Neg logit contributions
 moderation-0.013
e-0.013
oof-0.013
 hate-0.012
 supposed-0.012
Tokens
Token ID82
Feature activation-3.86e-03

Pos logit contributions
lights+0.026
rier+0.026
der+0.025
boards+0.024
lic+0.023
Neg logit contributions
e-0.023
 hate-0.021
 quarrel-0.019
 panic-0.019
 harm-0.018
Token.
Token ID13
Feature activation-3.05e-03

Pos logit contributions
lic+0.007
der+0.006
rier+0.006
lights+0.006
boards+0.006
Neg logit contributions
e-0.008
 hate-0.007
 moderation-0.007
oof-0.006
 quarrel-0.006
Token the
Token ID262
Feature activation-6.12e-02

Pos logit contributions
lights+0.089
rier+0.084
der+0.084
lic+0.078
boards+0.076
Neg logit contributions
e-0.086
 hate-0.077
 moderation-0.075
oof-0.071
 client-0.066
Token camera
Token ID4676
Feature activation-5.05e-02

Pos logit contributions
der+0.120
rier+0.116
lights+0.110
lessly+0.101
lic+0.100
Neg logit contributions
e-0.116
 moderation-0.110
oof-0.108
 hate-0.103
atherine-0.097
Token up
Token ID510
Feature activation-1.98e-02

Pos logit contributions
rier+0.082
der+0.081
lights+0.078
lic+0.071
lessly+0.069
Neg logit contributions
e-0.111
 hate-0.100
oof-0.100
 moderation-0.099
 supposed-0.093
Token to
Token ID284
Feature activation+8.49e-03

Pos logit contributions
der+0.032
rier+0.030
lights+0.028
lic+0.026
lessly+0.025
Neg logit contributions
e-0.045
 moderation-0.043
oof-0.042
 hate-0.041
atherine-0.038
Token her
Token ID607
Feature activation-1.45e-02

Pos logit contributions
 moderation+0.016
e+0.016
oof+0.015
 hate+0.014
beh+0.014
Neg logit contributions
der-0.016
rier-0.015
lights-0.014
lessly-0.013
ly-0.013
Token eye
Token ID4151
Feature activation-9.76e-02

Pos logit contributions
der+0.023
rier+0.022
lights+0.020
lessly+0.018
lic+0.018
Neg logit contributions
e-0.033
 moderation-0.032
oof-0.031
 hate-0.030
atherine-0.029
Token,
Token ID11
Feature activation-3.19e-02

Pos logit contributions
lights+0.160
der+0.159
rier+0.154
lic+0.145
lessly+0.141
Neg logit contributions
e-0.210
 moderation-0.180
 hate-0.175
oof-0.175
cle-0.165
Token the
Token ID262
Feature activation-6.31e-02

Pos logit contributions
der+0.056
rier+0.054
lights+0.052
lic+0.047
boards+0.047
Neg logit contributions
e-0.067
 moderation-0.063
 hate-0.062
oof-0.062
 client-0.056
Token park
Token ID3952
Feature activation-1.62e-02

Pos logit contributions
der+0.129
rier+0.124
lights+0.119
lic+0.110
lessly+0.109
Neg logit contributions
e-0.115
 moderation-0.109
oof-0.107
 hate-0.103
 client-0.095
Token seemed
Token ID3947
Feature activation-4.49e-02

Pos logit contributions
der+0.030
rier+0.029
lights+0.027
lic+0.025
lessly+0.025
Neg logit contributions
e-0.033
oof-0.030
 moderation-0.030
 hate-0.030
 client-0.027
Token magical
Token ID10883
Feature activation-3.24e-02

Pos logit contributions
der+0.088
rier+0.083
lights+0.082
boards+0.075
lic+0.074
Neg logit contributions
e-0.083
 hate-0.080
 moderation-0.079
oof-0.074
 supposed-0.073
Token a
Token ID257
Feature activation-8.82e-03

Pos logit contributions
der+0.091
lights+0.088
rier+0.085
boards+0.084
lic+0.080
Neg logit contributions
e-0.063
 moderation-0.059
 hate-0.057
 supposed-0.048
oof-0.046
Token pleasant
Token ID15497
Feature activation+7.69e-04

Pos logit contributions
der+0.019
lights+0.018
rier+0.018
boards+0.017
ows+0.016
Neg logit contributions
e-0.016
 hate-0.015
 moderation-0.015
oof-0.013
 matter-0.013
Token t
Token ID256
Feature activation+7.19e-03

Pos logit contributions
e+0.001
 hate+0.001
 moderation+0.001
 matter+0.001
 panic+0.001
Neg logit contributions
boards-0.002
rier-0.002
der-0.002
lights-0.002
lic-0.002
Tokening
Token ID278
Feature activation-3.73e-02

Pos logit contributions
e+0.014
oof+0.012
 hate+0.012
 moderation+0.011
atherine+0.011
Neg logit contributions
der-0.014
lights-0.013
rier-0.013
boards-0.013
lic-0.012
Tokenling
Token ID1359
Feature activation-2.87e-02

Pos logit contributions
lic+0.073
boards+0.070
der+0.068
ows+0.066
rier+0.065
Neg logit contributions
e-0.072
 hate-0.065
 matter-0.059
 moderation-0.059
 harm-0.057
Token sensation
Token ID18098
Feature activation-9.74e-02

Pos logit contributions
lic+0.057
rier+0.054
der+0.052
boards+0.052
lights+0.052
Neg logit contributions
e-0.054
 hate-0.049
 moderation-0.046
 matter-0.045
 harm-0.043
Token throughout
Token ID3690
Feature activation+2.29e-02

Pos logit contributions
rier+0.186
lic+0.182
der+0.178
lights+0.177
 lid+0.173
Neg logit contributions
e-0.187
 hate-0.139
 supposed-0.135
aster-0.134
 ever-0.121
Token his
Token ID465
Feature activation+1.14e-02

Pos logit contributions
e+0.036
 moderation+0.034
 hate+0.030
oof+0.029
 generation+0.027
Neg logit contributions
der-0.055
rier-0.053
lights-0.051
lic-0.050
ows-0.049
Token body
Token ID1767
Feature activation-3.16e-02

Pos logit contributions
e+0.018
oof+0.017
 moderation+0.017
 hate+0.016
atherine+0.015
Neg logit contributions
der-0.025
rier-0.024
lights-0.023
ly-0.022
lessly-0.021
Token.
Token ID13
Feature activation-8.79e-03

Pos logit contributions
rier+0.057
lights+0.055
der+0.054
lic+0.053
boards+0.048
Neg logit contributions
e-0.067
 moderation-0.057
 hate-0.057
 supposed-0.053
 seen-0.049
Token The
Token ID383
Feature activation-4.72e-02

Pos logit contributions
rier+0.015
der+0.015
lights+0.014
lic+0.013
lessly+0.013
Neg logit contributions
e-0.018
 moderation-0.018
oof-0.017
 hate-0.017
 client-0.016
Token and
Token ID290
Feature activation-1.73e-02

Pos logit contributions
e+0.017
 moderation+0.016
oof+0.016
 hate+0.015
atherine+0.014
Neg logit contributions
der-0.015
rier-0.015
lights-0.014
lic-0.012
lessly-0.012
Token Tom
Token ID4186
Feature activation+3.31e-02

Pos logit contributions
der+0.031
rier+0.027
lights+0.025
ly+0.024
lessly+0.023
Neg logit contributions
oof-0.036
e-0.035
 moderation-0.034
 supposed-0.032
vous-0.031
Token felt
Token ID2936
Feature activation-4.35e-02

Pos logit contributions
e+0.065
 moderation+0.060
oof+0.059
 hate+0.058
 client+0.053
Neg logit contributions
der-0.065
rier-0.063
lights-0.060
lic-0.054
lessly-0.054
Token a
Token ID257
Feature activation-8.44e-03

Pos logit contributions
der+0.095
lights+0.090
rier+0.089
boards+0.084
lic+0.081
Neg logit contributions
e-0.079
 hate-0.072
 moderation-0.070
oof-0.065
 supposed-0.061
Token strange
Token ID6283
Feature activation-2.60e-02

Pos logit contributions
der+0.017
rier+0.016
lights+0.016
boards+0.016
lic+0.015
Neg logit contributions
e-0.016
 hate-0.014
 moderation-0.014
oof-0.014
 supposed-0.012
Token sensation
Token ID18098
Feature activation-9.73e-02

Pos logit contributions
rier+0.053
lights+0.053
der+0.053
boards+0.052
lic+0.051
Neg logit contributions
e-0.046
 hate-0.044
 moderation-0.040
oof-0.039
 quarrel-0.038
Token in
Token ID287
Feature activation+3.01e-02

Pos logit contributions
rier+0.184
der+0.180
boards+0.179
lic+0.179
lights+0.172
Neg logit contributions
e-0.188
 hate-0.149
 supposed-0.140
aster-0.137
 harm-0.128
Token their
Token ID511
Feature activation-8.39e-03

Pos logit contributions
e+0.052
 moderation+0.049
oof+0.049
 hate+0.046
 client+0.043
Neg logit contributions
der-0.065
rier-0.062
lights-0.060
lessly-0.055
lic-0.054
Token stomach
Token ID11384
Feature activation-1.41e-02

Pos logit contributions
der+0.015
lights+0.014
rier+0.014
ly+0.012
 lid+0.012
Neg logit contributions
 moderation-0.017
e-0.017
oof-0.016
 hate-0.015
 supposed-0.015
Tokens
Token ID82
Feature activation-7.29e-03

Pos logit contributions
rier+0.027
lights+0.026
der+0.025
lic+0.024
atively+0.024
Neg logit contributions
e-0.025
 hate-0.022
 quarrel-0.021
 panic-0.020
 harm-0.020
Token.
Token ID13
Feature activation-4.03e-03

Pos logit contributions
lic+0.012
rier+0.012
der+0.012
lights+0.011
boards+0.010
Neg logit contributions
e-0.016
 hate-0.014
 moderation-0.014
oof-0.013
 quarrel-0.012
Token she
Token ID673
Feature activation-2.12e-02

Pos logit contributions
der+0.018
rier+0.017
lights+0.016
lic+0.015
lessly+0.014
Neg logit contributions
e-0.025
 moderation-0.024
oof-0.024
 hate-0.023
 client-0.022
Token unsc
Token ID28594
Feature activation-3.96e-02

Pos logit contributions
der+0.041
rier+0.040
lights+0.038
lic+0.034
lessly+0.034
Neg logit contributions
e-0.041
 moderation-0.039
oof-0.038
 hate-0.037
 client-0.034
Tokenrew
Token ID1809
Feature activation-3.65e-02

Pos logit contributions
der+0.081
rier+0.079
lights+0.073
lessly+0.068
lic+0.067
Neg logit contributions
e-0.072
 moderation-0.071
oof-0.068
 hate-0.066
 client-0.060
Tokened
Token ID276
Feature activation-4.64e-02

Pos logit contributions
der+0.060
rier+0.055
lights+0.054
lic+0.048
boards+0.046
Neg logit contributions
e-0.085
 moderation-0.073
 hate-0.073
oof-0.072
atherine-0.070
Token the
Token ID262
Feature activation-6.65e-02

Pos logit contributions
der+0.098
rier+0.093
lights+0.092
lic+0.086
boards+0.083
Neg logit contributions
e-0.083
 hate-0.073
 moderation-0.070
oof-0.069
 client-0.064
Token lid
Token ID19789
Feature activation-9.73e-02

Pos logit contributions
der+0.129
lights+0.121
rier+0.120
lic+0.111
lessly+0.107
Neg logit contributions
e-0.128
 moderation-0.122
 hate-0.117
oof-0.116
 client-0.110
Token and
Token ID290
Feature activation-1.11e-02

Pos logit contributions
der+0.185
lights+0.182
rier+0.182
lic+0.171
arl+0.150
Neg logit contributions
e-0.161
 hate-0.125
 seen-0.114
 been-0.110
 quarrel-0.108
Token unveiled
Token ID18475
Feature activation-4.10e-02

Pos logit contributions
der+0.024
rier+0.023
lights+0.022
lic+0.020
lessly+0.020
Neg logit contributions
e-0.020
 moderation-0.018
 hate-0.017
oof-0.017
atherine-0.016
Token a
Token ID257
Feature activation-2.27e-02

Pos logit contributions
der+0.088
lights+0.086
rier+0.086
lic+0.081
boards+0.080
Neg logit contributions
e-0.064
 moderation-0.059
 hate-0.056
oof-0.056
 client-0.051
Token mysterious
Token ID11428
Feature activation-4.80e-02

Pos logit contributions
der+0.044
rier+0.042
lights+0.040
lessly+0.036
boards+0.036
Neg logit contributions
e-0.044
 moderation-0.042
oof-0.041
 hate-0.040
 client-0.037
Token message
Token ID3275
Feature activation-7.80e-02

Pos logit contributions
der+0.103
rier+0.098
lights+0.096
lessly+0.088
lic+0.088
Neg logit contributions
e-0.082
 moderation-0.077
oof-0.076
 hate-0.076
 client-0.068
Token,
Token ID11
Feature activation-2.59e-02

Pos logit contributions
der+0.054
rier+0.053
lights+0.052
lic+0.049
lessly+0.045
Neg logit contributions
e-0.048
 hate-0.041
 moderation-0.038
 disagree-0.038
 quarrel-0.037
Token he
Token ID339
Feature activation-1.18e-02

Pos logit contributions
der+0.039
rier+0.037
lights+0.035
lic+0.032
lessly+0.031
Neg logit contributions
e-0.062
 moderation-0.060
 hate-0.057
oof-0.057
 client-0.053
Token felt
Token ID2936
Feature activation-3.57e-02

Pos logit contributions
der+0.024
rier+0.024
lights+0.023
lessly+0.021
lic+0.020
Neg logit contributions
e-0.022
 moderation-0.020
oof-0.020
 hate-0.020
 client-0.018
Token a
Token ID257
Feature activation-7.32e-03

Pos logit contributions
lights+0.082
der+0.081
rier+0.078
boards+0.075
lic+0.074
Neg logit contributions
e-0.058
 moderation-0.056
 hate-0.055
oof-0.046
 supposed-0.046
Token strange
Token ID6283
Feature activation-2.15e-02

Pos logit contributions
lights+0.016
der+0.016
boards+0.015
rier+0.015
ows+0.014
Neg logit contributions
e-0.012
 hate-0.012
 moderation-0.011
 panic-0.010
 matter-0.010
Token sensation
Token ID18098
Feature activation-9.66e-02

Pos logit contributions
lights+0.045
der+0.044
rier+0.043
boards+0.041
lic+0.040
Neg logit contributions
e-0.038
 hate-0.037
 moderation-0.035
oof-0.033
 quarrel-0.031
Token,
Token ID11
Feature activation-3.00e-02

Pos logit contributions
rier+0.193
lights+0.189
der+0.185
lic+0.175
boards+0.173
Neg logit contributions
e-0.188
 hate-0.144
aster-0.143
 supposed-0.140
 moderation-0.121
Token like
Token ID588
Feature activation+2.31e-02

Pos logit contributions
rier+0.058
lights+0.057
der+0.056
lic+0.053
ows+0.051
Neg logit contributions
e-0.057
 moderation-0.054
 hate-0.054
oof-0.047
 supposed-0.047
Token something
Token ID1223
Feature activation-1.55e-02

Pos logit contributions
e+0.041
 moderation+0.039
 hate+0.037
oof+0.036
 supposed+0.033
Neg logit contributions
rier-0.047
der-0.047
lights-0.045
lic-0.042
ows-0.041
Token powerful
Token ID3665
Feature activation-3.19e-02

Pos logit contributions
rier+0.033
lights+0.032
der+0.032
boards+0.027
lic+0.027
Neg logit contributions
 hate-0.027
e-0.027
 supposed-0.024
 moderation-0.023
oof-0.022
Token was
Token ID373
Feature activation-7.53e-03

Pos logit contributions
rier+0.063
lights+0.059
lic+0.055
der+0.053
lessly+0.049
Neg logit contributions
e-0.061
 hate-0.058
 supposed-0.054
 harm-0.052
oof-0.050
Token,
Token ID11
Feature activation-2.93e-02

Pos logit contributions
der+0.046
rier+0.045
lights+0.043
lic+0.038
lessly+0.038
Neg logit contributions
e-0.069
 moderation-0.064
oof-0.063
 hate-0.063
 client-0.057
Token it
Token ID340
Feature activation-3.88e-02

Pos logit contributions
der+0.059
rier+0.056
lights+0.053
lessly+0.049
lic+0.049
Neg logit contributions
e-0.055
 moderation-0.053
oof-0.052
 hate-0.050
atherine-0.046
Token felt
Token ID2936
Feature activation-3.34e-02

Pos logit contributions
rier+0.083
der+0.079
lights+0.076
lessly+0.071
ows+0.068
Neg logit contributions
e-0.071
 hate-0.060
oof-0.059
 moderation-0.059
 quarrel-0.055
Token a
Token ID257
Feature activation-9.17e-03

Pos logit contributions
lights+0.074
der+0.074
rier+0.073
lic+0.066
boards+0.066
Neg logit contributions
 moderation-0.055
e-0.054
 hate-0.051
oof-0.046
 supposed-0.043
Token burning
Token ID9482
Feature activation-3.93e-02

Pos logit contributions
lights+0.020
der+0.019
rier+0.018
boards+0.018
ows+0.018
Neg logit contributions
 hate-0.016
e-0.015
 moderation-0.015
oof-0.013
 matter-0.013
Token sensation
Token ID18098
Feature activation-9.54e-02

Pos logit contributions
der+0.081
rier+0.078
lights+0.075
lic+0.075
boards+0.072
Neg logit contributions
e-0.074
 moderation-0.063
 hate-0.063
oof-0.059
 matter-0.057
Token and
Token ID290
Feature activation+8.12e-03

Pos logit contributions
rier+0.195
lights+0.179
der+0.178
lic+0.167
boards+0.160
Neg logit contributions
e-0.188
 supposed-0.143
 hate-0.142
aster-0.141
 moderation-0.124
Token started
Token ID2067
Feature activation-1.11e-02

Pos logit contributions
e+0.014
 moderation+0.013
 hate+0.012
oof+0.012
 quarrel+0.011
Neg logit contributions
rier-0.017
der-0.017
lights-0.016
lic-0.015
ows-0.015
Token to
Token ID284
Feature activation+3.17e-02

Pos logit contributions
der+0.020
rier+0.019
lights+0.018
lessly+0.016
lic+0.016
Neg logit contributions
e-0.023
 moderation-0.022
 hate-0.021
oof-0.021
 supposed-0.020
Token jump
Token ID4391
Feature activation-1.89e-02

Pos logit contributions
e+0.059
oof+0.056
 moderation+0.056
 hate+0.051
atherine+0.051
Neg logit contributions
der-0.064
rier-0.060
lights-0.057
lessly-0.052
lic-0.052
Token around
Token ID1088
Feature activation+2.36e-02

Pos logit contributions
der+0.036
lights+0.036
rier+0.036
lic+0.032
boards+0.031
Neg logit contributions
e-0.034
 hate-0.030
oof-0.026
 quarrel-0.026
 disagree-0.025
Token friendly
Token ID8030
Feature activation-1.19e-02

Pos logit contributions
der+0.071
lights+0.065
rier+0.064
ows+0.060
boards+0.059
Neg logit contributions
e-0.060
 hate-0.055
 moderation-0.055
oof-0.053
 client-0.050
Token voice
Token ID3809
Feature activation-5.93e-02

Pos logit contributions
der+0.026
lights+0.026
rier+0.024
boards+0.024
ows+0.023
Neg logit contributions
e-0.019
oof-0.017
 hate-0.017
 moderation-0.017
 judge-0.016
Token from
Token ID422
Feature activation-1.81e-02

Pos logit contributions
der+0.137
rier+0.136
lights+0.128
boards+0.118
lessly+0.112
Neg logit contributions
e-0.099
 hate-0.080
oof-0.077
 harm-0.071
cle-0.071
Token the
Token ID262
Feature activation-3.62e-02

Pos logit contributions
der+0.037
rier+0.035
lights+0.034
boards+0.032
lic+0.032
Neg logit contributions
e-0.034
 moderation-0.031
oof-0.031
 hate-0.030
 client-0.028
Token reef
Token ID25088
Feature activation-2.83e-02

Pos logit contributions
der+0.078
ows+0.072
lights+0.072
lic+0.069
rier+0.067
Neg logit contributions
 moderation-0.060
e-0.058
 hate-0.055
 generation-0.054
 judge-0.053
Token replied
Token ID8712
Feature activation-9.42e-02

Pos logit contributions
der+0.051
rier+0.050
lights+0.048
lessly+0.042
lic+0.042
Neg logit contributions
e-0.058
 hate-0.050
 moderation-0.050
oof-0.049
 quarrel-0.046
Token back
Token ID736
Feature activation-5.48e-03

Pos logit contributions
lights+0.216
der+0.186
rier+0.181
boards+0.170
lessly+0.166
Neg logit contributions
e-0.158
 hate-0.111
oof-0.108
 judge-0.107
 courtroom-0.106
Token:
Token ID25
Feature activation-2.41e-02

Pos logit contributions
rier+0.009
der+0.008
lights+0.008
boards+0.008
lic+0.007
Neg logit contributions
e-0.012
oof-0.011
 hate-0.011
 moderation-0.011
 client-0.011
Token
Token ID6184
Feature activation+2.54e-02

Pos logit contributions
der+0.037
lights+0.036
rier+0.036
ly+0.030
boards+0.030
Neg logit contributions
e-0.056
 moderation-0.052
oof-0.052
 hate-0.051
 client-0.047
Token
Token ID95
Feature activation+1.55e-02

Pos logit contributions
e+0.048
oof+0.043
 moderation+0.042
 hate+0.041
atherine+0.039
Neg logit contributions
der-0.052
lights-0.050
rier-0.049
lessly-0.045
boards-0.044
Token
Token ID26391
Feature activation+4.07e-02

Pos logit contributions
e+0.026
 moderation+0.025
oof+0.024
 hate+0.023
 client+0.021
Neg logit contributions
der-0.034
rier-0.033
lights-0.032
lic-0.030
lessly-0.029
Token,
Token ID11
Feature activation-3.81e-02

Pos logit contributions
der+0.066
rier+0.065
lights+0.060
lic+0.054
lessly+0.053
Neg logit contributions
e-0.088
 moderation-0.079
oof-0.078
 hate-0.078
 client-0.072
Token the
Token ID262
Feature activation-5.42e-02

Pos logit contributions
der+0.081
rier+0.078
lights+0.074
lic+0.068
lessly+0.068
Neg logit contributions
e-0.068
 moderation-0.064
oof-0.062
 hate-0.060
 client-0.055
Token fairy
Token ID25607
Feature activation-2.55e-02

Pos logit contributions
der+0.105
rier+0.097
lights+0.096
lic+0.086
ows+0.085
Neg logit contributions
 moderation-0.100
e-0.100
 hate-0.096
 client-0.095
 judge-0.088
Token pointed
Token ID6235
Feature activation-5.66e-02

Pos logit contributions
der+0.055
rier+0.054
lights+0.051
lessly+0.047
boards+0.045
Neg logit contributions
e-0.047
oof-0.039
 hate-0.039
 moderation-0.038
atherine-0.036
Token her
Token ID607
Feature activation-4.56e-02

Pos logit contributions
der+0.114
lights+0.108
rier+0.103
lic+0.100
boards+0.096
Neg logit contributions
e-0.107
 hate-0.096
oof-0.092
 moderation-0.091
 seen-0.084
Token wand
Token ID11569
Feature activation-9.30e-02

Pos logit contributions
der+0.082
rier+0.076
lights+0.074
lic+0.069
boards+0.066
Neg logit contributions
e-0.094
 moderation-0.087
 hate-0.087
oof-0.082
 client-0.082
Token his
Token ID465
Feature activation-5.42e-02

Pos logit contributions
rier+0.161
lights+0.160
der+0.159
lic+0.151
boards+0.138
Neg logit contributions
e-0.192
 hate-0.161
 supposed-0.152
 moderation-0.148
oof-0.147
Token way
Token ID835
Feature activation-2.79e-02

Pos logit contributions
der+0.082
lights+0.076
rier+0.075
lic+0.070
boards+0.066
Neg logit contributions
e-0.126
 moderation-0.118
 hate-0.117
oof-0.115
 client-0.112
Token,
Token ID11
Feature activation-3.77e-02

Pos logit contributions
der+0.057
lic+0.052
rier+0.051
lights+0.049
boards+0.048
Neg logit contributions
e-0.049
 hate-0.045
 supposed-0.044
 moderation-0.043
oof-0.042
Token and
Token ID290
Feature activation-1.66e-02

Pos logit contributions
der+0.069
rier+0.067
lights+0.066
lic+0.059
boards+0.059
Neg logit contributions
e-0.076
 hate-0.069
 moderation-0.069
oof-0.067
 supposed-0.063
Token with
Token ID351
Feature activation+1.40e-03

Pos logit contributions
der+0.035
rier+0.034
lights+0.033
boards+0.030
ows+0.029
Neg logit contributions
e-0.029
 moderation-0.025
oof-0.025
 hate-0.025
atherine-0.023
Token bird
Token ID6512
Feature activation-7.96e-03

Pos logit contributions
der+0.122
rier+0.119
lights+0.112
lessly+0.104
ly+0.102
Neg logit contributions
e-0.096
 moderation-0.091
oof-0.090
 hate-0.084
atherine-0.077
Token looked
Token ID3114
Feature activation-6.77e-02

Pos logit contributions
rier+0.016
der+0.016
lights+0.014
boards+0.014
lic+0.013
Neg logit contributions
e-0.016
 hate-0.014
oof-0.013
 moderation-0.013
atherine-0.012
Token at
Token ID379
Feature activation-4.31e-03

Pos logit contributions
der+0.137
rier+0.125
boards+0.125
lights+0.124
lic+0.119
Neg logit contributions
e-0.117
 hate-0.117
oof-0.101
 afar-0.099
 moderation-0.099
Token her
Token ID607
Feature activation-1.72e-02

Pos logit contributions
der+0.010
rier+0.009
lights+0.009
lessly+0.008
lic+0.008
Neg logit contributions
e-0.007
 moderation-0.007
oof-0.006
 hate-0.006
 client-0.006
Token in
Token ID287
Feature activation+1.16e-02

Pos logit contributions
der+0.027
rier+0.026
lights+0.024
lic+0.022
lessly+0.022
Neg logit contributions
e-0.040
 moderation-0.038
oof-0.038
 hate-0.037
 client-0.034
Token amaz
Token ID40642
Feature activation-9.16e-02

Pos logit contributions
e+0.018
 moderation+0.017
oof+0.017
 hate+0.016
 client+0.014
Neg logit contributions
der-0.027
rier-0.026
lights-0.025
lic-0.024
boards-0.023
Tokenement
Token ID972
Feature activation-2.20e-02

Pos logit contributions
lic+0.151
ows+0.143
der+0.142
lessly+0.140
rier+0.138
Neg logit contributions
e-0.176
pa-0.150
 moderation-0.145
oof-0.138
atherine-0.135
Token,
Token ID11
Feature activation-3.96e-02

Pos logit contributions
rier+0.038
lic+0.037
boards+0.035
der+0.034
lights+0.034
Neg logit contributions
e-0.044
 hate-0.040
oof-0.037
 moderation-0.036
 quarrel-0.035
Token and
Token ID290
Feature activation-1.03e-02

Pos logit contributions
lights+0.071
rier+0.071
der+0.070
boards+0.067
lic+0.065
Neg logit contributions
e-0.076
 hate-0.071
 moderation-0.070
oof-0.067
 afar-0.063
Token said
Token ID531
Feature activation-7.13e-02

Pos logit contributions
der+0.021
rier+0.020
lights+0.020
boards+0.019
ows+0.018
Neg logit contributions
e-0.019
 moderation-0.016
 hate-0.016
oof-0.016
atherine-0.015
Token:
Token ID25
Feature activation-9.56e-03

Pos logit contributions
lights+0.144
der+0.125
boards+0.123
rier+0.118
ows+0.112
Neg logit contributions
e-0.134
 hate-0.115
oof-0.108
 moderation-0.103
 disagree-0.101
Token found
Token ID1043
Feature activation-2.96e-02

Pos logit contributions
der+0.017
rier+0.017
lights+0.016
lessly+0.015
lic+0.015
Neg logit contributions
e-0.014
 moderation-0.013
oof-0.012
 hate-0.012
 client-0.011
Token a
Token ID257
Feature activation-2.36e-02

Pos logit contributions
der+0.071
lights+0.067
rier+0.066
lic+0.064
lessly+0.059
Neg logit contributions
e-0.048
 moderation-0.043
oof-0.038
 hate-0.037
 supposed-0.035
Token magical
Token ID10883
Feature activation-3.97e-02

Pos logit contributions
der+0.046
lights+0.043
rier+0.043
lessly+0.038
boards+0.038
Neg logit contributions
e-0.045
 moderation-0.044
 hate-0.042
oof-0.042
 client-0.039
Token o
Token ID267
Feature activation+3.25e-03

Pos logit contributions
der+0.083
rier+0.079
lights+0.077
lic+0.072
lessly+0.072
Neg logit contributions
e-0.071
 moderation-0.068
 hate-0.064
oof-0.063
 client-0.058
Tokenint
Token ID600
Feature activation-4.95e-02

Pos logit contributions
e+0.005
 moderation+0.005
oof+0.004
aster+0.004
cle+0.004
Neg logit contributions
der-0.008
rier-0.008
lights-0.007
lessly-0.007
rily-0.007
Tokenment
Token ID434
Feature activation-9.13e-02

Pos logit contributions
der+0.074
lights+0.070
rier+0.069
lic+0.063
boards+0.058
Neg logit contributions
e-0.116
 seen-0.090
 moderation-0.089
oof-0.088
 hate-0.087
Token.
Token ID13
Feature activation-1.34e-02

Pos logit contributions
lights+0.161
rier+0.158
lic+0.158
der+0.156
stals+0.130
Neg logit contributions
e-0.188
 moderation-0.157
 seen-0.144
oof-0.139
 hate-0.138
Token He
Token ID679
Feature activation-3.51e-02

Pos logit contributions
der+0.019
rier+0.017
lights+0.016
lic+0.015
lessly+0.015
Neg logit contributions
e-0.033
 moderation-0.032
oof-0.031
 hate-0.031
 supposed-0.029
Token touched
Token ID12615
Feature activation-3.09e-02

Pos logit contributions
der+0.068
rier+0.067
lights+0.063
lessly+0.057
lic+0.056
Neg logit contributions
e-0.070
 moderation-0.064
oof-0.063
 hate-0.062
atherine-0.057
Token it
Token ID340
Feature activation-3.95e-02

Pos logit contributions
der+0.076
lights+0.074
rier+0.073
lic+0.070
boards+0.068
Neg logit contributions
e-0.050
 moderation-0.039
oof-0.035
 hate-0.033
 afar-0.031
Token and
Token ID290
Feature activation-5.58e-03

Pos logit contributions
lights+0.076
rier+0.076
der+0.075
lic+0.073
boards+0.066
Neg logit contributions
e-0.076
 hate-0.062
 moderation-0.060
oof-0.056
 practice-0.056
Token in
Token ID287
Feature activation-4.90e-03

Pos logit contributions
der+0.088
rier+0.084
lic+0.083
lights+0.080
boards+0.078
Neg logit contributions
e-0.089
 moderation-0.086
 hate-0.083
oof-0.080
 supposed-0.072
Token a
Token ID257
Feature activation-2.25e-02

Pos logit contributions
der+0.010
rier+0.010
lights+0.009
lessly+0.009
lic+0.008
Neg logit contributions
e-0.009
 moderation-0.008
oof-0.008
 hate-0.008
 client-0.007
Token blink
Token ID21019
Feature activation-5.64e-02

Pos logit contributions
der+0.044
rier+0.041
lights+0.040
lic+0.037
lessly+0.036
Neg logit contributions
e-0.043
 moderation-0.042
oof-0.040
 hate-0.040
 client-0.037
Token of
Token ID286
Feature activation-2.49e-03

Pos logit contributions
rier+0.091
der+0.089
lic+0.087
boards+0.084
lights+0.083
Neg logit contributions
e-0.116
 hate-0.098
 moderation-0.097
oof-0.089
 supposed-0.084
Token an
Token ID281
Feature activation-2.87e-02

Pos logit contributions
der+0.005
rier+0.005
lights+0.005
lic+0.005
boards+0.005
Neg logit contributions
e-0.004
 moderation-0.004
oof-0.004
 hate-0.004
 client-0.003
Token eye
Token ID4151
Feature activation-9.08e-02

Pos logit contributions
der+0.053
rier+0.050
lights+0.049
lic+0.045
boards+0.045
Neg logit contributions
e-0.058
 moderation-0.055
oof-0.054
 hate-0.053
 client-0.049
Token.
Token ID13
Feature activation-1.67e-02

Pos logit contributions
lights+0.153
rier+0.149
der+0.149
lic+0.140
boards+0.127
Neg logit contributions
e-0.194
 moderation-0.162
 hate-0.160
oof-0.154
 supposed-0.153
Token She
Token ID1375
Feature activation-3.30e-02

Pos logit contributions
der+0.026
rier+0.024
lights+0.022
lic+0.021
lessly+0.021
Neg logit contributions
e-0.039
 moderation-0.037
oof-0.036
 hate-0.036
 supposed-0.034
Token could
Token ID714
Feature activation-2.71e-02

Pos logit contributions
rier+0.065
der+0.065
lights+0.061
lic+0.056
lessly+0.056
Neg logit contributions
e-0.064
 moderation-0.057
 hate-0.057
oof-0.056
 client-0.051
Token see
Token ID766
Feature activation+3.17e-03

Pos logit contributions
der+0.065
rier+0.063
lights+0.061
boards+0.057
lic+0.057
Neg logit contributions
e-0.042
 moderation-0.037
 hate-0.037
oof-0.036
 client-0.032
Token cars
Token ID5006
Feature activation+3.84e-03

Pos logit contributions
e+0.006
 moderation+0.005
 hate+0.005
oof+0.005
 client+0.005
Neg logit contributions
der-0.007
rier-0.006
lights-0.006
lic-0.006
boards-0.006
Token
Token ID198
Feature activation-1.21e-02

Pos logit contributions
der+0.016
lic+0.015
rier+0.015
lights+0.015
lessly+0.014
Neg logit contributions
e-0.029
 moderation-0.029
 hate-0.028
 client-0.028
 supposed-0.028
TokenThe
Token ID464
Feature activation-5.80e-02

Pos logit contributions
lic+0.018
der+0.018
lessly+0.017
rier+0.016
boards+0.016
Neg logit contributions
 supposed-0.024
 moderation-0.024
 hate-0.023
 client-0.023
e-0.023
Token fairy
Token ID25607
Feature activation-2.13e-02

Pos logit contributions
der+0.113
rier+0.108
lights+0.102
lic+0.094
lessly+0.094
Neg logit contributions
e-0.112
 moderation-0.107
oof-0.104
 hate-0.101
atherine-0.093
Token waved
Token ID26834
Feature activation-4.35e-02

Pos logit contributions
der+0.040
rier+0.039
lights+0.037
lessly+0.033
boards+0.033
Neg logit contributions
e-0.043
 moderation-0.039
oof-0.039
 hate-0.038
atherine-0.035
Token her
Token ID607
Feature activation-4.19e-02

Pos logit contributions
der+0.096
lights+0.093
rier+0.088
boards+0.088
lic+0.087
Neg logit contributions
e-0.071
 hate-0.062
oof-0.060
 moderation-0.056
atherine-0.052
Token wand
Token ID11569
Feature activation-9.07e-02

Pos logit contributions
der+0.064
lights+0.060
rier+0.059
lic+0.058
boards+0.056
Neg logit contributions
e-0.099
oof-0.088
 hate-0.088
 moderation-0.088
 client-0.082
Token and
Token ID290
Feature activation-1.27e-02

Pos logit contributions
rier+0.161
lights+0.160
lic+0.158
der+0.157
boards+0.149
Neg logit contributions
e-0.179
 hate-0.146
 quarrel-0.131
oof-0.131
 moderation-0.129
Token said
Token ID531
Feature activation-7.73e-02

Pos logit contributions
rier+0.027
der+0.026
lights+0.025
boards+0.023
lessly+0.022
Neg logit contributions
e-0.023
oof-0.020
 moderation-0.020
 hate-0.019
atherine-0.018
Token,
Token ID11
Feature activation-4.37e-02

Pos logit contributions
lights+0.153
der+0.129
rier+0.128
boards+0.124
lessly+0.119
Neg logit contributions
e-0.147
 hate-0.121
oof-0.118
 client-0.109
 disagree-0.109
Token "
Token ID366
Feature activation+1.67e-02

Pos logit contributions
lights+0.075
der+0.069
rier+0.069
boards+0.063
ier+0.060
Neg logit contributions
e-0.087
 hate-0.080
oof-0.078
 moderation-0.077
 disagree-0.075
TokenSh
Token ID2484
Feature activation+2.44e-02

Pos logit contributions
e+0.028
 moderation+0.028
 hate+0.026
oof+0.026
 client+0.026
Neg logit contributions
der-0.035
rier-0.032
lessly-0.031
lic-0.031
lights-0.031
Token,
Token ID11
Feature activation-4.76e-02

Pos logit contributions
der+0.080
rier+0.076
lights+0.072
lic+0.066
boards+0.065
Neg logit contributions
e-0.103
 moderation-0.096
oof-0.094
 hate-0.093
 client-0.086
Token the
Token ID262
Feature activation-6.26e-02

Pos logit contributions
der+0.087
rier+0.083
lights+0.079
lic+0.072
lessly+0.072
Neg logit contributions
e-0.098
 moderation-0.093
oof-0.091
 hate-0.088
 client-0.083
Token fairy
Token ID25607
Feature activation-2.56e-02

Pos logit contributions
der+0.121
rier+0.116
lights+0.110
lic+0.101
lessly+0.101
Neg logit contributions
e-0.122
 moderation-0.116
oof-0.113
 hate-0.109
 client-0.102
Token waved
Token ID26834
Feature activation-5.29e-02

Pos logit contributions
der+0.053
rier+0.052
lights+0.049
boards+0.045
lessly+0.045
Neg logit contributions
e-0.048
oof-0.042
 hate-0.041
 moderation-0.041
 client-0.037
Token her
Token ID607
Feature activation-4.52e-02

Pos logit contributions
der+0.108
lights+0.105
boards+0.101
lic+0.099
rier+0.098
Neg logit contributions
e-0.093
 hate-0.083
oof-0.077
 moderation-0.073
 generation-0.069
Token wand
Token ID11569
Feature activation-9.06e-02

Pos logit contributions
der+0.071
lights+0.065
rier+0.064
boards+0.064
lic+0.064
Neg logit contributions
e-0.105
 hate-0.094
oof-0.093
 moderation-0.092
 client-0.087
Token once
Token ID1752
Feature activation-9.07e-03

Pos logit contributions
rier+0.160
lights+0.156
der+0.156
lic+0.152
boards+0.147
Neg logit contributions
e-0.184
 hate-0.152
oof-0.137
 moderation-0.135
 seen-0.133
Token more
Token ID517
Feature activation-1.76e-03

Pos logit contributions
der+0.017
rier+0.017
boards+0.017
lic+0.016
lights+0.016
Neg logit contributions
e-0.019
 moderation-0.016
 hate-0.015
oof-0.015
 harm-0.015
Token and
Token ID290
Feature activation-7.59e-03

Pos logit contributions
der+0.004
rier+0.003
lights+0.003
boards+0.003
lic+0.003
Neg logit contributions
e-0.003
 hate-0.003
 moderation-0.003
oof-0.003
 supposed-0.003
Token the
Token ID262
Feature activation-5.69e-02

Pos logit contributions
der+0.016
rier+0.015
lights+0.014
boards+0.014
lic+0.013
Neg logit contributions
e-0.014
 moderation-0.013
oof-0.012
 hate-0.012
atherine-0.011
Token world
Token ID995
Feature activation-3.31e-02

Pos logit contributions
der+0.108
rier+0.104
lights+0.098
lessly+0.090
lic+0.089
Neg logit contributions
e-0.112
 moderation-0.108
oof-0.105
 hate-0.101
atherine-0.094
Token me
Token ID502
Feature activation+3.23e-02

Pos logit contributions
 moderation+0.081
e+0.081
oof+0.076
 supposed+0.070
atherine+0.070
Neg logit contributions
der-0.105
rier-0.099
lights-0.093
ly-0.090
lic-0.089
Token.
Token ID13
Feature activation+3.52e-02

Pos logit contributions
e+0.074
 moderation+0.073
oof+0.072
 hate+0.069
atherine+0.066
Neg logit contributions
der-0.049
rier-0.047
lights-0.043
lessly-0.039
lic-0.038
Token I
Token ID314
Feature activation+1.63e-02

Pos logit contributions
e+0.075
 moderation+0.073
oof+0.070
 hate+0.068
 supposed+0.064
Neg logit contributions
der-0.061
rier-0.057
lights-0.054
lic-0.050
lessly-0.050
Token don
Token ID836
Feature activation+1.06e-02

Pos logit contributions
 moderation+0.027
e+0.026
oof+0.025
atherine+0.023
 client+0.023
Neg logit contributions
der-0.035
rier-0.033
lights-0.031
lic-0.030
lessly-0.030
Token't
Token ID470
Feature activation+4.27e-02

Pos logit contributions
e+0.018
 moderation+0.017
oof+0.017
 hate+0.016
 client+0.015
Neg logit contributions
der-0.023
rier-0.022
lights-0.021
lic-0.020
lessly-0.020
Token like
Token ID588
Feature activation+7.18e-02

Pos logit contributions
 moderation+0.071
cle+0.068
oof+0.068
atherine+0.067
e+0.067
Neg logit contributions
der-0.077
ly-0.073
rier-0.072
lessly-0.069
lic-0.064
Token cars
Token ID5006
Feature activation+3.71e-02

Pos logit contributions
e+0.138
 moderation+0.131
oof+0.128
 hate+0.120
cle+0.119
Neg logit contributions
der-0.142
rier-0.133
lights-0.126
ly-0.116
boards-0.114
Token and
Token ID290
Feature activation+5.53e-02

Pos logit contributions
e+0.083
 moderation+0.077
 hate+0.076
oof+0.075
 client+0.068
Neg logit contributions
der-0.062
rier-0.061
lights-0.059
lic-0.055
lessly-0.053
Token trucks
Token ID13960
Feature activation+2.25e-02

Pos logit contributions
e+0.092
 moderation+0.087
oof+0.084
 hate+0.081
 client+0.075
Neg logit contributions
der-0.123
rier-0.118
lights-0.114
lic-0.106
lessly-0.105
Token,
Token ID11
Feature activation+1.17e-02

Pos logit contributions
e+0.050
 hate+0.046
 moderation+0.045
oof+0.044
 client+0.041
Neg logit contributions
der-0.037
lights-0.037
rier-0.037
lic-0.034
lessly-0.032
Token I
Token ID314
Feature activation+2.95e-02

Pos logit contributions
e+0.023
 moderation+0.022
oof+0.021
 hate+0.021
 client+0.019
Neg logit contributions
der-0.022
rier-0.021
lights-0.020
lic-0.019
lessly-0.019
Token what
Token ID644
Feature activation+1.34e-02

Pos logit contributions
der+0.092
rier+0.091
lights+0.089
lic+0.084
boards+0.081
Neg logit contributions
e-0.069
 moderation-0.065
 hate-0.062
oof-0.061
 supposed-0.056
Token it
Token ID340
Feature activation-9.16e-03

Pos logit contributions
oof+0.032
e+0.031
 moderation+0.031
 hate+0.030
atherine+0.028
Neg logit contributions
der-0.017
lights-0.016
rier-0.015
boards-0.014
ly-0.014
Token would
Token ID561
Feature activation+9.91e-03

Pos logit contributions
rier+0.018
der+0.018
lights+0.016
lessly+0.015
lic+0.014
Neg logit contributions
e-0.018
 moderation-0.017
 hate-0.017
oof-0.016
 supposed-0.015
Token be
Token ID307
Feature activation+2.69e-02

Pos logit contributions
e+0.016
 moderation+0.015
oof+0.014
 hate+0.014
 client+0.013
Neg logit contributions
der-0.023
rier-0.022
lights-0.021
lic-0.020
lessly-0.020
Token like
Token ID588
Feature activation+3.22e-02

Pos logit contributions
e+0.048
 moderation+0.045
oof+0.043
 hate+0.043
 supposed+0.042
Neg logit contributions
der-0.058
rier-0.055
lights-0.054
boards-0.049
lic-0.048
Token to
Token ID284
Feature activation+7.36e-02

Pos logit contributions
e+0.076
 moderation+0.073
oof+0.071
 hate+0.070
 client+0.066
Neg logit contributions
der-0.049
rier-0.047
lights-0.044
lic-0.039
lessly-0.039
Token sail
Token ID14936
Feature activation+5.08e-03

Pos logit contributions
 moderation+0.118
e+0.118
oof+0.113
atherine+0.107
 client+0.106
Neg logit contributions
der-0.156
rier-0.148
lights-0.144
lessly-0.133
lic-0.132
Token in
Token ID287
Feature activation+6.19e-02

Pos logit contributions
e+0.011
 moderation+0.011
oof+0.010
 hate+0.010
 supposed+0.009
Neg logit contributions
der-0.009
rier-0.008
lights-0.008
lic-0.007
boards-0.007
Token a
Token ID257
Feature activation+8.57e-03

Pos logit contributions
e+0.121
 moderation+0.115
oof+0.114
 hate+0.110
 client+0.103
Neg logit contributions
der-0.117
rier-0.113
lights-0.108
lessly-0.099
lic-0.097
Token boat
Token ID8848
Feature activation-7.16e-03

Pos logit contributions
e+0.018
 moderation+0.017
oof+0.017
 hate+0.017
 client+0.015
Neg logit contributions
der-0.015
rier-0.014
lights-0.014
lessly-0.012
lic-0.012
Token like
Token ID588
Feature activation+4.33e-02

Pos logit contributions
der+0.012
rier+0.011
lights+0.010
lic+0.009
lessly+0.009
Neg logit contributions
e-0.017
 moderation-0.015
oof-0.015
 hate-0.015
 supposed-0.014
Token you
Token ID345
Feature activation+3.37e-02

Pos logit contributions
 moderation+0.025
e+0.025
 hate+0.024
oof+0.024
 supposed+0.024
Neg logit contributions
der-0.036
rier-0.036
lights-0.033
ly-0.032
lic-0.031
Token.
Token ID13
Feature activation+2.15e-02

Pos logit contributions
e+0.068
oof+0.063
 hate+0.062
 moderation+0.060
 supposed+0.057
Neg logit contributions
der-0.060
lights-0.059
rier-0.057
lic-0.051
boards-0.051
Token I
Token ID314
Feature activation+1.30e-02

Pos logit contributions
e+0.042
 moderation+0.042
oof+0.040
 hate+0.039
 supposed+0.038
Neg logit contributions
der-0.039
rier-0.036
lights-0.034
lic-0.032
lessly-0.032
Token don
Token ID836
Feature activation+6.85e-03

Pos logit contributions
e+0.021
 moderation+0.021
oof+0.020
 client+0.019
 hate+0.018
Neg logit contributions
der-0.028
rier-0.026
lights-0.025
lic-0.024
lessly-0.024
Token't
Token ID470
Feature activation+3.66e-02

Pos logit contributions
e+0.010
 moderation+0.009
oof+0.009
 hate+0.009
 supposed+0.008
Neg logit contributions
der-0.016
rier-0.016
lights-0.015
lic-0.014
lessly-0.014
Token like
Token ID588
Feature activation+7.31e-02

Pos logit contributions
 moderation+0.056
e+0.056
oof+0.055
cle+0.054
atherine+0.053
Neg logit contributions
der-0.075
rier-0.069
lessly-0.068
ly-0.067
ows-0.064
Token brushes
Token ID36377
Feature activation-8.82e-04

Pos logit contributions
e+0.149
 moderation+0.141
oof+0.141
 hate+0.133
atherine+0.128
Neg logit contributions
der-0.136
rier-0.127
lights-0.122
lessly-0.109
lic-0.109
Token."
Token ID526
Feature activation-2.11e-03

Pos logit contributions
lic+0.002
der+0.002
lights+0.002
atively+0.002
boards+0.002
Neg logit contributions
e-0.002
 moderation-0.001
 hate-0.001
oof-0.001
 supposed-0.001
Token
Token ID220
Feature activation+1.80e-02

Pos logit contributions
der+0.004
rier+0.004
lights+0.004
lic+0.003
lessly+0.003
Neg logit contributions
e-0.004
 moderation-0.004
oof-0.004
 hate-0.004
 client-0.003
Token
Token ID198
Feature activation-1.35e-02

Pos logit contributions
e+0.033
 moderation+0.032
oof+0.031
 hate+0.030
 client+0.027
Neg logit contributions
der-0.036
rier-0.035
lights-0.033
lessly-0.030
ly-0.030
Token
Token ID198
Feature activation-1.41e-02

Pos logit contributions
der+0.024
rier+0.023
lights+0.022
lic+0.020
lessly+0.020
Neg logit contributions
e-0.028
 moderation-0.027
oof-0.026
 hate-0.026
 client-0.024
Token cat
Token ID3797
Feature activation+5.26e-02

Pos logit contributions
e+0.045
 moderation+0.043
oof+0.042
 hate+0.040
 client+0.037
Neg logit contributions
der-0.061
rier-0.058
lights-0.056
lic-0.052
lessly-0.052
Token!
Token ID0
Feature activation+3.53e-02

Pos logit contributions
e+0.126
 moderation+0.125
oof+0.120
 hate+0.118
 client+0.113
Neg logit contributions
der-0.075
rier-0.070
lights-0.065
lic-0.060
lessly-0.057
Token Do
Token ID2141
Feature activation+1.79e-02

Pos logit contributions
e+0.064
 moderation+0.063
oof+0.060
 hate+0.057
 supposed+0.054
Neg logit contributions
der-0.071
rier-0.068
lights-0.064
lic-0.062
lessly-0.060
Token you
Token ID345
Feature activation+5.31e-02

Pos logit contributions
 moderation+0.032
 generation+0.030
e+0.029
 client+0.029
 supposed+0.028
Neg logit contributions
der-0.032
ly-0.029
rier-0.028
lic-0.028
boards-0.028
Token want
Token ID765
Feature activation+4.68e-02

Pos logit contributions
cle+0.092
atherine+0.091
 client+0.091
athy+0.089
Elizabeth+0.084
Neg logit contributions
der-0.092
ly-0.090
lessly-0.086
lic-0.086
ows-0.083
Token to
Token ID284
Feature activation+7.81e-02

Pos logit contributions
e+0.118
 moderation+0.116
 hate+0.111
 client+0.110
atherine+0.110
Neg logit contributions
der-0.057
lights-0.045
rier-0.045
boards-0.044
ly-0.044
Token play
Token ID711
Feature activation+4.89e-02

Pos logit contributions
 moderation+0.115
e+0.108
 supposed+0.103
atherine+0.102
 client+0.102
Neg logit contributions
der-0.165
rier-0.151
lights-0.150
lic-0.148
ly-0.142
Token with
Token ID351
Feature activation+8.43e-02

Pos logit contributions
e+0.113
 moderation+0.110
oof+0.107
 hate+0.104
 client+0.099
Neg logit contributions
der-0.076
rier-0.072
lights-0.067
lic-0.061
lessly-0.060
Token my
Token ID616
Feature activation+2.11e-02

Pos logit contributions
 moderation+0.184
 supposed+0.180
 hate+0.176
e+0.175
oof+0.174
Neg logit contributions
der-0.119
rier-0.107
lights-0.106
ly-0.105
boards-0.099
Token toy
Token ID13373
Feature activation+1.67e-02

Pos logit contributions
e+0.046
 moderation+0.044
oof+0.041
beh+0.040
 hate+0.040
Neg logit contributions
der-0.037
rier-0.034
lights-0.032
 lid-0.030
ly-0.028
Token?"
Token ID1701
Feature activation+1.06e-02

Pos logit contributions
e+0.029
 moderation+0.028
oof+0.027
 hate+0.026
 client+0.024
Neg logit contributions
der-0.036
rier-0.034
lights-0.033
lic-0.031
lessly-0.031
Token
Token ID129
Feature activation+3.26e-02

Pos logit contributions
e+0.099
 moderation+0.098
oof+0.093
 hate+0.091
 client+0.083
Neg logit contributions
der-0.121
lights-0.113
rier-0.113
lic-0.108
ows-0.106
Token
Token ID241
Feature activation+2.59e-02

Pos logit contributions
 moderation+0.060
e+0.058
oof+0.054
 hate+0.053
 client+0.051
Neg logit contributions
der-0.064
rier-0.060
lessly-0.055
ows-0.054
lights-0.054
TokenI
Token ID40
Feature activation+1.96e-02

Pos logit contributions
e+0.047
 moderation+0.046
oof+0.045
 client+0.042
atherine+0.042
Neg logit contributions
der-0.051
lessly-0.046
rier-0.044
lic-0.043
lights-0.042
Token don
Token ID836
Feature activation+4.66e-03

Pos logit contributions
e+0.026
 moderation+0.024
oof+0.023
 hate+0.022
 client+0.020
Neg logit contributions
der-0.050
rier-0.048
lights-0.047
lic-0.044
lessly-0.044
Tokenâ
Token ID22940
Feature activation+6.56e-02

Pos logit contributions
e+0.007
oof+0.006
 moderation+0.006
 hate+0.006
 supposed+0.005
Neg logit contributions
der-0.011
rier-0.011
lights-0.011
lic-0.010
ly-0.010
Token
Token ID26391
Feature activation+7.30e-02

Pos logit contributions
 client+0.062
 moderation+0.062
oof+0.057
 hate+0.057
 nephew+0.054
Neg logit contributions
der-0.161
ly-0.157
rier-0.156
lessly-0.155
lic-0.154
Token
Token ID8151
Feature activation+4.05e-02

Pos logit contributions
 moderation+0.115
e+0.110
oof+0.107
 hate+0.105
 generation+0.098
Neg logit contributions
der-0.150
lic-0.146
lights-0.143
ows-0.141
rier-0.138
Tokent
Token ID83
Feature activation+2.31e-02

Pos logit contributions
e+0.093
 moderation+0.091
oof+0.088
 hate+0.086
 client+0.081
Neg logit contributions
der-0.063
rier-0.060
lights-0.056
lic-0.052
lessly-0.051
Token want
Token ID765
Feature activation+2.36e-02

Pos logit contributions
e+0.031
 moderation+0.031
oof+0.030
atherine+0.027
 client+0.027
Neg logit contributions
der-0.055
rier-0.051
lessly-0.050
lights-0.049
lic-0.048
Token to
Token ID284
Feature activation+5.88e-02

Pos logit contributions
e+0.055
 moderation+0.053
oof+0.051
 hate+0.050
 client+0.047
Neg logit contributions
der-0.037
rier-0.034
lights-0.033
lic-0.030
boards-0.029
Token go
Token ID467
Feature activation+4.43e-02

Pos logit contributions
e+0.093
 moderation+0.088
oof+0.085
 hate+0.081
 client+0.074
Neg logit contributions
der-0.135
rier-0.130
lights-0.125
lic-0.117
lessly-0.116
Token for
Token ID329
Feature activation+2.10e-02

Pos logit contributions
e+0.062
 moderation+0.057
oof+0.057
 hate+0.055
 client+0.052
Neg logit contributions
der-0.040
lights-0.039
rier-0.039
lic-0.036
boards-0.034
Token the
Token ID262
Feature activation-2.58e-02

Pos logit contributions
e+0.034
 moderation+0.031
oof+0.030
 hate+0.029
 client+0.026
Neg logit contributions
der-0.049
rier-0.047
lights-0.046
lic-0.042
boards-0.042
Token advice
Token ID5608
Feature activation-3.59e-02

Pos logit contributions
der+0.058
rier+0.055
lights+0.054
lic+0.050
boards+0.049
Neg logit contributions
e-0.042
 moderation-0.042
 hate-0.039
oof-0.038
 client-0.035
Token.
Token ID13
Feature activation-2.31e-02

Pos logit contributions
der+0.057
rier+0.056
lic+0.055
lights+0.055
boards+0.048
Neg logit contributions
e-0.081
 moderation-0.073
 hate-0.071
oof-0.068
 supposed-0.065
Token From
Token ID3574
Feature activation-1.67e-02

Pos logit contributions
rier+0.037
der+0.036
lights+0.036
lic+0.032
boards+0.031
Neg logit contributions
e-0.049
 moderation-0.047
oof-0.047
 hate-0.044
atherine-0.043
Token then
Token ID788
Feature activation+1.34e-03

Pos logit contributions
der+0.028
rier+0.027
lights+0.025
lic+0.024
boards+0.024
Neg logit contributions
e-0.037
oof-0.035
 moderation-0.034
 hate-0.033
atherine-0.032
Token on
Token ID319
Feature activation+1.78e-02

Pos logit contributions
e+0.003
oof+0.003
 moderation+0.003
 hate+0.003
 supposed+0.002
Neg logit contributions
der-0.002
rier-0.002
lights-0.002
lic-0.002
boards-0.002
Token,
Token ID11
Feature activation-1.66e-02

Pos logit contributions
e+0.044
 moderation+0.041
oof+0.040
 hate+0.039
 client+0.037
Neg logit contributions
der-0.026
rier-0.025
lights-0.023
lic-0.023
boards-0.021
Token the
Token ID262
Feature activation-3.61e-02

Pos logit contributions
der+0.031
rier+0.030
lights+0.028
lessly+0.026
lic+0.025
Neg logit contributions
e-0.033
oof-0.032
 moderation-0.032
 hate-0.030
 client-0.028
Token small
Token ID1402
Feature activation-8.21e-03

Pos logit contributions
der+0.083
rier+0.077
lights+0.075
ows+0.072
boards+0.070
Neg logit contributions
 moderation-0.059
e-0.055
 hate-0.055
oof-0.050
 client-0.049
Token bear
Token ID6842
Feature activation+1.08e-02

Pos logit contributions
der+0.019
lights+0.018
rier+0.018
lic+0.017
boards+0.017
Neg logit contributions
e-0.013
 moderation-0.013
 hate-0.012
oof-0.011
 client-0.011
Token saw
Token ID2497
Feature activation-3.59e-02

Pos logit contributions
e+0.005
 hate+0.004
 moderation+0.004
oof+0.004
atherine+0.004
Neg logit contributions
der-0.006
rier-0.006
lights-0.006
lessly-0.005
boards-0.005
Token him
Token ID683
Feature activation-6.67e-03

Pos logit contributions
rier+0.091
der+0.088
lights+0.086
ows+0.082
lic+0.082
Neg logit contributions
e-0.048
 moderation-0.047
 hate-0.042
oof-0.038
aster-0.037
Token from
Token ID422
Feature activation+1.09e-03

Pos logit contributions
der+0.012
rier+0.012
lights+0.011
lessly+0.010
lic+0.010
Neg logit contributions
e-0.013
 moderation-0.013
oof-0.013
 hate-0.012
 client-0.011
Token the
Token ID262
Feature activation-3.01e-02

Pos logit contributions
e+0.002
oof+0.002
 moderation+0.002
 hate+0.002
 client+0.002
Neg logit contributions
der-0.002
rier-0.002
lights-0.002
lessly-0.002
ly-0.002
Token window
Token ID4324
Feature activation-2.77e-02

Pos logit contributions
der+0.061
rier+0.058
lights+0.056
lic+0.052
lessly+0.051
Neg logit contributions
e-0.055
 moderation-0.054
 hate-0.053
oof-0.050
 client-0.047
Token and
Token ID290
Feature activation+5.77e-03

Pos logit contributions
rier+0.048
der+0.046
lic+0.043
lights+0.041
lessly+0.039
Neg logit contributions
e-0.060
 hate-0.053
 moderation-0.052
 quarrel-0.047
oof-0.046
Token said
Token ID531
Feature activation-6.35e-02

Pos logit contributions
e+0.010
 moderation+0.009
 hate+0.009
oof+0.008
 supposed+0.008
Neg logit contributions
rier-0.012
der-0.012
lights-0.012
ows-0.010
boards-0.010
Token,
Token ID11
Feature activation-4.27e-02

Pos logit contributions
lights+0.126
der+0.109
rier+0.109
lic+0.103
ows+0.101
Neg logit contributions
e-0.116
 hate-0.109
 moderation-0.099
 disagree-0.095
 lawyer-0.094
Token "
Token ID366
Feature activation+1.45e-02

Pos logit contributions
lights+0.081
rier+0.074
der+0.070
ows+0.064
ier+0.064
Neg logit contributions
e-0.076
 hate-0.072
 moderation-0.070
cle-0.068
oof-0.068
TokenJack
Token ID14295
Feature activation+3.42e-02

Pos logit contributions
e+0.023
oof+0.021
 moderation+0.021
 hate+0.020
 client+0.019
Neg logit contributions
der-0.033
rier-0.031
lights-0.030
lessly-0.030
lic-0.029
Token!
Token ID0
Feature activation+1.58e-02

Pos logit contributions
e+0.083
 moderation+0.078
oof+0.077
 hate+0.075
 supposed+0.071
Neg logit contributions
der-0.052
rier-0.049
lights-0.047
boards-0.041
lessly-0.041
Token.
Token ID13
Feature activation-7.00e-03

Pos logit contributions
e+0.060
 moderation+0.058
oof+0.057
 hate+0.055
 client+0.053
Neg logit contributions
der-0.030
rier-0.028
lights-0.026
lessly-0.023
lic-0.023
Token They
Token ID1119
Feature activation-1.33e-02

Pos logit contributions
der+0.011
lights+0.010
rier+0.010
lic+0.009
boards+0.009
Neg logit contributions
e-0.016
 moderation-0.016
oof-0.015
 hate-0.014
 supposed-0.014
Token were
Token ID547
Feature activation-2.04e-02

Pos logit contributions
der+0.023
rier+0.021
lights+0.021
lic+0.020
lessly+0.020
Neg logit contributions
e-0.027
 moderation-0.027
oof-0.026
 hate-0.024
 client-0.023
Token all
Token ID477
Feature activation+1.86e-02

Pos logit contributions
der+0.038
rier+0.037
lights+0.036
lic+0.034
lessly+0.033
Neg logit contributions
 moderation-0.039
e-0.039
oof-0.038
 hate-0.035
cle-0.033
Token very
Token ID845
Feature activation+3.58e-02

Pos logit contributions
 moderation+0.038
e+0.038
oof+0.038
 hate+0.035
 client+0.034
Neg logit contributions
rier-0.031
der-0.031
lights-0.030
lessly-0.028
lic-0.028
Token happy
Token ID3772
Feature activation-3.44e-04

Pos logit contributions
e+0.062
 moderation+0.059
oof+0.057
 hate+0.055
 client+0.051
Neg logit contributions
der-0.077
rier-0.074
lights-0.071
lic-0.066
lessly-0.066
Token.
Token ID13
Feature activation-6.25e-03

Pos logit contributions
der+0.001
rier+0.001
lights+0.000
lic+0.000
lessly+0.000
Neg logit contributions
e-0.001
 moderation-0.001
oof-0.001
 hate-0.001
 supposed-0.001
Token Amy
Token ID14235
Feature activation+2.30e-02

Pos logit contributions
der+0.010
lights+0.009
rier+0.008
boards+0.008
lic+0.008
Neg logit contributions
e-0.014
 moderation-0.014
oof-0.013
 hate-0.013
beh-0.013
Token saw
Token ID2497
Feature activation-3.02e-02

Pos logit contributions
 moderation+0.051
oof+0.048
e+0.047
 hate+0.044
 client+0.043
Neg logit contributions
der-0.038
rier-0.032
lights-0.031
ly-0.031
boards-0.031
Token a
Token ID257
Feature activation-8.82e-03

Pos logit contributions
der+0.063
rier+0.060
lights+0.058
lic+0.053
lessly+0.053
Neg logit contributions
e-0.055
 moderation-0.052
oof-0.051
 hate-0.049
 client-0.045
Token big
Token ID1263
Feature activation+3.36e-03

Pos logit contributions
der+0.015
rier+0.014
lights+0.013
lic+0.012
lessly+0.012
Neg logit contributions
e-0.019
 moderation-0.019
oof-0.018
 hate-0.018
 client-0.016
Token who
Token ID508
Feature activation+3.98e-02

Pos logit contributions
 moderation+0.065
e+0.064
oof+0.063
 hate+0.061
atherine+0.058
Neg logit contributions
der-0.053
rier-0.049
lights-0.046
lessly-0.042
ly-0.042
Token worked
Token ID3111
Feature activation-3.46e-02

Pos logit contributions
e+0.071
 moderation+0.069
oof+0.067
 hate+0.062
atherine+0.061
Neg logit contributions
der-0.082
rier-0.076
lights-0.074
lic-0.070
lessly-0.069
Token on
Token ID319
Feature activation+5.92e-02

Pos logit contributions
der+0.064
rier+0.062
lights+0.057
lic+0.053
boards+0.051
Neg logit contributions
e-0.071
 moderation-0.066
 hate-0.065
oof-0.064
 supposed-0.062
Token his
Token ID465
Feature activation+2.45e-02

Pos logit contributions
e+0.111
oof+0.108
 moderation+0.103
 hate+0.100
atherine+0.098
Neg logit contributions
der-0.114
rier-0.106
lights-0.103
boards-0.096
lessly-0.093
Token farm
Token ID5318
Feature activation+7.02e-03

Pos logit contributions
e+0.045
oof+0.044
 moderation+0.043
atherine+0.039
 hate+0.039
Neg logit contributions
der-0.047
rier-0.045
lights-0.041
ly-0.040
boards-0.039
Token everyday
Token ID10908
Feature activation+1.29e-02

Pos logit contributions
e+0.016
 moderation+0.016
oof+0.015
 hate+0.015
 client+0.014
Neg logit contributions
der-0.011
rier-0.010
lights-0.010
lic-0.009
lessly-0.009
Token.
Token ID13
Feature activation+7.17e-03

Pos logit contributions
e+0.029
 moderation+0.028
 hate+0.027
oof+0.026
 client+0.025
Neg logit contributions
der-0.020
rier-0.020
lights-0.018
lic-0.018
ows-0.017
Token Every
Token ID3887
Feature activation+1.22e-02

Pos logit contributions
e+0.019
oof+0.018
 moderation+0.018
 hate+0.017
atherine+0.017
Neg logit contributions
der-0.009
rier-0.008
lights-0.008
boards-0.007
lessly-0.007
Token day
Token ID1110
Feature activation-2.06e-02

Pos logit contributions
e+0.017
oof+0.016
 moderation+0.016
 hate+0.014
atherine+0.014
Neg logit contributions
der-0.030
rier-0.028
lights-0.027
lessly-0.026
lic-0.026
Token he
Token ID339
Feature activation+5.57e-03

Pos logit contributions
der+0.026
rier+0.024
lights+0.022
lic+0.019
lessly+0.019
Neg logit contributions
e-0.055
 moderation-0.053
oof-0.052
 hate-0.051
 client-0.048
Token would
Token ID561
Feature activation-8.78e-03

Pos logit contributions
e+0.012
 moderation+0.012
oof+0.011
 hate+0.011
atherine+0.010
Neg logit contributions
der-0.010
rier-0.009
lights-0.009
lic-0.008
lessly-0.008
Token,
Token ID11
Feature activation+2.15e-02

Pos logit contributions
der+0.014
rier+0.013
lights+0.013
boards+0.012
lic+0.012
Neg logit contributions
e-0.019
oof-0.018
 hate-0.018
 moderation-0.018
 client-0.017
Token my
Token ID616
Feature activation-4.86e-03

Pos logit contributions
e+0.037
 moderation+0.035
oof+0.034
 hate+0.034
 client+0.031
Neg logit contributions
der-0.046
rier-0.043
lights-0.043
lic-0.040
boards-0.039
Token sweet
Token ID6029
Feature activation+1.56e-02

Pos logit contributions
der+0.010
rier+0.009
lights+0.009
lic+0.008
boards+0.008
Neg logit contributions
e-0.009
 moderation-0.009
oof-0.009
 hate-0.008
 client-0.008
Token boys
Token ID6510
Feature activation+3.61e-02

Pos logit contributions
e+0.030
 moderation+0.028
oof+0.028
 hate+0.027
 client+0.025
Neg logit contributions
der-0.031
rier-0.029
lights-0.028
lic-0.026
lessly-0.026
Token.
Token ID13
Feature activation+1.95e-02

Pos logit contributions
e+0.080
 moderation+0.076
oof+0.075
 hate+0.074
 client+0.068
Neg logit contributions
der-0.060
rier-0.057
lights-0.056
lic-0.050
lessly-0.049
Token You
Token ID921
Feature activation+4.86e-03

Pos logit contributions
e+0.035
 moderation+0.033
oof+0.032
 hate+0.031
 client+0.029
Neg logit contributions
der-0.040
rier-0.039
lights-0.038
lic-0.035
lessly-0.035
Token can
Token ID460
Feature activation+1.26e-03

Pos logit contributions
e+0.010
 moderation+0.009
oof+0.009
 hate+0.009
 client+0.008
Neg logit contributions
der-0.009
rier-0.009
lights-0.008
lic-0.008
lessly-0.008
Token look
Token ID804
Feature activation-6.47e-03

Pos logit contributions
e+0.002
 moderation+0.002
oof+0.001
 hate+0.001
 client+0.001
Neg logit contributions
der-0.003
rier-0.003
lights-0.003
lic-0.003
lessly-0.003
Token at
Token ID379
Feature activation+3.82e-02

Pos logit contributions
der+0.011
rier+0.010
lights+0.010
lic+0.009
boards+0.009
Neg logit contributions
e-0.014
 moderation-0.013
oof-0.013
 hate-0.013
 supposed-0.012
Token it
Token ID340
Feature activation-5.53e-03

Pos logit contributions
e+0.070
 moderation+0.066
oof+0.065
 hate+0.063
 client+0.059
Neg logit contributions
der-0.078
rier-0.074
lights-0.071
lessly-0.064
boards-0.063
Token and
Token ID290
Feature activation+3.89e-02

Pos logit contributions
der+0.010
lights+0.009
rier+0.009
lic+0.009
lessly+0.008
Neg logit contributions
e-0.012
 moderation-0.011
oof-0.011
 hate-0.010
 generation-0.009
Token a
Token ID257
Feature activation-1.21e-02

Pos logit contributions
e+0.013
 moderation+0.012
 hate+0.012
oof+0.011
 supposed+0.011
Neg logit contributions
der-0.016
rier-0.016
lights-0.015
boards-0.014
lic-0.014
Token little
Token ID1310
Feature activation+7.36e-04

Pos logit contributions
der+0.022
rier+0.021
lights+0.020
lic+0.018
lessly+0.018
Neg logit contributions
e-0.025
 moderation-0.024
oof-0.024
 hate-0.022
atherine-0.021
Token bit
Token ID1643
Feature activation-1.97e-02

Pos logit contributions
e+0.001
 moderation+0.001
 hate+0.001
oof+0.001
 client+0.001
Neg logit contributions
der-0.002
rier-0.001
lights-0.001
lic-0.001
lessly-0.001
Token embarrassed
Token ID21100
Feature activation-1.98e-02

Pos logit contributions
der+0.038
rier+0.038
lights+0.037
atively+0.035
boards+0.035
Neg logit contributions
 hate-0.034
e-0.031
 moderation-0.029
 harm-0.028
 matter-0.028
Token.
Token ID13
Feature activation-1.69e-03

Pos logit contributions
rier+0.033
der+0.032
lights+0.031
lic+0.031
ows+0.028
Neg logit contributions
e-0.041
 hate-0.040
 moderation-0.038
oof-0.036
 supposed-0.036
Token
Token ID198
Feature activation-1.36e-02

Pos logit contributions
der+0.003
rier+0.003
lights+0.003
lic+0.002
lessly+0.002
Neg logit contributions
e-0.004
 moderation-0.003
oof-0.003
 hate-0.003
 client-0.003
Token
Token ID198
Feature activation-1.28e-02

Pos logit contributions
der+0.022
rier+0.020
lights+0.020
lic+0.019
lessly+0.018
Neg logit contributions
e-0.030
 moderation-0.029
oof-0.028
 hate-0.028
 client-0.026
TokenThey
Token ID2990
Feature activation-2.17e-02

Pos logit contributions
der+0.026
rier+0.024
lights+0.024
lic+0.024
lessly+0.023
Neg logit contributions
e-0.023
oof-0.022
 moderation-0.021
 hate-0.020
 client-0.020
Token carried
Token ID5281
Feature activation-1.31e-02

Pos logit contributions
der+0.044
rier+0.043
lights+0.041
lic+0.037
lessly+0.037
Neg logit contributions
e-0.041
 moderation-0.038
 hate-0.037
oof-0.037
 client-0.034
Token boxes
Token ID10559
Feature activation-2.86e-02

Pos logit contributions
der+0.022
rier+0.021
lights+0.021
ly+0.020
lessly+0.019
Neg logit contributions
 moderation-0.026
e-0.026
oof-0.026
 hate-0.025
 supposed-0.024
Token full
Token ID1336
Feature activation-3.15e-02

Pos logit contributions
der+0.066
rier+0.058
lic+0.056
lights+0.050
boards+0.049
Neg logit contributions
e-0.048
 moderation-0.042
 seen-0.040
 hate-0.040
oof-0.038
Token go
Token ID467
Feature activation+2.52e-02

Pos logit contributions
e+0.001
 moderation+0.001
oof+0.001
 hate+0.001
 client+0.001
Neg logit contributions
der-0.001
rier-0.001
lights-0.001
lic-0.001
lessly-0.001
Token home
Token ID1363
Feature activation+3.04e-02

Pos logit contributions
e+0.046
oof+0.040
 hate+0.038
 moderation+0.038
 meet+0.035
Neg logit contributions
der-0.060
rier-0.058
lights-0.053
boards-0.049
lessly-0.049
Token and
Token ID290
Feature activation+3.70e-02

Pos logit contributions
e+0.061
 moderation+0.057
 hate+0.054
oof+0.054
 obey+0.050
Neg logit contributions
der-0.057
rier-0.057
lights-0.054
lic-0.053
boards-0.052
Token change
Token ID1487
Feature activation+2.24e-02

Pos logit contributions
e+0.049
 moderation+0.045
oof+0.044
 hate+0.041
 client+0.038
Neg logit contributions
der-0.094
rier-0.090
lights-0.087
lic-0.082
lessly-0.082
Token.
Token ID13
Feature activation+1.69e-02

Pos logit contributions
e+0.030
 moderation+0.028
oof+0.027
 hate+0.026
atherine+0.022
Neg logit contributions
der-0.057
rier-0.055
lights-0.053
lic-0.052
boards-0.051
Token And
Token ID843
Feature activation-9.84e-03

Pos logit contributions
e+0.031
 moderation+0.030
oof+0.029
 hate+0.028
 supposed+0.027
Neg logit contributions
der-0.034
rier-0.031
lights-0.030
lic-0.028
lessly-0.028
Token next
Token ID1306
Feature activation-1.37e-02

Pos logit contributions
der+0.023
rier+0.022
lights+0.021
lic+0.020
boards+0.020
Neg logit contributions
e-0.016
 moderation-0.015
oof-0.014
 hate-0.014
 client-0.012
Token time
Token ID640
Feature activation+7.33e-03

Pos logit contributions
der+0.040
rier+0.039
lights+0.037
lessly+0.036
lic+0.035
Neg logit contributions
e-0.014
 moderation-0.013
oof-0.013
 hate-0.011
 supposed-0.009
Token,
Token ID11
Feature activation-1.01e-02

Pos logit contributions
e+0.018
 moderation+0.018
oof+0.017
 hate+0.017
 client+0.016
Neg logit contributions
der-0.010
rier-0.010
lights-0.009
lic-0.008
lessly-0.008
Token be
Token ID307
Feature activation+1.48e-02

Pos logit contributions
der+0.024
rier+0.023
lights+0.023
lessly+0.021
lic+0.021
Neg logit contributions
e-0.015
 moderation-0.014
oof-0.014
 hate-0.014
 client-0.012
Token more
Token ID517
Feature activation+2.99e-02

Pos logit contributions
e+0.022
 moderation+0.022
 hate+0.021
oof+0.021
 client+0.020
Neg logit contributions
der-0.033
rier-0.033
lights-0.031
lic-0.029
lessly-0.029
Token find
Token ID1064
Feature activation+5.45e-03

Pos logit contributions
e+0.060
oof+0.058
 moderation+0.056
atherine+0.050
 hate+0.049
Neg logit contributions
der-0.085
rier-0.083
lights-0.075
lessly-0.072
ly-0.070
Token out
Token ID503
Feature activation+3.55e-02

Pos logit contributions
e+0.011
 moderation+0.011
oof+0.010
 hate+0.010
 client+0.009
Neg logit contributions
der-0.010
rier-0.010
lights-0.009
lic-0.009
lessly-0.008
Token.
Token ID13
Feature activation-4.86e-03

Pos logit contributions
e+0.068
 moderation+0.062
 hate+0.061
oof+0.060
 client+0.057
Neg logit contributions
der-0.068
rier-0.066
lights-0.063
lic-0.060
boards-0.059
Token
Token ID198
Feature activation-1.61e-02

Pos logit contributions
der+0.009
rier+0.009
lights+0.008
lessly+0.008
lic+0.007
Neg logit contributions
e-0.010
 moderation-0.009
oof-0.009
 hate-0.009
 client-0.008
Token
Token ID198
Feature activation-1.47e-02

Pos logit contributions
der+0.029
rier+0.028
lights+0.027
lic+0.025
lessly+0.024
Neg logit contributions
e-0.033
 moderation-0.032
oof-0.031
 hate-0.030
 client-0.028
TokenFinally
Token ID11158
Feature activation-2.72e-02

Pos logit contributions
der+0.029
rier+0.027
lights+0.026
lic+0.026
lessly+0.025
Neg logit contributions
 moderation-0.028
e-0.028
oof-0.026
 hate-0.025
 client-0.024
Token,
Token ID11
Feature activation-2.34e-02

Pos logit contributions
der+0.045
rier+0.045
lights+0.039
lic+0.037
lessly+0.037
Neg logit contributions
e-0.063
 moderation-0.057
oof-0.057
 hate-0.056
 client-0.053
Token the
Token ID262
Feature activation-4.47e-02

Pos logit contributions
der+0.041
rier+0.039
lights+0.038
lic+0.034
lessly+0.034
Neg logit contributions
e-0.049
 moderation-0.047
oof-0.046
 hate-0.044
atherine-0.042
Token truck
Token ID7779
Feature activation-1.13e-02

Pos logit contributions
der+0.086
rier+0.082
lights+0.079
lic+0.072
lessly+0.072
Neg logit contributions
e-0.088
 moderation-0.084
oof-0.082
 hate-0.081
 client-0.076
Token stopped
Token ID5025
Feature activation-1.13e-02

Pos logit contributions
der+0.022
rier+0.021
lights+0.020
lessly+0.019
lic+0.018
Neg logit contributions
e-0.022
oof-0.020
 moderation-0.019
 hate-0.019
 client-0.018
Token at
Token ID379
Feature activation+3.13e-02

Pos logit contributions
der+0.017
rier+0.017
lights+0.016
lic+0.014
boards+0.014
Neg logit contributions
e-0.027
 moderation-0.026
oof-0.025
 hate-0.025
 client-0.023
Token dry
Token ID5894
Feature activation+1.41e-03

Pos logit contributions
e+0.045
 moderation+0.042
oof+0.041
 hate+0.040
 supposed+0.036
Neg logit contributions
der-0.048
rier-0.046
lights-0.044
lic-0.041
lessly-0.040
Token.
Token ID13
Feature activation+1.69e-02

Pos logit contributions
e+0.002
 moderation+0.002
oof+0.001
 hate+0.001
 supposed+0.001
Neg logit contributions
lic-0.003
atively-0.003
rier-0.003
der-0.003
 lid-0.003
Token And
Token ID843
Feature activation-9.64e-03

Pos logit contributions
e+0.031
 moderation+0.029
oof+0.028
 hate+0.028
 client+0.026
Neg logit contributions
der-0.035
rier-0.033
lights-0.032
lic-0.029
lessly-0.029
Token next
Token ID1306
Feature activation-1.28e-02

Pos logit contributions
der+0.022
rier+0.021
lights+0.021
lic+0.020
boards+0.020
Neg logit contributions
e-0.016
 moderation-0.015
oof-0.014
 hate-0.014
atherine-0.012
Token time
Token ID640
Feature activation+3.95e-03

Pos logit contributions
der+0.038
rier+0.037
lights+0.035
lessly+0.034
lic+0.033
Neg logit contributions
e-0.013
oof-0.012
 moderation-0.012
 hate-0.011
 supposed-0.009
Token,
Token ID11
Feature activation-1.22e-02

Pos logit contributions
e+0.010
 moderation+0.009
oof+0.009
 hate+0.009
 supposed+0.008
Neg logit contributions
der-0.006
rier-0.005
lights-0.005
lic-0.005
lessly-0.005
Token let
Token ID1309
Feature activation+3.54e-03

Pos logit contributions
der+0.031
lights+0.030
rier+0.029
lic+0.027
lessly+0.027
Neg logit contributions
e-0.017
 moderation-0.016
 hate-0.015
oof-0.015
 client-0.013
Token's
Token ID338
Feature activation+3.14e-03

Pos logit contributions
e+0.006
 moderation+0.006
oof+0.006
 hate+0.006
 client+0.006
Neg logit contributions
der-0.007
rier-0.007
lights-0.006
ly-0.006
boards-0.006
Token hide
Token ID7808
Feature activation+2.15e-02

Pos logit contributions
e+0.005
 moderation+0.005
oof+0.005
 hate+0.005
 client+0.004
Neg logit contributions
der-0.007
rier-0.007
lights-0.006
lic-0.006
lessly-0.006
Token somewhere
Token ID7382
Feature activation+1.49e-02

Pos logit contributions
e+0.041
 moderation+0.035
oof+0.034
 hate+0.032
aster+0.029
Neg logit contributions
der-0.044
lic-0.041
rier-0.041
lights-0.039
boards-0.039
Token else
Token ID2073
Feature activation+4.00e-02

Pos logit contributions
e+0.019
 moderation+0.018
 hate+0.018
 supposed+0.016
oof+0.015
Neg logit contributions
der-0.037
rier-0.036
lic-0.034
lights-0.033
boards-0.033
Token happy
Token ID3772
Feature activation-3.27e-03

Pos logit contributions
der+0.087
rier+0.081
lights+0.080
boards+0.076
lic+0.074
Neg logit contributions
e-0.070
 hate-0.066
oof-0.064
 moderation-0.062
 supposed-0.057
Token together
Token ID1978
Feature activation+1.45e-03

Pos logit contributions
rier+0.006
der+0.006
lights+0.005
lic+0.005
lessly+0.005
Neg logit contributions
e-0.006
 moderation-0.006
oof-0.006
 hate-0.006
 supposed-0.006
Token."
Token ID526
Feature activation+6.72e-03

Pos logit contributions
e+0.003
 moderation+0.003
oof+0.003
 hate+0.003
 supposed+0.003
Neg logit contributions
der-0.003
rier-0.002
lights-0.002
lic-0.002
boards-0.002
Token
Token ID198
Feature activation-6.56e-03

Pos logit contributions
e+0.012
 moderation+0.012
oof+0.012
 hate+0.011
 client+0.010
Neg logit contributions
der-0.014
rier-0.014
lights-0.012
lessly-0.011
lic-0.011
Token
Token ID198
Feature activation-2.53e-03

Pos logit contributions
der+0.014
rier+0.014
lights+0.013
boards+0.012
lessly+0.012
Neg logit contributions
e-0.012
oof-0.011
 moderation-0.011
 hate-0.010
atherine-0.009
TokenThey
Token ID2990
Feature activation-2.95e-02

Pos logit contributions
der+0.005
rier+0.004
lights+0.004
lic+0.004
lessly+0.004
Neg logit contributions
e-0.005
 moderation-0.005
oof-0.005
 hate-0.005
 client-0.004
Token took
Token ID1718
Feature activation-2.53e-02

Pos logit contributions
der+0.063
rier+0.063
lights+0.059
lessly+0.054
lic+0.054
Neg logit contributions
e-0.053
 moderation-0.048
oof-0.047
 hate-0.046
atherine-0.042
Token the
Token ID262
Feature activation-5.06e-02

Pos logit contributions
der+0.048
rier+0.047
lights+0.045
lic+0.040
lessly+0.040
Neg logit contributions
e-0.048
 moderation-0.047
oof-0.046
 hate-0.044
 supposed-0.041
Token dolls
Token ID36062
Feature activation+3.71e-03

Pos logit contributions
der+0.099
rier+0.095
lights+0.091
lic+0.083
lessly+0.083
Neg logit contributions
e-0.098
 moderation-0.093
oof-0.091
 hate-0.088
 client-0.082
Token to
Token ID284
Feature activation+2.15e-02

Pos logit contributions
e+0.009
 moderation+0.009
oof+0.008
 hate+0.008
 client+0.008
Neg logit contributions
der-0.005
rier-0.005
lights-0.005
lic-0.004
lessly-0.004
Token the
Token ID262
Feature activation-4.38e-02

Pos logit contributions
e+0.043
oof+0.042
 moderation+0.041
beh+0.040
 hate+0.039
Neg logit contributions
der-0.037
rier-0.036
lights-0.033
lic-0.030
lessly-0.029
Token
Token ID198
Feature activation-2.58e-02

Pos logit contributions
der+0.042
rier+0.039
lights+0.037
lessly+0.035
lic+0.035
Neg logit contributions
e-0.057
 moderation-0.055
oof-0.053
 hate-0.053
 supposed-0.050
TokenBen
Token ID11696
Feature activation+3.99e-03

Pos logit contributions
der+0.051
rier+0.048
lessly+0.045
lights+0.044
lic+0.044
Neg logit contributions
e-0.046
 moderation-0.044
oof-0.044
 hate-0.043
 supposed-0.042
Tokenjamin
Token ID13337
Feature activation+3.89e-03

Pos logit contributions
e+0.008
 moderation+0.007
oof+0.007
 hate+0.007
atherine+0.006
Neg logit contributions
der-0.008
rier-0.008
lights-0.008
lic-0.007
lessly-0.007
Token smiled
Token ID13541
Feature activation-4.62e-02

Pos logit contributions
e+0.008
 moderation+0.008
oof+0.008
 hate+0.007
 client+0.007
Neg logit contributions
der-0.007
rier-0.007
lights-0.006
lic-0.006
lessly-0.006
Token and
Token ID290
Feature activation-2.17e-02

Pos logit contributions
boards+0.105
lic+0.100
lights+0.096
pan+0.090
oons+0.090
Neg logit contributions
e-0.064
 hate-0.063
oof-0.055
 worry-0.053
 nephew-0.053
Token said
Token ID531
Feature activation-7.68e-02

Pos logit contributions
rier+0.045
der+0.044
lights+0.044
boards+0.041
lic+0.038
Neg logit contributions
e-0.038
oof-0.035
 hate-0.034
 moderation-0.033
atherine-0.031
Token,
Token ID11
Feature activation-4.90e-02

Pos logit contributions
lights+0.160
der+0.140
boards+0.138
rier+0.129
ows+0.124
Neg logit contributions
e-0.138
 hate-0.118
oof-0.107
 disagree-0.106
 obey-0.105
Token "
Token ID366
Feature activation+6.05e-03

Pos logit contributions
lights+0.086
der+0.078
rier+0.077
boards+0.075
ows+0.071
Neg logit contributions
e-0.098
 hate-0.090
 moderation-0.088
oof-0.087
cle-0.085
TokenThat
Token ID2504
Feature activation-1.24e-02

Pos logit contributions
e+0.010
 moderation+0.010
oof+0.009
 hate+0.009
 client+0.009
Neg logit contributions
der-0.013
rier-0.013
lights-0.012
lessly-0.012
lic-0.011
Token's
Token ID338
Feature activation-8.42e-03

Pos logit contributions
der+0.026
rier+0.025
lights+0.024
lic+0.022
lessly+0.022
Neg logit contributions
e-0.022
 moderation-0.021
oof-0.021
 hate-0.020
 client-0.018
Token great
Token ID1049
Feature activation+1.03e-02

Pos logit contributions
der+0.017
rier+0.016
lights+0.015
lessly+0.014
lic+0.014
Neg logit contributions
e-0.016
 moderation-0.015
oof-0.014
 hate-0.014
 client-0.014
Token as
Token ID355
Feature activation+1.19e-02

Pos logit contributions
rier+0.051
der+0.048
lights+0.046
lic+0.045
ows+0.041
Neg logit contributions
e-0.061
 hate-0.058
 moderation-0.055
oof-0.053
 supposed-0.051
Token can
Token ID460
Feature activation-5.52e-03

Pos logit contributions
e+0.026
 moderation+0.025
oof+0.024
 hate+0.024
 client+0.022
Neg logit contributions
der-0.020
rier-0.019
lights-0.018
lic-0.017
boards-0.016
Token be
Token ID307
Feature activation+4.76e-02

Pos logit contributions
der+0.012
rier+0.011
lights+0.011
lic+0.010
boards+0.010
Neg logit contributions
e-0.010
 hate-0.009
 moderation-0.009
oof-0.009
 client-0.008
Token.
Token ID13
Feature activation-1.90e-02

Pos logit contributions
e+0.084
 moderation+0.079
oof+0.076
 hate+0.075
 client+0.068
Neg logit contributions
der-0.102
rier-0.098
lights-0.094
lic-0.087
lessly-0.086
Token She
Token ID1375
Feature activation-3.80e-02

Pos logit contributions
der+0.032
rier+0.032
lights+0.030
lic+0.027
lessly+0.027
Neg logit contributions
e-0.040
 moderation-0.039
oof-0.037
 hate-0.037
atherine-0.034
Token said
Token ID531
Feature activation-7.12e-02

Pos logit contributions
rier+0.082
der+0.081
lights+0.077
boards+0.070
lessly+0.069
Neg logit contributions
e-0.066
oof-0.060
 hate-0.059
 moderation-0.059
atherine-0.054
Token to
Token ID284
Feature activation+5.64e-03

Pos logit contributions
lights+0.131
der+0.117
rier+0.114
lic+0.113
boards+0.107
Neg logit contributions
e-0.141
 hate-0.130
 moderation-0.125
 client-0.120
oof-0.118
Token him
Token ID683
Feature activation-8.25e-03

Pos logit contributions
e+0.009
 moderation+0.009
oof+0.009
 hate+0.008
 client+0.008
Neg logit contributions
der-0.012
rier-0.012
lights-0.011
lessly-0.010
lic-0.010
Token,
Token ID11
Feature activation-4.59e-02

Pos logit contributions
der+0.013
lights+0.013
rier+0.013
boards+0.011
lic+0.011
Neg logit contributions
e-0.018
 hate-0.016
oof-0.016
 moderation-0.016
 client-0.016
Token "
Token ID366
Feature activation+1.04e-02

Pos logit contributions
lights+0.071
rier+0.068
der+0.067
boards+0.059
lic+0.058
Neg logit contributions
e-0.102
 hate-0.096
 moderation-0.096
oof-0.092
 supposed-0.088
TokenDon
Token ID3987
Feature activation+5.71e-03

Pos logit contributions
e+0.018
oof+0.017
 moderation+0.016
 hate+0.016
 client+0.016
Neg logit contributions
der-0.022
rier-0.021
lessly-0.019
lic-0.019
lights-0.019
Token the
Token ID262
Feature activation-4.91e-02

Pos logit contributions
der+0.024
rier+0.023
lights+0.022
lic+0.019
lessly+0.019
Neg logit contributions
e-0.040
 moderation-0.038
oof-0.038
 hate-0.037
 client-0.035
Token table
Token ID3084
Feature activation-4.00e-02

Pos logit contributions
der+0.093
lights+0.089
rier+0.088
lic+0.080
lessly+0.078
Neg logit contributions
e-0.094
 moderation-0.093
 hate-0.089
oof-0.086
 client-0.086
Token suddenly
Token ID6451
Feature activation-3.85e-02

Pos logit contributions
rier+0.078
der+0.078
lights+0.071
lessly+0.069
lic+0.067
Neg logit contributions
e-0.074
 hate-0.065
 moderation-0.064
 quarrel-0.061
oof-0.060
Token closed
Token ID4838
Feature activation-4.60e-02

Pos logit contributions
rier+0.080
der+0.076
lights+0.075
lessly+0.068
ows+0.067
Neg logit contributions
e-0.067
 hate-0.059
 moderation-0.058
oof-0.057
 client-0.053
Token its
Token ID663
Feature activation-5.03e-02

Pos logit contributions
lic+0.104
der+0.102
rier+0.097
boards+0.093
lessly+0.092
Neg logit contributions
e-0.065
 moderation-0.055
oof-0.053
 hate-0.050
 afar-0.050
Token lid
Token ID19789
Feature activation-8.07e-02

Pos logit contributions
der+0.072
lights+0.068
rier+0.065
lic+0.063
lessly+0.062
Neg logit contributions
e-0.118
oof-0.113
 moderation-0.113
 hate-0.111
 client-0.108
Token!
Token ID0
Feature activation-1.26e-02

Pos logit contributions
rier+0.172
der+0.155
lights+0.153
lic+0.146
stals+0.130
Neg logit contributions
e-0.127
 seen-0.100
 hate-0.099
 been-0.093
 supposed-0.087
Token He
Token ID679
Feature activation-2.50e-02

Pos logit contributions
der+0.022
rier+0.022
lights+0.022
lessly+0.019
lic+0.019
Neg logit contributions
e-0.025
 moderation-0.024
oof-0.023
 hate-0.023
 client-0.022
Token was
Token ID373
Feature activation-2.75e-02

Pos logit contributions
der+0.046
rier+0.046
lights+0.043
lessly+0.039
lic+0.038
Neg logit contributions
e-0.051
 moderation-0.047
oof-0.046
 hate-0.046
 client-0.042
Token so
Token ID523
Feature activation+2.76e-02

Pos logit contributions
der+0.056
rier+0.053
lights+0.053
boards+0.047
lic+0.047
Neg logit contributions
e-0.052
 moderation-0.048
 hate-0.047
oof-0.045
 supposed-0.044
Token sad
Token ID6507
Feature activation-1.55e-02

Pos logit contributions
e+0.042
 hate+0.039
 moderation+0.039
oof+0.034
 supposed+0.034
Neg logit contributions
der-0.066
lights-0.063
rier-0.062
boards-0.058
lic-0.057
Token at
Token ID379
Feature activation+2.59e-03

Pos logit contributions
der+0.110
rier+0.103
lights+0.098
boards+0.089
lic+0.088
Neg logit contributions
e-0.127
 hate-0.121
oof-0.118
 moderation-0.118
 supposed-0.108
Token the
Token ID262
Feature activation-3.29e-02

Pos logit contributions
e+0.005
 moderation+0.004
oof+0.004
 hate+0.004
 client+0.004
Neg logit contributions
der-0.006
rier-0.005
lights-0.005
lic-0.005
lessly-0.005
Token little
Token ID1310
Feature activation+3.71e-03

Pos logit contributions
der+0.071
rier+0.067
lights+0.064
lessly+0.058
lic+0.058
Neg logit contributions
e-0.058
 moderation-0.054
oof-0.053
 hate-0.053
 client-0.049
Token girl
Token ID2576
Feature activation+2.70e-02

Pos logit contributions
e+0.006
 moderation+0.005
 hate+0.005
 client+0.005
oof+0.005
Neg logit contributions
der-0.009
lights-0.008
rier-0.008
lic-0.008
lessly-0.007
Token and
Token ID290
Feature activation-2.57e-03

Pos logit contributions
e+0.065
 moderation+0.063
oof+0.062
 hate+0.060
 client+0.057
Neg logit contributions
der-0.039
rier-0.037
lights-0.035
lessly-0.031
lic-0.031
Token replied
Token ID8712
Feature activation-7.03e-02

Pos logit contributions
der+0.005
rier+0.005
lights+0.005
lessly+0.004
boards+0.004
Neg logit contributions
e-0.005
 moderation-0.004
 hate-0.004
oof-0.004
 client-0.004
Token,
Token ID11
Feature activation-4.71e-02

Pos logit contributions
lights+0.149
der+0.128
lessly+0.124
lic+0.123
rier+0.122
Neg logit contributions
e-0.123
 hate-0.104
 disagree-0.096
 lawyer-0.095
oof-0.094
Token "
Token ID366
Feature activation+1.71e-02

Pos logit contributions
lights+0.078
der+0.072
rier+0.071
boards+0.065
ows+0.064
Neg logit contributions
e-0.100
 hate-0.092
 moderation-0.091
oof-0.089
 quarrel-0.086
TokenNo
Token ID2949
Feature activation+6.90e-03

Pos logit contributions
e+0.028
 moderation+0.027
oof+0.026
 hate+0.025
 supposed+0.024
Neg logit contributions
der-0.038
rier-0.036
lights-0.034
lic-0.034
lessly-0.033
Token,
Token ID11
Feature activation+7.79e-03

Pos logit contributions
e+0.018
 moderation+0.017
oof+0.017
 hate+0.016
 client+0.016
Neg logit contributions
der-0.009
rier-0.009
lights-0.008
lic-0.007
lessly-0.007
Token I
Token ID314
Feature activation+2.63e-02

Pos logit contributions
e+0.013
 moderation+0.013
 hate+0.012
oof+0.012
 client+0.011
Neg logit contributions
der-0.017
rier-0.016
lights-0.016
lic-0.015
boards-0.014
Token her
Token ID607
Feature activation-2.96e-02

Pos logit contributions
der+0.057
rier+0.055
lights+0.055
lic+0.052
boards+0.050
Neg logit contributions
e-0.037
 moderation-0.034
oof-0.033
 hate-0.031
atherine-0.028
Token mom
Token ID1995
Feature activation+1.16e-02

Pos logit contributions
der+0.068
rier+0.065
lights+0.063
lic+0.059
lessly+0.059
Neg logit contributions
e-0.047
 moderation-0.045
oof-0.043
 hate-0.042
 client-0.038
Token and
Token ID290
Feature activation-1.76e-02

Pos logit contributions
e+0.027
 moderation+0.025
oof+0.025
 hate+0.024
atherine+0.023
Neg logit contributions
der-0.019
rier-0.018
lights-0.017
lic-0.016
boards-0.015
Token dad
Token ID9955
Feature activation+1.16e-02

Pos logit contributions
rier+0.035
der+0.035
lights+0.033
lic+0.031
boards+0.030
Neg logit contributions
e-0.032
 moderation-0.030
oof-0.030
 hate-0.029
atherine-0.028
Token and
Token ID290
Feature activation-1.94e-02

Pos logit contributions
e+0.027
 moderation+0.024
 hate+0.024
oof+0.024
 supposed+0.022
Neg logit contributions
der-0.019
rier-0.018
lights-0.018
lic-0.016
boards-0.015
Token said
Token ID531
Feature activation-7.20e-02

Pos logit contributions
der+0.038
rier+0.037
lights+0.035
lic+0.032
boards+0.032
Neg logit contributions
e-0.037
 moderation-0.035
oof-0.035
 hate-0.034
 client-0.031
Token,
Token ID11
Feature activation-4.40e-02

Pos logit contributions
lights+0.141
der+0.127
rier+0.119
boards+0.118
lic+0.116
Neg logit contributions
e-0.133
 hate-0.120
oof-0.113
 moderation-0.112
 client-0.107
Tokenâ
Token ID22940
Feature activation+2.52e-02

Pos logit contributions
lights+0.073
rier+0.068
der+0.068
boards+0.062
ows+0.060
Neg logit contributions
e-0.092
 hate-0.085
 moderation-0.084
oof-0.084
cle-0.080
Token
Token ID26391
Feature activation+4.72e-02

Pos logit contributions
e+0.028
 moderation+0.026
oof+0.025
 hate+0.023
 client+0.020
Neg logit contributions
der-0.070
rier-0.068
lights-0.065
lic-0.062
lessly-0.062
Token
Token ID129
Feature activation+1.32e-02

Pos logit contributions
e+0.086
 moderation+0.082
oof+0.080
 hate+0.077
 client+0.071
Neg logit contributions
der-0.097
rier-0.093
lights-0.089
lic-0.083
lessly-0.082
Token
Token ID241
Feature activation+1.26e-02

Pos logit contributions
 moderation+0.026
e+0.024
oof+0.023
 hate+0.023
 client+0.022
Neg logit contributions
der-0.025
rier-0.024
lights-0.022
lic-0.021
lessly-0.021