TOP ACTIVATIONS
MAX = 0.117

had never given her the idea . He wished he had
had been told by his parents . So he stayed on
don 't you ask your parents to teach you ? They
said it was not their fault . She said the truck
You can 't change my idea ." " Yes
closet is not a good idea . They learned that playing
the cell without asking your mom first ?" J
said it was not his fault . She said the bird
books . They share their wishes with each other . They
parent was not a good idea . From that day on
play with it for a minute , please ?" L ila
They would always share their wishes and help each other .
grateful for her mom 's idea to come on this adventure
him to always ask his mom and dad before he tasted
and accepted her mom 's idea . But when
learned to always ask his mom and dad before he ate
wanted was not a good idea . From then on ,
about the rock and their wishes . But the
said it was not their fault . They said the man
. They still have their wishes . <|endoftext|> L ila and

MIN ACTIVATIONS
MIN = -0.147

on the lake in a kay ak . As they were
the air , leaving a spir aling trail of black ink
butterfly , tw irling and spir aling in the air .
different styles like stars or spir als . Everyone was so
forth in the sky , spir alling higher and higher .
green and a long , spir alling tail . The princess
Fro ggy , in a kay ak . Fro ggy said
saw a man in a kay ak . The man said
liked to paddle in the kay ak and splash in the
and exciting . A baby fer ret !
Anne began to build a kay ak . They found some
and Anne jumped in the kay ak and padd led away
river . They saw a kay ak on the shore .
for a ride in the kay ak and her mom said
present ; it was a kay ak ! She was so
<|endoftext|> One day , a sw an named Sam was swimming
Every day , a nearby sw an would come to the
old man who had a kay ak . He decided to
spl ashed around in the kay ak , just like two
and saw ed until the kay ak was finished .

INTERVAL 0.051 - 0.117
CONTAINS 3.913%

play in the park every day . They liked to swing
. You are my best friend too . You are very
nice . And from that day on , the little boy
." Then he give his mom a big hug , and
an important lesson . No matter what danger he might face

INTERVAL -0.015 - 0.051
CONTAINS 65.732%

. We are going to love the baby . We are
together . <|endoftext|> Once upon a time , there was a
's mom asked " Did you have fun playing with the
something cool ." " No , Ben , that
from Lily . She tried to separate the book from him

INTERVAL -0.081 - -0.015
CONTAINS 30.118%

and sat up in the garden to wait .
. He went over to poke it and it turned out
ride together . <|endoftext|> Once upon a time , in a
full of water and salt . When she pulled the trigger
on and started to dance around the office . Her mom

INTERVAL -0.147 - -0.081
CONTAINS 0.237%

, so he started to cl ap his paws . He
leave the board in the p uddle and he went back
Mom plays along and pret ends she is his helper
to never touch a strange je ep again . <|endoftext|> On
big , green blanket and unp acked all their food .
Token had
Token ID550
Feature activation+3.51e-02

Pos logit contributions
 injuries+0.078
 mercy+0.071
 hesitation+0.064
 poisoning+0.064
 permission+0.062
Neg logit contributions
 plains-0.085
 spir-0.085
aling-0.081
ab-0.080
irl-0.080
Token never
Token ID1239
Feature activation+5.62e-02

Pos logit contributions
 injuries+0.047
 mercy+0.041
 hesitation+0.038
 poisoning+0.038
 permission+0.034
Neg logit contributions
 plains-0.063
 spir-0.063
aling-0.061
irl-0.061
 cycl-0.060
Token given
Token ID1813
Feature activation+6.08e-03

Pos logit contributions
 injuries+0.082
 mercy+0.074
 hesitation+0.068
 poisoning+0.066
 permission+0.066
Neg logit contributions
 plains-0.099
 spir-0.098
irl-0.095
ab-0.094
aling-0.093
Token her
Token ID607
Feature activation+8.69e-02

Pos logit contributions
 injuries+0.009
 mercy+0.008
 hesitation+0.007
 poisoning+0.007
 permission+0.007
Neg logit contributions
 plains-0.010
 spir-0.010
aling-0.010
ab-0.010
irl-0.010
Token the
Token ID262
Feature activation+3.79e-02

Pos logit contributions
 injuries+0.127
 mercy+0.115
 poisoning+0.106
 hesitation+0.105
 forgiveness+0.095
Neg logit contributions
 spir-0.147
 plains-0.146
aling-0.139
irl-0.139
ab-0.139
Token idea
Token ID2126
Feature activation+1.17e-01

Pos logit contributions
 injuries+0.073
 mercy+0.068
 hesitation+0.063
 poisoning+0.063
 permission+0.061
Neg logit contributions
 plains-0.047
 spir-0.047
aling-0.045
ab-0.044
irl-0.044
Token.
Token ID13
Feature activation-5.21e-03

Pos logit contributions
 injuries+0.190
 mercy+0.167
 hesitation+0.156
 poisoning+0.154
 forgiveness+0.141
Neg logit contributions
aling-0.174
ab-0.168
 spir-0.166
 sorts-0.160
 zoom-0.160
Token He
Token ID679
Feature activation-2.05e-03

Pos logit contributions
 plains+0.008
 spir+0.008
aling+0.008
ab+0.008
 cycl+0.007
Neg logit contributions
 injuries-0.009
 mercy-0.008
 hesitation-0.008
 poisoning-0.007
 permission-0.007
Token wished
Token ID16555
Feature activation+7.29e-02

Pos logit contributions
 spir+0.004
 plains+0.004
aling+0.003
ab+0.003
irl+0.003
Neg logit contributions
 injuries-0.003
 mercy-0.003
 hesitation-0.002
 poisoning-0.002
 permission-0.002
Token he
Token ID339
Feature activation+5.01e-02

Pos logit contributions
 injuries+0.099
 mercy+0.088
 poisoning+0.084
 hesitation+0.084
 permission+0.078
Neg logit contributions
 plains-0.123
 spir-0.122
 cycl-0.114
aling-0.114
 jung-0.112
Token had
Token ID550
Feature activation+3.46e-02

Pos logit contributions
 injuries+0.077
 mercy+0.070
 hesitation+0.064
 poisoning+0.064
 permission+0.061
Neg logit contributions
 plains-0.082
 spir-0.082
aling-0.079
ab-0.078
irl-0.077
Token had
Token ID550
Feature activation+2.60e-02

Pos logit contributions
 injuries+0.056
 mercy+0.050
 hesitation+0.046
 poisoning+0.046
 permission+0.044
Neg logit contributions
 spir-0.060
 plains-0.060
irl-0.057
aling-0.056
ab-0.056
Token been
Token ID587
Feature activation-2.60e-03

Pos logit contributions
 injuries+0.042
 mercy+0.039
 hesitation+0.034
 poisoning+0.033
 forgiveness+0.033
Neg logit contributions
 spir-0.041
 plains-0.040
ab-0.040
aling-0.040
irl-0.040
Token told
Token ID1297
Feature activation+5.17e-02

Pos logit contributions
 spir+0.004
 plains+0.004
aling+0.004
ab+0.004
irl+0.004
Neg logit contributions
 injuries-0.004
 mercy-0.004
 poisoning-0.004
 hesitation-0.004
 permission-0.003
Token by
Token ID416
Feature activation+2.83e-02

Pos logit contributions
 injuries+0.073
 mercy+0.066
 hesitation+0.058
 poisoning+0.058
 fault+0.055
Neg logit contributions
 plains-0.090
 spir-0.089
aling-0.088
ab-0.086
 sorts-0.085
Token his
Token ID465
Feature activation+4.84e-02

Pos logit contributions
 injuries+0.042
 mercy+0.037
 hesitation+0.035
 permission+0.034
 forgiveness+0.033
Neg logit contributions
 plains-0.050
ab-0.049
aling-0.049
 spir-0.048
irl-0.047
Token parents
Token ID3397
Feature activation+1.11e-01

Pos logit contributions
 injuries+0.061
 mercy+0.056
 poisoning+0.053
 hesitation+0.049
t+0.045
Neg logit contributions
 spir-0.091
 plains-0.090
irl-0.086
 cycl-0.084
aling-0.084
Token.
Token ID13
Feature activation-8.35e-03

Pos logit contributions
 injuries+0.165
 mercy+0.146
 poisoning+0.126
 hesitation+0.122
 forgiveness+0.120
Neg logit contributions
 plains-0.188
 spir-0.187
aling-0.183
 sorts-0.182
ab-0.178
Token So
Token ID1406
Feature activation+3.50e-02

Pos logit contributions
irl+0.014
 spir+0.014
ab+0.013
 plains+0.013
aling+0.013
Neg logit contributions
 injuries-0.011
 mercy-0.010
 hesitation-0.010
 permission-0.010
 either-0.009
Token he
Token ID339
Feature activation-7.04e-03

Pos logit contributions
 injuries+0.033
 mercy+0.028
 hesitation+0.025
 poisoning+0.024
 permission+0.023
Neg logit contributions
 plains-0.078
 spir-0.077
ab-0.075
aling-0.075
irl-0.075
Token stayed
Token ID9658
Feature activation-3.26e-02

Pos logit contributions
 plains+0.011
 spir+0.011
irl+0.011
 sorts+0.011
aling+0.011
Neg logit contributions
 injuries-0.010
 mercy-0.010
 poisoning-0.009
 help-0.009
 hesitation-0.009
Token on
Token ID319
Feature activation-3.15e-03

Pos logit contributions
 plains+0.054
 spir+0.053
aling+0.051
ab+0.051
 cycl+0.050
Neg logit contributions
 injuries-0.051
 mercy-0.045
 poisoning-0.042
 hesitation-0.042
 permission-0.041
Token don
Token ID836
Feature activation+4.97e-02

Pos logit contributions
 injuries+0.071
 mercy+0.062
 hesitation+0.054
 poisoning+0.053
 forgiveness+0.053
Neg logit contributions
 plains-0.128
 spir-0.126
aling-0.124
ab-0.122
 cycl-0.120
Token't
Token ID470
Feature activation+3.80e-02

Pos logit contributions
 injuries+0.035
 mercy+0.028
 poisoning+0.021
 hesitation+0.021
 forgiveness+0.019
Neg logit contributions
 plains-0.123
 spir-0.122
ab-0.119
aling-0.118
 cycl-0.117
Token you
Token ID345
Feature activation+3.98e-02

Pos logit contributions
 injuries+0.057
 mercy+0.051
 poisoning+0.047
 hesitation+0.047
 permission+0.044
Neg logit contributions
 spir-0.064
 plains-0.063
aling-0.061
irl-0.061
ab-0.060
Token ask
Token ID1265
Feature activation+7.10e-02

Pos logit contributions
 injuries+0.062
 mercy+0.056
 hesitation+0.051
 poisoning+0.051
 forgiveness+0.048
Neg logit contributions
 plains-0.063
 spir-0.063
aling-0.062
irl-0.062
ab-0.061
Token your
Token ID534
Feature activation+4.64e-02

Pos logit contributions
 injuries+0.104
 mercy+0.097
 hesitation+0.089
 poisoning+0.088
 fault+0.086
Neg logit contributions
 plains-0.114
 spir-0.109
irl-0.108
aling-0.107
ab-0.107
Token parents
Token ID3397
Feature activation+1.09e-01

Pos logit contributions
 injuries+0.067
 mercy+0.061
 poisoning+0.060
 hesitation+0.057
t+0.055
Neg logit contributions
 plains-0.076
 spir-0.073
irl-0.072
 sights-0.071
ab-0.071
Token to
Token ID284
Feature activation+4.50e-02

Pos logit contributions
 injuries+0.155
 mercy+0.142
 hesitation+0.122
 poisoning+0.120
 fault+0.113
Neg logit contributions
 plains-0.183
ab-0.183
aling-0.180
 spir-0.173
 sorts-0.171
Token teach
Token ID4545
Feature activation+2.34e-02

Pos logit contributions
 injuries+0.070
 mercy+0.064
 poisoning+0.061
 hesitation+0.059
 fault+0.059
Neg logit contributions
 spir-0.067
irl-0.067
 plains-0.066
aling-0.065
 cycl-0.062
Token you
Token ID345
Feature activation+7.38e-02

Pos logit contributions
 injuries+0.041
 mercy+0.036
 hesitation+0.035
 poisoning+0.035
 fault+0.033
Neg logit contributions
 plains-0.032
irl-0.032
 spir-0.032
 sorts-0.030
ab-0.030
Token?
Token ID30
Feature activation-3.49e-02

Pos logit contributions
 injuries+0.115
 mercy+0.100
 poisoning+0.093
 hesitation+0.092
t+0.084
Neg logit contributions
 plains-0.119
 spir-0.117
irl-0.116
aling-0.115
ab-0.114
Token They
Token ID1119
Feature activation-1.16e-03

Pos logit contributions
 spir+0.049
 plains+0.049
irl+0.046
aling+0.046
ab+0.045
Neg logit contributions
 injuries-0.061
 mercy-0.056
 poisoning-0.052
 hesitation-0.052
 permission-0.051
Token said
Token ID531
Feature activation+5.24e-02

Pos logit contributions
 spir+0.033
 sorts+0.031
 plains+0.031
 cycl+0.030
irl+0.029
Neg logit contributions
 injuries-0.033
 mercy-0.029
t-0.028
 fault-0.028
 poisoning-0.028
Token it
Token ID340
Feature activation+3.76e-02

Pos logit contributions
 injuries+0.076
 mercy+0.066
 poisoning+0.063
 hesitation+0.059
 permission+0.056
Neg logit contributions
 spir-0.089
irl-0.086
 plains-0.082
ab-0.081
 sings-0.080
Token was
Token ID373
Feature activation-2.59e-02

Pos logit contributions
 injuries+0.073
 mercy+0.065
 hesitation+0.065
 permission+0.060
 poisoning+0.060
Neg logit contributions
 spir-0.048
 plains-0.046
ab-0.044
irl-0.044
 cycl-0.043
Token not
Token ID407
Feature activation+4.07e-02

Pos logit contributions
 spir+0.048
irl+0.048
 plains+0.046
ep+0.045
ab+0.045
Neg logit contributions
 injuries-0.032
 mercy-0.029
 hesitation-0.027
t-0.025
 poisoning-0.025
Token their
Token ID511
Feature activation+4.98e-02

Pos logit contributions
 injuries+0.063
 mercy+0.058
 poisoning+0.053
 hesitation+0.052
 permission+0.052
Neg logit contributions
 plains-0.068
 spir-0.068
aling-0.066
irl-0.064
ab-0.064
Token fault
Token ID8046
Feature activation+1.07e-01

Pos logit contributions
 injuries+0.069
 mercy+0.063
 hesitation+0.060
 poisoning+0.058
t+0.057
Neg logit contributions
 plains-0.082
 spir-0.079
 sights-0.078
 sorts-0.077
ab-0.076
Token.
Token ID13
Feature activation-8.90e-03

Pos logit contributions
 injuries+0.201
 mercy+0.183
 poisoning+0.175
use+0.175
 hesitation+0.173
Neg logit contributions
 spir-0.113
 soaring-0.104
 zoom-0.103
ab-0.102
 plains-0.101
Token She
Token ID1375
Feature activation+4.56e-03

Pos logit contributions
 plains+0.010
 spir+0.010
ab+0.009
irl+0.009
 cycl+0.009
Neg logit contributions
 injuries-0.019
 mercy-0.018
 hesitation-0.016
 poisoning-0.016
 permission-0.016
Token said
Token ID531
Feature activation+4.12e-02

Pos logit contributions
 injuries+0.008
 hesitation+0.007
 mercy+0.007
 poisoning+0.006
 permission+0.006
Neg logit contributions
 spir-0.007
 cycl-0.006
ep-0.006
 plains-0.006
irl-0.006
Token the
Token ID262
Feature activation+2.50e-02

Pos logit contributions
 injuries+0.068
 mercy+0.059
 poisoning+0.057
 sleeve+0.053
 hesitation+0.052
Neg logit contributions
 spir-0.060
 sings-0.056
irl-0.056
 soaring-0.055
 zoom-0.055
Token truck
Token ID7779
Feature activation+1.84e-02

Pos logit contributions
 injuries+0.046
 mercy+0.042
 poisoning+0.041
 hesitation+0.041
 permission+0.039
Neg logit contributions
 plains-0.031
 spir-0.030
 cycl-0.030
 sights-0.030
ep-0.029
Token You
Token ID921
Feature activation+1.01e-02

Pos logit contributions
 spir+0.016
 plains+0.016
irl+0.016
aling+0.016
 cycl+0.015
Neg logit contributions
 injuries-0.014
 mercy-0.013
 poisoning-0.012
 hesitation-0.012
 forgiveness-0.012
Token can
Token ID460
Feature activation+5.16e-02

Pos logit contributions
 injuries+0.015
 mercy+0.013
 hesitation+0.012
 poisoning+0.012
 permission+0.011
Neg logit contributions
 spir-0.017
 plains-0.017
aling-0.017
ab-0.017
irl-0.017
Token't
Token ID470
Feature activation+3.76e-02

Pos logit contributions
 injuries+0.090
 mercy+0.081
 hesitation+0.080
 poisoning+0.076
 permission+0.075
Neg logit contributions
irl-0.075
 spir-0.072
aling-0.070
 zoom-0.069
 plains-0.068
Token change
Token ID1487
Feature activation-1.98e-03

Pos logit contributions
 injuries+0.067
 mercy+0.062
 hesitation+0.058
 poisoning+0.057
 fault+0.055
Neg logit contributions
 spir-0.052
 plains-0.051
aling-0.050
irl-0.050
ab-0.049
Token my
Token ID616
Feature activation+4.65e-02

Pos logit contributions
 spir+0.003
 plains+0.003
ab+0.003
irl+0.003
 sorts+0.003
Neg logit contributions
 injuries-0.003
 mercy-0.003
 hesitation-0.003
 poisoning-0.003
 forgiveness-0.003
Token idea
Token ID2126
Feature activation+1.06e-01

Pos logit contributions
 injuries+0.085
 mercy+0.082
 hesitation+0.075
 poisoning+0.074
t+0.071
Neg logit contributions
 plains-0.059
ab-0.059
 spir-0.058
irl-0.056
 sights-0.054
Token."
Token ID526
Feature activation+2.40e-03

Pos logit contributions
 injuries+0.199
 mercy+0.184
 hesitation+0.167
 forgiveness+0.156
 poisoning+0.153
Neg logit contributions
ab-0.145
aling-0.133
 plains-0.129
irl-0.126
 spir-0.124
Token
Token ID198
Feature activation+2.25e-02

Pos logit contributions
 injuries+0.005
 mercy+0.005
 poisoning+0.004
 permission+0.004
 hesitation+0.004
Neg logit contributions
 plains-0.003
 spir-0.002
ab-0.002
irl-0.002
 cycl-0.002
Token
Token ID198
Feature activation+1.31e-02

Pos logit contributions
 injuries+0.040
 mercy+0.037
 hesitation+0.035
 poisoning+0.035
 permission+0.035
Neg logit contributions
 plains-0.032
 spir-0.031
irl-0.031
ab-0.029
aling-0.029
Token"
Token ID1
Feature activation+2.06e-02

Pos logit contributions
 injuries+0.020
 mercy+0.018
 hesitation+0.017
 poisoning+0.017
 permission+0.016
Neg logit contributions
 spir-0.022
 plains-0.022
aling-0.021
irl-0.021
ab-0.021
TokenYes
Token ID5297
Feature activation+5.76e-02

Pos logit contributions
 injuries+0.036
 mercy+0.034
 fault+0.033
 hesitation+0.033
 permission+0.031
Neg logit contributions
 spir-0.029
irl-0.028
 plains-0.027
ab-0.027
 cycl-0.026
Token closet
Token ID21615
Feature activation-2.95e-03

Pos logit contributions
 plains+0.025
 spir+0.025
aling+0.025
irl+0.024
ab+0.024
Neg logit contributions
 injuries-0.035
 mercy-0.031
 poisoning-0.029
 hesitation-0.029
 permission-0.029
Token is
Token ID318
Feature activation-2.03e-02

Pos logit contributions
 plains+0.005
 spir+0.005
ab+0.005
aling+0.005
irl+0.005
Neg logit contributions
 injuries-0.004
 mercy-0.004
 hesitation-0.004
 poisoning-0.004
 permission-0.003
Token not
Token ID407
Feature activation+3.78e-02

Pos logit contributions
 spir+0.035
 plains+0.034
irl+0.033
aling+0.033
ab+0.032
Neg logit contributions
 injuries-0.028
 mercy-0.026
 hesitation-0.024
 poisoning-0.023
t-0.022
Token a
Token ID257
Feature activation+7.37e-03

Pos logit contributions
 injuries+0.059
 mercy+0.053
 poisoning+0.049
 hesitation+0.049
 permission+0.047
Neg logit contributions
 plains-0.061
 spir-0.061
aling-0.059
ab-0.058
irl-0.058
Token good
Token ID922
Feature activation+1.95e-02

Pos logit contributions
 injuries+0.013
 mercy+0.013
 hesitation+0.012
 poisoning+0.011
 permission+0.011
Neg logit contributions
 spir-0.010
 plains-0.009
ab-0.009
 sorts-0.008
 zoom-0.008
Token idea
Token ID2126
Feature activation+1.06e-01

Pos logit contributions
 injuries+0.034
 mercy+0.032
 hesitation+0.029
 poisoning+0.029
 permission+0.028
Neg logit contributions
 spir-0.027
 plains-0.027
ab-0.026
 sorts-0.026
irl-0.025
Token.
Token ID13
Feature activation-7.07e-03

Pos logit contributions
 injuries+0.191
 mercy+0.177
 hesitation+0.157
 severe+0.155
 worry+0.147
Neg logit contributions
ab-0.143
 spir-0.135
 plains-0.130
 sorts-0.128
aling-0.125
Token They
Token ID1119
Feature activation+8.19e-03

Pos logit contributions
 plains+0.009
 spir+0.009
ab+0.008
aling+0.008
irl+0.008
Neg logit contributions
 injuries-0.014
 mercy-0.013
 hesitation-0.012
 poisoning-0.012
 permission-0.012
Token learned
Token ID4499
Feature activation+3.21e-02

Pos logit contributions
 injuries+0.013
 mercy+0.012
 hesitation+0.011
 poisoning+0.011
 permission+0.011
Neg logit contributions
 spir-0.013
 plains-0.013
aling-0.012
irl-0.012
 cycl-0.012
Token that
Token ID326
Feature activation+4.57e-02

Pos logit contributions
 injuries+0.048
 hesitation+0.041
 poisoning+0.036
 mercy+0.036
 supervision+0.034
Neg logit contributions
 spir-0.053
 plains-0.052
 sorts-0.052
irl-0.051
arg-0.051
Token playing
Token ID2712
Feature activation-2.73e-02

Pos logit contributions
 injuries+0.072
 mercy+0.065
 hesitation+0.060
 poisoning+0.060
 permission+0.057
Neg logit contributions
 plains-0.073
 spir-0.073
aling-0.070
ab-0.069
irl-0.069
Token the
Token ID262
Feature activation+1.78e-02

Pos logit contributions
 plains+0.030
 spir+0.029
 sorts+0.028
irl+0.028
aling+0.028
Neg logit contributions
 injuries-0.037
 mercy-0.033
 hesitation-0.032
 poisoning-0.031
 forgiveness-0.028
Token cell
Token ID2685
Feature activation+7.77e-03

Pos logit contributions
 injuries+0.031
 mercy+0.028
 hesitation+0.026
 poisoning+0.026
 permission+0.025
Neg logit contributions
 plains-0.026
 spir-0.026
aling-0.025
ab-0.024
irl-0.024
Token without
Token ID1231
Feature activation+1.43e-02

Pos logit contributions
 injuries+0.010
 mercy+0.009
 hesitation+0.008
 poisoning+0.008
 forgiveness+0.007
Neg logit contributions
 plains-0.015
 spir-0.015
ab-0.014
aling-0.014
irl-0.014
Token asking
Token ID4737
Feature activation+6.92e-02

Pos logit contributions
 injuries+0.022
 mercy+0.020
 hesitation+0.018
 permission+0.018
 poisoning+0.018
Neg logit contributions
 plains-0.024
 spir-0.024
ab-0.023
aling-0.023
irl-0.023
Token your
Token ID534
Feature activation+5.17e-02

Pos logit contributions
 injuries+0.103
 mercy+0.094
 hesitation+0.086
 poisoning+0.083
 fault+0.080
Neg logit contributions
 plains-0.113
ab-0.110
 spir-0.109
aling-0.108
irl-0.108
Token mom
Token ID1995
Feature activation+1.06e-01

Pos logit contributions
 injuries+0.060
 mercy+0.055
 poisoning+0.052
 hesitation+0.050
t+0.043
Neg logit contributions
 spir-0.096
 plains-0.096
ab-0.092
irl-0.092
 sorts-0.090
Token first
Token ID717
Feature activation+6.40e-02

Pos logit contributions
 injuries+0.164
 mercy+0.149
 hesitation+0.136
 poisoning+0.131
 forgiveness+0.124
Neg logit contributions
ab-0.171
aling-0.170
 plains-0.170
 spir-0.165
 sorts-0.162
Token?"
Token ID1701
Feature activation+1.01e-03

Pos logit contributions
 injuries+0.096
 mercy+0.088
 poisoning+0.079
 hesitation+0.078
 forgiveness+0.072
Neg logit contributions
 spir-0.104
 plains-0.104
irl-0.103
ab-0.102
aling-0.101
Token
Token ID198
Feature activation+2.76e-03

Pos logit contributions
 injuries+0.002
 mercy+0.002
 poisoning+0.002
 hesitation+0.002
 fault+0.001
Neg logit contributions
 spir-0.001
 sorts-0.001
 plains-0.001
ab-0.001
irl-0.001
Token
Token ID198
Feature activation-3.08e-03

Pos logit contributions
 injuries+0.004
 mercy+0.004
 poisoning+0.004
 hesitation+0.004
 permission+0.003
Neg logit contributions
 spir-0.005
 plains-0.004
irl-0.004
aling-0.004
ab-0.004
TokenJ
Token ID41
Feature activation-3.42e-02

Pos logit contributions
 spir+0.005
 plains+0.005
aling+0.005
ab+0.004
irl+0.004
Neg logit contributions
 injuries-0.005
 mercy-0.005
 hesitation-0.004
 poisoning-0.004
 permission-0.004
Token said
Token ID531
Feature activation+4.66e-02

Pos logit contributions
 spir+0.028
 plains+0.027
 cycl+0.026
 sorts+0.026
aling+0.026
Neg logit contributions
 injuries-0.025
 mercy-0.022
 poisoning-0.021
 hesitation-0.021
t-0.020
Token it
Token ID340
Feature activation+3.51e-02

Pos logit contributions
 injuries+0.067
 mercy+0.055
 poisoning+0.054
 hesitation+0.050
 payoff+0.048
Neg logit contributions
 spir-0.077
irl-0.074
ab-0.074
 plains-0.073
 soaring-0.072
Token was
Token ID373
Feature activation-2.65e-02

Pos logit contributions
 injuries+0.067
 mercy+0.059
 hesitation+0.058
 poisoning+0.055
 permission+0.054
Neg logit contributions
 spir-0.046
 plains-0.044
ab-0.043
irl-0.042
 sorts-0.042
Token not
Token ID407
Feature activation+3.99e-02

Pos logit contributions
 spir+0.051
irl+0.051
 plains+0.050
aling+0.049
ab+0.049
Neg logit contributions
 injuries-0.031
 mercy-0.027
 hesitation-0.025
 poisoning-0.024
t-0.023
Token his
Token ID465
Feature activation+5.79e-02

Pos logit contributions
 injuries+0.057
 mercy+0.053
 permission+0.048
 poisoning+0.047
 hesitation+0.047
Neg logit contributions
 plains-0.072
 spir-0.071
aling-0.069
irl-0.068
 cycl-0.067
Token fault
Token ID8046
Feature activation+1.06e-01

Pos logit contributions
 injuries+0.072
 mercy+0.064
 poisoning+0.060
 hesitation+0.056
t+0.051
Neg logit contributions
 spir-0.105
 plains-0.103
ab-0.101
irl-0.101
 sights-0.100
Token.
Token ID13
Feature activation-1.43e-02

Pos logit contributions
 injuries+0.201
 mercy+0.180
 poisoning+0.176
 payoff+0.172
use+0.172
Neg logit contributions
 spir-0.111
 zoom-0.103
 soaring-0.103
 sorts-0.101
ab-0.100
Token She
Token ID1375
Feature activation+1.27e-03

Pos logit contributions
 plains+0.017
 spir+0.016
aling+0.015
ab+0.015
irl+0.014
Neg logit contributions
 injuries-0.030
 mercy-0.028
 hesitation-0.026
 poisoning-0.025
 permission-0.025
Token said
Token ID531
Feature activation+3.48e-02

Pos logit contributions
 injuries+0.002
 hesitation+0.002
 mercy+0.002
 poisoning+0.002
 fault+0.002
Neg logit contributions
 spir-0.002
 cycl-0.002
 plains-0.002
ep-0.002
 sorts-0.002
Token the
Token ID262
Feature activation+1.78e-02

Pos logit contributions
 injuries+0.055
 mercy+0.046
 poisoning+0.045
 hesitation+0.043
 fault+0.041
Neg logit contributions
 spir-0.053
 plains-0.050
 soaring-0.049
irl-0.049
 zoom-0.048
Token bird
Token ID6512
Feature activation+8.33e-03

Pos logit contributions
 injuries+0.029
 mercy+0.027
 poisoning+0.026
 hesitation+0.025
 permission+0.024
Neg logit contributions
 spir-0.027
 plains-0.026
irl-0.025
ab-0.025
ep-0.024
Token books
Token ID3835
Feature activation-2.12e-03

Pos logit contributions
 spir+0.016
 plains+0.016
 cycl+0.015
aling+0.015
 sorts+0.015
Neg logit contributions
 injuries-0.021
 mercy-0.019
 poisoning-0.018
 hesitation-0.018
t-0.017
Token.
Token ID13
Feature activation-5.27e-03

Pos logit contributions
 plains+0.004
ab+0.003
aling+0.003
 spir+0.003
irl+0.003
Neg logit contributions
 injuries-0.003
 mercy-0.003
 hesitation-0.003
 permission-0.003
 forgiveness-0.002
Token They
Token ID1119
Feature activation-1.71e-03

Pos logit contributions
 plains+0.006
 spir+0.006
ab+0.006
aling+0.006
irl+0.006
Neg logit contributions
 injuries-0.011
 mercy-0.010
 hesitation-0.009
 poisoning-0.009
 permission-0.009
Token share
Token ID2648
Feature activation+8.87e-03

Pos logit contributions
 plains+0.003
 spir+0.003
aling+0.003
ab+0.003
irl+0.003
Neg logit contributions
 injuries-0.003
 mercy-0.002
 hesitation-0.002
 poisoning-0.002
 permission-0.002
Token their
Token ID511
Feature activation+3.79e-02

Pos logit contributions
 injuries+0.013
 mercy+0.012
 hesitation+0.011
 poisoning+0.011
 permission+0.010
Neg logit contributions
 spir-0.015
 plains-0.015
aling-0.015
irl-0.014
ab-0.014
Token wishes
Token ID12802
Feature activation+1.05e-01

Pos logit contributions
 injuries+0.067
 mercy+0.062
 hesitation+0.058
 poisoning+0.057
t+0.055
Neg logit contributions
 plains-0.053
 spir-0.052
irl-0.050
ab-0.049
 sights-0.048
Token with
Token ID351
Feature activation+1.31e-02

Pos logit contributions
 injuries+0.143
 mercy+0.131
 hesitation+0.118
 poisoning+0.115
 permission+0.109
Neg logit contributions
 spir-0.186
 plains-0.186
ab-0.180
irl-0.180
aling-0.180
Token each
Token ID1123
Feature activation+3.31e-02

Pos logit contributions
 injuries+0.020
 mercy+0.018
 hesitation+0.017
 poisoning+0.016
 permission+0.016
Neg logit contributions
 plains-0.022
 spir-0.022
irl-0.021
aling-0.020
 sorts-0.020
Token other
Token ID584
Feature activation+7.82e-02

Pos logit contributions
 injuries+0.053
 mercy+0.048
 poisoning+0.043
 hesitation+0.043
 forgiveness+0.041
Neg logit contributions
 plains-0.054
 spir-0.052
ab-0.051
 sorts-0.050
aling-0.049
Token.
Token ID13
Feature activation-9.49e-03

Pos logit contributions
 injuries+0.103
 mercy+0.091
 hesitation+0.082
 poisoning+0.081
 permission+0.077
Neg logit contributions
 plains-0.147
 spir-0.146
aling-0.140
ab-0.140
irl-0.138
Token They
Token ID1119
Feature activation+1.59e-03

Pos logit contributions
 plains+0.011
 spir+0.011
aling+0.010
ab+0.010
irl+0.010
Neg logit contributions
 injuries-0.019
 mercy-0.018
 hesitation-0.017
 poisoning-0.017
 permission-0.016
Token parent
Token ID2560
Feature activation+6.77e-02

Pos logit contributions
 injuries+0.061
 mercy+0.056
 poisoning+0.055
 hesitation+0.053
t+0.046
Neg logit contributions
irl-0.111
 spir-0.110
 plains-0.109
 sights-0.105
 cycl-0.103
Token was
Token ID373
Feature activation-1.01e-02

Pos logit contributions
 injuries+0.105
 mercy+0.095
 poisoning+0.086
 hesitation+0.085
 permission+0.080
Neg logit contributions
 spir-0.108
 plains-0.105
aling-0.105
ab-0.104
 sorts-0.102
Token not
Token ID407
Feature activation+4.90e-02

Pos logit contributions
 spir+0.018
 plains+0.018
aling+0.017
irl+0.017
ab+0.017
Neg logit contributions
 injuries-0.014
 mercy-0.012
 poisoning-0.011
 hesitation-0.011
 permission-0.010
Token a
Token ID257
Feature activation+2.47e-02

Pos logit contributions
 injuries+0.074
 mercy+0.069
 permission+0.064
 hesitation+0.063
 poisoning+0.062
Neg logit contributions
 plains-0.087
 spir-0.084
irl-0.081
aling-0.080
 cycl-0.078
Token good
Token ID922
Feature activation+2.94e-02

Pos logit contributions
 injuries+0.040
 mercy+0.036
 hesitation+0.034
 poisoning+0.033
 permission+0.031
Neg logit contributions
 spir-0.039
 plains-0.038
ab-0.037
 sorts-0.035
 cycl-0.035
Token idea
Token ID2126
Feature activation+1.05e-01

Pos logit contributions
 injuries+0.048
 mercy+0.042
 hesitation+0.039
 poisoning+0.039
 permission+0.036
Neg logit contributions
 spir-0.046
 plains-0.045
ab-0.044
 sorts-0.044
irl-0.042
Token.
Token ID13
Feature activation-5.24e-03

Pos logit contributions
 injuries+0.194
 mercy+0.178
 hesitation+0.158
 poisoning+0.151
 fiber+0.149
Neg logit contributions
ab-0.132
 spir-0.128
 sorts-0.120
aling-0.118
 soaring-0.114
Token From
Token ID3574
Feature activation+5.97e-04

Pos logit contributions
 spir+0.008
 plains+0.008
aling+0.007
irl+0.007
ab+0.007
Neg logit contributions
 injuries-0.009
 mercy-0.008
 poisoning-0.007
 hesitation-0.007
 forgiveness-0.007
Token that
Token ID326
Feature activation+1.20e-02

Pos logit contributions
 injuries+0.001
 fault+0.001
 mercy+0.001
 hesitation+0.001
t+0.001
Neg logit contributions
aling-0.001
 plains-0.001
aw-0.001
irl-0.001
 spir-0.001
Token day
Token ID1110
Feature activation+5.88e-02

Pos logit contributions
 injuries+0.026
 mercy+0.024
 hesitation+0.022
 poisoning+0.021
 fault+0.021
Neg logit contributions
 spir-0.013
 sorts-0.012
 plains-0.012
ab-0.012
aling-0.012
Token on
Token ID319
Feature activation+1.78e-02

Pos logit contributions
 injuries+0.109
 mercy+0.100
 poisoning+0.093
 hesitation+0.093
 permission+0.090
Neg logit contributions
 plains-0.079
 spir-0.079
aling-0.074
irl-0.073
ab-0.073
Token play
Token ID711
Feature activation-1.51e-02

Pos logit contributions
 injuries+0.146
 hesitation+0.143
 mercy+0.130
 permission+0.123
 poisoning+0.122
Neg logit contributions
 zoom-0.081
irl-0.077
aling-0.072
ab-0.071
 spir-0.069
Token with
Token ID351
Feature activation+1.41e-02

Pos logit contributions
 spir+0.018
irl+0.018
ab+0.017
 plains+0.017
aling+0.017
Neg logit contributions
 injuries-0.030
 mercy-0.028
 hesitation-0.027
 poisoning-0.025
 permission-0.025
Token it
Token ID340
Feature activation+6.36e-02

Pos logit contributions
 injuries+0.025
 hesitation+0.023
 mercy+0.023
 poisoning+0.021
 fault+0.021
Neg logit contributions
 plains-0.019
 spir-0.019
irl-0.019
aling-0.018
ep-0.018
Token for
Token ID329
Feature activation+4.51e-02

Pos logit contributions
 injuries+0.107
 hesitation+0.101
 mercy+0.098
 permission+0.091
 supervision+0.089
Neg logit contributions
irl-0.095
ab-0.095
aling-0.091
ep-0.090
 plains-0.087
Token a
Token ID257
Feature activation+2.84e-02

Pos logit contributions
 injuries+0.073
 mercy+0.066
 hesitation+0.065
 poisoning+0.064
 fault+0.062
Neg logit contributions
 plains-0.066
irl-0.065
 spir-0.064
aling-0.063
ep-0.060
Token minute
Token ID5664
Feature activation+1.05e-01

Pos logit contributions
 injuries+0.046
 mercy+0.043
 hesitation+0.040
 poisoning+0.039
 permission+0.038
Neg logit contributions
 spir-0.042
 plains-0.042
ab-0.041
irl-0.039
 cycl-0.039
Token,
Token ID11
Feature activation+2.28e-02

Pos logit contributions
 injuries+0.190
 mercy+0.181
 hesitation+0.159
 poisoning+0.157
 permission+0.153
Neg logit contributions
ab-0.149
irl-0.142
aling-0.134
 plains-0.132
 sorts-0.131
Token please
Token ID3387
Feature activation+7.08e-02

Pos logit contributions
 injuries+0.033
 mercy+0.029
 hesitation+0.027
 poisoning+0.026
t+0.024
Neg logit contributions
 spir-0.038
irl-0.038
 plains-0.037
ab-0.037
 zoom-0.036
Token?"
Token ID1701
Feature activation+1.71e-02

Pos logit contributions
 injuries+0.102
 mercy+0.092
 hesitation+0.086
 poisoning+0.083
 fault+0.079
Neg logit contributions
irl-0.119
 spir-0.118
 plains-0.117
ab-0.117
aling-0.116
Token L
Token ID406
Feature activation-3.79e-02

Pos logit contributions
 injuries+0.032
 mercy+0.030
 poisoning+0.029
 fault+0.027
 hesitation+0.027
Neg logit contributions
 spir-0.024
irl-0.022
 plains-0.021
 zoom-0.021
 cycl-0.020
Tokenila
Token ID10102
Feature activation-3.49e-02

Pos logit contributions
 spir+0.067
 plains+0.066
irl+0.065
ab+0.065
aling+0.065
Neg logit contributions
 injuries-0.055
 mercy-0.049
 poisoning-0.044
 hesitation-0.044
 permission-0.042
Token They
Token ID1119
Feature activation-2.74e-03

Pos logit contributions
 spir+0.012
 plains+0.012
ab+0.011
 sorts+0.011
irl+0.011
Neg logit contributions
 injuries-0.026
 mercy-0.024
 hesitation-0.023
 poisoning-0.022
 permission-0.022
Token would
Token ID561
Feature activation+7.79e-03

Pos logit contributions
 plains+0.004
 spir+0.004
aling+0.004
ab+0.004
irl+0.004
Neg logit contributions
 injuries-0.004
 mercy-0.004
 poisoning-0.004
 hesitation-0.004
 permission-0.004
Token always
Token ID1464
Feature activation-4.10e-03

Pos logit contributions
 injuries+0.012
 mercy+0.011
 hesitation+0.010
 poisoning+0.010
 permission+0.010
Neg logit contributions
 spir-0.013
 plains-0.012
aling-0.012
irl-0.012
ab-0.012
Token share
Token ID2648
Feature activation+1.29e-02

Pos logit contributions
 plains+0.007
 spir+0.007
aling+0.007
ab+0.007
irl+0.007
Neg logit contributions
 injuries-0.006
 mercy-0.005
 hesitation-0.005
 poisoning-0.005
 permission-0.005
Token their
Token ID511
Feature activation+4.47e-02

Pos logit contributions
 injuries+0.021
 mercy+0.020
 hesitation+0.018
 poisoning+0.017
t+0.017
Neg logit contributions
 spir-0.020
 plains-0.019
irl-0.019
ab-0.018
 sorts-0.018
Token wishes
Token ID12802
Feature activation+1.05e-01

Pos logit contributions
 injuries+0.078
 mercy+0.073
 hesitation+0.068
 poisoning+0.067
t+0.064
Neg logit contributions
 spir-0.063
 plains-0.062
irl-0.060
ab-0.059
 sorts-0.057
Token and
Token ID290
Feature activation-1.72e-02

Pos logit contributions
 injuries+0.146
 mercy+0.136
 hesitation+0.122
 poisoning+0.118
 permission+0.112
Neg logit contributions
 spir-0.181
 plains-0.179
irl-0.176
aling-0.175
ab-0.175
Token help
Token ID1037
Feature activation+4.70e-02

Pos logit contributions
 spir+0.028
 plains+0.028
 sorts+0.027
irl+0.027
aling+0.027
Neg logit contributions
 injuries-0.026
 mercy-0.022
 hesitation-0.021
 poisoning-0.021
t-0.020
Token each
Token ID1123
Feature activation+1.50e-02

Pos logit contributions
 injuries+0.072
 mercy+0.070
 hesitation+0.068
 fault+0.063
 poisoning+0.061
Neg logit contributions
 spir-0.070
 plains-0.068
irl-0.066
 sorts-0.065
ab-0.064
Token other
Token ID584
Feature activation+6.13e-02

Pos logit contributions
 injuries+0.025
 mercy+0.023
 poisoning+0.021
 hesitation+0.021
 fault+0.020
Neg logit contributions
 spir-0.023
 plains-0.023
 sorts-0.023
ab-0.022
aling-0.021
Token.
Token ID13
Feature activation-1.03e-02

Pos logit contributions
 injuries+0.079
 mercy+0.070
 hesitation+0.063
 poisoning+0.062
 permission+0.059
Neg logit contributions
 plains-0.116
 spir-0.116
aling-0.112
ab-0.111
irl-0.111
Token grateful
Token ID14066
Feature activation+4.91e-02

Pos logit contributions
 spir+0.048
 plains+0.046
aling+0.046
irl+0.045
 sorts+0.044
Neg logit contributions
 injuries-0.064
 mercy-0.058
 poisoning-0.053
 hesitation-0.053
 fault-0.051
Token for
Token ID329
Feature activation+4.15e-02

Pos logit contributions
 injuries+0.097
 mercy+0.091
 hesitation+0.081
 supervision+0.080
 babys+0.080
Neg logit contributions
 sorts-0.051
 plains-0.050
 spir-0.049
 sights-0.047
ab-0.047
Token her
Token ID607
Feature activation+6.31e-02

Pos logit contributions
 injuries+0.054
 mercy+0.048
 hesitation+0.044
 poisoning+0.043
 forgiveness+0.043
Neg logit contributions
 plains-0.081
ab-0.081
 spir-0.079
aling-0.078
 cycl-0.077
Token mom
Token ID1995
Feature activation+7.96e-02

Pos logit contributions
 injuries+0.097
 mercy+0.089
 poisoning+0.083
 hesitation+0.080
t+0.079
Neg logit contributions
 spir-0.099
 plains-0.095
irl-0.095
 sorts-0.092
 sights-0.092
Token's
Token ID338
Feature activation+3.97e-02

Pos logit contributions
 injuries+0.133
 mercy+0.121
 hesitation+0.106
 poisoning+0.106
 fault+0.102
Neg logit contributions
 spir-0.120
 plains-0.119
 sorts-0.117
aling-0.115
ab-0.114
Token idea
Token ID2126
Feature activation+1.05e-01

Pos logit contributions
 injuries+0.052
 mercy+0.043
 poisoning+0.042
 hesitation+0.038
t+0.037
Neg logit contributions
 spir-0.069
 sorts-0.069
 plains-0.067
 sights-0.067
irl-0.066
Token to
Token ID284
Feature activation+4.25e-02

Pos logit contributions
 injuries+0.187
 mercy+0.176
 hesitation+0.151
 fault+0.151
 fiber+0.146
Neg logit contributions
 spir-0.138
 sorts-0.136
 plains-0.128
ab-0.124
aling-0.121
Token come
Token ID1282
Feature activation+4.19e-03

Pos logit contributions
 injuries+0.066
 mercy+0.059
 hesitation+0.054
 poisoning+0.054
 permission+0.052
Neg logit contributions
 spir-0.069
 plains-0.069
aling-0.066
irl-0.066
ab-0.066
Token on
Token ID319
Feature activation+1.64e-02

Pos logit contributions
 injuries+0.005
 mercy+0.004
 poisoning+0.004
 hesitation+0.004
 permission+0.004
Neg logit contributions
 plains-0.009
 spir-0.008
aling-0.008
ab-0.008
irl-0.008
Token this
Token ID428
Feature activation+2.60e-02

Pos logit contributions
 injuries+0.026
 mercy+0.024
 poisoning+0.022
t+0.022
 fault+0.022
Neg logit contributions
 spir-0.026
 plains-0.025
irl-0.024
 sorts-0.023
 sights-0.023
Token adventure
Token ID8855
Feature activation+7.85e-03

Pos logit contributions
 injuries+0.042
 mercy+0.038
 hesitation+0.035
 poisoning+0.035
 help+0.034
Neg logit contributions
 spir-0.040
 plains-0.039
aling-0.039
ab-0.038
 cycl-0.038
Token him
Token ID683
Feature activation+5.11e-02

Pos logit contributions
 injuries+0.065
 mercy+0.057
 poisoning+0.056
 hesitation+0.055
 fault+0.052
Neg logit contributions
 plains-0.070
 spir-0.070
aling-0.067
irl-0.066
 cycl-0.065
Token to
Token ID284
Feature activation+3.38e-02

Pos logit contributions
 injuries+0.059
 mercy+0.047
 hesitation+0.046
 poisoning+0.044
 permission+0.040
Neg logit contributions
 plains-0.101
 spir-0.100
aling-0.097
 cycl-0.096
irl-0.096
Token always
Token ID1464
Feature activation+2.13e-02

Pos logit contributions
 injuries+0.052
 mercy+0.047
 hesitation+0.044
 poisoning+0.044
 permission+0.041
Neg logit contributions
 spir-0.053
 plains-0.052
irl-0.051
aling-0.051
ab-0.049
Token ask
Token ID1265
Feature activation+6.39e-02

Pos logit contributions
 injuries+0.034
 mercy+0.030
 hesitation+0.028
 poisoning+0.028
 fault+0.027
Neg logit contributions
 spir-0.033
 plains-0.033
aling-0.032
irl-0.032
ab-0.032
Token his
Token ID465
Feature activation+5.80e-02

Pos logit contributions
 injuries+0.096
 mercy+0.091
 fault+0.087
 hesitation+0.085
 poisoning+0.084
Neg logit contributions
 plains-0.095
 spir-0.089
aling-0.089
 jung-0.086
ab-0.086
Token mom
Token ID1995
Feature activation+1.05e-01

Pos logit contributions
 injuries+0.071
 mercy+0.064
 poisoning+0.061
 hesitation+0.060
t+0.058
Neg logit contributions
 plains-0.103
 spir-0.101
ab-0.098
irl-0.098
 sights-0.096
Token and
Token ID290
Feature activation+2.80e-02

Pos logit contributions
 injuries+0.184
 mercy+0.177
 hesitation+0.162
 poisoning+0.153
 fault+0.151
Neg logit contributions
ab-0.132
 plains-0.130
aling-0.126
ep-0.125
c-0.125
Token dad
Token ID9955
Feature activation+7.25e-02

Pos logit contributions
 injuries+0.035
t+0.034
 hesitation+0.030
 fault+0.029
 mercy+0.028
Neg logit contributions
 cycl-0.044
 plains-0.044
 rode-0.044
 zoom-0.043
 spir-0.043
Token before
Token ID878
Feature activation+3.78e-02

Pos logit contributions
 injuries+0.105
 mercy+0.094
 hesitation+0.083
 poisoning+0.083
 blood+0.080
Neg logit contributions
 plains-0.113
aling-0.113
ab-0.113
 sorts-0.107
 spir-0.107
Token he
Token ID339
Feature activation+3.01e-02

Pos logit contributions
 injuries+0.049
 mercy+0.046
 poisoning+0.042
 hesitation+0.040
 fault+0.040
Neg logit contributions
 spir-0.067
 plains-0.066
irl-0.064
 cycl-0.063
aling-0.061
Token tasted
Token ID29187
Feature activation-1.62e-02

Pos logit contributions
 injuries+0.051
 mercy+0.047
 hesitation+0.045
 poisoning+0.044
 fault+0.042
Neg logit contributions
 spir-0.043
 plains-0.043
aling-0.042
ab-0.042
irl-0.040
Token and
Token ID290
Feature activation-3.53e-02

Pos logit contributions
 injuries+0.075
 mercy+0.071
 hesitation+0.062
 poisoning+0.061
 forgiveness+0.059
Neg logit contributions
 spir-0.082
 plains-0.081
aling-0.078
ab-0.077
 sorts-0.075
Token accepted
Token ID6292
Feature activation+1.17e-02

Pos logit contributions
 plains+0.073
ab+0.070
 spir+0.069
aling+0.067
 jung+0.065
Neg logit contributions
 injuries-0.039
 mercy-0.037
 help-0.035
 poisoning-0.033
 forgiveness-0.033
Token her
Token ID607
Feature activation+5.90e-02

Pos logit contributions
 injuries+0.022
 mercy+0.021
t+0.019
 fault+0.019
 poisoning+0.019
Neg logit contributions
 spir-0.014
irl-0.013
 sorts-0.012
 strut-0.011
 plains-0.011
Token mom
Token ID1995
Feature activation+6.69e-02

Pos logit contributions
 injuries+0.065
 poisoning+0.062
 mercy+0.060
t+0.059
 hesitation+0.051
Neg logit contributions
irl-0.111
 spir-0.110
 zoom-0.105
 sights-0.104
 sorts-0.101
Token's
Token ID338
Feature activation+2.99e-02

Pos logit contributions
 injuries+0.135
 mercy+0.126
 poisoning+0.119
 hesitation+0.119
 permission+0.116
Neg logit contributions
 plains-0.078
 spir-0.078
irl-0.073
aling-0.072
ab-0.071
Token idea
Token ID2126
Feature activation+1.05e-01

Pos logit contributions
 injuries+0.029
 poisoning+0.025
 mercy+0.024
t+0.021
 fault+0.020
Neg logit contributions
 spir-0.059
 sorts-0.058
irl-0.058
 sights-0.057
 cycl-0.055
Token.
Token ID13
Feature activation-7.69e-03

Pos logit contributions
 injuries+0.189
 mercy+0.179
 hesitation+0.161
 fault+0.155
 poisoning+0.153
Neg logit contributions
 spir-0.129
 sorts-0.124
ab-0.121
aling-0.120
 plains-0.116
Token
Token ID198
Feature activation+2.12e-02

Pos logit contributions
 spir+0.014
 plains+0.014
irl+0.014
aling+0.014
ab+0.013
Neg logit contributions
 injuries-0.009
 mercy-0.008
 poisoning-0.008
 fault-0.008
 either-0.007
Token
Token ID198
Feature activation+1.81e-02

Pos logit contributions
 injuries+0.035
 mercy+0.032
 hesitation+0.029
 poisoning+0.029
 permission+0.028
Neg logit contributions
 plains-0.033
 spir-0.033
aling-0.032
irl-0.031
ab-0.031
TokenBut
Token ID1537
Feature activation+3.13e-02

Pos logit contributions
 injuries+0.027
 mercy+0.023
 poisoning+0.023
t+0.022
 forgiveness+0.021
Neg logit contributions
 plains-0.030
 spir-0.029
ab-0.028
 sorts-0.028
 sights-0.028
Token when
Token ID618
Feature activation+3.68e-02

Pos logit contributions
 injuries+0.045
 mercy+0.044
 hesitation+0.038
 fault+0.036
 poisoning+0.036
Neg logit contributions
 spir-0.052
 plains-0.051
irl-0.051
ab-0.050
aling-0.049
Token learned
Token ID4499
Feature activation+2.84e-02

Pos logit contributions
 spir+0.016
 plains+0.016
aling+0.015
irl+0.015
 sorts+0.015
Neg logit contributions
 injuries-0.019
 mercy-0.017
 hesitation-0.016
 poisoning-0.015
 permission-0.015
Token to
Token ID284
Feature activation+2.79e-02

Pos logit contributions
 injuries+0.040
 hesitation+0.032
 poisoning+0.030
 mercy+0.030
 worry+0.028
Neg logit contributions
 plains-0.051
 spir-0.050
aling-0.049
 sorts-0.048
irl-0.047
Token always
Token ID1464
Feature activation+1.66e-02

Pos logit contributions
 injuries+0.042
 mercy+0.037
 hesitation+0.035
 poisoning+0.034
 permission+0.032
Neg logit contributions
 spir-0.046
 plains-0.046
aling-0.044
irl-0.044
ab-0.044
Token ask
Token ID1265
Feature activation+7.00e-02

Pos logit contributions
 injuries+0.025
 mercy+0.023
 hesitation+0.021
 poisoning+0.021
 permission+0.020
Neg logit contributions
 spir-0.027
 plains-0.027
aling-0.026
irl-0.026
ab-0.026
Token his
Token ID465
Feature activation+6.17e-02

Pos logit contributions
 injuries+0.103
 mercy+0.094
 fault+0.088
 hesitation+0.087
t+0.087
Neg logit contributions
 plains-0.110
 spir-0.105
aling-0.101
 jung-0.100
 sorts-0.098
Token mom
Token ID1995
Feature activation+1.05e-01

Pos logit contributions
 injuries+0.068
 mercy+0.063
 poisoning+0.060
 hesitation+0.057
t+0.057
Neg logit contributions
 plains-0.115
 spir-0.114
irl-0.110
 sights-0.110
 jung-0.108
Token and
Token ID290
Feature activation+2.75e-02

Pos logit contributions
 injuries+0.182
 mercy+0.171
 hesitation+0.152
 poisoning+0.144
 forgiveness+0.141
Neg logit contributions
 plains-0.147
 sorts-0.140
ab-0.139
 spir-0.136
aling-0.134
Token dad
Token ID9955
Feature activation+7.02e-02

Pos logit contributions
 injuries+0.034
 hesitation+0.028
t+0.027
 mercy+0.026
 either+0.026
Neg logit contributions
 sorts-0.046
 plains-0.046
 cycl-0.046
 spir-0.046
 rode-0.045
Token before
Token ID878
Feature activation+3.91e-02

Pos logit contributions
 injuries+0.096
 mercy+0.082
 hesitation+0.071
 poisoning+0.070
 blood+0.067
Neg logit contributions
 plains-0.126
 sorts-0.122
 spir-0.122
aling-0.121
ab-0.120
Token he
Token ID339
Feature activation+2.78e-02

Pos logit contributions
 injuries+0.054
 mercy+0.050
 poisoning+0.045
 hesitation+0.044
 fault+0.043
Neg logit contributions
 spir-0.068
 plains-0.067
irl-0.065
 cycl-0.063
aling-0.062
Token ate
Token ID15063
Feature activation-1.98e-02

Pos logit contributions
 injuries+0.048
 mercy+0.044
 hesitation+0.041
 poisoning+0.040
 fault+0.039
Neg logit contributions
 spir-0.040
 plains-0.040
aling-0.039
ab-0.038
irl-0.037
Token wanted
Token ID2227
Feature activation+5.37e-02

Pos logit contributions
 injuries+0.082
 mercy+0.076
 hesitation+0.070
 poisoning+0.070
 permission+0.067
Neg logit contributions
 plains-0.061
 spir-0.060
aling-0.059
ab-0.058
irl-0.057
Token was
Token ID373
Feature activation+1.69e-02

Pos logit contributions
 injuries+0.091
 mercy+0.085
 hesitation+0.077
 severe+0.071
 babys+0.071
Neg logit contributions
ab-0.075
 sorts-0.073
 plains-0.072
aling-0.071
 spir-0.070
Token not
Token ID407
Feature activation+5.55e-02

Pos logit contributions
 injuries+0.027
 mercy+0.024
 hesitation+0.021
 poisoning+0.021
 sleeve+0.020
Neg logit contributions
aling-0.025
 spir-0.025
 plains-0.025
 sorts-0.025
ab-0.025
Token a
Token ID257
Feature activation+2.33e-02

Pos logit contributions
 injuries+0.089
 mercy+0.080
 hesitation+0.074
 poisoning+0.073
 permission+0.070
Neg logit contributions
 plains-0.088
 spir-0.087
aling-0.084
ab-0.083
irl-0.082
Token good
Token ID922
Feature activation+2.84e-02

Pos logit contributions
 injuries+0.044
 mercy+0.041
 hesitation+0.039
 poisoning+0.036
 permission+0.036
Neg logit contributions
ab-0.029
 spir-0.029
 plains-0.028
 sorts-0.026
 cycl-0.026
Token idea
Token ID2126
Feature activation+1.04e-01

Pos logit contributions
 injuries+0.044
 mercy+0.040
 hesitation+0.036
 poisoning+0.036
 permission+0.034
Neg logit contributions
 plains-0.046
 spir-0.046
ab-0.044
aling-0.043
irl-0.043
Token.
Token ID13
Feature activation-6.55e-03

Pos logit contributions
 injuries+0.204
 mercy+0.182
 hesitation+0.162
 severe+0.160
 swollen+0.155
Neg logit contributions
ab-0.126
aling-0.119
 spir-0.115
 sorts-0.114
 plains-0.109
Token From
Token ID3574
Feature activation-6.35e-03

Pos logit contributions
 spir+0.011
irl+0.011
 plains+0.010
ab+0.010
aling+0.010
Neg logit contributions
 injuries-0.010
 mercy-0.009
 poisoning-0.008
 fault-0.008
 forgiveness-0.008
Token then
Token ID788
Feature activation+1.93e-02

Pos logit contributions
 plains+0.011
aling+0.011
irl+0.010
 spir+0.010
ep+0.010
Neg logit contributions
 injuries-0.009
 fault-0.008
 mercy-0.008
t-0.008
 hesitation-0.007
Token on
Token ID319
Feature activation+1.06e-02

Pos logit contributions
 injuries+0.029
 mercy+0.026
 poisoning+0.025
 help+0.025
 hesitation+0.024
Neg logit contributions
 plains-0.032
 spir-0.032
 sorts-0.030
arg-0.029
 array-0.029
Token,
Token ID11
Feature activation+5.21e-03

Pos logit contributions
 injuries+0.016
 mercy+0.014
 permission+0.013
 hesitation+0.013
 poisoning+0.013
Neg logit contributions
 plains-0.018
irl-0.017
 spir-0.017
 cycl-0.016
ab-0.016
Token about
Token ID546
Feature activation+1.53e-02

Pos logit contributions
 injuries+0.095
 mercy+0.087
 hesitation+0.081
 poisoning+0.080
 permission+0.077
Neg logit contributions
 plains-0.068
 spir-0.068
aling-0.065
irl-0.064
ab-0.064
Token the
Token ID262
Feature activation+1.83e-02

Pos logit contributions
 injuries+0.021
 mercy+0.018
 hesitation+0.016
 forgiveness+0.016
 poisoning+0.016
Neg logit contributions
ab-0.030
 plains-0.029
aling-0.029
 spir-0.029
 cycl-0.028
Token rock
Token ID3881
Feature activation+5.24e-03

Pos logit contributions
 injuries+0.035
 mercy+0.032
 poisoning+0.030
 hesitation+0.030
 permission+0.029
Neg logit contributions
 spir-0.023
 plains-0.023
irl-0.022
aling-0.021
ab-0.021
Token and
Token ID290
Feature activation-2.35e-02

Pos logit contributions
 injuries+0.006
 mercy+0.006
 poisoning+0.005
 hesitation+0.005
 permission+0.005
Neg logit contributions
 spir-0.010
 plains-0.010
aling-0.010
irl-0.010
ab-0.010
Token their
Token ID511
Feature activation+1.88e-02

Pos logit contributions
 plains+0.041
 spir+0.041
ab+0.039
aling+0.039
irl+0.039
Neg logit contributions
 injuries-0.034
 mercy-0.030
 hesitation-0.028
 poisoning-0.027
 permission-0.026
Token wishes
Token ID12802
Feature activation+1.04e-01

Pos logit contributions
 injuries+0.030
 mercy+0.028
 poisoning+0.026
 hesitation+0.026
t+0.024
Neg logit contributions
 plains-0.029
 spir-0.029
irl-0.027
 cycl-0.027
ab-0.026
Token.
Token ID13
Feature activation-8.40e-03

Pos logit contributions
 injuries+0.138
 mercy+0.122
 poisoning+0.109
 hesitation+0.109
 permission+0.104
Neg logit contributions
 plains-0.194
 spir-0.193
aling-0.186
ab-0.184
irl-0.184
Token
Token ID198
Feature activation+2.09e-02

Pos logit contributions
 spir+0.014
 plains+0.014
aling+0.013
irl+0.013
ab+0.013
Neg logit contributions
 injuries-0.012
 mercy-0.011
 poisoning-0.010
 hesitation-0.010
 forgiveness-0.010
Token
Token ID198
Feature activation+1.26e-02

Pos logit contributions
 injuries+0.029
 mercy+0.026
 poisoning+0.023
 hesitation+0.023
 forgiveness+0.022
Neg logit contributions
 plains-0.037
 spir-0.037
ab-0.036
aling-0.036
 cycl-0.035
TokenBut
Token ID1537
Feature activation+2.76e-02

Pos logit contributions
 injuries+0.016
 mercy+0.014
 poisoning+0.012
t+0.012
 forgiveness+0.012
Neg logit contributions
 plains-0.024
 spir-0.024
 sorts-0.023
ab-0.023
 sights-0.022
Token the
Token ID262
Feature activation-1.43e-03

Pos logit contributions
 injuries+0.037
 mercy+0.034
 hesitation+0.030
 poisoning+0.029
 fault+0.029
Neg logit contributions
 spir-0.049
 plains-0.049
irl-0.047
aling-0.047
ab-0.046
Token said
Token ID531
Feature activation+3.74e-02

Pos logit contributions
 spir+0.037
 sorts+0.036
 cycl+0.035
 rode+0.034
 zoom+0.034
Neg logit contributions
 injuries-0.044
t-0.038
 fault-0.037
 poisoning-0.037
 mercy-0.036
Token it
Token ID340
Feature activation+3.31e-02

Pos logit contributions
 injuries+0.057
 mercy+0.049
 poisoning+0.047
 hesitation+0.043
 payoff+0.041
Neg logit contributions
 spir-0.060
irl-0.057
 plains-0.055
aw-0.054
 sights-0.054
Token was
Token ID373
Feature activation-2.69e-02

Pos logit contributions
 injuries+0.055
 mercy+0.049
 hesitation+0.046
 poisoning+0.045
 permission+0.044
Neg logit contributions
 spir-0.051
 plains-0.051
aling-0.049
ab-0.048
irl-0.048
Token not
Token ID407
Feature activation+4.21e-02

Pos logit contributions
 spir+0.054
irl+0.053
 plains+0.053
aling+0.052
ab+0.052
Neg logit contributions
 injuries-0.030
 mercy-0.026
 hesitation-0.024
 poisoning-0.022
 forgiveness-0.021
Token their
Token ID511
Feature activation+3.48e-02

Pos logit contributions
 injuries+0.059
 mercy+0.057
 permission+0.050
 poisoning+0.050
 help+0.049
Neg logit contributions
 plains-0.078
 spir-0.077
irl-0.074
aling-0.073
ab-0.072
Token fault
Token ID8046
Feature activation+1.04e-01

Pos logit contributions
 injuries+0.035
 mercy+0.029
 poisoning+0.026
 hesitation+0.026
t+0.022
Neg logit contributions
 spir-0.073
 plains-0.073
 sorts-0.072
 sights-0.072
 zoom-0.071
Token.
Token ID13
Feature activation-2.22e-02

Pos logit contributions
 injuries+0.201
 mercy+0.183
use+0.179
 poisoning+0.175
 hesitation+0.173
Neg logit contributions
 spir-0.100
 zoom-0.096
 track-0.094
 soaring-0.093
 sorts-0.092
Token They
Token ID1119
Feature activation-4.57e-03

Pos logit contributions
 plains+0.020
 spir+0.020
 sorts+0.017
ab+0.017
arg+0.017
Neg logit contributions
 injuries-0.052
 mercy-0.049
 hesitation-0.046
 poisoning-0.045
 permission-0.045
Token said
Token ID531
Feature activation+3.26e-02

Pos logit contributions
 spir+0.007
 sorts+0.007
 cycl+0.007
 plains+0.007
 zoom+0.006
Neg logit contributions
 injuries-0.008
 mercy-0.006
 hesitation-0.006
 fault-0.006
 poisoning-0.006
Token the
Token ID262
Feature activation+8.20e-03

Pos logit contributions
 injuries+0.054
 mercy+0.048
 poisoning+0.046
 sleeve+0.042
 hesitation+0.042
Neg logit contributions
 spir-0.048
irl-0.044
 zoom-0.043
 plains-0.043
 soaring-0.042
Token man
Token ID582
Feature activation+2.64e-02

Pos logit contributions
 injuries+0.015
 mercy+0.014
 poisoning+0.013
 hesitation+0.013
t+0.012
Neg logit contributions
 spir-0.011
 plains-0.010
 zoom-0.010
 cycl-0.010
 sights-0.010
Token.
Token ID13
Feature activation-1.13e-02

Pos logit contributions
 injuries+0.036
 mercy+0.033
 hesitation+0.032
 permission+0.029
 revenge+0.029
Neg logit contributions
ab-0.029
 plains-0.028
 spir-0.027
aling-0.027
 sorts-0.026
Token They
Token ID1119
Feature activation-3.02e-03

Pos logit contributions
 plains+0.016
 spir+0.016
 sorts+0.015
ab+0.014
irl+0.014
Neg logit contributions
 injuries-0.020
 mercy-0.019
 hesitation-0.018
 poisoning-0.018
 permission-0.017
Token still
Token ID991
Feature activation+4.35e-03

Pos logit contributions
 spir+0.005
 plains+0.005
aling+0.005
irl+0.005
ab+0.005
Neg logit contributions
 injuries-0.004
 mercy-0.004
 hesitation-0.003
 poisoning-0.003
 permission-0.003
Token have
Token ID423
Feature activation+1.69e-03

Pos logit contributions
 injuries+0.007
 mercy+0.006
 hesitation+0.005
 poisoning+0.005
 help+0.005
Neg logit contributions
 plains-0.007
 spir-0.007
aling-0.007
irl-0.007
ab-0.007
Token their
Token ID511
Feature activation+3.63e-02

Pos logit contributions
 injuries+0.003
 mercy+0.003
 poisoning+0.003
 hesitation+0.003
 forgiveness+0.003
Neg logit contributions
 plains-0.002
 spir-0.002
irl-0.002
 cycl-0.002
 sorts-0.002
Token wishes
Token ID12802
Feature activation+1.04e-01

Pos logit contributions
 injuries+0.065
 mercy+0.063
 hesitation+0.059
 poisoning+0.059
t+0.054
Neg logit contributions
 spir-0.046
 plains-0.045
irl-0.045
ab-0.041
 sights-0.041
Token.
Token ID13
Feature activation-9.62e-03

Pos logit contributions
 injuries+0.146
 mercy+0.136
 hesitation+0.124
 poisoning+0.120
 permission+0.112
Neg logit contributions
 plains-0.175
 spir-0.175
ab-0.169
irl-0.169
aling-0.166
Token<|endoftext|>
Token ID50256
Feature activation-1.19e-02

Pos logit contributions
 plains+0.015
 spir+0.014
ab+0.013
aling+0.013
 sorts+0.013
Neg logit contributions
 injuries-0.017
 mercy-0.015
 hesitation-0.014
 poisoning-0.014
 permission-0.014
TokenL
Token ID43
Feature activation-2.01e-02

Pos logit contributions
 spir+0.022
 plains+0.022
aling+0.021
 cycl+0.021
irl+0.021
Neg logit contributions
 injuries-0.015
 mercy-0.013
 poisoning-0.012
 hesitation-0.012
 either-0.012
Tokenila
Token ID10102
Feature activation-3.99e-02

Pos logit contributions
 plains+0.033
ab+0.033
 spir+0.032
irl+0.032
 sorts+0.031
Neg logit contributions
 injuries-0.032
 mercy-0.029
 hesitation-0.026
 forgiveness-0.025
 poisoning-0.025
Token and
Token ID290
Feature activation-3.33e-03

Pos logit contributions
 spir+0.079
 plains+0.078
aling+0.076
irl+0.075
 cycl+0.074
Neg logit contributions
 injuries-0.044
 mercy-0.038
 poisoning-0.037
 help-0.034
 hesitation-0.033
Token on
Token ID319
Feature activation-8.53e-03

Pos logit contributions
 plains+0.017
 spir+0.017
irl+0.017
 cycl+0.017
aling+0.016
Neg logit contributions
 injuries-0.017
 mercy-0.015
 hesitation-0.014
 help-0.014
 permission-0.014
Token the
Token ID262
Feature activation-2.10e-02

Pos logit contributions
 spir+0.014
 plains+0.014
aling+0.014
ab+0.014
irl+0.014
Neg logit contributions
 injuries-0.013
 mercy-0.012
 hesitation-0.010
 poisoning-0.010
 permission-0.010
Token lake
Token ID13546
Feature activation-6.63e-02

Pos logit contributions
 spir+0.026
ab+0.026
aling+0.025
 plains+0.025
irl+0.025
Neg logit contributions
 injuries-0.040
 mercy-0.037
 poisoning-0.035
 hesitation-0.034
 permission-0.033
Token in
Token ID287
Feature activation-3.12e-02

Pos logit contributions
aling+0.149
 spir+0.148
irl+0.147
ab+0.147
 cycl+0.147
Neg logit contributions
 injuries-0.056
 mercy-0.045
 poisoning-0.043
 either-0.041
 yet-0.040
Token a
Token ID257
Feature activation-4.24e-02

Pos logit contributions
 plains+0.056
ab+0.056
 spir+0.056
aling+0.055
irl+0.055
Neg logit contributions
 injuries-0.043
 mercy-0.039
 hesitation-0.035
 poisoning-0.034
 permission-0.034
Token kay
Token ID34681
Feature activation-1.47e-01

Pos logit contributions
ab+0.071
aling+0.070
 spir+0.067
 plains+0.066
 cycl+0.066
Neg logit contributions
 injuries-0.058
 mercy-0.050
 help-0.049
 worry-0.049
 poisoning-0.045
Tokenak
Token ID461
Feature activation-4.53e-02

Pos logit contributions
 spir+0.255
 strut+0.236
 sings+0.228
 pian+0.224
 plains+0.223
Neg logit contributions
 injuries-0.196
t-0.189
 poisoning-0.173
 mercy-0.165
guard-0.157
Token.
Token ID13
Feature activation-1.15e-02

Pos logit contributions
irl+0.104
 spir+0.103
aling+0.102
 plains+0.101
ab+0.099
Neg logit contributions
 injuries-0.034
 mercy-0.028
 poisoning-0.025
 either-0.024
t-0.024
Token As
Token ID1081
Feature activation+2.41e-03

Pos logit contributions
 spir+0.021
aling+0.021
irl+0.020
ab+0.020
ep+0.020
Neg logit contributions
 injuries-0.015
 mercy-0.013
 poisoning-0.013
 hesitation-0.012
 yet-0.012
Token they
Token ID484
Feature activation-1.06e-02

Pos logit contributions
 injuries+0.003
 mercy+0.002
 hesitation+0.002
 poisoning+0.002
 forgiveness+0.002
Neg logit contributions
 spir-0.005
 plains-0.005
aling-0.005
ab-0.005
irl-0.005
Token were
Token ID547
Feature activation-1.50e-02

Pos logit contributions
 spir+0.016
 plains+0.016
aling+0.016
ab+0.016
irl+0.015
Neg logit contributions
 injuries-0.017
 mercy-0.015
 hesitation-0.015
 poisoning-0.014
 help-0.014
Token the
Token ID262
Feature activation-2.92e-02

Pos logit contributions
 spir+0.051
 plains+0.048
 sorts+0.047
 jung+0.044
ab+0.042
Neg logit contributions
 injuries-0.065
 mercy-0.060
 poisoning-0.057
t-0.056
 hesitation-0.056
Token air
Token ID1633
Feature activation-2.33e-02

Pos logit contributions
irl+0.040
aling+0.039
 spir+0.038
 cycl+0.037
 plains+0.037
Neg logit contributions
 injuries-0.054
 mercy-0.049
 poisoning-0.045
 help-0.044
 hesitation-0.044
Token,
Token ID11
Feature activation-7.31e-03

Pos logit contributions
 plains+0.046
 spir+0.045
aling+0.044
irl+0.044
ab+0.044
Neg logit contributions
 injuries-0.028
 mercy-0.025
 hesitation-0.022
 poisoning-0.022
 permission-0.021
Token leaving
Token ID4305
Feature activation-1.65e-02

Pos logit contributions
 plains+0.014
aling+0.014
 spir+0.014
irl+0.014
ab+0.014
Neg logit contributions
 injuries-0.009
 mercy-0.008
 poisoning-0.007
 hesitation-0.007
 forgiveness-0.007
Token a
Token ID257
Feature activation-2.80e-02

Pos logit contributions
 spir+0.025
 plains+0.025
 sorts+0.024
ab+0.023
 sights+0.023
Neg logit contributions
 injuries-0.027
 mercy-0.025
 hesitation-0.023
 poisoning-0.023
 fault-0.022
Token spir
Token ID9158
Feature activation-1.47e-01

Pos logit contributions
 plains+0.040
 spir+0.039
aling+0.039
irl+0.038
ab+0.038
Neg logit contributions
 injuries-0.049
 mercy-0.045
 poisoning-0.041
 hesitation-0.041
 forgiveness-0.039
Tokenaling
Token ID4272
Feature activation-5.73e-02

Pos logit contributions
irl+0.241
irling+0.222
 sights+0.207
 sorts+0.199
ep+0.199
Neg logit contributions
 injuries-0.208
 mercy-0.202
 permission-0.202
 help-0.183
 poisoning-0.179
Token trail
Token ID8025
Feature activation-4.73e-02

Pos logit contributions
 plains+0.086
 spir+0.086
irl+0.085
aling+0.085
 cycl+0.081
Neg logit contributions
 injuries-0.093
 mercy-0.084
 poisoning-0.077
 hesitation-0.076
 permission-0.075
Token of
Token ID286
Feature activation-3.33e-02

Pos logit contributions
 plains+0.084
 spir+0.083
irl+0.082
 cycl+0.080
aling+0.078
Neg logit contributions
 injuries-0.065
 mercy-0.058
 poisoning-0.054
t-0.054
 hesitation-0.052
Token black
Token ID2042
Feature activation-3.90e-02

Pos logit contributions
 spir+0.045
 plains+0.045
 sorts+0.041
ab+0.041
irl+0.041
Neg logit contributions
 injuries-0.061
 mercy-0.056
 poisoning-0.053
 hesitation-0.052
 permission-0.050
Token ink
Token ID16882
Feature activation-3.24e-02

Pos logit contributions
 spir+0.048
 plains+0.047
aling+0.046
irl+0.046
ab+0.044
Neg logit contributions
 injuries-0.076
 mercy-0.070
 hesitation-0.065
 poisoning-0.065
 permission-0.063
Token butterfly
Token ID35113
Feature activation-4.72e-02

Pos logit contributions
 plains+0.073
 spir+0.072
aling+0.071
irl+0.069
ab+0.069
Neg logit contributions
 injuries-0.065
 mercy-0.058
 poisoning-0.054
 hesitation-0.053
 permission-0.051
Token,
Token ID11
Feature activation-1.17e-02

Pos logit contributions
 plains+0.097
aling+0.094
 cycl+0.093
irl+0.092
 sorts+0.091
Neg logit contributions
 injuries-0.049
 poisoning-0.040
 mercy-0.040
 either-0.038
 hesitation-0.038
Token tw
Token ID665
Feature activation-1.03e-01

Pos logit contributions
 plains+0.019
aling+0.019
 cycl+0.019
ab+0.019
irl+0.018
Neg logit contributions
 injuries-0.017
 mercy-0.015
 poisoning-0.014
 hesitation-0.014
 blood-0.013
Tokenirling
Token ID24297
Feature activation-7.58e-02

Pos logit contributions
 plains+0.112
 cycl+0.112
 sorts+0.105
aling+0.100
 blasted+0.095
Neg logit contributions
 injuries-0.201
 hesitation-0.171
t-0.167
 sleeve-0.167
 sting-0.165
Token and
Token ID290
Feature activation-5.56e-02

Pos logit contributions
aling+0.113
 plains+0.112
 cycl+0.109
irl+0.108
 spir+0.107
Neg logit contributions
 injuries-0.119
 mercy-0.106
 poisoning-0.104
 hesitation-0.103
 either-0.097
Token spir
Token ID9158
Feature activation-1.46e-01

Pos logit contributions
aling+0.080
ab+0.079
 cycl+0.077
arg+0.077
irl+0.076
Neg logit contributions
 injuries-0.083
 hesitation-0.081
 help-0.076
 to-0.074
 payoff-0.071
Tokenaling
Token ID4272
Feature activation-5.57e-02

Pos logit contributions
irl+0.261
ep+0.246
irling+0.244
 plains+0.230
 sights+0.229
Neg logit contributions
 permission-0.162
 help-0.153
t-0.149
 injuries-0.149
 hesitation-0.148
Token in
Token ID287
Feature activation-3.76e-02

Pos logit contributions
aling+0.097
irl+0.097
 plains+0.096
 cycl+0.094
arg+0.092
Neg logit contributions
 injuries-0.073
 mercy-0.065
 hesitation-0.062
 poisoning-0.060
 permission-0.059
Token the
Token ID262
Feature activation-3.37e-02

Pos logit contributions
 plains+0.056
 spir+0.055
aling+0.054
irl+0.053
ab+0.053
Neg logit contributions
 injuries-0.063
 mercy-0.058
 hesitation-0.054
 poisoning-0.053
 permission-0.051
Token air
Token ID1633
Feature activation-3.40e-02

Pos logit contributions
irl+0.054
aling+0.053
arg+0.052
 cycl+0.050
ab+0.049
Neg logit contributions
 injuries-0.050
 mercy-0.045
 hesitation-0.045
 help-0.044
 payoff-0.043
Token.
Token ID13
Feature activation-1.76e-02

Pos logit contributions
 plains+0.066
 spir+0.066
aling+0.064
irl+0.063
ab+0.063
Neg logit contributions
 injuries-0.041
 mercy-0.036
 hesitation-0.033
 poisoning-0.032
 permission-0.030
Token different
Token ID1180
Feature activation-3.35e-02

Pos logit contributions
 plains+0.069
 spir+0.068
aling+0.067
ab+0.066
irl+0.066
Neg logit contributions
 injuries-0.071
 mercy-0.064
 hesitation-0.059
 poisoning-0.058
 permission-0.057
Token styles
Token ID12186
Feature activation-1.98e-02

Pos logit contributions
 spir+0.042
 plains+0.042
aling+0.042
irl+0.041
ab+0.040
Neg logit contributions
 injuries-0.063
 mercy-0.058
 hesitation-0.054
 poisoning-0.054
 permission-0.053
Token like
Token ID588
Feature activation-1.25e-02

Pos logit contributions
 plains+0.041
 spir+0.040
aling+0.039
irl+0.039
 cycl+0.039
Neg logit contributions
 injuries-0.022
 mercy-0.019
 poisoning-0.017
 hesitation-0.016
 permission-0.016
Token stars
Token ID5788
Feature activation-2.45e-02

Pos logit contributions
 plains+0.020
aling+0.019
 spir+0.019
ab+0.019
irl+0.019
Neg logit contributions
 injuries-0.020
 mercy-0.018
 hesitation-0.017
 poisoning-0.016
 permission-0.016
Token or
Token ID393
Feature activation-3.22e-02

Pos logit contributions
 plains+0.050
 spir+0.049
aling+0.049
irl+0.049
 cycl+0.048
Neg logit contributions
 injuries-0.027
 mercy-0.023
 poisoning-0.021
 hesitation-0.021
 either-0.020
Token spir
Token ID9158
Feature activation-1.46e-01

Pos logit contributions
 spir+0.041
 plains+0.041
aling+0.038
irl+0.037
ab+0.037
Neg logit contributions
 injuries-0.062
 mercy-0.057
 poisoning-0.053
 hesitation-0.053
 permission-0.051
Tokenals
Token ID874
Feature activation-3.10e-02

Pos logit contributions
irl+0.244
irling+0.230
ep+0.223
 sorts+0.222
 cycl+0.220
Neg logit contributions
 injuries-0.191
 permission-0.179
 help-0.169
 nor-0.167
 fault-0.165
Token.
Token ID13
Feature activation-1.79e-02

Pos logit contributions
 plains+0.059
 spir+0.059
aling+0.057
irl+0.056
ab+0.056
Neg logit contributions
 injuries-0.039
 mercy-0.035
 hesitation-0.031
 poisoning-0.031
 permission-0.029
Token Everyone
Token ID11075
Feature activation-2.27e-02

Pos logit contributions
irl+0.032
 spir+0.032
aling+0.031
ep+0.030
 cycl+0.030
Neg logit contributions
 injuries-0.022
 mercy-0.020
 help-0.019
 forgiveness-0.019
 either-0.019
Token was
Token ID373
Feature activation-4.22e-02

Pos logit contributions
 plains+0.041
aling+0.039
 spir+0.038
irl+0.038
arg+0.038
Neg logit contributions
 mercy-0.027
 injuries-0.026
 help-0.025
 forgiveness-0.024
 poisoning-0.024
Token so
Token ID523
Feature activation-1.27e-02

Pos logit contributions
 plains+0.064
 spir+0.063
aling+0.061
ab+0.060
irl+0.060
Neg logit contributions
 injuries-0.070
 mercy-0.064
 hesitation-0.060
 poisoning-0.059
 permission-0.057
Token forth
Token ID6071
Feature activation-1.10e-02

Pos logit contributions
ab+0.081
arg+0.079
aling+0.077
irl+0.077
 plains+0.077
Neg logit contributions
 injuries-0.070
 to-0.065
 help-0.064
 poisoning-0.064
 mercy-0.061
Token in
Token ID287
Feature activation-3.60e-02

Pos logit contributions
 plains+0.023
 spir+0.023
aling+0.022
irl+0.022
ab+0.022
Neg logit contributions
 injuries-0.012
 mercy-0.011
 poisoning-0.009
 hesitation-0.009
 permission-0.009
Token the
Token ID262
Feature activation-4.43e-02

Pos logit contributions
 spir+0.055
 plains+0.054
 sorts+0.051
aling+0.050
irl+0.050
Neg logit contributions
 injuries-0.061
 mercy-0.056
 poisoning-0.052
 hesitation-0.051
 permission-0.049
Token sky
Token ID6766
Feature activation-4.08e-02

Pos logit contributions
aling+0.075
irl+0.073
ab+0.072
arg+0.071
 sorts+0.070
Neg logit contributions
 injuries-0.064
 mercy-0.053
 blood-0.052
 help-0.052
 last-0.050
Token,
Token ID11
Feature activation-1.50e-02

Pos logit contributions
 spir+0.088
irl+0.087
aling+0.087
 plains+0.087
 cycl+0.085
Neg logit contributions
 injuries-0.040
 mercy-0.033
 hesitation-0.030
 poisoning-0.029
 help-0.027
Token spir
Token ID9158
Feature activation-1.45e-01

Pos logit contributions
irl+0.026
 plains+0.026
aling+0.025
ab+0.025
arg+0.025
Neg logit contributions
 injuries-0.019
 poisoning-0.017
 help-0.016
 hesitation-0.016
 mercy-0.016
Tokenalling
Token ID9221
Feature activation-6.11e-02

Pos logit contributions
irl+0.229
irling+0.213
 sorts+0.205
 array+0.200
itions+0.195
Neg logit contributions
 injuries-0.199
 permission-0.190
 mercy-0.187
 help-0.187
t-0.183
Token higher
Token ID2440
Feature activation-4.36e-02

Pos logit contributions
 spir+0.101
 plains+0.101
aling+0.098
irl+0.097
ab+0.096
Neg logit contributions
 injuries-0.092
 mercy-0.083
 poisoning-0.076
 hesitation-0.076
 permission-0.072
Token and
Token ID290
Feature activation-6.78e-02

Pos logit contributions
 plains+0.081
 spir+0.080
ab+0.075
aling+0.075
irl+0.074
Neg logit contributions
 injuries-0.061
 mercy-0.055
 hesitation-0.049
 poisoning-0.049
 forgiveness-0.048
Token higher
Token ID2440
Feature activation-4.35e-02

Pos logit contributions
irl+0.106
arg+0.105
 cycl+0.103
ab+0.102
 plains+0.095
Neg logit contributions
 injuries-0.089
 first-0.089
 last-0.086
 yet-0.086
 mercy-0.084
Token.
Token ID13
Feature activation-1.58e-02

Pos logit contributions
 plains+0.081
 spir+0.081
aling+0.077
ab+0.076
irl+0.076
Neg logit contributions
 injuries-0.058
 mercy-0.052
 poisoning-0.047
 hesitation-0.047
 forgiveness-0.045
Token green
Token ID4077
Feature activation-2.11e-02

Pos logit contributions
 spir+0.024
 plains+0.024
aling+0.022
ab+0.022
irl+0.022
Neg logit contributions
 injuries-0.035
 mercy-0.032
 poisoning-0.030
 hesitation-0.030
 permission-0.029
Token and
Token ID290
Feature activation-3.27e-02

Pos logit contributions
 plains+0.032
 spir+0.032
ab+0.030
 sights+0.028
 jung+0.028
Neg logit contributions
 injuries-0.037
 mercy-0.034
 poisoning-0.031
 hesitation-0.030
 forgiveness-0.030
Token a
Token ID257
Feature activation-1.88e-02

Pos logit contributions
aling+0.048
ab+0.046
irl+0.045
 plains+0.045
 cycl+0.045
Neg logit contributions
 injuries-0.052
 mercy-0.050
 help-0.048
 hesitation-0.047
 poisoning-0.045
Token long
Token ID890
Feature activation-3.66e-02

Pos logit contributions
aling+0.026
 plains+0.026
 spir+0.025
irl+0.025
ab+0.024
Neg logit contributions
 injuries-0.034
 mercy-0.031
 hesitation-0.029
 poisoning-0.028
 forgiveness-0.027
Token,
Token ID11
Feature activation-1.69e-02

Pos logit contributions
aling+0.053
 spir+0.052
 plains+0.052
irl+0.050
 cycl+0.050
Neg logit contributions
 injuries-0.063
 mercy-0.056
 hesitation-0.053
 poisoning-0.052
 permission-0.050
Token spir
Token ID9158
Feature activation-1.42e-01

Pos logit contributions
 plains+0.024
aling+0.024
irl+0.024
 cycl+0.023
 spir+0.023
Neg logit contributions
 injuries-0.028
 mercy-0.026
 hesitation-0.025
 poisoning-0.024
 help-0.023
Tokenalling
Token ID9221
Feature activation-5.61e-02

Pos logit contributions
irl+0.249
irling+0.229
ep+0.223
 sorts+0.218
 sights+0.216
Neg logit contributions
 injuries-0.178
 permission-0.175
 mercy-0.171
 help-0.165
 hesitation-0.164
Token tail
Token ID7894
Feature activation-2.79e-02

Pos logit contributions
 spir+0.094
 plains+0.094
irl+0.093
aling+0.093
 cycl+0.090
Neg logit contributions
 injuries-0.081
 mercy-0.072
 hesitation-0.066
 poisoning-0.066
 permission-0.064
Token.
Token ID13
Feature activation-2.29e-02

Pos logit contributions
 plains+0.052
 spir+0.052
aling+0.051
irl+0.051
ab+0.050
Neg logit contributions
 injuries-0.035
 mercy-0.031
 hesitation-0.028
 poisoning-0.028
 permission-0.026
Token The
Token ID383
Feature activation-8.50e-03

Pos logit contributions
 spir+0.036
irl+0.036
 plains+0.036
aling+0.035
 cycl+0.034
Neg logit contributions
 injuries-0.034
 mercy-0.031
 hesitation-0.029
 poisoning-0.029
 forgiveness-0.028
Token princess
Token ID21752
Feature activation-4.96e-02

Pos logit contributions
 spir+0.012
 plains+0.012
aling+0.012
irl+0.012
 cycl+0.011
Neg logit contributions
 injuries-0.015
 mercy-0.014
 hesitation-0.013
 poisoning-0.013
 forgiveness-0.012
Token Fro
Token ID9734
Feature activation-2.35e-02

Pos logit contributions
 injuries+0.053
 mercy+0.047
 hesitation+0.042
 poisoning+0.042
 permission+0.042
Neg logit contributions
 spir-0.048
 plains-0.047
 sorts-0.044
aling-0.044
 cycl-0.044
Tokenggy
Token ID19970
Feature activation-2.04e-02

Pos logit contributions
 spir+0.035
irl+0.034
 plains+0.034
ab+0.033
aling+0.032
Neg logit contributions
 injuries-0.039
 mercy-0.036
 hesitation-0.035
 permission-0.034
 poisoning-0.033
Token,
Token ID11
Feature activation+3.09e-02

Pos logit contributions
 plains+0.045
 spir+0.045
aling+0.043
ab+0.043
irl+0.043
Neg logit contributions
 injuries-0.020
 mercy-0.017
 poisoning-0.015
 hesitation-0.015
 permission-0.014
Token in
Token ID287
Feature activation-2.76e-02

Pos logit contributions
 injuries+0.052
 mercy+0.047
 hesitation+0.043
 poisoning+0.043
 permission+0.042
Neg logit contributions
 spir-0.047
 plains-0.047
aling-0.044
ab-0.044
irl-0.044
Token a
Token ID257
Feature activation-4.32e-02

Pos logit contributions
 plains+0.047
 spir+0.047
aling+0.046
ab+0.045
irl+0.045
Neg logit contributions
 injuries-0.041
 mercy-0.036
 hesitation-0.033
 poisoning-0.033
 permission-0.031
Token kay
Token ID34681
Feature activation-1.42e-01

Pos logit contributions
 plains+0.065
 spir+0.065
aling+0.064
ab+0.062
 cycl+0.061
Neg logit contributions
 injuries-0.072
 mercy-0.063
 poisoning-0.060
 hesitation-0.060
 help-0.058
Tokenak
Token ID461
Feature activation-3.54e-02

Pos logit contributions
 spir+0.219
 plains+0.201
 strut+0.195
 cycl+0.194
 sings+0.184
Neg logit contributions
 injuries-0.230
 poisoning-0.202
 mercy-0.201
t-0.196
 hesitation-0.177
Token.
Token ID13
Feature activation-3.60e-03

Pos logit contributions
 spir+0.077
 plains+0.076
irl+0.076
aling+0.074
 sorts+0.074
Neg logit contributions
 injuries-0.031
 mercy-0.025
 poisoning-0.025
 help-0.024
 hesitation-0.023
Token Fro
Token ID9734
Feature activation-4.19e-02

Pos logit contributions
 spir+0.006
 plains+0.006
aling+0.006
ab+0.006
irl+0.006
Neg logit contributions
 injuries-0.005
 mercy-0.004
 poisoning-0.004
 hesitation-0.004
 permission-0.004
Tokenggy
Token ID19970
Feature activation-5.12e-02

Pos logit contributions
 spir+0.064
irl+0.063
 plains+0.063
ab+0.061
aling+0.059
Neg logit contributions
 injuries-0.070
 mercy-0.064
 hesitation-0.060
 permission-0.059
 poisoning-0.057
Token said
Token ID531
Feature activation+3.73e-02

Pos logit contributions
 plains+0.092
 spir+0.091
aling+0.090
irl+0.088
ab+0.088
Neg logit contributions
 injuries-0.068
 mercy-0.061
 poisoning-0.056
 hesitation-0.056
 help-0.052
Token saw
Token ID2497
Feature activation-2.38e-02

Pos logit contributions
 plains+0.023
 spir+0.022
ab+0.021
aling+0.021
irl+0.021
Neg logit contributions
 injuries-0.020
 mercy-0.019
 help-0.018
 hesitation-0.017
 poisoning-0.017
Token a
Token ID257
Feature activation-2.38e-02

Pos logit contributions
 spir+0.037
 plains+0.036
 sorts+0.034
aling+0.033
irl+0.033
Neg logit contributions
 injuries-0.040
 mercy-0.035
 poisoning-0.033
 hesitation-0.033
 permission-0.031
Token man
Token ID582
Feature activation-1.06e-02

Pos logit contributions
aling+0.039
 plains+0.038
 spir+0.037
ab+0.037
irl+0.036
Neg logit contributions
 injuries-0.038
 mercy-0.034
 poisoning-0.032
 hesitation-0.031
 forgiveness-0.030
Token in
Token ID287
Feature activation-3.21e-02

Pos logit contributions
 plains+0.018
 spir+0.017
irl+0.017
ab+0.016
 cycl+0.016
Neg logit contributions
 injuries-0.017
 mercy-0.015
 hesitation-0.014
 poisoning-0.014
 help-0.013
Token a
Token ID257
Feature activation-4.54e-02

Pos logit contributions
 spir+0.052
 plains+0.051
aling+0.049
irl+0.048
 sorts+0.048
Neg logit contributions
 injuries-0.050
 mercy-0.046
 poisoning-0.042
 hesitation-0.042
 permission-0.040
Token kay
Token ID34681
Feature activation-1.39e-01

Pos logit contributions
 spir+0.069
aling+0.068
 plains+0.068
irl+0.068
ab+0.066
Neg logit contributions
 injuries-0.075
 mercy-0.065
 poisoning-0.063
 hesitation-0.060
 help-0.060
Tokenak
Token ID461
Feature activation-2.97e-02

Pos logit contributions
 spir+0.235
 strut+0.216
 plains+0.205
 sings+0.199
 cycl+0.197
Neg logit contributions
 injuries-0.207
t-0.185
 poisoning-0.183
 mercy-0.181
guard-0.162
Token.
Token ID13
Feature activation-1.85e-02

Pos logit contributions
 spir+0.066
irl+0.066
 plains+0.065
 cycl+0.063
aling+0.062
Neg logit contributions
 injuries-0.025
 mercy-0.021
 poisoning-0.019
 either-0.019
 hesitation-0.018
Token The
Token ID383
Feature activation-9.93e-03

Pos logit contributions
 spir+0.030
 plains+0.030
irl+0.030
aling+0.029
ab+0.029
Neg logit contributions
 injuries-0.028
 mercy-0.025
 hesitation-0.023
 poisoning-0.023
 forgiveness-0.022
Token man
Token ID582
Feature activation-2.58e-02

Pos logit contributions
 spir+0.020
 plains+0.020
aling+0.020
ab+0.019
irl+0.019
Neg logit contributions
 injuries-0.012
 mercy-0.010
 hesitation-0.009
 poisoning-0.009
 permission-0.009
Token said
Token ID531
Feature activation+3.20e-02

Pos logit contributions
 plains+0.048
 spir+0.047
irl+0.046
aling+0.046
 sorts+0.045
Neg logit contributions
 injuries-0.032
 mercy-0.030
 poisoning-0.026
 hesitation-0.026
 either-0.026
Token liked
Token ID8288
Feature activation-7.78e-03

Pos logit contributions
 plains+0.003
 spir+0.003
aling+0.002
ab+0.002
irl+0.002
Neg logit contributions
 injuries-0.002
 help-0.002
 mercy-0.002
 poisoning-0.002
 hesitation-0.002
Token to
Token ID284
Feature activation-1.30e-02

Pos logit contributions
 spir+0.018
 plains+0.018
aling+0.018
ab+0.018
irl+0.018
Neg logit contributions
 injuries-0.006
 mercy-0.005
 hesitation-0.004
 poisoning-0.004
 permission-0.004
Token paddle
Token ID39517
Feature activation-6.00e-02

Pos logit contributions
 plains+0.022
ab+0.022
 spir+0.021
aling+0.021
 sorts+0.019
Neg logit contributions
 injuries-0.019
 help-0.018
 mercy-0.017
 hesitation-0.016
 worry-0.016
Token in
Token ID287
Feature activation-2.54e-02

Pos logit contributions
 plains+0.099
 spir+0.098
irl+0.093
ab+0.092
aling+0.092
Neg logit contributions
 injuries-0.095
 mercy-0.085
 hesitation-0.078
 poisoning-0.077
 forgiveness-0.074
Token the
Token ID262
Feature activation-2.57e-02

Pos logit contributions
 plains+0.043
 spir+0.043
aling+0.042
ab+0.041
irl+0.041
Neg logit contributions
 injuries-0.038
 mercy-0.034
 hesitation-0.031
 poisoning-0.031
 permission-0.029
Token kay
Token ID34681
Feature activation-1.39e-01

Pos logit contributions
aling+0.037
ab+0.037
 spir+0.037
irl+0.036
 plains+0.036
Neg logit contributions
 injuries-0.044
 mercy-0.040
 poisoning-0.037
 hesitation-0.036
 help-0.036
Tokenak
Token ID461
Feature activation-2.64e-02

Pos logit contributions
 spir+0.236
 strut+0.216
 plains+0.204
 cycl+0.204
 array+0.202
Neg logit contributions
 injuries-0.194
t-0.181
 poisoning-0.173
 mercy-0.161
guard-0.155
Token and
Token ID290
Feature activation-3.78e-02

Pos logit contributions
 spir+0.055
 plains+0.053
irl+0.053
aling+0.053
ab+0.051
Neg logit contributions
 injuries-0.028
 mercy-0.025
 poisoning-0.022
 hesitation-0.022
 permission-0.021
Token splash
Token ID22870
Feature activation-6.43e-02

Pos logit contributions
ab+0.065
 plains+0.063
 spir+0.063
aling+0.061
irl+0.060
Neg logit contributions
 injuries-0.055
 help-0.050
 mercy-0.049
 hesitation-0.047
 permission-0.046
Token in
Token ID287
Feature activation-3.03e-02

Pos logit contributions
 plains+0.097
 spir+0.096
aling+0.093
irl+0.091
ab+0.091
Neg logit contributions
 injuries-0.107
 mercy-0.098
 poisoning-0.090
 hesitation-0.090
 permission-0.087
Token the
Token ID262
Feature activation-2.39e-02

Pos logit contributions
 plains+0.054
 spir+0.053
aling+0.053
ab+0.053
irl+0.052
Neg logit contributions
 injuries-0.042
 mercy-0.037
 hesitation-0.034
 poisoning-0.033
 permission-0.032
Token and
Token ID290
Feature activation-4.62e-02

Pos logit contributions
 plains+0.000
 spir+0.000
ab+0.000
 sorts+0.000
aling+0.000
Neg logit contributions
 injuries-0.000
 mercy-0.000
 hesitation-0.000
 poisoning-0.000
 forgiveness-0.000
Token exciting
Token ID7895
Feature activation-6.44e-03

Pos logit contributions
ab+0.090
aling+0.088
 plains+0.088
irl+0.087
 spir+0.085
Neg logit contributions
 injuries-0.055
 mercy-0.049
 hesitation-0.047
 help-0.044
 permission-0.043
Token.
Token ID13
Feature activation-2.08e-02

Pos logit contributions
 plains+0.012
aling+0.012
 spir+0.012
ab+0.012
irl+0.012
Neg logit contributions
 injuries-0.008
 mercy-0.007
 hesitation-0.007
 poisoning-0.007
 permission-0.007
Token A
Token ID317
Feature activation-2.03e-02

Pos logit contributions
irl+0.040
ep+0.039
aling+0.038
 spir+0.037
ab+0.037
Neg logit contributions
 injuries-0.026
 mercy-0.023
 hesitation-0.022
 poisoning-0.022
 help-0.021
Token baby
Token ID5156
Feature activation-2.59e-02

Pos logit contributions
aling+0.033
 plains+0.032
irl+0.031
ab+0.031
 spir+0.031
Neg logit contributions
 injuries-0.031
 mercy-0.029
 hesitation-0.027
 help-0.027
 poisoning-0.026
Token fer
Token ID11354
Feature activation-1.38e-01

Pos logit contributions
 plains+0.045
 spir+0.045
aling+0.044
irl+0.043
ab+0.043
Neg logit contributions
 injuries-0.037
 mercy-0.033
 poisoning-0.030
 hesitation-0.030
 forgiveness-0.029
Tokenret
Token ID1186
Feature activation-5.24e-02

Pos logit contributions
 spir+0.198
 plains+0.198
 cycl+0.179
 sights+0.175
arg+0.174
Neg logit contributions
 injuries-0.230
 mercy-0.217
 poisoning-0.199
t-0.199
 hesitation-0.193
Token!
Token ID0
Feature activation-2.77e-02

Pos logit contributions
 plains+0.083
 spir+0.080
 sorts+0.080
irl+0.079
aling+0.077
Neg logit contributions
 injuries-0.080
 mercy-0.073
 hesitation-0.068
 poisoning-0.068
t-0.068
Token
Token ID220
Feature activation-5.82e-03

Pos logit contributions
irl+0.052
ep+0.050
 spir+0.049
aling+0.049
 plains+0.049
Neg logit contributions
 injuries-0.035
 mercy-0.032
 hesitation-0.030
 poisoning-0.029
 forgiveness-0.029
Token
Token ID198
Feature activation+1.47e-03

Pos logit contributions
 spir+0.012
 plains+0.012
aling+0.012
irl+0.012
 cycl+0.012
Neg logit contributions
 injuries-0.006
 mercy-0.005
 fault-0.005
 poisoning-0.005
 hesitation-0.004
Token
Token ID198
Feature activation-1.80e-03

Pos logit contributions
 injuries+0.002
 mercy+0.001
 poisoning+0.001
 hesitation+0.001
 help+0.001
Neg logit contributions
 plains-0.003
 spir-0.003
aling-0.003
ab-0.003
irl-0.003
Token Anne
Token ID15397
Feature activation-3.34e-03

Pos logit contributions
 spir+0.008
 plains+0.008
aling+0.007
irl+0.007
ab+0.007
Neg logit contributions
 injuries-0.009
 mercy-0.008
 hesitation-0.008
 poisoning-0.008
 permission-0.007
Token began
Token ID2540
Feature activation-2.47e-02

Pos logit contributions
 plains+0.005
 spir+0.005
irl+0.004
ab+0.004
aling+0.004
Neg logit contributions
 injuries-0.006
 mercy-0.005
 help-0.005
 hesitation-0.005
 permission-0.005
Token to
Token ID284
Feature activation-2.10e-02

Pos logit contributions
 plains+0.066
ab+0.066
aling+0.065
 spir+0.065
irl+0.064
Neg logit contributions
 injuries-0.012
 mercy-0.008
 help-0.006
 hesitation-0.006
 permission-0.005
Token build
Token ID1382
Feature activation-4.05e-02

Pos logit contributions
 plains+0.035
ab+0.035
aling+0.034
 spir+0.034
aw+0.033
Neg logit contributions
 injuries-0.030
 help-0.030
 mercy-0.028
 worry-0.027
 permission-0.026
Token a
Token ID257
Feature activation-2.84e-02

Pos logit contributions
ab+0.083
aling+0.082
 plains+0.078
 cycl+0.076
irl+0.076
Neg logit contributions
 injuries-0.047
 mercy-0.042
 help-0.037
 either-0.036
 hesitation-0.035
Token kay
Token ID34681
Feature activation-1.37e-01

Pos logit contributions
aling+0.045
ab+0.045
 plains+0.042
 spir+0.042
irl+0.042
Neg logit contributions
 injuries-0.046
 mercy-0.040
 help-0.038
 poisoning-0.036
 worry-0.036
Tokenak
Token ID461
Feature activation-3.45e-02

Pos logit contributions
 spir+0.232
 strut+0.221
 sings+0.218
 plains+0.210
 cycl+0.205
Neg logit contributions
 injuries-0.188
t-0.177
 mercy-0.166
 poisoning-0.160
guard-0.154
Token.
Token ID13
Feature activation-2.39e-02

Pos logit contributions
 plains+0.072
 spir+0.072
irl+0.071
 cycl+0.070
ab+0.069
Neg logit contributions
 injuries-0.033
 mercy-0.031
 help-0.027
 poisoning-0.026
 permission-0.025
Token They
Token ID1119
Feature activation-2.32e-02

Pos logit contributions
irl+0.041
 spir+0.040
aling+0.039
ab+0.038
irling+0.037
Neg logit contributions
 injuries-0.031
 mercy-0.030
 help-0.027
 hesitation-0.027
 poisoning-0.027
Token found
Token ID1043
Feature activation-3.40e-02

Pos logit contributions
 plains+0.034
ab+0.033
aling+0.032
 spir+0.032
irl+0.032
Neg logit contributions
 injuries-0.036
 help-0.035
 mercy-0.034
 hesitation-0.032
 permission-0.032
Token some
Token ID617
Feature activation+3.08e-03

Pos logit contributions
 plains+0.048
 spir+0.048
 sorts+0.044
aling+0.044
ab+0.044
Neg logit contributions
 injuries-0.062
 mercy-0.055
 poisoning-0.052
 hesitation-0.052
 forgiveness-0.049
Token and
Token ID290
Feature activation-1.32e-02

Pos logit contributions
 spir+0.055
 plains+0.055
irl+0.053
aling+0.053
 sorts+0.051
Neg logit contributions
 injuries-0.023
 mercy-0.023
 hesitation-0.020
 help-0.020
 permission-0.019
Token Anne
Token ID15397
Feature activation-3.45e-03

Pos logit contributions
 spir+0.019
 plains+0.019
aling+0.018
irl+0.018
ab+0.017
Neg logit contributions
 injuries-0.023
 mercy-0.022
 hesitation-0.020
 poisoning-0.020
 permission-0.019
Token jumped
Token ID11687
Feature activation-5.78e-02

Pos logit contributions
 plains+0.006
 spir+0.005
 sights+0.005
phony+0.005
irl+0.005
Neg logit contributions
 help-0.005
 hesitation-0.005
 mercy-0.005
 permission-0.005
 injuries-0.005
Token in
Token ID287
Feature activation-1.95e-02

Pos logit contributions
 plains+0.096
 spir+0.093
aling+0.091
irl+0.090
ab+0.090
Neg logit contributions
 injuries-0.083
 mercy-0.077
 help-0.072
 hesitation-0.071
 poisoning-0.070
Token the
Token ID262
Feature activation-3.56e-02

Pos logit contributions
 spir+0.030
 plains+0.029
 sorts+0.028
irl+0.027
 jung+0.026
Neg logit contributions
 injuries-0.032
 mercy-0.030
 poisoning-0.028
t-0.027
 fault-0.027
Token kay
Token ID34681
Feature activation-1.37e-01

Pos logit contributions
irl+0.059
ab+0.058
 sorts+0.056
aling+0.056
vel+0.054
Neg logit contributions
 injuries-0.049
 help-0.045
 mercy-0.041
 hesitation-0.041
 last-0.041
Tokenak
Token ID461
Feature activation-2.91e-02

Pos logit contributions
 spir+0.235
 strut+0.227
 sings+0.217
 soaring+0.211
 planets+0.210
Neg logit contributions
 injuries-0.178
t-0.169
 poisoning-0.156
 mercy-0.150
 sleeve-0.145
Token and
Token ID290
Feature activation-5.70e-02

Pos logit contributions
 spir+0.061
irl+0.061
 plains+0.061
aling+0.060
ab+0.059
Neg logit contributions
 injuries-0.029
 mercy-0.026
 hesitation-0.023
 poisoning-0.022
 permission-0.021
Token padd
Token ID14098
Feature activation-9.86e-02

Pos logit contributions
ab+0.098
irl+0.098
vel+0.096
arg+0.095
 plains+0.095
Neg logit contributions
 help-0.067
 to-0.066
 Dad-0.058
 mom-0.057
 first-0.057
Tokenled
Token ID992
Feature activation-5.15e-02

Pos logit contributions
 spir+0.164
 plains+0.157
irl+0.151
arg+0.149
aling+0.149
Neg logit contributions
 injuries-0.143
 mercy-0.136
 hesitation-0.122
 poisoning-0.120
 permission-0.119
Token away
Token ID1497
Feature activation+6.64e-03

Pos logit contributions
aling+0.082
 plains+0.081
ab+0.080
irl+0.080
 spir+0.078
Neg logit contributions
 injuries-0.076
 mercy-0.071
 hesitation-0.067
 help-0.067
 poisoning-0.063
Token river
Token ID7850
Feature activation-5.35e-02

Pos logit contributions
ab+0.030
aling+0.030
 spir+0.029
 plains+0.029
 cycl+0.029
Neg logit contributions
 injuries-0.040
 mercy-0.037
 poisoning-0.035
 hesitation-0.034
 help-0.033
Token.
Token ID13
Feature activation+2.45e-03

Pos logit contributions
 spir+0.105
 plains+0.105
aling+0.101
irl+0.101
ab+0.101
Neg logit contributions
 injuries-0.064
 mercy-0.057
 poisoning-0.051
 hesitation-0.050
 permission-0.048
Token They
Token ID1119
Feature activation-4.01e-03

Pos logit contributions
 injuries+0.004
 mercy+0.003
 poisoning+0.003
 hesitation+0.003
 either+0.003
Neg logit contributions
 spir-0.004
 plains-0.004
aling-0.004
irl-0.004
ab-0.004
Token saw
Token ID2497
Feature activation-2.52e-02

Pos logit contributions
 plains+0.006
 spir+0.006
aling+0.006
ab+0.006
irl+0.006
Neg logit contributions
 injuries-0.006
 mercy-0.005
 poisoning-0.005
 help-0.005
 hesitation-0.005
Token a
Token ID257
Feature activation-2.16e-02

Pos logit contributions
 plains+0.036
 spir+0.036
 sorts+0.032
 jung+0.032
 sights+0.031
Neg logit contributions
 injuries-0.047
 mercy-0.041
 poisoning-0.039
 hesitation-0.039
 permission-0.037
Token kay
Token ID34681
Feature activation-1.37e-01

Pos logit contributions
aling+0.036
 plains+0.035
ab+0.035
 spir+0.034
 cycl+0.033
Neg logit contributions
 injuries-0.032
 mercy-0.030
 poisoning-0.028
 hesitation-0.026
 help-0.026
Tokenak
Token ID461
Feature activation-3.06e-02

Pos logit contributions
 spir+0.204
 plains+0.191
 strut+0.187
 cycl+0.187
 sings+0.178
Neg logit contributions
 injuries-0.219
t-0.203
 poisoning-0.200
 mercy-0.195
guard-0.177
Token on
Token ID319
Feature activation-3.52e-02

Pos logit contributions
 plains+0.063
irl+0.062
 spir+0.062
aling+0.062
ab+0.061
Neg logit contributions
 injuries-0.029
 poisoning-0.025
 mercy-0.025
 hesitation-0.024
t-0.023
Token the
Token ID262
Feature activation-2.24e-02

Pos logit contributions
 plains+0.060
 spir+0.060
aling+0.058
ab+0.057
irl+0.057
Neg logit contributions
 injuries-0.052
 mercy-0.047
 hesitation-0.043
 poisoning-0.043
 permission-0.041
Token shore
Token ID15191
Feature activation-4.73e-02

Pos logit contributions
 plains+0.028
 spir+0.027
 jung+0.023
 track+0.022
 sights+0.022
Neg logit contributions
 injuries-0.046
 mercy-0.042
 hesitation-0.041
 forgiveness-0.040
 poisoning-0.038
Token.
Token ID13
Feature activation-5.67e-03

Pos logit contributions
 plains+0.090
 spir+0.090
aling+0.086
ab+0.085
irl+0.085
Neg logit contributions
 injuries-0.061
 mercy-0.054
 hesitation-0.048
 poisoning-0.048
 forgiveness-0.045
Token for
Token ID329
Feature activation+2.14e-02

Pos logit contributions
 plains+0.009
 spir+0.009
ab+0.008
aling+0.008
irl+0.008
Neg logit contributions
 injuries-0.008
 mercy-0.007
 poisoning-0.007
 hesitation-0.007
 permission-0.007
Token a
Token ID257
Feature activation+2.21e-02

Pos logit contributions
 injuries+0.035
 mercy+0.032
 poisoning+0.030
 fault+0.030
 hesitation+0.029
Neg logit contributions
 spir-0.032
 plains-0.032
irl-0.031
aling-0.030
 cycl-0.029
Token ride
Token ID6594
Feature activation-2.96e-02

Pos logit contributions
 injuries+0.039
 mercy+0.036
 hesitation+0.035
 poisoning+0.034
 help+0.034
Neg logit contributions
 plains-0.030
 spir-0.030
aling-0.030
ab-0.028
irl-0.028
Token in
Token ID287
Feature activation-1.58e-02

Pos logit contributions
 plains+0.048
 spir+0.048
aling+0.046
ab+0.046
irl+0.045
Neg logit contributions
 injuries-0.046
 mercy-0.042
 hesitation-0.038
 poisoning-0.038
 forgiveness-0.036
Token the
Token ID262
Feature activation-2.29e-02

Pos logit contributions
 spir+0.022
 sorts+0.021
irl+0.021
 plains+0.021
aling+0.020
Neg logit contributions
 injuries-0.028
 mercy-0.027
 poisoning-0.025
 fault-0.024
 hesitation-0.024
Token kay
Token ID34681
Feature activation-1.36e-01

Pos logit contributions
ab+0.038
aling+0.037
irl+0.037
 spir+0.036
 sorts+0.035
Neg logit contributions
 injuries-0.032
 help-0.028
 mercy-0.028
 poisoning-0.027
 hesitation-0.027
Tokenak
Token ID461
Feature activation-1.74e-02

Pos logit contributions
 spir+0.236
 strut+0.218
 sings+0.208
 plains+0.205
 pian+0.203
Neg logit contributions
 injuries-0.184
t-0.175
 poisoning-0.168
 mercy-0.153
guard-0.151
Token and
Token ID290
Feature activation-4.29e-02

Pos logit contributions
 plains+0.033
 spir+0.033
aling+0.032
ab+0.032
irl+0.031
Neg logit contributions
 injuries-0.022
 mercy-0.020
 hesitation-0.017
 poisoning-0.017
 permission-0.016
Token her
Token ID607
Feature activation+2.08e-02

Pos logit contributions
 plains+0.085
 spir+0.083
ab+0.083
aling+0.081
irl+0.079
Neg logit contributions
 injuries-0.048
 mercy-0.044
 help-0.043
 hesitation-0.043
 permission-0.041
Token mom
Token ID1995
Feature activation+3.55e-02

Pos logit contributions
 injuries+0.022
 mercy+0.021
 poisoning+0.019
 hesitation+0.017
t+0.016
Neg logit contributions
 spir-0.042
 plains-0.042
 cycl-0.041
irl-0.041
 zoom-0.040
Token said
Token ID531
Feature activation+3.19e-02

Pos logit contributions
 injuries+0.051
 mercy+0.044
 hesitation+0.040
 poisoning+0.040
 permission+0.037
Neg logit contributions
 spir-0.063
 plains-0.062
ab-0.061
aling-0.060
 cycl-0.060
Token present
Token ID1944
Feature activation+1.80e-02

Pos logit contributions
 spir+0.010
 plains+0.010
ab+0.009
irl+0.009
aling+0.009
Neg logit contributions
 injuries-0.009
 mercy-0.008
 hesitation-0.007
 poisoning-0.007
 permission-0.007
Token;
Token ID26
Feature activation+1.10e-02

Pos logit contributions
 injuries+0.028
 mercy+0.026
 hesitation+0.024
 poisoning+0.022
 fault+0.022
Neg logit contributions
 spir-0.029
 plains-0.028
ab-0.028
 sorts-0.027
aling-0.027
Token it
Token ID340
Feature activation-1.12e-04

Pos logit contributions
 injuries+0.013
 mercy+0.012
 poisoning+0.011
 forgiveness+0.011
 help+0.011
Neg logit contributions
 plains-0.021
aling-0.021
 spir-0.021
ab-0.020
irl-0.020
Token was
Token ID373
Feature activation-3.47e-02

Pos logit contributions
 plains+0.000
aling+0.000
 spir+0.000
irl+0.000
ab+0.000
Neg logit contributions
 injuries-0.000
 mercy-0.000
 poisoning-0.000
t-0.000
 help-0.000
Token a
Token ID257
Feature activation-2.25e-02

Pos logit contributions
 spir+0.054
 plains+0.053
irl+0.051
 sorts+0.051
aling+0.050
Neg logit contributions
 injuries-0.056
 mercy-0.052
 hesitation-0.047
 poisoning-0.046
 permission-0.044
Token kay
Token ID34681
Feature activation-1.36e-01

Pos logit contributions
 plains+0.032
 spir+0.031
aling+0.030
ab+0.030
irl+0.030
Neg logit contributions
 injuries-0.040
 mercy-0.036
 poisoning-0.034
 hesitation-0.034
 permission-0.032
Tokenak
Token ID461
Feature activation-3.28e-02

Pos logit contributions
 spir+0.256
 sings+0.245
 strut+0.245
 plains+0.233
 planets+0.230
Neg logit contributions
t-0.177
 injuries-0.155
 poisoning-0.150
guard-0.132
 mercy-0.125
Token!
Token ID0
Feature activation-6.10e-03

Pos logit contributions
irl+0.067
 plains+0.067
 spir+0.065
aling+0.064
 cycl+0.063
Neg logit contributions
 injuries-0.032
 poisoning-0.028
 mercy-0.028
t-0.026
 help-0.026
Token She
Token ID1375
Feature activation-2.36e-02

Pos logit contributions
 spir+0.010
 plains+0.010
irl+0.010
aling+0.010
ep+0.010
Neg logit contributions
 injuries-0.008
 mercy-0.008
 hesitation-0.007
 poisoning-0.007
 help-0.007
Token was
Token ID373
Feature activation-3.76e-02

Pos logit contributions
 plains+0.043
aling+0.042
 spir+0.041
ab+0.040
irl+0.039
Neg logit contributions
 injuries-0.028
 mercy-0.025
 help-0.025
 poisoning-0.024
t-0.023
Token so
Token ID523
Feature activation-8.64e-03

Pos logit contributions
 plains+0.082
 spir+0.078
ab+0.077
irl+0.077
aling+0.077
Neg logit contributions
 injuries-0.037
 mercy-0.032
 hesitation-0.032
 permission-0.030
 help-0.030
Token<|endoftext|>
Token ID50256
Feature activation-1.20e-02

Pos logit contributions
 spir+0.017
 plains+0.017
aling+0.017
ab+0.017
irl+0.017
Neg logit contributions
 injuries-0.010
 mercy-0.009
 poisoning-0.008
 hesitation-0.008
 forgiveness-0.007
TokenOne
Token ID3198
Feature activation-5.43e-03

Pos logit contributions
 plains+0.022
 spir+0.022
ab+0.021
aling+0.021
irl+0.021
Neg logit contributions
 injuries-0.017
 mercy-0.015
 hesitation-0.013
 poisoning-0.013
 permission-0.013
Token day
Token ID1110
Feature activation+2.79e-02

Pos logit contributions
 spir+0.010
 plains+0.009
aling+0.009
ab+0.009
 cycl+0.009
Neg logit contributions
 injuries-0.007
 mercy-0.007
 hesitation-0.006
 help-0.006
 poisoning-0.006
Token,
Token ID11
Feature activation+1.86e-02

Pos logit contributions
 injuries+0.037
 mercy+0.033
 hesitation+0.028
 poisoning+0.028
 forgiveness+0.027
Neg logit contributions
irl-0.053
 spir-0.052
ab-0.052
 plains-0.052
 cycl-0.050
Token a
Token ID257
Feature activation-3.95e-02

Pos logit contributions
 injuries+0.030
 mercy+0.027
 hesitation+0.025
 poisoning+0.025
 permission+0.024
Neg logit contributions
 plains-0.029
 spir-0.029
ab-0.028
aling-0.028
irl-0.028
Token sw
Token ID1509
Feature activation-1.36e-01

Pos logit contributions
aling+0.082
 sorts+0.079
 plains+0.078
 spir+0.077
alling+0.076
Neg logit contributions
 injuries-0.045
 help-0.038
 mercy-0.037
 poisoning-0.036
 worry-0.034
Tokenan
Token ID272
Feature activation-8.10e-02

Pos logit contributions
 sorts+0.229
 spir+0.228
aling+0.219
 sings+0.218
 plains+0.216
Neg logit contributions
 injuries-0.184
 poisoning-0.153
 mercy-0.150
t-0.148
 permission-0.147
Token named
Token ID3706
Feature activation-2.85e-02

Pos logit contributions
 plains+0.133
 spir+0.130
aling+0.130
irl+0.127
 cycl+0.125
Neg logit contributions
 injuries-0.112
 mercy-0.101
 help-0.097
 poisoning-0.097
t-0.094
Token Sam
Token ID3409
Feature activation-1.69e-02

Pos logit contributions
aling+0.061
ab+0.058
 plains+0.058
 cycl+0.058
irl+0.058
Neg logit contributions
 injuries-0.024
 help-0.024
 mercy-0.023
 either-0.021
 poisoning-0.021
Token was
Token ID373
Feature activation-2.07e-02

Pos logit contributions
 plains+0.026
 spir+0.025
aling+0.025
irl+0.024
ab+0.024
Neg logit contributions
 injuries-0.027
 mercy-0.024
 poisoning-0.023
 hesitation-0.023
 help-0.022
Token swimming
Token ID14899
Feature activation-7.27e-02

Pos logit contributions
 plains+0.040
 sorts+0.038
 cycl+0.038
 spir+0.037
aling+0.037
Neg logit contributions
 help-0.024
 injuries-0.022
 to-0.022
 first-0.021
 poisoning-0.020
TokenEvery
Token ID6109
Feature activation+8.51e-03

Pos logit contributions
 spir+0.009
 plains+0.009
ab+0.009
irl+0.009
aling+0.009
Neg logit contributions
 injuries-0.006
 mercy-0.006
 hesitation-0.005
 poisoning-0.005
 forgiveness-0.005
Token day
Token ID1110
Feature activation+2.94e-02

Pos logit contributions
 injuries+0.015
 mercy+0.013
 poisoning+0.013
 hesitation+0.012
 permission+0.012
Neg logit contributions
 plains-0.012
 spir-0.012
irl-0.012
aling-0.012
ab-0.011
Token,
Token ID11
Feature activation-5.74e-03

Pos logit contributions
 injuries+0.032
 mercy+0.028
 hesitation+0.024
 poisoning+0.024
 permission+0.023
Neg logit contributions
 plains-0.061
 spir-0.061
aling-0.059
irl-0.059
ab-0.059
Token a
Token ID257
Feature activation-3.43e-02

Pos logit contributions
 plains+0.010
 spir+0.010
aling+0.010
irl+0.010
ab+0.010
Neg logit contributions
 injuries-0.008
 mercy-0.007
 hesitation-0.006
 poisoning-0.006
 permission-0.006
Token nearby
Token ID6716
Feature activation-4.66e-02

Pos logit contributions
 plains+0.056
 spir+0.056
aling+0.054
irl+0.053
ab+0.053
Neg logit contributions
 injuries-0.053
 mercy-0.048
 hesitation-0.045
 poisoning-0.044
 forgiveness-0.042
Token sw
Token ID1509
Feature activation-1.36e-01

Pos logit contributions
 spir+0.078
aling+0.077
irl+0.076
 plains+0.075
ab+0.074
Neg logit contributions
 injuries-0.068
 mercy-0.062
 hesitation-0.058
 help-0.056
 poisoning-0.056
Tokenan
Token ID272
Feature activation-7.29e-02

Pos logit contributions
 spir+0.204
 sorts+0.196
 plains+0.192
aling+0.188
 sings+0.187
Neg logit contributions
 injuries-0.217
 mercy-0.199
 permission-0.189
 fault-0.181
 poisoning-0.175
Token would
Token ID561
Feature activation-3.48e-03

Pos logit contributions
 plains+0.106
 spir+0.105
aling+0.100
ab+0.099
irl+0.099
Neg logit contributions
 injuries-0.126
 mercy-0.115
 poisoning-0.106
 hesitation-0.106
 permission-0.102
Token come
Token ID1282
Feature activation-1.20e-02

Pos logit contributions
 plains+0.005
 spir+0.005
ab+0.005
aling+0.005
irl+0.005
Neg logit contributions
 injuries-0.006
 mercy-0.005
 hesitation-0.005
 poisoning-0.005
 help-0.005
Token to
Token ID284
Feature activation-5.45e-04

Pos logit contributions
 plains+0.022
 spir+0.022
aling+0.021
ab+0.021
irl+0.021
Neg logit contributions
 injuries-0.016
 mercy-0.014
 poisoning-0.013
 hesitation-0.013
 forgiveness-0.012
Token the
Token ID262
Feature activation-2.08e-02

Pos logit contributions
 plains+0.001
 spir+0.001
aling+0.001
ab+0.001
irl+0.001
Neg logit contributions
 injuries-0.001
 mercy-0.001
 hesitation-0.001
 poisoning-0.001
 permission-0.001
Token old
Token ID1468
Feature activation-3.14e-02

Pos logit contributions
 spir+0.050
ab+0.050
 plains+0.049
 cycl+0.049
irl+0.048
Neg logit contributions
 injuries-0.042
 mercy-0.039
 permission-0.037
 poisoning-0.037
 hesitation-0.036
Token man
Token ID582
Feature activation-1.32e-02

Pos logit contributions
aling+0.067
 spir+0.066
 plains+0.065
irl+0.063
ab+0.063
Neg logit contributions
 injuries-0.034
 mercy-0.029
 poisoning-0.027
 hesitation-0.025
 forgiveness-0.025
Token who
Token ID508
Feature activation-2.20e-02

Pos logit contributions
 plains+0.021
 spir+0.020
irl+0.019
 cycl+0.019
ab+0.019
Neg logit contributions
 injuries-0.021
 mercy-0.019
 poisoning-0.018
 help-0.018
 hesitation-0.017
Token had
Token ID550
Feature activation-4.25e-04

Pos logit contributions
 plains+0.035
 spir+0.034
aling+0.033
ab+0.032
phony+0.032
Neg logit contributions
 help-0.031
 injuries-0.031
 mercy-0.030
 poisoning-0.030
t-0.029
Token a
Token ID257
Feature activation-1.02e-02

Pos logit contributions
ab+0.001
aling+0.001
 spir+0.001
irl+0.001
 plains+0.001
Neg logit contributions
 injuries-0.000
 mercy-0.000
 help-0.000
 permission-0.000
 poisoning-0.000
Token kay
Token ID34681
Feature activation-1.36e-01

Pos logit contributions
 plains+0.016
 spir+0.016
aling+0.016
ab+0.015
irl+0.015
Neg logit contributions
 injuries-0.017
 mercy-0.015
 poisoning-0.014
 hesitation-0.014
 help-0.014
Tokenak
Token ID461
Feature activation-3.35e-02

Pos logit contributions
 spir+0.240
 plains+0.219
 strut+0.219
 sings+0.209
 cycl+0.207
Neg logit contributions
 injuries-0.186
t-0.182
 poisoning-0.169
 mercy-0.161
guard-0.150
Token.
Token ID13
Feature activation-1.47e-03

Pos logit contributions
 spir+0.073
 plains+0.072
irl+0.072
aling+0.072
ab+0.070
Neg logit contributions
 injuries-0.030
 mercy-0.025
 poisoning-0.023
 help-0.022
 permission-0.022
Token He
Token ID679
Feature activation-2.59e-02

Pos logit contributions
irl+0.003
irling+0.003
 spir+0.003
ab+0.003
ep+0.003
Neg logit contributions
 yet-0.001
 injuries-0.001
 help-0.001
 either-0.001
 mercy-0.001
Token decided
Token ID3066
Feature activation+2.85e-02

Pos logit contributions
aling+0.039
 spir+0.039
 plains+0.038
ab+0.038
 sorts+0.036
Neg logit contributions
 injuries-0.039
 mercy-0.035
 help-0.035
 poisoning-0.035
t-0.034
Token to
Token ID284
Feature activation+1.10e-03

Pos logit contributions
 injuries+0.024
 mercy+0.019
 hesitation+0.016
 poisoning+0.016
 forgiveness+0.014
Neg logit contributions
 plains-0.068
 spir-0.067
aling-0.065
 cycl-0.065
irl-0.064
Token spl
Token ID4328
Feature activation-9.16e-02

Pos logit contributions
ab+0.068
 plains+0.067
aling+0.065
irl+0.065
arg+0.065
Neg logit contributions
 injuries-0.051
 hesitation-0.049
 help-0.047
 mercy-0.047
 poisoning-0.044
Tokenashed
Token ID5263
Feature activation-3.49e-02

Pos logit contributions
 spir+0.187
 plains+0.180
 planets+0.174
 jung+0.173
 outer+0.171
Neg logit contributions
 injuries-0.093
t-0.088
 poisoning-0.081
less-0.075
 mercy-0.074
Token around
Token ID1088
Feature activation-1.18e-02

Pos logit contributions
 spir+0.049
 plains+0.048
ab+0.045
irl+0.045
 sorts+0.045
Neg logit contributions
 injuries-0.064
 mercy-0.058
 poisoning-0.054
 hesitation-0.053
 permission-0.052
Token in
Token ID287
Feature activation-2.72e-02

Pos logit contributions
 spir+0.020
 plains+0.020
ab+0.019
aling+0.019
irl+0.019
Neg logit contributions
 injuries-0.018
 mercy-0.016
 poisoning-0.015
 hesitation-0.014
 permission-0.014
Token the
Token ID262
Feature activation-3.35e-02

Pos logit contributions
 spir+0.043
 plains+0.042
 sorts+0.040
irl+0.039
aling+0.039
Neg logit contributions
 injuries-0.045
 mercy-0.042
 poisoning-0.038
 hesitation-0.037
 forgiveness-0.036
Token kay
Token ID34681
Feature activation-1.36e-01

Pos logit contributions
ab+0.061
aling+0.061
irl+0.059
 plains+0.059
 sorts+0.058
Neg logit contributions
 injuries-0.043
 mercy-0.036
 help-0.035
 hesitation-0.034
 poisoning-0.033
Tokenak
Token ID461
Feature activation-3.10e-02

Pos logit contributions
 spir+0.222
 strut+0.209
 sings+0.202
 cycl+0.199
 plains+0.199
Neg logit contributions
 injuries-0.188
t-0.180
 poisoning-0.168
 mercy-0.159
guard-0.154
Token,
Token ID11
Feature activation-4.24e-03

Pos logit contributions
 spir+0.063
 plains+0.063
irl+0.062
aling+0.062
ab+0.060
Neg logit contributions
 injuries-0.034
 mercy-0.030
 hesitation-0.027
 poisoning-0.026
 permission-0.025
Token just
Token ID655
Feature activation-8.63e-03

Pos logit contributions
 plains+0.007
irl+0.007
aling+0.007
ab+0.007
 spir+0.007
Neg logit contributions
 injuries-0.006
 poisoning-0.005
 hesitation-0.005
 mercy-0.005
 help-0.005
Token like
Token ID588
Feature activation-6.81e-03

Pos logit contributions
 plains+0.016
 spir+0.016
aling+0.016
ab+0.015
irl+0.015
Neg logit contributions
 injuries-0.011
 mercy-0.010
 hesitation-0.009
 poisoning-0.009
 permission-0.008
Token two
Token ID734
Feature activation+3.27e-02

Pos logit contributions
 spir+0.012
 plains+0.011
aling+0.011
irl+0.011
ab+0.011
Neg logit contributions
 injuries-0.010
 mercy-0.009
 poisoning-0.008
 hesitation-0.008
 permission-0.008
Token and
Token ID290
Feature activation-6.91e-02

Pos logit contributions
 spir+0.101
 plains+0.100
ab+0.094
irl+0.093
aling+0.092
Neg logit contributions
 injuries-0.096
 mercy-0.088
 hesitation-0.080
 poisoning-0.079
 forgiveness-0.077
Token saw
Token ID2497
Feature activation-2.71e-02

Pos logit contributions
 plains+0.101
irl+0.099
ab+0.099
 spir+0.098
arg+0.098
Neg logit contributions
 injuries-0.112
 mercy-0.104
 help-0.097
 hesitation-0.094
 poisoning-0.093
Tokened
Token ID276
Feature activation-4.51e-02

Pos logit contributions
 spir+0.041
 plains+0.041
aling+0.039
ab+0.039
irl+0.039
Neg logit contributions
 injuries-0.045
 mercy-0.041
 poisoning-0.038
 hesitation-0.038
 permission-0.036
Token until
Token ID1566
Feature activation+2.62e-02

Pos logit contributions
aling+0.074
 plains+0.074
irl+0.074
 spir+0.073
ab+0.072
Neg logit contributions
 injuries-0.066
 mercy-0.060
 hesitation-0.056
 poisoning-0.055
 permission-0.053
Token the
Token ID262
Feature activation-4.24e-03

Pos logit contributions
 injuries+0.039
 mercy+0.035
 poisoning+0.032
 hesitation+0.031
 permission+0.030
Neg logit contributions
 spir-0.045
 plains-0.045
irl-0.042
 cycl-0.041
 sorts-0.041
Token kay
Token ID34681
Feature activation-1.35e-01

Pos logit contributions
 plains+0.005
 spir+0.005
aling+0.005
ab+0.005
irl+0.005
Neg logit contributions
 injuries-0.008
 mercy-0.007
 hesitation-0.007
 poisoning-0.007
 permission-0.007
Tokenak
Token ID461
Feature activation-3.80e-02

Pos logit contributions
 spir+0.224
 strut+0.216
 sings+0.211
 cycl+0.200
 plains+0.199
Neg logit contributions
 injuries-0.191
t-0.180
 poisoning-0.162
 mercy-0.161
guard-0.146
Token was
Token ID373
Feature activation-4.50e-02

Pos logit contributions
 plains+0.062
aling+0.060
irl+0.060
 spir+0.060
ab+0.059
Neg logit contributions
 injuries-0.056
 mercy-0.050
 poisoning-0.046
t-0.046
 hesitation-0.044
Token finished
Token ID5201
Feature activation+5.60e-04

Pos logit contributions
 plains+0.064
 spir+0.064
aling+0.061
ab+0.061
irl+0.060
Neg logit contributions
 injuries-0.079
 mercy-0.072
 hesitation-0.067
 poisoning-0.067
 permission-0.064
Token.
Token ID13
Feature activation-2.48e-02

Pos logit contributions
 injuries+0.001
 mercy+0.001
 poisoning+0.001
 hesitation+0.001
 permission+0.001
Neg logit contributions
 plains-0.001
 spir-0.001
aling-0.001
ab-0.001
irl-0.001
Token
Token ID198
Feature activation+4.22e-03

Pos logit contributions
 spir+0.049
aling+0.048
irl+0.048
 plains+0.048
ab+0.047
Neg logit contributions
 injuries-0.028
 mercy-0.025
 hesitation-0.022
 poisoning-0.022
 forgiveness-0.021
Token play
Token ID711
Feature activation-2.44e-02

Pos logit contributions
 plains+0.027
 spir+0.026
aling+0.026
ab+0.026
 cycl+0.024
Neg logit contributions
 injuries-0.032
 mercy-0.029
 help-0.029
 hesitation-0.027
 worry-0.027
Token in
Token ID287
Feature activation-2.17e-02

Pos logit contributions
aling+0.040
 spir+0.039
 plains+0.038
 sorts+0.037
 cycl+0.037
Neg logit contributions
 injuries-0.037
 help-0.034
 mercy-0.033
 poisoning-0.032
 forgiveness-0.030
Token the
Token ID262
Feature activation-3.13e-02

Pos logit contributions
ab+0.043
aling+0.043
 plains+0.042
 spir+0.041
 cycl+0.041
Neg logit contributions
 injuries-0.026
 mercy-0.022
 help-0.020
 hesitation-0.020
 poisoning-0.020
Token park
Token ID3952
Feature activation-4.28e-02

Pos logit contributions
ab+0.052
aling+0.052
 cycl+0.049
arg+0.049
irl+0.048
Neg logit contributions
 injuries-0.042
 help-0.037
 poisoning-0.036
 first-0.035
 mercy-0.035
Token every
Token ID790
Feature activation+2.48e-02

Pos logit contributions
 spir+0.098
aling+0.098
 cycl+0.097
 plains+0.094
ab+0.094
Neg logit contributions
 injuries-0.033
 either-0.029
 yet-0.028
 help-0.027
 mercy-0.026
Token day
Token ID1110
Feature activation+5.40e-02

Pos logit contributions
 injuries+0.025
 help+0.023
 first+0.022
t+0.022
 hesitation+0.021
Neg logit contributions
 spir-0.049
aling-0.048
 plains-0.047
arg-0.047
 cycl-0.046
Token.
Token ID13
Feature activation+1.81e-03

Pos logit contributions
 injuries+0.056
 mercy+0.045
 hesitation+0.042
 poisoning+0.040
 permission+0.039
Neg logit contributions
 plains-0.116
 spir-0.115
aling-0.113
 cycl-0.112
ab-0.111
Token They
Token ID1119
Feature activation-7.95e-03

Pos logit contributions
 injuries+0.003
 mercy+0.003
 hesitation+0.003
 poisoning+0.003
 permission+0.003
Neg logit contributions
 spir-0.003
 plains-0.003
aling-0.002
ab-0.002
irl-0.002
Token liked
Token ID8288
Feature activation-5.25e-03

Pos logit contributions
aling+0.013
 plains+0.013
 spir+0.013
ab+0.013
arg+0.012
Neg logit contributions
 injuries-0.011
 help-0.010
 mercy-0.010
 poisoning-0.010
 hesitation-0.010
Token to
Token ID284
Feature activation-1.40e-02

Pos logit contributions
 spir+0.014
aling+0.014
 plains+0.014
ab+0.014
irl+0.014
Neg logit contributions
 injuries-0.002
 mercy-0.002
 poisoning-0.001
 hesitation-0.001
 permission-0.001
Token swing
Token ID9628
Feature activation-5.23e-02

Pos logit contributions
 plains+0.022
ab+0.022
aling+0.021
 spir+0.021
 sorts+0.020
Neg logit contributions
 help-0.022
 injuries-0.022
 mercy-0.019
 worry-0.019
 hesitation-0.019
Token.
Token ID13
Feature activation-1.04e-02

Pos logit contributions
 injuries+0.119
 mercy+0.109
 poisoning+0.104
 hesitation+0.103
 payoff+0.098
Neg logit contributions
aling-0.076
 spir-0.074
 plains-0.074
ab-0.073
 zoom-0.071
Token You
Token ID921
Feature activation+2.86e-02

Pos logit contributions
 plains+0.015
 spir+0.015
aling+0.015
ab+0.015
irl+0.014
Neg logit contributions
 injuries-0.018
 mercy-0.016
 hesitation-0.015
 poisoning-0.015
 permission-0.014
Token are
Token ID389
Feature activation+8.66e-03

Pos logit contributions
 injuries+0.042
 mercy+0.037
 hesitation+0.034
 poisoning+0.034
 permission+0.032
Neg logit contributions
 spir-0.049
 plains-0.049
aling-0.047
irl-0.047
ab-0.047
Token my
Token ID616
Feature activation+4.91e-02

Pos logit contributions
 injuries+0.011
 mercy+0.011
 poisoning+0.009
 forgiveness+0.009
 help+0.009
Neg logit contributions
 plains-0.017
 spir-0.016
 cycl-0.016
ab-0.016
aling-0.015
Token best
Token ID1266
Feature activation+5.13e-02

Pos logit contributions
 injuries+0.084
 mercy+0.073
 poisoning+0.071
 hesitation+0.069
t+0.066
Neg logit contributions
 spir-0.071
 plains-0.070
ab-0.068
 sorts-0.066
 sights-0.065
Token friend
Token ID1545
Feature activation+7.09e-02

Pos logit contributions
 injuries+0.089
 mercy+0.078
 poisoning+0.078
 hesitation+0.076
t+0.068
Neg logit contributions
ab-0.071
 spir-0.069
 plains-0.068
irl-0.065
 zoom-0.065
Token too
Token ID1165
Feature activation+6.40e-02

Pos logit contributions
 injuries+0.134
 hesitation+0.114
 mercy+0.110
 poisoning+0.103
 fault+0.102
Neg logit contributions
ab-0.099
 sorts-0.090
aling-0.089
 plains-0.089
 spir-0.086
Token.
Token ID13
Feature activation-1.31e-02

Pos logit contributions
 injuries+0.101
 mercy+0.090
 hesitation+0.084
 poisoning+0.083
 permission+0.080
Neg logit contributions
 plains-0.100
 spir-0.099
ab-0.097
aling-0.094
irl-0.094
Token You
Token ID921
Feature activation+2.10e-02

Pos logit contributions
 plains+0.019
 spir+0.019
aling+0.018
ab+0.018
irl+0.018
Neg logit contributions
 injuries-0.022
 mercy-0.020
 hesitation-0.019
 poisoning-0.019
 permission-0.018
Token are
Token ID389
Feature activation+5.99e-03

Pos logit contributions
 injuries+0.030
 mercy+0.027
 hesitation+0.024
 poisoning+0.024
 permission+0.023
Neg logit contributions
 spir-0.037
 plains-0.037
aling-0.036
ab-0.035
irl-0.035
Token very
Token ID845
Feature activation-1.71e-03

Pos logit contributions
 injuries+0.008
 mercy+0.008
 poisoning+0.007
 forgiveness+0.007
 help+0.007
Neg logit contributions
 plains-0.011
 spir-0.011
 cycl-0.011
ab-0.011
aling-0.010
Token nice
Token ID3621
Feature activation+4.42e-02

Pos logit contributions
 plains+0.012
 spir+0.011
irl+0.011
 cycl+0.011
aling+0.011
Neg logit contributions
 injuries-0.010
 mercy-0.009
 poisoning-0.009
 permission-0.009
 hesitation-0.008
Token.
Token ID13
Feature activation-1.73e-02

Pos logit contributions
 injuries+0.067
 mercy+0.059
 poisoning+0.052
 hesitation+0.052
 forgiveness+0.049
Neg logit contributions
 spir-0.075
 plains-0.074
 sorts-0.073
ab-0.073
aling-0.070
Token And
Token ID843
Feature activation-2.33e-03

Pos logit contributions
 spir+0.034
irl+0.034
 plains+0.033
ab+0.032
aling+0.032
Neg logit contributions
 injuries-0.019
 mercy-0.018
 poisoning-0.015
 hesitation-0.015
 forgiveness-0.015
Token from
Token ID422
Feature activation-1.25e-02

Pos logit contributions
 spir+0.004
 plains+0.004
aling+0.004
ab+0.004
irl+0.004
Neg logit contributions
 injuries-0.003
 mercy-0.003
 poisoning-0.002
 hesitation-0.002
 permission-0.002
Token that
Token ID326
Feature activation+5.99e-03

Pos logit contributions
 plains+0.026
 spir+0.025
aling+0.025
irl+0.024
ab+0.024
Neg logit contributions
 injuries-0.014
 mercy-0.012
 poisoning-0.011
 hesitation-0.011
 fault-0.010
Token day
Token ID1110
Feature activation+5.69e-02

Pos logit contributions
 injuries+0.008
 mercy+0.007
 hesitation+0.007
 poisoning+0.007
 help+0.007
Neg logit contributions
 spir-0.011
 plains-0.011
aling-0.010
irl-0.010
 cycl-0.010
Token on
Token ID319
Feature activation+1.09e-02

Pos logit contributions
 injuries+0.096
 mercy+0.085
 hesitation+0.081
 poisoning+0.080
t+0.078
Neg logit contributions
 spir-0.087
 plains-0.087
 sorts-0.083
aling-0.080
irl-0.080
Token,
Token ID11
Feature activation+1.96e-03

Pos logit contributions
 injuries+0.016
 mercy+0.015
 permission+0.013
 hesitation+0.013
 poisoning+0.013
Neg logit contributions
 plains-0.019
 spir-0.019
irl-0.018
ab-0.018
 cycl-0.017
Token the
Token ID262
Feature activation+3.04e-03

Pos logit contributions
 injuries+0.002
 mercy+0.002
 poisoning+0.002
 hesitation+0.002
 help+0.002
Neg logit contributions
 plains-0.004
 spir-0.004
ab-0.004
irl-0.004
aling-0.004
Token little
Token ID1310
Feature activation-4.19e-03

Pos logit contributions
 injuries+0.005
 mercy+0.005
 poisoning+0.004
 hesitation+0.004
 permission+0.004
Neg logit contributions
 plains-0.005
 spir-0.005
aling-0.004
ab-0.004
irl-0.004
Token boy
Token ID2933
Feature activation-2.46e-02

Pos logit contributions
 plains+0.007
 spir+0.007
ab+0.006
aling+0.006
 sorts+0.006
Neg logit contributions
 injuries-0.007
 mercy-0.006
 hesitation-0.006
 poisoning-0.006
 permission-0.005
Token."
Token ID526
Feature activation+3.05e-03

Pos logit contributions
 injuries+0.038
 mercy+0.034
 hesitation+0.031
 poisoning+0.030
 forgiveness+0.029
Neg logit contributions
 plains-0.040
 spir-0.039
ab-0.039
 sorts-0.038
aling-0.038
Token Then
Token ID3244
Feature activation+3.16e-02

Pos logit contributions
 injuries+0.004
 mercy+0.004
 hesitation+0.004
 poisoning+0.004
 permission+0.003
Neg logit contributions
 plains-0.005
 spir-0.005
aling-0.005
ab-0.005
irl-0.005
Token he
Token ID339
Feature activation-1.21e-02

Pos logit contributions
 injuries+0.037
 mercy+0.032
 poisoning+0.029
 hesitation+0.028
 permission+0.026
Neg logit contributions
 spir-0.064
 plains-0.064
aling-0.061
irl-0.061
 cycl-0.061
Token give
Token ID1577
Feature activation-2.34e-02

Pos logit contributions
 plains+0.020
 spir+0.019
aling+0.018
irl+0.018
ab+0.018
Neg logit contributions
 injuries-0.019
 mercy-0.017
 poisoning-0.016
 hesitation-0.015
 help-0.015
Token his
Token ID465
Feature activation+5.13e-02

Pos logit contributions
 plains+0.040
 spir+0.039
aling+0.038
ab+0.037
irl+0.037
Neg logit contributions
 injuries-0.035
 mercy-0.031
 hesitation-0.029
 poisoning-0.029
 forgiveness-0.027
Token mom
Token ID1995
Feature activation+6.53e-02

Pos logit contributions
 injuries+0.065
 mercy+0.057
 poisoning+0.055
 hesitation+0.053
t+0.050
Neg logit contributions
irl-0.091
 spir-0.087
 sorts-0.087
 sights-0.085
 zoom-0.085
Token a
Token ID257
Feature activation+2.06e-02

Pos logit contributions
 injuries+0.083
 mercy+0.077
 permission+0.069
 poisoning+0.068
 hesitation+0.067
Neg logit contributions
irl-0.122
 plains-0.122
 spir-0.122
ab-0.118
aling-0.117
Token big
Token ID1263
Feature activation+8.62e-04

Pos logit contributions
 injuries+0.033
 mercy+0.029
 hesitation+0.026
 poisoning+0.026
 permission+0.025
Neg logit contributions
 spir-0.033
 plains-0.033
ab-0.032
irl-0.032
 sorts-0.032
Token hug
Token ID16225
Feature activation+3.99e-02

Pos logit contributions
 injuries+0.001
 mercy+0.001
 poisoning+0.001
 hesitation+0.001
 permission+0.001
Neg logit contributions
 spir-0.002
 plains-0.002
ab-0.002
 sorts-0.002
aling-0.002
Token,
Token ID11
Feature activation+8.31e-03

Pos logit contributions
 injuries+0.063
 mercy+0.058
 hesitation+0.052
 fault+0.050
 poisoning+0.050
Neg logit contributions
 plains-0.062
 spir-0.061
ab-0.061
aling-0.059
 cycl-0.058
Token and
Token ID290
Feature activation-3.79e-02

Pos logit contributions
 injuries+0.008
 mercy+0.007
 hesitation+0.006
 poisoning+0.006
 forgiveness+0.006
Neg logit contributions
 plains-0.018
irl-0.018
 spir-0.018
ab-0.017
aling-0.017
Token an
Token ID281
Feature activation-1.09e-02

Pos logit contributions
 injuries+0.046
 mercy+0.039
 poisoning+0.035
t+0.035
 hesitation+0.034
Neg logit contributions
 plains-0.072
 spir-0.070
 sorts-0.067
aling-0.066
 cycl-0.066
Token important
Token ID1593
Feature activation+1.02e-02

Pos logit contributions
 spir+0.015
 plains+0.015
aling+0.014
ab+0.014
irl+0.014
Neg logit contributions
 injuries-0.019
 mercy-0.018
 hesitation-0.017
 poisoning-0.016
 permission-0.016
Token lesson
Token ID11483
Feature activation+6.16e-02

Pos logit contributions
 injuries+0.011
 mercy+0.009
 hesitation+0.008
 poisoning+0.008
 help+0.008
Neg logit contributions
 spir-0.022
 plains-0.022
aling-0.022
ab-0.021
irl-0.021
Token.
Token ID13
Feature activation-1.35e-02

Pos logit contributions
 injuries+0.068
 mercy+0.058
 poisoning+0.051
 hesitation+0.051
 permission+0.047
Neg logit contributions
 plains-0.129
 spir-0.128
aling-0.125
ab-0.122
irl-0.122
Token No
Token ID1400
Feature activation+3.62e-02

Pos logit contributions
 spir+0.022
 plains+0.022
irl+0.022
ab+0.021
aling+0.021
Neg logit contributions
 injuries-0.020
 mercy-0.019
 hesitation-0.017
 poisoning-0.017
 forgiveness-0.016
Token matter
Token ID2300
Feature activation+5.44e-02

Pos logit contributions
 injuries+0.047
 mercy+0.043
 hesitation+0.040
 poisoning+0.037
 forgiveness+0.037
Neg logit contributions
 spir-0.070
 plains-0.069
aling-0.066
arg-0.066
irl-0.066
Token what
Token ID644
Feature activation+5.80e-02

Pos logit contributions
 injuries+0.071
 mercy+0.067
 forgiveness+0.058
 permission+0.058
 hesitation+0.057
Neg logit contributions
 spir-0.104
 plains-0.102
aling-0.099
ab-0.099
irl-0.098
Token danger
Token ID3514
Feature activation+3.92e-02

Pos logit contributions
 injuries+0.076
 mercy+0.067
 hesitation+0.060
 poisoning+0.060
 permission+0.057
Neg logit contributions
 plains-0.108
 spir-0.108
aling-0.104
ab-0.103
irl-0.103
Token he
Token ID339
Feature activation+3.26e-02

Pos logit contributions
 injuries+0.055
 mercy+0.049
 hesitation+0.044
 poisoning+0.044
 permission+0.042
Neg logit contributions
 plains-0.069
 spir-0.069
aling-0.067
ab-0.066
irl-0.066
Token might
Token ID1244
Feature activation+5.38e-02

Pos logit contributions
 injuries+0.050
 mercy+0.046
 hesitation+0.042
 poisoning+0.042
 permission+0.040
Neg logit contributions
 spir-0.053
 plains-0.053
aling-0.050
irl-0.050
ab-0.050
Token face
Token ID1986
Feature activation+2.29e-02

Pos logit contributions
 injuries+0.090
 mercy+0.080
 poisoning+0.073
 forgiveness+0.071
 hesitation+0.070
Neg logit contributions
 spir-0.082
 plains-0.080
aling-0.079
irl-0.077
 sorts-0.077
Token.
Token ID13
Feature activation-2.20e-02

Pos logit contributions
 injuries+0.030
 mercy+0.027
 hesitation+0.025
 poisoning+0.024
 forgiveness+0.023
Neg logit contributions
 plains-0.033
ab-0.033
 spir-0.033
aling-0.032
 sorts-0.031
Token We
Token ID775
Feature activation+2.99e-02

Pos logit contributions
 spir+0.033
 plains+0.033
irl+0.032
aling+0.031
ab+0.031
Neg logit contributions
 injuries-0.036
 mercy-0.034
 poisoning-0.031
 hesitation-0.030
 forgiveness-0.030
Token are
Token ID389
Feature activation-1.07e-02

Pos logit contributions
 injuries+0.042
 mercy+0.037
 hesitation+0.033
 poisoning+0.033
 forgiveness+0.031
Neg logit contributions
 spir-0.053
 plains-0.053
aling-0.052
ab-0.051
irl-0.051
Token going
Token ID1016
Feature activation-2.34e-02

Pos logit contributions
 plains+0.021
 spir+0.020
 cycl+0.020
 sorts+0.019
aling+0.019
Neg logit contributions
 injuries-0.013
 mercy-0.012
 help-0.012
 permission-0.011
 hesitation-0.011
Token to
Token ID284
Feature activation+5.44e-03

Pos logit contributions
 plains+0.053
 spir+0.052
aling+0.051
ab+0.051
irl+0.051
Neg logit contributions
 injuries-0.022
 mercy-0.019
 hesitation-0.016
 poisoning-0.016
 forgiveness-0.015
Token love
Token ID1842
Feature activation+2.87e-02

Pos logit contributions
 injuries+0.009
 mercy+0.008
 poisoning+0.008
 hesitation+0.007
 permission+0.007
Neg logit contributions
 spir-0.008
 plains-0.008
irl-0.008
aling-0.008
 sorts-0.008
Token the
Token ID262
Feature activation+1.02e-02

Pos logit contributions
 injuries+0.043
 mercy+0.039
 hesitation+0.037
 poisoning+0.035
 fault+0.033
Neg logit contributions
 plains-0.048
 spir-0.047
ab-0.045
irl-0.045
 sorts-0.045
Token baby
Token ID5156
Feature activation+2.48e-02

Pos logit contributions
 injuries+0.017
 mercy+0.015
 hesitation+0.014
 poisoning+0.014
 supervision+0.013
Neg logit contributions
 spir-0.016
 plains-0.015
 sorts-0.015
 sights-0.015
ab-0.014
Token.
Token ID13
Feature activation-1.66e-02

Pos logit contributions
 injuries+0.040
 mercy+0.036
 hesitation+0.033
 poisoning+0.032
 permission+0.030
Neg logit contributions
 plains-0.039
 spir-0.038
ab-0.038
 sorts-0.037
 sights-0.036
Token We
Token ID775
Feature activation+3.47e-02

Pos logit contributions
 spir+0.024
 plains+0.024
irl+0.024
aling+0.023
ab+0.023
Neg logit contributions
 injuries-0.028
 mercy-0.026
 poisoning-0.023
 hesitation-0.023
 forgiveness-0.023
Token are
Token ID389
Feature activation-1.02e-02

Pos logit contributions
 injuries+0.048
 mercy+0.042
 hesitation+0.038
 poisoning+0.038
 forgiveness+0.036
Neg logit contributions
 spir-0.062
 plains-0.062
aling-0.060
ab-0.060
irl-0.060
Token together
Token ID1978
Feature activation+5.13e-02

Pos logit contributions
 injuries+0.079
 mercy+0.066
t+0.065
 poisoning+0.064
 help+0.062
Neg logit contributions
 spir-0.101
 plains-0.101
irl-0.099
arg-0.095
 cycl-0.094
Token.
Token ID13
Feature activation-9.27e-03

Pos logit contributions
 injuries+0.059
 mercy+0.050
 poisoning+0.045
 hesitation+0.044
 permission+0.041
Neg logit contributions
 plains-0.105
 spir-0.105
aling-0.102
irl-0.101
 cycl-0.100
Token<|endoftext|>
Token ID50256
Feature activation+2.00e-03

Pos logit contributions
 spir+0.018
 plains+0.018
irl+0.018
ab+0.018
aling+0.018
Neg logit contributions
 injuries-0.010
 mercy-0.009
 poisoning-0.008
 hesitation-0.008
 fault-0.008
TokenOnce
Token ID7454
Feature activation+5.50e-02

Pos logit contributions
 injuries+0.003
 mercy+0.003
 hesitation+0.003
 permission+0.002
 poisoning+0.002
Neg logit contributions
 plains-0.003
 spir-0.003
ab-0.003
irl-0.003
aling-0.003
Token upon
Token ID2402
Feature activation-2.70e-02

Pos logit contributions
 injuries+0.113
 mercy+0.109
 poisoning+0.098
 hesitation+0.097
 permission+0.097
Neg logit contributions
irl-0.061
 plains-0.058
ab-0.057
 spir-0.057
 sorts-0.056
Token a
Token ID257
Feature activation-1.01e-02

Pos logit contributions
 plains+0.045
 spir+0.045
aling+0.043
ab+0.043
irl+0.043
Neg logit contributions
 injuries-0.040
 mercy-0.036
 poisoning-0.033
 hesitation-0.033
 permission-0.032
Token time
Token ID640
Feature activation+2.54e-02

Pos logit contributions
 plains+0.016
 spir+0.016
irl+0.015
ab+0.015
aling+0.015
Neg logit contributions
 injuries-0.016
 mercy-0.015
 hesitation-0.014
 poisoning-0.014
 permission-0.014
Token,
Token ID11
Feature activation+2.62e-02

Pos logit contributions
 injuries+0.038
 mercy+0.035
 permission+0.031
 hesitation+0.030
 poisoning+0.029
Neg logit contributions
irl-0.044
ab-0.044
 plains-0.043
 spir-0.043
 soaring-0.040
Token there
Token ID612
Feature activation+5.93e-03

Pos logit contributions
 injuries+0.046
 mercy+0.042
 hesitation+0.038
 poisoning+0.038
 permission+0.037
Neg logit contributions
 plains-0.038
 spir-0.038
ab-0.037
irl-0.037
aling-0.036
Token was
Token ID373
Feature activation-2.30e-02

Pos logit contributions
 injuries+0.012
 mercy+0.011
 hesitation+0.011
 permission+0.011
 poisoning+0.010
Neg logit contributions
 spir-0.007
 plains-0.007
irl-0.007
 sorts-0.007
ab-0.007
Token a
Token ID257
Feature activation-3.70e-02

Pos logit contributions
 plains+0.036
 spir+0.036
aling+0.035
irl+0.035
ab+0.034
Neg logit contributions
 injuries-0.037
 mercy-0.034
 hesitation-0.031
 poisoning-0.031
 permission-0.030
Token's
Token ID338
Feature activation+1.67e-02

Pos logit contributions
 spir+0.007
 plains+0.007
aling+0.007
ab+0.007
irl+0.007
Neg logit contributions
 injuries-0.006
 mercy-0.006
 hesitation-0.005
 poisoning-0.005
 permission-0.005
Token mom
Token ID1995
Feature activation+3.13e-02

Pos logit contributions
 mercy+0.018
 injuries+0.017
 poisoning+0.016
 hesitation+0.015
t+0.014
Neg logit contributions
irl-0.032
 zoom-0.031
 spir-0.031
 sights-0.031
 plains-0.031
Token asked
Token ID1965
Feature activation+4.79e-02

Pos logit contributions
 injuries+0.046
 mercy+0.041
 hesitation+0.037
 poisoning+0.037
 permission+0.036
Neg logit contributions
 spir-0.054
 plains-0.053
aling-0.051
ab-0.051
 cycl-0.051
Token "
Token ID366
Feature activation+7.66e-03

Pos logit contributions
 injuries+0.063
 mercy+0.059
 fault+0.052
 hesitation+0.051
 poisoning+0.051
Neg logit contributions
 spir-0.083
aling-0.080
 plains-0.080
 sorts-0.080
ab-0.078
TokenDid
Token ID11633
Feature activation+3.08e-02

Pos logit contributions
 injuries+0.011
 mercy+0.010
 hesitation+0.009
 poisoning+0.009
 fault+0.008
Neg logit contributions
 plains-0.014
 spir-0.014
aling-0.013
ab-0.013
irl-0.013
Token you
Token ID345
Feature activation+2.33e-02

Pos logit contributions
 injuries+0.036
 mercy+0.033
 poisoning+0.028
 hesitation+0.028
 forgiveness+0.027
Neg logit contributions
 plains-0.062
 spir-0.061
ab-0.060
aling-0.059
 sorts-0.058
Token have
Token ID423
Feature activation+6.08e-03

Pos logit contributions
 injuries+0.038
 mercy+0.034
 hesitation+0.031
 poisoning+0.031
 permission+0.030
Neg logit contributions
 plains-0.037
 spir-0.037
aling-0.035
ab-0.035
irl-0.034
Token fun
Token ID1257
Feature activation+4.24e-02

Pos logit contributions
 injuries+0.010
 mercy+0.009
 poisoning+0.008
 hesitation+0.008
 forgiveness+0.008
Neg logit contributions
 spir-0.009
 plains-0.009
aling-0.009
irl-0.009
 cycl-0.009
Token playing
Token ID2712
Feature activation-1.60e-02

Pos logit contributions
 injuries+0.061
 mercy+0.055
 poisoning+0.050
 hesitation+0.050
 permission+0.048
Neg logit contributions
 spir-0.074
 plains-0.074
aling-0.070
ab-0.069
irl-0.069
Token with
Token ID351
Feature activation+9.83e-03

Pos logit contributions
 spir+0.025
 plains+0.025
aling+0.023
 cycl+0.023
 sorts+0.023
Neg logit contributions
 injuries-0.026
 mercy-0.023
 poisoning-0.022
 forgiveness-0.021
 help-0.021
Token the
Token ID262
Feature activation+1.68e-02

Pos logit contributions
 injuries+0.017
 mercy+0.015
 hesitation+0.014
 poisoning+0.014
 permission+0.013
Neg logit contributions
 spir-0.015
 plains-0.015
aling-0.014
irl-0.014
ab-0.014
Token something
Token ID1223
Feature activation+5.25e-02

Pos logit contributions
 injuries+0.004
 mercy+0.003
 hesitation+0.003
 poisoning+0.003
 permission+0.003
Neg logit contributions
 plains-0.004
 spir-0.004
aling-0.004
ab-0.004
irl-0.004
Token cool
Token ID3608
Feature activation+2.23e-03

Pos logit contributions
 injuries+0.093
 mercy+0.080
 hesitation+0.077
 poisoning+0.072
 permission+0.071
Neg logit contributions
 spir-0.075
 plains-0.075
irl-0.073
 sorts-0.072
ab-0.072
Token."
Token ID526
Feature activation+8.48e-03

Pos logit contributions
 injuries+0.004
 mercy+0.004
 hesitation+0.003
 forgiveness+0.003
 protests+0.003
Neg logit contributions
ab-0.003
 plains-0.003
 sorts-0.003
ep-0.003
 sights-0.003
Token
Token ID198
Feature activation+2.49e-02

Pos logit contributions
 injuries+0.015
 mercy+0.014
 hesitation+0.013
 poisoning+0.013
 permission+0.012
Neg logit contributions
 spir-0.012
 plains-0.012
ab-0.012
irl-0.012
 sorts-0.012
Token
Token ID198
Feature activation+1.72e-02

Pos logit contributions
 injuries+0.043
 mercy+0.039
 hesitation+0.037
 permission+0.036
 poisoning+0.035
Neg logit contributions
 spir-0.038
ab-0.037
irl-0.037
 plains-0.037
 sorts-0.036
Token"
Token ID1
Feature activation+2.17e-02

Pos logit contributions
 injuries+0.028
 mercy+0.025
 hesitation+0.023
 poisoning+0.023
 permission+0.022
Neg logit contributions
 spir-0.027
 plains-0.027
aling-0.026
ab-0.026
irl-0.026
TokenNo
Token ID2949
Feature activation+3.02e-02

Pos logit contributions
 injuries+0.037
 mercy+0.033
 hesitation+0.031
 fault+0.030
 poisoning+0.030
Neg logit contributions
 spir-0.034
 plains-0.033
ab-0.032
irl-0.032
 cycl-0.031
Token,
Token ID11
Feature activation+3.42e-02

Pos logit contributions
 injuries+0.044
 poisoning+0.037
 mercy+0.036
t+0.034
 forgiveness+0.032
Neg logit contributions
 plains-0.049
irl-0.048
 spir-0.048
 sorts-0.047
 sights-0.046
Token Ben
Token ID3932
Feature activation+1.63e-02

Pos logit contributions
 injuries+0.051
 mercy+0.046
 hesitation+0.041
 poisoning+0.041
 fault+0.039
Neg logit contributions
 spir-0.058
 plains-0.056
irl-0.054
ab-0.054
aling-0.053
Token,
Token ID11
Feature activation+3.80e-02

Pos logit contributions
 injuries+0.026
 mercy+0.021
 hesitation+0.021
 permission+0.020
 forgiveness+0.019
Neg logit contributions
ab-0.026
 spir-0.026
 plains-0.026
ep-0.025
 sorts-0.025
Token that
Token ID326
Feature activation+2.18e-02

Pos logit contributions
 injuries+0.057
 mercy+0.052
 hesitation+0.048
 poisoning+0.046
 fault+0.044
Neg logit contributions
 spir-0.063
 plains-0.061
irl-0.059
ab-0.058
aling-0.057
Token from
Token ID422
Feature activation-1.72e-02

Pos logit contributions
 injuries+0.052
 mercy+0.045
 hesitation+0.043
 poisoning+0.041
 forgiveness+0.040
Neg logit contributions
 spir-0.057
 plains-0.055
 sorts-0.053
ab-0.052
aling-0.052
Token Lily
Token ID20037
Feature activation+2.47e-02

Pos logit contributions
 plains+0.025
 spir+0.025
irl+0.023
aling+0.023
 jung+0.022
Neg logit contributions
 injuries-0.028
 mercy-0.027
 poisoning-0.025
 hesitation-0.024
 fault-0.024
Token.
Token ID13
Feature activation-4.26e-03

Pos logit contributions
 injuries+0.038
 mercy+0.036
 hesitation+0.031
 poisoning+0.031
 forgiveness+0.030
Neg logit contributions
 plains-0.040
 spir-0.039
ab-0.038
aling-0.038
 sorts-0.037
Token She
Token ID1375
Feature activation-1.37e-02

Pos logit contributions
 spir+0.007
 plains+0.007
ab+0.006
aling+0.006
irl+0.006
Neg logit contributions
 injuries-0.007
 mercy-0.006
 hesitation-0.006
 poisoning-0.006
 permission-0.006
Token tried
Token ID3088
Feature activation+7.59e-03

Pos logit contributions
 plains+0.023
 spir+0.023
aling+0.022
ab+0.022
irl+0.021
Neg logit contributions
 injuries-0.020
 mercy-0.019
 poisoning-0.017
 hesitation-0.016
 help-0.016
Token to
Token ID284
Feature activation+8.44e-03

Pos logit contributions
 injuries+0.009
 mercy+0.008
 poisoning+0.007
 hesitation+0.006
 permission+0.006
Neg logit contributions
 spir-0.015
 plains-0.015
aling-0.014
irl-0.014
 sorts-0.014
Token separate
Token ID4553
Feature activation-1.95e-02

Pos logit contributions
 injuries+0.014
 mercy+0.013
 poisoning+0.012
 hesitation+0.012
 fault+0.012
Neg logit contributions
 spir-0.012
 plains-0.012
aling-0.012
irl-0.012
ab-0.011
Token the
Token ID262
Feature activation+8.81e-03

Pos logit contributions
 spir+0.023
 sorts+0.022
 plains+0.021
ab+0.020
 sights+0.020
Neg logit contributions
 injuries-0.039
 mercy-0.037
ache-0.033
 poisoning-0.033
 hesitation-0.033
Token book
Token ID1492
Feature activation-1.53e-03

Pos logit contributions
 injuries+0.015
 mercy+0.014
 poisoning+0.013
 hesitation+0.013
 permission+0.013
Neg logit contributions
 spir-0.013
 plains-0.013
aling-0.013
irl-0.012
 cycl-0.012
Token from
Token ID422
Feature activation-1.95e-02

Pos logit contributions
 plains+0.003
 spir+0.003
aling+0.003
ab+0.002
 sorts+0.002
Neg logit contributions
 injuries-0.002
 mercy-0.002
 hesitation-0.002
 permission-0.002
 forgiveness-0.002
Token him
Token ID683
Feature activation+2.93e-02

Pos logit contributions
 plains+0.027
 spir+0.027
irl+0.025
 jung+0.024
 sorts+0.024
Neg logit contributions
 injuries-0.033
 mercy-0.032
 fault-0.029
 hesitation-0.029
 poisoning-0.029
Token and
Token ID290
Feature activation-4.47e-02

Pos logit contributions
 spir+0.037
irl+0.037
 plains+0.036
aling+0.036
 sorts+0.035
Neg logit contributions
 injuries-0.017
 mercy-0.013
 poisoning-0.013
 hesitation-0.012
 help-0.012
Token sat
Token ID3332
Feature activation-6.67e-02

Pos logit contributions
acking+0.078
ab+0.078
aling+0.077
irl+0.075
 plains+0.074
Neg logit contributions
 injuries-0.058
 help-0.054
 mercy-0.054
 to-0.052
 mom-0.051
Token up
Token ID510
Feature activation-3.47e-02

Pos logit contributions
 spir+0.103
aling+0.099
 plains+0.099
 sorts+0.097
irl+0.097
Neg logit contributions
 injuries-0.111
 mercy-0.094
 poisoning-0.092
 permission-0.090
 yet-0.089
Token in
Token ID287
Feature activation-3.00e-02

Pos logit contributions
 plains+0.059
 spir+0.058
aling+0.056
irl+0.055
 cycl+0.055
Neg logit contributions
 injuries-0.052
 mercy-0.046
 hesitation-0.043
 poisoning-0.043
 permission-0.041
Token the
Token ID262
Feature activation-3.30e-02

Pos logit contributions
 spir+0.051
 plains+0.051
aling+0.048
ab+0.048
irl+0.048
Neg logit contributions
 injuries-0.045
 mercy-0.041
 poisoning-0.037
 hesitation-0.037
 permission-0.035
Token garden
Token ID11376
Feature activation-5.47e-02

Pos logit contributions
irl+0.053
aling+0.052
ab+0.049
 cycl+0.048
arg+0.048
Neg logit contributions
 injuries-0.050
 help-0.047
 mercy-0.044
 poisoning-0.044
 permission-0.043
Token to
Token ID284
Feature activation-1.82e-02

Pos logit contributions
irl+0.115
aling+0.115
 spir+0.115
 plains+0.114
ab+0.111
Neg logit contributions
 injuries-0.058
 mercy-0.047
 poisoning-0.044
 hesitation-0.043
 permission-0.041
Token wait
Token ID4043
Feature activation+2.11e-02

Pos logit contributions
 plains+0.037
acking+0.036
ab+0.036
acked+0.035
ventures+0.035
Neg logit contributions
 help-0.024
 injuries-0.021
 mercy-0.019
 permission-0.018
 forgiveness-0.018
Token.
Token ID13
Feature activation-1.21e-02

Pos logit contributions
 injuries+0.026
 mercy+0.022
 hesitation+0.020
 poisoning+0.020
 permission+0.019
Neg logit contributions
 plains-0.041
 spir-0.041
aling-0.040
irl-0.039
ab-0.039
Token
Token ID198
Feature activation+1.69e-02

Pos logit contributions
irl+0.023
 spir+0.023
ep+0.022
 plains+0.022
ab+0.022
Neg logit contributions
 injuries-0.014
t-0.013
 help-0.012
 to-0.012
 yet-0.012
Token
Token ID198
Feature activation+9.44e-03

Pos logit contributions
 injuries+0.021
 mercy+0.019
 poisoning+0.017
 hesitation+0.017
 permission+0.016
Neg logit contributions
 spir-0.032
 plains-0.032
aling-0.031
ab-0.031
irl-0.031
Token.
Token ID13
Feature activation-7.39e-03

Pos logit contributions
 plains+0.057
 sorts+0.056
irl+0.056
 spir+0.055
aling+0.054
Neg logit contributions
 injuries-0.032
 mercy-0.027
 either-0.026
 hesitation-0.026
 poisoning-0.024
Token He
Token ID679
Feature activation-2.60e-02

Pos logit contributions
irl+0.018
irling+0.017
 spir+0.016
ep+0.016
ab+0.016
Neg logit contributions
 injuries-0.006
 yet-0.005
 to-0.005
t-0.005
 mercy-0.005
Token went
Token ID1816
Feature activation-3.26e-02

Pos logit contributions
ab+0.051
aling+0.050
 sorts+0.049
 spir+0.048
 plains+0.048
Neg logit contributions
 injuries-0.029
 mercy-0.027
 worry-0.024
 help-0.023
 hesitation-0.023
Token over
Token ID625
Feature activation-1.93e-02

Pos logit contributions
 plains+0.051
 spir+0.051
aling+0.049
ab+0.048
irl+0.047
Neg logit contributions
 injuries-0.054
 mercy-0.048
 hesitation-0.044
 poisoning-0.043
 forgiveness-0.043
Token to
Token ID284
Feature activation-1.29e-02

Pos logit contributions
 plains+0.040
 spir+0.038
 sorts+0.038
 jung+0.037
ab+0.036
Neg logit contributions
 injuries-0.024
 mercy-0.020
 fault-0.018
 hesitation-0.017
 poisoning-0.017
Token poke
Token ID22620
Feature activation-5.21e-02

Pos logit contributions
 plains+0.024
 spir+0.024
ab+0.023
irl+0.023
aling+0.023
Neg logit contributions
 injuries-0.016
 mercy-0.015
 help-0.015
 permission-0.014
 hesitation-0.014
Token it
Token ID340
Feature activation+1.08e-02

Pos logit contributions
 spir+0.083
 plains+0.082
aling+0.078
ab+0.078
irl+0.076
Neg logit contributions
 injuries-0.085
 mercy-0.077
 poisoning-0.070
 hesitation-0.070
 fault-0.067
Token and
Token ID290
Feature activation-5.02e-02

Pos logit contributions
 injuries+0.015
 mercy+0.013
 hesitation+0.012
 poisoning+0.012
 permission+0.011
Neg logit contributions
 spir-0.020
 plains-0.019
aling-0.019
ab-0.019
irl-0.018
Token it
Token ID340
Feature activation-1.76e-02

Pos logit contributions
 plains+0.093
irl+0.091
 spir+0.090
ab+0.090
aling+0.090
Neg logit contributions
 injuries-0.065
 mercy-0.059
 hesitation-0.053
 poisoning-0.053
 help-0.051
Token turned
Token ID2900
Feature activation-4.33e-02

Pos logit contributions
 plains+0.028
 spir+0.027
aling+0.027
ab+0.027
irl+0.026
Neg logit contributions
 injuries-0.027
 mercy-0.025
 poisoning-0.023
 hesitation-0.022
 help-0.021
Token out
Token ID503
Feature activation-5.31e-03

Pos logit contributions
 plains+0.067
irl+0.065
 spir+0.064
aling+0.063
 cycl+0.062
Neg logit contributions
 injuries-0.070
 mercy-0.065
 hesitation-0.060
 poisoning-0.059
 permission-0.056
Token ride
Token ID6594
Feature activation-3.82e-02

Pos logit contributions
 spir+0.016
aling+0.016
irl+0.016
 plains+0.016
 cycl+0.015
Neg logit contributions
 injuries-0.010
 mercy-0.009
 help-0.009
 either-0.008
t-0.008
Token together
Token ID1978
Feature activation+4.74e-02

Pos logit contributions
aling+0.073
 spir+0.071
 cycl+0.070
 plains+0.070
irl+0.070
Neg logit contributions
 injuries-0.045
 mercy-0.038
 either-0.038
 yet-0.036
 help-0.035
Token.
Token ID13
Feature activation-2.48e-02

Pos logit contributions
 injuries+0.051
 mercy+0.043
 hesitation+0.039
 poisoning+0.039
 permission+0.037
Neg logit contributions
 spir-0.098
 plains-0.098
aling-0.096
irl-0.095
 cycl-0.095
Token<|endoftext|>
Token ID50256
Feature activation-1.75e-02

Pos logit contributions
 plains+0.050
 spir+0.049
aling+0.049
irl+0.048
 cycl+0.047
Neg logit contributions
 injuries-0.027
 mercy-0.023
 poisoning-0.022
 forgiveness-0.021
 help-0.021
TokenOnce
Token ID7454
Feature activation+4.09e-02

Pos logit contributions
 plains+0.033
 spir+0.033
aling+0.032
irl+0.031
ab+0.031
Neg logit contributions
 injuries-0.023
 mercy-0.020
 hesitation-0.018
 poisoning-0.018
 permission-0.017
Token upon
Token ID2402
Feature activation-3.61e-02

Pos logit contributions
 injuries+0.088
 mercy+0.085
 poisoning+0.075
 hesitation+0.074
 permission+0.073
Neg logit contributions
irl-0.044
 sorts-0.042
ab-0.041
 plains-0.039
 spir-0.039
Token a
Token ID257
Feature activation-1.97e-02

Pos logit contributions
 plains+0.061
 spir+0.061
aling+0.059
ab+0.058
irl+0.058
Neg logit contributions
 injuries-0.053
 mercy-0.048
 poisoning-0.044
 hesitation-0.043
 permission-0.041
Token time
Token ID640
Feature activation+1.54e-02

Pos logit contributions
 plains+0.030
 spir+0.030
irl+0.030
ab+0.030
aling+0.029
Neg logit contributions
 injuries-0.032
 mercy-0.030
 hesitation-0.027
 permission-0.027
 poisoning-0.027
Token,
Token ID11
Feature activation+1.98e-02

Pos logit contributions
 injuries+0.025
 mercy+0.023
 permission+0.020
 hesitation+0.019
 poisoning+0.019
Neg logit contributions
irl-0.025
ab-0.025
 spir-0.024
 plains-0.024
 soaring-0.023
Token in
Token ID287
Feature activation-1.94e-02

Pos logit contributions
 injuries+0.033
 mercy+0.030
 hesitation+0.027
 poisoning+0.027
 permission+0.026
Neg logit contributions
 spir-0.031
 plains-0.031
ab-0.029
irl-0.029
aling-0.029
Token a
Token ID257
Feature activation-5.58e-02

Pos logit contributions
 plains+0.034
 spir+0.034
aling+0.034
ab+0.033
 cycl+0.032
Neg logit contributions
 injuries-0.027
 mercy-0.024
 poisoning-0.022
 hesitation-0.022
 permission-0.021
Token full
Token ID1336
Feature activation+1.39e-02

Pos logit contributions
 spir+0.079
 plains+0.078
aling+0.075
irl+0.075
ab+0.074
Neg logit contributions
 injuries-0.076
 mercy-0.069
 hesitation-0.063
 poisoning-0.063
 forgiveness-0.061
Token of
Token ID286
Feature activation-3.19e-02

Pos logit contributions
 injuries+0.024
 mercy+0.023
 hesitation+0.021
 fault+0.021
 permission+0.020
Neg logit contributions
 spir-0.019
 plains-0.019
ab-0.018
 sorts-0.018
aling-0.017
Token water
Token ID1660
Feature activation-1.68e-02

Pos logit contributions
 spir+0.049
 plains+0.049
irl+0.046
aling+0.046
 sorts+0.046
Neg logit contributions
 injuries-0.052
 mercy-0.047
 poisoning-0.043
 hesitation-0.043
 permission-0.041
Token and
Token ID290
Feature activation-5.12e-02

Pos logit contributions
 spir+0.030
 plains+0.030
ab+0.028
aling+0.028
irl+0.027
Neg logit contributions
 injuries-0.023
 mercy-0.022
 hesitation-0.019
 forgiveness-0.018
 permission-0.018
Token salt
Token ID8268
Feature activation-2.34e-02

Pos logit contributions
 plains+0.078
 spir+0.078
aling+0.075
ab+0.074
irl+0.074
Neg logit contributions
 injuries-0.085
 mercy-0.077
 hesitation-0.071
 poisoning-0.071
 permission-0.068
Token.
Token ID13
Feature activation-1.68e-02

Pos logit contributions
 spir+0.038
 plains+0.037
ab+0.036
irl+0.035
aling+0.035
Neg logit contributions
 injuries-0.036
 mercy-0.033
 hesitation-0.030
 forgiveness-0.029
 permission-0.028
Token When
Token ID1649
Feature activation+2.64e-02

Pos logit contributions
 plains+0.022
 spir+0.022
ab+0.021
aling+0.021
irl+0.020
Neg logit contributions
 injuries-0.031
 mercy-0.029
 hesitation-0.027
 poisoning-0.027
 permission-0.026
Token she
Token ID673
Feature activation-1.19e-02

Pos logit contributions
 injuries+0.030
 mercy+0.025
 hesitation+0.021
 poisoning+0.021
 help+0.021
Neg logit contributions
ab-0.057
aling-0.056
 plains-0.056
 spir-0.055
 sorts-0.053
Token pulled
Token ID5954
Feature activation-2.69e-02

Pos logit contributions
 spir+0.017
 plains+0.017
aling+0.017
ab+0.016
irl+0.016
Neg logit contributions
 injuries-0.020
 mercy-0.019
 hesitation-0.017
 poisoning-0.017
 permission-0.017
Token the
Token ID262
Feature activation+1.80e-03

Pos logit contributions
 spir+0.037
 plains+0.035
ab+0.035
irl+0.034
aling+0.033
Neg logit contributions
 injuries-0.049
 mercy-0.047
 hesitation-0.042
 forgiveness-0.041
 poisoning-0.041
Token trigger
Token ID7616
Feature activation+4.96e-02

Pos logit contributions
 mercy+0.003
 injuries+0.003
 hesitation+0.003
 forgiveness+0.003
 fault+0.002
Neg logit contributions
 spir-0.003
ab-0.003
irl-0.002
 plains-0.002
 sights-0.002
Token on
Token ID319
Feature activation-6.29e-03

Pos logit contributions
 injuries+0.007
 mercy+0.006
 poisoning+0.006
 permission+0.006
 help+0.006
Neg logit contributions
 plains-0.009
 spir-0.009
aling-0.009
 sorts-0.009
ab-0.008
Token and
Token ID290
Feature activation-4.79e-02

Pos logit contributions
 spir+0.010
 plains+0.010
irl+0.010
ab+0.010
aling+0.010
Neg logit contributions
 injuries-0.010
 mercy-0.009
 hesitation-0.008
 poisoning-0.008
 permission-0.008
Token started
Token ID2067
Feature activation-2.64e-02

Pos logit contributions
 plains+0.088
aling+0.087
ab+0.086
acking+0.083
arg+0.081
Neg logit contributions
 injuries-0.062
 mercy-0.057
 help-0.054
 poisoning-0.054
 forgiveness-0.052
Token to
Token ID284
Feature activation-1.68e-02

Pos logit contributions
 plains+0.075
aling+0.074
ab+0.074
irl+0.073
 spir+0.072
Neg logit contributions
 injuries-0.009
 help-0.003
 mercy-0.003
 to-0.003
 permission-0.003
Token dance
Token ID9280
Feature activation-4.44e-02

Pos logit contributions
 plains+0.028
ab+0.026
aling+0.026
aw+0.025
 spir+0.025
Neg logit contributions
 injuries-0.026
 help-0.025
 mercy-0.024
 worry-0.024
 hesitation-0.022
Token around
Token ID1088
Feature activation-2.10e-02

Pos logit contributions
 plains+0.090
 spir+0.089
aling+0.089
 cycl+0.087
irl+0.086
Neg logit contributions
 injuries-0.049
 mercy-0.042
 poisoning-0.039
 help-0.037
 hesitation-0.037
Token the
Token ID262
Feature activation-1.26e-02

Pos logit contributions
 plains+0.038
 spir+0.038
aling+0.036
ab+0.036
irl+0.036
Neg logit contributions
 injuries-0.029
 mercy-0.026
 hesitation-0.024
 poisoning-0.024
 permission-0.022
Token office
Token ID2607
Feature activation-3.64e-02

Pos logit contributions
aling+0.017
irl+0.017
 plains+0.017
 spir+0.017
 cycl+0.016
Neg logit contributions
 injuries-0.023
 poisoning-0.020
 mercy-0.020
 help-0.019
 permission-0.019
Token.
Token ID13
Feature activation-1.06e-02

Pos logit contributions
irl+0.085
 plains+0.084
 spir+0.083
aling+0.082
 cycl+0.082
Neg logit contributions
 injuries-0.030
 mercy-0.023
 either-0.022
 help-0.021
 poisoning-0.021
Token Her
Token ID2332
Feature activation+2.61e-02

Pos logit contributions
 spir+0.020
aling+0.020
irl+0.020
 plains+0.019
ep+0.019
Neg logit contributions
 injuries-0.013
 mercy-0.012
 poisoning-0.011
 help-0.011
 hesitation-0.010
Token mom
Token ID1995
Feature activation+1.88e-02

Pos logit contributions
 injuries+0.023
 mercy+0.021
 hesitation+0.018
 poisoning+0.017
 permission+0.015
Neg logit contributions
 spir-0.059
 plains-0.058
irl-0.057
 cycl-0.056
ab-0.056
Token,
Token ID11
Feature activation+1.74e-03

Pos logit contributions
 spir+0.008
 plains+0.008
aling+0.007
ab+0.007
irl+0.007
Neg logit contributions
 injuries-0.007
 mercy-0.006
 hesitation-0.006
 poisoning-0.005
 permission-0.005
Token so
Token ID523
Feature activation+6.86e-03

Pos logit contributions
 injuries+0.002
 mercy+0.002
 poisoning+0.002
 hesitation+0.002
 permission+0.002
Neg logit contributions
 spir-0.003
 plains-0.003
aling-0.003
irl-0.003
 sorts-0.003
Token he
Token ID339
Feature activation-1.60e-02

Pos logit contributions
 injuries+0.007
 mercy+0.006
 hesitation+0.005
 poisoning+0.005
 forgiveness+0.004
Neg logit contributions
 plains-0.015
 spir-0.015
aling-0.015
ab-0.015
irl-0.015
Token started
Token ID2067
Feature activation-2.55e-02

Pos logit contributions
 plains+0.024
 spir+0.024
aling+0.023
 sorts+0.022
ab+0.022
Neg logit contributions
 injuries-0.026
 mercy-0.024
 poisoning-0.023
 help-0.022
 permission-0.022
Token to
Token ID284
Feature activation-2.56e-02

Pos logit contributions
 plains+0.066
aling+0.064
 spir+0.064
ab+0.064
 cycl+0.063
Neg logit contributions
 injuries-0.015
 mercy-0.010
 permission-0.009
 hesitation-0.009
 help-0.009
Token cl
Token ID537
Feature activation-8.40e-02

Pos logit contributions
 plains+0.045
ab+0.043
aling+0.042
 spir+0.042
 cycl+0.041
Neg logit contributions
 injuries-0.037
 help-0.034
 mercy-0.033
 worry-0.032
 hesitation-0.031
Tokenap
Token ID499
Feature activation-4.55e-02

Pos logit contributions
 plains+0.105
 spir+0.096
 cycl+0.090
 jung+0.086
 sights+0.082
Neg logit contributions
 injuries-0.148
 permission-0.144
 help-0.144
 mercy-0.139
 worry-0.138
Token his
Token ID465
Feature activation+3.05e-02

Pos logit contributions
 plains+0.090
 spir+0.088
aling+0.087
 cycl+0.086
irl+0.083
Neg logit contributions
 injuries-0.054
 mercy-0.047
 permission-0.043
 help-0.042
 hesitation-0.042
Token paws
Token ID49193
Feature activation+1.13e-02

Pos logit contributions
 injuries+0.044
 mercy+0.039
 permission+0.037
 help+0.037
 hesitation+0.035
Neg logit contributions
aling-0.055
 plains-0.051
 spir-0.051
irl-0.050
 sorts-0.049
Token.
Token ID13
Feature activation-1.07e-04

Pos logit contributions
 injuries+0.012
 mercy+0.010
 hesitation+0.009
 poisoning+0.009
 permission+0.009
Neg logit contributions
 plains-0.023
 spir-0.023
 cycl-0.022
irl-0.022
aling-0.022
Token He
Token ID679
Feature activation-2.30e-02

Pos logit contributions
irl+0.000
irling+0.000
 spir+0.000
ep+0.000
aling+0.000
Neg logit contributions
 help-0.000
 to-0.000
 injuries-0.000
t-0.000
 yet-0.000
Token leave
Token ID2666
Feature activation-1.26e-02

Pos logit contributions
 injuries+0.005
 mercy+0.004
 hesitation+0.004
 help+0.004
 permission+0.004
Neg logit contributions
 plains-0.006
 spir-0.005
ab-0.005
aling-0.005
 sorts-0.005
Token the
Token ID262
Feature activation+2.63e-03

Pos logit contributions
 plains+0.019
 spir+0.019
aling+0.018
irl+0.017
 sorts+0.017
Neg logit contributions
 injuries-0.020
 mercy-0.019
 hesitation-0.017
 poisoning-0.017
 fault-0.016
Token board
Token ID3096
Feature activation-2.76e-02

Pos logit contributions
 injuries+0.005
 mercy+0.005
 poisoning+0.004
 hesitation+0.004
 permission+0.004
Neg logit contributions
 plains-0.003
 spir-0.003
aling-0.003
ab-0.003
irl-0.003
Token in
Token ID287
Feature activation-1.95e-02

Pos logit contributions
 plains+0.045
 spir+0.045
aling+0.043
ab+0.043
irl+0.042
Neg logit contributions
 injuries-0.043
 mercy-0.038
 hesitation-0.035
 poisoning-0.035
 permission-0.034
Token the
Token ID262
Feature activation-2.64e-02

Pos logit contributions
 sorts+0.025
 spir+0.025
 plains+0.024
irl+0.022
aling+0.022
Neg logit contributions
 injuries-0.034
 mercy-0.033
 poisoning-0.031
ache-0.030
 hesitation-0.030
Token p
Token ID279
Feature activation-9.65e-02

Pos logit contributions
irl+0.036
aling+0.036
 plains+0.036
 spir+0.036
ab+0.036
Neg logit contributions
 injuries-0.048
 mercy-0.043
 poisoning-0.041
 help-0.040
 permission-0.040
Tokenuddle
Token ID24500
Feature activation-3.34e-02

Pos logit contributions
 spir+0.152
 plains+0.146
 cycl+0.142
 strut+0.141
aling+0.136
Neg logit contributions
 injuries-0.155
 fault-0.133
 mercy-0.125
t-0.123
 poisoning-0.123
Token and
Token ID290
Feature activation-3.57e-02

Pos logit contributions
 plains+0.058
 spir+0.058
aling+0.057
ab+0.056
 sorts+0.055
Neg logit contributions
 injuries-0.047
 mercy-0.043
 hesitation-0.039
 poisoning-0.037
 forgiveness-0.036
Token he
Token ID339
Feature activation-1.66e-02

Pos logit contributions
 plains+0.062
ab+0.060
 spir+0.060
aling+0.058
irl+0.058
Neg logit contributions
 injuries-0.053
 mercy-0.048
 help-0.043
 poisoning-0.043
 permission-0.043
Token went
Token ID1816
Feature activation-3.27e-02

Pos logit contributions
 plains+0.029
 spir+0.028
aling+0.027
ab+0.027
irl+0.027
Neg logit contributions
 injuries-0.024
 mercy-0.021
 poisoning-0.019
 hesitation-0.019
 permission-0.019
Token back
Token ID736
Feature activation+4.77e-03

Pos logit contributions
 plains+0.069
 spir+0.069
aling+0.067
ab+0.066
irl+0.066
Neg logit contributions
 injuries-0.035
 mercy-0.030
 hesitation-0.027
 poisoning-0.026
 permission-0.025
Token
Token ID198
Feature activation+1.42e-02

Pos logit contributions
 injuries+0.028
 mercy+0.025
 hesitation+0.024
 poisoning+0.023
 permission+0.022
Neg logit contributions
 plains-0.030
 spir-0.030
irl-0.029
aling-0.029
ab-0.028
TokenMom
Token ID29252
Feature activation+2.99e-02

Pos logit contributions
 injuries+0.016
 mercy+0.014
 poisoning+0.012
 hesitation+0.012
 permission+0.011
Neg logit contributions
 plains-0.029
 spir-0.029
aling-0.028
ab-0.028
irl-0.028
Token plays
Token ID5341
Feature activation-3.76e-02

Pos logit contributions
 injuries+0.042
 mercy+0.040
 poisoning+0.035
 hesitation+0.035
 help+0.035
Neg logit contributions
 spir-0.050
 plains-0.050
irl-0.048
aling-0.047
arg-0.047
Token along
Token ID1863
Feature activation-2.64e-02

Pos logit contributions
 plains+0.064
 spir+0.064
 cycl+0.063
aling+0.063
irl+0.061
Neg logit contributions
 injuries-0.053
 mercy-0.047
 poisoning-0.045
 help-0.044
 hesitation-0.044
Token and
Token ID290
Feature activation-6.44e-02

Pos logit contributions
 plains+0.043
 spir+0.043
aling+0.041
ab+0.041
irl+0.040
Neg logit contributions
 injuries-0.041
 mercy-0.037
 hesitation-0.034
 poisoning-0.034
 permission-0.032
Token pret
Token ID2181
Feature activation-9.32e-02

Pos logit contributions
 plains+0.103
arg+0.100
 array+0.099
irl+0.098
ab+0.098
Neg logit contributions
 mercy-0.085
 injuries-0.085
 help-0.083
 first-0.081
 hesitation-0.078
Tokenends
Token ID2412
Feature activation+3.45e-03

Pos logit contributions
 sorts+0.207
 plains+0.195
 spir+0.191
 array+0.184
 sights+0.184
Neg logit contributions
 injuries-0.080
t-0.069
 mercy-0.065
 help-0.061
 poisoning-0.058
Token she
Token ID673
Feature activation+2.83e-02

Pos logit contributions
 injuries+0.004
 mercy+0.003
 hesitation+0.003
 poisoning+0.003
 permission+0.003
Neg logit contributions
 spir-0.007
 plains-0.007
aling-0.007
irl-0.007
ab-0.007
Token is
Token ID318
Feature activation-1.30e-02

Pos logit contributions
 injuries+0.032
 mercy+0.030
 either+0.030
 help+0.030
t+0.028
Neg logit contributions
 spir-0.049
 plains-0.049
 array-0.047
irl-0.047
arg-0.047
Token his
Token ID465
Feature activation+4.05e-02

Pos logit contributions
ab+0.023
 plains+0.023
 cycl+0.022
aling+0.022
irl+0.022
Neg logit contributions
 injuries-0.017
 mercy-0.015
 hesitation-0.015
 permission-0.014
 help-0.014
Token helper
Token ID31904
Feature activation+4.98e-02

Pos logit contributions
 injuries+0.070
 mercy+0.064
 poisoning+0.059
 hesitation+0.059
 permission+0.057
Neg logit contributions
 plains-0.059
 spir-0.058
aling-0.055
ab-0.055
irl-0.055
Token to
Token ID284
Feature activation+3.25e-02

Pos logit contributions
 injuries+0.053
 mercy+0.043
 hesitation+0.040
 poisoning+0.039
 permission+0.035
Neg logit contributions
 plains-0.092
 spir-0.092
aling-0.088
 sorts-0.087
irl-0.087
Token never
Token ID1239
Feature activation+5.45e-02

Pos logit contributions
 injuries+0.050
 mercy+0.045
 hesitation+0.041
 poisoning+0.041
 permission+0.040
Neg logit contributions
 plains-0.053
 spir-0.053
aling-0.051
ab-0.050
irl-0.050
Token touch
Token ID3638
Feature activation-1.36e-02

Pos logit contributions
 injuries+0.101
 mercy+0.091
 poisoning+0.088
 hesitation+0.087
 fault+0.085
Neg logit contributions
 spir-0.069
aling-0.068
 plains-0.068
ab-0.065
irl-0.065
Token a
Token ID257
Feature activation+1.77e-02

Pos logit contributions
 spir+0.021
 plains+0.021
irl+0.019
ab+0.019
 sorts+0.019
Neg logit contributions
 injuries-0.022
 mercy-0.020
 hesitation-0.019
 poisoning-0.019
t-0.017
Token strange
Token ID6283
Feature activation-7.29e-03

Pos logit contributions
 injuries+0.035
 mercy+0.031
 poisoning+0.029
 hesitation+0.029
 forgiveness+0.028
Neg logit contributions
 plains-0.022
 spir-0.022
aling-0.021
irl-0.021
ab-0.021
Token je
Token ID11223
Feature activation-8.38e-02

Pos logit contributions
aling+0.010
 plains+0.010
 spir+0.010
irl+0.010
ab+0.009
Neg logit contributions
 injuries-0.013
 mercy-0.012
 hesitation-0.011
 poisoning-0.011
 forgiveness-0.011
Tokenep
Token ID538
Feature activation-2.25e-02

Pos logit contributions
 sorts+0.068
 spir+0.064
 plains+0.059
 strut+0.057
irl+0.057
Neg logit contributions
 injuries-0.209
 mercy-0.190
 blood-0.174
t-0.172
 help-0.172
Token again
Token ID757
Feature activation+4.86e-02

Pos logit contributions
 plains+0.043
 spir+0.043
aling+0.042
ab+0.041
irl+0.041
Neg logit contributions
 injuries-0.028
 mercy-0.025
 hesitation-0.023
 poisoning-0.022
 forgiveness-0.021
Token.
Token ID13
Feature activation-1.45e-02

Pos logit contributions
 injuries+0.078
 mercy+0.078
 poisoning+0.070
 hesitation+0.070
 fault+0.067
Neg logit contributions
 plains-0.068
 spir-0.066
ab-0.066
aling-0.064
irl-0.063
Token<|endoftext|>
Token ID50256
Feature activation-1.77e-02

Pos logit contributions
 plains+0.018
 spir+0.018
ab+0.016
aling+0.016
irl+0.016
Neg logit contributions
 injuries-0.028
 mercy-0.027
 hesitation-0.025
 poisoning-0.025
 permission-0.024
TokenOn
Token ID2202
Feature activation-2.11e-02

Pos logit contributions
 spir+0.034
 plains+0.033
aling+0.033
 cycl+0.032
irl+0.032
Neg logit contributions
 injuries-0.020
 mercy-0.018
 poisoning-0.017
 hesitation-0.016
 either-0.016
Token big
Token ID1263
Feature activation-3.18e-02

Pos logit contributions
aling+0.022
 plains+0.022
 spir+0.021
ab+0.020
irl+0.020
Neg logit contributions
 injuries-0.022
 mercy-0.019
 poisoning-0.018
 hesitation-0.018
 help-0.017
Token,
Token ID11
Feature activation-1.70e-02

Pos logit contributions
 plains+0.054
aling+0.054
 spir+0.053
ab+0.051
 cycl+0.051
Neg logit contributions
 injuries-0.047
 mercy-0.041
 hesitation-0.039
 poisoning-0.039
 help-0.036
Token green
Token ID4077
Feature activation-5.28e-02

Pos logit contributions
aling+0.026
 plains+0.026
irl+0.025
 cycl+0.025
arg+0.025
Neg logit contributions
 injuries-0.028
 poisoning-0.024
 mercy-0.024
 hesitation-0.023
 help-0.023
Token blanket
Token ID18447
Feature activation-2.65e-02

Pos logit contributions
 spir+0.083
aling+0.082
 plains+0.082
irl+0.078
ab+0.078
Neg logit contributions
 injuries-0.084
 mercy-0.076
 poisoning-0.071
 hesitation-0.070
 permission-0.067
Token and
Token ID290
Feature activation-4.35e-02

Pos logit contributions
 plains+0.050
 spir+0.050
aling+0.048
irl+0.048
ab+0.048
Neg logit contributions
 injuries-0.034
 mercy-0.030
 poisoning-0.027
 hesitation-0.027
 permission-0.026
Token unp
Token ID8593
Feature activation-1.09e-01

Pos logit contributions
aling+0.083
ab+0.079
 plains+0.079
acking+0.077
 spir+0.075
Neg logit contributions
 injuries-0.056
 mercy-0.051
 help-0.047
 poisoning-0.046
 hesitation-0.045
Tokenacked
Token ID6021
Feature activation-4.96e-02

Pos logit contributions
 spir+0.141
 strut+0.135
 soaring+0.131
 cycl+0.128
 prow+0.117
Neg logit contributions
t-0.164
 to-0.160
 poisoning-0.156
guard-0.140
 anymore-0.137
Token all
Token ID477
Feature activation-2.27e-02

Pos logit contributions
aling+0.089
 plains+0.081
 cycl+0.081
irl+0.080
alling+0.079
Neg logit contributions
 injuries-0.070
 mercy-0.064
 permission-0.060
 hesitation-0.059
 help-0.059
Token their
Token ID511
Feature activation+3.93e-02

Pos logit contributions
 plains+0.044
 cycl+0.044
 spir+0.043
aling+0.043
arg+0.043
Neg logit contributions
 injuries-0.022
 mercy-0.021
 poisoning-0.020
 help-0.020
 either-0.019
Token food
Token ID2057
Feature activation+1.84e-02

Pos logit contributions
 injuries+0.069
 mercy+0.064
 hesitation+0.059
 poisoning+0.058
 permission+0.056
Neg logit contributions
 plains-0.057
 spir-0.056
ab-0.053
irl-0.053
aling-0.052
Token.
Token ID13
Feature activation-1.09e-02

Pos logit contributions
 injuries+0.026
 mercy+0.024
 hesitation+0.021
 poisoning+0.020
 forgiveness+0.020
Neg logit contributions
 plains-0.032
 spir-0.032
 sorts-0.031
ab-0.030
irl-0.030