Reducing Bias in Modeling Real-world Password Strength via Deep Learning and Dynamic Dictionaries - arXiv.org

Page created by Andre Hernandez
 
CONTINUE READING
Reducing Bias in Modeling Real-world Password Strength
                                                                 via Deep Learning and Dynamic Dictionaries

                                                      Dario Pasquini†,§ , Marco Cianfriglia§ , Giuseppe Ateniese‡ and Massimo Bernaschi§
                                             † Sapienza University of Rome, ‡ Stevens Institute of Technology, § Institute of Applied Computing CNR
arXiv:2010.12269v3 [cs.CR] 12 Dec 2020

                                                                   Abstract                                      security parameter (e.g., the key size). The only way to estab-
                                                                                                                 lish the soundness of a system is to learn and model attackers’
                                         Password security hinges on an accurate understanding of the
                                                                                                                 capabilities. To this end, simulating password guessing at-
                                         techniques adopted by attackers. Unfortunately, real-world
                                                                                                                 tacks has become a requisite practice. (1) Administrators rely
                                         adversaries resort on pragmatic guessing strategies such as
                                                                                                                 on cracking sessions to reactively evaluate the security of their
                                         dictionary attacks that are inherently difficult to model in
                                                                                                                 accounts. (2) Researchers use password guessing techniques
                                         password security studies. In order to be representative of
                                                                                                                 to validate the soundness of proactive password checking ap-
                                         the actual threat, dictionary attacks must be thoughtfully con-
                                                                                                                 proaches [34,46]. Ultimately, modeling attackers’ capabilities
                                         figured and tuned. However, this process requires a domain-
                                                                                                                 is critical to ensure the security of passwords.
                                         knowledge and expertise that cannot be easily replicated by
                                                                                                                    In this direction, more than three decades of active research
                                         researchers and security practitioners. The consequence of
                                                                                                                 provided us with powerful password models [34, 36, 37, 45].
                                         inaccurately calibrating those attacks is the unreliability of
                                                                                                                 However, very little progress has been made to systemati-
                                         password security analyses, impaired by a severe measure-
                                                                                                                 cally model real-world attackers [32,43]. Indeed, professional
                                         ment bias.
                                                                                                                 password crackers rarely harness fully-automated approaches
                                            In the present work, we introduce new guessing techniques
                                                                                                                 developed in academia. They rely on more pragmatic guessing
                                         that make dictionary attacks consistently more resilient to
                                                                                                                 techniques that present stronger inductive biases. In offline
                                         inadequate configurations. Our framework allows dictionary
                                                                                                                 attacks, professionals use high-throughput and flexible tech-
                                         attacks to self-heal and converge towards optimal attacks’ per-
                                                                                                                 niques such as dictionary attacks with mangling rules [1].
                                         formance, requiring no supervision or domain-knowledge. To
                                                                                                                 Moreover, they rely on highly tuned setups that result from
                                         achieve this: (1) We use a deep neural network to model and
                                                                                                                 profound expertise that is refined over years of practical ex-
                                         then simulate the proficiency of expert adversaries. (2) Then,
                                                                                                                 perience [32, 43]. However, reproducing or modeling these
                                         we introduce dynamic guessing strategies within dictionary
                                                                                                                 proprietary attack strategies is very difficult, and the end re-
                                         attacks. These mimic experts’ ability to adapt their guess-
                                                                                                                 sults rarely mimic actual real-world threats [43]. This failure
                                         ing strategies on the fly by incorporating knowledge on their
                                                                                                                 often results in an overestimation of password security that
                                         targets.
                                                                                                                 sways studies’ conclusions and further jeopardize password-
                                            Our techniques enable more robust and sound password
                                                                                                                 based systems.
                                         strength estimates within dictionary attacks, eventually reduc-
                                         ing bias in modeling real-world threats in password security.              In the present work, we develop a new generation of
                                                                                                                 dictionary attacks that more closely resembles real-world
                                                                                                                 attackers’ abilities and guessing strategies. In the process,
                                         1    Introduction                                                       we devise two complementary techniques that aim to
                                                                                                                 systematically mimic different attackers’ behaviors:
                                         Passwords have proven to be irreplaceable. They are still
                                         preferred over safer options and appear essential in fallback              By rethinking the underlying framework, we devise the
                                         mechanisms. However, users tend to select their passwords as            Adaptive Mangling Rules attack. This artificially simulates
                                         easy-to-remember strings, which results in very skewed distri-          the optimal configurations harnessed by expert adversaries
                                         butions that can be easily modeled by an attacker. This makes           by explicitly handling the conditional nature of mangling
                                         passwords and authentication systems that implement them                rules. Here, during the attack, each word from the dictionary
                                         inherently susceptible to guessing attacks. In this scenario, the       is associated with a dedicated and possible unique rules-set
                                         security of the authentication protocol cannot be stated via a          that is created at runtime via a deep neural network. Using

                                                                                                             1
this technique, we confirmed that standard attacks, based on          easy-to-remember passwords that aggregate in relatively
off-the-shelf dictionaries and rules-sets, are sub-optimal and        few dense clusters. Real-world passwords, therefore, tend
can be easily compressed up to an order of magnitude in the           to cluster in very bounded distributions that can be modeled
number of guesses. Furthermore, we are the first to explicitly        by an attacker, making authentication-systems intrinsically
model the strong relationship that bounds mangling rules and          susceptible to guessing attacks. In a guessing attack, the
dictionary words, demonstrating its connection with optimal           attacker aims at recovering plaintext credentials by attempting
configurations.                                                       several candidate passwords (guesses) till success or budget
   Our second contribution introduces dynamic guessing                exhaustion; this happens by either searching for collisions
strategies within dictionary attacks [37]. Real-world                 of password hashes (offline attack) or attempting remote
adversaries perform their guessing attacks incorporating              logins (online attack). In this process, the attacker relies on a
prior knowledge on the targets and dynamically adjusting              so-called password model that defines which, and in which
their guesses during the attack. In doing so, professionals           order, guesses should be tried to maximize the effectiveness
seek to optimize their configurations and maximize the                of the attack (see Section 2.4).
number of compromised passwords. Unfortunately, automatic
guessing techniques fail to model this adversarial behavior.
Instead, we demonstrate that dynamic guessing strategies                 Generally speaking, a password model can be understood as
can be enabled in dictionary attacks and substantially                a suitable estimation of the password distribution that enables
improve the guessing attack’s effectiveness while requiring           an educated exploration of the key-space. Existing password
no prior optimization. More prominently, our technique                models construct over a heterogeneous set of assumptions and
makes dictionary attacks consistently more resilient to               rely on either intuitive or rigorous security definitions. From
miss-configurations by promoting the completeness of the              the most practical point of view, those can be divided into two
dictionary at runtime.                                                macro-classes, i.e., parametric and nonparametric password
                                                                      models.
   Finally, we combine these methodologies and introduce the
Adaptive Dynamic Mangling rules attack (AdaMs). We show                  Parametric approaches build on top of probabilistic reason-
that it automatically causes the guessing strategy to progress        ing; they assume that real-world password distributions are
towards an optimal one, regardless of the initial attack setup.       sufficiently smooth to be accurately described from suitable
The AdaMs attack consistently reduces the overestimation              parametric probabilistic models. Here, a password mass func-
induced by inexpert configurations in dictionary attacks, en-         tion is explicitly [34, 36] or implicitly [37] derived from a
abling more robust and sound password strength estimates.             set of observable data (i.e., previously leaked passwords) and
                                                                      used to assign a probability to each element of the key-space.
Organization: Section 2 gives an overview of the funda-               During the guessing attack, guesses are produced by travers-
mental concepts needed for the comprehension of our con-              ing the key-space following the decreasing probability order
tributions. In Section 3, we introduce Adaptive Mangling              imposed by the modeled mass function. These approaches are,
Rules aside the intuitions and tools on which those are based.        in general, relatively slow and unsuitable for practical offline
Section 4 discusses dynamic mangling rules attacks. Finally,          attacks. Although simple models such as Markov Chains can
Section 5 aggregates the previous methodologies, introduc-            be employed [9], more advanced and effective models such as
ing the AdaMs attack. The motivation and evaluation of the            the neural network ones [34,37] are hardly considered outside
proposed techniques are presented in their respective sections.       the research domain due to their inefficiency.
Section 6 concludes the paper, although supplementary infor-
mation is provided in the Appendices.                                    Nonparametric models such as Probabilistic Context-Free
                                                                      Grammars (PCFG) and dictionary attacks rely on simpler and
2     Background and preliminaries                                    more intuitive constructions, which tend to be closer to human
                                                                      logic. Generally, those assume passwords as realizations of
In this Section, we start by covering passwords guessing at-          templates and generate novel guesses by abstracting and ap-
tacks and their foundations in Section 2.1. In Section 2.2, we        plying such patterns on ground-truth. These approaches main-
focus on dictionary attacks that are the basis of our contri-         tain a collection of tokens that are either directly given as part
butions. Next, Section 2.3 briefly discusses relevant related         of the model configuration (e.g., the dictionary and rules-set
works. Finally, we define the threat model in Section 2.4.            for dictionary attack.) or extracted from observed passwords
                                                                      in a setup phase (e.g., terminals/grammar for PCFG). In con-
                                                                      trast with parametric models, these can produce only a limited
2.1    Password Guessing
                                                                      number of guesses, which is a function of the chosen configu-
Human-chosen passwords do not distribute uniformly in                 ration. A detailed discussion on dictionary attacks follows in
the exponentially large key-space. Users tend to choose               the next Section.

                                                                  2
2.2     Dictionary Attacks                                                        Furthermore, real-world attackers update their guessing strat-
                                                                                  egy dynamically during the attack [43]. Basing on prior
Dictionary attacks can be traced back to the inception of                         knowledge and the initially matched passwords, they tune
password security studies [35, 41]. They stem from the obser-                     their guesses generation process to describe their target set of
vation that users tend to pick their passwords from a bounded                     passwords better and eventually recover more of them. To this
and predictable pool of candidates; common natural words                          end, professionals prefer extremely flexible tools that allow
and numeric patterns dominate most of this skewed distribu-                       for fast and complete customization. While the state-of-the-
tion [40]. An attacker, collecting such strings (i.e., creating                   art probabilistic models fail at that, mangling rules make any
a dictionary/wordlist), can use them as high-quality guesses                      form of customization feasible as well as natural.
during a guessing attack, rapidly covering the key-space’s
densest zone. These dictionaries are typically constructed
by aggregating passwords revealed in previous incidents and
plain-word dictionaries.                                                          2.3     Related Works
   Although dictionary attacks can produce only a limited
number of guesses1 , these can be extended through man-                           Although dictionary attacks are ubiquitous in password secu-
gling rules. Mangling rules attacks describe password dis-                        rity research [20, 23, 24, 30, 34], little effort has been spent
tributions by factorizing guesses in two main components:                         studying them. This Section covers the most relevant contri-
(1) dictionary-words and (2) string transformations (man-                         butions.
gling rules). These transformations aim at replicating users                         Ur et al. [43] firstly made explicit the large performance
composition behavior such as leeting or concatenating digits                      gap between optimized and stock configurations for mangling
(e.g., “pa$$w0rd" or “password123") [26]. Mangling trans-                         rules attacks. In their work, Ur et al. recruited professional
formations are modeled by the attacker and collected in sets                      figures in password recovery and compared their performance
(i.e., rules-sets). During the guessing attack, each dictionary                   against off-the-shelf parametric/nonparametric approaches in
word is extended in real-time through mangling rules, creating                    different guessing scenarios. Here, professional attackers have
novel guesses that augment the guessing attack’s coverage                         been shown capable of vastly outperform any password model.
over the key-space. Hereafter, we use the terms dictionary                        This thanks to custom dictionaries, proprietary mangling rules,
attack and mangling rules attack interchangeably.                                 and the ability to create tailored rules for the attacked set of
   Most widely known implementations of mangling rules are                        passwords (referred to as freestyle rules). Finally, the authors
included in the password cracking software Hashcat [6] and                        show that the performance gap between professional and non-
John the Ripper [8] (JtR). Here, mangling rules are encoded                       professional attacks can be reduced by combining guesses of
through simple custom programming languages. Hashcat                              multiple password models.
and JtR share almost overlapping mangling rules languages,                           More recently, Liu et al. [32] produced a set of tools that can
although few peculiar instructions are unique to each tool.                       be used to optimize the configuration of dictionaries attacks.
However, they consistently differ in the way mangling rules                       These solutions extend previous approaches [3, 7], making
are applied during the attack. Hashcat follows a word-major                       them faster. Their core contribution is an algorithm capable
order, where all the rules of the rule-set are applied to a single                of inverting almost all mangling rules; that is, given a rule r
dictionary-word before the next dictionary word is considered.                    and password to evaluate p, the inversion-rule function pro-
In contrast, JtR follows a rule-major order, where a rule is                      duces as output a regex that matches all the preimages of r(p)
applied to all the dictionary words before moving to the next                     i.e., all the dictionary entries that transformed by r would
rule. In our work, we rely on the approach of Hashcat as                          produce p. At the cost of an initial pre-computation phase,
the word-major order is necessary to efficiently implement                        following this approach, it is possible to count dictionary-
the adaptive mangling rules attack that we introduce in Sec-                      words/mangling-rules hits on an attacked set without enu-
tion 3.3.                                                                         merating all the possible guesses. Liu et al. used the method
The community behind these software packages developed                            to optimize the ordering of mangling rules in a rules-set by
numerous mangling rules sets that are publicly available.                         sorting them in decreasing hits-count order.2 In doing so, the
   Despite their simplicity, mangling rules attacks represent                     authors observed that default rules-sets follow an optimal or-
a substantial threat in offline password guessing. Mangling                       dering only rarely.
rules are extremely fast and inherently parallel; they are nat-                   Basing on the same general approach, they speedup the auto-
urally suited for both parallel hardware (i.e., GPUs) and dis-                    matic generation of mangling rules [3] and augment dictio-
tributed setups, making them one of the few guessing ap-                          naries by adding missing words in consideration of known
proaches suitable for large-scale attacks (e.g., botnets).                        attacked sets [7]. Similarly, they derive an approximate guess-
                                                                                  number calculator for rule-major order attacks.
   1 The required disk space inherently bounds the number of guesses issued

from plain dictionary attacks. Guessing attacks can easily go beyond 1012
guesses, and storing such a quantity of strings is not practical.                    2 Primarily,   for rule-major order setups (e.g., JtR).

                                                                              3
2.4    Threat Model                                                    and rules-sets that are not as effective as advanced configura-
                                                                       tions adopted by professionals. Unavoidably, this leads to a
In our study, we primarily model the case of trawling, offline
                                                                       constant overestimation of password strength that skews the
attacks. Here, an adversary aims at recovering a set of pass-
                                                                       conclusions of studies and reactive analysis.
words X (also referred to as attacked-set) coming from an
                                                                          Hereafter, we show that the domain-knowledge of profes-
arbitrary password distribution P(x) by performing a guess-
                                                                       sional attackers can be suitably approximated with a Deep
ing attack. To better describe both the current trend in pass-
                                                                       Neural Network. Given that, we devise a new dictionary
word storing techniques [27, 38, 39] and real-world attackers’
                                                                       attack that autonomously promotes functional interaction be-
goals [17], we assume a rational attacker who is bound to
                                                                       tween the dictionary and the rules-set, implicitly simulating
produce a limited number of guesses. More precisely, this at-
                                                                       the precision of real-world attackers’ configurations.
tacker aims at maximizing the number of guessed passwords
                                                                       We start by presenting the intuition behind our technique.
in X given a predefined budget i.e., a maximal number of
                                                                       Formalization and methodology are reported later.
guesses the attacker is willing to perform on X. Hereafter, we
model this strategy under the form of β-success-rate [18,19]:
                                β
                                                                       3.1    The conditional nature of mangling rules
                      sβ (X) = ∑ P(xi ).                    (1)        As introduced in Section 2.2, dictionary attacks describe
                               i=1
                                                                       password distributions by factorizing guesses into two main
                                                                       components—a dictionary word w and a transformation rule r.
Experimental setup In our construction, we do not impose
                                                                       Here, the word w acts as a semantic base, whereas r is a syn-
any limitation on the nature of P(x) nor the attacker’s a priori
                                                                       tactic transformation that aims at providing a suitable guess
knowledge. However, in our experiments, we consider a weak
                                                                       through the manipulation of w. Generally speaking, such fac-
attacker who does not retain any initial knowledge of the tar-
                                                                       torized representation can be thought of as an approximation
get distribution i.e., who cannot provide an optimal attack
                                                                       of the typical users’ composition behavior: starting from a
configuration for X before the attack. This last assumption
                                                                       plain word or phrase, users manipulate it by performing oper-
makes a better description of the use-case of automatic guess-
                                                                       ations such as leeting, appending characters or concatenation.
ing approaches currently used in password security studies.
                                                                          At configuration time, such transformations are abstracted
   In the attacks reported in the paper, we always sort the
                                                                       and collected in arbitrary large rules-sets under the form of
words in the dictionary according to their frequency. The
                                                                       mangling rules. Then, during the attack, guesses are repro-
password leaks that we use through the paper are listed in
                                                                       duced by exhaustively applying the collected rules on all the
Appendix A.
                                                                       words in the dictionary. In this generation process, rules are
                                                                       applied unconditionally on all the words, assuming that the
3     The Adaptive Mangling Rules attack                               abstracted syntactic transformations equally interact with all
                                                                       the elements in the dictionary. However, arguably, users do
In this Section, we introduce the first core block of our pass-        not follow the same simplistic model in their password com-
word model: the Adaptive Mangling Rules. We start in Sec-              position process. Users first select words and then mangling
tion 3.1, where we make explicit the conditional nature of             transformations conditioned by those words. That is, man-
mangling rules while discussing its connection with optimal            gling transformations are subjective and depend on the base
attack configurations. In Section 3.2, we model the functional         words on which those are applied. For instance, users may
relationship connecting mangling rules and dictionary words            prefer to append digits at the end of a name (e.g., “jimmy"
via a deep neural network. Finally, leveraging the introduced          to “jimmy91"), repeat short words rather than long ones
tools, we establish the Adaptive Mangling Rules attack in              (e.g., “why" to “whywhywhy") or capitalize certain strings
Section 3.3.                                                           over others (e.g., “cookie" to “COOKIE").
                                                                          Pragmatically, we can think of each mangling rule as a
Motivation: Dictionary attacks are highly sensitive to their           function that is valid on an arbitrary small subset of the dictio-
configuration; while parametric approaches tend to be more             nary space, strictly defined by the users’ composition habits.
robust to train-sets and hyper-parameters choices, the per-            Thus, applying a mangling rule on words outside this domain
formance of dictionary attacks crucially depends on the se-            unavoidably brings it to produce guesses that have only a neg-
lected dictionary and rules-set [32, 43]. As evidenced by              ligible probability of inducing hits during the guessing attack
Ur et al. [43], real-world attackers rely on extremely opti-           (i.e., that do not replicate users’ behavior). This concept is cap-
mized configurations. Here, dictionaries and mangling rules            tured in Figure 1, where four panels depict the hits distribution
are jointly created over time through practical experience [1],        of the rules-set “best64" for four different dictionaries. Each
harnessing a domain knowledge and expertise that is mostly             dictionary represents a specific subset of the dictionary space
unknown to the academic community [32]. Very often, pass-              that has been obtained by filtering out suitable strings from
word security studies rely on publicly available dictionaries          the RockYou leak; namely, these are passwords composed of:

                                                                   4
1.0                                       1.0                                        1.0                                   1.0

  0.8                                       0.8                                        0.8                                   0.8

  0.6                                       0.6                                        0.6                                   0.6

  0.4                                       0.4                                        0.4                                   0.4

  0.2                                       0.2                                        0.2                                   0.2

  0.0                                       0.0                                        0.0                                   0.0

            (a) Only digits.                      (b) Only capital letters.                  (c) Strings of length 5.              (d) Strings of length 10.

Figure 1: Distribution of hits per rule for 4 different input dictionaries for the same attacked-set i.e., animoto. Within a plot, each
bar depicts the normalized number of hits for one of the 77 mangling rules in best64. We performed the attack with Hashcat.

digits (Figure 1a), capital letters (Figure 1b), passwords of                           (i.e., the set of all possible dictionary words), respectively.
length 5 (Figure 1c), and passwords of length 10 (Figure 1d).                           Values of π(w, r) close to 1 indicate that the transformation
The four histograms show how mangling rules selectively                                 induced by r is well-defined on w and would lead to a valuable
and heterogeneously interact with the underlying dictionar-                             guess. Values close to 0, instead, indicate that users would not
ies. Rules that produce many hits for a specific dictionary                             apply r over w, i.e., guesses will likely fall outside the dense
inevitably perform very poorly with the others.                                         zone of the password distribution.
   Eventually, the conditional nature of mangling rules has a                              This formalization of compatibility function also leads to
critical impact in defining the effectiveness of a dictionary                           a straightforward probabilistic interpretation that better sup-
attack. To reach optimal performance, an attacker has to                                ports the learning process through a neural network. Indeed,
resort on a setup that a priori maximizes the conditional                               we can think of π as a probability function over the event:
effectiveness of mangling rules. In this direction, we can
see highly optimized configurations used by experts as pairs                                                            r(w) ∈ X,
of dictionaries and rules-sets that organically support each
other in the guesses generation process.3 On the other                                  where X is an abstraction of the attacked set of passwords.
hand, configurations based on arbitrary chosen rule-sets and                            More precisely, we have that:
dictionaries may not be fully compatible and, as we show                                                                               
later in the paper, generate a large number of low-quality                                              ∀w∈W, r∈R π(r, w) = P(r(w) ∈ X) .
guesses. Unavoidably, this phenomenon makes adversary
models based on mangling rules inaccurate, and induce an                                In other words, P(r(w) ∈ X) is the probability of guessing an
overestimation of password strength [43].                                               element of X by trying the guess g = r(w) produced by the
                                                                                        application of r over w.
  Next, we show how modeling the conditional nature of                                  Furthermore, such a probability can be seen as an unnormal-
mangling rules allows us to cast dictionary attacks that are                            ized version of the password distribution, creating a direct
inherently more resilient to poor configurations.                                       link to probabilistic password models [34, 36] (more details
                                                                                        are given in Appendix C). However, here, the password dis-
                                                                                        tribution is defined over the factorized domain R × W rather
3.2      A Model of Rule/Word Compatibility                                             than directly over the key-space.
We introduce the notion of compatibility that refers to the                             This factorized form offers us practical advantages over the
functional relation among dictionary words and mangling                                 classic formulation. More in detail, by choosing and fixing
rules discussed in the previous Section. The compatibility                              a specific rule-space R (i.e., a rules-set), we can reshape the
can be thought of as a continuous value defined between                                 compatibility function as:
a mangling rule r and a dictionary-word w that, intuitively,
measures the utility of applying the rule r on w. More formally,                                                   πR : W → [0, 1]|R| .                        (2)
we model compatibility as a function:
                                                                                        This version of the compatibility function takes as input a
                           π : R × W → [0, 1],                                          dictionary-word and outputs a compatibility value for each
                                                                                        rule in the chosen rule-set with a single inference. This form
where R and W are the rule-space (i.e., the set of all the                              is concretely more computational convenient and will be used
suitable transformations r : W → W) and the dictionary-space                            to model the neural approximation of the compatibility func-
   3 This has also been indirectly observed by Ur et al. in their ablation study        tion.
on pro’s guessing strategy, where the greatest improvement was achieved                    Next, we show how the compatibility function can be in-
with a proprietary dictionary in tandem with a proprietary rules-set.                   ferred from raw data using a deep neural network.

                                                                                   5
Name       Cardinality   Brief Description
3.2.1    Learning the compatibility function
                                                                                         PasswordPro      3120       Manually produced.
As stated before, the probabilistic interpretation of the com-                            generated      14728       Automatically generated.
patibility function makes it possible to learn π using a neural                           generated2     65117       Automatically generated.
network. Indeed, the probability P(r(w) ∈ X), in any form,
can be described through a binary classification: for each                                Table 1: Used Hashcat’s mangling rules sets.
pair word/rule (w, r), we have to predict one of two possi-
ble outcomes: g ∈ X or g 6∈ X, where g = r(w). In solving
this classification task, we can train a neural network in a                      that is, the same password can be guessed multiple times by
logistic regression and obtain a good approximation of the                        different combinations of rules and words. This is necessary
probability P(r(w) ∈ X).                                                          to correctly model the functional compatibility. In the same
     In the same way, the reshaped formulation of π (i.e., Eq. 2)                 way, we do not consider the identity mangling rule (i.e., ’:’)
describes a multi-label classification. In a multi-label clas-                    in the construction of the training set. When it occurs, we
sification, each input participates simultaneously to multiple                    remove it from the rules set. To the same end, we do not
binary classifications i.e., an input is associated with multi-                   consider hits caused by conditional identity transformations
ple classes at the same time. More formally, having a fixed                       i.e., r(w) = w.
number of possible classes n, each data point is mapped to
a binary vector in {0, 1}n . In our case, n = |R| and each bit                    Training set configuration The creation of a training set
in the binary vector corresponds to the outcome of the event                      entails the proper selection of the sets XA and W as well as
r j (w) ∈ X for a rule r j ∈ R.                                                   the rules-set R. Arguably, the most critical choice is the set
     To train a model, then, we have to resort to a supervised                    XA , as this is the ground-truth on which we base the approxi-
learning approach. We have to create a suitable training-set                      mation of the compatibility function. In our study, we select
composed of pairs (input,label) that the neural network can                       XA to be the password leak discovered by 4iQ in the Dark
model during the training. Under our construction, we can                         Web [12]. We completely anonymized all entries by removing
easily produce such suitable labels by performing a mangling                      users’ information, and obtained a set of ∼ 4 · 108 of unique
rules attack. In particular, fixed a rules-set R, we collect pairs                passwords. We use this set as XA within our models.
(wi , yi ), where wi is the input to our model (i.e., a dictionary-               Similarly, we want W to be a good description of the
word) and yi is the label vector associated with wi . As expli-                   dictionary-space. However, in this case, we exploit the gener-
cated before, the label yi asserts the membership of the list of                  alization capability of the neural network that can automat-
guesses [r1 (wi ), r2 (wi ), . . . , r|R| (wi )] over a hypothetical target       ically infer a general description of the input space from a
set of passwords X i.e., :                                                        relatively small training set. In our experiments, we use the
                                                                                  LinkedIn leak as W .
        yi = [r1 (wi ) ∈ X, r2 (wi ) ∈ X, . . . , r|R| (wi ) ∈ X]     (3)         Finally, we train three neural networks that learn the com-
                                                                                  patibility function for three different rules-sets; namely Pass-
To collect labels, then, we have to concertize X by choosing a
                                                                                  wordPro, generated and generated2. Those sets are provided
representative set of passwords. Intuitively, such a set should
                                                                                  with the Hashcat software and widely studied in previous
be sufficiently large and diverse since it describes the entire
                                                                                  works [32, 34, 37]. Table 1 lists them along with some addi-
key-space. Hereafter, we refer to this set as XA . This is the
                                                                                  tional information.
set of passwords we attack during the process of collecting
                                                                                     Eventually, the labels we collect in the guessing process are
labels.
                                                                                  extremely sparse. In our experiments, more than 95% of the
In the same way, we have to choose another set of strings W
                                                                                  guesses are a miss, causing our training-set to be extremely
that represents and generalizes the dictionary-space. This is
                                                                                  unbalanced towards the negative class.
used as input to the neural network during the training process,
and as the input dictionary during the simulated guessing
attack. Details on the adopted set are given at the end of the                    Model definition and training We construct our model
section.                                                                          over a residual structure [25] primarily composed of mono-
   Finally, given XA and W , and chosen a rules-space R, we                       dimensional convolution layers. Here, input strings are first
construct the set of labels by simulating a guessing attack;                      embedded at character-level via a linear transformation; then,
that is, for each entry wi in the dictionary W , we collect the                   a series of residual blocks are sequentially applied to extract
label vector yi (E.q. 3). In doing so, we used a modified ver-                    a global representation for dictionary words. Finally, such
sion of Hashcat described in Appendix H. Alternatively, the                       representations are mapped into the label-space by means
technique proposed in [32] can be used to speedup the labels                      of a single, linear layer that performs the classification task.
collection.                                                                       This architecture is trained in a multi-label classification; each
Unlike the actual guessing attack, in the process, we do not                      output of the final dense layer is squashed in the interval [0, 1]
remove passwords from XA when those are guessed correctly;                        via the logit function, and binary cross entropy is applied

                                                                              6
to each probability separately. The network’s loss is then                  We implemented our framework on TensorFlow; the mod-
obtained by summing up all the cross-entropies of the |R|                els have been trained on a NVIDIA DGX-2 machine. A com-
classes.                                                                 plete description of the architectures employed is given in
   As mentioned in the previous Section, our training-set is             Appendix D. Additionally, Appendix I contains additional
extremely unbalanced toward the negative class; that is, the             remarks on the neural approximation of the compatibility
vast majority of the ground-truth labels assigned to a training          function.
instance are negative. Additionally, a similar disproportion                Ultimately, we obtain three different neural networks: one
appears in the distribution per rule. Typically, we have many            for each rule-set reported in Table 1. Summing up, each neural
rules that count only a few positive examples, whereas others            network is an approximation of the compatibility function πR
have orders of magnitude more hits. In our framework, we                 for the respective rules-set R that is capable of assigning a
alleviate the negative effects of those disproportions by induc-         compatibility score to each rule in |R| with a single network
tive bias. In particular, we achieve it by considering a focal           inference i.e., Eq. 2. The suitability of these neural approxi-
regulation in our loss function [31].                                    mations will be proven later in the paper.
   Originally developed for object detection tasks in which
there is a strong imbalance between foreground and back-                 Additionally approaches To improve the performance of
ground classes, we adopt focal regulation to account for sparse          our method, we further investigated domain-specific construc-
and underrepresented labels when learning the compatibility              tions for multi-label classification. In particular, we tested
function. This focal loss is mainly characterized by a mod-              label embedding techniques. Those are approaches that aim
ulating factor γ that dynamically reduces the importance of              at modeling, implicitly, the correlation among labels. How-
well-classified instances in the computation of the loss func-           ever, although unconditional dependence is evident in the
tion, allowing the model to focus on hard examples (e.g., un-            modeled domain, we found no concrete advantage in directly
derrepresented rules). More formally, the form of regularized            considering it during the training. In the same direction, we
binary cross entropy that we adopt is defined as:                        investigated more sophisticated embedding techniques, where
                     (                                                   labels and dictionary-words were jointly mapped to the same
                      −(1 − α)(1 − p j )γ log(p j ) if y j = 1           latent space [48], yet achieving similar performance.
    FL(p j , y j ) =    γ                                      ,
                      αp j log(1 − p j )            if y j = 0              Additionally, we tested implementations based on trans-
                                                                         former networks [44], obtaining no substantial improvement.
where p j is the probability assigned by the model to the j’th           We attribute such a result to the lack of dominant long-term
class, and y j is the ground-truth label (i.e., 1/hit and 0/miss).       relationships among characters composing dictionary-words.
The parameter α in the equation allows us to declare an a pri-           In such a domain, we believe convolutional filters to be fully
ori importance factor to the negative class. We use that to              capable of capturing characters’ interactions. Furthermore,
down-weighting the correct predictions of the negative class             convolutional layers are significantly more efficient than the
in the loss function that would be dominant otherwise. In                multi-head attention mechanism used by transformer net-
our setup, we dynamically select α based on the distribu-                works.
tion of the hits observed in the training set. In particular,
we choose α= (1−p̄ p̄) , where p̄ is the ratio of positive labels
(i.e., hits/guesses) in the dataset. Differently, we fix γ=2 as
                                                                         3.3    Adaptive Mangling Rules
we found this value to be optimal in our experiments.                    As motivated in Section 3.2, each word in the dictionary inter-
Summing up, our loss function is defined as:                             acts just with a limited number of mangling transformations
                                                                         that are conditionally defined by users’ composition habits.
                        |R|
                                                                         While modern rules-sets can contain more than ten thousand
            L f = Ex,y ∑ FL(sigmoid( f (x) j ), y j )
                        j=1
                                                                         entries, each dictionary-word w will interact only with a small
                                                                         subset of compatible rules, say Rw . As stated before, opti-
where f are the logits of the neural network. We train the               mized configurations compose over pairs of dictionaries and
model using Adam stochastic gradient descent [29] until an               rule-sets that have been created to mutually support each
early-stopping-criteria based on the AUC of a validation set             other. This is achieved by implicitly maximizing the aver-
is reached.                                                              age cardinality of the compatible set of rules Rw for each
   Maintaining the same general architecture, we train dif-              dictionary-word w in the dictionary.
ferent networks with different sizes. In our experiments, we                I doing so, advanced attackers rely on domain knowledge
noticed that large networks provide a better approximation of            and intuition to create optimized configurations. But, thanks
the compatibility function, although small networks can be               to the explicit form of the compatibility function, it is possi-
used to reduce the computational cost with a limited loss in             ble to simulate their expertise. The intuition is that, given a
utility. In the paper, we report the results only for our biggest        dictionary-word w, we can infer the compatible rules-set Rw
networks.                                                                (i.e., the set of rules that interact well with w) according to the

                                                                     7
adaptive                                                        standard
                                                                                        0.3                                                                          0.4
                       0.8
                       0.7                                                                                  0.6                                                      0.3                                                                              0.3
       Guessed passwords

                                                                        Guessed passwords

                                                                                                                                                                                                                                      Guessed passwords
                                                                                                                                                     Guessed passwords
                       0.6                                                              0.2
                       0.5                                                                                  0.5                                                      0.2                                                                              0.2

                                                                                            Guessed passwords
                       0.4
                       0.3                                                              0.1
                       0.2                                                                                  0.4                                                      0.1                                                                              0.1
                       0.1
                       0.0
                             0.0   0.2     0.4     0.6    0.8   1.0
                                                                                        0.0
                                                                                                     0.0
                                                                                                            0.30.5     1.0     1.5    2.0    2.5
                                                                                                                                                                     0.0
                                                                                                                                                                                  0.0            0.5     1.0     1.5    2.0   2.5
                                                                                                                                                                                                                                                      0.0
                                                                                                                                                                                                                                                            0           1        2        3     4
                                         Number of Guesses      ×1011                                                Number of Guesses       ×1010                                                     Number of Guesses      ×1010                                         Number of Guesses    ×1010
                                                                                                            0.2
                     (a) MyHeritage on animoto                                       (b) animoto on MyHeritage                                                           (c) animoto on RockYou                                                           (d) RockYou on animoto
                                                                                                            0.1
Figure 2: Comparison between adaptive and
                                        0.0classic mangling rules on four combination password leaks (dictionary/attacked-set)
using the rules-set PasswordPro. β=0.5 is used
                                             0.0for the adaptive
                                                        0.5      case.
                                                                   1.0      1.5
                                                                                                                                            Number of Guesses                                                        ×1012
compatibility scores assigned by the neural approximation of                                                                                                             magnitude smaller than the complete rules-set. Typically, for
π. More formally, given π for the rules-set R and a dictionary-                                                                                                          β=0.5, only ∼ 10%/15% of the rules are conditionally ap-
word w, we can determine the compatible rules-set for w by                                                                                                               plied to the dictionary-words. Considering the percentage of
thresholding the compatibility values assigned by the neural                                                                                                             guessed passwords for adaptive and non-adaptive attacks, this
network to the rules in R:                                                                                                                                               means that approximately 90% of guesses are wasted during
                                                                                                                                                                         classic, unoptimized mangling rules attacks. Figure 3 further
                  Rw ≈ Rβw = {r | r ∈ R ∧ π(w, r) > (1 − β)},                                                                           (4)                              reports the distribution of selected rules during the adaptive
                                                                                                                                                                         attack of Figure 2a. It emphasizes how mangling rules hetero-
where β ∈ (0, 1] is a threshold parameter whose effect will be
                                                                                                                                                                         geneously interact with the underlying dictionary. Although
discussed later.
                                                                                                                                                                         very few rules interact well with all the words (e.g., selection
   At this point, we simulate high-quality configuration at-
                                                                                                                                                                         frequency is > 70%), most of the mangling rules participate
tacks by ensuring dictionary-words does not interact with
                                          β                                                                                                                              only in rare events.
rules outside its compatible rules-set Rw . Algorithm 1 imple-                                                                                                              Further empirical validation for the adaptive mangling rules
ments this strategy by following a word-major order in the                                                                                                               will be given later in Section 5.
generation of guesses. Every dictionary-word is limited to
                                               β                                                                                                                                           1.0
interact with the subset of compatible rules Rw that is decided                                                                                                                            0.9
by the neural net. Intuitively, this is equivalent to assigning                                                                                                                            0.8
                                                                                                                                                                                           0.7
                                                                                                                                                                         Selection ratio

and applying a dedicated (and possibly unique) rules-set                                                                                                                                   0.6
to each word in the dictionary. Note that, the selection of                                                                                                                                0.5
                                                                                                                                                                                           0.4
the compatible rules-set is performed at runtime, during the                                                                                                                               0.3
                                                                                                                                                                                           0.2
attack, and does not require any pre-computation. We call this                                                                                                                             0.1
                                                                                                                                                                                           0.0
novel guessing strategy Adaptive Mangling Rules, since the                                                                                                                                                                                                      Rules
rule-set is continuously adapted during the attack to better
assist the selected dictionary.                                                                                                                                          Figure 3: Selection frequencies of adaptive mangling rules
   The efficacy of adaptive mangling rules over the standard                                                                                                             for the 3120 rules of PasswordPro.
attack is shown in Figure 2, where multiple examples are
reported. The adaptive mangling rules reduce the number of
produced guesses while maintaining the hits count mostly                                                                                                                 The Attack Budget Unlike standard dictionary attacks,
unchanged. In our experiments, the adaptive approach in-                                                                                                                 whose effectiveness solely depends on the initial configu-
duces compatible rules-sets that, on average, are an order of                                                                                                            ration, adaptive mangling rules can be controlled by an addi-
                                                                                                                                                                         tional scalar parameter that we refer to as the attack budget β.
                                                                                                                                                                         This parameter defines the threshold of compatibility that a
 Algorithm 1: Adaptive mangling rules attack.                                                                                                                                                                                β
                                                                                                                                                                         rule must exceed to be included in the rules-set Rw for a word
  Data: dictonary D, rules-set R, budget β, neural net πR                                                                                                                w. Indirectly, this value determines the average size of compat-
1 forall w ∈ D do                                                                                                                                                        ible rules-sets, and consequently, the total number of guesses
       β
2     Rw = {r|πR (w)r > (1 − β)};                                                                                                                                        performed during the attack. More precisely, low values of
3     forall r ∈ Rw do
                                         β                                                                                                                               β force compatible rule-sets to include only rules with high-
4         g = r(w);                                                                                                                                                      compatibility. Those will produce only a limited number of
5         issue g;                                                                                                                                                       guesses, inducing very precise attacks at the cost of missing
                                                                                                                                                                         possible hits (i.e., high precision, low recall). Higher values

                                                                                                                                                     8
adaptive(β = 0.4)                                                              adaptive(β = 0.5)                                             adaptive(β = 0.6)                adaptive(β = 0.7)
                               1.0                                                                   1.0                                                                                        1.0                                                                     1.0
                                                                                                                                 1.0
     number of relative HITS

                                                                           number of relative HITS

                                                                                                                                                                      number of relative HITS

                                                                                                                                                                                                                                              number of relative HITS
                                                                                                       number of relative HITS
                               0.9                                                                   0.9                                                                                        0.9                                                                     0.9

                               0.8                                                                   0.8                         0.9                                                            0.8                                                                     0.8

                               0.7                                                                   0.7                                                                                        0.7                                                                     0.7
                                                                                                                                 0.8
                               0.6                                                                   0.6                                                                                        0.6                                                                     0.6
                                  0.0      0.1              0.2      0.3                                0.0                           0.1              0.2      0.3                                0.0        0.1              0.2      0.3                                0.0        0.1              0.2      0.3
                                        number of relative GUESSES                                                                 number of relative GUESSES                                              number of relative GUESSES                                              number of relative GUESSES
                                                                                                                                 0.7
                               (a) MyHeritage on animoto                                             (b) animoto on MyHeritage                                                                        (c) animoto on RockYou                                                  (d) RockYou on animoto
                                                                                                                           0.6
Figure 4: Effect of the parameter β on                                             combinations of password sets and Pass-the0.0          0.1
                                                                                                                               guessing performance       0.2four different
                                                                                                                                                        for             0.3
                                                                                                                                      number of relative GUESSES
wordPro rules. Plots are normalized according to the results of the standard mangling rules attack (i.e., β = 1). For instance,
(x=0.1, y=0.95) means that we guessed 95% of the password guessed with the standard mangling rules attack by performing
10% of the guesses required from the latter.

of β translate in a more permissive selection, where also rules                                                                                                                                  Table 2: Number of compatible scores computed per second
with low-compatibility are included in the compatible set.                                                                                                                                       (c/s) for different networks. Values computed on a single
Those will increase the number of produced guesses, induc-                                                                                                                                       NVIDIA V100 GPU.
ing more exhaustive, yet more imprecise, attacks (i.e., higher
recall, lower precision). When β reaches 1, the adaptive man-                                                                                                                                                     generated2                  generated                                     PasswordPro
gling rules attack becomes a standard mangling rules attack,                                                                                                                                                        (large)                     (large)                                        (large)
since all the rules are unconditionally included in the com-                                                                                                                                                  130.550.403 c/s           89.049.382 c/s                                     31.836.734 c/s
patible rules-set. The effect of the budget parameter is better
captured by the examples reported in Figure 4. Here, the
performance of multiple values of β is visualized and com-                                                                                                                                       do not change this feature.
pared with the total hits and guesses performed by a standard                                                                                                                                       In Algorithm 1, the only additional operation over the stan-
mangling rules attack.                                                                                                                                                                           dard mangling rules attack is the selection of compatible rules
   The budget parameter β can be used to model different                                                                                                                                         for each dictionary-word via the trained neural net. As dis-
types of adversaries. For instance, rational attackers [17]                                                                                                                                      cussed in Section 3.2.1, this operation requires just a single
change their configuration in consideration of the practical                                                                                                                                     network inference to be computed; that is, with a single in-
cost of performing the attack. This parameter permit to easily                                                                                                                                   ference, we obtain a compatibility score for each element
describe those attackers and evaluate password security ac-                                                                                                                                      in {w} × R. Furthermore, inference for multiple consecutive
cordingly. For instance, using a low budget (e.g., β=0.4), we                                                                                                                                    words can be trivially batched and computed in parallel, fur-
can model a greedy attacker who selects an attack configura-                                                                                                                                     ther reducing the computation’s impact.
tion that maximizes guessing precision at the expense of the                                                                                                                                        Table 2 reports the number of compatibility values that dif-
number of compromised accounts (a rational behavior in case                                                                                                                                      ferent neural networks can compute per second. In the table,
of an expensive hash function).                                                                                                                                                                  we used our largest networks without any form of optimiza-
   Seeking a more pragmatic interpretation, the budget param-                                                                                                                                    tion. Nevertheless, the overhead over the plain mangling rules
eter is implicitly equivalent to early-stopping4 (i.e., Eq. 1),                                                                                                                                  attack is minimal (see Appendix G). Additionally, similar to
where single guesses are sorted in optimal order i.e., guesses                                                                                                                                   standard dictionary attacks, adaptive mangling rules attacks
are exhaustively generated before the attack, and indirectly                                                                                                                                     are inherently parallel and, therefore, distributed and scalable.
sorted by decreasing probability/compatibility.
   The optimal value of β depends on the rules-set. In our
tests, we found these optimal values to be 0.6, 0.8 and 0.8                                                                                                                                       4       Dynamic Dictionary attacks
for PassowordPro, generated and generated2, respectively.
Hereafter, we use these setups, unless otherwise specified.                                                                                                                                      This section introduces the second and last component of our
                                                                                                                                                                                                 password model—a dynamic mechanism that systematically
                                                                                                                                                                                                 adapts the guessing configuration to the unknown attacked-
Computational cost One of the core advantages of dictio-                                                                                                                                         set. In Section 4.1, we introduce the Dynamic Dictionary
nary attacks over more sophisticated approaches [34, 36, 45]                                                                                                                                     Augmentation technique. Next, in Section 4.2, we introduce
is their speed. For mangling rules attacks, generating guesses                                                                                                                                   the concept of a Dynamic Budgets.
has almost a negligible impact. Despite being consistently
more complex in their mechanisms, adaptive mangling rules
                                                                                                                                                                                                 Motivation: As widely documented [18, 21, 33, 37], pass-
  4 The                        attack stops before the guesses are terminated.                                                                                                                   word composition habits slightly change from sub-population

                                                                                                                                                                      9
to sub-population. Although passwords tend to follow the                                                                     steph

same general distribution, credentials created under different
environments exhibit unique biases. Users within the same                                                             steph69    phpphp
group usually choose passwords related to each other, influ-
enced mostly by environmental factors or the underlying ap-
                                                                                                                  phpphp00       php123      phpman
plicative layer. Major factors, for example, are users’ mother
tongue [21], community interests [47] and, imposed password
composition policies [30]. These have a significant impact on                                                php00          php1234   123php   thephpman
defining the final password distribution, and, consequently,
the guessability of the passwords [28]. The same factors that
                                                                                               php001   php007     phper         php12345           thephp
shape a password distribution are generally available to the
attackers who can collect and use them to drastically improve
the configuration of their guessing attacks.                                                                     phper123    php123456    p12345      s12345
    Unfortunately, current automatic reactive/proactive guess-
ing techniques fail to describe this natural adversarial behav-                                                                           p123456     s123456
ior [28, 32, 33, 43, 46]. Those methods are based on static
configurations that apply the same guessing strategy to each                          Figure 5: Example of small hits-tree induced by the dynamic
attacked-set of password, mostly ignoring trivial information                         attack performed on the phpBB leak. In the tree, every vertex
that can be either a priori collected or distilled from the run-                      is a guessed password; an edge between two nodes indicates
ning attack. In this Section, we discuss suitable modifications                       that the child password has been guessed by applying a man-
of the mangling-rules framework to describe a more realis-                            gling rule to the parent password.
tic guessing strategy. In particular, avoiding the necessity of
any prior knowledge over the attacked-set, we rely on the
concept of dynamic attack [37]. Here, a dynamic attacker                              can extend for thousands of levels.6
is an adversary who changes his guessing strategy according                           Figure 5 depicts an extremely small subtree (“hits-tree") ob-
to the attack’s success rate. Successful guesses are used to                          tained by attacking the password leak phpBB. The tree starts
select future attempts with the goal of exploiting the non-                           when the word “steph” is mangled, incidentally producing
i.i.d. of passwords originated from the same environment. In                          the word “phpphp”. Since the latter lies in a dense zone of
other words, dynamic password guessing attacks automati-                              the attacked set (i.e., it is a common users’ practice to insert
cally collect information on the target password distribution                         the name of the website or related strings in their password),
and use it to forge unique guessing configurations for the                            it induces multiple hits and causes the attack to focus in that
same set during the attack. Similarly, this general guessing                          specific zone of the key-space. The focus of the attack grows
approach can be easily linked to the optimal guessing strategy                        exponentially hit after hit and automatically stops only when
harnessed from human experts in [43], where mangling rules                            no more passwords are matched. Eventually, this process
were created at execution time based on the initially guessed                         makes it possible to guess passwords that would be missed
passwords.                                                                            with the static approach. For instance, in Figure 5, all the
                                                                                      nodes in bold are passwords matched by the dynamic attack
                                                                                      but missed by the static one (i.e., standard dictionary attack)
4.1     Dynamic Dictionary Augmentation
                                                                                      under the same configuration.
In [37], dynamic adaptation of the guessing strategy is ob-                               Figure 6 compares the guessing performance of the dy-
tained from password latent space manipulations of deep gen-                          namic attack against the static version on a few examples for
erative models. A similar effect is reproduced within our                             the PasswordPro rules-set. The plots show that the dynamic
mangling rules approach by relying on a consistently simpler,                         augmentation of the dictionary has a very heterogeneous ef-
yet powerful, solution based on hit-recycling. That is, every                         fect on the guessing attacks. In the case of Figure 6a, the
time we guess a new password by applying a mangling rule                              dynamic attack produces a substantial increment in the num-
over a dictionary word, we insert the guessed password in the                         ber of guesses as well as in the number of hits i.e., from
dictionary at runtime. In practice, we dynamically augment                            ∼ 15% to ∼ 80% recovered passwords. Arguably, such a gap
the dictionary during the attack using the guessed pass-                              is due to the minimal size of the original dictionary phpBB.
words.5 In the process, every new hit is directly reconsidered                        In the attack of Figure 6b, instead, a similar improvement is
and semantically extended through mangling rules. This re-                            achieved by requiring only a small number of guesses. On the
cursive method brings about massive chains/trees of hits that                         other hand, in the attack depicted in Figure 6c, the dynamic
                                                                                      augmentation has a limited effect on the final hits number.
   5 Although we have not found any direct reference to the hits-recycling

technique in the literature, it is likely well known and routinely deployed by           6 I.e.,
                                                                                               a forest, where the root of each tree is a word from the original
professionals.                                                                        dictionary.

                                                                                 10
dynamic                     standard
                      0.8                                                                                                                                                                                                                                  0.5
                                                                                           0.8                                                                                             0.8
                      0.7
                      0.6                                                                  0.7                                                   0.6                                       0.7                                                             0.4                                                               0.3

                                                                       Guessed passwords

                                                                                                                                                                       Guessed passwords

                                                                                                                                                                                                                                       Guessed passwords

                                                                                                                                                                                                                                                                                                         Guessed passwords
  Guessed passwords

                                                                                           0.6                                                                                             0.6
                      0.5
                      0.4
                                                                                           0.5                                                   0.5                                       0.5                                                             0.3
                                                                                                                                                                                                                                                                                                                             0.2

                                                                                                                                 Guessed passwords
                                                                                           0.4                                                                                             0.4
                      0.3                                                                  0.3                                                                                             0.3                                                             0.2
                      0.2                                                                  0.2                                                   0.4                                       0.2                                                             0.1
                                                                                                                                                                                                                                                                                                                             0.1
                      0.1                                                                  0.1                                                                                             0.1
                      0.0
                             0.0     0.5     1.0       1.5      2.0
                                                                                           0.0
                                                                                                 0      1                  2    3      4    5
                                                                                                                                                 0.3          6
                                                                                                                                                                                           0.0
                                                                                                                                                                                                 0.0   0.2    0.4 0.6 0.8 1.0 1.2
                                                                                                                                                                                                                                                           0.0
                                                                                                                                                                                                                                                                 0   1       2        3      4
                                                                                                                                                                                                                                                                                                                             0.0
                                                                                                                                                                                                                                                                                                                                   0    2          4       6
                                        Number of Guesses      ×1010                                                      Number of Guesses                  ×1010                                           Number of Guesses ×1011                                     Number of Guesses       ×1010                                 Number of Guesses   ×1010
                                                                                                                                                 0.2
                            (a) phpBB on animoto                                             (b) RockYou on animoto                                                                   (c) MyHeritage on animoto                                              (d) animoto on RockYou                                     (e) animoto on MyHeritage
                                                                                                                                                 0.1
Figure 6: Performance comparison between0.0dynamic and classic (static) attack for five different setups of dictionary/attacked-set.
The rules set PasswordPro in non-adaptive mode0.0 is used                 1.5
                                                      0.5 in all1.0the reported attacks. The 5 setups have been handpicked to fully
                                                        Number of Guesses     ×10 12
represent the possible effects of the dynamic dictionary augmentation.

                                                             phpBB                         RockYou                               MyHeritage                                                                           4.2         Dynamic budgets
                      0.9                   1.0
                                                                                                                      0.9
                      0.8                                                                                             0.8
                                                                                                                                                                                                                      Adaptive mangling rules (Section 3.3) demonstrated that it is
                                                                                                     %Guessed Passwords

                      0.7
 %Guessed Passwords

                                                                                                                      0.7
                      0.6                   0.8
                                                                                                                      0.6
                      0.5                                                                                             0.5                                                                                             possible to consistently improve the precision of the guessing
                      0.4                   0.6                                                                       0.4                                                                                             attack by promoting compatibility among rules-set and dictio-
                      0.3                                                                                             0.3
                      0.2                                                                                             0.2                                                                                             nary (i.e., simulating high-quality configurations at runtime).
                                            0.4
                      0.1                                                                                             0.1
                            108        109 0.2 1010         1011             1012                                          108                       109      1010        1011                          1012
                                                                                                                                                                                                                      This approach assumes that the compatibility function mod-
                                        Number of guesses (log)                                                                                       Number of guesses (log)
                                                                                                                                                                                                                      eled before the attack is sufficiently general to simulate good
                                   (a) standard attack
                                            0.0
                                               0.0            0.2       0.4                          0.6
                                                                                                                                 (b) dynamic attack
                                                                                                                                   0.8                      1.0
                                                                                                                                                                                                                      configurations for each possible attacked-set. However, as
                                                                                                                                                                                                                      motivated in the introduction of Section 4, every attacked set
Figure 7: Guessing attacks performed on the animoto leak                                                                                                                                                              of passwords present peculiar biases and, therefore, different
using three different dictionaries. The panel on the left reports                                                                                                                                                     compatibility relations among rules and dictionary-words.
the guessing curves for the static setup. The panel on the right                                                                                                                                                      To reduce the effect of this dependence, we introduce an ad-
reports those for the dynamic setup. The x-axis is logarithmic.                                                                                                                                                       ditional dynamic approach supporting the adaptive mangling
                                                                                                                                                                                                                      rules framework. Rather than modifying the neural network at
                                                                                                                                                                                                                      runtime (which is neither a practical nor a reliable solution),
                                                                                                                                                                                                                      we alter the selection process of compatible rules by acting
                                                                                                                                                                                                                      on the budget parameter β.
However, it increases the attack precision in the initial phase.
Conversely, attacks in Figures 6d and 6e show a decreased                                                                                                                                                                Algorithm 2 details our solution. Here, rather than having a
precision in the initial phase of the attack, but that is com-                                                                                                                                                        global parameter β for all the rules of the rules-set R, we have
pensated later by the dynamic approach. The same results                                                                                                                                                              a budget vector B that assigns a dedicated budget value to each
are reported in Appendix F for the rules-sets generated and                                                                                                                                                           rule in R (i.e., B ∈ (0, 1]|R| ). Initially, all the budget values in
generated2.                                                                                                                                                                                                           B are initialized to the same value β (i.e., ∀r∈R Br =β) given
                                                                                                                                                                                                                      as an input parameter. During the attack, the elements of B
   Another interesting property of the dynamic augmentation                                                                                                                                                           are individually increased and decreased to better describe
is that it makes the guessing attack consistently less sensitive                                                                                                                                                      the attacked set of passwords. Within this context, increasing
to the choice of the input dictionary. Indeed, in contrast with                                                                                                                                                       the budget Br of a rule r means reducing the compatibility
the static approach, different choices of the initial dictionary
tend to produce very homogeneous results in the dynamic ap-
proach. This behavior is captured in Figure 7, where results,                                                                                                                                                           Algorithm 2: Adaptive rules with Dynamic budget
obtained by varying three input dictionaries, are compared                                                                                                                                                               Data: dictonary D, rules-set R, attacked-set X, budget β
between static and dynamic attack. The standard attacks (Fig-                                                                                                                                                          1 forall w ∈ D do
ure 7a) result in very different outcomes; for instance, using                                                                                                                                                         2
                                                                                                                                                                                                                              β
                                                                                                                                                                                                                             Rw = {r|πR (w)r > (1 − Bi )};
phpBB we match 15% of the attacked-set, whereas we match                                                                                                                                                                                                             β
more than 80% with MyHeritage. These differences in per-                                                                                                                                                               3         forall r ∈ Rw do
formance are leveled out by the dynamic augmentation of the                                                                                                                                                            4             g = r(w);
dictionary (Figure 7b); all the dynamic attacks recover ∼ 80%                                                                                                                                                          5             if g ∈ X then
of the attacked-set. Intuitively, dynamic augmentation reme-                                                                                                                                                           6                  X = X − {x};
dies deficiencies in the initial configuration of the dictionary,                                                                                                                                                      7                  Br = Br + ∆;
                                                                                                                                                                                                                                                                         |B|
promoting its completeness. These claims will find further                                                                                                                                                             8               B = B · ∑|B| β ;
                                                                                                                                                                                                                                                                     ∑         B
support in Section 5.

                                                                                                                                                                                                               11
threshold needed to include r in the compatible rules-set of a                       implementation of AdaMs are given in Appendix H, whereas
dictionary-word w, and, consequently, making r more popular                          we benchmark it in Appendix G.
during the attack. On the other hand, by decreasing Br , we
reduce the chances of selection for r; r is selected only in case
of high-compatibility words.                                                         5.1    Evaluation
In the algorithm, we increase the budget Br when the rule r                          Figure 8 reports an extensive comparison of AdaMs against
produces a hit . The added increment is a small value ∆ that                         standard mangling-rules attacks. In the figure, we test all pairs
scales inversely with the number of guesses produced.                                of dictionary/rule-set obtained from the combination of the
   At the end of the internal loop, the vector B is then nor-                        dictionaries: MyHeritage, RockYou, animoto, phpBB and the
                                                             |R|
malized; i.e., we scale the values in B so that ∑Rr B = ∑i β.                        rules-sets: PasswordPro and generated on four attacked-sets.
Normalizing B has two aims. (1) It reduces the budgets for                           Results for generated2 are reported in Appendix F instead.
non-hitting rules (the mass we add to the budget of rule r is                        Hereafter, we switch to a logarithm scale given the hetero-
subtracted from all other budgets.). (2) It maintains the total                      geneity of the number of guesses produced by the various
                             |R|
budget of the attack (i.e., ∑i β) unchanged so that dynamic                          configurations.
and static budget leads to almost the same number of guesses                         For the reasons given in the previous sections, AdaMs outper-
during the attack for a given β. Furthermore, we impose a                            forms standard mangling rules within the same configurations,
maximum and a minimum bound on the increments or decre-                              while requiring fewer guesses on average. More interestingly,
ments of B. This is to prevent values of zero (rule always                           AdaMs attacks generally exceed the hits count of all the stan-
excluded) or equal/higher than one (rule always included).                           dard attacks regardless of the selected dictionary. In particular,
   As for the dynamic dictionary augmentation, the dynamic                           this is always true for the generated rules-set.
budget has always a positive, but, heterogeneous, effect on the                      Conversely, in cases where the dynamic dictionary augmenta-
guessing performance. Mostly, the number of hits increases                           tion offers only a small gain in the number of hits (e.g., attack-
or remains unaffected. Among the proposed techniques, this                           ing RockYou), AdaMs equalizes the performance of various
is the one with the mildest effect. Yet, this will be particularly                   dictionaries, typically, towards the optimal configuration for
useful when combined with dynamic dictionary augmenta-                               the standard attack. In Figures 8d and 8h, all the configura-
tion in the next Section. Appendix E better explicates the                           tions of AdaMs reach a number of hits comparable to the best
improvement induced from the dynamic budgets.                                        configuration for the standard attack, i.e., using MyHeritage,
                                                                                     while requiring up to an order of magnitude fewer guesses
                                                                                     (e.g., Figure 8d), further confirming that the best standard
5      Adaptive, Dynamic Mangling rules: AdaMs                                       attack is far from being optimal. In the reported experiments,
                                                                                     the only outlier is phpBB when used against zooks in Fig-
The results of the previous section confirm the effectiveness                        ure 8b. Here, AdaMs did not reach/exceed all the standard
of the dynamic guessing mechanisms. We increased the num-                            attacks in the number of hits despite consistently redressing
ber of hits compared to classic dictionary attacks by using the                      the initial configuration. However, this discrepancy is can-
produced guesses to improve the attack on the fly. However,                          celed out when more mangling rules are considered i.e., in
in the process, we also increased the number of guesses, pos-                        Figure 8f.
sibly in a way that is hard to control and gauge. Moreover, by                       Eventually, the AdaMs attack makes the initial selection of
changing the dictionary at runtime, we disrupt any form of                           the dictionary systematically less influential. For instance, in
optimization of the initial configuration, such as any a priori                      our experiments, a set such as phpBB reaches the same per-
ordering in the wordlist [32] and any joint optimization with                        formance of wordlists that are two orders of magnitude larger
the rules-set7 . Unavoidably, this leads to sub-optimal attacks                      (e.g., RockYou). The crucial factor remains the rules-set’s
that may overestimate passwords strength.                                            cardinality that ultimately determines the magnitude of the
To mitigate this phenomenon, we combine the dynamic aug-                             attack, even though it does not appreciably affect the guessing
mentation technique with the Adaptive Mangling Rules frame-                          performance.
work. The latter seeks an optimal configuration at runtime                              The effectiveness of AdaMs is better captured by the re-
on the dynamic dictionary, promoting compatibility with the                          sults reported in Figure 9. Here, we create a synthetic optimal
rules-set and limiting the impact of imperfect dictionary-                           dictionary for an attacked-set and evaluate the capability of
words. This process is further supported by the dynamic                              AdaMs to converge to the performance of such an optimal
budgets that address possible covariate-shifts [42] of the com-                      configuration. To this end, given a password leak X, we ran-
patibility function induced by the augmented dictionary.                             domly divide it in two disjointed sets of equal size, say Xdict
   Hereafter, we refer to this final guessing strategy as                            and Xtarget . Then, we attack Xtarget by using both Xdict (i.e., op-
AdaMs (Adaptive, Dynamic Mangling rules). Details on the                             timal dictionary) and an external dictionary (i.e., sub-optimal
                                                                                     dictionary). Arguably, Xdict is the a priori optimal dictionary
    7 I.e.,   new words may not interact well with the mangling rules in use.        to attack Xtarget since Xdict and Xtarget are samples of the very

                                                                                12
You can also read