Rethink Decision Tree Traversal - arXiv

Page created by Thomas Garcia
 
CONTINUE READING
Rethink Decision Tree Traversal
                                                                             Jinxiong Zhang         ID

                                                                           jinxiongzhang@qq.com
arXiv:2209.04825v2 [cs.LG] 6 Oct 2022

                                                                                     Abstract
                                                  We will show how to implement binary decision tree traversal in the language of
                                              matrix computation. Our main contribution is to propose some equivalent algorithms of
                                              binary tree traversal based on novel matrix representation of the hierarchical structure
                                              of the decision tree. Our key idea is to travel the binary decision tree by maximum inner
                                              product search. We not only implement decision tree methods without the recursive
                                              traverse but also delve into the partitioning nature of tree-based methods.

                                        1    Introduction
                                        QuickScorer [12] and RapidScorer [21] are proposed based on bit-vectors of the false nodes
                                        in order to speed up the additive ensemble of regression trees in learning to rank. Inspired
                                        by [12], more works, such as [2; 11; 13; 15], focus on the application and acceleration of
                                        additive tree models while we will pay attention to the theory of algorithms specially the
                                        representation of binary decision tree in the language of matrix computation.
                                            Based on so-called Tree Supervision Loss, a hierarchical classifier is built from the weights
                                        of the softmax layer in convolutional neural networks in [18]. In [20; 19], tree regularization
                                        is used to enhance the interpretability of deep neural networks. A generalized tree repre-
                                        sentation termed TART is based on transition matrix shown in [22]. We introduce some
                                        matrix to represent the structure of binary decision tree, which means that matrix can imply
                                        hierarchy. And we will show how to generate equivalent algorithms of binary tree traversal,
                                        which is to translate the binary decision tree to matrix computation.
                                            We begin with the algorithm 1 and provide some illustration. Then we modify 1 to matrix
                                        computation, where we introduce bit-matrix and sign-matrix to represent the decision tree
                                        structure. And we consider binary decision tree evaluation as error-correcting output codes.
                                        Later we discuss the interpretable representation. Finally we give a summary.

                                        2    QuickScorer
                                        The QuickScorer is a tree-based rank model, which has reduced control hazard, smaller
                                        branch mis-prediction rate and better memory access patterns in experiment [12]. The
                                        core of QuickScorer is to represent the candidate exit leaves with a bitvetor [12] as shown in
                                        algorithm (1). Based on the bit-vectors of false nodes, tree traversal is via the bitwise logical
                                        AND operations when all the results of associated test evaluation are known as shown in
                                        algorithm (1). Its correctness is originally proved in [12]. More implementation or extension
                                        of QuickScorer can be found in [2; 11; 13; 21; 15].

                                                                                         1
Given a tree T (N, L), the traversal of the input x is described in algorithm 1 and its
correctness is proved in [12]. Here we follow the description of tree model in [12].
        A decision tree T (N, L) is composed of a set of internal nodes N = {n0 , n1 , · · · , nt }
        and and a set of leaves L = {l0 , l1 , · · · , lt+1 } where each n ∈ N is associated with
        a Boolean test and each leaf l ∈ L stores the prediction l.val ∈ R. All the nodes
        whose Boolean conditions evaluate to F alse are called false nodes, and true
        nodes otherwise. If a visited node in N is a false one, then the right branch is
        taken, and the left branch otherwise.
   Hereinafter, we assume that nodes of T are numbered in breadth-first order and leaves
from left to right.

Algorithm 1 Scorer of QuickScorer
      Input:
         • input feature vector x
         • a binary decision tree T (N, L) with
             –   internal nodes N : {n0 , · · · , nt−1 }, 1 ≤ t < ∞
             –   a set of leaves L: {l0 , · · · , lt }, 1 ≤ t < ∞
             –   the prediction of leaf l ∈ L :l.val ∈ R
             –   node bitvector associated with n ∈ N : n.bitvector
    Output: tree traversal output value
 1: procedure Score(x, T )
 2:    Initialize the result bitvector v ← 1, 1, · · · , 1
 3:    Find the F asle nodes U (x, T ) given x, T
 4:    for node u ∈ U do                                               . iterate over the false nodes
 5:        ~v ← ~v ∧ u.bitvector                                           . update by logical AND
 6:    j ← index of the leftmost bit set to 1 of ~v
 7:    return lj .val

    It is best to read [12] for more details on how to evaluate a single binary decision tree
stored as a set of precomputed bitvectors. Here we adopt the illustration from [16] in figure
(1) and (2).

2.1      Construction of Bit-vectors and Bit-matrix
In this section, we will introduce the construction of the bit-matrix based on (1).
    In [12], the idea of tree traversal using bitvectors is based on the node bitvector as below:
           Every internal node n is associated with a node bitvector n.bitvector (of the
        same length), acting as a bitmask that encodes (with 0’s) the set of leaves to be
        removed from L whenever n is a false node.
      In [21], QuickScorer is described as below
        QuickScorer maintains a bitvector, composed of ∧ bits, one per leaf, to indicate
        the possible exit leaf candidates with corresponding bits equal to 1.

                                                    2
Figure 1: The bitvector in QuickScorer

   We call the bit-vector in QuickScorer as bit-vector for false node defined in (1).
Definition 1 The bit-vector for a false node is defined as a binary vector in {0, 1}n if
the decision tree has n internal nodes, where the digit 1 corresponds to the possible exit leaf
candidates and the digit 0 corresponds to the impossible exit leaf when the associated Boolean
test is false.
    The last bit is always equal to 1 in each bitvector by definition.
    Now we consider the simplest case : (1) all internal nodes are true; (2) all internal nodes
are false. In the first case, there is no false nodes and the result bitvector v is only initialized
in algorithm 1. Thus, the first bit must correspond to the leftmost leaf. In the second case,
there is no true nodes and the result bitvector v is bit-wise logical AND of all bitvectors in
the tree, where the rightmost leaf must be indicated by 1.

Definition 2 The bit-matrix B of a decision tree is a binary matrix where each column is
the bitvector of the tree.
                                       T
    The simplest bit-matrix is 0      1 . The bit-matrix of the decision tree in figure (1) is
given as follows.                                               
                                       0       0       1   1   1
                                      0       1       1   1   1
                                                                
                                      1       1       0   0   1
                                    B=
                                      1
                                                                 .
                                              1       0   1   1
                                      1       1       1   1   0
                                       1       1       1   1   1
   The next is to show the relation between the structure of decision tree and its bit-matrix.

                                                   3
Figure 2: The scorer in QuickScorer

2.2    Some Properties of Bit-matrice
By the definition 2, the bit-matrix is binary matrix. Here we want to search some properties
of the bit-matrix of a binary decision tree specially its rank.

Definition 3 The complement of the bitvector b denoted by c is defined by

                                           b + c = ~1.
                                        T           T
For example, the complement of 0 1 is 1 0 .
    The complement bit-vectors of a node is a binary vector where the digit 1 corresponds
to the possible exit leaf candidates over its subtree when it is a true node. The bitvector b
and its complement bitvector ~1 − b cannot exist in the same bit-matrix of a decision tree.
    We can calculate a leaf bitvector as the product of bit-vectors of false nodes and comple-
ment bit-vectors of true nodes. For example, the bitvectors of the false nodes are 111111,
001111, 011111, 111101, 001101 and complement bit-vectors of true nodes are 001100 and
001000. And their product is 001000 which indicates the exit leaf directly

Lemma 1 The bitvector b is linearly independent of its complement bitvector ~1 − b.

Lemma 2 The bitvectors of the right subtree are linearly independent of the bitvectors of
the left subtree.

Proof 2.1 By the definition, the bits corresponding to the exit leaf candidates of the left
subtree are always set to 1 in the bitvectors of the right subtree while there is at least one
0-bit corresponding to the exit leaf candidates of the left subtree in the bitvectors of the left
subtree. Thus this lemma is completed.

   Based on the above lemma, we can obtain the following theorem 1.

                                               4
Theorem 1 The bit-matrix B is invertible for any decision tree.

Proof 2.2 We use induction. The induction hypothesis is that (Bn×(n−1) ) is invertible for
all integer n ≥ 2.
                                       T
    When n = 2, we have B = 0 1 so it holds.
    Assume that (Bn×(n−1) ) is invertible if t ≤ N , we will prove that (B(N +1)×N ) is invert-
ible. By the assumption, it holds for the right subtree and left subtree. By the above lemma,
it holds for the whole tree.
    As above, it holds for all integer n ≥ 2 by induction.

From 1 we can find that the augmented matrix (Bn×(n−1) ~1n ) invertible. The rank is an
invariant of bit-matrices if their size is determined. The column is unique and distinct in
the matrix B, where the digit 1/0 corresponds to the possible/impossible exit leaf when the
associated Boolean test is false.
    The augmented matrix (B ~1) of the tree in 1 is symmetric but the coding scheme of
bitvectors is based on the false nodes. This symmetry is emergent.
    The structure of a decision tree is determined by its bit-matrix and the structure of the
subtree is determined by a submatrix of the bit-matrix. And we can observe that recursive
patterns emerge in the bit-matrix as below.
                                                
                               0 0 1 1 1                      
                              0 1 1 1 1             0 0 1
                                                
                              1 1 0 0 1   0 1 1 
                              1 1 0 1 1 , 1 1 0 .
                                                            
                                                
                              1 1 1 1 0             1 1 1
                               1 1 1 1 1

    The hierarchy of the tree is encoded in the position order of bitvectors.

3     Beyond Bitvector: Diverse Tree Evaluation
We will convert the bitwise logical AND operation to multiplication of binary matrix and
vector. Specially, we focus on how to implement binary decision tree in the language of
matrix computation.

3.1    The Computational Perspective
There are two key steps of QuickScorer :(1) the first is to find the false nodes given the input
and tree; (2) the second is to find the terminal node based on the bitvectors of false nodes.
Here, we will implement these steps in the language of matrix computation.
   From the computational perspective, the of logical AND of bitvectors is equivalent to
their element-wise multiplication n algorithm (1)
                                                               Y
                 SF = arg max( ∧ u.bitvector) = arg max(          u.bitvector).              (1)
                                u∈U
                                                             u∈U

So the index of the bit set to 1 of logical AND in algorithm (1) requires that all operand
must be 1 in the index corresponding to the bit set to 1 of ~v . And it is equivalent to the
intersection ∩u∈U arg max(u.bitvector) by combining the fact that 1 is the maximum of the

                                               5
bitvector in algorithm (1). Then we can replace the logical AND operation in the algorithm
(1) with the matrix computation operation based on the following relation
                                                            X
                 SF = ∩ arg max(u.bitvector) = arg max(       u.bitvector).             (2)
                          u∈U
                                                                      u∈U

   The step to find the false nodes in (1) is to evaluate all the Boolean tests of internal
nodes and we use a binary vector to represent such procedure as defined in (4).
Definition 4 The test vector ~t of the input x is a binary column vector where the digit 1
corresponds to the false node and the digit 0 corresponds to the true node for a given tree
T (N, L).
Based on the test vector ~t and bit-matrix B, we replace the logical AND over the bitvectors
of false nodes by the matrix computation as below:

                                              ~v = B~t + ~1                                       (3)

where ~1 is a bitvector with all bits set to 1.
    The matrix-vector multiplication B~t is the sum of bitvectors of all false nodes. The key
observation is that the maximum of formula (3) corresponds to bit set to 1 of the result
bitvector in algorithm (1). In another word, arg max of formula (3) is equal to arg max
of the result vector in algorithm (1), where the operator arg max returns all the index of
maximum in the vector. Additionally, finding the index of leftmost maximum of formula
(3) is to find the minimum index in arg max of formula (3).
    And we can obtain the theorem (2).
Theorem 2 The algorithm (1) is equivalent to the algorithm (2).

Algorithm 2 Scorer of QuickScorer in language of matrix computation
      Input:
         • a feature vector x
         • a binary decision tree T (N, L) with
             –   internal nodes N : {n0 , · · · , nt−1 }, 1 ≤ t < ∞
             –   a set of leaves L: {l0 , · · · , lt }, 1 ≤ t < ∞
             –   the prediction of leaf l ∈ L :l.val ∈ R
             –   the bit-matrix of T (N, L): B
      Output: tree traversal output value
 1:   procedure Scorer(x, T )
 2:      Find the test vector ~t given x, T                                          . as defined in 4
 3:      Compute the result vector: ~v ← B~t + ~1               . tree traversal via matrix operation
 4:      j ← min arg max ~v                                   . index of the leftmost maximum of v
 5:      return lj .val

      The matrix-vector multiplication is the linear combination of column vector, so
                                               X
                                         B~t =     ti Bi ,
                                                     i

                                                    6
where Bi is the i-th bitvector. And we can obtain the following relation:
                                               X                     X
             SF = arg max(B~t + ~1) = arg max(    ti Bi ) = arg max(   pi ti Bi )
                                                  i                    i

if each pi is positive. Thus, the non-negative number is sufficient for the test vector ~t and
bitmatrix B. In another word, we extend the algorithm 1 from binary value to non-negative
value.
    Another observation of B~t is that arg max B~t = ∩i arg max(ti Bi ). Let B̃i replace the 0
in Bi by −1, then we can find the following equation
                                                                             X
           arg max B~t = ∩i arg max(ti Bi ) = ∩i arg max(ti B̃i ) = arg max(   ti B̃i ),    (4)
                                                                            i

so bit-matrix is not necessary for the algorithm (1).
   In contrast to [12], we can define the bitvector for the true node acting as a bitmask that
encodes (with 0’s) the set of leaves to be removed whenever the internal node is a true node.
The bitwise logical AND between ~1 and the node bitvector of a true node n corresponds to
the removal of the leaves in the right subtree of n from the candidate exit leaves. And we
can obtain the following equation
                                  Y
                  ST = arg max( t.bitvector) = ∩ arg max(t.bitvector)
                                                      t∈T
                                 t∈T

where T is the true node of a decision tree and t.bitvector is the bitvector for the internal
node t. The exit leaf cannot removed by the logical AND of these bitvectors but leftmost
bit set to 1 does not correspond to the exit leaf. In another word, we have the following
relation ST \ SF 6= ∅ and min(ST ∩ SF ) = min SF where SF is defined in 2.
    The inner product between the i-th column of the B and the test vector ~t is the i-th
value of B~t, which is the number of false nodes that select the i-th candidate leaf. And the
maximum of B~t is the number of all false nodes, which is equal to ~t, ~t or the `1 norm of ~t
and independent of the norm of column vectors of B.
    However, computation procedures can be equivalent in form but different in efficiency.
The arg max and intersection operation may take more time to execute. And these diverse
equivalent form will help us understand the algorithm further. In algorithm 2, we replace
the procedure to find the false nodes and the iteration over the false nodes in algorithm 1
with the test vector and matrix-vector multiplication.
    These observation and facts imply the possibility to design more efficient algorithms
equivalent to the algorithm (1). We can find more substitute for the bit-matrix encoding
schemes. In contrast to [18], we find that the hierarchy of each node is embedding in the
column vectors of the bit-matrix. In algorithm (1), it is the index of leftmost bit set to 1
in the result vector that we only require, which is the index of the exit leaf that we really
need during tree traversal when the input x and the decision tree T are given. Next we will
introduce how to construct an algorithm based on the unique and constant maximum to
traversal the binary tree.

3.2    The Algorithmic Perspective
The algorithm QuickScorer in [12] looks like magicians’ trick and its lack of intuitions makes
it difficult to understand the deeper insight behind it. The bitvector is position-sensitive

                                              7
in algorithm 1, where the nodes of are required to be numbered in breadth-first order and
leaves from left to right.
    Since the reachability of a leaf node is only determined by the internal nodes over the
path from its root to itself, it is these bitvectors in algorithm (1) that play the lead role
during traversal rather than the bitvectors of other false nodes. Thus we should take it into
consideration when designing new test vector ~s.

Definition 5 The signed test vector ~s of the input x is a binary vector where the digit
1 corresponds to the false node and the digit −1 corresponds to the true node for a given
decision tree.

    By comparing definition 1 and 5, we can find the equation ~s = 2~t − ~1.
    We only consider the matrix-vector B~t in 3 as the linear combination of bitvectors while
it is the inner products of the column vectors of B and test vector ~t that determine the
maximum of B~t. The bitvectors as rows in matrix B is designed for the internal nodes
rather than the leaf nodes.
    We need to design vectors to represent the leaf node as the column vector and ensure
that the inner product of the leaf vector and the test vector is equal to the number of the
internal nodes over the path from the root node to itself if the input reaches this leaf node.

Definition 6 The representation vector of a leaf node is a ternary vector determined by the
following rules:
   • the digit +1 corresponds to the true nodes over its path from the root node;
   • the digit −1 corresponds to the false nodes over its path from the root node;
   • and digit 0 corresponds to the rest of internal nodes.

    The representation vector of a leaf node actually describes the path from the root to this
leaf node, which contains the encoding character: −1 represents an edge to a left child, and
+1 represents an edge to a right child. And we defined the signed matrix to describe the
stricture of the binary decision tree as 7.
Definition 7 The signed matrix S of a decision tree consists of its leaf representation vec-
tors as its row vector.
  It is easy to find the upper bound of the inner product of a leaf vector p~ = (p1 , · · · , pd )T ∈
{+1, 0, −1}d and the signed test vector ~s = (s1 , · · · , sd )T ∈ {1, −1}d as shown below
                                            d
                                            X               d
                                                            X
                                h~
                                 p, ~si =         pi si ≤         |pi | = k~
                                                                           pk1 .                 (5)
                                            i=1             i=1

If pi si > 0, we say that the leaf node and the input reach a consensus on the i-th test.

Definition 8 The depth of a leaf node is the path length measured as the number of non-
terminal nodes between the root node and itself. The depth vector d~ of a decision tree consists
                                                               ~
of all leaf nodes depth and depth matrix D is defined as diag(d).

   If the number of consensuses reached between the input and the leaf node is equal to
the depth of the leaf node, then the leaf is the destination of the input in the decision tree.
   We propose the algorithm termed SignQuickScorer as (3).

                                                      8
Algorithm 3 SignQuickScorer
      Input:
        • a feature vector x
        • a binary decision tree T
    Output: tree traversal output value
 1: procedure SignQuickScorer(x, T )
 2:    Find the signed test vector ~s given x, T                          . as defined in 5
 3:    Compute the result vector: ~v ← D−1 S~s         . D−1 and S are precomputed in 7 8.
 4:    j ← arg max ~v                                         . index of the maximum of v
 5:    return lj .val

Theorem 3 The algorithm SignQuickScorer(3) is correct.
                                                                                             ∗
Proof 3.1 We only need to prove the inner product of the normalized exit leaf vector hs∗s,s∗ i
and the test vector ~t is the unique maximum of D−1 S~t.
                             hs∗ ,~ti
    First, we prove that hs∗ ,s∗ i = max{D−1 S~t}. And maxb∈B ht, bi = ht, b∗ i = ht, ti if and
                                                    hs∗ ,~ti
only if ht − b∗ , ti = 0 by the inequality 5. Thus hs∗ ,s∗ i ≤ 1. And according to the definition
5 and 6, we have s∗ , ~t − s∗ = 0 and imply that s∗ , ~t = hs∗ , s∗ i so max{D−1 S~t} = 1.
    For the rest of leaf vectors s ∈ T and the test vector ~t we have the equation: s, ~t <
hs, sibecause there is at least one test which the input does not match over the path from the
root node to this leaf node. So that s, ~t − s 6= 0. And we complete this proof.

    And the result vector can be divided element-wisely by the number of the internal nodes
 over the path, then we can obtain the unique maximum as the indicator of the exit leaf.
    We can take the example in figure 2 as follows
                                                                                   
                  −1 −1 0         0    0           2 0 0 0 0 0                        −1
        1         −1 +1 0                                                             0 
     1                           0    0  
                                            
                                                    0 2 0 0 0 0
                                                                                     
           , S = +1 0 −1 −1 0  , D = 0 0 3 0 0 0 , D−1 S~s =  11  .
                                                                                 
~s =  −1
                +1 0 −1 +1 0                    0 0 0 3 0 0                      
     −1                                                                          31 
                  +1 0 +1 0 −1                    0 0 0 0 3 0                     − 
       +1                                                                                 3
                                                                                         1
                    +1 0 +1 0 +1                     0 0 0 0 0 3                         3

    And we can regard the algorithm (3) as the decoding of error-correcting output coding
[4; 5] shown in (4), which is deigned for multi-classification by combining binary classifiers.
Both methods believe on the cluster assumption, which is based on the similarity in essence.

    Each class is attached to an unique code in error-correcting output coding(ECOC) [3; 10;
1] while a leaf vector only indicates a specific region of the feature space and each class may
be attached to more than one leaves in (3). And the key step in (4) is find the maximum
similarity, while key step in (3) is to find the maximum scaled inner product. The value
attached to the leaf node is voted by majority or averaging which is different from other
mixed regression method[6].
    Next we will show how to take advantage of the unique maximizer of D−1 S~s.

                                               9
Algorithm 4 SignQuickScorer as template matching
      Input:
        • a feature vector x
        • a binary decision tree T
      Output: tree traversal output value
 1:   procedure SignQuickScorer(x, T )
 2:      Find the test vector ~t given x, T                                   . as defined in 5
                                                       hSi ,~ti
 3:      Find the most likely template: j ← argi max     d~i
                                                                  . Si is the i-thh column of S
 4:      return lj .val.

3.3      The Representation Perspective
We define the δ function to replace the max operator in (3) as following:
                                      (
                                        0 x 6= 0
                               δ(x) =              ∀x ∈ R.
                                        1 x=0
According to the proof of 3, we can find the index of the leftmost maximum of ~v by the
following equation                      X
                                   j=      δ(~vi − 1)i
                                              i

because that max ~v = 1 and arg max ~v is unique, where v = D−1 S~t.
   And we find a succinct expression of 3 as shown in algorithm 5. We can explain this δ

Algorithm 5 SignQuickScorer as Attention
      Input:
        • a feature vector x
        • a binary decision tree T
    Output: tree traversal output value
 1: procedure SignQuickScorer(x, T )
 2:    Find the test vector ~t given x, T                                     . as defined in 5.
 3:    ComputeP the result vector: ~v ← D S t
                                          −1 ~
                                                                           . as computed in 3.
 4:    return i δ(~vi − 1)li .val

function as selector in order to find the most likely prediction value.
                             hSi ,~ti
   Note that ~vi − 1 ⇐⇒        d~i
                                      = 1 ⇐⇒ Si , ~t = d~i so we can obtain a variant of the
algorithm (3) as shown in (6).
   In terms of attention mechanism [17; 14; 9], the decision tree is in the following form:
                            X
                    T (x) =       δ((D−1 S~t)i − 1)li .val = Attention(~t, S, l)          (6)
                               i

where the weight assigned to each value is computed by the following function
                                               (
                                                 0 (D−1 S~t)i 6= 1
                      wi = δ((D−1 S~t)i − 1) =                     .
                                                 1 otherwise

                                                  10
Algorithm 6 SignQuickScorer
      Input:
        • a feature vector x
        • a binary decision tree T
    Output: tree traversal output value
 1: procedure SignQuickScorer(x, T )
 2:    Find the test vector ~t given x, T                                          . ~t defined in 5.
       Compute                            ~ ~                   ~
 3:
              P the result vector: ~v ← S t − d.           . S, d defined in (7), (8), respectively.
 4:    return i δ(~vi )li .val

   In fact, it is the hard attention. And we can output a soft attention distribution by
applying the scaled dot-product scoring function instead of the δ function as below
                                           exp((D−1 S~t)i )
                                     wi = P                    .
                                                     −1 S~
                                            i exp((D     t)i )
Similarly, we can discover that the soft decision tree is the soft attention.
    The Boolean expression may occur in different path so that we can rewrite the signed
test vector ~s as a selection of the raw Boolean expression ~s = M~e, where the matrix M
describe the contribution of each Boolean expression and the importance of each attribute.
    It is a general consensus that decision trees are of interpretability. Interpretablity is
partially dependent on the model complexity. For some model sparsity is a proxy for inter-
pretability. However, interpretability is independent of the model complexity for decision
trees because of their explicit decision boundaries. Thus, the algorithm 5 provides an inter-
pretable explanation of hard attention.
    In [24; 23], we restricted the input vector in numerical data. The difficulty is to handle
the hybrid data. The method 5 is a general scheme on how to convert the rule-based methods
to the attention mechanism. And test vector ~t is the feature vector in {+1, −1}d extracted
from the raw feature space and we only need d + 1 patterns determined by the leaf vector
in {+1, 0, −1}d , where different patterns may share the same value.
    Each path in decision trees consists of Boolean interactions of features [? ]. The lack
of feature transformation in (3), (4), (5) may restrict their predictability and stability. We
can discover that the most expensive computation of (3), (4), (5) and (6) is maximum inner
product search, which belong to the nearest neighbor search problem.
    The binary decision tree traversal is a router/gate based on Boolean tests, which select
a destination leaf node. For example, the centroid of the i-th leaf node is to minimize
                                      X
                                          δ(~vi − 1)d(x, ci )
                                       x∈X

and the value of the i-th leaf node is to minimize
                                   X
                                          δ(~vi − 1)d(y, li .val)
                                 (x,y)∈X ×Y

where vi is dependent of x as computed in algorithm (5); d(·, ·) is the quasi-distance function.
Ensemble and mixture essence occurs in the decision tree. We would like find the prototype
sample of each leaf node and convert the decision problem into nearest neighbor search
directly while it is beyond our scope here.

                                                 11
4    Discussion
We clarify some properties of the bit-matrix of decision tree in QuickScorer. And we intro-
duce the emergent recursive patterns in the bit-matrix and some equivalent algorithms of
QuickScorer, which is to evaluate the binary decision tree in the language of matrix compu-
tation. Our main contribution is to propose some novel methods to perform decision tree
traversal from different perspectives.

References
 [1] Erin L Allwein, Robert E Schapire, and Yoram Singer. Reducing multiclass to binary:
     a unifying approach for margin classifiers. Journal of Machine Learning Research,
     1(2):113–141, 2001.
 [2] Domenico Dato, Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele
     Perego, Nicola Tonellotto, and Rossano Venturini. Fast ranking with additive ensembles
     of oblivious and non-oblivious regression trees. ACM Transactions on Information
     Systems (TOIS), 35(2):1–31, 2016.
 [3] Thomas G Dietterich and Ghulum Bakiri. Solving multiclass learning problems via
     error-correcting output codes. Journal of Artificial Intelligence Research, 2(1):263–286,
     1994.

 [4] Sergio Escalera, Oriol Pujol, and Petia Radeva. Decoding of ternary error correcting
     output codes. pages 753–763, 2006.
 [5] Escalerasergio, Pujoloriol, and Radevapetia. On the decoding process in ternary error-
     correcting output codes. IEEE Transactions on Pattern Analysis and Machine Intelli-
     gence, 2010.
 [6] Kun Gai, Xiaoqiang Zhu, Han Li, Kai Liu, and Zhe Wang. Learning piece-wise linear
     models from large scale data for ad click prediction. arXiv: Machine Learning, 2017.
 [7] Jiun Tian Hoe, Kam Woh Ng, Tianyu Zhang, Chee Seng Chan, Yi-Zhe Song, and
     Tao Xiang. One loss for all: Deep hashing with a single cosine similarity based learn-
     ing objective. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wort-
     man Vaughan, editors, Advances in Neural Information Processing Systems, volume 34,
     pages 24286–24298. Curran Associates, Inc., 2021.
 [8] Rong Kang, Yue Cao, Mingsheng Long, Jianmin Wang, and Philip S. Yu. Maximum-
     margin hamming hashing. 2019 IEEE/CVF International Conference on Computer
     Vision (ICCV), pages 8251–8260, 2019.
 [9] A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret. Transformers are RNNs: Fast
     autoregressive transformers with linear attention. In Proceedings of the International
     Conference on Machine Learning (ICML), 2020.

[10] Eun Bae Kong and Thomas G. Dietterich. Error-correcting output coding corrects bias
     and variance. In In Proceedings of the Twelfth International Conference on Machine
     Learning, pages 313–321. Morgan Kaufmann, 1995.

                                             12
[11] Francesco Lettich, Claudio Lucchese, Franco Nardini, Salvatore Orlando, Raffaele
     Perego, Nicola Tonellotto, and Rossano Venturini. GPU-based parallelization of
     QuickScorer to speed-up document ranking with tree ensembles. In 7th Italian In-
     formation Retrieval Workshop, IIR 2016, volume 1653. CEUR-WS, 2016.
[12] Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, Nicola
     Tonellotto, and Rossano Venturini. QuickScorer: A fast algorithm to rank documents
     with additive ensembles of regression trees. In Proceedings of the 38th International
     ACM SIGIR Conference on Research and Development in Information Retrieval, pages
     73–82, 2015.
[13] Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, Nicola
     Tonellotto, and Rossano Venturini. QuickScorer: Efficient traversal of large ensembles
     of decision trees. In Joint European Conference on Machine Learning and Knowledge
     Discovery in Databases, pages 383–387. Springer, 2017.
[14] Hideya Mino, Masao Utiyama, Eiichiro Sumita, and Takenobu Tokunaga. Key-value
     attention mechanism for neural machine translation. In Proceedings of the Eighth Inter-
     national Joint Conference on Natural Language Processing (Volume 2: Short Papers),
     pages 290–295, Taipei, Taiwan, November 2017. Asian Federation of Natural Language
     Processing.
[15] Romina Molina, Fernando Loor, Veronica Gil-Costa, Franco Maria Nardini, Raffaele
     Perego, and Salvatore Trani. Efficient traversal of decision tree ensembles with FPGAs.
     Journal of Parallel and Distributed Computing, 155:38–49, 2021.
[16] Shitoumu. On quick scorer predication of the additive ensembles of regression trees
     model. Website. Accessed March 22, 2022.
[17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
     Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon,
     U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, ed-
     itors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran
     Associates, Inc., 2017.
[18] Alvin Wan, Lisa Dunlap, Daniel Ho, Jihan Yin, Scott Lee, Henry Jin, Suzanne Petryk,
     Sarah Adel Bargal, and Joseph E. Gonzalez. NBDT: Neural-backed decision trees,
     2020.
[19] Mike Wu, Michael C. Hughes, Sonali Parbhoo, Maurizio Zazzi, Volker Roth, and Finale
     Doshi-Velez. Beyond sparsity: Tree regularization of deep models for interpretability. In
     Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirti-
     eth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Sym-
     posium on Educational Advances in Artificial Intelligence, AAAI’18/IAAI’18/EAAI’18.
     AAAI Press, 2018.
[20] Mike Wu, Sonali Parbhoo, Michael C. Hughes, Volker Roth, and Finale Doshi-Velez.
     Optimizing for interpretability in deep neural networks with tree regularization. J.
     Artif. Int. Res., 72:1–37, jan 2022.

                                             13
[21] Ting Ye, Hucheng Zhou, Will Y Zou, Bin Gao, and Ruofei Zhang. RapidScorer: fast
     tree ensemble evaluation by maximizing compactness in data level parallelization. In
     Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Dis-
     covery & Data Mining, pages 941–950, 2018.
[22] Jaemin Yoo and Lee Sael. Transition matrix representation of trees with transposed
     convolutions. In Proceedings of the 2022 SIAM International Conference on Data Min-
     ing (SDM). SIAM, 2022.
[23] Jinxiong Zhang. Decision machines: Interpreting decision tree as a model combination
     method. ArXiv, abs/2101.11347, 2021.

[24] Jinxiong Zhang. Yet another representation of binary decision trees: A mathematical
     demonstration. ArXiv, abs/2101.07077, 2021.

                                           14
You can also read