Meta-Model Structure Selection: Building Polynomial NARX Model for Regression and Classification

 
Noname manuscript No.
                                         (will be inserted by the editor)

                                         Meta-Model Structure Selection: Building Polynomial NARX Model
                                         for Regression and Classification
                                         Wilson Rocha Lacerda Junior · Samir Angelo Milani Martins · Erivelton Geraldo
                                         Nepomuceno
arXiv:2109.09917v1 [cs.LG] 21 Sep 2021

                                         Received: date / Accepted: date

                                         Abstract This work presents a new meta-heuristic ap-               1 Introduction
                                         proach to select the structure of polynomial NARX
                                         models for regression and classification problems. The             System identification is a method of identifying the dy-
                                         method takes into account the complexity of the model              namic model of a system from measurements of the sys-
                                         and the contribution of each term to build parsimonious            tem inputs and outputs [1]. In particular, the interest
                                         models by proposing a new cost function formulation.               in nonlinear system identification has deserved much at-
                                         The robustness of the new algorithm is tested on several           tention by researchers from the 1950s onward and many
                                         simulated and experimental system with different non-              relevant results were developed [2,3,4,5]. In this con-
                                         linear characteristics. The obtained results show that             text, one frequently employed model representation is
                                         the proposed algorithm is capable of identifying the cor-          the NARMAX (Non-linear Autoregressive Models with
                                         rect model, for cases where the proper model structure             Moving Average and Exogenous Input), which was in-
                                         is known, and determine parsimonious models for ex-                troduced in 1981 aiming at representing a broad class
                                         perimental data even for those systems for which tradi-            of nonlinear system [6,7,8].
                                         tional and contemporary methods habitually fails. The                  There are many NARMAX model set representa-
                                         new algorithm is validated over classical methods such             tions such as polynomial, generalized additive, and neu-
                                         as the FROLS and recent randomized approaches.                     ral networks. Among these types of the extended model
                                                                                                            set, the power-form polynomial is the most commonly
                                                                                                            NARMAX representation [1]. Fitting polynomial NAR-
                                         Keywords System Identification · Regression and                    MAX models is a simple task if the terms in the model
                                         Classification · , NARX Model · Meta-heuristic · Model             are known a priori, which is not the case in real-world
                                         Structure Selection                                                problems. Selecting the model terms, however, is fun-
                                                                                                            damental if the goal of the identification is to obtain
                                                                                                            models that can reproduce the dynamics of the orig-
                                         Wilson Rocha Lacerda Junior                                        inal system. Problems related to overparameterization
                                         Control and Modelling and Control Group (GCOM), Depart-
                                                                                                            and numerical ill-conditioning are typical because of the
                                         ment of Electrical Engineering, Federal University of São João
                                         del-Rei, Minas Gerais, Brazil                                      limitations of the identification algorithms in selecting
                                         E-mail: wilsonrljr@outlook.com                                     the appropriate terms that should compose the final
                                         Samir Angelo Milani Martins                                        model [9,10].
                                         Control and Modelling and Control Group (GCOM), Depart-                In that respect, one of the most traditionally algo-
                                         ment of Electrical Engineering, Federal University of São João   rithms for structure selection of polynomial NARMAX
                                         del-Rei, Minas Gerais, Brazil
                                         E-mail: martins@ufsj.edu.br
                                                                                                            was developed by [11] based on the Orthogonal Least
                                                                                                            Squares (OLS) and the Error Reduction Ratio (ERR),
                                         Erivelton Geraldo Nepomuceno
                                         Control and Modelling and Control Group (GCOM), Depart-
                                                                                                            called Forward Regression Orthogonal Least Squares
                                         ment of Electrical Engineering, Federal University of São João   (FROLS). Numerous variants of FROLS algorithm has
                                         del-Rei, Minas Gerais, Brazil                                      been developed to improve the model selection per-
                                         E-mail: nepomuceno@ufsj.edu.br                                     formance such as [12,13,14,15]. The drawbacks of the
2                                                                                                Wilson Rocha Lacerda Junior et al.

FROLS have been extensively reviewed in the litera-          and physical systems. Section 4 adapts the technique to
ture, e.g., in [16,17,18]. Most of these weak points are     develop NARX models considering systems with binary
related to i) the Prediction Error Minimization (PEM)        responses that depend on continuous predictors. Sec-
framework; ii) the inadequacy of the ERR index in mea-       tion 5 recaps the primary considerations of this study
suring the absolute importance of regressors; iii) the use   and proposes possible future works.
of information criteria such as Akaike Information Cri-
terion (AIC) [19], Final Prediction Error (FPE) [20]
and the Bayesian information criterion (BIC) [21], to        2 Background
select the model order. Regarding the information crite-
ria, although these techniques work well for linear mod-     2.1 Polynomial NARX model
els, in a nonlinear context no simple relation between
model size and accuracy can be established [18,22].          Polynomial Multiple-Input Multiple-Output (MIMO)
                                                             NARX is a mathematical model based on difference
    As a consequence of the limitations of OLS based
                                                             equations and relates the current output as a function
algorithms, some recent research endeavors have sig-
                                                             of past inputs and outputs, mathematically described
nificantly strayed from the classical FROLS scheme,
                                                             as [12,7]:
by reformulating the Model Structure Selection (MSS)
process in a probabilistic framework and using ran-          yi k =Fiℓ y1 k−1 , . . . , y1 k−niy , . . . , ys k−1 , . . . , ys k−niy ,
                                                                      
dom sampling methods [18,23,24,25,26]. Nevertheless,                                                 1                        s

these techniques based on meta-heuristics and prob-                 x1 k−d , x1 k−d−1 , . . . , x1 k−d−nix , . . . ,
                                                                                                          1
abilistic frameworks presents some flaws. The meta-                                                         
                                                                    xr k−d , xr k−d−1 , . . . , xr k−d−nix + ξi k ,               (1)
heuristics approaches turn on AIC, FPE, BIC and oth-                                                          r

ers information criteria to formulate the cost function      where ny ∈ N∗ , nx ∈ N, are the maximum lags for the
of the optimization problem, generally resulting in over-    system output and input respectively; xk ∈ Rnx is the
parameterized models.                                        system input and yk ∈ Rny is the system output at dis-
    Last but not last, due to the importance of classifi-    crete time k ∈ Nn ; ek ∈ Rne stands for uncertainties
cation techniques for decision-making tasks in engineer-     and possible noise at discrete time k. In this case,F ℓ
ing, business, health science, and many others fields, it    is some nonlinear function of the input and output re-
is surprising how only a few researchers have addressed      gressors with nonlinearity degree ℓ ∈ N and d is a time
this problem using classical regression techniques. The      delay typically set to d = 1.
authors in [27] presented a novel algorithm that com-            The number of possibles terms of MIMO NARX
bines logistic regression with the NARX methodology          model given the ith polynomial degree, ℓi , is:
to deal with systems with a dichotomous response vari-
                                                                      ℓi
able. The results in that work, although very interest-               X
                                                             nm r =         nij ,                                                 (2)
ing, are based on FROLS algorithm and, therefore, in-
                                                                      j=0
herits most of the drawbacks concerning the traditional
technique, opening new paths for research.                   where
    This work proposes a technique to the identification                         s              r
                                                                                                                    
                                                                                        niyk +       nixk + j − 1
                                                                                  P              P
of nonlinear systems using meta-heuristics that fills the             nij−1
                                                                                  k=1         k=1
mentioned gaps in what concerns the structure selection      nij =                                                   ,
                                                                                             j
of NARMAX models for regression and classification.
The method uses an alternative to the information cri-                                    ni0 = 1, j = 1, . . . , ℓi .            (3)
teria cited as the index indicating the accuracy of the          Parsimony makes the Polynomial NARX models a
model as a function of the size of the model. Finally, the   widely known model family. This characteristic means
proposed algorithm is adapted to deal with classifica-       that a wide range of behaviors can be represented con-
tion problems to represent systems with binary outputs       cisely using only a few terms of the vast search space
that depend on continuous time predictors.                   formed by candidate regressors and usually a small data
    The remainder of this work is organized as follows:      set are required to estimate a model.
Section 2 provides the basic framework and notation for
nonlinear system identification of NARX models. Sec-
tion 3 presents the necessary tools to formulate the cost    2.2 Importance of Structure Selection
function of the identification strategy. This section also
introduces the new algorithm and reports the results         Identifying the correct structure, is fundamental to al-
obtained on several systems taken from the literature        low the user to be able to analyze the system dynamics
Meta-Model Structure Selection: Building Polynomial NARX Model for Regression and Classification                         3

consistently. The regressors selection, however, is not       2.4 Meta-heuristics
a simple task. If ℓ, nx , and ny , increases, the number
of candidate models becomes too large for brute force         In general, nature-inspired optimization algorithms have
approach. Considering the MIMO case, this problem is          been increasingly widespread over the last two decade
far worse than the Single-Input Single-Output (SISO)          due to the flexibility, simplicity, versatility, and local
one if many inputs and outputs are required. The total        optima avoidance of the algorithms in real life applica-
number of all different models is given by                    tions.
       (                                                          Two essential characteristics of meta-heuristics al-
        2nr    for SISO models,                               gorithms are exploitation and exploration [28]. Exploita-
nm =      nm r
                                                     (4)
        2      for MIMO models,                               tion is related to the local information in the search
                                                              process regarding the best near solution. On the other
where nr and nm r are the values computed using Eq. (2)
                                                              hand, exploration is related to explore a vast area of the
to Eq. (3).
                                                              search space to find an even better solution and not be
    A classical solution to regressors selection problem
                                                              stuck in local optima. [29] shows that there is no con-
is the FROLS algorithm associated with ERR test. The
                                                              sensus about the notion of exploration and exploitation
FROLS method adapt the set of regressors in the search
                                                              in evolutionary computing, and the definitions are not
space into a set of orthogonal vectors, which ERR eval-
                                                              generally accepted. However, it can be observed a gen-
uates the individual contribution to the desired output
                                                              eral agreement about they work like opposite forces and
variance by calculating the normalized energy coeffi-
                                                              usually hard to balance. In this sense, a combination of
cient C(x, y) between two vectors defined as:
                                                              two metaheuristics, called hybrid metaheuristic, can be
              (x⊤ y)2                                         done to provide a more robust algorithm.
C(x, y) =                 .                            (5)
            (x⊤ x)(y ⊤ y)
                                                              2.4.1 The Binary hybrid Particle Swarm Optimization
   An approach often used is to stop the algorithm
                                                                    and Gravitational Search Algorithm (BPSOGSA)
using some information criteria, e.g., AIC [19].
                                                                    algorithm

2.3 The Sigmoid Linear Unit Function                          As can be observed in most meta-heuristics algorithm,
                                                              to achieve a good balance between exploration and ex-
Definition 1 (Sigmoidal function) Let F represent a class     ploitation phase is a challenging task. In this paper,
of bounded functions φ : R 7→ R. If the properties of         to provide a more powerful performance by assuring
φ(x) satisfies                                                higher flexibility in the search process a BPSOGSA
                                                              hybridized using a low-level co-evolutionary heteroge-
 lim φ(x) = α
x→∞                                                           neous technique [30] proposed by [31] is used. The main
 lim φ(x) = β        with α > β,                              concept of the BPSOGSA is to associate the high capa-
x→−∞
                                                              bility of the particles in Particle Swarm Optimization
the function is called sigmoidal.                             (PSO) to scan the whole search space to find the best
                                                              global solution with the ability to look over local solu-
   In this particular case and following definition Eq. (1)
                                                              tions of the Gravitational Search Algorithm (GSA) in
with alpha = 0 and β = 1, we write a ”S” shaped curve
                                                              a binary space.
as
             1                                                2.4.2 Standard PSO algorithm
ς(x) =       −a(x−c)
                     .                                 (6)
       1+e
    In that case, we can specify a, the rate of change. If     In PSO [32,33], each particle represents a candidate
a is close to zero, the sigmoid function will be gradual.      solution and consists of two parts: the location in the
If a is large, the sigmoid function will have an abrupt        search space, ~x np,d ∈ Rnp×d , and the respective veloc-
or sharp transition. If a is negative, the sigmoid will go     ity, ~v np,d ∈ Rnp×d , where np = 1, 2, · · · , na and na is
from 1 to zero. The parameter c corresponds to the x           the size of the swarm and d is the dimension of the prob-
value where y = 0.5.                                           lem. In this respect, the following equation represents
    The Sigmoid Linear Unit Function (SiLU) is defined         the initial population:
by the sigmoid function multiplied by its input                           
                                                                            x1,1 x1,2 · · · x1,d
                                                                                                 
                                                                           x2,1 x2,2 · · · x2,d 
silu(x) = xς(x),                                       (7)
                                                              ~x np,d =  .        .. . .    ..                        (8)
                                                                                                
                                                                           ..      .     .   . 
which can be viewed as an steeper sigmoid function
with overshoot.                                                          xna ,1 xna ,2 · · · xna ,d
4                                                                                        Wilson Rocha Lacerda Junior et al.

   At each iteration, t, the position and velocity of a       2.4.4 The binary hybrid optimization algorithm
particle are updated according to
 t+1      t                                                   The combination of the algorithms are according to [31]:
vnv,d = ζvnv,d + c1 κ1 (pbesttnp − xtnp,d )
                   +c2 κ2 (gbesttnp − xtnp,d ),        (9)    vit+1 = ζ × vit + c′1 × κ × ati + c′2 × κ × (gbest − xti ),
                                                                                                                        (13)
where κj ∈ R, for j = [1, 2], are a real-valued, contin-
uous random variable in the interval [0, 1], ζ ∈ R is an
                                                              where c′j ∈ R is an acceleration coefficient. The Eq. (13)
inertia factor to control the influence of the previous
                                                              have the advantage to accelerate the exploitation phase
velocity on the current one (also working representing
                                                              by saving and using the location of the best mass found
a trade-off between exploration and exploitation), c1 is
                                                              so far. However, because this method can affect the ex-
the cognitive factor related to pbest (best particle) and
                                                              ploration phase as well, [35] proposed a solution to solve
c2 is the social factor related to gbest (global solution).
                                                              this issue by setting adaptive values for c′j , described
The values of the velocity, ~v np,d , are usually bounded
                                                              by [36]:
in the range [vmin , vmax ] to guarantee that the random-
ness of the system do not lead to particles rushing out
                                                                             t3
of the search space. The position are updated in the          c′1 = −2 ×         +2                                        (14)
                                                                         max(t)3
search space according to
                                                                          t3
xt+1     t       t+1                                          c′2 = 2 ×         +2                                         (15)
 np,d = xnp,d + vnp,d ,                               (10)              max(t)3
                                                               .                                                           (16)
2.4.3 Standard GSA algorithm
                                                                 In each iteration, the positions of particles are up-
In GSA [34], the agents are measured by their masses,
                                                              dated as stated in Eq. (10) to Eq. (11).
which are proportional to their respective values of the
                                                                 To avoid convergence to local optimum when map-
fitness function. These agents share information related
                                                              ping the continuous space to discrete solutions, the fol-
to their gravitational force in order to attract each other
                                                              lowing transfer function are used [37]:
to locations closer to the global optimum. The larger
the values of the masses, the best possible solution is
                                                                          2       π 
achieved, and the agents move more slowly than lighter        S(vik ) =     arctan vik .                                   (17)
                                                                          π        2
ones. In GSA, each mass (agent) has four specifications:
position, inertial mass, active gravitational mass, and
                                                                 Considering a uniformly distributed random num-
passive gravitational mass. The position of the mass
                                                              ber κ ∈ (0, 1), the positions of the agents in the binary
corresponds to a solution to the problem, and its gravi-
                                                              space are updated according to
tational and inertial masses are determined using a fit-
ness function.                                                                      (
                                                                                                                  t+1
    Consider a population formed by agents described                                    (xtnp,d )−1 ,   if κ < S(vik  )
                                                              xt+1
                                                               np,d   = X(m, n) =                                 t+1
in Eq. (8). At a specific time t, the velocity and position                             xtnp,d ,        if κ ≥ S(vik  ).
of each agent are updated, respectively, as follow:                                                                        (18)

 t+1         t
vi,d = κi × vi,d + ati,d ,
xt+1    t      t+1                                            3 Meta-Model Structure Selection (Meta-MSS): Build-
 i,d = xi,d + vi,d .                                  (11)
                                                                ing NARX for Regression
where κ gives a stochastic characteristic to the search.
The acceleration, ati,d , is computed according to the law    In this section, the use of a method based on meta-
of motion [34]:                                               heuristic to select the NARX model structure is ad-
            t                                                 dressed. The BPSOGSA is implemented to search for
          Fi,d
ati,d =        ,                                      (12)    the best model structure in a decision space formed by a
          Miit                                                predefined dictionary of regressors. The objective func-
where t is a specific time, Mii is inertial the mass of       tion of the optimization problem is based on the root
object i and Fi,d the gravitational force acting on mass      mean squared error of the free run simulation output
i in a d−dimensional space. The detailed process to           multiplied by a penalty factor that takes into account
calculate and update both Fi,d and Mii can be found           the complexity and the individual contribution of each
in [34].                                                      regressor to build the final model.
Meta-Model Structure Selection: Building Polynomial NARX Model for Regression and Classification                      5

3.1 Encoding scheme                                          where σ̂e2 is the estimated noise variance calculated as
                                                                                   N
The use of BPSOGSA for model structure selection is                     1  X
                                                             σ̂e2 =                 ⊤
                                                                             (yk − ψk−1 Θ̂)                        (22)
described. First, one should define the dimension of the              N −m
                                                                                   k=1
test function. In this regard, the ny , nx and ℓ are set
to generate all possibilities of regressors and a general    and Vjj is the jth diagonal element of (Ψ ⊤ Ψ )−1 .
matrix of regressors, Ψ , is built. The number of columns        The estimated standard error of the jth regression
of Ψ is assigned to the variable noV , and the number of     coefficient Θ̂j is the positive square root of the diagonal
agents, N , is defined. Then a binary noV × N matrix         elements of σ̂ 2 ,
referred as X , is randomly generated with the position                q
of each agent in the search space. Each column of X          se(Θ̂j ) = σ̂jj2 .                                     (23)
represents a possible solution; in other words, a possi-
ble model structure to be evaluated at each iteration.           A penalty test considers the standard error of the
Since each column of Ψ corresponds a possible regres-        regression coefficients to determine the statistical rele-
sor, a value of 1 in X indicates that, in its respective     vance of each regressor. The t-test is used in this study
position, the column of Ψ is included in the reduced         to perform a hypothesis test on the coefficients to check
matrix of regressors, while the value of 0 indicates that    the significance of individual regressors in the multi-
the regressor column is ignored.                             ple linear regression model. The hypothesis statements
                                                             involve testing the null hypothesis described as:
Example 1 Consider a case where all possible regressors
are defined based on ℓ = 1 and ny = nu = 2. The              H0 : Θj = 0,
Ψ is defined by                                              Ha : Θj 6= 0.
[constant y(k − 1) y(k − 2) u(k − 1) u(k − 2)]      (19)         In practice, one can compute a t-statistic as

   Because there are 5 possible regressors, noV = 5.                   Θ̂j
Assume N = 5, then X can be represented, for example,        T0 =              ,                                   (24)
                                                                      se(Θ̂)
as
                                                            which measures the number of standard deviations that
      01000
              
                                                             Θ̂j is away from 0. More precisely, let
    1 1 1 0 1
             
X =0 0 1 1 0
                                                (20)        −tα/2,N −m < T < tα/2,N −m ,                          (25)
    0 1 0 0 1
      10110                                                  where tα/2,N −m is the t value obtained considering α
                                                             as the significance level and N − m the degree of free-
   The first column of X is transposed and used to gen-      dom. Then, If T0 does not lie in the acceptance region
erate a candidate solution:                                  of Eq. (25), the null hypothesis, H0 : Θj = 0, is rejected
                                                           and it is concluded that Θj is significant at α. Other-
       constant y(k − 1) y(k − 2) u(k − 1) u(k − 2)          wise, θj is not significantly different from zero, and the
X =
          1        1        1        0        1              null hypothesis θj = 0 cannot be rejected.

    Hence, in this example, the first model to be tested
                                                             3.2.1 Penalty value based on the Derivative of the Sig-
is αy(k − 1) + βu(k − 2), where α and β are parame-
                                                                   moid Linear Unit function
ters estimated via Least Squares method. After that, the
second column of X is tested and so on.
                                                             We proposed a penalty value based on the derivative of
                                                             Eq. (7) defined as:

3.2 Formulation of the objective function                    ς(x(̺))
                                                             ˙       = ς(x)[1 + (a(x − c))(1 − ς(x))].             (26)

For each candidate model structure randomly defined,             In this respect, the parameters of Eq. (26) are de-
the linear-in-the-parameters system can be solved di-        fined as follows: x has the dimension of noV ; c = noV /2;
rectly using the Least Squares algorithm. The variance       and a is defined by the number of regressors of the cur-
of estimated parameters can be calculated as:                rent test model divided by c. This approach results in a
                                                             different curve for each model, considering the number
σ̂ 2 = σ̂e2 Vjj ,                                   (21)     of regressors of the current model. As the number of
6                                                                                            Wilson Rocha Lacerda Junior et al.

regressor increases, the slope of the sigmoid curve be-       Algorithm 1: Meta-structure selection (Meta-
comes steeper. The penalty value, ̺, corresponds to the       MSS) algorithm
value in y of the correspondent sigmoid curve regarding           Result: Model which has the best fitness value
the number of regressor in x. It is imperative to point           Input: {(uk ), (yk ), k = 1, . . . , N },
out that because the derivative of the sigmoid function                  M = {ψj , j = 1, . . . , m}, ny , nu , ℓ,
return negative values, we normalize ς as                                max iteration, noV , np
                                                             1    P ← Build initial population of random agents in
                                                                   the search space, S
̺ = ς − min(ς),                                    (27)      2    v ← set the agent’s velocity equal zero at first
                                                                   iteration
so ̺ ∈ R+ .                                                  3    Ψ ← Build the general matrix of regressors based on
    However, two different models can have the same                ny , nu and ℓ
                                                             4    repeat
number of regressors and present significantly different      5       for i = 1 : d do
results. This situation can be explained based on the         6            mi ← ~   x np,i ⊲ Extract the model encoding
importance of each regressor in the composition of the                       from population
model. In this respect, we use the t-student test to de-      7            Ψr ← Ψ (mi )             ⊲ Delete the Ψ columns
                                                                             where mi = 0         Ex.1
termine the statistical relevance of each regressor and
                                                              8            Θ̂ ← (Ψr⊤ Ψr )−1 Ψr⊤ y
introduce this information on the penalty function. In        9            ŷ ← Free-run simulation of the model
each case, the procedure returns the number of regres-       10            V ← (Ψ ⊤ Ψ )−1
sors that are not significant for the model, which we        11            σ̂e2 = N−m 1   PN                ⊤
                                                                                             k=1 (yk − ψk−1 Θ̂)    ⊲ Eq.22
call nΘ,H0 . Then, the penalty value is chosen consider-     12            for h = 1 : τ do
ing the model sizes as                                       13                 σ̂ 2 ← σ̂e2 Vh,h                   ⊲ Eq.21
                                                                                             q
                                                             14                 se(Θ̂j ) ← σ̂h,h 2                 ⊲ Eq.23
model size = nΘ + nΘ,H0 .                          (28)                                  Θ̂j
                                                             15               T0 ←     se(Θ̂)
                                                                                                                    ⊲ Eq.24
                                                             16               p ← regressors where
   The objective function considers the relative root
                                                                               −tα/2,N−m < T0 < tα/2,N−m
squared error of the model and ̺ and is defined as                             ⊲ Eq.25
     s                                                       17           end
        n                                                    18           Remove the p regressors from Ψr
          (yk − ŷk )2
       P
                                                             19           Check for empty model
       k=1
F= s                       × ̺.                    (29)      20           if Model is empty then
        n                                                    21               Generate a new population
              (yk − ȳ)2
        P
                                                             22               Repeat the steps from line 6 to 18
        k=1                                                  23           end
                                                             24           n1 ← size(p) ⊲ Number of redundant terms
    With this approach, even if the tested models have       25           Θ̂ ← (Ψr⊤ Ψr )−1 Ψr⊤ y      ⊲ Re-estimation
the same number of regressors, the model which con-
                                                                                 s
                                                                                     n
                                                                                         (yk −ŷk )2
                                                                                     P

tain redundant regressors are penalized with a more          26           Fi ←   s
                                                                                     k=1
                                                                                     n
                                                                                                       ×̺           ⊲ Eq.29
substantial penalty value.                                                                 (yk −ȳ)2
                                                                                     P
                                                                                     k=1

    Finally, the Algorithm 6 summarizes the method.          27           Pin ← Encoded Ψr
                                                             28           Evaluate the fitness for each agent, Fi (t)
                                                             29       end
                                                             30       P ← Pn                  ⊲ Update the population
3.3 Case Studies: Simulation Results                         31               x np,d ∈ P do
                                                                      foreach ~
                                                             32           Calculate the acceleration of each agent
In this section, six simulation examples are considered                     ⊲ Eq.12
                                                             33           Adapt the c′j coefficients            ⊲ Eq.16
to illustrate the effectiveness of the Meta-MSS algo-
                                                             34           Update the velocity of the agents     ⊲ Eq.13
rithm. An analysis of the algorithm performance has          35           Update the position of the agents     ⊲ Eq.11
been carried out considering different tuning parame-        36       end
ters. The selected systems are generally used as a bench-    37   until max iterations is reached
mark for model structures algorithms and were taken
from [38,18,24,10,14,39,40]. Finally, a comparative anal-
ysis with respect to the Randomized Model Structure
Selection (RaMSS) [18], the FROLS [1], and the Reversible-
jump Markov chain Monte Carlo (RJMCMC) [24] algo-
rithms has been accomplished to check out the goodness
of the proposed method.
Meta-Model Structure Selection: Building Polynomial NARX Model for Regression and Classification                                                7

   The simulation models are described as:                                   Table 2: Overall performance of the Meta-MSS
S1 :   yk = −1.7yk−1 − 0.8yk−2 + xk−1 + 0.81xk−2 + ek ,                                          S1      S2      S3      S4      S5      S6
                                                    (30)                     Correct model      100%    100%    100%    100%    100%    100%
                                                                          Elapsed time (mean)   5.16s   3.90s   3.40s   2.37s   1.40s   3.80s
       with xk ∼ U (−2, 2) and ek ∼ N (0, 0.012 );
S2 :   yk = 0.8yk−1 + 0.4xk−1 + 0.4x2k−1 + 0.4x3k−1 + ek ,
                                                       (31)              This result resides in the evaluation of all regressors
                             2
       with xk ∼ N (0, 0.3 ) and ek ∼ N (0, 0.01 ).      2               individually, and the ones considered redundant are re-
S3 :            3
       yk = 0.2yk−1 + 0.7yk−1 xk−1 + 0.6x2k−2
                                                                         moved from the model.
                                                                             Figure 1 present the convergence of each execution
− 0.7yk−2 x2k−2 − 0.5yk−2 + ek ,                                  (32)
                                                     2
                                                                         of Meta-MSS. It is noticeable that the majority of exe-
       with xk ∼ U (−1, 1) and ek ∼ N (0, 0.01 ).                        cutions converges to the correct model structures with
S4 :   yk = 0.7yk−1 xk−1 − 0.5yk−2 + 0.6x2k−2                            10 or fewer iterations. The reason for this relies on
− 0.7yk−2 x2k−2 + ek ,                                            (33)   the maximum number of iterations and the number of
       with xk ∼ U (−1, 1) and ek ∼ N (0, 0.04 ).    2                   search agents. The first one is related to the accelera-
S5 :   yk = 0.7yk−1 xk−1 − 0.5yk−2 + 0.6x2k−2                            tion coefficient, which boosts the exploration phase of
                                                                         the algorithm, while the latter increases the number of
− 0.7yk−2 x2k−2 + 0.2ek−1
                                                                         candidate models to be evaluated. Intuitively, one can
       − 0.3xk−1 ek−2 + ek ,                                      (34)
                                                                         see that both parameters influence the elapsed time
       with xk ∼ U (−1, 1) and ek ∼ N (0, 0.022 );                       and, more importantly, the model structure selected
S6 :   yk = 0.75yk−2 + 0.25xk−2 − 0.2yk−2 xk−2 + ek                      to compose the final model. Consequently, an inappro-
       with xk ∼ N (0, 0.252 ) and ek ∼ N (0, 0.022 );                   priate choice of one of them may results in sub/over-
                                                                         parameterized models, since the algorithm can converge
where U(a, b) are samples evenly distributed over [a, b],
                                                                         to a local optimum. The next subsection presents an
and N (η, σ 2 ) are samples with a Gaussian distribution
                                                                         analysis of the max iter and n agents influence in the
with mean η and standard deviation σ. All realizations
                                                                         algorithm performance.
of the systems are composed of a total of 500 input-
output data samples. Also, the same random seed is
used to reproducibility purpose.
                                                                         3.4 Meta-MSS vs RaMSS vs C-RaMSS
    All tests have been performed in Matlab® 2018a
environment, on a Dell Inspiron 5448 Core i5 − 5200U                     The systems S1 , S2 , S3 , S4 and S6 has been used as
CPU 2.20GHz with 12GB of RAM.                                            benchmark by [41], so we can compare directly our re-
    Following the aforementioned studies, the maximum                    sults with those reported by the author in his thesis. All
lags for the input and output are chosen to be, respec-                  techniques used ny = nu = 4 and ℓ = 3. The RaMSS
tively, nu = ny = 4 and the nonlinear degree is ℓ = 3.                   and the RaMSS with Conditional Linear Family (C-
The parameters related to the BPSOGSA are detailed                       RaMSS) used the following configuration for the tun-
on Table (1).                                                            ing parameters: K = 1, α = 0.997, N P = 200 and
                                                                         v = 0.1. The Meta-Structure Selection Algorithm was
        Table 1: Parameters used in Meta-MSS                             tuned according to Table 1.
                                                                             In terms of correctness, the Meta-MSS outperforms
 Parameters   nu   ny    ℓ   p-value   max iter   n agents   α    G0
   Values      4    4    3    0.05       30          10      23   100
                                                                         (or at least equals) the RaMSS and C-RaMSS for all an-
                                                                         alyzed systems as shown in Table 3. Regarding S6 , the
                                                                         correctness rate increased by 18% when compared with
    300 runs of the Meta-MSS algorithm have been exe-                    RaMSS and the elapsed time required for C-RaMSS
cuted for each model, aiming to compare some statistics                  obtain 100% of correctness is 1276.84% higher than
about the algorithm performance. The elapsed time, the                   the Meta-MSS. Furthermore, the Meta-MSS is notably
time required to obtain the final model, and correctness,                more computationally efficient than C-RaMSS and sim-
the percentage of exact model selections, are analyzed.                  ilar to RaMSS.
    The results in Table 2 are obtained with the param-
eters configured accordingly to Table (1).
    Table 2 shows that all the model terms are correctly                 3.5 Meta-MSS vs FROLS
selected using the Meta-MSS. It is worth to notice that
even the model S5 , which have an autoregressive noise,                  The FROLS algorithm has been tested on all the sys-
was correctly selected using the proposed algorithm.                     tems and the results are detailed in Table 5. It can
8                                                                                         Wilson Rocha Lacerda Junior et al.

          (1) System S1 .                           (2) System S2 .                                  (3) System S3 .

          (4) System S4 .                           (5) System S5 .                                  (6) System S6 .

                        Fig. 1: The convergence of each execution of Meta-MSS algorithm.

                    Table 3: Comparative analysis between Meta-MSS, RaMSS, and C-RaMSS

                                                               S1         S2       S3         S4         S6
                                         Correct model        100%      100%     100%       100%       100%
                Meta-MSS
                                      Elapsed time (mean)     5.16s      3.90s    3.40s      2.37s      3.80s
                                      Correct model          90.33%     100%     100%       100%       66%
                RaMSS- N P = 100
                                      Elapsed time (mean)    3.27s      1.24s    2.59s      1.67s      6.66s
                                      Correct model          78.33%     100%     100%       100%       82%
                RaMSS- N P = 200
                                      Elapsed time (mean)    6.25s      2.07s    4.42s      2.77s      9.16s
                                      Correct model          93.33%     100%     100%       100%       100%
                C-RaMSS
                                      Elapsed time (mean)    18s        10.50s   16.96s     10.56s     48.52s

be seen that only the model terms selected for S2 and        cution with 30, 000 iterations. Furthermore, it assumes
S6 are correct using FROLS. The FROLS fails to se-           different probability distributions which are chosen to
lect two out of four regressors for S1 . Regarding S3 ,      ease the computations for the parameters involved in
                                                     3
the term yk−1 is included in the model instead of yk−1   .   the procedure.
Similarly, the term yk−4 is wrongly added in model S4
instead of yk−2 . Finally, an incorrect model structure is
returned for S5 as well with the addition of the spurious    3.7 Full-scale F-16 aircraft
term yk−4 .
                                                             The F-16 Ground Vibration Test has been used as a
                                                             benchmark for system identification. The case exhibits
3.6 Meta-MSS vs RJMCMC                                       a clearance and friction nonlinearities at the mounting
                                                             interface of the payloads. The empirical data were ac-
The S4 is taken from [24]. Again the maximum lag for         quired on a full-scale F-16 aircraft on a Siemens LMS
the input and output are ny = nu = 4 and the nonlin-         Ground Vibration Testing Master Class as well as a de-
ear degree is ℓ = 3. In their work, the authors executed     tailed formulation of the identification problem is avail-
the algorithm 10 times on the same input-output data.        able at Nonlinear System Identification Benchmarks 1 .
The RJMCMC was able to select the true model struc-              Several datasets are available concerning different
ture 7 times out of the 10 runs. On the other hand,          input signals and frequencies. This work considers the
the Meta-MSS can get the true model in all runs of the       data recorded under multisine excitations with a full
algorithm. The results are summarized in Table 6. Be-        frequency grid from 2 to 15Hz. According to [42], at
sides, there are main drawbacks related to RJMCMC            each force level, 9 periods were acquired considering a
method which are overcome by the Meta-MSS: the for-
mer is computationally expensive and required an exe-          1
                                                                   Available at http://www.nonlinearbenchmark.org/
Meta-Model Structure Selection: Building Polynomial NARX Model for Regression and Classification                       9

Table 4: Comparative analysis - Meta-MSS vs FROLS            same input and output lags are considered on FROLS
                                                             approach. Table 8 details the results considering the
                       Table 5: C                            second acceleration signals as output. For this case, fol-
                                                             lowing the recommendation in [42], the models are eval-
             Meta-MSS                    FROLS
                                                             uated using the metric ermst , which is defined as:
       Regressor   Correct        Regressor  Correct
         yk−1        yes            yk−1       yes                   v
                                                                     u
                                                                     u1 X   N
         yk−2        yes            yk−4       no
 S1
         xk−1        yes            xk−1       yes           ermst = t        (yk − ŷk )2 .                      (35)
                                                                        N
                                                                            k=1
         xk−2        yes            xk−4       no
         yk−1        yes            yk−1       yes               As highlighted in Table 8, the Meta-MSS algorithm
         xk−1        yes            xk−1       yes
 S2                                                          returns a model with 9 regressors and a better per-
         x2k−1       yes            x2k−1      yes
                                                             formance than the model with 18 terms built using
         x3k−1       yes            x3k−1      yes
           3                                                 FROLS. The step-by-step procedure used by FROLS
         yk−1        yes            yk−1       no
                                                             results in the selection of the first 12 output terms, while
       yk−1 xk−1     yes          yk−1 xk−1    yes
 S3      x2k−2       yes            x2k−2      yes           only 4 output regressors are selected using Meta-MSS.
       yk−2 x2k−2    yes          yk−2 x2k−2   yes           From Table 8, one can see that the Meta-MSS algo-
         yk−2        yes            yk−2       yes           rithm have an affordable computational cost, since the
       yk−1 xk−1     yes          yk−1 xk−1    yes           time to select the model is very acceptable, even when
         yk−2        yes            yk−4       no            comparing with FROLS, which is known to be one of
 S4                                                          the most efficient methods for structure selection.
         x2k−2       yes            x2k−2      yes
       yk−2 x2k−2    yes          yk−2 x2k−2   yes               Further, it is interesting to note that the Meta-MSS
       yk−1 xk−1     yes          yk−1 xk−1    yes           returned a linear model even when the tests were per-
         yk−2        yes            yk−4       no            formed using the maximum nonlinearity degree ℓ = 2.
 S5
         x2k−2       yes            x2k−2      yes           This result demonstrates the excellent performance of
       yk−2 x2k−2    yes          yk−2 x2k−2   yes           the method since the classical one was not able to reach
         yk−2        yes            yk−2       yes           a satisfactory result. Figure 2 depicts the free run sim-
 S6      xk−1        yes            xk−1       yes           ulation of each model.
       yk−2 xk−2     yes          yk−2 xk−1    yes

                                                             4 Meta-Model Structure Selection (Meta-MSS): Build-
single realization of the input signal. There are 8192         ing NARX for Classification
samples per period. Note that transients are present in
the first period of measurement.                             Because of many real-life problems associate continu-
    This case study represents a significant challenge       ous and discrete variables, classification has been one of
because it involves nonparametric analysis of the data,      the most widely studied techniques for decision-making
linearized modeling, and damping ratios versus the ex-       tasks in engineering, health science, business and many
citation level and nonlinear modeling around a single        more. Many methods and algorithms have been de-
mode. Also, the order of the system is reasonably high.      veloped to data classification, which cover logistic re-
In the 2 − 15Hz band, the F-16 possesses about 10 res-       gression [43], random forest [44], support vector ma-
onance modes.                                                chines [45], k-nearest neighbors [46] and logistic-NARX
    The Meta-MSS algorithm and the FROLS are used            model for binary classification [27]. The former three
to select models to represent the dynamics of the F-         algorithms are widely used, but the interpretation of
16 aircraft described above. In the first approach, the      such models is a hard task. Regarding logistic-NARX,
maximum nonlinearity degree and the lag of inputs and        besides the computational efficiency and transparency,
output were set to 2 and 10, respectively. In this case,     it allows the inclusion of lagged terms straightforwardly
the Meta-MSS select a model with 15 terms, but the           while other techniques include lagged terms explicitly.
model selected through FROLS diverged. Thus, we set              Following the logistic-NARX approach, this section
the maximum lag to 2. The Meta-MSS has chosen 3              adapts the Meta-MSS algorithm to develop NARX mod-
regressors to form the model, while the FROLS failed         els focusing on the prediction of systems with binary re-
again to build an adequate model. Finally, the maxi-         sponses that depend on continuous predictors. The pri-
mum lag was set to 20 and the maximum nonlinearity           mary motivation comes from the fact that the logistic-
degree was defined to be 1. For the latter case, the Meta-   NARX approach inherits not only the goodness of the
MSS parameters are defined as listed in Table 7. The         FROLS but all of its drawbacks related to being stocked
10                                                                                                         Wilson Rocha Lacerda Junior et al.

                                     Table 6: Comparative analysis - Meta-MSS vs RJMCMC

                                Meta-MSS                                                   RJMCMC
                            Model     Correct           Model 1 (7×)           Model 2        Model 3      Model 4     Correct
                          yk−1 xk−1     yes              yk−1 xk−1            yk−1 xk−1      yk−1 xk−1    yk−1 xk−1      yes
                            yk−2        yes                yk−2                 yk−2           yk−2         yk−2         yes
                    S4      x2k−2       yes                x2k−2                x2k−2          x2k−2        x2k−2        yes
                          yk−2 x2k−2    yes              yk−2 x2k−2           yk−2 x2k−2     yk−2 x2k−2   yk−2 x2k−2     yes
                               -         -                    -               yk−3 xk−3        x2k−4      xk−1 x2k−3     no

Table 7: Parameters used in Meta-Structure Selection                            4.1 Logist NARX modeling approach using Meta-MSSc
Algorithm for the F-16 benchmark                                                    algorithm
 Parameters    nu   nu2   ny   ℓ   p-value   max iter   n agents   α    G0
   Values      20    20   20   1    0.05       30          15      23   100     In [27], the logistic-NARX is based on the FROLS al-
                                                                                gorithm to select the terms to compose the following
                                                                                probability model
                                                                                            1
                                                                                pk =           ⊤ Θ̂
                                                                                                    .                               (36)
                                                                                           −ψk−1
                                                                                      1+e
                                                                                    The biserial coefficient is used to measure the rela-
                                                                                tionship between a continuous variable and a dichoto-
                                                                                mous variable according to [47]:
                                                                                                      r
          (1) Meta-MSS: ℓ = 1, ny = nx1 = nx2 = 20.                                        X 1 − X 0 n1 n0
                                                                                r(x, y) =                    ,                      (37)
                                                                                              σX         N2
                                                                                where X0 is the mean value on the continuous variable
                                                                                X for all the observations that belong to class 0, X1 is
                                                                                the mean value of variable X for all the observations
                                                                                that belong to class 1, σX is the standard deviation of
          (2) Meta-MSS: ℓ = 2, ny = nx1 = nx2 = 2.                              variable X, n0 is the number of observations that be-
                                                                                long to class 0, n1 is the number of observations that
                                                                                belong to class 1, and N is the total number of data
                                                                                points. Even though it is based on FROLS, the logistic-
                                                                                NARX approach requires the user to set a maximum
                                                                                number of regressors to form the final model, which is
                                                                                not required when using Meta-MSS algorithm for bi-
         (3) Meta-MSS: ℓ = 2, ny = nx1 = nx2 = 10.
                                                                                nary classification.
                                                                                    The objective function of the Meta-MSS is adapted
                                                                                to use the biserial correlation to measure the associa-
                                                                                tion between the variables instead of the RMSE. For
                                                                                the continuous regression problem, the parameters are
              (4) FROLS: ℓ = 1, ny = nx1 = nx2 = 20.
                                                                                estimated using the LS method, which minimizes the
Fig. 2: Models obtained using the Meta-MSS and the                              sum of squared errors of the model output. Because we
FROLS algorithm. The FROLS was only capable to re-                              are dealing with categorical response variables, this ap-
turn a stable model when setting ℓ = 1. The Meta-MSS,                           proach is not capable of producing minimum variance
otherwise, returned satisfactory models in all cases.                           unbiased estimators, so the parameters are estimated
                                                                                via a Stochastic Gradient Descent (SGD) [48]:
                                                                                    Apart from those changes, the main aspects of the
                                                                                standard Meta-MSS algorithm are held, such the regres-
                                                                                sor significance evaluation and all aspects of exploration
                                                                                and exploitation of the search space. Because the pa-
in locally optimal solutions. A direct comparison with                          rameters are now estimated using SGD, the method be-
the methods above is performed using the identifica-                            comes more computationally demanding, and this can
tion and evaluation of two simulated models, and an                             slow down the method, especially when concerning with
empirical system.                                                               large models.
Meta-Model Structure Selection: Building Polynomial NARX Model for Regression and Classification                                11

                           Table 8: Identified NARX model using Algorithm 6 and FROLS.

                       Meta-MSS         Meta-MSS | ℓ = nx1 = nx2 = 2   Meta-MSS | ℓ = 2, nx1 = nx2 = 10           FROLS
                Model term Parameter    Model term      Parameter      Model term        Parameter        Model term Parameter
                  yk−1        0.7166      yk−2            0.6481         yk−1              1.3442           yk−1           1.7829
                  yk−5        0.2389      x1 k−1          1.5361         yk−2             −0.8141           yk−2          −1.8167
                  yk−8       −0.0716      x2 k−2          1.3857         yk−4              0.3592           yk−3           1.3812
                  yk−13       -0.0867                                    x1 k−6            14.8635          yk−6           1.5213
                  x1 k−2      1.5992                                     x1 k−7           −14.7748          yk−9           0.3625
                 x1 k−13     −1.1414                                     x2 k−1           −3.2129           x2 k−7        −2.4253
                  x2 k−4      2.2248                                     x2 k−3            7.1903           x1 k−1         1.8534
                  x2 k−8     −0.8383                                     x2 k−8           −4.0374           x2 k−3         1.9866
                 x2 k−13     −1.1189                                                                        x1 k−8        −1.5305
                                                                                                            x2 k−1         0.6547
                                                                                                            yk−7           1.2767
                                                                                                            yk−5           1.3378
                                                                                                            yk−10         −0.3234
                                                                                                            yk−4          −1.0199
                                                                                                            yk−8          −0.7116
                                                                                                            yk−12         −0.2222
                                                                                                            yk−11          0.3761
                                                                                                           x1 k−20         0.0245
    ermst              0.0862                      0.1268                           0.0982                         0.0876
 Elapsed time          27.92s                       16.78                           207.13                         18.01s

4.2 Electroencephalography Eye State Identification               higher score achieved by the popular techniques was
                                                                  0.6473, considering the case where a principal compo-
This dataset was built by [49] containing 117 seconds             nent analysis (PCA) drastically reduced the data di-
of EEG eye state corpus with a total of 14, 980 EEG               mensionality. Thus, this result shows a powerful per-
measurements from 14 different sensors taken with the             formance of the Meta-MSSc algorithm. For comparison
Emotiv EEG neuroheadset to predict eye states [27].               purpose, a PCA is performed, and the first five princi-
Their dataset is now frequently used as a benchmark               pal components were selected as a representation of the
and is available on Machine Learning Repository, Uni-             original data. Table 10 illustrates that Meta-MSSc has
versity of California, Irvine (UCI) [50]. The reader is           built the model with the best accuracy together with
referred to [51] for additional information regarding the         the Logistic NARX approach. The models built with-
experiment.                                                       out autoregressive inputs have the worst classification
    Following the method in [27], the data is separated           accuracy, although this is improved with the addition
in a training set composed of 80% of the data and a               of autoregressive terms. However, even with autoregres-
testing set with the remainder. The eye state is encoded          sive information, the popular techniques do not achieve
as follows: 1 indicates the eye-closed and 0 the eye-open         a classification accuracy to take up the ones obtained
state.                                                            by the Meta-MSSc and Logistic NARX methods.
    Some statistical analysis was performed on train-
ing dataset to check if the data have missing values
or any outlier to be fixed. In this respect, were found           5 Conclusion
values corresponding to inaccurate or corrupt records
in all-time provided from sensors. The detected inac-             This study presents the structure selection of polyno-
curate values are replaced with the mean value of the             mial NARX models using a hybrid and binary Particle
remaining measurements for each variable. Also, each              Swarm Optimization and Gravitational Search Algo-
input sequence is transformed using scale and centering           rithm. The selection procedure considers the individual
transformations. The Logistic NARX based on FROLS                 importance of each regressor along with the free-run-
was not able to achieve satisfactory performance when             simulation performance to apply a penalty function in
trained with the original dataset. The authors explained          candidates solutions. The technique, called Meta-MSS
the lousy performance as a consequence of the high                algorithm in its standard form, is extended and ana-
variability and dependency between the variables mea-             lyzed into two main categories: (i) regression approach
sured. Table 9 reports that the Meta-MSSc , on the other          and (ii), the identification of systems with binary re-
hand, was capable of building a model with 10 terms               sponses using a logistic approach.
and accuracy of 65.04%.                                               The technique, called Meta-MSS algorithm, outper-
    This result may appear to be a poor performance.              formed or at least was compatible with classical ap-
However, the Logistic NARX achieved 0.7199, and the               proaches like FROLS, and modern techniques such as
12                                                                                       Wilson Rocha Lacerda Junior et al.

Table 9: Identified NARX model using Meta-MSS. This model was built using the original EEG measurements. No
comparison was made because the FROLS based technique was not capable to generate a model which performed
well enough

               Model term    constant    x1 k−1    x4 k−30   x4 k−36   x4 k−38      x4 k−41   x6 k−2        x7 k−5   x12 k−1    x13 k−1
 Meta-MSSc
               Parameter      0.2055    −0.1077    0.1689    0.1061    0.0751       0.1393    0.3573       −0.7471   −0.4736    0.3875

Table 10: Accuracy performance between different methods for Electroencephalography Eye State Identification

                                          Method                                 Classification accuracy
                                        Meta-MSSc                                         0.7480
                                      Logistic NARX                                       0.7199
                                     Regression NARX                                      0.6643
                        Random Forest (without autoregressive inputs)                     0.5475
                    Support Vector Machine (without autoregressive inputs)                0.6029
                     K-Nearest Neighbors (without autoregressive inputs)                  0.5041
                         Random Forest (with autoregressive inputs)                       0.6365
                     Support Vector Machine (with autoregressive inputs)                  0.6473
                       K-Nearest Neighbors (with autoregressive inputs)                   0.5662

RaMSS, C-RaMSS, RJMCMC, and a meta-heuristic                  Although some analysis are out of scope and, there-
based algorithm. This statement considers the results         fore, are not addressed in this paper, future work are
obtained in the model selection of 6 simulated models         open for research regarding the inclusion of noise pro-
taken from literature, and the performance on the F-16        cess terms in model structure selection, which is an im-
Ground Vibration benchmark.                                   portant problem concerning the identification of poly-
    The latter category proves the robust performance         nomial autoregressive models. In this respect, an excit-
of the technique using an adapted algorithm, called           ing continuation of this work would be to implement
Meta-MSSc , to build models to predict binary outcomes        an extended version of Meta-MSS to return NARMAX
in classification problems. Again, the proposed algo-         models.
rithm outperformed or at least was compatible with
popular techniques such as K-Nearest Neighbors, Ran-
                                                              References
dom Forests and Support Vector Machine, and recent
approaches based on FROLS algorithm using NARX                 1. S.A. Billings, Nonlinear system identification: NARMAX
models. Besides the simulated example, the electroen-             methods in the time, frequency, and spatio-temporal do-
cephalography eye state identification proved that the            mains (John Wiley & Sons, Chichester, 2013)
                                                               2. N. Wiener, Nonlinear problems in random theory. Tech.
Meta-MSSc algorithm could handle the problem bet-
                                                                  rep., Massachusetts Institute of Technology (1958)
ter than all of the compared techniques. In this case          3. W.J. Rugh, Nonlinear System Theory (Johns Hopkins
study, the new algorithm returned a model with satis-             University Press, 1981)
factory performance even when the data dimensionality          4. R. Haber, L. Keviczky, Nonlinear System Identification
                                                                  - Input-Output Modeling Approach, vol. 1 and 2 (Kluwer
was not transformed using data reduction techniques,              Academic Publishers, 1999)
which was not possible with the algorithms used for            5. R. Pintelon, J. Schoukens, System Identification: A Fre-
comparisons purposes.                                             quency Domain Approach (John Wiley & Sons, 2012)
                                                               6. S.A. Billings, I.J. Leontaritis, in Proceedings of the IEEE
    Furthermore, although the stochastic nature of the            Conference on Control and its Application (1981), pp.
Meta-MSS algorithm, the individual evaluation of the              183–187
regressors and the penalty function results in fast con-       7. I.J. Leontaritis, S.A. Billings, International Journal of
vergence. In this respect, the computational efficiency is        Control 41(2), 303 (1985)
                                                               8. S. Chen, S.A. Billings, International Journal of Control
better or at least consistent with other stochastic proce-        49(3), 1013 (1989)
dures, such as RaMSS, C-RaMSS, RJMCMC. The com-                9. L.A. Aguirre, S.A. Billings, Physica. D, Nonlinear Phe-
putational effort relies on the number of search agents,          nomena 80(1-2), 26 (1995)
                                                              10. L. Piroddi, W. Spinelli, International Journal of Control
the maximum number of iterations, and the search space
                                                                  76(17), 1767 (2003)
dimensionality. Therefore, in some cases, the elapsed         11. M.L. Korenberg, S.A. Billings, Y.P. Liu, P.J. McIlroy,
time of the Meta-MSS is compatible with the FROLS.                International Journal of Control 48(1), 193 (1988)
                                                              12. S.A. Billings, S. Chen, M.J. Korenberg, International
    The development of a meta-heuristic based algo-
                                                                  journal of control 49(6), 2157 (1989)
rithm for model selection such as the Meta-MSS permits        13. M. Farina, L. Piroddi, International Journal of Systems
a broad exploration in the field of system identification.        Science 43(2), 319 (2012)
Meta-Model Structure Selection: Building Polynomial NARX Model for Regression and Classification                           13

14. Y. Guo, L.Z. Guo, S.A. Billings, H. Wei, International        45. N. Cristianini, J. Shawe-Taylor, An introduction to sup-
    Journal of Systems Science 46(5), 776 (2015)                      port vector machines and other kernel-based learning
15. K.Z. Mao, S.A. Billings, Mechanical Systems and Signal            methods (Cambridge university press, 2000)
    Processing 13(2), 351 (1999)                                  46. M. Kuhn, K. Johnson, Applied predictive modeling,
16. S.A. Billings, L.A. Aguirre, International journal of Bi-         vol. 26 (Springer, 2013)
    furcation and Chaos 5(06), 1541 (1995)                        47. J. Pallant, SPSS survival manual (McGraw-Hill Educa-
17. P. Palumbo, L. Piroddi, Journal of Sound and Vibration            tion (UK), 2013)
    239(3), 405 (2001)                                            48. L. Bottou, in Neural networks: Tricks of the trade
18. A. Falsone, L. Piroddi, M. Prandini, Automatica 60, 227           (Springer, 2012), pp. 421–436
    (2015)                                                        49. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reute-
19. H. Akaike, IEEE Transactions on Automatic Control                 mann, I.H. Witten, ACM SIGKDD explorations newslet-
    19(6), 716 (1974)                                                 ter 11(1), 10 (2009)
20. H. Akaike, Annals of the Institute of Statistical Mathe-      50. A. Asuncion, D. Newman. Uci machine learning reposi-
    matics 21(1), 243 (1969)                                          tory (2007)
21. G. Schwarz, The Annals of Statistics 6(2), 461 (1978)         51. T. Wang, S.U. Guan, K.L. Man, T.O. Ting, Mathemati-
22. S. Chen, X. Hong, C.J. Harris, IEEE Transactions on               cal Problems in Engineering 2014 (2014)
    Automatic Control 48(6), 1029 (2003)
23. R. Tempo, G. Calafiore, F. Dabbene, Randomized al-
    gorithms for analysis and control of uncertain systems:
    with applications (Springer Science & Business Media,
    2012)
24. T. Baldacchino, S.R. Anderson, V. Kadirkamanathan,
    Automatica 49(9), 2641 (2013)
25. K. Rodriguez-Vazquez, C.M. Fonseca, P.J. Fleming,
    IEEE Transactions on Systems, Man, and Cybernetics-
    Part A: Systems and Humans 34(4), 531 (2004)
26. A.G.V. Severino, F.M.U.d. Araáujo, in Simpósio
    Brasileiro de Automação Inteligente (2017), pp. 609–614
27. J.R.A. Solares, H.L. Wei, S.A. Billings, Neural Comput-
    ing and Applications 31(1), 11 (2019)
28. C. Blum, A. Roli, ACM computing surveys (CSUR)
    35(3), 268 (2003)
29. A.E. Eiben, C.A. Schippers, Fundamenta Informaticae
    35(1-4), 35 (1998)
30. E.G. Talbi, Journal of heuristics 8(5), 541 (2002)
31. S. Mirjalili, S.Z.M. Hashim, in 2010 international con-
    ference on computer and information application (IEEE,
    2010), pp. 374–377
32. J. Kennedy, R.C. Eberhart. Particle swarm optimization,
    ieee international of first conference on neural networks
    (1995)
33. J. Kennedy, Encyclopedia of machine learning pp. 760–
    766 (2010)
34. E. Rashedi, H. Nezamabadi-Pour, S. Saryazdi, Informa-
    tion sciences 179(13), 2232 (2009)
35. S. Mirjalili, A. Lewis, Neural Computing and Applica-
    tions 25(7-8), 1569 (2014)
36. S. Mirjalili, G.G. Wang, L.d.S. Coelho, Neural Comput-
    ing and Applications 25(6), 1423 (2014)
37. S. Mirjalili, A. Lewis, Swarm and Evolutionary Compu-
    tation 9, 1 (2013)
38. H. Wei, S.A. Billings, International Journal of Modelling,
    Identification and Control 3(4), 341 (2008)
39. M. Bonin, V. Seghezza, L. Piroddi, IET control theory &
    applications 4(7), 1157 (2010)
40. L.A. Aguirre, B.H.G. Barbosa, A.P. Braga, Mechanical
    Systems and Signal Processing 24(8), 2855 (2010)
41. F. Bianchi, A. Falsone, M. Prandini, L. Piroddi, A ran-
    domised approach for narx model identification based on
    a multivariate bernoulli distribution. Master’s thesis, Po-
    litecnico di Milano (2017)
42. J.P. Nöel, M. Schoukens, in 2017 Workshop on Nonlinear
    System Identification Benchmarks (2017), pp. 19–23
43. D.W. Hosmer Jr, S. Lemeshow, R.X. Sturdivant, Applied
    logistic regression, vol. 398 (John Wiley & Sons, 2013)
44. L. Breiman, Machine learning 45(1), 5 (2001)
You can also read