Deep Inertial Odometry with Accurate IMU Preintegration

                                              Rooholla Khorrambakht1 , Chris Xiaoxuan Lu2 , Hamed Damirchi1 , Zhenghua Chen3 , Zhengguo Li3

                                            Abstract— Inertial Measurement Units (IMUs) are intercep-                intermediate representations can lead to significant perfor-
                                         tive modalities that provide ego-motion measurements indepen-               mance boosts. By introducing preintegration as a method
                                         dent of the environmental factors. They are widely adopted in               of extracting intermediate representations, we achieved sig-
                                         various autonomous systems. Motivated by the limitations in
                                         processing the noisy measurements from these sensors using                  nificant performance improvements while also reducing the
                                         their mathematical models, researchers have recently proposed               computational load. Furthermore, the produced intermediate
                                         various deep learning architectures to estimate inertial odome-             representations do not impose any architectural constraints
arXiv:2101.07061v1 [cs.LG] 18 Jan 2021

                                         try in an end-to-end manner. Nevertheless, the high-frequency               on the deep learning models.
                                         and redundant measurements from IMUs lead to long raw
                                         sequences to be processed. In this study, we aim to investigate                However, the validity of this compression relies on the
                                         the efficacy of accurate preintegration as a more realistic                 correctness of the assumptions made for formulating the
                                         solution to the IMU motion model for deep inertial odometry                 motion preintegration. The formulation presented in [4]
                                         (DIO) and the resultant DIO is a fusion of model-driven and                 relied on the derivations of [6] which is based on the
                                         data-driven approaches. The accurate IMU preintegration has                 assumption that the angular velocity in body frame and
                                         the potential to outperform numerical approximation of the
                                         continuous IMU model used in the existing DIOs. Experimental                linear acceleration in world frame between two consecutive
                                         results validate the proposed DIO.                                          samples remain constant. This assumption holds when the
                                                                                                                     motion is not highly dynamic or when the IMU sample rate
                                                                 I. I NTRODUCTION                                    is high. The authors of [7] addressed this issue by exploiting
                                            Inertial Odometry (IO) is concerned with the estimation                  the linear switched systems theory and proposed an accu-
                                         of ego-pose transformations using raw IMU measurements.                     rate preintegration formulation for Visual Inertial Odometry
                                         Performing IO based on the mathematical model of an IMU                     (VIO) applications. This study aims to adopt the presented
                                         leads to large accumulative errors in long runs and is only                 formulation by [7] in computing the PreIntegrated (PI)
                                         reliable for short time-intervals between the measurements                  features in our deep learning setup and study its effectiveness
                                         from other complementary sensors in a fusion setup. How-                    in two dynamic and moderate motion domains. As such,
                                         ever, learning the inertial odometry encapsulates the IO                    the model-driven and data-driven approaches co-exist under
                                         problem in a sequence modeling setup and is motivated by                    one roof. Using real-world datasets, we show that the two
                                         the flexibility of deep learning in exploiting the complex                  features lead to similar performances in moderate motions.
                                         motion patterns in the data as pseudo measurements for                      However, at higher speeds and dynamic motions, accurate
                                         keeping the error in check.                                                 preintegration yields better performance. Nevertheless, the
                                            IMU measurements are produced at high frequencies,                       fusion of IMU preintegration and deep learning results in a
                                         which leads to long sequences for relatively short time                     deep inertial odometry (DIO) that is useful for the emerging
                                         periods. Processing these long sequences using Recurrent                    cognitive navigation for robots. The cognitive navigation has
                                         Neural Networks (RNNs) is challenging and is prone to                       the potential to replace the popular metric navigation due to
                                         washout [1], [2], while processing them using Convolutional                 the development of artificial intelligence [8].
                                         Neural Networks (CNNs) requires deep architectures to cover                    The remainder of this report is structured as follows.
                                         large enough receptive fields. Although Wavenets have been                  Section II presents the preintegration theory and the formu-
                                         employed to address such problems [3], they have not yet                    lation of its accurate solution. Next, the experimental setups,
                                         found widespread adoption due to their specialized architec-                datasets, and results are introduced in Section III and the
                                         ture. As another alternative to this solution, we proposed                  concluding remarks are presented in Section IV.
                                         exploiting preintegration as a model-aware preprocessing
                                         step for compressing the temporal dimension of the raw
                                         motion measurements from the IMU [4]. As pointed out                            II. P REINTEGRATION T HEORY AND PI F EATURES
                                         in the literature [5], correct inductive biases, and proper
                                                                                                                        Preintegration is the method of computing motion con-
                                              1 RoohollaKhorrambakht and Hamed Damirchi are with                     straint variables between keyframes in a pose graph [6].
                                         Faculty of   Electrical   Engineering   and   Computer Science,             It is based on the mathematical model of the IMU and
                                         K. N. Toosi University of Technology, lran, r.khorrambakht,
                                                                                 compresses the samples between frames into 9D vectors
                                              2 Chris Xiaoxuan Lu is with the School of Informatics, University of
                                                                                                                     constraining the orientations, velocities, and positions of
                                         Edinburgh, United Kingdom,                             adjacent nodes in the graph. In this section, the relevant
                                              3 Zhenghua Chen and Zhengguo Li are with the Institute for
                                         Infocomm Research (I2R), A*STAR, Singapore, Chen Zhenghua,                  formulations for computing Forster and accurate PI features
                                                                                    have been presented.
A. IMU Model                                                             As proposed in [6], through the multiplication of both sides
   The measurement model of an IMU may be expressed as                   of the above questions by Ri the initial state-dependent and
follows [6]:                                                             sensor-dependent integrated measurements can be separated:
      B ω̃(t)     = B ω WB (t) + bg (t) + η g (t)                                 .          Y                          
                                                                   (1)     ∆Rij = R> i R j =     exp ω̃ k − bgk − η gd k  ∆t
        B ã(t)   = R>                        a       a
                     wB (t) (w a(t) − w g) + b (t) + η (t)                                      k=i
where RwB represents that orientation of the IMU body                      ∆vij = R>
                                                                                   i (vj − vi − g∆tij )
frame B with respect to world frame W , w a(t) represents                            j−1
                                                                                           ∆Rik ãk − bak − η ad
the acceleration of the IMU with respect to the world frame,                     =                            k    ∆t
and w ω W B (t) represents the angular velocity of the IMU                           k=i
expressed in its local body coordinate frame. The IMU mea-                                                         j−1
                                                                                .                                1X
surements B ω̃(t), and B ã(t) are contaminated with additive              ∆pij = R>
                                                                                   i         pj − pi − vi ∆tij −    g∆t2
Gaussian noise η g (t), η a (t), and random walk biases bg (t)                                                   2
and bg (t). Furthermore, g in the above equation represents                          j−1 
                                                                                     X               1
                                                                                            ∆vik ∆t + ∆Rik ãk − bak − η ad
the known gravity direction in the world coordinate frame.                       =                                       k   ∆t
   Based on the kinematic equations of the IMU, the mo-                              k=i
tion propagation of the sensor in the world frame may be                                                                           (5)
expressed as follows [6]:                                                The right hand side of the above equations represent motion
                                                                         constraints that are only functions of the IMU measurements.
    ṘWB = RWB × B ω ∧
                     WB ,            w v̇   = w a,   w ṗ   = wv   (2)   In [4], we proposed adopting these constraints as an in-
where (.) operator converts a 3D vector into its skew                    termediate representation for deep inertial odometry, which
symmetric matrix representation. Through integration we                  we called, PI features. It is worth noting that we substitute
have:                                                                    the bias-corrected values in the above equations with their
                                                                        corresponding raw measurements and rely on our model to
  RwB (t + ∆t) = RwB (t) exp t           B ω WB (τ )dτ                   compensate for their impact.
                          R t+∆t
  w v(t + ∆t) = w v(t) + t         a(τ )dτ                          C. Accurate Preintegration
                          R t+∆t w             RR t+∆t            2
  w p(t + ∆t) = w p(t) + t       w v(τ )dτ +       t    w a(τ )dτ      The constant world acceleration between consecutive IMU
where ∆T is the sampling interval of the sensor and exp(.) samples can be violated in the case of highly dynamic
is the SO(3) exponential map.                                       motions, or when the IMU sampling frequency is not very
                                                                    high. To account for this problem, [7] exploits the switched
B. Forster Formulation                                              liner systems theory to model the state transition between
   Assuming the angular velocity in body frame and acceler- each pair of the IMU samples. In other words, [7] assumes
ation in world frame as constants between two IMU samples, the acceleration and angular velocity between two IMU
equations 1 and 2 lead to the following discrete motion samples to be constant in the body frame as opposed to the
propagation model:                                                  world frame which is a more realistic assumption. With this
  R(t + ∆t) = R(t) Exp ω̃(t) − bg (t) − η gd (t) ∆t                 assumption and after deriving the closed form solutions of
                                                                    the state transition between each pair of IMU samples, the
   v(t + ∆t) = v(t)
                                                                    following accurate preintegrated constraints are derived:
             + g∆t + R(t) ã(t) − ba (t) − η ad (t) ∆t
                                                              (3)             Y
                                   1      2                         ∆R  ij =      exp(Θ(t))
  p(t + ∆t) = p(t) + v(t)∆t + g∆t
                                   2                                         k=i
                1                                                            j−1
             + R(t) ã(t) − ba (t) − η ad (t) ∆t2
                                                                                  ∆Rik Γ(Θ(t)) ãk − bak − η ad
                2                                                   ∆vij =                                    k   ∆t
Based on the above model, the motion corresponding to a                       k=i
batch of IMU samples between time steps i and j may be                       j−1
                                                                                   ∆vik ∆t + ∆Rik Λ(Θ(t)) ãk − bak − η ad
formulated as follows:                                              ∆pij =                                                 k  ∆t
          j−1                                                                 k=i
 Rj = Ri       Exp ω̃ k − bgk − η gd k    ∆t
                                                                    where Λ(.) and Γ(.) are corrective terms defined in [7].
                      j−1                                           Furthermore Θ(t) is defined as follows:
                                       a      ad
  vj = vi + g∆tij +       Rk ãk − bk − η k ∆t                                      Θ(t) = ω̃ k − bgk − η gd   ∆t
 pj = pi                                                                 It can be shown that when Θ(t) is small (high sampling
        j−1                                                            rate or moderate dynamics), Λ(.) converges to 0.5 and Γ(.)
        X           1      1                                             converges to 1. In other words, when the sampling rate of
             vk ∆t + g∆t2 + Rk ãk − bak − η ad
    +                                        k   ∆t
                    2      2                                             the IMU is high or the motion is highly dynamic, the two
                                                    (4)                  preintegrated features become identical.
FC               FC                     FC
                                                                                         1.0         Kitti
       LSTM1            LSTM2                 LSTM20                                                 OxfordIO

                                                                 Probablity(P(x value)
        (128)            (128)                  (128)

    Preintegration   Preintegration         Preintegration


   Integration Window

Fig. 1: The architecture of the models used in all our                                   0.0
experiments.                                                                                   0.0   0.5      1.0   1.5    2.0   2.5     3.0   3.5
                                                                                                           Dynamic Acceleration (m/s2)
                                                                 Fig. 2: The cumulative distribution of the dynamic acceler-
                                 III. E XPERIMENTS               ation of the Kitti and OxfordIO datasets.
   In this section, we use two real-world datasets to investi-
gate the effectiveness of accurate preintegration compared
to two baselines, a model trained using raw IMU data             where ∆T ∗ i−1,i = T ∗ −1     ∗
                                                                                         i−1 T i are odometry labels, and the
and another using preintegrated features based on Forster’s      Σ is an empirical covariance matrix computed using the
formulation.                                                     training data. Furthermore, (.)∧ and (.)∨ operators respec-
                                                                 tively represent operators that transform the se(3) vectors
A. Setup                                                         into their skew-symmetric matrix form and vise versa.
   1) Network Architecture: In this study, we use the base          Finally, the models are implemented using Pytorch, and
architecture from the IO-Net paper [9], which is a single        Adam optimizer with an initial learning rate of 0.001 has
layer bi-directional LSTM with a hidden state size of 128        been adopted to train them. In order to avoid overfitting,
and independently leaned initial hidden states. The selection    dropout layers with a rate of 25% are added between the
of LSTM brings the flexibility of feeding inputs with various    LSTM outputs and the FC inputs. On average, our models
temporal resolutions without any architectural modifications.    converged after 100 epochs.
Furthermore, as reported in [9], bi-directionality improves         3) Datasets: We have chosen two datasets to represent
the capacity of the model for capturing the motion dynamics      fast and slow motion distributions: the OxfordIO pedestrian
by allowing the predictions from one step to use both past       odometry dataset [11] as a representative of an application
and future histories of the signal.                              domain with moderate motions, and Kitti autonomous driv-
   As depicted in Fig. 1, we consider a temporal history of      ing dataset as an example of a domain with fast motions.
200 IMU samples on each inference. The window length             Fig. 2 illustrates the cumulative distribution of the static
of 200 is the value for which we achieved a well balance         acceleration of a snippet from the test sets of the two datasets.
between the performance and computational load. The pre-         As it can be seen in the graph, the 90th percentile of the
sented architecture is common in all of our experiments.         OxfordIO dataset is at around 1.25m/s2 while this value for
Two experiments are designed for each type of accurate           the Kitti dataset is 2.5ms/s2 with maximum accelerations
and Forster preintegrated features. In this paper, we choose     of over 3ms/s2 , which coincides with our assumption about
a preintegration length of 10 IMU samples based on the           the fast and slow motion nature of each dataset.
slowest ground-truth frequency in our datasets. Our baseline        In terms of sensor specifications, the Kitti dataset is
experiment bypasses the preintegration module and feeds          recorded using a car equipped with vision, Lidar, and RTK-
the LSTM with the raw IMU samples. For each odometry             GPS+IMU units traversing urban and countryside environ-
transformation between 10 IMU samples, the hidden states of      ments. The IMU measurements are available at a rate of 100
the LSTM are fed into fully connected layers with ξ i ∈ se(3)    Hz, and a 10 Hz centimeter-level accurate ground truth is
predictions as outputs. These se(3) vectors are converted        provided through the fusion of Lidar and GPS-IMU sensors.
into SE(3) transformation matrices through the exponential       On the other hand, the OxfordIO dataset contains the IMU
mapping function, Ti = exp (ξ i ).                               readings at a rate of 100 Hz from a smartphone held in
   2) Training and Loss: The training loss has been for-         various configurations by multiple users undergoing different
mulated similar to DPC-Net [10], with geodesic distances         motion patterns. The ground-truth for this dataset is recorded
between the network predictions and ground-truth labels as       using a Vicon motion capture system with millimeter-level
the loss:                                                        accuracy.
                L=       ΦTi Σ−1 Φi                              B. Autonomous Driving Motion Domain
                     k=1                                           This section investigates the effectiveness of accurate
                   .          ∗ −1        ∧ ∨
               Φi = log(∆T i,i+1 exp (ξ i ))                     preintegration in IO performance improvement trained on a
TABLE II: The performance evaluation of the three baseline         TABLE III: The performance evaluation of the three baseline
models tested on Kitti dataset.                                    models tested on OxfordIO dataset.
                                   PI-Net             PI-Net                            IO-Net      PI-Net          PI-Net
 test           IO-Net                                                   Test seq.
                                  (Forster)        (Accurate PI)                          (%)    (Forster) (%)   (Accurate PI)
         trel        rrel    trel         rrel   trel      rrel        handheld-d1-s2     6.5        4.94             5.0
         (%)      (deg/m)    (%)      (deg/m)    (%)    (deg/m)        handheld-d1-s5    3.33        2.91            2.67
  10    11.37       0.018   10.23       0.015    10.6     0.019        handheld-d1-s6    3.12        2.82            2.91
  07    20.91       0.016    8.52        0.013   5.82     0.014        handheld-d3-s1    4.48        3.42             3.6
  05     18.6       0.035     8.8        0.021   8.77     0.019        handheld-d4-s1    4.32        4.28            3.92
 avg.   16.96       0.023    9.18       0.0163    8.3     0.017        handheld-d4-s3     4.6        4.05            4.14
                                                                       handheld-d5-s1    3.64        3.53            3.42
                                                                          average        4.35        3.71            3.66

dynamic motion model.
   1) Baselines: We train three models using the Kitti
                                                                   adopted identical odometry steps and integration lengths (10
dataset. The base model takes the raw IMU measurements
                                                                   IMU samples). The sequences shown in Table III are used
as input while the other two are fed with the two types of
                                                                   for testing, and the remaining sequences were adopted for
PI features, one computed using the accurate formulation
                                                                   training and validation.
and the other using Forster’s method. Based on the 10
                                                                      2) Evaluation Metric: We integrate the 6-DOF SE(3)
Hz frequency of the ground-truth labels and the 100 Hz
                                                                   odometry predictions corresponding to batches of 200 IMU
IMU sampling rate, we set the odometry step (preintegration
                                                                   samples to calculate the displacements. We compare these
length) to 10 IMU samples (100/10 = 10). Furthermore, for
                                                                   displacements against their ground-truth values to compute
training, sequences {00-10}-{05,07,10,03} were used and
                                                                   the error for each model. We then divide these errors by the
testing was performed using sequences {05,07,10,03}.
                                                                   displacement length to compute normalized error values.
   2) Evaluation Metric: The relative translation and rotation        3) Results: The results of this experiment have been re-
errors defined by the KITTI benchmark [12] have been               ported in Table II. As can be seen in the table, the difference
adopted as the evaluation metric. These relative errors are        between the average performance of the two preintegration
computed as the averaged position/orientation errors within        methods is marginal in this experiment. However, similar
all possible sub-sequences of lengths 100m, ..., 800m. We          to the previous section, both preintegration methods surpass
use the open-source implementation provided in [13].               the performance of the model operating on raw data. The
   3) Results: The results of this experiment have been            close gap between the performances of the two preintegration
reported in Table II. Each of the three main columns of            methods was expected for this motion domain. As indicated
the table represent the performance for each of the three          in Section II, the two preintegration formulations are identi-
baselines, and each row indicates the performance on each          cal when the IMU sampling frequency is high or when the
test sequence. As it can be seen, the average translation error    accelerations and rotation rates are not highly dynamic.
for the model trained with accurate PI features surpasses
the other two baselines, by 0.88% compared to the Forster’s                    IV. C ONCLUSION AND D ISCUSSION
method and 8.66% compared to the model with raw input.                Preintegration reduces the temporal dimension of the raw
It is important to note that models using any integration          IMU signals by incorporating the mathematical model of
methods perform better than the baseline model with raw            the sensor. This reduction of temporal steps leads to fewer
inputs. It is also important to note that the orientation errors   recursions by the RNN, which facilitates faster inference and
of the two preintegration methods are close to each other.         better performance. In this study, we investigated the impact
Because, based on Eq. 6 and Eq. 5, both accurate and Forster       of this numerical inaccuracy on the performance of a deep
integration methods employ identical formulas to compute           learning model trained using PI features. We observed that
the rotation portion of the PI features.                           the adoption of accurate preintegration leads to performance
                                                                   improvements in highly dynamic motions. In contrast, the
C. Pedestrian Odometry Motion Domain
                                                                   performance gap is marginal when the movements are not
   Unlike the driving motion domain, pedestrians do not ex-        highly dynamic or equivalently when the sampling frequency
hibit frequent high acceleration/deceleration and high-speed       of the sensor is very high.
motions. Thus, in this section, we repeat the experiments on
this domain to investigate the hypothesis that both accurate
and Forster methods perform similarly under moderate mo-
tions. It is important to note that the sampling frequencies
of the IMUs in both datasets are equal, which is important
for isolating the motion characteristics as the only influential
   1) Baselines: The three baseline models of the previ-
ous section were trained on the handheld domain of the
OxfordIO dataset. In order to maintain comparability, we
