Outatime: Using Speculation to Enable Low-Latency Continuous Interaction for Mobile Cloud Gaming

Page created by Rosa Ferguson
 
CONTINUE READING
Outatime: Using Speculation to Enable Low-Latency Continuous Interaction for Mobile Cloud Gaming
Outatime: Using Speculation to Enable Low-Latency
           Continuous Interaction for Mobile Cloud Gaming

                                Kyungmin Lee† David Chu‡ Eduardo Cuervo‡ Johannes Kopf‡
                                 Yury Degtyarev◦ Sergey Grizan Alec Wolman‡ Jason Flinn†
                                                          †                     ‡
                                                     Univeristy of Michigan       Microsoft Research
                                      ◦                                             
                                          St. Petersburg Polytechnic University       Siberian Federal University

ABSTRACT                                                                                      1.      INTRODUCTION
Gaming on phones, tablets and laptops is very popular.                                           Gaming is the most popular mobile activity, accounting
Cloud gaming — where remote servers perform game execu-                                       for nearly a third of the time spent on mobile devices [18].
tion and rendering on behalf of thin clients that simply send                                 Recently, cloud gaming — where datacenter servers execute
input and display output frames — promises any device the                                     the games on behalf of thin clients that merely transmit
ability to play any game any time. Unfortunately, the reality                                 UI input events and display output rendered by the servers
is that wide-area network latencies are often prohibitive; cel-                               — has emerged as an interesting alternative to traditional
lular, Wi-Fi and even wired residential end host round trip                                   client-side execution. Cloud gaming offers several advan-
times (RTTs) can exceed 100ms, a threshold above which                                        tages. First, every client can enjoy the high-end graphics
many gamers tend to deem responsiveness unacceptable.                                         provided by powerful server GPUs. This is especially appeal-
   In this paper, we present Outatime, a speculative execu-                                   ing for mobile devices such as basic laptops, phones, tablets,
tion system for mobile cloud gaming that is able to mask up                                   TVs and other displays lacking high-end GPUs. Second,
to 120ms of network latency. Outatime renders speculative                                     cloud gaming eases developer pain. Platform compatibil-
frames of future possible outcomes, delivering them to the                                    ity headaches and vexing per-platform performance tuning
client one entire RTT ahead of time, and recovers quickly                                     — sources of much developer frustration [32, 40, 44] — are
from mis-speculations when they occur. Clients perceive lit-                                  eliminated. Third, cloud gaming simplifies software deploy-
tle latency. To achieve this, Outatime combines: 1) future                                    ment and management. Server management (e.g., for bug
state prediction; 2) state approximation with image-based                                     fixes, software updates, hardware upgrades, content addi-
rendering and event time-shifting; 3) fast state checkpoint                                   tions, etc.) is far easier than modifying clients. Finally,
and rollback; and 4) state compression for bandwidth sav-                                     players can select from a vast library of titles and instantly
ings.                                                                                         play any of them. Sony, Nvidia, Amazon and OnLive are
   To evaluate the Outatime speculation system, we use two                                    among the providers that currently offer cloud gaming ser-
high quality, commercially-released games: a twitch-based                                     vices [1–3, 10].
first person shooter, Doom 3, and an action role playing                                         However, cloud gaming faces a key technical dilemma:
game, Fable 3. Through user studies and performance bench-                                    how can players attain real-time interactivity in the face of
marks, we find that players strongly prefer Outatime to                                       wide-area latency? Real-time interactivity means client in-
traditional thin-client gaming where the network RTT is                                       put events should be quickly reflected on the client display.
fully visible, and that Outatime successfully mimics play-                                    User studies have shown that players are sensitive to as little
ing across a low-latency network.                                                             as 60 ms latency, and are aggravated at latencies in excess
                                                                                              of 100ms [6,12,33]. A further delay degradation from 150ms
                                                                                              to 250ms lowers user engagement by 75% [9].
Categories and Subject Descriptors                                                               One way to address latency is to move servers closer to
C.2.4 [Computer-Communication Networks]: Distributed                                          clients. Unfortunately, not only are decentralized edge servers
Systems—Distributed Applications; I.6.8 [Simulation and                                       more expensive to build and maintain, local spikes in de-
Modeling]: Types of Simulation—Gaming                                                         mand cannot be routed to remote servers which further mag-
                                                                                              nifies costs. Most importantly, high latencies are often at-
                                                                                              tributed to the networks’s last mile. Recent studies have
Keywords                                                                                      found that the 95th percentile of network latencies for 3G,
Cloud gaming; Speculation; Network latency                                                    Wi-Fi and LTE are over 600ms, 300ms and 400ms, respec-
                                                                                              tively [21, 22, 35]. In fact, even well-established residential
Permission to make digital or hard copies of all or part of this work for personal or         wired last mile links tend to suffer from latencies in excess of
classroom use is granted without fee provided that copies are not made or distributed         100ms when under load [5, 37]. Unlike non-interactive video
for profit or commercial advantage and that copies bear this notice and the full cita-        streaming, buffering is not possible for interactive gaming.
tion on the first page. Copyrights for components of this work owned by others than              Instead, we propose to mitigate wide-area latency via spec-
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
                                                                                              ulative execution. Our system, Outatime,1 delivers real-time
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.                                   gaming interactivity as fast as — and in some cases, even
MobiSys’15, May 18–22, 2015, Florence, Italy.
Copyright c 2015 ACM 978-1-4503-3494-5/15/05 ...$15.00.                                       1
http://dx.doi.org/10.1145/2742647.2742656.                                                        Outatime : License plates of a car capable of time travel.
Outatime: Using Speculation to Enable Low-Latency Continuous Interaction for Mobile Cloud Gaming
faster than — traditional local client-side execution, despite       siveness on Outatime when operating at up to 120ms RTT
latencies up to 120ms.                                               when compared head-to-head to a system with no latency.
   Outatime’s basic approach is to employ speculative execu-         Latencies up to 250ms RTT were even acceptable for less ex-
tion to render multiple possible frame outputs which could           perienced players. Moreover, unlike in standard cloud gam-
occur RTT milliseconds in the future. While the funda-               ing systems, Outatime players’ in-game skills performance
mentals of speculative execution are well-understood, the            did not drop off as RTT increased up to 120ms. Over-
dynamism and sensitivity of graphically intensive twitch-            all, player surveys scored Outatime gameplay highly. We
based interaction makes for stringent demands on specula-            have also deployed Outatime to the general public on lim-
tion. Dynamism leads to rapid state space explosion. Sen-            ited occasions and received positive reception [14, 19, 27, 39].
sitivity means users are able to (visually) identify incorrect       While we have focused our evaluation on desktop and laptop
states. Outatime employs the following techniques to man-            clients experiencing high network latency, our techniques are
age speculative state in the face of dynamism and sensitivity.       broadly applicable to phone and tablet clients as well. Outa-
Future State Prediction: Given the user’s historical ten-            time’s only client prerequisites are video decode and modest
dencies and recent behavior, we show that some categories            GPU capabilities which are standard in today’s devices.
of user actions are highly predictable. We develop a Markov-            The remainder of the paper is organized as follows. §2
based prediction model that examines recent user input to            reviews game architectures and the impact of latency. §3
forecast expected future input. We use two techniques to im-         presents an overview of the Outatime architecture. §4 and
prove prediction quality: supersampling of input events, and         §5 detail our two main methods of speculation. §6 and §7
Kalman filtering to improve users’ perception of smoothness.         discusses how we save, load and compress state. §8 intro-
                                                                     duces application to multiplayer. §9 covers the implemen-
State Approximation: For some types of mispredictions,
                                                                     tation. §10 evaluates Outatime via user study and perfor-
Outatime approximates the correct state by applying error
                                                                     mance benchmarks. §11 covers related work and §13 dis-
compensation on the (mis)predicted frame. The resulting
                                                                     cusses implications of the work.
frame is very close to what the client ought to see. Our mis-
prediction compensation uses image-based rendering (IBR),
a technique that transforms pre-rendered images from one             2.      BACKGROUND & IMPACT OF LATENCY
viewpoint to a different viewpoint using only a small amount             The vast majority of game applications are structured
of additional 3D metadata.                                           around the game loop, a repetitive execution of the follow-
   Certain user inputs (e.g., firing a gun) cannot be easily         ing stages: 1) read user input; 2) update game state; and
predicted. For these, we use parallel speculative executions         3) render and display frame. Each iteration of this loop is
to explore multiple outcomes. However, the set of all possi-         a logical tick of the game clock and corresponds to 32ms
ble states over long RTTs can be very large. To address this,        of wall-clock time for an effective frame rate of 30 frames
we approximate the state space by subsampling it, and time           per second (fps).2 The time taken for one iteration of the
shifting actual event streams to match event streams that            game loop is the frame time. Frame time is a key metric
are “close enough” i.e., below players’ sensitivity thresholds.      for assessing interactivity since it corresponds to the delay
These techniques greatly reduces the state space with mini-          between a user’s input and observed output.
mal impact on the quality of interaction, thereby permitting             Network latency has an acute effect on interaction for
speculation within a reasonable deadline.                            cloud gaming. In standard cloud gaming, the frame time
State Checkpoint and Rollback: Core to Outatime is                   must include the additional overhead of the network RTT,
a real-time process checkpoint and rollback service. Check-          as illustrated in Figure 1a. Let time be discretized into 32ms
point and rollback prevents mispredictions from propagat-            clock ticks, and let the RTT be 4 ticks (128ms). At t5 , the
ing forward. We develop a hybrid memory-page based and               client reads user input i5 and transmits it to the server. At
object-based checkpoint mechanism that is fast and well-             t7 , the server receives the input, updates the game state,
suited for graphically-intensive real-time simulations like games.   and renders the frame, f5 . At t8 , the server transmits the
State Compression for Saving Bandwidth: Specula-                     output frame to the client, which receives it at t10 . Note
tion’s latency reduction benefits come at the cost of higher         that the frame time incurs the full RTT overhead. In this
bandwidth utilization. To reduce this overhead, we develop           example, an RTT of 128ms results in a frame time of 160
a video encoding scheme which provides a 40% bitrate re-             ms.
duction over standard encoding by taking advantage of the
visual coherence of speculated frames. The final bitrate             3.      GOALS AND SYSTEM ARCHITECTURE
overhead of speculation depends on the RTT. It is 1.9× for
120ms.                                                                  For Outatime, responsiveness is paramount; Outatime’s
   To illustrate Outatime in the context of fast interaction,        goal is to consistently deliver low frame times (< 32ms) at
we evaluate Outatime’s prediction techniques using two ac-           high frame rate (> 30fps) even in the face of long RTTs.
tion games where even small latencies are disadvantageous.           In exchange, we are willing to transmit a higher volume of
Doom 3 is a twitch-based first person shooter where re-              data and potentially even introduce (small and ephemeral)
sponsiveness is paramount. Fable 3 is a role playing game            visual artifacts, ideally sufficiently minor that most players
with frequent fast action combat. Both are high-quality,             rarely notice.
commercially-released games, and are very similar to phone              The basic principle underlying Outatime is to specula-
and tablet games in the first person shooter and role playing        tively generate possible output frames and transmit them to
genres, respectively.                                                the client a full RTT ahead of the client’s actual correspond-
   Through interactive gamer testing, we found that even ex-         ing input. As shown in Figure 1b, the client sends input as
perienced players perceived only minor differences in respon-        2     1
                                                                                  ≈ 32ms for mathematical convenience.
                                                                         30f ps
Outatime: Using Speculation to Enable Low-Latency Continuous Interaction for Mobile Cloud Gaming
TICK = 33ms                                                                                                     TICK = 33ms

                                                                  i5 => f5: rendering for t5                                                     i0 => i’1, i’2, …, i’5 => f’5: prediction for t5

         Server                                                                                                          Server
                                                                     t7        t8     t9                                                         t2    t3    t4

                                       ….      ….                          ….        ….                                                               ….    ….                        ….     ….
         Client                                                                                                          Client
                   t0       t1    t2                   t5    t6      t7                     t10    t11   t12                      t0     t1      t2                 t5     t6    t7                 t10   t11   t12

                                                      i5: input for t5                     f5: frame for t5                       i0: input for t0                 i5 f’5: frame for t5
                                                                      frame time                                                                                  frame time

        (a) Standard cloud gaming: Frame time depends on net latency.                                                                         (b) Outatime: Frame time is negligible.

                                 Figure 1: Comparison of frame delivery time lines. RTT= 4 ticks, server processing time = 1 tick.

                   Client              Server                                                                           impulse, consists of events that are inherently sporadic such
                                         Restore from                  Update               Checkpoint to
                                                                                                                        as firing a weapon or activating an object, yet are fundamen-
              Sample Input                                                                                              tal to the player’s perception of responsiveness. For exam-
                                        Mis-speculation*             Simulation             Updated State
                                                                                                                        ple, in first person shooters, instantaneous weapon firing is
                                                                                                                        core to gameplay. Unlike navigation inputs, the sporadic na-
                                            Navigation: Supersampled                  Impulse: Parallel
                                               Markov Prediction                    Timeline Speculation
                                                                                                                        ture of impulse events makes them less amenable to predic-
                                                                                                                        tion. Instead, Outatime generates parallel speculations for
              Display Frame                                                                                             multiple possible future impulse time lines. To tame state
                                                  RTT < threshold?                  Event Time Shifting                 space explosion, Outatime subsamples the state space and
              Image-Based                   YES                           NO                                            time shifts impulse events to the closest speculated timeline.
               Rendering                                                                                                This enables Outatime to provide the player the perception
                                       Kalman Shake           Partial Cube Map +
                 Decode                 Reduction*          Depth Map Rendering*                                        that impulse is handled instantaneously. Besides navigation
                Bitstream                                                                                               and impulse, we classify other input that is slow relative to
                                                                                                                        RTT as delay tolerant. One typical example of delay toler-
              RX Bitstream             TX Bitstream           Joint Video Encoding                                      ant input is activating the heads-up display. Delay tolerant
                                                                                                  * As needed           input is not subject to speculation, and we discuss how it is
                                                                                                                        handled in §5.
        Figure 2: The Outatime Architecture. Bold boxes represent                                                          Figure 2 also shows how Outatime’s server and client are
        the main areas of this paper’s technical focus.                                                                 equipped to deal with speculations that are occasionally
                                                                                                                        wrong. The server continually checkpoints valid state and
                                                                                                                        restores from (mis-)speculated state at 30 fps. The client ex-
        before; at t0 , the client sends the input i0 which happens to                                                  ecutes IBR — a very basic graphics procedure — to compen-
        be the input generated more than one RTT interval prior                                                         sate for navigation mispredictions when they occur. Other-
        to t5 . The server receives i0 at t2 , computes a sequence of                                                   wise, Outatime, like standard cloud gaming systems, makes
        probable future input up to one RTT later as i01 , i02 , ..., i05 (we                                           minimal assumptions about client capabilities. Namely, the
        use 0 to denote speculation), renders its respective frame f50 ,                                                client should be able to perform standard operations such as
        and sends these to the client. Upon reception at the client                                                     decode a bitstream, display frames and transmit standard
        at time t5 , the client verifies that its actual input sequence                                                 input such as button, mouse, keyboard and touch events. In
        recorded during the elapsed interval matches the server’s                                                       contrast, high-end games that run solely on a client device
        predicted sequence: i1 = i01 , i2 = i02 , ..., i5 = i05 . If the input                                          can demand much more powerful CPU and GPU processing,
        sequences match, then the client can safely display f50 with-                                                   as we show in §10.
        out modification because we ensure that the game output is
        deterministic for a given input [11]. If the input sequence
        differs, the client approximates the actual state by applying
        error compensation to f50 and displays a corrected frame.
        We describe error compensation in detail in §4.5. Unlike                                                        4.        SPECULATION FOR NAVIGATION
        in standard cloud gaming where clients wait more than one                                                          Navigation speculation entails predicting a sequence of fu-
        RTT for a response, Outatime immediately delivers response                                                      ture navigation input events at discrete time steps. Hence,
        frames to the client after the corresponding input.                                                             we use a discrete time Markov chain for navigation infer-
           Speculation performance in Outatime depends upon be-                                                         ence. We first describe how we applied the Markov model
        ing able to accurately predict future input and generate its                                                    to input prediction, and our use of supersampling to improve
        corresponding output frames. Outatime does this by iden-                                                        the inference accuracy. Next, we refine our prediction in one
        tifying two main classes of game input, and building specu-                                                     of two ways, depending on the severity of the expected er-
        lation mechanisms for each, as illustrated in Figure 2. The                                                     ror. We determine the expected error as a function of RTT
        first class, navigation, consists of input events that control                                                  from offline training. When errors are sufficiently low (typi-
        view (rotation) and movement (translation) and modify the                                                       cally corresponding to RTT< 40ms), we apply an additional
        player’s field of view. Navigation inputs tend to exhibit                                                       Kalman filter to reduce video “shake”). Otherwise, we use
        continuity over short time windows, and therefore Outatime                                                      misprediction compensation on the client to post-process the
        makes effective predictions for navigation. The second class,                                                   frame rendered by the server.
Outatime: Using Speculation to Enable Low-Latency Continuous Interaction for Mobile Cloud Gaming
4.1    Basic Markov Prediction                                     its more frenetic gameplay. Based on subjective assessment,
   We construct a Markov model for navigation. Time is             prediction error below 4◦ is under the threshold at which
quantized, with each discrete interval representing a game         output frame differences are perceivable.
tick. Let the random variable navigation vector Nt represent          Based on these results, we make two observations. First,
the change in 3-D translation and rotation at time t:              for RTT ≤ 40ms (where 98% and 93% of errors are less than
              Nt = {δx,t , δy,t , δz,t , θx,t , θy,t , θz,t }      4◦ for Doom 3 and Fable 3 respectively), errors are suffi-
                                                                   ciently minor and infrequent. Note that the client can de-
Each component above is quantized. Let nt represent an             tect the magnitude of the error (because it knows the ground
actual empirical navigation vector received from the client.       truth), and drop any frames with excessive error. A frame
Our state estimation problem is to find the maximum like-          rate drop from 30fps to 30 × 0.95 = 28.5fps is unlikely to af-
lihood estimator N̂t+λ where λ is the RTT.                         fect most players’ perceptions. For RTT > 40ms, we require
   Using the Markov model, the probability distribution of         additional misprediction compensation mechanisms. Before
the navigation vector at the next time step is dependent           discussing both of these cases in turn, we first address the
only upon the navigation vector from the current time step:        question of how much training is needed for successful ap-
p(Nt+1 |Nt ). We predict the most likely navigation vector         plication of the predictive model.
N̂t+1 at the next time step as:
               N̂t+1 = E[p(Nt+1 |Nt = nt )]                        4.3    Bootstrap Time
                    = argmax p(Nt+1 |Nt = nt )                        Construction of a reasonable Markov Model requires suf-
                        Nt+1
                                                                   ficient training data collected during an observation period.
where Nt = nt indicates that the current time step has been        Figure 5 shows that prediction error improves as observation
assigned a fixed value by sampling the actual user input           time increases from 30 seconds to 300 seconds, after which
nt . In many cases, the RTT is longer than a single time           the prediction error distribution remains stable. Somewhat
step (32ms). To handle this case, we predict the most likely       surprisingly, having test players different from training play-
value after one RTT as:              Y                             ers only marginally impacts prediction performance as long
  N̂t+λ = argmax p(Nt+1 |Nt = nt )         p(Nt+i+1 |Nt+i )        as test and train players are of similar skill level. Therefore,
            Nt+λ
                                     i=1..λ−1                      we chose to use a single model per coarse-grained skill level
where λ represents the RTT latency expressed in units of           (novice or experienced) which is agnostic of the player.
clock ticks.
  Our results indicate that the Markov assumption holds up         4.4    Shake Reduction with Kalman Filtering
well in practice: namely, Nt+1 is memoryless (i.e., indepen-          While the Markov model yields high prediction accuracy
dent of the past given Nt ). In fact, additional history in        for RTT< 40ms, minor mispredictions can introduce a dis-
the form of longer Markov chains did not show a measur-            tracting visual effect that we describe as video shake. As a
able benefit in terms of prediction accuracy. Rather than          simple example, consider a single dimension of input such
constructing a single model for the entire navigation vector,      as yaw. The ground truth over three frames may be that
instead we treat each component of the vector N indepen-           the yaw remains unchanged, but the prediction error might
dently, and construct six separate models. The benefit of          be +2◦ , −3◦ , +3◦ . Unfortunately, the user would perceive a
this approach is that less training is required when estimat-      shaking effect because the frames would jump by 5◦ in one
ing N̂ , and we observed that this assumption of treating the      direction, and then 6◦ in another. From our experience with
vector components independently does not hurt prediction           early prototypes, the manifested shakiness was sufficiently
accuracy. Below in §4.3, we discuss the issue of training in       noticeable so as to reduce playability.
more detail.                                                          We apply a Kalman filter [25] in order to compensate for
4.2    Supersampling                                               video shake. The filter’s advantage is that it weighs esti-
                                                                   mates in proportion to sample noise and prediction error.
   We further refine our navigation predictions by supersam-
                                                                   Conceptually, when errors in past predictions are low rela-
pling: sampling input at a rate that is faster than the game’s
                                                                   tive to sample noise, predictions are given greater weight for
usage of the input. Supersampling helps with prediction ac-
                                                                   state update. Conversely, when measurement noise is low,
curacy because it lowers sampling noise. To construct a
                                                                   samples make greater contribution to the new state. For
supersampled Markov model, we first poll the input device
                                                                   space, we omit technical development of the filter for our
at the fastest rate possible. This rate is dependent on the
                                                                   problem. One interesting filter modification we make is that
specific input device. It is at least 100Hz for touch digitizers
                                                                   we extend the filter to support error accumulation over vari-
and at least 125Hz for standard mice. With a 32ms clock
                                                                   able RTT time steps; samples are weighed against an RTT’s
tick, we can often capture at least four samples per tick.
                                                                   worth of prediction error.
We then build the Markov model as before. The inference is
                                                                      Using the Kalman formulation, we assume a linear model
similar to the equation above, with the main difference being
                                               +
                                                                   with Gaussian noise, that is:
the production operator incrementing by i = 0.25. A sum-                                  Nt+1 = ANt + ω                      (1)
mary of navigation prediction accuracy from the user study
described in §10 is shown in Figure 3. Most dimensions of                                    nt = Nt + ν                      (2)
rotational and translational displacement exhibit little per-      for some state transition matrix A and white Gaussian noise
formance degradation with longer RTTs. Yaw (θx ), player’s         ω. We assign A = {aij } as a matrix which encodes the
horizontal view angle, exhibits the most error, and we show        maximum likelihood Markov transitions where aij = 1 if
its performance in detail in Figure 4 for user traces collected    i = nt and j = N̂t+λ , and zero otherwise; ω = N (0, Qω ) is a
from both Doom 3 and Fable 3 at various RTTs from 40ms             normal distribution where Qω is the covariance matrix of the
to 240ms. Doom 3 exhibits greater error than Fable 3 due to        dynamic noise; ν = N (0, Qν ) is a normal distribution where
Outatime: Using Speculation to Enable Low-Latency Continuous Interaction for Mobile Cloud Gaming
1
                                            1                                           1
                                                                                                                                 0.8
                                           0.8                                      0.8
                                                                                                                                                   30 seconds

                                                                                                                       CDF (%)
                                                                                                                                 0.6

                                                                          CDF (%)
                                 CDF (%)
                                           0.6                                      0.6                                                            60 seconds
                                                          RTT 40ms                                    RTT 40ms                                     90 seconds
                                                                                                                                 0.4
                                           0.4            RTT 80ms                  0.4               RTT 80ms                                     150 seconds
                                                          RTT 160ms                                   RTT 160ms                  0.2               300 seconds
                                           0.2                                      0.2
                                                          RTT 240ms                                   RTT 240ms                                    450 seconds
                                            0                                           0                                         0
                                                                                                                                   0              50           100
                                             0         20         40                     0           20           40
                                                                                          Prediction Error (degree)                    Prediction Error (degree)
                                              Prediction Error (degree)

Figure 3: Doom 3 Navigation               (a) Doom 3                      (b) Fable 3                                  Figure 5: Error Decreases
Prediction Summary.         Roll                                                                                       with More Observation Time.
(θz ) is not an input in Doom 3 Figure 4: Prediction for Yaw (θx ), the navigation  component                          Data is for Fable 3 at RTT=
and need not be predicted.       with the highest variance. Error under 4◦ is imperceptible.                           160ms.

Qν is the covariance matrix of the sampling noise. We assign
covariances Qω and Qν based on a priori observations.
   For the base case where the RTT is one clock tick, we
generate navigation predictions N̂t+1|t for time t + 1 given
observations up to and including t as follows:                                                            ,
                         N̂t+1|t = AN̂t|t                  (3)
Similarly, navigation error covariance predictions Pt+1|t are
                                                                                            Input Image        Input Depth                       Output Image
computed as follows.
                     Pt+1|t = APt|t AT + Qω                (4)                      Figure 6: Image-based Rendering Example w/ Fable 3. For-
   When a new measurement nt arrives, we update the esti-                           ward translation and leftward rotation is applied. The dog
mate of both the navigation prediction and its error covari-                        (indicated by green arrow) is closer and toward the center
ance at the current time step as:                                                   after IBR.
               N̂t|t = N̂t|t−1 + Gt (nt − N̂t|t−1 )        (5)
                        Pt|t = Pt|t−1 − Gt Pt|t−1          (6)                      4.5.1        Image-based Rendering
where Gt is the Kalman gain, which is:                                                 We compensate for mispredictions with image-based ren-
                  Gt = Pt|t−1 [Pt|t−1 + Qv ]−1             (7)                      dering (IBR). IBR is a means to derive novel camera view-
Lastly, we initialize the Kalman filter based on the a priori                       points from fixed initial camera viewpoints. The specific
distribution: N̂1|0 = E[N1 ].                                                       implementation of IBR that we use operates by having ini-
   Note the feedback loop between the error covariance and                          tial cameras capture depth information (f ∆ ) in addition to
Kalman gain weighting favors the measurement nt when                                2D RGB color information (f 0 ). It then applies a warp to
the prediction error is high and favors the prediction N̂t|t−1                      create a new 2D image (f 00 ) from f 0 and f ∆ [38]. Figure 6
when error is low. Moreover, the weighting naturally smooths                        illustrates an example whereby an original image and its
shaking caused by inaccurate prediction by favoring the ex-                         depth information is used to generate a new image from a
isting value of N̂ .                                                                novel viewpoint. Note that the new image is both trans-
   To extend λ timesteps for predictions of N̂t+λ|t , we com-                       lated and rotated with respect to the original, and contains
pute λ intermediate iterations of Equation (3),(4), (6) and                         some visual artifacts in the form of blurred pixels and low
(7). When a new measurement nt+λ is received, the accu-                             resolution areas when IBR is inaccurate.
mulation of prediction error Pt+λ|t will have increased the                            To enable IBR, two requirements must be satisfied. First,
gain Gt+λ . The large gain then favors the observation when                         the depth information must accurately reflect the 3D scene.
adjusting the estimate in Equation (5).                                             Fortunately, the graphics pipeline’s z-buffer precisely con-
                                                                                    tains per-pixel depth information and is already a byproduct
                                                                                    of standard rendering. Second, the original 2D scene must
                                                                                    be sufficiently large so as to ensure that any new view is
                                                                                    bounded within the original. To handle this case, instead of
                                                                                    rendering a normal 2D image by default, we render a cube
4.5   Misprediction Compensation with Image-                                        map [17] centered at the player’s position. As shown in Fig-
      based Rendering                                                               ure 7, the cube map draws a panoramic 360◦ image on the
   When RTT > 40ms, a noticeable fraction of navigation                             six sides of a cube. In this way, the cube map ensures that
input is mispredicted, resulting in users perceiving lack of                        any IBR image is within its bounds.
motor control. Our goal in misprediction compensation is                               Unfortunately, naı̈ve use of the depth map and cube map
for the server to generate auxiliary view data f ∆ alongside                        can lead to significant overhead. The cube map’s six faces
its predicted frame f 0 such that the client can reconstruct a                      are approximately3 six times the size of the original image.
frame f 00 that is a much better approximation of the desired
frame f than f 0 .                                                                  3
                                                                                        The original image is not square but rather 16:9 or 4:3.
Outatime: Using Speculation to Enable Low-Latency Continuous Interaction for Mobile Cloud Gaming
TOP
  FRONT

                                                99% Error Coverage (degree)
                                                                              300
                                   CLIP                                              Doom 3, yaw
                                                                              250
 LEFT                                                                                Doom 3, pitch
                                                                              200    Fable 3, yaw
                                                                                     Fable 3, pitch
                                                                              150                                            (a) Visible Smears
                                                                              100

                                                                               50
                                    RIGHT
                                                                                0
                                                                                 0    100      200      300
(BACK                                                                                   RTT (ms)
not shown)                   BOTTOM            Figure 8: Angular coverage of                                                (b) Patched Smears
                                               99% of prediction errors is much
Figure 7:   Cube Map Example w/                less than 360◦ even for high                                    Figure 9: Misprediction’s visual artifacts ap-
Doom 3. Clip region shown.                     RTT.                                                            pear as smears which we mitigate.

The z-buffer is the same resolution as the original image,                                      player’s vertical view angle, is very narrow (because players
and depth information is needed for every cube face. Taken                                      hardly look up or down), and both Fable 3’s yaw and pitch
together, the total overhead is nominally 12×. This cost is                                     ranges are modest at under 80◦ even for RTT ≥ 300ms.
incurred at multiple points in the system where data size is                                    Even for Doom 3’s pronounced yaw range, only 225◦ of cov-
the main determinant of resource utilization, such as server                                    erage is needed at 250 ms. The clip parameters are also
rendering, encoding, bandwidth and decoding. We use the                                         applied to the depth map in order to similarly reduce its
following technique to reduce this overhead.                                                    size.
                                                                                                   In theory, compounding translation error on top of rota-
4.5.2    Clipped Cube Map                                                                       tion error can further expand the clip region. It turns out
                                                                                                that translation accuracy (see Figure 3) is sufficiently high
   We observe that it is unlikely that the player’s true view                                   to obviate consideration of accumulated translation error for
will diverge egregiously from the most likely predicted view;                                   the purposes of clipping.
transmitting a cube map that can compensate for errors in
360◦ is gratuitous. Therefore, we render a clipped cube map                                     4.5.3     Patching Visual Artifacts
rather than a full cube map. The percentage of clipping
depends on the expected variance of the prediction error. If                                      Lastly, we add a technique to mitigate visual artifacts.
the variance is high, then we render more of the cube. On                                      These occasionally appear when the navigation prediction
the other hand, if the prediction variance is low, we render                                   is wrong and there is significant depth disparity between
less of the cube. The dotted line in Figure 7 marks the clip                                   foreground and background objects. For example, in Fig-
region for an example rendering.                                                               ure 9a, the floating stones are much closer to the camera
   In order to size the clip, we define a cut plane c such that                                than the lava in the background. As a result, IBR reveals
the clipped cube bounds the output image with probability                                      “blind spots” behind the stones as indicated by the green
1 − . The cut plane then is a function of the variance of the                                 arrows, which are manifest as visual “smears”.
prediction, and hence the partial cube map approaches a full                                      Our blind spot patching technique mitigates this problem,
cube when player movement exhibits high variance over the                                      as seen in Figure 9b. As part of IBR, we extrude the object
subject RTT horizon. To calculate c, we choose not a single                                    borders in the depth map with lightweight image processing
predicted Markov state, but rather a set N of k states such                                    which propagates depth values from foreground borders onto
that the set covers 1 −  of the expected probability density:                                 the background. However, we keep the color buffer the same.
                   X                                                                           This way blind spots will exhibit less depth disparity and
     N = {nit+1 |       p(Nt+1 = nit+1 |Nt = nt ) ≥ 1 − }                                     will share adjacent colors with the background, appearing as
                 i=1..k                                                                        more reasonable extensions of the background. The results
The clipped cube map then only needs to cover the range                                        are more pleasing scenes especially when near and far objects
represented by the states in N . For a single dimension such                                   are intermingled.
as yaw, the range is then simply the largest distance dif-
ference, and the cut plane along the yaw axis is defined as
follows:                                                                                        5.    SPECULATION FOR IMPULSE EVENTS
       cyaw = max yaw(nit+1 ) − min yaw(njt+1 )                                                   The prototypical impulse events are fire for first person
               nit+1 ∈N               j
                                    nt+1 ∈N
                                                                                                shooters, and interact (with other characters or objects)
This suffices to cover 1 −  of the probable yaw states.                                        for role playing games. We define an impulse event as be-
  In practice, error ranges are significantly less than 360◦                                    ing registered when its corresponding user input is activated.
and therefore the size of the cube map can be substantially                                     For example, a user’s button activation may register a fire
reduced. Figure 8 shows the distribution of cyaw and cpitch                                     event.
in Fable 3 and Doom 3 for  = 0.01, meaning that 99%                                              The objective for impulse speculation is to respond quickly
of mispredictions are compensated. Doom 3’s pitch range,                                        to player’s impulse input while avoiding any visual inconsis-
Outatime: Using Speculation to Enable Low-Latency Continuous Interaction for Mobile Cloud Gaming
Impulse Timeline                                                                                       5.2     Time-Shifting
33ms                                                RTT = 8 ticks
                                                                                                                    To address the shortcomings of subsampling, time-shifting
                                                                                                                 causes activations to be registered either earlier or later in
                                                                                                                 time in order to align them with the nearest subsampled
         t0           t1          t2        t3          t4          t5          t6       t7          t8
                                                                                                                 tick. Time shifting to an earlier time is feasible using specu-
                           shift forward         shift backward          shift forward                           lation because the shift occurs on a speculative sequence at
                                                                                                                 the server — not an actual sequence that has already been
       Speculative Sequences                                                                                     committed by the client. Put another way, as long as the
                                                                                          X,X       f’,18        client has not yet displayed the output frame at a particular
                                           X,?                                                                   tick, it is always safe to shift an event backwards to that
                                                                                         X,~X       f’,28
         ?,?                                                                                                     tick.
                                                                                         ~X,X       f’,38
                                           ~X,?
                                                                                                                    Specifically, for any integer k, an activation issued be-
               X: activation
              ~X: no activation
                                                                                         ~X,~X      f’,48        tween tσ∗k− σ2 and tσ∗k−1 is deferred until tσ∗k . An activa-
                                                                                                 Speculative     tion issued between tσ∗k+1 and tσ∗k+ σ2 −1 is treated as if it
                                                                                                   frames        had arrived earlier in time at tσ∗k . Figure 10a illustrates
                       (a) Speculative timeline and state branches
                                                                                                               14combined subsampling and time-shifting, where the activa-

                                                                                                                 tions that occur at t1 through t2 are shifted later to t3 and
                                                                                                                 activations that occur at t4 are shifted earlier to t3 . The
                                                                                                                 corresponding state tree in Figure 10a shows the possible
                                                                                                                 event sequences and four resulting speculative frames, f801 ,
                                                                                                                 f802 , f803 and f804 . Note that it is not necessary to handle ac-
                                                                                                                 tivations at t0 within the illustrated 8 tick window because
       (b) ∼X, ∼X                      (c) ∼X, X              (d) X,∼X                        (e) X,X            speculations that started at earlier clock ticks (e.g. at t−1 )
                                                                                                                 would have covered them.
       Figure 10: Subsampling and time-shifting impulse events
                                                                                                                    The ability to time-shift both forward and backward al-
       allows the server to bound speculation to a maximum of
                                                                                                                 lows us to further halve the subsampling rate to double σ
       four sequences even for RTT= 256ms. Screenshots (b) –
                                                                                                                 without impacting player perception. Using 60ms as the
       (e) show speculative frames corresponding to four activation
                                                                                                                 threshold of player perception [6, 33], we note that time-
       sequences of weapon fire and no fire.
                                                                                                                 shifting forward alone permits a subsampling period of σ = 2
                                                                                                                 (64ms) with an average shift of 32ms. With the added abil-
       tencies. For example, in a first person shooter, weapons                                                  ity to time-shift backward as well, we can support a subsam-
       should fire quickly when triggered, and enemies should not                                                pling period of σ = 4 (128ms) yet still maintain an average
       reappear shortly after dying. The latter type of visual (and                                              shift of only 32ms. For σ = 4 and RTT≤ 256ms, we gen-
       semantic) inconsistency is disconcerting to players, yet may                                              erate a maximum of four speculative sequences as shown
       occur when mispredictions occur in a prediction-based ap-                                                 in Figure 10a. When RTT > 256ms, we further lower the
       proach. Therefore, we employ a speculation technique for                                                  subsampling frequency sufficiently to ensure that we bound
       impulse that differs substantially from navigation specula-                                               speculation to a maximum of four sequences. Specifically,
       tion. Rather than attempt to predict impulse events, instead                                              σ = λ2 . While this can potentially result in users noticing
       we explore multiple outcomes in parallel.                                                                 the lowered sample rate, it allows us to cap the overhead of
          An overview of Outatime’s impulse speculation is as fol-                                               speculation.
       lows. The server creates a speculative input sequence for all
       possible event sequences that may occur within one RTT,                                                 5.3     Advanced Impulse Events
       executes each sequence, renders the final frame of each se-                                                While binary impulse events are the most common, some
       quence, and sends the set of speculative input sequences and                                            games provide more options. For example, a Fable 3 player
       frame pairs to the client. Upon reception, the client chooses                                           may cast a magic spell either directionally or unidirection-
       the event sequence that matches the events that actually                                                ally which is a ternary impulse event due to mutual ex-
       transpired, and displays its corresponding frame.                                                       clusion. Some first person shooters support primary and
          As RTT increases, the number of possible sequences grows                                             secondary fire modes (Doom 3 does not) which is also a
       exponentially. Consider an RTT of 256ms, which is 8 clock                                               ternary impulse event. With a ternary (or quaternary) im-
       ticks. An activation may lead to an event registration at                                               pulse event, the state branching factor is three (or four)
       any of the 8 ticks, leading to an overwhelming 28 possible                                              rather than two at every subsampling tick. With four par-
       sequences. In general, 2λ sequences are possible for an RTT                                             allel speculative sequences and a subsampling interval of
       of λ ticks. We use two state approximations to tame state                                               σ = 128ms, Outatime is able to support RTT ≤ 128ms
       space explosion: subsampling and time-shifting.                                                         for ternary and quaternary impulse events without lowering
                                                                                                               the subsampling frequency.
       5.1        Subsampling
          We reduce the number of possible sequences by only per-                                              5.4     Delay Tolerant Events
       mitting activations at the subsampling periodicity σ which                                                We classify any input event that is slow relative to RTT
       is a periodicity greater than one clock tick. The benefit is                                            as delay tolerant. We use a practical observation to simplify
                                             λ
       that the state space is reduced to 2 σ . The drawback is                                                handling of delay tolerant events. According to our mea-
       that subsampling alone would cause activations not falling                                              surements on Fable 3 and Doom 3, delay tolerant events
       on the sampling periodicity to be lost, which would appear                                              exhibited very high cool down times that exceeded 256ms.
       counter-intuitive to users.                                                                             The cool down time is the period after an event is registered
Outatime: Using Speculation to Enable Low-Latency Continuous Interaction for Mobile Cloud Gaming
during which no other impulse events can be registered. For       pages are discarded. During speculation, the allocator also
example, in Doom 3, weapon reloading takes anywhere from          defers page deallocation resulting from object delete until
1000ms to 2500ms during which time the weapon reload              commit because deleted objects may need to be restored if
animation is shown. Weapon switching takes even longer.           the speculation is later invalidated.
Fable 3 delay tolerant events have even higher cool down             For object-level checkpointing, we track lifetimes of ob-
times. We take the approach that whenever a delay toler-          jects rather than pages for any object the developer has
ant input is activated at the client, it is permissible to miss   marked for tracking. To rollback a speculation, we delete
one full RTT of the event’s consequences, as long as we can       any object that did not exist at the checkpoint, and restore
compress time after the RTT. The time compression pro-            any objects that were deleted during speculation. More-
cedure works as follows: for a delay tolerant event which         over, we take advantage of inverse functions when available
displays τ frames worth of animation during its cool down         to quickly undo object state changes. Many SSOs in games
(e.g. a weapon reload animation which takes τ frames), we         expose custom inverse functions that undo the actions of
may miss λ frames due to the RTT. During the remaining            previous state change. For example, for geometry objects
τ − λ frames, we choose to compress time by sampling τ − λ        that speculatively execute a move(∆x) function, a suitable
frames uniformly from the original animation sequence τ .         inverse is move(−∆x). Identifying and tracking invertible
The net effect is that delay tolerant event animations ap-        functions is a trade-off in developer effort; we found the sav-
pear to play at fast speed. In return, we are assured that        ings in checkpoint and rollback time to be very significant
all events are properly processed because the delay tolerant      in certain cases (§10).
event’s cool down is greater than the RTT. For example,
weapon switching or reloading immediately followed by fir-
ing is handled correctly.                                         7.    BANDWIDTH COMPRESSION
                                                                     Navigation and impulse speculation generate additional
6.   FAST CHECKPOINT AND ROLLBACK                                 frames to transmit from server to client. As an example, con-
   As with other systems that perform speculation [30, 42],       sider impulse speculation which for RTT of 256ms transmits
Outatime uses checkpoint and restore to play forward a spec-      four speculative frames for four possible worlds. Nominally,
ulative sequence, and roll back the sequence if it turns out      this bandwidth overhead is four times that of transmitting
to be incorrect. In contrast to these previous systems, our       a single frame.
continuous 30fps interactivity performance constraints are           We can achieve a large reduction in bandwidth by ob-
qualitatively much more demanding, and we highlight how           serving that frames from different speculations share sig-
we have managed these requirements.                               nificant spatial and temporal coherence. Using Figure 10a
   Unique among speculation systems, we use a hybrid of           as an example, f801 and f802 are likely to look very simi-
page-level checkpointing and object-level checkpointing. This     lar, with the only difference being two frames’ worth of a
is because we need very fast checkpointing and restore; whereas   weapon discharge animation in f801 . Corresponding screen-
page-level checkpointing is efficient when most objects need      shots Figure 10b–10e show that the surrounding environ-
checkpointing, object-level checkpointing is higher perfor-       ment is largely unchanged, and therefore the spatial coher-
mance when few objects need checkpointing. In general,            ence is often high. In addition, when Outatime speculates
it is only necessary to checkpoint Simulation State Objects       for the next four frames, f901 -f901 , f901 is likely to look similar
(SSOs): those objects which reproduce the world state. Check-     not only to f801 , but also to f802 , and therefore the temporal
pointing objects which have no bearing on the simulation,         coherence is also often high. Similarly, navigation specula-
such as buffer contents, only increase runtime overhead. There-   tion’s clipped cube map faces often exhibit both temporal
fore, we choose either object-level or page-level checkpoint-     and spatial coherence.
ing based on the density of SSOs among co-located objects.           Outatime takes advantage of temporal and spatial coher-
We currently make the choice between page-level or object-        ence to reduce bandwidth by joint encoding of speculative
level manually at the granularity of the binaries (executables    frames. Encoding is the server-side process of compressing
and libraries), though it is also conceivable to do so auto-      raw RGB frames into a compact bitstream which are then
matically at the level of the compilation units.                  transmitted to the client where they are decoded and dis-
   For page-level checkpointing, we intercept calls to the de-    played. A key step of standard codecs such as H.264 is to
fault libc memory allocator with a version that implements        divide each frame into macroblocks (e.g., 64 × 64 bit). A
page-level copy-on-write. At the start of a speculation (at       search process then identifies macroblocks that are equiv-
every clock tick for navigation and at each σ clock ticks for     alent (in some lossy domain) both intra-frame and inter-
impulse), the allocator marks all pages read-only. When a         frame. In Outatime, we perform joint encoding by extend-
page fault occurs, the allocator makes a copy of the original     ing the search process to be inter-speculation; macroblocks
page and sets the protection level of the faulted page to read-   across streams of different speculations are compared for
write. When new input arrives, the allocator invalidates and      equivalence. When an equivalency is found, we need only
discards some speculative sequences which do not match the        transmit the data for the first macroblock, and use pointers
new input. For example in Figure 10a, if no event activa-         to it for the other macroblocks.
tion occurs at t3 , then the sequences corresponding to f801         The addition of inter-speculation search does not change
and f802 are invalid. State changes of the other speculative      the client’s decoding complexity but does introduce more
sequences up until t3 are committed. In order to correctly        encoding complexity on the server. Fortunately, modern
roll back a speculation, the allocator copies back the original   GPUs are equipped with very fast hardware accelerated en-
content of the dirty pages using the copies that it created.      coders [23, 31]. We have reprogrammed these otherwise idle
The allocator also tracks any pages created as a result of        hardware accelerated capabilities for our speculation’s joint
new object allocations since the last checkpoint. Any such        encoding.
Outatime: Using Speculation to Enable Low-Latency Continuous Interaction for Mobile Cloud Gaming
8.    MULTIPLAYER                                                        executing, and returns framebuffers as encoded bitstream
   Thus far, we have described Outatime from the perspec-                packets to the master using shared memory. To support nav-
tive of a single user. Outatime works in a straightforward               igation speculation, we add an additional slave command:
manner for multiplayer as well, though it is useful to clarify           rendercube, which produces the cubemap and depth maps
some nuances. As a matter of background, we briefly review               necessary for IBR. The number of slaves spawned depends
distributed consistency in multiplayer systems. The stan-                on the network latency. When RTT exceeds 128 ms, four
dard architecture of a multiplayer gaming system is com-                 slaves can cover four speculative state branches. Otherwise,
posed of traditional thick clients at the end hosts and a game           three slaves suffice. The client is a simple thin client with
state coordination server which reconciles distributed state             the ability to perform IBR.
updates to produce an eventually consistent view of events.                 We implemented bandwidth compression with hardware-
For responsiveness, each client may perform local dead reck-             accelerated video encode and decode. The server-side joint
oning [8,16]. As an example, player one locally computes the             video encode pipeline and client-side decode pipeline used
position of player two based off of last reported trajectory.            the Nvidia NVENC hardware encoder and Nvidia Video
If player one should fire at player two who deviates from                Processor decoder, respectively [31]. The encode pipeline
the dead-reckoned path, whether a hit is actually scored                 consists of raw frame capture, color space conversion and
depends on the coordination server’s reconciliation choice.              H.264 bitstream encoding. The decode pipeline consists of
Reconciliation can be crude and disconcerting when local                 H.264 bitstream decoding and color space conversion to raw
dead-reckoned results are overridden; users perceive glitches            frames. We implemented the client-side IBR function as an
such as: 1) an opponent’s avatar appears to teleport if the              OpenGL GLSL shader which consumes decoded raw frames
opponent does not follow the dead-reckoned path, 2) a player             and produces a compensated final frame for display.
in a firefight fires first yet still suffers a fatality, 3) “sponging”      Lastly, we also examined the source code for Unreal En-
occurs — a phenomenon whereby a player sees an opponent                  gine [16], one of several widely used commercial game en-
soak up lots of damage without getting hurt [4].                         gines upon which many games are built, and verified that
   With multiplayer, Outatime applies the architecture of                the modifications described above are general and feasible
Figure 2 to clients without altering the coordination server:            there as well. Specifically, support exists for rendering mul-
end hosts run thin clients and servers run end hosts’ cor-               tiple views, the key primitive of rendercube [15]. Other
responding Outatime server processes. The coordination                   key operations such as checkpoint, rollback and determinism
server — which need not be co-located with the Outatime                  can be implemented at the level of the C++ library linker.
server processes — runs as in standard multiplayer. Ou-                  Therefore, we suggest speculative execution is broadly ap-
tatime’s multiplayer consistency is equivalent to standard               plicable across commercial titles, and can be systematized
multiplayer’s because dead-reckoning is still used for oppo-             with additional work.
nents’ positions; glitches can occur, but they are no more or
less frequent than in standard multiplayer. As future work,              10.    EVALUATION
we are interested in extending Outatime’s speculative ap-                  We use both user studies and performance benchmark-
proach in conjunction with AI-led state approximations [7]               ing to characterize the cost and benefits of Outatime. User
to better remedy glitches that appear generally in any mul-              studies are useful to assess perceived responsiveness and vi-
tiplayer game.                                                           sual quality degradation, and to learn how macro-level sys-
                                                                         tem behavior impacts gameplay. Our primary tests are on
                                                                         Doom 3 because twitch-based gaming is very sensitive to la-
9.    IMPLEMENTATION                                                     tency. We confirm the results with limited secondary tests
  To prototype Outatime, we modified Doom 3 (originally                  on Fable 3. A summary of our findings are as follows.
366,000 lines of code) and Fable 3 (originally 959,000 lines
of code). Doom 3 was released in 2004 and open sourced                      • Based on subjective assessment, players — including
in 2011. Fable 3 was released in 2011. While both games                       seasoned gamers — rate Outatime playable with only
are several years old, we note that the core gameplay of first                minor impairment noticeable up to 128ms.
person shooters and role playing games upon which Outa-
                                                                            • Users demonstrate very little drop off of in-game skills
time relies has not fundamentally changed in newer games.
                                                                              with Outatime as compared to a standard cloud gam-
The following section’s discussion is with respect to Doom 3.
                                                                              ing system.
Our experience with Fable 3 was similar
  We have made the following key modifications to Doom 3.                   • Our real-time checkpointing and rollback service is a
To enable deterministic speculation, we made changes ac-                      key enabler for speculation, taking less than 1ms for
cording to [11] such as de-randomizing the random number                      checkpointing app state.
generator, enforcing deterministic thread scheduling, and re-
placing timer interrupts with non-time-based function call-                 • Speculation imposes increased demands on resource.
backs. To support impulse speculation, we spawn up to four                    At 128ms, bandwidth consumption is 1.97× higher
Doom 3 slaves, each of which is a modified instance of the                    than standard cloud gaming. However, frame rates
original game. Each slave accepts the following commands:                     are still satisfactorily fast at 52fps at 95th percentile.
advance consumes an input sequence and simulates game
logic accordingly; render produces a frame corresponding to              10.1    Experimental Setup
the current simulation state; undo discards any uncommitted                 We tested Outatime against the following baselines. Stan-
state; commit makes any input applied thus far permanent.                dard Thick Client consists of out-of-the-box Doom 3, which
Each slave receives instructions from our master process re-             is a traditional client-only application. Standard Thin Client
garding the speculation (i.e., input sequence) it should be              emulates the traditional cloud gaming architecture shown in
Outatime: Using Speculation to Enable Low-Latency Continuous Interaction for Mobile Cloud Gaming
5
                                                                                                                                             120

                                                                                                                  Task Completion Time (s)
Mean Opinion Score

                     4.5

                      4                                                                                                                      100
                     3.5
                                                                                                                                              80
                      3

                     2.5                                                                                                                      60
                      2

                     1.5
                                                                                                                                              40       Outatime
                                Standard Thin Client                                                                                                   Standard Thin Client
                      1
                                Outatime                                                                                                      20       Standard Thick Client
                     0.5        Standard Thick Client
                           0   50   100   150   200   250   300   350   400                                                                    0
                           Network Latency RTT (ms)                                                                                                0   100   200     300       400
                                                                                                                                                         RTT (ms)
Figure 11: Impact of Latency on User
                                                                              Figure 12: Remaining Health                                    Figure 13: Task Completion Time
Experience

Figure 1a, where Doom 3 is executed on a server without                                        10.2    User Perception of Gameplay
speculation, and the player submits input and views output                                       The user study centered around three criteria.
frames on a client. The server consists of an HP z420 ma-
chine with quad core Intel i7, 16GB memory, and an Nvidia                                         • Mean Opinion Score (MOS): Participants assign a sub-
GTX 680 GPU w/4GB memory. For Thin Client and Out-                                                  jective 1–5 score on their experience where 5 indicates
atime, we emulated a network with a defined RTT. The em-                                            no difference from reference, 4 indicates minor differ-
ulation consisted of delaying input processing and output                                           ences, 3 indicates acceptable differences, 2 indicates
frames to and from the server and client by a fixed RTT.                                            annoying differences and 1 indicates unplayable. MOS
The client process was hosted on the same machine as the                                            is a standard metric in the evaluation of video and
server in order to finely control network RTT. We also used                                         audio communication services.
the same machine to run the Thick Client. User input was
issued via mouse and keyboard. We configured Doom 3 for
a 1024 × 1024 output resolution. We conducted formal user                                         • Skill Impact: We use the decrease in players’ in-game
studies with sixty-four participants overall over three sepa-                                       health as a proxy for the skill degradation resulting
rate study sessions. Two studies were for Doom 3 and con-                                           from higher latency.
sisted of twenty-three participants (Study A) and eighteen
participants (Study B). The third was for Fable 3 (Study C)
and is described in more detail later in this section. Both                                       • Task Completion Time: Participants are asked to fin-
Study A and Study B consisted of coworkers and colleagues                                           ish an in-game task in the shortest possible time under
who were recruited based on their voluntary response to a                                           varying latency conditions.
call for participation in a gaming study. The self-reported
age range was 24–42 (39 male, 2 female). The main dif-                                            Each participant first played a reference level on the Thick
ference between the studies was that Study B’s participants                                    Client system, during which time they had an opportunity
were drawn primarily from a local gaming interest group and                                    to familiarize themselves with game controls, as well as ex-
were compensated with a $5 gift card, whereas Study A’s                                        perience best-case responsiveness and visual quality. Next,
participants were not. Prior to engagement, all participants                                   they re-played the level three to ten times with either Outa-
were provided an overview of the study, and consented to                                       time or Thin Client and an RTT selected randomly from
participate in accordance with institutional ethics and pri-                                   {0ms, 64ms, 128ms, 256ms, 384ms}. Among the multiple
vacy policies. They also made a self-assessment regarding                                      replays, they also played once on Thick Client as a control.
their own video game skill at three granularities: 1) overall                                  Participants and the experimenter were blind to the system
video game experience, 2) experience with the first person                                     configuration during re-plays. Some of the participants re-
shooter genre, and 3) experience with Doom 3 specifically.                                     peated the entire process for a second level. We configured
While all Study A’s participants reported either Beginner                                      the level so that participants only had access to the fastest
(score=2) or No Experience (score=1) for Doom 3, Study                                         firing weapon so that any degradations in responsiveness
B’s participants self-reported as being “gamers” much more                                     would be more readily apparent.
frequently, with 72% having previously played Doom 3. For                                         After each re-play, participants were asked to rank their
both Study A and B, participants first played the unmod-                                       experience relative to the reference on an MOS scale ac-
ified game to gain familiarity. They then played the same                                      cording to three questions: (1) How was your overall user
map at various network latency settings with and without                                       experience? (2) How was the responsiveness of the con-
Outatime enabled. Participants were blind as to the network                                    trols? (3) How was the graphical visual quality? We also
latency and whether Outatime was enabled.                                                      solicited free-form comments and recorded in-game vocal ex-
                                                                                               clamations which turned out to be illuminating. Lastly, we
                                                                                               recorded general player statistics during play, such as re-
                                                                                               maining player health and time to finish the level.
You can also read