The Xbox One System on a Chip and Kinect Sensor

Page created by Bobby Gilbert
 
CONTINUE READING
The Xbox One System on a Chip and Kinect Sensor
This article has been accepted for publication in IEEE Micro but has not yet been fully edited.
                                     Some content may change prior to final publication.

     The Xbox One System on a Chip and Kinect Sensor
                                John Sell, Patrick O’Connor, Microsoft Corporation

                                                                      Figure 2 shows a block diagram of the system.
1         Abstract                                                    The main SoC contains all of the principal
                                                                      computation components. The South Bridge
The System on a Chip at the heart of the Xbox
                                                                      chip expands the SoC input and output to
One entertainment console is one of the largest
                                                                      access optical disc, hard disc, and flash storage,
consumer designs to date with five billion
                                                                      HDMI input, Kinect, and wireless devices.
transistors. The Xbox One Kinect image and
voice sensor uses time of flight technology to
provide high resolution, low latency, lighting-
independent three-dimensional image sensing.
Together they provide unique voice and gesture
interaction with high performance games and
other entertainment applications.

2         Terms
CPU       Central Processing Unit
DRAM      Dynamic Random Access Memory
DSP       Digital Signal Processor
GPU       Graphics Processing Unit
HDMI      High Definition Multi-media Interface
MMU       Memory Management Unit
PCI (e)   Peripheral Component Interface
SoC       System on a Chip
SRAM      Static Random Access Memory

3         Xbox One System                                                          Figure 2: Xbox One System

The Xbox One system pictured in figure 1
includes the Kinect image and audio sensors,                          4         Main SoC
console, and wireless controller.
                                                                      A single SoC departs from the initial
                                                                      implementations of previous high performance
                                                                      consoles. One chip enables the most efficient
                                                                      allocation of memory and other resources. It
                                                                      avoids the latency, bandwidth limitations, and
                                                                      power consumption of communicating between
                                                                      computation chips.

                                                                      Microsoft collaborated with Advanced Micro
                                                                      Devices (AMD) to develop the SoC. SRAM and
                                                                      GPU circuits with redundancy comprise more
                                                                      than 50% of the 370-mm2 chip, resulting in yield
                                                                      comparable to much smaller designs.

                                                                      Figure 3 shows the SoC organization. The SoC
                                                                      provides simultaneous system and user
                                                                      services, video input and output, voice
    Figure 1: Xbox One Kinect, Console, and                           recognition, and three-dimensional image
               Wireless Controller                                    recognition.

                     Digital Object Indentifier 10.1109/MM.2014.9          0272-1732/$26.00 2014 IEEE
The Xbox One System on a Chip and Kinect Sensor
This article has been accepted for publication in IEEE Micro but has not yet been fully edited.
                                   Some content may change prior to final publication.

                                                                    page addresses and uses large pages where
Significant features include:                                       appropriate to avoid significant performance
    x Unified, but not uniform, main memory                         impact from the two-dimensional translation.
    x Universal host-guest virtual memory
         management                                                 System software manages physical memory
    x High bandwidth CPU cache coherency                            allocation. System software and hardware keep
    x Power islands matching features and                           page tables synchronized so that CPU, GPU,
         performance to active tasks                                and other processors can share memory, pass
                                                                    pointers rather than copying data, and a linear
                                                                    data structure in a GPU or CPU virtual space
                                                                    can have physical pages scattered in DRAM and
                                                                    SRAM. The unified memory system frees
                                                                    applications from the mechanics of where data
                                                                    is located, but GPU-intensive applications can
                                                                    specify which data should be in SRAM for best
                                                                    performance.

                                                                    The GPU graphics core and several specialized
                                                                    processors share the GPU MMU, which
                                                                    supports 16 virtual spaces. PCIe input and
                                                                    output and audio processors share the IO MMU,
                                                                    which supports virtual spaces for each PCI
                                                                    bus/device/function. Each CPU core has its own
                                                                    MMU (CPU access to SRAM maps through a
                                                                    CPU MMU and the GPU MMU).

                                                                    The design provides 32 GB/second peak DRAM
                                                                    access with hardware-maintained CPU cache
                                                                    coherency for data shared by the CPU, GPU,
                                                                    and other processors. Hardware-maintained
                                                                    coherency improves performance and software
                                                                    reliability.

                                                                    The implementation restricts shared CPU-
                                                                    cache-coherent data (and PCIe and audio data,
                                                                    most of which is CPU-cache-coherent) to DRAM
          Figure 3: SoC Organization                                for simplification and cost savings. GPU SRAM
                                                                    access and non-CPU-cache-coherent DRAM
                                                                    access bypass CPU cache coherency checking.
4.1        Main Memory
Main memory consists of 8 Gbytes of low cost                        4.2            CPU
DDR3 external DRAM and 32 Mbytes of internal
SRAM. This provides necessary bandwidth                             The CPU contains eight AMD Jaguar single-
while saving power and considerable cost over                       thread 64-bit x86 cores in two clusters of four.
wider or faster external DRAM-only alternatives.                    The cores contain individual first level code
                                                                    caches and data caches. Each cluster contains
Peak DRAM bandwidth is 68 Gbytes per                                a shared 2 MB second level cache.
second. Peak SRAM bandwidth ranges between
109 and 204 Gbytes per second, depending on                         The CPU cores operate at 1750 MHz in full
the mix of transactions. Sustainable total peak                     performance mode. Each cluster can operate at
bandwidth is about 200 Gbytes per second.                           different frequencies. The system selectively
                                                                    powers individual cores and clusters to match
MMU hardware maps guest virtual addresses to                        workload requirements.
guest physical addresses to physical addresses
for virtualization and security. The                                Jaguar provides good performance and
implementation sizes caching of fully translated                    excellent power-performance efficiency.

                   Digital Object Indentifier 10.1109/MM.2014.9          0272-1732/$26.00 2014 IEEE
The Xbox One System on a Chip and Kinect Sensor
This article has been accepted for publication in IEEE Micro but has not yet been fully edited.
                                   Some content may change prior to final publication.

The CPU contains minor modifications from                           and GPU processing. Kinect makes extensive
earlier Jaguar implementations to support two                       use of combined CPU-GPU computation.
clusters and increased CPU cache coherent
bandwidth.                                                          The graphics core contains two graphics
                                                                    command and two compute command
4.3        GPU                                                      processors. Each command processor supports
                                                                    16 work streams. The two geometry primitive
Figure 4 shows the graphics core and the                            engines, 12 compute units, and four render
independent processors and functions sharing                        backend depth and color engines in the graphics
the GPU MMU. The GPU contains AMD                                   core support two independent graphics contexts.
graphics technology supporting a customized
version of Microsoft DirectX graphics features.                     The graphics core operates at 853 MHz in full
Hardware and software customizations provide                        performance mode. System software selects
more direct access to hardware resources than                       lower frequencies, and powers the graphics core
standard DirectX. They reduce CPU overhead to                       and compute unit resources to match tasks.
manage graphics activity and combined CPU

                                                     Figure 4: GPU

                   Digital Object Indentifier 10.1109/MM.2014.9          0272-1732/$26.00 2014 IEEE
The Xbox One System on a Chip and Kinect Sensor
This article has been accepted for publication in IEEE Micro but has not yet been fully edited.
                                    Some content may change prior to final publication.

4.4        Independent GPU
           Processors and Functions
Eight independent processors and functions
share the GPU MMU. These engines support
applications and system services. They augment
GPU and CPU processing, and are more power-
performance efficient at their tasks.

Four of the engines provide copy, format
conversion, compression, and decompression
services. The video decode and encode engines
support multiple streams and a range of formats.
The audio-video input and output engines
support multiple streams, synchronization, and
digital rights management. Audio-video output
includes resizing and compositing three images,
and saving results in main memory in addition to
display output.

4.5        Audio Processors
The SoC contains eight audio processors and
supporting hardware shown in figure 5. The
processors support applications and system
services with multiple work queues. Collectively
they would require two CPU cores to match their
audio processing capability.

The four DSP cores are Tensilica-based designs
incorporating standard and specialized
instructions. Two include single precision vector
floating point totaling 15.4 billion operations per
second.

The other four audio processors implement:
   x Sample rate conversion
   x Equalization and dynamic range                                                Figure 5: Audio Processors
       compression
   x Filter and volume processing
   x 512 stream Xbox Media Audio format                              5         Xbox One Kinect
       decompression
                                                                     The Xbox One Kinect is the second-generation
The audio processors use the IO MMU. This                            Microsoft three-dimensional image and audio
path to main memory provides lower latency                           sensor. It is an integral part of the Xbox One
than the GPU MMU path. Low latency is                                system. The three-dimensional image and audio
important for games, which frequently make                           sensors and the SoC computation capabilities
instantaneous audio decisions, and Kinect audio                      operating in parallel with games and other
processing.                                                          applications provide an unprecedented level of
                                                                     voice, gesture and physical interaction with the
                                                                     system.

                    Digital Object Indentifier 10.1109/MM.2014.9          0272-1732/$26.00 2014 IEEE
The Xbox One System on a Chip and Kinect Sensor
This article has been accepted for publication in IEEE Micro but has not yet been fully edited.
                                   Some content may change prior to final publication.

5.1        Image Sensor Goals and                                        x    Depth resolution within 1% of distance
                                                                         x    Minimum software resolvable object less
           Requirements                                                       than 2.5 cm
User experience drove the image sensor goals:                            x    Operating range from 0.8 m to 4.2 m
   x Resolution sufficient for software to                                    from the camera
       reliably detect and track the range of                            x    Illumination from the camera and
       human sizes from young children to                                     operation independent of room lighting
       small and large adults: a limiting                                x    Maximum of 14 milliseconds exposure
       dimension is the diameter of a small                                   time
       child’s wrist, approx. 2.5cm                                      x    Less than 20 milliseconds latency from
   x Camera field of view wide enough for                                     the beginning of each exposure to data
       users to interact close to the camera in                               delivered over USB 3.0 to main system
       small spaces and relatively far away in                                software
       larger rooms                                                      x    Depth accuracy within 2% across all
   x Camera dynamic range sufficient for                                      lighting, color, users, and other
       users throughout the space with widely                                 conditions in the operating range
       varying clothing colors
   x Lighting independence                                          5.2           Time of Flight Camera
   x Stability and repeatability
   x Sufficiently low latency for natural-
                                                                                  Architecture
       feeling gesture and physical interaction                     Figure 6 shows the three-dimensional image
                                                                    sensor system. The system consists of the
These goals led to the key requirements:                            sensor chip and a camera SoC. The SoC
   x Field of view of 70 degrees horizontal x                       manages the sensor and communications with
       60 degrees vertical                                          the Xbox One console.
   x Aperture F# < 1.1

                              Figure 6: Three-dimensional Image Sensor System

                   Digital Object Indentifier 10.1109/MM.2014.9          0272-1732/$26.00 2014 IEEE
The Xbox One System on a Chip and Kinect Sensor
This article has been accepted for publication in IEEE Micro but has not yet been fully edited.
                                   Some content may change prior to final publication.

The time of flight system modulates a camera
light source with a square wave. It uses phase                      5.3            Differential Pixels
detection to measure the time it takes light to
travel from the light source to the object and                      Figure 7 shows the time of flight sensor and
back to the sensor, and calculates distance from                    signal waveforms. A laser diode illuminates the
the results.                                                        subjects. The time of flight differential pixel array
                                                                    receives the reflected light.
The timing generator creates a modulation
square wave. The system uses this signal to                         A differential pixel distinguishes the time of flight
modulate both the local light source (transmitter)                  sensor from a classic camera sensor. The
and the pixel (receiver).                                           modulation input controls conversion of
                                                                    incoming light to charge in the differential pixel’s
The light travels to the object and back in time                    two outputs. The timing generator creates clock
ǻt. The system calculates ǻt by estimating                          signals to control the pixel array and a
received light phase at each pixel with                             synchronous signal to modulate the light source.
knowledge of the modulation frequency. The                          The waveforms illustrate phase determination.
system calculates depth from the speed of light
in air: 1 cm in 33 picoseconds.

                                           Figure 7: Time of Flight Sensor

The light source transmits the light signal. It                     the pixel clock. This is the essence of time of
travels out from the camera, reflects off any                       flight phase detection.
object in the field of view and returns to the
sensor lens with some delay (phase shift) and                       Some interesting properties of the pixel output
attenuation.                                                        lead to a very useful set of output images.
                                                                        x (A+B) gives a ‘normal’ grey scale image
The lens focuses the light on the sensor pixels.                             illuminated by normal ambient (room)
A synchronous clock modulates the pixel                                      lighting (‘ambient image’)
receiver. When the clock is high, photons falling                       x (A-B) gives phase information after an
on the pixel contribute charge to the A-out side                             arctangent
                                                                                    g     calculation (‘depth image’)
of the pixel. When the clock is low, photons                            x                  gives a grey scale image
contribute charge to the B-out side of the pixel.                            which is independent of ambient (room)
                                                                             lighting (‘active image’)
The (A-B) differential signal provides an output
whose value depends both on the returning light                     Chip optical and electrical parameters determine
level and on the time it arrives with respect to                    the quality of the resulting image. It does not
                                                                    depend significantly on mechanical factors.

                   Digital Object Indentifier 10.1109/MM.2014.9          0272-1732/$26.00 2014 IEEE
This article has been accepted for publication in IEEE Micro but has not yet been fully edited.
                                    Some content may change prior to final publication.

Multiphase captures cancel linearity errors, and                     a subject. The Kinect time of flight system must
simple temperature compensation ensures                              keep the aperture wide open to minimize the
accuracy is within specifications.                                   light power required. It takes two images back-
                                                                     to-back with different but fixed shutter times of
Key benefits of the time of flight system are:                       approximately 100 and 1000 microseconds, and
   x One depth sample per pixel: X-Y                                 selects the best result pixel by pixel. The design
       resolution is determined by chip                              provides non-destructive pixel reading, and light
       dimensions                                                    integration involves reading each pixel multiple
   x Depth resolution is a function of the                           times to select the best result.
       signal to noise ratio and modulation
       frequency, that is: transmit light power,                     5.5             Sensing over Long Range
       receiver sensitivity, modulation contrast,
       and lens f-number                                                             with Fine Resolution
   x Higher frequency: phaseÆdistance ratio                          The system measures phase shift of a
       scales directly with modulation                               modulated signal, then calculates depth from the
       frequency resulting in finer resolution                       phase using:
   x Complexity is in circuit design. The
       overall system, and particularly the
       mechanical aspects are simplified
   x Sensor outputs three possible images
       from the same pixel data:                                     Depth is d, C is the speed of light, and fmod is
       1. Depth reading per pixel                                    the modulation frequency.
       2. ‘Active’ image is independent of
            room / ambient lighting                                  Increasing the modulation frequency increases
       3. Standard ‘Passive’ image, based                            resolution, that is the depth resolution for a given
            upon room / ambient lighting                             phase uncertainty. Power limits what modulation
                                                                     frequencies can be practically used and higher
5.4         Dynamic Range                                            frequency increases phase aliasing.

High dynamic range is important. To provide a                        Phase wraps around at 360o. This causes the
robust experience in multiplayer situations, we                      depth reading to alias. For example, aliasing
want to detect someone wearing bright clothes                        starts at a depth of 1.87 m with an 80 MHz
standing close to the camera and                                     modulation frequency.
simultaneously detect someone wearing very
dark clothes standing at the back of the play                        Kinect acquires images at multiple modulation
space.                                                               frequencies, illustrated in figure 8. This allows
                                                                     ambiguity elimination as far away as the
With time of flight, depth resolution is a function                  equivalent of the beat frequency of the different
of the signal to noise ratio at the sensor, where                    frequencies, which is greater than 10 m for
signal is the received light power and noise is a                    Kinect with the chosen frequencies of
combination of shot noise in the light and circuit                   approximately 120 MHz, 80 MHz and 16 MHz.
noise in the sensor electronics. We want to
exceed a minimum signal to noise ratio for all
pixels imaging the users in the room
independent of how many users, the clothes                                    360°
they are wearing or where they are in the room.
                                                                                 0°
                                                                                                                            z
For an optical system, the incident power
density falls off with the square of distance.
Reflectivity of typical clothes can vary from more                            360°
than 95% to less than 10%. This requires that
                                                                                 0°
the sensor must show a per-pixel dynamic range                                                                              z
in excess of 2500x.

A photographer can adjust aperture and shutter                           Figure 8: Multiple Modulation Frequencies
time in a camera to achieve optimal exposure for

                    Digital Object Indentifier 10.1109/MM.2014.9          0272-1732/$26.00 2014 IEEE
This article has been accepted for publication in IEEE Micro but has not yet been fully edited.
                                   Some content may change prior to final publication.

5.6        Depth Image                                              Figure 11 illustrates the wide dynamic depth
                                                                    range applied to human figure recognition. One
The GPU in the main SoC calculates depth from                       figure is close to the camera and the other is far
the phase information delivered by the camera.                      away. The system captures both clearly.
This takes a small part of each frame time.

Figure 9 shows a depth image captured at a
distance of approx. 2.5 m, direct from the
camera, without averaging or further processing.
The coloring is a result of test software that
assigns a color to each recognized user for
engineering use.

                                                                     Figure 11: Dynamic Range Figure Recognition

                                                                    5.7            Face Recognition
                                                                    Face recognition is important for a personalized
                                                                    user experience It is difficult to achieve high
                                                                    quality results in many situations with normal
                                                                    photography due to the wide variety of room
                                                                    light conditions. The photo in figure 12 is an
                                                                    example of how room lighting and the resulting
                                                                    shadowing can dramatically change how a
                                                                    person looks to a camera, in this case from a
                                                                    lamp to the side of the TV.
             Figure 9: Depth Image

Figure 10 illustrates de-aliasing performance. It
shows an image of a long corridor. The system
obtains smooth depth readings out to 16 m in
this example without wrapping.

                                                                        Figure 12: High Contrast Ambient Lighting
                                                                                        Situation

                                                                    Figure 13 shows the same scene captured with
                                                                    the Kinect three-dimensional sensor. The sensor
                                                                    data provides an image that is independent of
            Figure 10: Depth Range                                  the wide variation in room lighting.

                   Digital Object Indentifier 10.1109/MM.2014.9          0272-1732/$26.00 2014 IEEE
This article has been accepted for publication in IEEE Micro but has not yet been fully edited.
                                   Some content may change prior to final publication.

                                                                    4. D. Piatti, F. Rinaudo, “SR-4000 and
                                                                    CamCube3.0 Time of Flight (ToF) Cameras:
                                                                    Tests and Comparison”, Remote Sens., pp.
                                                                    1069-1089, 2012

                                                                    5. C. S. Bamji et al., A 512×424 CMOS 3D
                                                                    Time-of-Flight Image Sensor with Multi-
                                                                    Frequency Photo-Demodulation up to 130MHz
                                                                    and 2GS/s ADC, ISSCC Proceedings, Feb.
                                                                    2014

                                                                    John Sell is a hardware architect at Microsoft,
                                                                    and chief architect of the Xbox One SoC. Sell
                                                                    has a MS in electrical engineering and computer
    Figure 13: Kinect Image in High Contrast
                                                                    science from the University of California at
           Ambient Lighting Situation
                                                                    Berkeley, and a BS in engineering from Harvey
                                                                    Mudd College, Claremont, CA.
The resolution is lower than the high definition
RGB camera that Kinect also contains.
                                                                    Patrick O'Connor is a Senior Director of
However, the fixed illumination more than
                                                                    Engineering at Microsoft, responsible for
compensates so that the system can provide
                                                                    hardware and software development of sensors
robust face recognition to applications.
                                                                    and custom silicon. O'Connor has a BS in
                                                                    electrical engineering from Trinity College,
6       Conclusion                                                  Dublin.

The Xbox One SoC incorporates five billion                          Microsoft Corporation
transistors to provide high performance                             1065 La Avenida
computation, graphics, audio processing, and                        Mountain View, CA 94043
audio-video input and output for multiple,
simultaneous applications and system services.
The Xbox One Kinect adds low latency three-
dimensional image and voice sensing. Together,
the SoC and Kinect provide unique voice and
gesture control. The system recognizes
individual users. They can use voice and
movement within many applications, switch
instantly between functions, and combine
games, TV, and music, while interacting with
friends via services such as Skype audio and
video.

7       References
1. Jeff Andrews and Nick Baker, Xbox 360
System Architecture, IEEE Micro, March/April
2006

2. AMD-V Nested Paging, July 2008,
http://developer.amd.com/wordpress/media/201
2/10/NPT-WP-1%201-final-TM.pdf

3. Jeff Rupley, Jaguar, Hot Chips 24
Proceedings, August 2012,
http://www.hotchips.org/archives/hc24

                   Digital Object Indentifier 10.1109/MM.2014.9          0272-1732/$26.00 2014 IEEE
You can also read