The Xbox One System on a Chip and Kinect Sensor

Page created by Bobby Gilbert

Shopping

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

The Xbox One System on a Chip and Kinect Sensor

This article has been accepted for publication in IEEE Micro but has not yet been fully edited.
                                     Some content may change prior to final publication.

     The Xbox One System on a Chip and Kinect Sensor
                                John Sell, Patrick O’Connor, Microsoft Corporation

                                                                      Figure 2 shows a block diagram of the system.
1         Abstract                                                    The main SoC contains all of the principal
                                                                      computation components. The South Bridge
The System on a Chip at the heart of the Xbox
                                                                      chip expands the SoC input and output to
One entertainment console is one of the largest
                                                                      access optical disc, hard disc, and flash storage,
consumer designs to date with five billion
                                                                      HDMI input, Kinect, and wireless devices.
transistors. The Xbox One Kinect image and
voice sensor uses time of flight technology to
provide high resolution, low latency, lighting-
independent three-dimensional image sensing.
Together they provide unique voice and gesture
interaction with high performance games and
other entertainment applications.

2         Terms
CPU       Central Processing Unit
DRAM      Dynamic Random Access Memory
DSP       Digital Signal Processor
GPU       Graphics Processing Unit
HDMI      High Definition Multi-media Interface
MMU       Memory Management Unit
PCI (e)   Peripheral Component Interface
SoC       System on a Chip
SRAM      Static Random Access Memory

3         Xbox One System                                                          Figure 2: Xbox One System

The Xbox One system pictured in figure 1
includes the Kinect image and audio sensors,                          4         Main SoC
console, and wireless controller.
                                                                      A single SoC departs from the initial
                                                                      implementations of previous high performance
                                                                      consoles. One chip enables the most efficient
                                                                      allocation of memory and other resources. It
                                                                      avoids the latency, bandwidth limitations, and
                                                                      power consumption of communicating between
                                                                      computation chips.

                                                                      Microsoft collaborated with Advanced Micro
                                                                      Devices (AMD) to develop the SoC. SRAM and
                                                                      GPU circuits with redundancy comprise more
                                                                      than 50% of the 370-mm2 chip, resulting in yield
                                                                      comparable to much smaller designs.

                                                                      Figure 3 shows the SoC organization. The SoC
                                                                      provides simultaneous system and user
                                                                      services, video input and output, voice
    Figure 1: Xbox One Kinect, Console, and                           recognition, and three-dimensional image
               Wireless Controller                                    recognition.

                     Digital Object Indentifier 10.1109/MM.2014.9          0272-1732/$26.00 2014 IEEE

This article has been accepted for publication in IEEE Micro but has not yet been fully edited.
                                   Some content may change prior to final publication.

                                                                    page addresses and uses large pages where
Significant features include:                                       appropriate to avoid significant performance
    x Unified, but not uniform, main memory                         impact from the two-dimensional translation.
    x Universal host-guest virtual memory
         management                                                 System software manages physical memory
    x High bandwidth CPU cache coherency                            allocation. System software and hardware keep
    x Power islands matching features and                           page tables synchronized so that CPU, GPU,
         performance to active tasks                                and other processors can share memory, pass
                                                                    pointers rather than copying data, and a linear
                                                                    data structure in a GPU or CPU virtual space
                                                                    can have physical pages scattered in DRAM and
                                                                    SRAM. The unified memory system frees
                                                                    applications from the mechanics of where data
                                                                    is located, but GPU-intensive applications can
                                                                    specify which data should be in SRAM for best
                                                                    performance.

                                                                    The GPU graphics core and several specialized
                                                                    processors share the GPU MMU, which
                                                                    supports 16 virtual spaces. PCIe input and
                                                                    output and audio processors share the IO MMU,
                                                                    which supports virtual spaces for each PCI
                                                                    bus/device/function. Each CPU core has its own
                                                                    MMU (CPU access to SRAM maps through a
                                                                    CPU MMU and the GPU MMU).

                                                                    The design provides 32 GB/second peak DRAM
                                                                    access with hardware-maintained CPU cache
                                                                    coherency for data shared by the CPU, GPU,
                                                                    and other processors. Hardware-maintained
                                                                    coherency improves performance and software
                                                                    reliability.

                                                                    The implementation restricts shared CPU-
                                                                    cache-coherent data (and PCIe and audio data,
                                                                    most of which is CPU-cache-coherent) to DRAM
          Figure 3: SoC Organization                                for simplification and cost savings. GPU SRAM
                                                                    access and non-CPU-cache-coherent DRAM
                                                                    access bypass CPU cache coherency checking.
4.1        Main Memory
Main memory consists of 8 Gbytes of low cost                        4.2            CPU
DDR3 external DRAM and 32 Mbytes of internal
SRAM. This provides necessary bandwidth                             The CPU contains eight AMD Jaguar single-
while saving power and considerable cost over                       thread 64-bit x86 cores in two clusters of four.
wider or faster external DRAM-only alternatives.                    The cores contain individual first level code
                                                                    caches and data caches. Each cluster contains
Peak DRAM bandwidth is 68 Gbytes per                                a shared 2 MB second level cache.
second. Peak SRAM bandwidth ranges between
109 and 204 Gbytes per second, depending on                         The CPU cores operate at 1750 MHz in full
the mix of transactions. Sustainable total peak                     performance mode. Each cluster can operate at
bandwidth is about 200 Gbytes per second.                           different frequencies. The system selectively
                                                                    powers individual cores and clusters to match
MMU hardware maps guest virtual addresses to                        workload requirements.
guest physical addresses to physical addresses
for virtualization and security. The                                Jaguar provides good performance and
implementation sizes caching of fully translated                    excellent power-performance efficiency.

                   Digital Object Indentifier 10.1109/MM.2014.9          0272-1732/$26.00 2014 IEEE

This article has been accepted for publication in IEEE Micro but has not yet been fully edited.
Some content may change prior to final publication.

The CPU contains minor modifications from and GPU processing. Kinect makes extensive
earlier Jaguar implementations to support two use of combined CPU-GPU computation.
clusters and increased CPU cache coherent
bandwidth. The graphics core contains two graphics
command and two compute command
4.3 GPU processors. Each command processor supports
16 work streams. The two geometry primitive
Figure 4 shows the graphics core and the engines, 12 compute units, and four render
independent processors and functions sharing backend depth and color engines in the graphics
the GPU MMU. The GPU contains AMD core support two independent graphics contexts.
graphics technology supporting a customized
version of Microsoft DirectX graphics features. The graphics core operates at 853 MHz in full
Hardware and software customizations provide performance mode. System software selects
more direct access to hardware resources than lower frequencies, and powers the graphics core
standard DirectX. They reduce CPU overhead to and compute unit resources to match tasks.
manage graphics activity and combined CPU

Figure 4: GPU

Digital Object Indentifier 10.1109/MM.2014.9 0272-1732/$26.00 2014 IEEE

This article has been accepted for publication in IEEE Micro but has not yet been fully edited.
Some content may change prior to final publication.

4.4 Independent GPU
Processors and Functions
Eight independent processors and functions
share the GPU MMU. These engines support
applications and system services. They augment
GPU and CPU processing, and are more power-
performance efficient at their tasks.

Four of the engines provide copy, format
conversion, compression, and decompression
services. The video decode and encode engines
support multiple streams and a range of formats.
The audio-video input and output engines
support multiple streams, synchronization, and
digital rights management. Audio-video output
includes resizing and compositing three images,
and saving results in main memory in addition to
display output.

4.5 Audio Processors
The SoC contains eight audio processors and
supporting hardware shown in figure 5. The
processors support applications and system
services with multiple work queues. Collectively
they would require two CPU cores to match their
audio processing capability.

The four DSP cores are Tensilica-based designs
incorporating standard and specialized
instructions. Two include single precision vector
floating point totaling 15.4 billion operations per
second.

The other four audio processors implement:
x Sample rate conversion
x Equalization and dynamic range Figure 5: Audio Processors
compression
x Filter and volume processing
x 512 stream Xbox Media Audio format 5 Xbox One Kinect
decompression
The Xbox One Kinect is the second-generation
The audio processors use the IO MMU. This Microsoft three-dimensional image and audio
path to main memory provides lower latency sensor. It is an integral part of the Xbox One
than the GPU MMU path. Low latency is system. The three-dimensional image and audio
important for games, which frequently make sensors and the SoC computation capabilities
instantaneous audio decisions, and Kinect audio operating in parallel with games and other
processing. applications provide an unprecedented level of
voice, gesture and physical interaction with the
system.

Digital Object Indentifier 10.1109/MM.2014.9 0272-1732/$26.00 2014 IEEE

This article has been accepted for publication in IEEE Micro but has not yet been fully edited.
                                   Some content may change prior to final publication.

5.1        Image Sensor Goals and                                        x    Depth resolution within 1% of distance
                                                                         x    Minimum software resolvable object less
           Requirements                                                       than 2.5 cm
User experience drove the image sensor goals:                            x    Operating range from 0.8 m to 4.2 m
   x Resolution sufficient for software to                                    from the camera
       reliably detect and track the range of                            x    Illumination from the camera and
       human sizes from young children to                                     operation independent of room lighting
       small and large adults: a limiting                                x    Maximum of 14 milliseconds exposure
       dimension is the diameter of a small                                   time
       child’s wrist, approx. 2.5cm                                      x    Less than 20 milliseconds latency from
   x Camera field of view wide enough for                                     the beginning of each exposure to data
       users to interact close to the camera in                               delivered over USB 3.0 to main system
       small spaces and relatively far away in                                software
       larger rooms                                                      x    Depth accuracy within 2% across all
   x Camera dynamic range sufficient for                                      lighting, color, users, and other
       users throughout the space with widely                                 conditions in the operating range
       varying clothing colors
   x Lighting independence                                          5.2           Time of Flight Camera
   x Stability and repeatability
   x Sufficiently low latency for natural-
                                                                                  Architecture
       feeling gesture and physical interaction                     Figure 6 shows the three-dimensional image
                                                                    sensor system. The system consists of the
These goals led to the key requirements:                            sensor chip and a camera SoC. The SoC
   x Field of view of 70 degrees horizontal x                       manages the sensor and communications with
       60 degrees vertical                                          the Xbox One console.
   x Aperture F# < 1.1

                              Figure 6: Three-dimensional Image Sensor System

                   Digital Object Indentifier 10.1109/MM.2014.9          0272-1732/$26.00 2014 IEEE

This article has been accepted for publication in IEEE Micro but has not yet been fully edited.
Some content may change prior to final publication.

The time of flight system modulates a camera
light source with a square wave. It uses phase 5.3 Differential Pixels
detection to measure the time it takes light to
travel from the light source to the object and Figure 7 shows the time of flight sensor and
back to the sensor, and calculates distance from signal waveforms. A laser diode illuminates the
the results. subjects. The time of flight differential pixel array
receives the reflected light.
The timing generator creates a modulation
square wave. The system uses this signal to A differential pixel distinguishes the time of flight
modulate both the local light source (transmitter) sensor from a classic camera sensor. The
and the pixel (receiver). modulation input controls conversion of
incoming light to charge in the differential pixel’s
The light travels to the object and back in time two outputs. The timing generator creates clock
ǻt. The system calculates ǻt by estimating signals to control the pixel array and a
received light phase at each pixel with synchronous signal to modulate the light source.
knowledge of the modulation frequency. The The waveforms illustrate phase determination.
system calculates depth from the speed of light
in air: 1 cm in 33 picoseconds.

Figure 7: Time of Flight Sensor

The light source transmits the light signal. It the pixel clock. This is the essence of time of
travels out from the camera, reflects off any flight phase detection.
object in the field of view and returns to the
sensor lens with some delay (phase shift) and Some interesting properties of the pixel output
attenuation. lead to a very useful set of output images.
x (A+B) gives a ‘normal’ grey scale image
The lens focuses the light on the sensor pixels. illuminated by normal ambient (room)
A synchronous clock modulates the pixel lighting (‘ambient image’)
receiver. When the clock is high, photons falling x (A-B) gives phase information after an
on the pixel contribute charge to the A-out side arctangent
g calculation (‘depth image’)
of the pixel. When the clock is low, photons x gives a grey scale image
contribute charge to the B-out side of the pixel. which is independent of ambient (room)
lighting (‘active image’)
The (A-B) differential signal provides an output
whose value depends both on the returning light Chip optical and electrical parameters determine
level and on the time it arrives with respect to the quality of the resulting image. It does not
depend significantly on mechanical factors.

Digital Object Indentifier 10.1109/MM.2014.9 0272-1732/$26.00 2014 IEEE

This article has been accepted for publication in IEEE Micro but has not yet been fully edited.
                                    Some content may change prior to final publication.

Multiphase captures cancel linearity errors, and                     a subject. The Kinect time of flight system must
simple temperature compensation ensures                              keep the aperture wide open to minimize the
accuracy is within specifications.                                   light power required. It takes two images back-
                                                                     to-back with different but fixed shutter times of
Key benefits of the time of flight system are:                       approximately 100 and 1000 microseconds, and
   x One depth sample per pixel: X-Y                                 selects the best result pixel by pixel. The design
       resolution is determined by chip                              provides non-destructive pixel reading, and light
       dimensions                                                    integration involves reading each pixel multiple
   x Depth resolution is a function of the                           times to select the best result.
       signal to noise ratio and modulation
       frequency, that is: transmit light power,                     5.5             Sensing over Long Range
       receiver sensitivity, modulation contrast,
       and lens f-number                                                             with Fine Resolution
   x Higher frequency: phaseÆdistance ratio                          The system measures phase shift of a
       scales directly with modulation                               modulated signal, then calculates depth from the
       frequency resulting in finer resolution                       phase using:
   x Complexity is in circuit design. The
       overall system, and particularly the
       mechanical aspects are simplified
   x Sensor outputs three possible images
       from the same pixel data:                                     Depth is d, C is the speed of light, and fmod is
       1. Depth reading per pixel                                    the modulation frequency.
       2. ‘Active’ image is independent of
            room / ambient lighting                                  Increasing the modulation frequency increases
       3. Standard ‘Passive’ image, based                            resolution, that is the depth resolution for a given
            upon room / ambient lighting                             phase uncertainty. Power limits what modulation
                                                                     frequencies can be practically used and higher
5.4         Dynamic Range                                            frequency increases phase aliasing.

High dynamic range is important. To provide a                        Phase wraps around at 360o. This causes the
robust experience in multiplayer situations, we                      depth reading to alias. For example, aliasing
want to detect someone wearing bright clothes                        starts at a depth of 1.87 m with an 80 MHz
standing close to the camera and                                     modulation frequency.
simultaneously detect someone wearing very
dark clothes standing at the back of the play                        Kinect acquires images at multiple modulation
space.                                                               frequencies, illustrated in figure 8. This allows
                                                                     ambiguity elimination as far away as the
With time of flight, depth resolution is a function                  equivalent of the beat frequency of the different
of the signal to noise ratio at the sensor, where                    frequencies, which is greater than 10 m for
signal is the received light power and noise is a                    Kinect with the chosen frequencies of
combination of shot noise in the light and circuit                   approximately 120 MHz, 80 MHz and 16 MHz.
noise in the sensor electronics. We want to
exceed a minimum signal to noise ratio for all
pixels imaging the users in the room
independent of how many users, the clothes                                    360°
they are wearing or where they are in the room.
                                                                                 0°
                                                                                                                            z
For an optical system, the incident power
density falls off with the square of distance.
Reflectivity of typical clothes can vary from more                            360°
than 95% to less than 10%. This requires that
                                                                                 0°
the sensor must show a per-pixel dynamic range                                                                              z
in excess of 2500x.

A photographer can adjust aperture and shutter                           Figure 8: Multiple Modulation Frequencies
time in a camera to achieve optimal exposure for

                    Digital Object Indentifier 10.1109/MM.2014.9          0272-1732/$26.00 2014 IEEE

This article has been accepted for publication in IEEE Micro but has not yet been fully edited.
Some content may change prior to final publication.

5.6 Depth Image Figure 11 illustrates the wide dynamic depth
range applied to human figure recognition. One
The GPU in the main SoC calculates depth from figure is close to the camera and the other is far
the phase information delivered by the camera. away. The system captures both clearly.
This takes a small part of each frame time.

Figure 9 shows a depth image captured at a
distance of approx. 2.5 m, direct from the
camera, without averaging or further processing.
The coloring is a result of test software that
assigns a color to each recognized user for
engineering use.

Figure 11: Dynamic Range Figure Recognition

5.7 Face Recognition
Face recognition is important for a personalized
user experience It is difficult to achieve high
quality results in many situations with normal
photography due to the wide variety of room
light conditions. The photo in figure 12 is an
example of how room lighting and the resulting
shadowing can dramatically change how a
person looks to a camera, in this case from a
lamp to the side of the TV.
Figure 9: Depth Image

Figure 10 illustrates de-aliasing performance. It
shows an image of a long corridor. The system
obtains smooth depth readings out to 16 m in
this example without wrapping.

Figure 12: High Contrast Ambient Lighting
Situation

Figure 13 shows the same scene captured with
the Kinect three-dimensional sensor. The sensor
data provides an image that is independent of
Figure 10: Depth Range the wide variation in room lighting.

Digital Object Indentifier 10.1109/MM.2014.9 0272-1732/$26.00 2014 IEEE

This article has been accepted for publication in IEEE Micro but has not yet been fully edited.
Some content may change prior to final publication.

4. D. Piatti, F. Rinaudo, “SR-4000 and
CamCube3.0 Time of Flight (ToF) Cameras:
Tests and Comparison”, Remote Sens., pp.
1069-1089, 2012

5. C. S. Bamji et al., A 512×424 CMOS 3D
Time-of-Flight Image Sensor with Multi-
Frequency Photo-Demodulation up to 130MHz
and 2GS/s ADC, ISSCC Proceedings, Feb.
2014

John Sell is a hardware architect at Microsoft,
and chief architect of the Xbox One SoC. Sell
has a MS in electrical engineering and computer
Figure 13: Kinect Image in High Contrast
science from the University of California at
Ambient Lighting Situation
Berkeley, and a BS in engineering from Harvey
Mudd College, Claremont, CA.
The resolution is lower than the high definition
RGB camera that Kinect also contains.
Patrick O'Connor is a Senior Director of
However, the fixed illumination more than
Engineering at Microsoft, responsible for
compensates so that the system can provide
hardware and software development of sensors
robust face recognition to applications.
and custom silicon. O'Connor has a BS in
electrical engineering from Trinity College,
6 Conclusion Dublin.

The Xbox One SoC incorporates five billion Microsoft Corporation
transistors to provide high performance 1065 La Avenida
computation, graphics, audio processing, and Mountain View, CA 94043
audio-video input and output for multiple,
simultaneous applications and system services.
The Xbox One Kinect adds low latency three-
dimensional image and voice sensing. Together,
the SoC and Kinect provide unique voice and
gesture control. The system recognizes
individual users. They can use voice and
movement within many applications, switch
instantly between functions, and combine
games, TV, and music, while interacting with
friends via services such as Skype audio and
video.

7 References
1. Jeff Andrews and Nick Baker, Xbox 360
System Architecture, IEEE Micro, March/April
2006

2. AMD-V Nested Paging, July 2008,
http://developer.amd.com/wordpress/media/201
2/10/NPT-WP-1%201-final-TM.pdf

3. Jeff Rupley, Jaguar, Hot Chips 24
Proceedings, August 2012,
http://www.hotchips.org/archives/hc24

Digital Object Indentifier 10.1109/MM.2014.9 0272-1732/$26.00 2014 IEEE

You can also read