PERFORM AUTOMATIC SPEECH RECOGNITION (ASR) WITH WAV2LETTER USING PYARMNN AND DEBIAN PACKAGES - TUTORIAL VERSION 21.08

Page created by Christina Stephens

IT & Technique

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Perform Automatic Speech Recognition (ASR) with
Wav2Letter using PyArmNN and Debian Packages
Version 21.08

Tutorial
Non-Conﬁdential                                   Issue 01
Copyright © 2021 Arm Limited (or its aﬃliates).   102603_2108_01_en
All rights reserved.

Perform Automatic Speech Recognition (ASR) with                                           Document ID: 102603_2108_01_en
Wav2Letter using PyArmNN and Debian Packages Tutorial                                                        Version 21.08

Perform Automatic Speech Recognition (ASR) with Wav2Letter using
PyArmNN and Debian Packages
Tutorial
Copyright © 2021 Arm Limited (or its aﬃliates). All rights reserved.

Release information

Document history
Issue                 Date                              Conﬁdentiality                         Change
2108-01               9 August 2021                     Non-Conﬁdential                        Initial release

Proprietary Notice

This document is protected by copyright and other related rights and the practice or
implementation of the information contained in this document may be protected by one or more
patents or pending patent applications. No part of this document may be reproduced in any form
by any means without the express prior written permission of Arm. No license, express or implied,
by estoppel or otherwise to any intellectual property rights is granted by this document unless
speciﬁcally stated.

Your access to the information in this document is conditional upon your acceptance that you
will not use or permit others to use the information for the purposes of determining whether
implementations infringe any third party patents.

THIS DOCUMENT IS PROVIDED “AS IS”. ARM PROVIDES NO REPRESENTATIONS AND NO
WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, INCLUDING, WITHOUT LIMITATION,
THE IMPLIED WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-
INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE WITH RESPECT TO THE
DOCUMENT. For the avoidance of doubt, Arm makes no representation with respect to, and has
undertaken no analysis to identify or understand the scope and content of, third party patents,
copyrights, trade secrets, or other rights.

This document may include technical inaccuracies or typographical errors.

TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL ARM BE LIABLE FOR
ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL,
INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND
REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF ANY USE OF THIS
DOCUMENT, EVEN IF ARM HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

This document consists solely of commercial items. You shall be responsible for ensuring that
any use, duplication or disclosure of this document complies fully with any relevant export laws
                        Copyright © 2021 Arm Limited (or its aﬃliates). All rights reserved.
                                              Non-Conﬁdential
                                                                                                                 Page 2 of 20

Perform Automatic Speech Recognition (ASR) with Document ID: 102603_2108_01_en
Wav2Letter using PyArmNN and Debian Packages Tutorial Version 21.08

and regulations to assure that this document or any portion thereof is not exported, directly
or indirectly, in violation of such export laws. Use of the word “partner” in reference to Arm’s
customers is not intended to create or refer to any partnership relationship with any other
company. Arm may make changes to this document at any time and without notice.

If any of the provisions contained in these terms conflict with any of the provisions of any click
through or signed written agreement covering this document with Arm, then the click through or
signed written agreement prevails over and supersedes the conflicting provisions of these terms.
This document may be translated into other languages for convenience, and you agree that if there
is any conflict between the English version of this document and any translation, the terms of the
English version of the Agreement shall prevail.

The Arm corporate logo and words marked with ® or ™ are registered trademarks or trademarks
of Arm Limited (or its subsidiaries) in the US and/or elsewhere. All rights reserved. Other brands
and names mentioned in this document may be the trademarks of their respective owners. Please
follow Arm’s trademark usage guidelines at https://www.arm.com/company/policies/trademarks.

Arm Limited. Company 02557590 registered in England.

110 Fulbourn Road, Cambridge, England CB1 9NJ.

(LES-PRE-20349)

Conﬁdentiality Status

This document is Non-Conﬁdential. The right to use, copy and disclose this document may be
subject to license restrictions in accordance with the terms of the agreement entered into by Arm
and the party that Arm delivered this document to.

Unrestricted Access is an Arm internal classiﬁcation.

Product Status

The information in this document is Final, that is for a developed product.

Web address

developer.arm.com

Perform Automatic Speech Recognition (ASR) with                                           Document ID: 102603_2108_01_en
Wav2Letter using PyArmNN and Debian Packages Tutorial                                                        Version 21.08

Inclusive language commitment
Arm values inclusive communities. Arm recognizes that we and our industry have used language
that can be oﬀensive. Arm strives to lead the industry and create change.

We believe that this document contains no oﬀensive language. To report oﬀensive language in this
document, email terms@arm.com.

                        Copyright © 2021 Arm Limited (or its aﬃliates). All rights reserved.
                                              Non-Conﬁdential
                                                                                                             Page 4 of 20

Perform Automatic Speech Recognition (ASR) with                                                                            Document ID: 102603_2108_01_en
Wav2Letter using PyArmNN and Debian Packages Tutorial                                                                                         Version 21.08
                                                                                                                                                   Contents

Contents

1 Introduction....................................................................................................................................................... 6
1.1 Conventions......................................................................................................................................................6
1.2 Additional reading........................................................................................................................................... 7
1.3 Feedback........................................................................................................................................................... 7
1.4 Other information........................................................................................................................................... 8

2 Overview............................................................................................................................................................ 9
2.1 Overview of PyArmNN................................................................................................................................. 9
2.2 Speech recognition......................................................................................................................................... 9
2.3 Before you begin.......................................................................................................................................... 10

3 Device-speciﬁc installation......................................................................................................................... 11
3.1 Install on Raspberry Pi................................................................................................................................ 11
3.2 Install on Odroid N2 Plus...........................................................................................................................11

4 Running the application............................................................................................................................... 13
4.1 Initializing the project.................................................................................................................................. 13
4.2 Get an audio ﬁle for the example............................................................................................................ 14
4.3 Run the example...........................................................................................................................................14

5 Code deep dive.............................................................................................................................................. 15
5.1 Initialization.................................................................................................................................................... 15
5.2 Creating a network...................................................................................................................................... 16
5.3 Automatic speech recognition pipeline................................................................................................... 17

6 Related information...................................................................................................................................... 19

7 Next steps....................................................................................................................................................... 20

                                 Copyright © 2021 Arm Limited (or its aﬃliates). All rights reserved.
                                                       Non-Conﬁdential
                                                                                                                                                       Page 5 of 20

Perform Automatic Speech Recognition (ASR) with Document ID: 102603_2108_01_en
Wav2Letter using PyArmNN and Debian Packages Tutorial Version 21.08
Introduction

1 Introduction

1.1 Conventions
The following subsections describe conventions used in Arm documents.

Glossary
The Arm Glossary is a list of terms used in Arm documentation, together with deﬁnitions for
those terms. The Arm Glossary does not contain terms that are industry standard unless the Arm
meaning diﬀers from the generally accepted meaning.

See the Arm® Glossary for more information: developer.arm.com/glossary.

Typographic conventions
Arm documentation uses typographical conventions to convey speciﬁc meaning.

Convention Use
italic Introduces special terminology, denotes cross-references, and citations.
bold Highlights interface elements, such as menu names. Denotes signal names. Also used for terms in
descriptive lists, where appropriate.
monospace Denotes text that you can enter at the keyboard, such as commands, ﬁle and program names, and source
code.
monospace italic Denotes arguments to monospace text where the argument is to be replaced by a speciﬁc value.
monospace bold Denotes language keywords when used outside example code.
monospace underline Denotes a permitted abbreviation for a command or option. You can enter the underlined text instead of
the full command or option name.
Encloses replaceable terms for assembler syntax where they appear in code or code fragments. For ex-
ample:
MRC p15, 0, , , ,

SMALL CAPITALS Used in body text for a few terms that have speciﬁc technical meanings, that are deﬁned in the
Arm Glossary. For example, IMPLEMENTATION DEFINED, IMPLEMENTATION SPECIFIC, UNKNOWN, and
UNPREDICTABLE.

This represents a recommendation which, if not followed, might lead to system failure or damage.

This represents a requirement for the system that, if not followed, might result in system failure or
damage.

This represents a requirement for the system that, if not followed, will result in system failure or damage.

Perform Automatic Speech Recognition (ASR) with Document ID: 102603_2108_01_en
Wav2Letter using PyArmNN and Debian Packages Tutorial Version 21.08
Introduction

Convention Use
This represents an important piece of information that needs your attention.

This represents a useful tip that might make it easier, better or faster to perform a task.

This is a reminder of something important that relates to the information you are reading.

1.2 Additional reading
This document contains information that is speciﬁc to this product. See the following documents
for other relevant information:

Table 1-2: Arm publications
Document Name Document ID Licensee only
My template book title ABC 12345 No

1.3 Feedback
Arm welcomes feedback on this product and its documentation.

Feedback on this product
If you have any comments or suggestions about this product, contact your supplier and give:
• The product name.
• The product revision or version.
• An explanation with as much information as you can provide. Include symptoms and diagnostic
procedures if appropriate.

Feedback on content
If you have comments on content then send an e-mail to errata@arm.com. Give:
• The title Perform Automatic Speech Recognition (ASR) with Wav2Letter using PyArmNN and
Debian Packages Tutorial.
• The number 102603_2108_01_en.
• If applicable, the page number(s) to which your comments refer.
• A concise explanation of your comments.

Perform Automatic Speech Recognition (ASR) with                                           Document ID: 102603_2108_01_en
Wav2Letter using PyArmNN and Debian Packages Tutorial                                                        Version 21.08
                                                                                                              Introduction

Arm also welcomes general suggestions for additions and improvements.

               Arm tests the PDF only in Adobe Acrobat and Acrobat Reader, and cannot
               guarantee the quality of the represented document when used with any other PDF
               reader.

1.4 Other information
See the Arm website for other relevant information.

•   Arm® Developer.
•   Arm® Documentation.
•   Technical Support
•   Arm® Glossary.

                        Copyright © 2021 Arm Limited (or its aﬃliates). All rights reserved.
                                              Non-Conﬁdential
                                                                                                             Page 8 of 20

Perform Automatic Speech Recognition (ASR) with Document ID: 102603_2108_01_en
Wav2Letter using PyArmNN and Debian Packages Tutorial Version 21.08
Overview

2 Overview
This guide reviews a sample application that performs Automatic Speech Recognition (ASR) with
the PyArmNN API. The guide explains how the speech recognition application works, then gives
instructions for running the application on a Raspberry Pi or an Odroid N2 Plus.

2.1 Overview of PyArmNN
The Arm NN library optimizes neural networks for Arm hardware. Arm NN provides signiﬁcant
performance increases compared with other frameworks when running on an Arm Cortex-A.

PyArmNN is a Python package that provides a wrapper around the C++ Arm NN API. PyArmNN
does not implement any computational kernels. PyArmNN delegates all operations to Arm NN,
meaning you access the power of Arm NN from Python.

PyArmNN enables rapid development allowing you to produce and test prototypes in minutes.

Both PyArmNN and Arm NN use parsers to import models from diﬀerent external frameworks.
Available parsers include the following:
• TensorFlow Lite
• ONNX
• PyTorch via ONNX

The parser converts the imported model into an Arm NN network graph, which can be optimized
for Arm hardware.

You can ﬁnd more information about PyArmNN and example code for PyArmNN in the Arm
Software GitHub repository.

2.2 Speech recognition
Speech recognition is the process of a program recognizing and translating spoken language into a
written format or a format understood by an application.

Many speech recognition applications can be broken down into two steps. The ﬁrst step is the pre-
processing of the raw audio data which usually involves applying some signal processing to the
audio data such as a Fast-Fourier transform. The second step is a model which takes the processed
data as input, this model is often a neural network.

There are some recent advancements in neural networks which use the raw data as input and
perform all the analysis. However, these networks tend to be large as they have all the feature
extraction and analysis built within them.

Perform Automatic Speech Recognition (ASR) with                                           Document ID: 102603_2108_01_en
Wav2Letter using PyArmNN and Debian Packages Tutorial                                                        Version 21.08
                                                                                                                 Overview

Speech recognition has many use-cases, these mainly fall under the umbrella of hands-free
dictation context. Some examples include enabling car drivers to perform tasks without taking their
hands oﬀ the driving wheel, automated caption generation making media more accessible, and
virtual assistants to help with day-to-day tasks.

2.3 Before you begin
This guide requires installing PyArmNN on your device. From Arm NN 20.11, we provide Debian
packages for Ubuntu 64-bit. These packages are the easiest way to install PyArmNN and are what
we recommend you use. To use these packages, you must download a supported 64-bit Debian
Linux operating system.

We provide installation steps for diﬀerent devices in the Device-speciﬁc installation section.

                        Copyright © 2021 Arm Limited (or its aﬃliates). All rights reserved.
                                              Non-Conﬁdential
                                                                                                            Page 10 of 20

Perform Automatic Speech Recognition (ASR) with Document ID: 102603_2108_01_en
Wav2Letter using PyArmNN and Debian Packages Tutorial Version 21.08
Device-speciﬁc installation

3 Device-speciﬁc installation
In this section, we provide installation steps to set up your Raspberry Pi and Odroid N2 Plus
devices.

3.1 Install on Raspberry Pi
The following steps cover setting up your Raspberry Pi for this example.

Before you begin
To run the example in this guide, you must install a 64-bit operating system on your Raspberry Pi.
Ubuntu has oﬃcial support for 64-bit 20.04 and 21.04 on the Raspberry Pi.

Procedure
1. Install Ubuntu. See installation instructions for the Raspberry Pi4 on the Ubuntu website. This
guide has been tested on Ubuntu 20.04 and Ubuntu 21.04.
2. Download and install the required packages. Add the PPA to your sources from the software-
properties-common package as shown in the following commands:
sudo apt install software-properties-common
sudo add-apt-repository ppa:armnn/ppa
sudo apt update
export ARMNN_MAJOR_VERSION=25
sudo apt-get install -y python3-pyarmnn libarmnn-cpuacc-backend${ARMNN_MA\
JOR_VERSION} libarmnn-cpuref-backend${ARMNN_MAJOR_VERSION}

In the apt-get command shown in the preceding code, 25 is the libarmn version. You can
replace this version number with the latest supported version listed on our GitHub repository.
These packages provide the TensorFlow Lite parser for Arm NN, which this guide uses.
3. Enter the following commands to install Git and Git Large File Storage to download models
from the model zoo:
sudo apt-get install git git-lfs
git lfs install --skip-repo

4. Install pip using the following command:
sudo apt install python3-pip

3.2 Install on Odroid N2 Plus
The Odroid has both an Arm Cortex-A CPU and an Arm Mali GPU. This means you can conﬁgure
your setup to utilize both the CPU and GPU. The following steps cover running the example in this
guide.

Procedure
1. Install Ubuntu Mate 20.04. Use the oﬃcial Ubuntu Mate images on the Odroid website.

Perform Automatic Speech Recognition (ASR) with Document ID: 102603_2108_01_en
Wav2Letter using PyArmNN and Debian Packages Tutorial Version 21.08
Device-speciﬁc installation

2. Do a basic apt update and apt upgrade. The following commands ensure everything is
up to date.
sudo apt update
sudo apt upgrade

3. Install the OpenCL drivers to include GPU support using the following commands. The Ubuntu
Mate images come with oﬃcial support. However, some users have reported problems. If you
have already installed these drivers, you can skip this step.
mkdir temp-folder
cd temp-folder
sudo apt-get install clinfo ocl-icd-libopencl1
sudo apt-get download mali-fbdev
ar -xv mali-fbdev_*
tar -xvf data.tar.xz
sudo rm /usr/lib/aarch64-linux-gnu/libOpenCL.so*
sudo cp -r usr/* /usr/
sudo mkdir /etc/OpenCL
sudo mkdir /etc/OpenCL/vendors/
sudo bash -c 'echo "libmali.so" > /etc/OpenCL/vendors/mali.icd'

4. Check your OpenCL installation by running the clinfo command.
clinfo

5. Download and install the required packages to run the example. Arm has tested the Odriod
with Arm NN Major version 24. To install this version, use the following commands:
sudo apt install software-properties-common
sudo add-apt-repository ppa:armnn/ppa
sudo apt update
export ARMNN_MAJOR_VERSION=25
sudo apt-get install -y sudo apt-get install -y python3-pyarmnn libarmnn-cpuacc-
backend${ARMNN_MAJOR_VERSION} libarmnn-gpuacc-backend${ARMNN_MAJOR_VERSION}
libarmnn-cpuref-backend${ARMNN_MAJOR_VERSION}

These packages provide the TensorFlow Lite parser for Arm NN, which is what this guide uses.
The packages also provide both the CPU and GPU accelerators.
6. Install Git and Git Large File Storage to download models from the model zoo using the
following commands:
sudo apt-get install git git-lfs
git lfs install --skip-repo

7. Install pip using the following command:
sudo apt install python3-pyarmnn

Perform Automatic Speech Recognition (ASR) with Document ID: 102603_2108_01_en
Wav2Letter using PyArmNN and Debian Packages Tutorial Version 21.08
Running the application

4 Running the application
This section explains how to retrieve and run all the code and models that you require to use the
Automatic Speech Recognition application. By the end of this section, you should have a text
output from a WAV ﬁle.

4.1 Initializing the project
The following steps cover initializing the project on your device.

About this task
Get the example code from GitHub.

Procedure
1. Create a workspace for the project with the following command:
mkdir ~/workspace && cd ~/workspace

2. Clone the Arm NN repository with the following command:
git clone https://github.com/ARM-software/armnn/

3. Checkout the master branch with the following command:
cd armnn && git checkout master

4. Navigate to the example folder with the following command:
cd python/pyarmnn/examples/speech_recognition

5. Install the PortAudio package with the following command:
sudo apt-get install libsndfile1 libportaudio2

6. Install the required Python modules with the following command:
pip3 install -r requirements.txt

7. To get the model from the Arm Model Zoo, navigate back to the example folder with the
following command:
cd ~/workspace

8. Clone the Model Zoo repository with the following command:
git clone https://github.com/ARM-software/ML-zoo

9. Copy the model ﬁle to the example application with the following commands:
cd armnn/python/pyarmnn/examples/speech_recognition
cp -r ~/workspace/ML-zoo/models/speech_recognition/wav2letter/tflite_int8 .

Perform Automatic Speech Recognition (ASR) with Document ID: 102603_2108_01_en
Wav2Letter using PyArmNN and Debian Packages Tutorial Version 21.08
Running the application

4.2 Get an audio file for the example
To run this example, you need a WAV file. We have provided an audio file for use with this guide,
available from the PyArmNN repository.

To download the file from the PyArmNN repository, use wget with the download link of your file.
For example, to get the file quick_brown_fox_16000khz.wav use the following command:

wget
https://git.mlplatform.org/ml/armnn.git/tree/python/pyarmnn/examples/speech_recogni\
tion/tests/testdata/quick_brown_fox_16000khz.wav

4.3 Run the example
The following steps cover running the example on your device.

Procedure
1. To run this example, use the following command:
python3 run_audio_file.py --audio_file_path --mod\
el_file_path --labels_file_path

The label file is the tests/testdata/wav2letter_labels.txt file in your example repository.
2. Optional flags: use --preferred_backends to run with a specific backend. You can enter
multiple values in preference order, separated by whitespace. For example, pass CpuAcc
CpuRef for [“CpuAcc”, “CpuRef”]. To see all available options, use --help
The available values are:
• CpuAcc for the CPU backend
• GpuAcc for the GPU backend
• CpuRef for the CPU reference kernels

The following is an example, with the ﬁle quick_brown_fox_16000khz.wav:

python3 run_audio_file.py --audio_file_path tests/testda\
ta/quick_brown_fox_16000khz.wav
--model_file_path tflite_int8/wav2letter_int8.tflite --labels_file_path tests/
testdata/wav2letter_labels.txt --preferred_backends CpuAcc CpuRef

Results
After the script has ﬁnished running, the output is displayed.

Perform Automatic Speech Recognition (ASR) with                                           Document ID: 102603_2108_01_en
Wav2Letter using PyArmNN and Debian Packages Tutorial                                                        Version 21.08
                                                                                                            Code deep dive

5 Code deep dive
This section of the guide explains how the Wav2Letter application processes a WAV ﬁle and
generates a transcript of human speech.

The application takes a WAV ﬁle of human speech as input, then uses the Mel-Frequency
Cepstral Coeﬃcients (MFCC) class to generate the input for the model. MFCC converts audio
into waveforms, which are easier for a convolutional neural network to parse. The model assigns
each chunk of converted audio the correct phoneme, which the application then converts into a
correctly spelled word.

The Wav2Letter application performs the following steps:
1. Initialization:
    a. Point the application to the sample audio, model, and label ﬁle.
    b. Build the dictionary of phonemes to spelling.
2. Creating a network: Convert the network to be run by Arm NN.
3. Automatic speech recognition pipeline:
    a. Feed the audio into the conversion process (MFCC).
    b. Input the sample into the model.
    c. The model processes the sample.
    d. Read output of the phonemes the model processed.
    e. Use the dictionary to convert to correct spelling.
    f.   Output the words as strings.

5.1 Initialization
The application parses the supplied user arguments, which include a path to an audio ﬁle. The
application loads the audio ﬁle into the AudioCapture class, which initializes the audio source. The
ModelParams class sets the sampling parameters that the model requires.

The AudioCapture class captures chunks of audio data from the source. Using automatic speech
recognition (ASR) on the audio ﬁle, the application creates a generator object to yield blocks of
audio data from the ﬁle. Each block has a minimum sample size.

To interpret the inference result of the loaded network, the application loads the labels that are
associated with the model. Each label represents a phoneme. The dict_labels() function
creates a dictionary that is keyed on the classiﬁcation index at the output node of the model. The
values of the dictionary are the characters that correspond to the appropriate phonemes.

                        Copyright © 2021 Arm Limited (or its aﬃliates). All rights reserved.
                                              Non-Conﬁdential
                                                                                                            Page 15 of 20

Perform Automatic Speech Recognition (ASR) with                                           Document ID: 102603_2108_01_en
Wav2Letter using PyArmNN and Debian Packages Tutorial                                                        Version 21.08
                                                                                                            Code deep dive

5.2 Creating a network
A PyArmNN application must import a graph from a ﬁle using an appropriate parser. These parsers
are libraries for loading neural networks of various formats into the Arm NN runtime. Arm NN
provides parsers for various model ﬁle types, including TFLite, TF, and ONNX.

Arm NN supports optimized execution on multiple CPU, GPU, and Ethos-N devices. Before
executing a graph, the application uses IRuntime() to create a runtime context with default
options that are appropriate to the device. We can optimize the imported graph by specifying
a list of backends in order of preference, and then implement backend-speciﬁc optimizations.
Each backend is identiﬁed by a unique string: CpuAcc, GpuAcc, and CpuRef which represent the
accelerated CPU, accelerated GPU, and the CPU reference kernels, respectively.

Arm NN splits the entire graph into subgraphs based on these backends. Each subgraph is then
optimized, and the corresponding subgraph in the original graph is substituted with its optimized
version.

The Optimize() function optimizes the graph for inference, then LoadNetwork() loads the
optimized network onto the compute device. The LoadNetwork() function also creates the
backend-speciﬁc workloads for the layers and a backend-speciﬁc workload factory.

Parsers extract the input information for the network:
•   The GetSubgraphInputTensorNames() function extracts all the input names.
•   The GetNetworkInputBindingInfo() function obtains the input binding information
    of the graph. The input binding information contains all the essential information about the
    input. This information is a tuple consisting of integer identiﬁers for bindable layers and tensor
    information. This information includes data type, quantization info, dimension count, and total
    elements.

To get the output binding information for an output layer, the parser retrieves the output tensor
names and calls the GetNetworkOutputBindingInfo() function.

For this application, the main point of contact with PyArmNN is through the
ArmnnNetworkExecutor class, which handles the network creation step for you.

The following code shows the network creation step:

 # common/network_executor.py
 # The provided wav2letter model is in .tflite format so we use TfLiteParser() to im\
 port the graph
 if ext == '.tflite':
     parser = ann.ITfLiteParser()
 network = parser.CreateNetworkFromBinaryFile(model_file)
 ...
 # Optimize the network for the list of preferred backends
 opt_network, messages = ann.Optimize(
     network, preferred_backends, self.runtime.GetDeviceSpec(), ann.OptimizerOption\
 s()
     )
 # Load the optimized network onto the runtime device self.network_id, _ = self.run\
 time.LoadNetwork(opt_network)
 # Get the input and output binding information

                        Copyright © 2021 Arm Limited (or its aﬃliates). All rights reserved.
                                              Non-Conﬁdential
                                                                                                            Page 16 of 20

Perform Automatic Speech Recognition (ASR) with                                           Document ID: 102603_2108_01_en
Wav2Letter using PyArmNN and Debian Packages Tutorial                                                        Version 21.08
                                                                                                            Code deep dive

 self.input_binding_info = parser.GetNetworkInputBindingInfo(graph_id, in\
 put_names[0])
 self.output_binding_info = parser.GetNetworkOutputBindingInfo(graph_id, output_name)

5.3 Automatic speech recognition pipeline
To extract the Mel-Frequency Cepstral Coeﬃcients (MFCCs) the network uses as features from a
given audio frame, we use the MFCC class. MFCCs are the result of computing the dot product of
the Discrete Cosine Transform (DCT) Matrix and the log of the Mel energy.

After extracting all the MFCCs that the application needs for an inference from the audio data, we
compute the ﬁrst and second MFCC derivatives with respect to time. The computation convolves
the derivatives with one-dimensional Savitzky-Golay ﬁlters. The MFCCs and the derivatives are
concatenated to make the input tensor for the model.

The following code shows the MFCC extraction and derivative computation:

 # preprocess.py
 # Extract MFCC features
 log_mel_energy = np.maximum(log_mel_energy, log_mel_energy.max() - top_db)
 mfcc_feats = np.dot(self.__dct_matrix, log_mel_energy)
 ...
 # Compute first and second derivatives (delta and delta-delta respectively) by pass\
 ing a
 # Savitzky-Golay filter as a 1D convolution over the features
 for i in range(features.shape[1]):
             idelta = np.convolve(features[:, i], self.__savgol_order1_coeffs,
  'same')
             mfcc_delta_np[:, i] = (idelta)
             ideltadelta = np.convolve(features[:, i], self.savgol_order2_coeffs,
  'same')
             mfcc_delta2_np[:, i] = (ideltadelta)
 # audio_utils.py
 # Quantize the input data and create input tensors with PyArmNN
 input_tensor = quantize_input(input_tensor, input_binding_info)
 input_tensors = ann.make_input_tensors([input_binding_info], [input_tensor])

               ArmnnNetworkExecutor has already created the output tensors for you.

After creating the workload tensors, the compute device performs inference for the loaded
network by using the EnqueueWorkload() function of the runtime context.

The following code shows calling the workload_tensors_to_ndarray() function to obtain
the inference results as a list of ndarrays:

 # common/network_executor.py
 status = runtime.EnqueueWorkload(net_id, input_tensors, self.output_tensors)
 self.output_result = ann.workload_tensors_to_ndarray(self.output_tensors)

                        Copyright © 2021 Arm Limited (or its aﬃliates). All rights reserved.
                                              Non-Conﬁdential
                                                                                                            Page 17 of 20

Perform Automatic Speech Recognition (ASR) with                                           Document ID: 102603_2108_01_en
Wav2Letter using PyArmNN and Debian Packages Tutorial                                                        Version 21.08
                                                                                                            Code deep dive

The output from the inference must be decoded to obtain the recognized characters from the
speech. A simple greedy decoder classiﬁes the results by taking the highest element of the output
as a key for the labels dictionary. The value returned is a character. The character is appended to a
list, and the list is ﬁltered to remove unwanted characters. The produced string is displayed on the
console.

                        Copyright © 2021 Arm Limited (or its aﬃliates). All rights reserved.
                                              Non-Conﬁdential
                                                                                                            Page 18 of 20

Perform Automatic Speech Recognition (ASR) with                                           Document ID: 102603_2108_01_en
Wav2Letter using PyArmNN and Debian Packages Tutorial                                                          Version 21.08
                                                                                                         Related information

6 Related information
Here are some resources related to the material in this guide:

•   Accelerated ML inference on Raspberry Pi with PyArmNN
•   AI and machine learning content from Arm
•   Arm's Machine Learning blog
•   Object recognition with Arm NN and Raspberry Pi
•   The original research paper from Facebook AI Research - for full details on how this model
    works.
•   Arm Community - ask development questions and ﬁnd articles and blogs on speciﬁc topics
    from Arm experts.
•   PyArmNN API
•   PyArmNN repository

                        Copyright © 2021 Arm Limited (or its aﬃliates). All rights reserved.
                                              Non-Conﬁdential
                                                                                                             Page 19 of 20

Perform Automatic Speech Recognition (ASR) with                                           Document ID: 102603_2108_01_en
Wav2Letter using PyArmNN and Debian Packages Tutorial                                                        Version 21.08
                                                                                                                Next steps

7 Next steps
Now that you understand how to perform automatic speech recognition with PyArmNN, you
can take control and create your own application. We suggest implementing your own network,
which you can do by updating the parameters of ModelParams and MfccParams to match your
custom model. The ArmnnNetworkExecutor() class handles the network optimization and
loading for you.

An important step to improve the accuracy of the generated output sentences is to provide cleaner
data to the network. You can improve the accuracy by including more preprocessing steps, like
noise reduction of your audio data.

In this application, we used a greedy decoder to decode the integer-encoded output. However,
you can achieve better results by implementing a beam search decoder. You can even try adding a
language model at the end to try to correct any spelling mistakes the model produces.

                        Copyright © 2021 Arm Limited (or its aﬃliates). All rights reserved.
                                              Non-Conﬁdential
                                                                                                            Page 20 of 20

You can also read