This is an open source project (formerly named Listen, Attend and Spell - PyTorch Implementation) for end-to-end ASR implemented with Pytorch, the well known deep learning toolkit.
Data is collected from the Ted2srt webpage.
Run python3 scraper/preprocess.py
from root directory to scrape and generate dataset.
The script will:
Scraped data is saved at scraper/data/
, processed data will be saved to data/
. Alternatively, download preprocessed data here.
To train each model:
python3 main.py --config config/<dataset>/<config_file>.yaml --njobs 8
.To use our dataset, set <dataset>
as ted
to use scraped data, or libri
to use public data from OpenSRL.
Configuration files are stored as:
Extractor | Classifier | Configuration file |
---|---|---|
MLP | RNN | mlp_rnn.yaml |
CNN | RNN | cnn_rnn.yaml |
ANN | RNN | ann_rnn.yaml |
RNN | RNN | rnn_rnn.yaml |
Extractor | Classifier | Configuration file |
---|---|---|
CNN | MLP | cnn_mlp.yaml |
CNN | CNN | cnn_cnn.yaml |
CNN | ANN | cnn_ann.yaml |
Experiment results are stored at experiment_results.md.
There are two main subcomponents. First is the extractor, the extractor further extracts the audio features for every frame into a latent representation $h$. Then we have the classifier, that takes in the latent representation, make prediction for each frame by classifying them into a predefined set of word token such as “a”, “the”, “-tion” etc. Lastly, the Beam search decoding algorithm decode the raw classification results into a sentence. A typical ASR has a CNN extractor and a RNN classifier.
For our experimentation we firstly fix the classifier to be RNN, and compare how the 4 NN variants perform as the extractor.
Secondly, we fix the Extractor to be CNN. and replace the classifier with the 4 NN variants.
Preprocess scraped data to input into Dataset
and DataLoader
. Includes data cleaning, cutting audio into multiple audio slices according to SRT annotated time and prepare label for each audio slice.
Symbols are removed from label and converted to lowercase.
Data that are less accurate are removed. Checking done for SRT that starts at the same time (e.g. 00:00:12,820 --> 00:00:14,820
). SRT that does not include introduction music time is filtered. Laughter and applause is removed.
Raw SRT snippet
1
00:00:12,312 --> 00:00:14,493
Six months ago, I got an email
2
00:00:14,493 --> 00:00:15,900
from a man in Israel
converted to
<audio_id>-1 six months ago i got an email
<audio_id>-2 from a man in israel
and stored at <audio_id>.trans.txt
. Corresponding sliced audio files are named <audio_id>_<audio_index>.mp3
.
After build_dataset()
has preprocessed the data, the data is split into train-dev-test sets.
Sample Rate: 44100
Shape: (1, 84055)
Dtype: torch.float32
- Max: 0.523
- Min: -0.319
- Mean: -0.000
- Std Dev: 0.081
Waveform plot of sample audio signal with length of 1.9s. Duration length can obtained: signal_frames / sample_rate
.
Steps to compute filter banks are motivated to mimic how human perceives audio signals[1].
Kaldi filter banks transformation applied on audio signals. 40 mel coefficients is kept.
feat_dim = 40
waveform_trans = torchaudio.compliance.kaldi.fbank(signal, frame_length=25, frame_shift=10, num_mel_bins=feat_dim)
plot_spectrogram(waveform_trans.transpose(0, 1).detach(), title="Filter Banks", ylabel='mel bins')
Extractor generates a sequence of feature vectors. Each feature vector is extracted from a small overlapped window of audio frames. Extractor transforms $x$, to high-level representation $h = (h_1, …, h_L)$.
Extractor includes downsampling of timesteps.
For example downsampling by a factor of 4 from 523 timesteps to 130 timesteps in RNN extractor. Downsampling is also achieved by MaxPooling of CNN extractors.
Classifier generates an output sequence $(y_1, . . . , y_T)$ from input $h$. $h$ is the output of the extractor. The classifier’s output $y$ is a sequence of word tokens. The sequence $y$ is expected to include a dimension of the same size as $h$.
Original README can be accessed here.
Liu, A., Lee, H.-Y., & Lee, L.-S. (2019). Adversarial Training of End-to-end Speech Recognition Using a Criticizing Language Model. Acoustics, Speech and Signal Processing (ICASSP). IEEE.
Liu, A. H., Sung, T.-W., Chuang, S.-P., Lee, H.-Y., & Lee, L.-S. (2019). Sequence-to-sequence Automatic Speech Recognition with Word Embedding Regularization and Fused Decoding. arXiv [cs.CL]. Opgehaal van http://arxiv.org/abs/1910.12740