Given an audio signal as input, we optimize our model to predict future samples from a given signal context。
We train the model to distinguish a sample $z_{i+k}$ that is k steps in the future from distractor samples $\widetilde{z}$ drawn from a proposal distribution pn, by minimizing the contrastive loss for each step.
We optimize the loss $L = \sum_{k=1}^{K}L_k$, summing (1) over different step sizes.
Embeds the audio signal in a latent space.
五层CNN构成。
Combines multiple timesteps of the encoder to obtain contextualized representations。即语言模型建模 $p(z_{i+k} \vert z_i . . . z_{i−r})$ 。
9层CNN构成;
本文承接上文的wav2vec,在上文的编码器和上下文网络之间添加了量化模块,对来自编码器的特征进行量化,以便使用来自NLP领域的一些方法。具体来说,本文使用了如下两种量化策略:
此外,本文还讨论了分组量化策略对避免码本的模式塌陷中发挥的作用。本文的工作是之后的wav2vec2的基础。
本文第一次展示了从语音信号中学习强大的表征,并在转录的语音上微调,可以取得超过最好的半监督方法的结果,并且在概念上更简单。
编码器:七层CNN;
上下文网络:transformer网络+CNN的相对位置编码。其中Base模型有12层,Large模型有24层。
量化模块:For self-supervised training we discretize the output of the feature encoder z to a finite set of speech representations via product quantization
Gumbel softmax 是重参数化的一种技巧,能够按照特定概率进行采样,并且保证计算题可导的特性。参考:Gumbel Softmax 是什么? - 知乎
预训练:
掩码: To pre-train the model we mask a certain proportion of time steps in the latent feature encoder space, similar to masked language modeling in BERT:We mask a proportion of the feature encoder outputs, or time steps before feeding them to the context network and replace them with a trained feature vector shared between all masked time steps
损失函数:
\[\mathcal{L}=\mathcal{L}_m + \mathcal{L}_d\]其中 $\mathcal{L}_m $ 是对比损失(Contrastive Loss),用于计算对掩码部分预测的准确度;$\mathcal{L}_d$ 是多样性损失,通过最大化码本的熵来鼓励码本中使用不同的向量,以避免所有表征取同一向量导致对比损失的塌陷;
自监督学习的意义:
语音和CV、NLP自监督的区别:
使用掩码预测任务进行预训练。
掩码预测任务的两个关键问题:怎么掩码和在哪计算预测损失。
怎么掩码:采用 wav2vec中的方案,随机取采样点,之后延长 l 步
损失函数计算:不失一般性的,认为预测损失由未掩码部分的预测和掩码部分的预测构成。当预测损失仅有未掩码部分构成时,类似于混合语音识别系统中的声学模型;当只预测掩码部分,模型需要根据上下文内容预测掩码区域内容,类似于语言模型。HuBert中采用只预测掩码部分。
In the other extreme with α = 1, the loss is only computed over the masked timesteps where the model has to predict the targets corresponding to the unseen frames from context, analogous to language modeling. It forces the model to learn both the acoustic representation of unmasked segments and the long-range temporal structure of the speech data.HuBert利用多种聚类模型集成的方式制造伪标签。聚类方式采用 k-means。wav2vec中的集成量化技巧在这里也可以使用。
最开始在MFCC上进行聚类。之后每一轮迭代过程中,在上一轮编码器某一层的特征上做聚类。
During fine-tuning, the convolutional waveform audio encoder parameters are fixed. Like wav2vec 2.0, we introduce a freeze-step hyperparameter to control how many fine-tuning steps the transformer parameters are fixed, and only the new softmax matrix is trained.
wav2vec 和 Hubert 都使用了掩码预测的任务进行自监督预训练。两者区别在于:
wav2vec: Unsupervised Pre-training for Speech Recognition, INTERSPEECH 2019 ↩
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,NIPS,2020 ↩
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units, IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 2021 ↩