Given an audio signal as input, we optimize our model to predict future samples from a given signal context。
We train the model to distinguish a sample $z_{i+k}$ that is k steps in the future from distractor samples $\widetilde{z}$ drawn from a proposal distribution pn, by minimizing the contrastive loss for each step.
We optimize the loss $L = \sum_{k=1}^{K}L_k$, summing (1) over different step sizes.
Embeds the audio signal in a latent space.
Combines multiple timesteps of the encoder to obtain contextualized representations。即语言模型建模 $p(z_{i+k} \vert z_i . . . z_{i−r})$ 。
量化模块:For self-supervised training we discretize the output of the feature encoder z to a finite set of speech representations via product quantization
Gumbel softmax 是重参数化的一种技巧,能够按照特定概率进行采样,并且保证计算题可导的特性。参考:Gumbel Softmax 是什么? - 知乎
掩码: To pre-train the model we mask a certain proportion of time steps in the latent feature encoder space, similar to masked language modeling in BERT:We mask a proportion of the feature encoder outputs, or time steps before feeding them to the context network and replace them with a trained feature vector shared between all masked time steps
\[\mathcal{L}=\mathcal{L}_m + \mathcal{L}_d\]其中 $\mathcal{L}_m $ 是对比损失(Contrastive Loss),用于计算对掩码部分预测的准确度;$\mathcal{L}_d$ 是多样性损失,通过最大化码本的熵来鼓励码本中使用不同的向量,以避免所有表征取同一向量导致对比损失的塌陷;
怎么掩码:采用 wav2vec中的方案,随机取采样点,之后延长 l 步
In the other extreme with α = 1, the loss is only computed over the masked timesteps where the model has to predict the targets corresponding to the unseen frames from context, analogous to language modeling. It forces the model to learn both the acoustic representation of unmasked segments and the long-range temporal structure of the speech data.HuBert利用多种聚类模型集成的方式制造伪标签。聚类方式采用 k-means。wav2vec中的集成量化技巧在这里也可以使用。
During fine-tuning, the convolutional waveform audio encoder parameters are fixed. Like wav2vec 2.0, we introduce a freeze-step hyperparameter to control how many fine-tuning steps the transformer parameters are fixed, and only the new softmax matrix is trained.
wav2vec 和 Hubert 都使用了掩码预测的任务进行自监督预训练。两者区别在于:
wav2vec: Unsupervised Pre-training for Speech Recognition, INTERSPEECH 2019 ↩
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,NIPS,2020 ↩
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units, IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 2021 ↩