Scaling Speech Decoding With Self-Supervised Learning (2024)

Dulhan Jayalath1, Gilad Landau1, Brendan Shillingford3,
Mark Woolrich2 & Oiwi Parker Jones1
{1PNPL Scaling Speech Decoding With Self-Supervised Learning (1), 2OHBA}, University of Oxford & 3Google DeepMind
{dulhan, oiwi}@robots.ox.ac.uk

Abstract

The past few years have produced a series of spectacular advances in the decoding of speech from brain activity. The engine of these advances has been the acquisition of labelled data, with increasingly large datasets acquired from single subjects. However, participants exhibit individual differences, such as anatomy, and datasets use varied scanners and task designs. As a result, prior work has struggled to leverage data from multiple subjects, multiple datasets, multiple tasks, and unlabelled datasets. In turn, the field has not benefited from the rapidly growing number of open neural data repositories to exploit large-scale data and deep learning. This gap exists for all neural data, but especially for magnetoencephalography (MEG), where the scale of individual datasets has not yet caught up with other modalities. To address this, we develop a set of neuroscience-inspired self-supervised objectives, together with a neural architecture, for representation learning from heterogeneous and unlabelled neural recordings. Experimental results with MEG show that representations learned with these objectives scale with data, generalise across subjects, datasets, and tasks, outperform using the raw input representation, and even surpass comparable self-supervised approaches. In addition, we set new benchmarks for two foundational speech decoding tasks. Collectively, these methods now unlock the potential for training speech decoding models with orders of magnitude more existing data.

Scaling Speech Decoding With Self-Supervised Learning (2)

1 Introduction

In his Bitter Lesson, Richard Sutton argues that a major conclusion of 70 years of AI research is that general methods exploiting large-scale computation will outperform model-based approaches as the availability of compute increases (Sutton, 2019). In line with this, the generality of deep learning, via statistical learning from ever bigger datasets, has allowed the field to leverage computation in a way that appears to scale arbitrarily, leading to astounding advances across a diverse set of domains(Jumper etal., 2021; Caron etal., 2021; OpenAI, 2023; Radford etal., 2023).

In the domain of brain data, and of tasks like speech decoding, the bitter lesson has not yet been fully assimilated. State-of-the-art brain-computer interfaces (BCIs) have tried to scale up labelled datasets for individual subjects, using either invasive (Moses etal., 2021; Willett etal., 2023) or non-invasive brain recordings (Tang etal., 2023), mapping these to transcripts of attempted or imagined speech. Yet, a number of obstacles to scale remain. With few exceptions at present, e.g.Défossez etal. (2023), speech decoding models tend not to train on data from more than one subject. Moreover, they do not combine data from multiple datasets and in general do not utilise unlabelled data, or data from diverse tasks. Thus the size of training data has been limited to how much can be acquired for a single subject, and data from other subjects, or from the growing number of public data repositories, has not been leveraged. There are many reasons for these limitations; individual brains and data from different neuroimaging scanners differ, for example. But overcoming these limitations, as has begun to happen in neighbouring sub-fields, such as Jiang etal. (2024), holds the promise of training models on collective, internet-scale data.

While neuroimaging modalities such as electroencephalography (EEG) are more abundant, MEG may be a better modality for decoding as it provides a richer signal (Lopesda Silva, 2013; Hall etal., 2014). Given the scarcity of speech-labelled MEG data and the relative abundance of other MEG data, self-supervised learning (SSL) appears promising as it is an avenue for domains where labels are rare or hard to obtain (Balestriero etal., 2023). But the scale of public MEG data, while large, is still not at the volume of breakthroughs in self-supervised image and natural language processing, let alone EEG. Thus, SSL methods for MEG need to be highly data-efficient. Pretext tasks are one such method in which domain-specific self-supervised tasks are used to pre-train a model on unlabelled data by generating implicit training labels through transformations of the input in order to help a downstream task. We develop a set of these tasks, informed by advances in neuroscience, for learning with unlabelled brain data (Figure 1) and design an architecture for processing continuous multi-sensor neuroimaging signals which we train using our pretext tasks. In order to scale existing non-invasive datasets, we provide a unified method that allows us to leverage data from other experiments that do not have the same labels (by treating them as unlabelled) and that come from different subjects and neuroimaging scanners. We evaluate the representations learned with our approach on heard speech datasets acquired with non-invasive MEG, setting the baselines for speech detection and voicing classification on this data. The results not only demonstrate that scaling with unlabelled data works in speech decoding, but also shows that these representations can generalise across datasets, tasks, and even novel subjects for the first time. Our main contributionsare:

  • A set of domain-specific self-supervised pretext tasks for representation learning that can scale speech decoding over multiple subjects, multiple studies, and unlabelled data;

  • A data-efficient neural architecture for learning these self-supervised objectives and training downstream speech decoding from brain data; and

  • A comprehensive experimental evaluation, using multiple times the volume of data in prior work, that verifies the above claims and additionally provides evidence for the existence of scaling laws when pre-training models with unlabelled MEG recordings.

2 Related Work

Prior work in speech decoding has focused almost entirely on supervised learning with decoding models that typically do not generalise across participants or experiments. This is true both in recent state-of-the-art invasive studies (Moses etal., 2021; Metzger etal., 2023; Willett etal., 2023; Chen etal., 2024) and non-invasive studies (Tang etal., 2023). These prior works have scaled up the experimental data collected within individual subjects, but are unable to leverage data from other subjects and experiments. Focusing on semantic rather than phonetic decoding, the method developed by Tang etal. (2023) is remarkable for showing an ability to generalise across labelled task data when listening to speech, imagining speech, or even watching videos. They do not, however, leverage unlabelled data and are unable to show generalisation between subjects.

Specific studies into the limitations of generalising models between subjects show that while performance decreases on average when subjects are pooled, there are exceptions. Csaky etal. (2022) find that a subset of individuals perform better when evaluated with a group-level model than with individual models. Exploiting audio data in a multi-modal framework, Défossez etal. (2023) show that decoding performance improves for a segment identification task as data from multiple subjects listening to connected speech are aggregated. However, they do not demonstrate the ability to generalise to novel subjects and must retrain their model for new datasets. Moreover, although they repeat the result within two MEG and two EEG datasets, Défossez etal. (2023) do not show any improvements for pooling data across datasets. Their method is also unable to incorporate data without corresponding audio labels and so they do not combine data from studies with other kinds of labels either; cf.Wang & Ji (2022); Duan etal. (2023); Wang etal. (2023a). Unfortunately, the first two of these papers included a bug in their evaluation code. As such, their methods may perform no better than a baseline that provides pure noise inputs to the model (Jo etal., 2024).

In general, speech decoding has centred on different kinds of speech: listening, imagining, speaking out loud, and, for paralysed patients, attempting to speak aloud. We focus here on listening because it is easier to decode than imagined speech (e.g.Martin etal. (2014)). There is also some evidence of a functional overlap between listening and imagined speech representations in the brain (Wandelt etal., 2024), though we acknowledge that the question of overlap has been contested (Langland-Hassan & Vicente, 2018). Prior work has also investigated the two tasks that we focus on here (Dash etal., 2020; Moses etal., 2021; Gwilliams etal., 2023). The first of these, speech detection, formed the backbone to Moses etal. (2021), where a speech detection model was trained and subsequently used to detect isolated words, which were in turn classified and checked against a language model to generate acceptable sentences. Hamilton etal. (2018) further elaborated on the neural anatomy underlying speech detection, categorising neural responses in the superior temporal gyrus (STG) to sustained speech and speech onset. As for the second task, voicing classification, Gwilliams etal. (2023) used this task as a proxy for phoneme classification, as pooling phonemes into unvoiced or voiced segments (e.g.\textipa/p t k f s/ vs \textipa/b d g v z/) improves data efficiency. We note that voicing classification and speech detection are related tasks as voicing is a subclass of speech. This makes them foundational for building hierarchical speech decoding pipelines similar to prior surgical decoding work (Moses etal., 2021; Willett etal., 2023).

In the computer vision literature, there have been a plethora of methods that use self-supervised pretext tasks for representation learning (Agrawal etal., 2015; Doersch etal., 2015; Noroozi & Favaro, 2016; Larsson etal., 2016; Zhang etal., 2016; Gidaris etal., 2018). Until now, similar approaches have not translated to the brain decoding literature. However, prior work has used other methods to leverage unlabelled brain data. For example, Jiang etal. (2024) succeeded in cross-dataset and cross-task generalisation, using a transformer with tokenised brain signals and a masked token prediction objective. Although this work combined unlabelled datasets, their results studied simpler non-speech tasks with EEG. Wang etal. (2023b) used a similar approach, replacing tokens with contextualised embeddings of time-frequency input representations. Their impressive speech detection results were achieved with invasive neural recordings, which are comparatively rare and thus have much less potential to scale than non-invasive data. Perhaps the closest work to ours in terms of unlocking scaling with neural data is BIOT (Yang etal., 2023). This is a self-supervised architecture for encoding bio-signals that is similarly capable of training with different datasets, labels, and varied numbers of sensors. Like the previous works, the approach tokenises signals for a transformer architecture, but instead of a masked loss it uses a contrastive pre-training objective. While theoretically supporting MEG, Yang etal. (2023) evaluate BIOT on simple ECG/EEG tasks rather than address the comparatively complex challenge of speech decoding with MEG data.

3 Method

To encode continuous neuroimaging data, we introduce a neural architecture to embed heterogeneous brain signals. We leverage this architecture for self-supervised learning from unlabelled MEG data using a set of pretext tasks designed to generate generalisable brain representations for speech decoding. With this approach, we hope to replicate similar successes in computer vision(Gidaris etal., 2018; Chen etal., 2020).

3.1 Network Architecture

Our two-stage neural network architecture (Figure 2) uses pretext tasks in pre-training to learn a representation with unlabelled brain data. Then, the fine-tuning stage uses this representation to learn the downstream task by training with labelled data.

Scaling Speech Decoding With Self-Supervised Learning (3)

We divide recordings into windows of length w𝑤witalic_w seconds or t𝑡titalic_t samples. At train time, each batch of windows is standardised such that each sensor has zero mean and unit variance. The network takes as input the standardised sample windows. To combine heterogeneous datasets, which have different numbers of sensors S𝑆Sitalic_S, we apply a dataset-conditional linear layer to the sensor dimension, projecting the signal into a shared space with dimension dsharedsubscript𝑑sharedd_{\mathrm{shared}}italic_d start_POSTSUBSCRIPT roman_shared end_POSTSUBSCRIPT. Then, to encode the signal, we construct a wave-to-wave convolutional encoder architecture, the cortex encoder, inspired by work in neural audio codecs (Zeghidour etal., 2022; Défossez etal., 2022). Specifically, our convolutional encoder adapts the implementation of the SEANet architecture (Tagliasacchi etal., 2020) used in Défossez etal. (2022) which we describe here and as part of Figure 2. As these codecs typically operate on mono audio signals in 1×tsuperscript1𝑡\mathbb{R}^{1\times t}blackboard_R start_POSTSUPERSCRIPT 1 × italic_t end_POSTSUPERSCRIPT, while our signals are in dshared×tsuperscriptsubscript𝑑shared𝑡\mathbb{R}^{d_{\mathrm{shared}}\times t}blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT roman_shared end_POSTSUBSCRIPT × italic_t end_POSTSUPERSCRIPT, we increase the convolutional channel dimension from 1111 to match dsharedsubscript𝑑sharedd_{\mathrm{shared}}italic_d start_POSTSUBSCRIPT roman_shared end_POSTSUBSCRIPT while also inflating the channel dimension of subsequent convolutions. We refer to the output dimension of embeddings from this backbone as dbackbonesubscript𝑑backboned_{\mathrm{backbone}}italic_d start_POSTSUBSCRIPT roman_backbone end_POSTSUBSCRIPT. Thus, the backbone takes as input a window in S×tsuperscript𝑆𝑡\mathbb{R}^{S\times t}blackboard_R start_POSTSUPERSCRIPT italic_S × italic_t end_POSTSUPERSCRIPT, and encodes this into τ𝜏\tauitalic_τ embeddings (where τ<t𝜏𝑡\tau<titalic_τ < italic_t), each of dimension dbackbonesubscript𝑑backboned_{\mathrm{backbone}}italic_d start_POSTSUBSCRIPT roman_backbone end_POSTSUBSCRIPT (i.e. an dbackbone×τsuperscriptsubscript𝑑backbone𝜏\mathbb{R}^{d_{\mathrm{backbone}}\times\tau}blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT roman_backbone end_POSTSUBSCRIPT × italic_τ end_POSTSUPERSCRIPT output).

Just as speakers have different voices, neural responses between subjects have different characteristics. Consequently, individual variation leads to models that do not generalise well across subjects (Csaky etal., 2022). In the speech literature, models include speaker conditioning to account for these differences (Gibiansky etal., 2017). We take a similar approach by introducing subject conditioning. Zeghidour etal. (2022) find that conditioning is equally effective at the encoder bottleneck as in other stages of the model. Hence, we place ours at the cortex encoder bottleneck for simplicity. We use feature-wise linear modulation (FiLM) (Perez etal., 2018) as our conditioning method.

Following the advice of Balestriero etal. (2023, Section 3.2), we use a two-layer feedforward projector to alleviate misalignment between our pretext and downstream tasks in the representation. After the projector, linear classifiers make predictions for each of the pretext tasks. When fine-tuning, we train a linear decoder, for a downstream task, on top of the pre-trained representation, which remains frozen. Thus, we backpropagate only through the classifier. A trainable dataset-specific linear layer can be introduced for a novel dataset.

For speech detection, our classifier makes a prediction for each individual embedding. For voicing classification, where there is only one label for each sample window, the embeddings are flattened into a tensor in dbackbone×τsuperscriptsubscript𝑑backbone𝜏\mathbb{R}^{d_{\mathrm{backbone}}\times\tau}blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT roman_backbone end_POSTSUBSCRIPT × italic_τ end_POSTSUPERSCRIPT representing the entire window. This is the input to the voicing classifier and is referred to as full epoch decoding in neuroimaging literature (Csaky etal., 2023).

3.2 Pretext Tasks

Our pretext tasks are unsupervised feature learning tasks that aim to learn generalisable speech decoding features. Since different datasets use varied numbers of sensors, we construct these tasks with labels that are agnostic to the number of sensors in the signal.

Band prediction. In the literature, neural responses can be segmented into functional frequency bands (Giraud & Poeppel, 2012; Piai etal., 2014; Mai etal., 2016). Delta (δ𝛿\deltaitalic_δ) waves (0.1–4 Hz) are commonly associated with the rhythmic structure of heard speech (Luo etal., 2010), Theta (θ𝜃\thetaitalic_θ) waves (4–8 Hz) reliably track (Luo & Poeppel, 2007) and phase-lock to the amplitude envelope of heard sentences (Peelle etal., 2012), Alpha (α𝛼\alphaitalic_α) waves (8–12 Hz) relate to attentional processes and the inhibition of irrelevant information, helping to focus on relevant speech signals (Strauß etal., 2015),Beta (β𝛽\betaitalic_β) waves (12–30Hz) are implicated in top-down predictive coding (Bressler & Richter, 2015) which affects lexical processing (Weiss & Mueller, 2012),Gamma (γ𝛾\gammaitalic_γ) waves (30–70 Hz) occur with higher cognitive functions (e.g.memory, learning, reasoning, and planning) (Fries, 2009; Buzsáki & Wang, 2012), and High Gamma (γhighsuperscript𝛾high\gamma^{\mathrm{high}}italic_γ start_POSTSUPERSCRIPT roman_high end_POSTSUPERSCRIPT) waves (>>>70 Hz) have been linked specifically to speech detection (Hamilton etal., 2018) and phonemic feature classification in the STG (Mesgarani etal., 2014) as well as phonemic feature classification in the ventral sensorimotor cortex (vSMC) (Cheung etal., 2016). As High Gamma is a relatively wide band, we have split it into two sub-bands: Lower High Gamma (γlowerhighsubscriptsuperscript𝛾highlower\gamma^{\mathrm{high}}_{\mathrm{lower}}italic_γ start_POSTSUPERSCRIPT roman_high end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_lower end_POSTSUBSCRIPT) waves (70–100 Hz) and Upper High Gamma (γupperhighsubscriptsuperscript𝛾highupper\gamma^{\mathrm{high}}_{\mathrm{upper}}italic_γ start_POSTSUPERSCRIPT roman_high end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_upper end_POSTSUBSCRIPT) waves (100–150 Hz).

To learn representations that can distinguish between these, our band prediction task applies a band-stop filter for a randomly selected band ω𝜔\omegaitalic_ω to the sample x𝑥xitalic_x, passes the filtered sample xωsuperscript𝑥superscript𝜔x^{\omega^{\prime}}italic_x start_POSTSUPERSCRIPT italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT through the network backbone g𝑔gitalic_g and the corresponding linear predictor fbandsubscript𝑓bandf_{\mathrm{band}}italic_f start_POSTSUBSCRIPT roman_band end_POSTSUBSCRIPT, requiring the network to predict the frequency band that was rejected. This yields the loss

band=xBCE(fband(g(xω)),ω),subscriptbandsubscript𝑥𝐵subscriptCEsubscript𝑓band𝑔superscript𝑥superscript𝜔𝜔\mathcal{L}_{\mathrm{band}}=\sum_{x\in B}\mathcal{L}_{\mathrm{CE}}(f_{\mathrm{%band}}(g(x^{\omega^{\prime}})),\omega),caligraphic_L start_POSTSUBSCRIPT roman_band end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_x ∈ italic_B end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_CE end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT roman_band end_POSTSUBSCRIPT ( italic_g ( italic_x start_POSTSUPERSCRIPT italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) ) , italic_ω ) ,(1)

where B𝐵Bitalic_B is a mini-batch of samples, ω{δ,θ,α,β,γ,γlowerhigh,γupperhigh}𝜔𝛿𝜃𝛼𝛽𝛾subscriptsuperscript𝛾highlowersubscriptsuperscript𝛾highupper\omega\in\{\delta,\theta,\alpha,\beta,\gamma,\gamma^{\mathrm{high}}_{\mathrm{%lower}},\gamma^{\mathrm{high}}_{\mathrm{upper}}\}italic_ω ∈ { italic_δ , italic_θ , italic_α , italic_β , italic_γ , italic_γ start_POSTSUPERSCRIPT roman_high end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_lower end_POSTSUBSCRIPT , italic_γ start_POSTSUPERSCRIPT roman_high end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_upper end_POSTSUBSCRIPT }, and CEsubscriptCE\mathcal{L}_{\mathrm{CE}}caligraphic_L start_POSTSUBSCRIPT roman_CE end_POSTSUBSCRIPT is the cross-entropy loss as this is a multi-class classification task.

Phase shift prediction. Phase coupling between networks of neuron populations is necessary for coordinating brain activity (Fries, 2005; Vidaurre etal., 2018). Thus, since phase often synchronises between communicating brain areas, phase coupling between spatially distant sensors is likely to be a useful feature. Supporting this insight, recent work (Jiang etal., 2024) also finds phase to be an essential component of the signal.

To learn representations that encode phase differences between brain areas, this task applies a discrete uniform random phase shift ϕ{0,π8,π4,3π8,π2,5π8,3π4,7π8}italic-ϕ0𝜋8𝜋43𝜋8𝜋25𝜋83𝜋47𝜋8\phi\in\{0,\frac{\pi}{8},\frac{\pi}{4},\frac{3\pi}{8},\frac{\pi}{2},\frac{5\pi%}{8},\frac{3\pi}{4},\frac{7\pi}{8}\}italic_ϕ ∈ { 0 , divide start_ARG italic_π end_ARG start_ARG 8 end_ARG , divide start_ARG italic_π end_ARG start_ARG 4 end_ARG , divide start_ARG 3 italic_π end_ARG start_ARG 8 end_ARG , divide start_ARG italic_π end_ARG start_ARG 2 end_ARG , divide start_ARG 5 italic_π end_ARG start_ARG 8 end_ARG , divide start_ARG 3 italic_π end_ARG start_ARG 4 end_ARG , divide start_ARG 7 italic_π end_ARG start_ARG 8 end_ARG } to a uniformly randomly selected proportion ρ𝜌\rhoitalic_ρ of the sensors. Applying this shift to random sensors is critical since sensors are placed in different positions, capturing different regions of the brain. Uniform random selection ensures differences between any two regions of the brain are represented. The objective of this task is to predict the phase shift. This leads to a similar loss

phase=xBCE(fphase(g(xϕ)),ϕ),subscriptphasesubscript𝑥𝐵subscriptCEsubscript𝑓phase𝑔superscript𝑥italic-ϕitalic-ϕ\mathcal{L}_{\mathrm{phase}}=\sum_{x\in B}\mathcal{L}_{\mathrm{CE}}(f_{\mathrm%{phase}}(g(x^{\phi})),\phi),caligraphic_L start_POSTSUBSCRIPT roman_phase end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_x ∈ italic_B end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_CE end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT roman_phase end_POSTSUBSCRIPT ( italic_g ( italic_x start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ) ) , italic_ϕ ) ,(2)

where xϕsuperscript𝑥italic-ϕx^{\phi}italic_x start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT describes the signal with a phase shift ϕitalic-ϕ\phiitalic_ϕ applied to a proportion of the sensors. We use a discrete number of possible phase shifts, treating it as a multi-class task rather than a regression task, to ease the difficulty of the problem as MEG scanners typically have a large number of sensors.

Amplitude scale prediction. MEG and EEG signals use an array of sensors at different spatial locations, capturing different signal sources more intensely. Representing the relative amplitude difference between sensors could be important for differentiating between neural responses originating from distinct parts of the brain. Within speech, Hamilton etal. (2018) find that localised regions of the STG respond to sustained speech and speech onsets. Differentiating between neural responses from this region and others may be essential for decoding speech perception.

Thus, this pretext task focuses on learning representations that encode relative sensor amplitude differences. Similar to the phase shift task, we select a random proportion of the sensors ρ𝜌\rhoitalic_ρ and apply a discrete random amplitude scaling coefficient A[2,2]𝐴22A\in[-2,2]italic_A ∈ [ - 2 , 2 ], discretised into 16 scaling factors, to the signal. The objective is to predict the scaling factor, leading to the loss

amplitude=xBCE(famplitude(g(xA)),A),subscriptamplitudesubscript𝑥𝐵subscriptCEsubscript𝑓amplitude𝑔superscript𝑥𝐴𝐴\mathcal{L}_{\mathrm{amplitude}}=\sum_{x\in B}\mathcal{L}_{\mathrm{CE}}(f_{%\mathrm{amplitude}}(g(x^{A})),A),caligraphic_L start_POSTSUBSCRIPT roman_amplitude end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_x ∈ italic_B end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_CE end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT roman_amplitude end_POSTSUBSCRIPT ( italic_g ( italic_x start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ) ) , italic_A ) ,(3)

where xAsuperscript𝑥𝐴x^{A}italic_x start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT is the signal scaled with A𝐴Aitalic_A.

These pretext tasks capture complementary time- and frequency-domain properties of the signal. Hence, during pre-training, we combine them, creating an augmented version of the input for every pretext task by applying the matching transformation. We feed the augmented inputs through the network backbone and apply the corresponding classifier to predict the transformation, summing the weighted losses such that our final pre-training loss is given by

SSL=w1band+w2phase+w3amplitude,subscriptSSLsubscript𝑤1subscriptbandsubscript𝑤2subscriptphasesubscript𝑤3subscriptamplitude\mathcal{L}_{\mathrm{SSL}}=w_{1}\mathcal{L}_{\mathrm{band}}+w_{2}\mathcal{L}_{%\mathrm{phase}}+w_{3}\mathcal{L}_{\mathrm{amplitude}},caligraphic_L start_POSTSUBSCRIPT roman_SSL end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_band end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_phase end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_amplitude end_POSTSUBSCRIPT ,(4)

where wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a constant coefficient for each self-supervised loss.

4 Experiments

In this section, we evaluate the representations learned with our pretext tasks by measuring their ability to scale downstream performance with unlabelled data. This includes understanding how well they can generalise across datasets, subjects, and tasks. We focus our evaluation on MEG data as the signal is rich, with better spatial resolution than EEG (Lopesda Silva, 2013) and faster sampling rates than fMRI (Hall etal., 2014).

We pre-train all models to completion and then fine-tune on labelled data for each task. In all tables and figures, we quote the receiver operating characteristic area under the curve (ROC AUC) where chance is always 0.5 regardless of class balance. We show the test ROC AUC at the best validation ROC AUC (early stopping) and quote uncertainty as the standard error of the mean over three seeds. Additionally, we state the t𝑡titalic_t-score and p𝑝pitalic_p-value from single-sample one-sided t𝑡titalic_t-tests against chance.

4.1 Experimental setup

Datasets. Unless specified otherwise, our experiments use Cam-CAN (Shafto etal., 2014; Taylor etal., 2017) as an unlabelled representation learning dataset for pre-training. This is a study containing 641 subjects with resting and sensorimotor tasks, totalling approximately 160 hours of MEG recordings. For our downstream tasks, we use two labelled heard speech MEG datasets where participants listen to short stories or audiobooks. Armeni etal. (2022) contains 3 subjects who listen to 10 hours of recordings each (30 hours total) while Gwilliams etal. (2023) has 27 subjects, each recorded for 2 hours (54 hours total). Overall, we utilise over 200 hours of data. To the best of our knowledge, this is the largest volume of MEG data ever used for speech decoding.

Preprocessing. Each recording is in S×Tsuperscript𝑆𝑇\mathbb{R}^{S\times T}blackboard_R start_POSTSUPERSCRIPT italic_S × italic_T end_POSTSUPERSCRIPT where S𝑆Sitalic_S is the number of sensors and T𝑇Titalic_T is the number of time points sampled by the scanner. To eliminate high-frequency muscle movement artifacts, we apply a low-pass filter at 125Hz as well as a high-pass filter at 0.5Hz to remove slow-drift artifacts. Since the datasets were recorded in Europe, where the electric grid frequency is 50Hz, we apply a notch filter at multiples of 50Hz to account for line noise. Treating the low-pass filter threshold as the Nyquist frequency, we downsample the signal to twice that at 250Hz, avoiding aliasing within our band of interest. Finally, we detect bad sensor channels, those with significant noise and artifacts, using a variance threshold and replace them by interpolating the spatially nearest sensors.

Downstream tasks. We evaluate our methods with two fundamental speech decoding tasks of increasing difficulty. The first, speech detection, determines whether speech occurs in the auditory stimulus using the neural response. The second task is voicing classification. Given data aligned at the occurrence of a phoneme, the task is to recognise whether the phoneme is voiced or voiceless, where voicing is a binary phonetic feature that categorises whether a speech sound is associated with vocal cord vibration. We select these tasks as they are simpler than phoneme recognition, but are foundational because they must be solved to decode speech accurately into natural language.

4.2 Learning Generalisable Representations Using Pretext Tasks

Our first experiment investigates whether our self-supervised objectives produce generalisable representations. In Table1, we show the results of pre-training models with each pretext task independently as well as together. Here, all of our pretext tasks lead to results that are statistically significant, and outperform a baseline fine-tuned without pre-training. This provides initial evidence that our tasks are helpful in speech decoding. Interestingly, the combination of all pretext tasks leads to better generalisation than any task on its own. As we hypothesised earlier, this may be because our pretext tasks capture complementary properties in time- and frequency-space, enforcing that our representation includes more salient features for speech decoding than any individual task.

ArmeniGwilliams
ExperimentROC AUCt𝑡titalic_tp𝑝pitalic_pROC AUCt𝑡titalic_tp𝑝pitalic_p
Linear0.559±2e4plus-or-minus0.5592e40.559\pm 2\mathrm{e}{-4}0.559 ± 2 roman_e - 43413413413414e64e64\mathrm{e}{-6}4 roman_e - 60.527±7e5plus-or-minus0.5277e50.527\pm 7\mathrm{e}{-5}0.527 ± 7 roman_e - 53793793793793e63e63\mathrm{e}{-6}3 roman_e - 6
BIOT + linear0.500±4e4plus-or-minus0.5004e40.500\pm 4\mathrm{e}{-4}0.500 ± 4 roman_e - 4006e16e16\mathrm{e}{-1}6 roman_e - 10.499±2e4plus-or-minus0.4992e40.499\pm 2\mathrm{e}{-4}0.499 ± 2 roman_e - 433-3- 31e+01e01\mathrm{e}{+0}1 roman_e + 0
OursNo pre-training0.519±0.002plus-or-minus0.5190.0020.519\pm 0.0020.519 ± 0.00288887e37e37\mathrm{e}{-3}7 roman_e - 30.498±0.003plus-or-minus0.4980.0030.498\pm 0.0030.498 ± 0.003007e17e17\mathrm{e}{-1}7 roman_e - 1
+ linearAmp(ρ=0.2𝜌0.2\rho=0.2italic_ρ = 0.2)0.602±0.001plus-or-minus0.6020.0010.602\pm 0.0010.602 ± 0.0011141141141144e54e54\mathrm{e}{-5}4 roman_e - 50.532±0.005plus-or-minus0.5320.0050.532\pm 0.0050.532 ± 0.00566661e21e21\mathrm{e}{-2}1 roman_e - 2
Phase(ρ=0.5𝜌0.5\rho=0.5italic_ρ = 0.5)0.603±0.003plus-or-minus0.6030.0030.603\pm 0.0030.603 ± 0.003353535354e44e44\mathrm{e}{-4}4 roman_e - 40.535±0.003plus-or-minus0.5350.0030.535\pm 0.0030.535 ± 0.003121212123e33e33\mathrm{e}{-3}3 roman_e - 3
Band0.616±0.003plus-or-minus0.6160.0030.616\pm 0.0030.616 ± 0.003444444443e43e43\mathrm{e}{-4}3 roman_e - 40.542±0.001plus-or-minus0.5420.0010.542\pm 0.0010.542 ± 0.001464646462e42e42\mathrm{e}{-4}2 roman_e - 4
All tasks0.621±0.003plus-or-minus0.6210.003\mathbf{0.621}\pm 0.003bold_0.621 ± 0.003363636364e44e44\mathrm{e}{-4}4 roman_e - 40.543±0.003plus-or-minus0.5430.003\mathbf{0.543}\pm 0.003bold_0.543 ± 0.003131313133e33e33\mathrm{e}{-3}3 roman_e - 3

Now, we turn to the other baselines. Our approach significantly outperforms the equivalent with a raw MEG input instead of a pre-trained representation (the linear experiment). Here, the baseline has substantially more trainable parameters because the input dimension is far larger without an encoder. Even with this bias favouring the experiment with the raw input, using our representation still performs better. We also compare our approach to BIOT (Yang etal., 2023) which is a similar state-of-the-art self-supervised method. When BIOT is pre-trained using exactly the same data, the fine-tuned probe fails to generalise entirely after exhaustive hyperparameter tuning. We put this down to three critical reasons. Firstly, BIOT was designed around considerably lower-dimensional signals. Their EEG evaluation used an order of magnitude fewer sensors than our MEG data. With MEG, their transformer approach requires many more channel embeddings, leading to difficulty learning the complex interactions between sensors. Secondly, our self-supervised objective extracts speech decoding features which is essential for solving speech decoding tasks. BIOT performs well on simple EEG tasks in Yang etal. (2023)’s evaluation, but non-invasive speech decoding is significantly more challenging. Together, these obstacles suggest a vast amount of data is required to learn their objective with MEG. Indeed, given that they pre-train with over 50 thousand hours of EEG data in their evaluation, their objective appears too general to efficiently learn a representation for speech decoding from the limited amount of MEG pre-training data (160 hours) available to us. This highlights the importance of data-efficiency in SSL methods for MEG.

Among the individual pretext tasks, band prediction leads the rest. Perhaps this is because, by learning to discriminate between meaningful bands, the representation easily identifies phase-locking to speech onset in theta waves (Peelle etal., 2012). Further investigation is necessary here. The choice of the proportion of sensors to apply transformations to, ρ=0.5𝜌0.5\rho=0.5italic_ρ = 0.5 for phase shift prediction and ρ=0.2𝜌0.2\rho=0.2italic_ρ = 0.2 for amplitude prediction, were determined through a hyperparameter search. We conjecture that a smaller ρ𝜌\rhoitalic_ρ is optimal for amplitude scale prediction since this leads to representations that are especially strong at discriminating amplitude differences among small groups of sensors. Perhaps this makes it easier to distinguish between neural responses from distinct parts of the brain such as the STG, which is associated with speech onset (Hamilton etal., 2018). In contrast, a larger ρ𝜌\rhoitalic_ρ for phase shift prediction could lead to representations that better discriminate neural synchrony information which is distributed across the brain rather than localised. As a result, a large proportion of the sensors in a MEG scanner should encode information about this feature.

4.3 Scaling Speech Decoding With Unlabelled Data

Scaling Speech Decoding With Self-Supervised Learning (4)

Here, we analyse generalisation as we increase the volume of unlabelled data, analysing scaling performance on downstream tasks. As before, we pre-train with the combined pretext tasks. Figure 3 shows ROC AUC as we increase the amount of unlabelled data in pre-training up to approximately 160 hours. For both tasks, pre-training with any amount of data is sufficient to beat chance and there is a clear improvement in accuracy as the amount of unlabelled data increases. For speech detection on Armeni etal. (2022), scaling appears logarithmic in log-space; for all others, ROC AUC improves log-linearly within the data regime we study. In any case, adding unlabelled data has improved generalisation. Notably, we have scaled far beyond the data regime of prior surgical and non-surgical work and yet performance has continued to scale. Thus, our self-supervision approach may remain useful as the volume of open data in the field continues to rapidly increase.

Our results also reveal several new and notable phenomena. Firstly, we scaled up the pre-training dataset by increasing the number of subjects. Since this led to consistent and almost monotonic improvements in downstream accuracy, our method is an exception to the common consensus that pooling subjects worsens generalisation. Secondly, as we pre-trained our model with a different dataset to those we fine-tuned on, our representation shows cross-dataset generalisation. This is particularly surprising as the Armeni etal. (2022), Gwilliams etal. (2023), and our pre-training dataset all use different scanners entirely. Performing well across these datasets indicates that, together, our architecture and pretext tasks successfully generate representations that are generalisable across heterogeneous scanners. Finally, we note that our pre-training dataset contained no language data whatsoever yet still improved downstream accuracy on language tasks. Remarkably, this shows that unlabelled brain data collected from any task (including those that are not linguistic) can be used to improve speech decoding performance.

Since the results show improvements on both downstream tasks, this indicates that our pretext tasks are sufficiently generic to produce representations that work with multiple speech decoding tasks while still generalising well on each task individually. This is generally a challenging trade-off to manage. However, we notice that in both tasks, the base accuracy is higher and the improvement in ROC AUC is steeper for Armeni etal. (2022). This is likely to be because this dataset has more within-subject data. The weaker results for Gwilliams etal. (2023) may be a consequence of the larger number of subjects with shorter intra-subject recordings and greater subject variation. These observations support the findings of other recent work such as Csaky etal. (2022).

4.4 Scaling Unlabelled Data Improves Generalisation To Novel Subjects

In neuroimaging, brain data is generally highly variable across participants, leading to difficulty transferring models to novel subjects(Csaky etal., 2022). Whilst we have shown generalisation across subjects, here, we investigate whether we can generalise to novel subjects—an even more difficult challenge. This is critical in order to widely deploy speech BCIs for new patients. In this experiment, we fine-tune only on Gwilliams etal. (2023) and hold out three subjects with which we evaluate novel subject generalisation.

Scaling Speech Decoding With Self-Supervised Learning (5)

Figure4 shows that scaling up the amount of unlabelled data used in pre-training not only improves accuracy on subjects previously seen, but also demonstrates a positive log-linear trend in performance for novel subjects. This indicates that scaling our method is an encouraging direction for resolving the challenges of subject variance faced by prior work. Moreover, as far as we are aware, this is the first result to demonstrate novel subject generalisation in speech decoding from MEG.

4.5 Aggregating Unlabelled MEG Datasets

To scale up unlabelled data further than individual studies, we must be able to combine many existing datasets. As a preliminary investigation, we combine two of the largest public MEG datasets: MOUS (Schoffelen etal., 2019) and Cam-CAN (Shafto etal., 2014; Taylor etal., 2017). In this section, we investigate how pre-training with these combined datasets affects downstream performance using the same experimental setup as Figure 3.

Pre-trainingArmeniGwilliams
datasetHoursROC AUCt𝑡titalic_tp𝑝pitalic_pROC AUCt𝑡titalic_tp𝑝pitalic_p
Cam-CAN1590.621±0.003plus-or-minus0.6210.003\mathbf{0.621}\pm 0.003bold_0.621 ± 0.003363636364e44e44\mathrm{e}{-4}4 roman_e - 40.543±0.003plus-or-minus0.5430.0030.543\pm 0.0030.543 ± 0.003131313133e33e33\mathrm{e}{-3}3 roman_e - 3
MOUS1600.605±0.000plus-or-minus0.6050.0000.605\pm 0.0000.605 ± 0.0002612612612617e67e67\mathrm{e}{-6}7 roman_e - 60.543±0.004plus-or-minus0.5430.0040.543\pm 0.0040.543 ± 0.00499995e35e35\mathrm{e}{-3}5 roman_e - 3
Cam-CAN + MOUS3190.611±0.003plus-or-minus0.6110.0030.611\pm 0.0030.611 ± 0.003404040403e43e43\mathrm{e}{-4}3 roman_e - 40.546±0.002plus-or-minus0.5460.002\mathbf{0.546}\pm 0.002bold_0.546 ± 0.002202020201e31e31\mathrm{e}{-3}1 roman_e - 3

The results in Table 2 show, for the first time, that combining datasets can improve performance on downstream speech decoding tasks. It leads to better performance on Gwilliams etal. (2023) compared to pre-training on either dataset alone. Interestingly, this was not the case for Armeni etal. (2022) where pre-training on Cam-CAN alone performed best. Combined pre-training did, however, outperform training only on MOUS. It is surprising that pre-training on Cam-CAN was better than pre-training on MOUS when evaluating on Armeni etal. (2022) given that MOUS and Armeni etal. (2022) both used speech tasks and were acquired on the same MEG scanner. Cam-CAN, by contrast, did not use a speech task and was acquired on a different MEG scanner. We hypothesise that the better results for Cam-CAN are due to it being a cleaner dataset. During our experiments, we found that data quality, even among unlabelled data, can have a significant affect as artefacts in recordings disrupt learning.

While the combination of the two datasets includes far more hours of data than any prior work on deep learning with MEG, further work needs to be done to aggregate more datasets. Here, we were limited by compute budget. Increasing the number of datasets could enable the network to eventually always improve over the best singular dataset. Just as increasing the number of subjects (rather than only within-subject data) improves novel subject generalisation, a larger number of datasets may be key to scaling results when datasets are aggregated in pre-training.

4.6 Limitations

Although our results are significant in demonstrating a viable path forward to scale up speech BCIs, there remain a number of limitations to the present work. We focused here on two downstream tasks: speech detection and voice classification. Ultimately, we would like to expand this work to predict full transcripts from brain recordings (i.e.brain-to-text). This has been achieved with surgical data (Moses etal., 2021; Willett etal., 2023) but not yet convincingly with non-invasive methods like MEG or EEG (Jo etal., 2024). Speech detection has played an important role in the development of full brain-to-text in a surgical context (Moses etal., 2021) and we hope may play a similar role for non-invasive methods. Prior work has further used voice classification as a stand in for phoneme classification (Gwilliams etal., 2022), and we have been able to improve on these results here. In future work, we would like to expand this to all English phonemes. Secondly, while we have been able to demonstrate the utility of a few pretext tasks, we do not claim to have exhausted the full set of useful tasks. Rather, we conjecture that more useful pretext tasks remain to be found and believe a useful avenue of research will be into other input representations for brain recordings. For example, this paper did not make use of spatial features. Another limitation is our emphasis on heard speech over other types of speech, such as attempted or imagined speech. We hypothesise that the same methods presented here will generalise to these other varieties of speech, though this has yet to be shown. But, perhaps the biggest limitation of the present work is that, while it surpasses the amount of data used in other studies, it remains to be seen how much speech decoding tasks can be improved by scaling up the number of datasets used in training. In sharing this work now, we believe that the current proof of concept will be sufficiently impactful to the field as we continue to actively scale up the datasets that we can leverage.

5 Conclusion

Ultimately, solving speech decoding could transform the lives of patients with severe communication difficulties. This promise has not yet materialised because the field has been blocked by its inability to scale up data to leverage deep learning. Prior methods have been unable to aggregate data across different datasets, labels, or subjects to scale up because of heterogeneity in recording hardware, experiment design, and participants. A handful of studies have shown weak signals towards alleviating these issues. But until now, no one has developed a general solution. We provided a unified method that leverages unlabelled recordings data-efficiently using generic pretext tasks that shows that all of these problems can be solved. We verified this with experiments showing that our method not only scales with heterogeneous data but even generalises across datasets, subjects, and tasks. Our method unlocks the potential of the bitter lesson, providing a general method to exploit more computation by using more data. We implore the research community to employ the vast quantities of data and compute available to realise this potential. If scale is all you need in speech decoding, then the bitter lesson may not be so bitter.

Ethics Statement

In this work, we use data from studies that involve human subjects (Armeni etal., 2022; Gwilliams etal., 2023; Shafto etal., 2014; Taylor etal., 2017; Schoffelen etal., 2019). These datasets are public, cited, and have their own ethical approvals. The documentation for these is available with the publications for the respective datasets.

While there are clear positive impacts, we acknowledge that insights from neural speech decoding research may not all be beneficial. Research in this field could enable paralysed patients to communicate freely and materially assist those with minor communication difficulty (e.g. stammering). As the technology matures, it could also enable new ways of communicating with others and interacting with devices without the risks of invasive surgical implants. Nevertheless, the maturity of this technology could also present potential negative societal impacts. For one, reading inner speech creates new concerns over data controls as this information is likely to be highly sensitive and personal to individuals. Given access to this technology, there is also the risk that bad actors could extract sensitive information from target individuals without consent. Moreover, there are possible long horizon effects associated with speech decoding research. Broad adoption of this technology could lead to the gradual erosion of privacy over inner speech within society. In addition, asymmetric effects, where some individuals or organisations can read inner speech but others are unable to, could worsen societal inequality. Within the scope of this paper, we mitigate risks associated with inner speech by focusing on decoding heard speech where there is low potential for abuse. Nonetheless, we acknowledge that this is still a stepping stone towards solving inner speech decoding.

Reproducibility Statement

In the supplementary materials, we have provided an anonymised code repository with instructions for reproducing our main experiments. We also include details on experiment design and setup (Section 4.1 and Appendix A), hyperparameters (Appendix B), and compute (Appendix C). While we attempt to be exhaustive with these details, any information not found directly in the main body or appendices can be located in the supplementary materials.

References

  • Agrawal etal. (2015)Pulkit Agrawal, João Carreira, and Jitendra Malik.Learning to see by moving.In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp. 37–45. IEEE Computer Society, 2015.doi: 10.1109/ICCV.2015.13.URL https://doi.org/10.1109/ICCV.2015.13.
  • Armeni etal. (2022)Kristijan Armeni, Umut Güçlü, Marcel van Gerven, and Jan-Mathijs Schoffelen.A 10-hour within-participant magnetoencephalography narrative dataset to test models of language comprehension.Scientific Data, 9(1):278, June 2022.ISSN 2052-4463.doi: 10.1038/s41597-022-01382-7.URL https://www.nature.com/articles/s41597-022-01382-7.
  • Balestriero etal. (2023)Randall Balestriero, Mark Ibrahim, Vlad Sobal, Ari Morcos, Shashank Shekhar, Tom Goldstein, Florian Bordes, Adrien Bardes, Grégoire Mialon, Yuandong Tian, Avi Schwarzschild, AndrewGordon Wilson, Jonas Geiping, Quentin Garrido, Pierre Fernandez, Amir Bar, Hamed Pirsiavash, Yann LeCun, and Micah Goldblum.A cookbook of self-supervised learning.CoRR, abs/2304.12210, 2023.doi: 10.48550/ARXIV.2304.12210.URL https://doi.org/10.48550/arXiv.2304.12210.
  • Bressler & Richter (2015)StevenL Bressler and CraigG Richter.Interareal oscillatory synchronization in top-down neocortical processing.Current Opinion in Neurobiology, 31:62–66, 2015.
  • Buzsáki & Wang (2012)György Buzsáki and Xiao-Jing Wang.Mechanisms of gamma oscillations.Annual Review of Neuroscience, 35:203–225, 2012.
  • Caron etal. (2021)Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin.Emerging properties in self-supervised vision transformers.In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pp. 9630–9640. IEEE, 2021.doi: 10.1109/ICCV48922.2021.00951.URL https://doi.org/10.1109/ICCV48922.2021.00951.
  • Chen etal. (2020)Ting Chen, Simon Kornblith, Mohammad Norouzi, and GeoffreyE. Hinton.A simple framework for contrastive learning of visual representations.In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 1597–1607. PMLR, 2020.URL http://proceedings.mlr.press/v119/chen20j.html.
  • Chen etal. (2024)Xupeng Chen, Ran Wang, Amirhossein Khalilian-Gourtani, Leyao Yu, Patricia Dugan, Daniel Friedman, Werner Doyle, Orrin Devinsky, Yao Wang, and Adeen Flinker.A neural speech decoding framework leveraging deep learning and speech synthesis.Nature Machine Intelligence, pp. 1–14, April 2024.ISSN 2522-5839.doi: 10.1038/s42256-024-00824-8.URL https://www.nature.com/articles/s42256-024-00824-8.
  • Cheung etal. (2016)Connie Cheung, LibertyS Hamilton, Keith Johnson, and EdwardF Chang.The auditory representation of speech sounds in human motor cortex.eLife, 5:e12577, 2016.
  • Csaky etal. (2022)Richard Csaky, Mats W.J. van Es, OiwiParker Jones, and MarkW. Woolrich.Group‐level brain decoding with deep learning.Human Brain Mapping, 44:6105 – 6119, 2022.URL https://doi.org/10.1002/hbm.26500.
  • Csaky etal. (2023)Richard Csaky, MatsW.J. van Es, OiwiParker Jones, and Mark Woolrich.Interpretable many-class decoding for MEG.NeuroImage, 282:120396, November 2023.ISSN 10538119.doi: 10.1016/j.neuroimage.2023.120396.URL https://linkinghub.elsevier.com/retrieve/pii/S1053811923005475.
  • Dash etal. (2020)Debadatta Dash, Paul Ferrari, Satwik Dutta, and Jun Wang.NeuroVAD: Real-Time Voice Activity Detection from Non-Invasive Neuromagnetic Signals.Sensors, 20(8):2248, January 2020.ISSN 1424-8220.doi: 10.3390/s20082248.URL https://www.mdpi.com/1424-8220/20/8/2248.
  • Défossez etal. (2022)Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi.High fidelity neural audio compression.CoRR, abs/2210.13438, 2022.doi: 10.48550/ARXIV.2210.13438.URL https://doi.org/10.48550/arXiv.2210.13438.
  • Doersch etal. (2015)Carl Doersch, Abhinav Gupta, and AlexeiA. Efros.Unsupervised visual representation learning by context prediction.In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp. 1422–1430. IEEE Computer Society, 2015.doi: 10.1109/ICCV.2015.167.URL https://doi.org/10.1109/ICCV.2015.167.
  • Duan etal. (2023)Yiqun Duan, Charles Chau, Zhen Wang, Yu-Kai Wang, and Chin-Teng Lin.DeWave: Discrete encoding of EEG waves for EEG to text translation.In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36 (NeurIPS 2023), New Orleans, LA, USA, December 10 - 16, 2023.URL http://papers.nips.cc/paper_files/paper/2023/hash/1f2fd23309a5b2d2537d063b29ec1b52-Abstract-Conference.html.
  • Défossez etal. (2023)Alexandre Défossez, Charlotte Caucheteux, Jérémy Rapin, Ori Kabeli, and Jean-Rémi King.Decoding speech perception from non-invasive brain recordings.Nature Machine Intelligence, 5(10):1097–1107, October 2023.ISSN 2522-5839.doi: 10.1038/s42256-023-00714-5.URL https://www.nature.com/articles/s42256-023-00714-5.
  • Fries (2005)Pascal Fries.A mechanism for cognitive dynamics: neuronal communication through neuronal coherence.Trends in Cognitive Sciences, 9(10):474–480, October 2005.ISSN 1364-6613.doi: 10.1016/j.tics.2005.08.011.URL https://www.sciencedirect.com/science/article/pii/S1364661305002421.
  • Fries (2009)Pascal Fries.Neuronal gamma-band synchronization as a fundamental process in cortical computation.Annual Review of Neuroscience, 32:209–224, 2009.
  • Gibiansky etal. (2017)Andrew Gibiansky, SercanÖmer Arik, GregoryFrederick Diamos, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, and Yanqi Zhou.Deep voice 2: Multi-speaker neural text-to-speech.In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, HannaM. Wallach, Rob Fergus, S.V.N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 2962–2970, 2017.URL https://proceedings.neurips.cc/paper/2017/hash/c59b469d724f7919b7d35514184fdc0f-Abstract.html.
  • Gidaris etal. (2018)Spyros Gidaris, Praveer Singh, and Nikos Komodakis.Unsupervised representation learning by predicting image rotations.In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018.URL https://openreview.net/forum?id=S1v4N2l0-.
  • Giraud & Poeppel (2012)Anne-Lise Giraud and David Poeppel.Cortical oscillations and speech processing: emerging computational principles and operations.Nature Neuroscience, 15(4):511–517, April 2012.ISSN 1546-1726.doi: 10.1038/nn.3063.URL https://www.nature.com/articles/nn.3063.
  • Gwilliams etal. (2022)Laura Gwilliams, Jean-Rémi King, Alec Marantz, and David Poeppel.Neural dynamics of phoneme sequences reveal position-invariant code for content and order.Nature Communications, 13(1):6606, November 2022.ISSN 2041-1723.doi: 10.1038/s41467-022-34326-1.URL https://www.nature.com/articles/s41467-022-34326-1.
  • Gwilliams etal. (2023)Laura Gwilliams, Graham Flick, Alec Marantz, Liina Pylkkänen, David Poeppel, and Jean-Rémi King.Introducing MEG-MASC a high-quality magneto-encephalography dataset for evaluating natural speech processing.Scientific Data, 10(1):862, December 2023.ISSN 2052-4463.doi: 10.1038/s41597-023-02752-5.URL https://www.nature.com/articles/s41597-023-02752-5.
  • Hall etal. (2014)EmmaL. Hall, SiânE. Robson, PeterG. Morris, and MatthewJ. Brookes.The relationship between MEG and fMRI.NeuroImage, 102:80–91, 2014.URL https://doi.org/10.1016/j.neuroimage.2013.11.005.
  • Hamilton etal. (2018)LibertyS. Hamilton, Erik Edwards, and EdwardF. Chang.A Spatial Map of Onset and Sustained Responses to Speech in the Human Superior Temporal Gyrus.Current Biology, 28(12):1860–1871.e4, June 2018.ISSN 09609822.doi: 10.1016/j.cub.2018.04.033.URL https://linkinghub.elsevier.com/retrieve/pii/S0960982218304615.
  • Jiang etal. (2024)Weibang Jiang, Liming Zhao, and Bao liang Lu.Large brain model for learning generic representations with tremendous EEG data in BCI.In The Twelfth International Conference on Learning Representations, 2024.URL https://openreview.net/forum?id=QzTpTRVtrP.
  • Jo etal. (2024)Hyejeong Jo, Yiqian Yang, Juhyeok Han, Yiqun Duan, Hui Xiong, and WonHee Lee.Are EEG-to-text models working?arXiv, 2024.doi: https://arxiv.org/abs/2405.06459.
  • Jumper etal. (2021)JohnM. Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon AA Kohl, Andy Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reiman, Ellen Clancy, Michal Zielinski, Martin Steinegger, Michalina Pacholska, Tamas Berghammer, Sebastian Bodenstein, David Silver, Oriol Vinyals, AndrewW. Senior, Koray Kavukcuoglu, Pushmeet Kohli, and Demis Hassabis.Highly accurate protein structure prediction with alphafold.Nature, 596:583 – 589, 2021.URL https://doi.org/10.1038/s41586-021-03819-2.
  • Langland-Hassan & Vicente (2018)Peter Langland-Hassan and Agustín Vicente.Inner Speech: New Voices.Oxford University Press, 2018.
  • Larsson etal. (2016)Gustav Larsson, Michael Maire, and Gregory Shakhnarovich.Learning representations for automatic colorization.In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (eds.), Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part IV, volume 9908 of Lecture Notes in Computer Science, pp. 577–593. Springer, 2016.doi: 10.1007/978-3-319-46493-0“˙35.URL https://doi.org/10.1007/978-3-319-46493-0_35.
  • Lopesda Silva (2013)Fernando Lopesda Silva.EEG and MEG: Relevance to Neuroscience.Neuron, 80(5):1112–1128, December 2013.ISSN 0896-6273.doi: 10.1016/j.neuron.2013.10.017.URL https://www.sciencedirect.com/science/article/pii/S0896627313009203.
  • Loshchilov & Hutter (2019)Ilya Loshchilov and Frank Hutter.Decoupled weight decay regularization.In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.URL https://openreview.net/forum?id=Bkg6RiCqY7.
  • Luo & Poeppel (2007)Huan Luo and David Poeppel.Phase patterns of neuronal responses reliably discriminate speech in human auditory cortex.Neuron, 54(6):1001–1010, 2007.
  • Luo etal. (2010)Huan Luo, Zuxiang Liu, and David Poeppel.Auditory cortex tracks both auditory and visual stimulus dynamics using low-frequency neuronal phase modulation.PLOS Biology, 8(8):e1000445, 2010.
  • Mai etal. (2016)Guangting Mai, JamesW. Minett, and William S.Y. Wang.Delta, theta, beta, and gamma brain oscillations index levels of auditory sentence processing.NeuroImage, 133:516–528, June 2016.ISSN 1053-8119.doi: 10.1016/j.neuroimage.2016.02.064.URL https://www.sciencedirect.com/science/article/pii/S1053811916001737.
  • Martin etal. (2014)Stéphanie Martin, Peter Brunner, Chris Holdgraf, Hans-Jochen Heinze, NathanE Crone, Jochem Rieger, Gerwin Schalk, RobertT Knight, and BrianN Pasley.Decoding spectrotemporal features of overt and covert speech from the human cortex.Frontiers in Neuroengineering, 7:14, 2014.
  • Mesgarani etal. (2014)Nima Mesgarani, Connie Cheung, Keith Johnson, and EdwardF. Chang.Phonetic feature encoding in human superior temporal gyrus.Science, 343(6174):1006–1010, 2014.doi: DOI:10.1126/science.1245994.
  • Metzger etal. (2023)SeanL. Metzger, KayloT. Littlejohn, AlexanderB. Silva, DavidA. Moses, MargaretP. Seaton, Ran Wang, MaximilianE. Dougherty, JessieR. Liu, Peter Wu, MichaelA. Berger, Inga Zhuravleva, Adelyn Tu-Chan, Karunesh Ganguly, GopalaK. Anumanchipalli, and EdwardF. Chang.A high-performance neuroprosthesis for speech decoding and avatar control.Nature, 620:1037–1046, 2023.
  • Moses etal. (2021)DavidA. Moses, SeanL. Metzger, JessieR. Liu, GopalaK. Anumanchipalli, JosephG. Makin, PengfeiF. Sun, Josh Chartier, MaximilianE. Dougherty, PatriciaM. Liu, GaryM. Abrams, Adelyn Tu-Chan, Karunesh Ganguly, and EdwardF. Chang.Neuroprosthesis for Decoding Speech in a Paralyzed Person with Anarthria.New England Journal of Medicine, 385(3):217–227, July 2021.ISSN 0028-4793.doi: 10.1056/NEJMoa2027540.URL https://doi.org/10.1056/NEJMoa2027540.
  • Noroozi & Favaro (2016)Mehdi Noroozi and Paolo Favaro.Unsupervised learning of visual representations by solving jigsaw puzzles.In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (eds.), Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI, volume 9910 of Lecture Notes in Computer Science, pp. 69–84. Springer, 2016.doi: 10.1007/978-3-319-46466-4“˙5.URL https://doi.org/10.1007/978-3-319-46466-4_5.
  • OpenAI (2023)OpenAI.GPT-4 technical report.CoRR, abs/2303.08774, 2023.doi: 10.48550/ARXIV.2303.08774.URL https://doi.org/10.48550/arXiv.2303.08774.
  • Peelle etal. (2012)JonathanE. Peelle, Joachim Gross, and MatthewH. Davis.Phase-locked responses to speech in human auditory cortex are enhanced during comprehension.Cerebral Cortex, 23(6):1378–1387, 2012.
  • Perez etal. (2018)Ethan Perez, Florian Strub, Harm deVries, Vincent Dumoulin, and AaronC. Courville.FiLM: Visual reasoning with a general conditioning layer.In SheilaA. McIlraith and KilianQ. Weinberger (eds.), Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pp. 3942–3951. AAAI Press, 2018.doi: 10.1609/AAAI.V32I1.11671.URL https://doi.org/10.1609/aaai.v32i1.11671.
  • Piai etal. (2014)Vitória Piai, Ardi Roelofs, and Eric Maris.Oscillatory brain responses in spoken word production reflect lexical frequency and sentential constraint.Neuropsychologia, 53:146–156, January 2014.ISSN 0028-3932.doi: 10.1016/j.neuropsychologia.2013.11.014.URL https://www.sciencedirect.com/science/article/pii/S0028393213004119.
  • Radford etal. (2023)Alec Radford, JongWook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever.Robust speech recognition via large-scale weak supervision.In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp. 28492–28518. PMLR, 2023.URL https://proceedings.mlr.press/v202/radford23a.html.
  • Schoffelen etal. (2019)Jan-Mathijs Schoffelen, Robert Oostenveld, Nietzsche H.L. Lam, Julia Uddén, Annika Hultén, and Peter Hagoort.A 204-subject multimodal neuroimaging dataset to study language processing.Scientific Data, 6(1):17, April 2019.ISSN 2052-4463.doi: 10.1038/s41597-019-0020-y.URL https://www.nature.com/articles/s41597-019-0020-y.
  • Shafto etal. (2014)MeredithA. Shafto, LorraineK. Tyler, Marie Dixon, JasonR. Taylor, JamesBenedict Rowe, Rhodri Cusack, AndrewJ. Calder, WilliamD. Marslen-Wilson, JohnS. Duncan, T.Dalgleish, Richard N.A. Henson, Carol Brayne, and FionaE. Matthews.The Cambridge centre for ageing and neuroscience (Cam-CAN) study protocol: a cross-sectional, lifespan, multidisciplinary examination of healthy cognitive ageing.BMC Neurology, 14, 2014.
  • Strauß etal. (2015)Antje Strauß, MollyJ Henry, Mathias Scharinger, and Jonas Obleser.Alpha phase determines successful lexical decision in noise.Journal of Neuroscience, 35(7):3256–3262, 2015.
  • Sutton (2019)Richard Sutton.The bitter lesson.Incomplete Ideas (blog), 2019.URL http://www.incompleteideas.net/IncIdeas/BitterLesson.html.
  • Tagliasacchi etal. (2020)Marco Tagliasacchi, Yunpeng Li, Karolis Misiunas, and Dominik Roblek.SEANet: A multi-modal speech enhancement network.In Helen Meng, BoXu, and ThomasFang Zheng (eds.), Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020, pp. 1126–1130. ISCA, 2020.doi: 10.21437/INTERSPEECH.2020-1563.URL https://doi.org/10.21437/Interspeech.2020-1563.
  • Tang etal. (2023)Jerry Tang, Amanda LeBel, Shailee Jain, and AlexanderG. Huth.Semantic reconstruction of continuous language from non-invasive brain recordings.Nature Neuroscience, 26(5):858–866, May 2023.ISSN 1546-1726.doi: 10.1038/s41593-023-01304-9.URL https://www.nature.com/articles/s41593-023-01304-9.
  • Taylor etal. (2017)JasonR. Taylor, Nitin Williams, Rhodri Cusack, Tibor Auer, MeredithA. Shafto, Marie Dixon, LorraineK. Tyler, Cam-CAN Group, and Richard N.A. Henson.The Cambridge centre for ageing and neuroscience (Cam-CAN) data repository: Structural and functional MRI, MEG, and cognitive data from a cross-sectional adult lifespan sample.Neuroimage, 144:262 – 269, 2017.
  • Vidaurre etal. (2018)Diego Vidaurre, LaurenceT. Hunt, AndrewJ. Quinn, Benjamin A.E. Hunt, MatthewJ. Brookes, AnnaC. Nobre, and MarkW. Woolrich.Spontaneous cortical activity transiently organises into frequency specific phase-coupling networks.Nature Communications, 9(1):2987, July 2018.ISSN 2041-1723.doi: 10.1038/s41467-018-05316-z.URL https://www.nature.com/articles/s41467-018-05316-z.
  • Wandelt etal. (2024)SarahK Wandelt, DavidA. Bjånes, Kelsie Pejsa, Brian Lee, CharlesY Liu, and Richard Andersen.Representation of internal speech by single neurons in human supramarginal gyrus.Nature human behaviour, 2024.URL https://doi.org/10.1038/s41562-024-01867-y.
  • Wang etal. (2023a)BoWang, Xiran Xu, Longxiang Zhang, Boda Xiao, Xihong Wu, and Jing Chen.Semantic reconstruction of continuous language from MEG signals.CoRR, abs/2309.07701, 2023a.doi: 10.48550/ARXIV.2309.07701.URL https://doi.org/10.48550/arXiv.2309.07701.
  • Wang etal. (2023b)Christopher Wang, Vighnesh Subramaniam, AdamUri Yaari, Gabriel Kreiman, Boris Katz, Ignacio Cases, and Andrei Barbu.Brainbert: Self-supervised representation learning for intracranial recordings.In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023b.URL https://openreview.net/pdf?id=xmcYx_reUn6.
  • Wang & Ji (2022)Zhenhailong Wang and Heng Ji.Open vocabulary electroencephalography-to-text decoding and zero-shot sentiment classification.In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Virtual Event, February 22 - March 1, pp. 5350–5358. AAAI Press, 2022.doi: 10.1609/AAAI.V36I5.20472.URL https://doi.org/10.1609/aaai.v36i5.20472.
  • Weiss & Mueller (2012)Sabine Weiss and HorstM. Mueller.“Too many betas do not spoil the broth”: the role of beta brain oscillations in language processing.Frontiers in Psychology, 3, 2012.doi: https://doi.org/10.3389/fpsyg.2012.00201.
  • Willett etal. (2023)FrancisR. Willett, ErinM. Kunz, Chaofei Fan, DonaldT. Avansino, GuyH. Wilson, EunYoung Choi, Foram Kamdar, MatthewF. Glasser, LeighR. Hochberg, Shaul Druckmann, KrishnaV. Shenoy, and JaimieM. Henderson.A high-performance speech neuroprosthesis.Nature, 620(7976):1031–1036, August 2023.ISSN 1476-4687.doi: 10.1038/s41586-023-06377-x.URL https://www.nature.com/articles/s41586-023-06377-x.
  • Yang etal. (2023)Chaoqi Yang, M.Brandon Westover, and Jimeng Sun.BIOT: biosignal transformer for cross-data learning in the wild.In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.URL http://papers.nips.cc/paper_files/paper/2023/hash/f6b30f3e2dd9cb53bbf2024402d02295-Abstract-Conference.html.
  • Zeghidour etal. (2022)Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi.Soundstream: An end-to-end neural audio codec.IEEE ACM Trans. Audio Speech Lang. Process., 30:495–507, 2022.doi: 10.1109/TASLP.2021.3129994.URL https://doi.org/10.1109/TASLP.2021.3129994.
  • Zhang etal. (2016)Richard Zhang, Phillip Isola, and AlexeiA. Efros.Colorful image colorization.In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (eds.), Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III, volume 9907 of Lecture Notes in Computer Science, pp. 649–666. Springer, 2016.doi: 10.1007/978-3-319-46487-9“˙40.URL https://doi.org/10.1007/978-3-319-46487-9_40.

Appendix A Experiment Details

We pre-train with non-overlapping sample windows from all subjects and sessions. We adjust the amount of unlabelled data used from Cam-CAN by increasing the number of subjects in the sequence 1, 2, 4, 8, 17, 36, 74, 152, 312, and 641, successively randomly selecting more subjects to include. Each seed uses a different set of subjects to reduce negative effects from outlier subjects.

When fine-tuning with Armeni etal. (2022), we hold out session 010 from all subjects during training and validation, using this for evaluation. Similarly, when fine-tuning with Gwilliams etal. (2023), we hold out session 1 from subjects 23, 24, 25, 26, and 27, using these sessions for evaluation only. As there is limited within-subject data in the latter dataset, we did not hold out a session from all subjects as before. For our novel subject experiments, we hold out subjects 1, 2, and 3 entirely and use the data for these subjects during evaluation. In Gwilliams etal. (2023), we note that they use four different tasks for each subject and their order is randomized between subjects. Both sessions for each task are repeats of the task. This means that while the recording itself is unseen, in this dataset, it is possible that heldout sessions use tasks that may have been seen in the training set.

In all experiments, we always fine-tune to completion (usually around 300 epochs), taking the test metric at the best validation loss (early stopping). We use three randomly selected seeds for each pre-training and corresponding fine-tuning run. For speech detection, since our encoder reduces the temporal dimension from 125 samples (the number of samples in a 0.5 second window with a sample rate of 250Hz) down to 5 embeddings, we downsample our speech detection labels to match using PyTorch’s torch.nn.functional.interpolate. Therefore, each speech detection label represents a 0.1 second period of time.

Appendix B Hyperparameters

We conducted a search over hyperparameters of interest to optimise our self-supervised objectives and neural architecture. While these ablations indicated a theoretically ideal architectural configuration, in practice, we altered our final experimental architecture due to instabilities during training when data was scaled up. Our final architecture hyperparameters achieve a balance between the best values from our hyperparameter search and stable training. These values are detailed in Table 3.

HyperparameterValue
Window length (s)0.50.50.50.5
ρ𝜌\rhoitalic_ρ (phase)0.50.50.50.5
ρ𝜌\rhoitalic_ρ (amplitude)0.20.20.20.2
{w1,w2,w3}subscript𝑤1subscript𝑤2subscript𝑤3\{w_{1},w_{2},w_{3}\}{ italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT }{1.0,1.0,1.0}1.01.01.0\{1.0,1.0,1.0\}{ 1.0 , 1.0 , 1.0 }
dsharedsubscript𝑑sharedd_{\mathrm{shared}}italic_d start_POSTSUBSCRIPT roman_shared end_POSTSUBSCRIPT512512512512
dbackbonesubscript𝑑backboned_{\mathrm{backbone}}italic_d start_POSTSUBSCRIPT roman_backbone end_POSTSUBSCRIPT512512512512
SEANet convolution channels(512,512,512,512)512512512512(512,512,512,512)( 512 , 512 , 512 , 512 )
SEANet downsampling ratios(5,5,1)551(5,5,1)( 5 , 5 , 1 )
FiLM conditioning dimension16161616
Subject embedding dimension16161616
Pre-training epochs200200200200
OptimizerAdamW (Loshchilov & Hutter, 2019)
Learning rate0.0000660.0000660.0000660.000066
Train ratio0.80.80.80.8
Validation ratio0.10.10.10.1
Test ratio0.10.10.10.1

Appendix C Compute Resources

All experiments were run on individual NVIDIA V100 and A100 GPUs with up to 40GiB of GPU memory on a system with up to 1TiB of RAM. Each pre-training run with the maximum amount of pre-training data took approximately 200 hours (8.3 days). Fine-tuning following pre-training took up to another 12 hours. We estimate that we used approximately 3000 hours of compute for the final experimental runs, including hyperparameter searches. In total, over the course of developing this work from idea to final paper, we used around 10,000 hours of GPU compute.

Appendix D Licences For Datasets And Code

The Armeni etal. (2022) dataset is distributed under CC-BY-4.0 while the Gwilliams etal. (2023) dataset is distributed under the CC0 1.0 Universal licence. The Schoffelen etal. (2019) dataset is distributed with a RU-DI-HD-1.0 licence from the Donders institute. The licence for the Cam-CAN (Shafto etal., 2014; Taylor etal., 2017) dataset is unknown. The SEANet code adapted from Défossez etal. (2022) is distributed under the MIT licence, and the OSL library, which we use for preprocessing, is under the BSD-3-Clause licence.

Scaling Speech Decoding With Self-Supervised Learning (2024)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Jerrold Considine

Last Updated:

Views: 6375

Rating: 4.8 / 5 (78 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Jerrold Considine

Birthday: 1993-11-03

Address: Suite 447 3463 Marybelle Circles, New Marlin, AL 20765

Phone: +5816749283868

Job: Sales Executive

Hobby: Air sports, Sand art, Electronics, LARPing, Baseball, Book restoration, Puzzles

Introduction: My name is Jerrold Considine, I am a combative, cheerful, encouraging, happy, enthusiastic, funny, kind person who loves writing and wants to share my knowledge and understanding with you.