Predicting Natural Speech EEG Recordings with a Contextualized Speech Model.
<p>EEG recordings of audiobook comprehension were analyzed. The audiobook waveform was processed through a pre-trained deep-learning speech model (Whisper-base). A sliding window approach was applied to feed the model up to 30s of prior speech audio waveform, which was then re-represented as a...
Saved in:
| Main Author: | |
|---|---|
| Other Authors: | , |
| Published: |
2024
|
| Subjects: | |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | <p>EEG recordings of audiobook comprehension were analyzed. The audiobook waveform was processed through a pre-trained deep-learning speech model (Whisper-base). A sliding window approach was applied to feed the model up to 30s of prior speech audio waveform, which was then re-represented as an 80 channel Log-Mel Spectrogram. The Spectrogram is then fed-forward through successive layers of a Transformer Encoder artificial neural network via an initial convolutional layer. This entire process can be considered to implement a graded transformation of input speech to a contextualized more linguistic representation. At each transformer layer, input time-frames are contextualized via a self-attention computation that re-represents input frames according to a weighted average of themselves and preceding frames. The bottom row illustrates a summary of self-attention weightings computed at each layer for the first 3s of the audiobook stimulus. Attention weights (all positive values) relating each time frame to each previous timeframe are illustrated as the colored shading on each matrix row. Specifically, points along the diagonal correspond to attention weights at any timepoint t (from 0 to 3s) unrelated to preceding context. Meanwhile, points to the left of the diagonal correspond to the attention weights applied to preceding frames to re-represent the current time frame. The wealth of color to the left of the diagonal in layers 3, 5, and 6 demonstrates the importance of prior context in Whisper’s operation. The range of values in each matrix is L1: [0 0.16], L2: [0 0.15], L3 [0 0.09], L4 [0 0.13], L5 [0 0.11], L6 [0 0.09], with color intensity reflecting weight strength. The self-attention computation is illustrated in detail in <b>Fig A in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1012537#pcbi.1012537.s001" target="_blank">S1 Text</a></b>, and self-attention weight maps computed in the eight attention heads that were averaged to generate the visualization above are in <b>Figs B and C in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1012537#pcbi.1012537.s001" target="_blank">S1 Text</a>.</b> Whisper layer outputs were used to predict co-registered EEG data in a cross-validated multiple regression framework (this is illustrated above for only the final layer output). To reduce computational burden, Whisper vectors were reduced to 10 dimensions by projection onto pre-derived PCA axes (computed from different audiobook data, see also <b>Fig D in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1012537#pcbi.1012537.s001" target="_blank">S1 Text</a></b>), and both EEG and model data were resampled at 32Hz.</p> |
|---|