Predicting Natural Speech EEG Recordings with a Contextualized Speech Model.

<p>EEG recordings of audiobook comprehension were analyzed. The audiobook waveform was processed through a pre-trained deep-learning speech model (Whisper-base). A sliding window approach was applied to feed the model up to 30s of prior speech audio waveform, which was then re-represented as a...

Full description

Saved in:

Bibliographic Details
Main Author:	Andrew J. Anderson (20161392) (author)
Other Authors:	Chris Davis (647632) (author), Edmund C. Lalor (11450457) (author)
Published:	2024
Subjects:	Neuroscience Physiology Sociology Science Policy Biological Sciences not elsewhere classified speaker &# 8220 layer depth advantage intermediary representational stage learning models reveal transform continuous speech selectively attended speech contained eeg model complete brain models accurate eeg models model eeg recordings models may complete picture eeg modeling brain responses brain mechanism model electroencephalogram unattended speech speech rate xlink "> whisper provides text transformation superficial level reflects elements providing end promising approach listening conditions linguistic processing lexical representation end accounts audiobook comprehension
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	<p>EEG recordings of audiobook comprehension were analyzed. The audiobook waveform was processed through a pre-trained deep-learning speech model (Whisper-base). A sliding window approach was applied to feed the model up to 30s of prior speech audio waveform, which was then re-represented as an 80 channel Log-Mel Spectrogram. The Spectrogram is then fed-forward through successive layers of a Transformer Encoder artificial neural network via an initial convolutional layer. This entire process can be considered to implement a graded transformation of input speech to a contextualized more linguistic representation. At each transformer layer, input time-frames are contextualized via a self-attention computation that re-represents input frames according to a weighted average of themselves and preceding frames. The bottom row illustrates a summary of self-attention weightings computed at each layer for the first 3s of the audiobook stimulus. Attention weights (all positive values) relating each time frame to each previous timeframe are illustrated as the colored shading on each matrix row. Specifically, points along the diagonal correspond to attention weights at any timepoint t (from 0 to 3s) unrelated to preceding context. Meanwhile, points to the left of the diagonal correspond to the attention weights applied to preceding frames to re-represent the current time frame. The wealth of color to the left of the diagonal in layers 3, 5, and 6 demonstrates the importance of prior context in Whisper’s operation. The range of values in each matrix is L1: [0 0.16], L2: [0 0.15], L3 [0 0.09], L4 [0 0.13], L5 [0 0.11], L6 [0 0.09], with color intensity reflecting weight strength. The self-attention computation is illustrated in detail in <b>Fig A in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1012537#pcbi.1012537.s001" target="_blank">S1 Text</a></b>, and self-attention weight maps computed in the eight attention heads that were averaged to generate the visualization above are in <b>Figs B and C in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1012537#pcbi.1012537.s001" target="_blank">S1 Text</a>.</b> Whisper layer outputs were used to predict co-registered EEG data in a cross-validated multiple regression framework (this is illustrated above for only the final layer output). To reduce computational burden, Whisper vectors were reduced to 10 dimensions by projection onto pre-derived PCA axes (computed from different audiobook data, see also <b>Fig D in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1012537#pcbi.1012537.s001" target="_blank">S1 Text</a></b>), and both EEG and model data were resampled at 32Hz.</p>

Predicting Natural Speech EEG Recordings with a Contextualized Speech Model.

Similar Items