MelNet - Audio Samples

Audio samples accompanying the paper MelNet: A Generative Model for Audio in the Frequency Domain.

Contents:

Notes:


Single-Speaker Speech Generation

Samples generated by MelNet trained on the task of unconditional single-speaker speech generation using professionally recorded audiobook data from the Blizzard 2013 dataset.

Samples

Samples from the model without biasing or priming.

Biased Samples

Samples from the model using a bias of 1.0.

Primed Samples

The first 5 seconds of each audio clip are from the dataset and the remaining 5 seconds are generated by the model.


Multi-Speaker Speech Generation

Samples generated by MelNet trained on the task of unconditional multi-speaker speech generation using noisy, multispeaker, multilingual speech data from the VoxCeleb2 dataset.

Samples

Samples from the model without biasing or priming.


Music Generation

Samples generated by MelNet trained on the task of unconditional music generation using recorded piano performances from the MAESTRO dataset.

Samples

Samples from the model without biasing or priming.

Primed Samples

The first 5 seconds of each audio clip are from the dataset and the remaining 5 seconds are generated by the model.


Single-Speaker Text-to-Speech

Samples generated by MelNet trained on the task of single-speaker TTS using professionally recorded audiobook data from the Blizzard 2013 dataset.

Samples

The first audio clip for each text is taken from the dataset and the remaining 3 are samples generated by the model.

“My dear Fanny, you feel these things a great deal too much. I am most happy that you like the chain,”

Looking with a half fantastic curiosity to see whether the tender grass of early spring,

“I like them round,” said Mary. “And they are exactly the color of the sky over the moor.”

Lydia was Lydia still; untamed, unabashed, wild, noisy, and fearless.

“Oh, he has been away from New York—he has been all round the world. He doesn't know many people here, but he's very sociable, and he wants to know every one.”

Primed Samples

Each unlabelled audio clip is taken from the dataset and the audio clip that directly follows is a sample generated by the model primed with that sequence.

 

Write a fond note to the friend you cherish.

 

Pluck the bright rose without leaves.

 

Two plus seven is less than ten.

 

He said the same phrase thirty times.

 

We frown when events take a bad turn.


Multi-Speaker Text-to-Speech

Samples generated by MelNet trained on the task of multi-speaker TTS using noisy speech recognition data from the TED-LIUM 3 dataset.

Samples

Samples generated by the model conditioned on text and speaker ID. The conditioning text and speaker IDs are taken directly from the validation set (text in the dataset is unnormalized and unpunctuated).

it wasn't like i was asking for the code to a nuclear bunker or anything like that but the amount of resistance i got from this

and what that form is modeling and shaping is not cement

that every person here every decision that you've made today every decision you've made in your life you've not really made that decision but in fact

syria was largely a place of tolerance historically accustomed

and no matter what the rest of the world tells them they should be

the years went by and the princess grew up into a beautiful young woman

i spent so much time learning this language why do i only

and we were down to eating one meal a day running from place to place but wherever we could help we did at a certain point in time in

phrases and words even if you have a phd of chinese language you can't understand them

and when they came back and told us about it we really started thinking about the ways in which we see styrofoam every day

is only a very recent religious enthusiasm it surfaced only in the west

chances are that they are rooted in the productivity crisis

i cannot face your fears or chase your dreams and you can't do that for me but we can be supportive of eachother

the first law of travel and therefore of life you're only as strong

Selected Speakers

Samples generated by the model for selected speakers. Reference audio for each of the speakers can be found on the TED website.

A cramp is no small danger on a swim.

He said the same phrase thirty times.

Pluck the bright rose without leaves.

Two plus seven is less than ten.

The glow deepened in the eyes of the sweet girl.

Bring your problems to the wise chief.

Write a fond note to the friend you cherish.

Clothes and lodging are free to new men.

We frown when events take a bad turn.

Port is a strong wine with a smoky taste.


WaveNet Baseline

For comparison, we train WaveNet on the same three unconditional audio generation tasks used to evaluate MelNet (single-speaker speech generation, multi-speaker speech generation, and music generation).

Single-Speaker Speech Generation

Samples without biasing or priming.

Samples with priming: 5 seconds from the dataset followed by 5 seconds generated by WaveNet.

Multi-Speaker Speech Generation

Samples without biasing or priming.

Music Generation

Samples without biasing or priming.

Samples from a two-stage model which separately models MIDI notes and then uses WaveNet to synthesize audio conditioned on the generated MIDI notes.


Ablation - Multiscale Modelling

The following models were trained on the same data, with each model using a different number of tiers.

5-Tier Model

4-Tier Model

3-Tier Model

2-Tier Model