IY,
AE, SH…) or single letters
(S, T, L…). Append a digit
1-8 after a vowel to stress it.
No neural network. No recorded samples. No filters. Sammy synthesizes speech from scratch, sample by sample, using one of the oldest tricks in computer audio: drive three oscillators at the frequencies the mouth would resonate at, and reset them all every time the vocal cords would snap shut. That reset is what makes him sound like a little robot instead of a pure tone.
Your vocal tract has resonances - standing waves at specific frequencies. Linguists call these formants. Move your tongue, the resonances move with it, and that's how one sound becomes another.
Sammy fakes them directly. Two sine waves plus one rectangle,
tuned to F1, F2 and F3. No
filters, no mouth - just the frequencies.
A pure sum of three sines sounds nothing like a voice. It sounds like a synthesizer.
The trick: every time the glottal pulse would fire (around a hundred times a second for a male voice), all three phases are reset to zero. You get a comb of harmonics instead of three isolated tones. That comb is what your ears parse as speech.
Consonants like S, F, SH
don't use the oscillators at all. They're just colored
noise - a cheap bandpass around where the hiss should
sit spectrally. Two running averages, that's the whole filter.
Voiced fricatives (V, Z) do both at
once: the oscillators keep buzzing and the noise mixes in on top.
Finally the output is quantized to 4 bits. 16 amplitude levels, no more. That's the staircase you hear in the waveform - the D/A converter of a 1980s home computer, modeled back in.
Without it Sammy sounds like a clinical vocoder. With it, he sounds like home.
Each phoneme is a set of parameters - three formants, three amplitudes, a pitch. The engine expands a phoneme into 10-millisecond frames, then linearly interpolates two frames either side of each boundary into the next sound.
Short, crude, mechanical. Smoother interpolation would sound more human. That's exactly why we don't do it.
The original SAM shipped with a 600-rule English-to-phoneme
converter. Sammy doesn't. You write phonemes directly, like
HE4LOW instead of "hello". Less convenient, way
more fun.
For a chatbot, you phonemize the reply table once. For everything else, you get to spell things how they sound, which is its own small pleasure.
4 after one vowel and 1 after another - the first syllable jumps in pitch and the second stays low. That's how SAM did emphasis. Try IH4NTER EH1STIHNX.BLARGH SNURK DRAGOB is more fun than actual words because Sammy commits equally to both.AO4L YOHR BEY4S AHR BIHLAONX TUW AH4S. The more idiosyncratic your spelling, the more character he gets.// 500 lines, one file, zero dependencies import { SammyLike } from './sammy.js'; const sammy = new SammyLike(); // speak returns a Promise that resolves when playback ends await sammy.speak("HAY DHEHR."); await sammy.speak("AY AEM TAY4NIY.", { pitch: 40 }); // render to WAV (for caching or pre-baking chatbot replies) const blob = await sammy.renderToWav("DAW4NLOWD MIY4.");
A toy by gizmo64k.
Pure HTML + CSS + JavaScript. No frameworks, no dependencies, no CDN,
no training data. Three oscillators, a glottal counter, and a bit of
colored noise. The whole thing is a single file.