Top Related Projects
Robust Speech Recognition via Large-Scale Weak Supervision
Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
πΈπ¬ - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
Silero Models: pre-trained speech-to-text, text-to-speech and text-enhancement models made embarrassingly simple
Quick Overview
SAM (Software Automatic Mouth) is an open-source speech synthesizer originally developed for 8-bit computers in the 1980s. This GitHub repository contains a modern C port of the original assembly code, allowing SAM to run on contemporary systems. It provides a unique, retro-style voice synthesis experience with a small footprint.
Pros
- Lightweight and portable, capable of running on various platforms
- Offers a nostalgic, 8-bit era speech synthesis sound
- Highly customizable with adjustable speech parameters
- Simple to use with a straightforward API
Cons
- Limited in terms of natural-sounding speech compared to modern synthesizers
- Lacks support for multiple languages or accents
- May require additional effort to integrate into modern applications
- Documentation could be more comprehensive for advanced usage
Code Examples
- Basic usage to generate speech:
#include "sam.h"
int main() {
SetInput("Hello, world!");
SetSpeed(72);
SetPitch(64);
SetMouth(128);
SetThroat(128);
SAMMain();
return 0;
}
- Adjusting speech parameters:
#include "sam.h"
int main() {
SetInput("This is a test of different speech parameters.");
SetSpeed(120); // Faster speech
SetPitch(80); // Higher pitch
SetMouth(110); // Adjusted mouth shape
SetThroat(150); // Adjusted throat shape
SAMMain();
return 0;
}
- Using phonetic input:
#include "sam.h"
int main() {
EnablePhonemeOutput(1);
SetInput("/HEH4LOW WERLD"); // Phonetic representation of "Hello World"
SAMMain();
return 0;
}
Getting Started
To use SAM in your project:
- Clone the repository:
git clone https://github.com/s-macke/SAM.git
- Navigate to the
src
directory:cd SAM/src
- Compile the source:
gcc *.c -o sam
- Run SAM:
./sam "Hello, world!"
To integrate SAM into your C project, include the necessary header files and link against the compiled library. Make sure to initialize SAM, set the input text, and call SAMMain()
to generate speech output.
Competitor Comparisons
Robust Speech Recognition via Large-Scale Weak Supervision
Pros of Whisper
- Advanced speech recognition capabilities for multiple languages
- Backed by OpenAI's extensive research and development resources
- Regularly updated with new features and improvements
Cons of Whisper
- Requires significant computational resources for optimal performance
- More complex setup and usage compared to simpler audio tools
- Larger model size and longer processing times for some tasks
Code Comparison
SAM (Software Automatic Mouth):
void Output_Phoneme(unsigned char phoneme)
{
unsigned char s;
unsigned char phase1;
unsigned char phase2;
unsigned char phase3;
unsigned char mem38;
}
Whisper:
def transcribe(model, audio, task, language, **decode_options):
mel = log_mel_spectrogram(audio)
segment = pad_or_trim(mel, N_FRAMES).to(model.device)
decode_options["language"] = language
result = decode(model, segment, **decode_options)
return result
Key Differences
- SAM focuses on speech synthesis, while Whisper specializes in speech recognition
- SAM is written in C and assembly, Whisper is primarily Python-based
- Whisper offers more advanced features and broader language support
- SAM is lighter and simpler, suitable for retro computing and embedded systems
Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Pros of vosk-api
- Focuses on speech recognition, offering a more specialized solution for audio processing
- Supports multiple languages and provides pre-trained models
- Designed for real-time speech recognition with low latency
Cons of vosk-api
- More complex setup and integration compared to SAM's simplicity
- Requires more computational resources for speech recognition tasks
- Limited to speech recognition, while SAM offers broader audio synthesis capabilities
Code Comparison
vosk-api (Python):
from vosk import Model, KaldiRecognizer
import pyaudio
model = Model("model")
rec = KaldiRecognizer(model, 16000)
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=8000)
SAM (C):
#include "sam.h"
int main() {
SetInput("Hello World");
SetSpeed(72);
SetPitch(64);
SetMouth(128);
SetThroat(128);
SAMMain();
return 0;
}
The code snippets demonstrate the different focus areas of the two projects. vosk-api is centered around speech recognition, while SAM is geared towards speech synthesis. vosk-api requires more setup for audio input and model initialization, whereas SAM offers a simpler interface for generating speech output.
πΈπ¬ - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
Pros of TTS
- More comprehensive and feature-rich text-to-speech solution
- Supports multiple languages and voice models
- Active development with frequent updates and community support
Cons of TTS
- More complex setup and usage compared to SAM
- Requires more computational resources for training and inference
Code Comparison
TTS example:
from TTS.api import TTS
tts = TTS(model_name="tts_models/en/ljspeech/tacotron2-DDC")
tts.tts_to_file(text="Hello world!", file_path="output.wav")
SAM example:
#include "render.h"
#include "sam.h"
SetInput("Hello world!");
sam_flush();
Key Differences
- TTS is a modern, Python-based library with support for various TTS models
- SAM is a C implementation of a classic 1980s speech synthesizer
- TTS offers more natural-sounding speech output
- SAM provides a unique retro-style voice with lower resource requirements
Use Cases
- TTS: Modern applications requiring high-quality, multi-language speech synthesis
- SAM: Retro-style projects, educational purposes, or resource-constrained environments
DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
Pros of DeepSpeech
- Utilizes modern deep learning techniques for speech recognition
- Supports multiple languages and can be fine-tuned for specific domains
- Actively maintained with regular updates and improvements
Cons of DeepSpeech
- Requires significant computational resources for training and inference
- More complex to set up and use compared to simpler text-to-speech systems
- May have higher latency for real-time applications
Code Comparison
DeepSpeech (Python):
import deepspeech
model = deepspeech.Model('path/to/model.pbmm')
text = model.stt(audio_buffer)
SAM (C):
void SetMouthThroat(unsigned char mouth, unsigned char throat)
{
sam->mouth = mouth;
sam->throat = throat;
}
Key Differences
- DeepSpeech focuses on speech recognition (audio to text), while SAM is a text-to-speech system
- DeepSpeech uses neural networks, whereas SAM employs older formant synthesis techniques
- DeepSpeech is more flexible and accurate for modern applications, but SAM is simpler and requires fewer resources
- DeepSpeech is actively maintained, while SAM is a historical project with limited updates
Use Cases
- DeepSpeech: Modern speech recognition applications, voice assistants, transcription services
- SAM: Retro computing projects, educational purposes, low-resource environments
Silero Models: pre-trained speech-to-text, text-to-speech and text-enhancement models made embarrassingly simple
Pros of silero-models
- Focuses on speech recognition and text-to-speech models
- Supports multiple languages and provides pre-trained models
- Actively maintained with regular updates and improvements
Cons of silero-models
- More complex setup and usage compared to SAM
- Requires more computational resources for running models
- Limited to speech-related tasks, less versatile than SAM
Code Comparison
SAM:
import sam
sam.run_program("example.sam")
silero-models:
import torch
import soundfile as sf
from silero import silero_stt
model, decoder, utils = torch.hub.load(repo_or_dir='snakers4/silero-models',
model='silero_stt',
language='en')
Summary
SAM is a simple assembler and machine emulator, while silero-models focuses on speech recognition and synthesis. SAM is easier to use for basic assembly language tasks, while silero-models offers more advanced speech processing capabilities but requires more setup and resources. The choice between them depends on the specific project requirements and the desired level of complexity.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
SAM
Software Automatic Mouth - Tiny Speech Synthesizer
What is SAM?
Sam is a very small Text-To-Speech (TTS) program written in C, that runs on most popular platforms. It is an adaption to C of the speech software SAM (Software Automatic Mouth) for the Commodore C64 published in the year 1982 by Don't Ask Software (now SoftVoice, Inc.). It includes a Text-To-Phoneme converter called reciter and a Phoneme-To-Speech routine for the final output. It is so small that it will work also on embedded computers. On my computer it takes less than 39KB (much smaller on embedded devices as the executable-overhead is not necessary) of disk space and is a fully stand alone program. For immediate output it uses the SDL-library, otherwise it can save .wav files.
An online version and executables for Windows can be found on the web site: http://simulationcorner.net/index.php?page=sam
Compile
Simply type "make" in your command prompt. In order to compile without SDL remove the SDL statements from the CFLAGS and LFLAGS variables in the file "Makefile".
It should compile on every UNIX-like operating system. For Windows you need Cygwin or MinGW( + libsdl).
Fork
Take a look at https://github.com/vidarh/SAM for a more refactored and cleaner version of the code.
Usage
type
./sam I am Sam
for the first output.
If you have disabled SDL try
./sam -wav i_am_sam.wav I am Sam
to get a wav file. This file can be played by many media players available for the PC.
you can try other options like -pitch number -speed number -throat number -mouth number
Some typical values written in the original manual are:
DESCRIPTION SPEED PITCH THROAT MOUTH
Elf 72 64 110 160
Little Robot 92 60 190 190
Stuffy Guy 82 72 110 105
Little Old Lady 82 32 145 145
Extra-Terrestrial 100 64 150 200
SAM 72 64 128 128
It can even sing look at the file "sing" for a small example.
For the phoneme input table look in the Wiki.
A description of additional features can be found in the original manual at http://www.retrobits.net/atari/sam.shtml or in the manual of the equivalent Apple II program http://www.apple-iigs.info/newdoc/sam.pdf
Adaption To C
This program (disassembly at http://hitmen.c02.at/html/tools_sam.html) was converted semi-automatic into C by converting each assembler opcode. e. g.
lda 56 => A = mem[56];
jmp 38018 => goto pos38018;
inc 38 => mem[38]++;
. .
. .
Then it was manually rewritten to remove most of the jumps and register variables in the code and rename the variables to proper names. Most of the description below is a result of this rewriting process.
Unfortunately it is still unreadable. But you should see from where I started :)
Short description
First of all I will limit myself here to a very coarse description. There are very many exceptions defined in the source code that I will not explain. Also a lot of code is unknown for me e. g. Code47503. For a complete understanding of the code I need more time and especially more eyes have a look on the code.
Reciter
It changes the english text to phonemes by a ruleset shown in the wiki.
The rule " ANT(I)", "AY", means that if he find an "I" with previous letters " ANT", exchange the I by the phoneme "AY".
There are some special signs in this rules like # & @ ^ + : % which can mean e. g. that there must be a vocal or a consonant or something else.
With the -debug option you will get the corresponding rules and the resulting phonemes.
Output
Here is the full tree of subroutine calls:
SAMMain() Parser1() Parser2() Insert() CopyStress() SetPhonemeLength() Code48619() Code41240() Insert() Code48431() Insert()
Code48547
Code47574
Special1
Code47503
Code48227
SAMMain() is the entry routine and calls all further routines. Parser1 transforms the phoneme input and transforms it to three tables phonemeindex[] stress[] phonemelength[] (zero at this moment)
This tables are now changed:
Parser2 exchanges some phonemes by others and inserts new. CopyStress adds 1 to the stress under some circumstances SetPhonemeLength sets phoneme lengths. Code48619 changes the phoneme lengths Code41240 adds some additional phonemes Code48431 has some extra rules
The wiki shows all possible phonemes and some flag fields.
The final content of these tables can be seen with the -debug command.
In the function PrepareOutput() these tables are partly copied into the small tables: phonemeindexOutput[] stressOutput[] phonemelengthOutput[] for output.
Final Output
Except of some special phonemes the output is build by a linear combination:
A = A1 * sin ( f1 * t ) +
A2 * sin ( f2 * t ) +
A3 * rect( f3 * t )
where rect is a rectangular function with the same periodicity like sin. It seems really strange, but this is really enough for most types of phonemes.
Therefore the above phonemes are converted with some tables to pitches[] frequency1[] = f1 frequency2[] = f2 frequency3[] = f3 amplitude1[] = A1 amplitude2[] = A2 amplitude3[] = A3
Above formula is calculated in one very good omptimized routine. It only consist of 26 commands:
48087: LDX 43 ; get phase
CLC
LDA 42240,x ; load sine value (high 4 bits)
ORA TabAmpl1,y ; get amplitude (in low 4 bits)
TAX
LDA 42752,x ; multiplication table
STA 56 ; store
LDX 42 ; get phase
LDA 42240,x ; load sine value (high 4 bits)
ORA TabAmpl2,y ; get amplitude (in low 4 bits)
TAX
LDA 42752,x ; multiplication table
ADC Var56 ; add with previous values
STA 56 ; and store
LDX 41 ; get phase
LDA 42496,x ; load rect value (high 4 bits)
ORA TabAmpl3,y ; get amplitude (in low 4 bits)
TAX
LDA 42752,x ; multiplication table
ADC 56 ; add with previous values
ADC #136
LSR A ; get highest 4 bits
LSR A
LSR A
LSR A
STA 54296 ;SID main output command
The rest is handled in a special way. At the moment I cannot figure out in which way. But it seems that it uses some noise (e. g. for "s") using a table with random values.
License
The software is a reverse-engineered version of a software published more than 34 years ago by "Don't ask Software".
The company no longer exists. Any attempt to contact the original authors failed. Hence S.A.M. can be best described as Abandonware (http://en.wikipedia.org/wiki/Abandonware)
As long this is the case I cannot put my code under any specific open source software license. However the software might be used under the "Fair Use" act (https://en.wikipedia.org/wiki/FAIR_USE_Act) in the USA.
Contact
If you have questions don' t hesitate to ask me. If you discovered some new knowledge about the code please mail me.
Sebastian Macke Email: sebastian@macke.de
Top Related Projects
Robust Speech Recognition via Large-Scale Weak Supervision
Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
πΈπ¬ - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
Silero Models: pre-trained speech-to-text, text-to-speech and text-enhancement models made embarrassingly simple
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot