All human speakers, all humans, all sounds, all human and computer language encodings for an open Internet

Beomseok LEE @beomseok_lee_  How do we bridge Speech Encoders and LLMs?

My reply:

All human speakers, all humans, all sounds, all human and computer language encodings for an open Internet

You can aim to write an encoder for speech sounds from all human languages so all the spoken sounds are recognized, encoded in a smaller set of generated voices. That means the content of the speech matches things in the real world.

 
Across all human languages set global codes for all things common to humans – “the earth”, “water”, “the sky”, “the sun” so no matter the human language sounds and sound sequences used, the meaning is specific.
 
Focus on the efficiencies. If a voice recording is encoded to 64 characters, that is too small a base and too ambiguous. Your LLM needs larger and larger context and the space of possibles is too large. Get it down from “full audio quality” to “full lossless content” and “can reproduce all human voices efficiently”.
 
It is like a speech synthesizer that takes the natural voices of all humans and encodes that to a level of error that is sufficient for global communication between any of the N*(N-1)/2 human pairs for 8.2 Billion.
 
The whole point of writing sounds using alphabets is that it is permanent. Use computers and sound processing to find a compact speech replication encoding that is much smaller the audio recording using human speech from all languages. And that aims to both record the sounds for reproduction and sharing across all languages, and to encode meaning for AI listeners (assistive intelligences).
 
There are about 5.4 Billion humans using the Internet, and the 2.8 Billion not using the Internet. That divide is partly device access, but also “has no written speech” access.
 
You are too young to have lived through the decades of languages standardization, then variations, because of television. Now the Internet has made it chaotic again. The Internet efficiency is less than 0.1 % of what is possible.
 
But there are practical technology solutions that can help to give humans a way to encode sound, gesture, body language, multispectral information and a host of physiological and chemical and physical information that can be used to improve understanding and reduce ambiguity in human communication.
 
The IPA and the hob gob of “fonts” on the Internet hides the falsehood that characters encode meaning for all. The sounds came first, the sights and human images in human minds came first, then the marks on permanent things like paper and mud and stone.
 
I see it as a global optimization problems – to reduce losses and waste because groups use paper technologies that never were able to do a complete job of encoding sounds. Then the token correlation methods are churning out character and word sequences — divorced from human feelings and what they want to communicate.
 
I cannot distill almost three decades of Internet Foundation results related to encoding human languages and meanings – in a few sentences. But focus on sound not the pictures that remind humans of sounds and meaning.
 
King Sejong had a phonetic code created for Korean sounds. It was learnable by the humans at that time in a short time but not capable of encoding actual speakers and spoken or sung sounds. Now we can use cell phones, games, computers and public resources to take the many variations of human speakers – keep the original and compact it for audio record, convert it using global best practices to speech encoding that spans all humans and all languages and makes the Internet and all knowledge accessible by voice.

In parallel, take the LLM encodings and code the old text to meaning. Particularly for Science Technology Engineering Mathematics Computing Finance Government Organizations Trade Production Issues and other things on the Internet and in human society.

Make strings that encode dates into unambiguous data codes, place names into place codes, human names into name codes. All things that exist in databases, now those are not standard and so ambiguous for data exchange and sharing in a global Internet society — soon expanding into the solar system and beyond. Take all the values and units, the equations and methods and algorithms and distill out the core. NOT by creating single points of failure and manipulation and monopolies – but by making all knowledge accessible and usable by all humans and their assistive intelligences (AIs).

I am trying to do these things. But have mostly run out of years. It is the right way to go. If one can prevent more monopolies and greedy groups.
 
If am fairly certain it is possible to use physiological and sensor data to encode a much wider bandwidth of data from humans to use for determining intent and human ideas. Our keyboards at 200 words a minute is not much data and the sequence processing requires a lot of memory to process. With recognizable sounds that also encode intent, passions, emphasis, emotions, urgency and non-vocal feelings – much higher bandwidths are possible now. When those are played back it might change how we remember and evoke memories and feelings.
 
Richard Collins, The Internet Foundation
 
https://en.wikipedia.org/wiki/Hangul
 
 
Richard K Collins

About: Richard K Collins

Director, The Internet Foundation Studying formation and optimized collaboration of global communities. Applying the Internet to solve global problems and build sustainable communities. Internet policies, standards and best practices.


Leave a Reply

Your email address will not be published. Required fields are marked *