If the groups working on LLMs would focus, first, on the human languages, then the “global open translation ids” which emerge will be fairly stable entities that all AI groups and Internet groups can use as standard tokens.
Rather than making up arbitrary string fragments as tokens, for convenience; if everyone uses tokens (unique identifiers) for real things, that will speed global integration, perhaps by decades.
Imaging a table with all the words and terms used in English with columns for the corresponding translation in every one of the hundreds of other human languages. The unique identifier for the stable common things across many languages will likely be the “words” or “entities” of a future language. Not character sequences which are more tied to how humans vocalize and relate, then to the 3D volumetric entities in the real world.
If you parse the DOM of most websites in English (and other languages), you find that much of the meaning is not actually in the character sequences, but in the layout. And what actually is used is a kind of “internet pidgin” where people use terms and phrases placed near each other, rather than the “all letters from the alphabet and a bit of punctuation thrown in” paper model of human communication we got from “paper” technologies.
An LLM might have a lot of information on character sequences, but all the Internet DOM information has been lost, all the context (where did that information come from, who wrote it, why did they write it, who are they) gets lost. Most web pages now are fragmented and contain many independent flows of information. And many more pages have no words hardly, but pictures, icons, glyphs, banners, animations, videos and often things that have nothing to do with the words — whatever the human language. Then there are databases and content stores from which new things are served, that are the real content, not the pages themselves.
There are about 5 Billion humans using the Internet. And a tiny few million playing with LLMs and AIs. Tens of millions playing with software. Hundreds of millions playing with “computers” and Billions forced to use computers, networks and the Internet just to survive today. But there are 8.1 Billion humans and they are NOT being taken care of well. Most will have most of their potential wasted, and most will expend huge effort to learn and survive – and ALL that will be lost.
Look at Twitter(X). It has a lot of people talking about a lot of things, but it is disorganized and inefficient chatter, not focused and purposeful collaboration. I see it almost like a giant cancer blob, not a living intelligent organism in its own right. That is because, like all corporate run websites, the staff spend most of their time implementing “business rules and conceptions” when their time would be better spent helping the members of the community to work together on larger global open goals. It is a fairly large set of things going on, but it is finite and its efficiency improved. I know what happened before when many “staff” were allowed or encouraged to get too involved, and if they were young or less than perfect, they could be drawn into things that benefited a few. Making a true AI that monitors, records, identifies, indexes and organizes the information — relative to the external “rest of world” would help keep it open and stable. I can see it fairly clearly, but I have worked on this kind of looking at things for decades, and that is not getting captured in things like LLM shells (wrapper programs for LLMS where all the actual “AI” behavior is stored). The whole LLM is a statistically index of entities. When the tokens are arbitrary, the entities are too, and the LLMS using that will never converge to the truth. They will chase local minima and maxima and never get to the big things that matter — “lives with dignity and purpose for all humans and related species”.
Years ago, I took the 50,000 most commonly used 1 grams from the ngram database and classified them manually. It took me 7 days of 12 hour days or more. It was rough, but I wanted to have a sense of what words were used. That is just English with some borrowed words thrown in. From the Internet the United States now has many dozens of human languages so the USA and many international countries are effectually poly lingual. Since people can rent domains and sites in any country almost, then the usefulness of “where someone lives” or “where someone was physically born” or “where someone goes to a building to log into the Internet” is lost. Simple classification like “speaks English” or “lives in Spain” or “work in China” are being lost. NONE of those were homogeneous in the first place, probably never were. It should be obvious if you use real tokens for making LLMs (do the language first, then “what people do with the languages” next).
The individual groups working on LLMs are self organizing. But because they are from for-profit corporation and many voices and purposes they are NOT allowed to cooperate fully.
%A = (“large language models” OR “LLMs” OR “AIs”)
( %A) has 285 Million entry points
( %A site:com) is 143 Million entry points
( %A site:gov) is 0.42 Million
This last one is (“large language models” OR “LLMs” OR “AIs”) site:cn
And Google and Microsoft (Bing) is not using “global open translation ids” so it is not searching all languages.
I hope you see it. A person, in the future, can simply query this in their own language, as I did in English. An entity (perhaps with search engines using mirrors) would check to see if there is are global unique translation IDs assigned to “large language models”, “LLMs” and “AIs”. If not, then those can be generated and recorded – for the whole world.
People who do this sort of thing every day with databases will know what I am recommending, simply keep track of exact strings globally in all languages.
Now the point of using the global open translation ids in doing the correlations of the LLMS is to keep the tokens real and traceable. Not arbitrary and locally derived, and essentially inaccessible.
I am going to stop here. I think I might have a cold or flu. I am fairly certain I got this right. For Y2K I had to think in terms of “all computers and components”, “all algorithms in all systems, all algorithms in all software. For USAID Economic and Social Database, I had to think of all data for all countries and all industries in the world now and 50 years into the future. For the Famine Early Warning System I had to think of “all data for monitoring food, agriculture, storms, weather, climate, human populations, economies, labor force, conflicts, resources, migration etc etc etc”. For alternative fuels I had to think of all ways to create, store, use, conserve, monitor and control energy – including ways to not move people and materials when information is all that need to be sent. And that can be minimal if the information of the whole heliosphere can be stored in a few memory thingies.
Filed under ( Focus LLM groups on “all human languages”, then index “all knowledge” )
The “index” is both “statistical indexes” like LLMs and it is “lossless indexes” like regular database and such. The world already had many intelligent machines – they are maintained by groups of dedicated humans and used and improved by billions of humans. The LLMs are just databases and software. The LLM shells are just software, handwritten and maintained by humans. The shells retrieve data from the LLM statistical database.
We just need to make all machines accessible by all humans. Humans use many human languages, and have concerns and futures the LLM people barely understand. So for the heliospheric Internet, first make sure all languages are handled well and efficiently. Let me point out the essentials. It should save a few decades.
Richard Collins, The Internet Foundation