Global Open Tokens, Mozilla,

Yes. So instead of a light weight “tokenization” now and very little curation and verification now the raw data, and no index of the raw data. It puts effort into finding the nouns and processes, embedded data, index it and convert to tokens. It attempts to make sure if it is talking about Moscow, Russia that is coded correctly, not “Moscow, Inidana”. Rather than leaving all LLMs to guess the exact meaning from context. The original sources (the authors create and verify encoded content) and the correct version is the master for the page and in the codes that LLMs can use. It means that an exact link exists from any LLM conversation to real things, real people, real times, real locations, real material, real equations, real units. Not guess work at the human-AI interface on the fly.
 
Library of Congress authority lists for places, and other things are a start. Most things have exact codes. I covered all content types on the Internet. It means the provenance of all things is encoded so the codes lead to real sources people, groups, and activities.
 
The huge vectors that LLMs use are over fitting ad the large context does not make sense for fragments of things.
 
LLM is brute force, blind, untraceable and uses massive computer effort. One absolute encoding and then billions of lookups of combinations of things – is much more efficient than untraceable input, billions of operations to compile to vectors of probabilities, then huge vector operations to do simple lookups to guess at the origins. LLM only benefits Nvidia and chip makers and large groups. It does do things but at high cost.
 
LLM will work with global open tokens. But it would be working with 100% certainty, not guessing is “solar” is part of “Solar corporation” in some place” or “solar” something else. It needs some examples and there are lots of kinds of data that LLM does not work on.
 
Moving from character and string based to “entity” and “exact with exact sources” should help stabilize knowledge and keep the authors in the loop not exclude them.
 
My other intention was to encode the source pages so that whole sites are pre encoded and do to have to be scraped and tokenized. What the site owners intend is what is shared, not a processed scrape of text and its multitudes of ambiguities.
 
I have said it a bit harshly, but that is roughly the main themes. I see the LLM blocks all access to raw data original sources.
 
In the “global open token” world, a concept like “electron chirality” or “chiral electron” or “electron handedness” would form a cluster that can be tied to a finite set of records and groups and papers and entry points. It is immediately recognizable and groups can assign and link in other language representations of the same things. It eliminates many searches on Google because it is already indexed.
 
It eliminates single point of failure (and maniplation) by google and makes all entities and activities part of a living web of knowledge. It takes good data methods but it in not using LLM which is effectively using the huge vectors a hash code to look up entities based on arbitrary strings in that groups processing.
 
If a groups want to learn about “laser cutting” they are directed to where that is being done. It gets into much smaller bag of words and multi term queries to get to knowledge. And forces seriously addressing the lack of standard for group names, “tags” “hash tags” “categories”.
 
The same kind of process works for all data from all sensors, command in languages, date in databases.
 
I tried to make one human understandable way to approach all knowledge that makes efficient use of human time, is absolute, or better, and that AIs can use and give truthful answers.
 
When I got to Case Institute of Technology (CWRU) in fall 1967 there were guys proudly showing their poetry written by computer program. It was rhyming but all the work was in their preparation an methods. The LLM are like that and they show their best work that makes them look good.

It is happening – but slowly. I could see the first time I looked at LLM AIs after Elon Musk and Microsoft invested so much in OpenAi, It is a closed system, that they were deliberately preventing access to its pipelines. Now not working with other AI groups to improve coverage or subjects and human languages in a consistent way.

It is happening – but scattered, mostly profit motive, and not seriously aiming to solve outstanding problems in the world.

Meta getting bigger and more powerful, for itself, is not particularly good news. Oracle continues its attempts at profit centered growth. What happens is that all those generate income fro a few, and the benefits never recycle or reach all humans.

These mega corporations grow like sprawling plant systems but only produce a handful of fruit that are so big they are not edible, so they rot. All stem and root, not fruit. The subscriptions are sold as sweet fruit, but end up bitter, and take nutrients from the soils they invade. It takes resources from smaller plants, and they die off. “Profit only” motive is like an invasive species. Or an invading hoard.

Adding a range of goals can help. Open governance can help. Changing leaders can help. Letting AIs store he data in open lossless formats can help. The term I use lately is “siloed pyramids” — a socio-financial network that is all stem and only a few fruits. The “privilege” of paying subscriptions is sold as a benefit, not as a contract that the services improved and listens. Insiders pay more to gain access to the big fruit.

Sorry to use such pictures. I store most everything in global system diagrams , some of which are based on biological, organic, ecological, evolutionary network models. I actually classify monopolies and closed systems as “internet pathologies” because they deliberately and inadvertently divert resources from all humans to a few. They are literally inflexibly designed to take and take and take, not give back. Fast growing weeds, fast growing cancers that look like health, but mostly hurt global society. Because too few humans have any say on choices affecting everyone.

Puts all the eggs in a few closed baskets.  Single points of failure.  A tiny few neurons in a dinosaur dinosaur brain. A huge body that huge chunks can be lost, because “they are just skin and muscle ”  But, all cells are good at taking care of themselves –  if they can see the whole world, have access to basic nutrients and minerals, water and light.

It makes me tired trying to bring up all the millions of sub diagrams and cases. but after 27 years, it is the most concise way of mapping and predicting what is happening, does happen and will happen – worldwide.

“Global open tokens” are just part of a network of “global open resources”, “global open knowledge”, “all human languages”. It combines into smaller number of classifications, but with verified open methods it can be safe and reliable and fair.


War, poverty, job loss, homelessness, earthquakes, tsunamis, existential threats. Nothing stimulates more than running for your life. Unfortunately the huge payoff is not positive for everyone. Only those promoting the war or swinging the ax survive and benefit, not the ones thrown into that “stimulating” environment. You wanted small, but I do not think that is what will happen. When cells divide the insides are mixed very fast.

https://x.com/i/grok/share/GymEPj9PoXHvp88pC0yQ0rZjO


A simple “first, second and higher order differences” takes care of most all sensory datasets. Remember new things, remember exceptions and link them to their root. Some things on the Internet do go 50 levels deep and some signals are still carrying information at 20th differences. But zeroth first second third time derivatives cover most things – which is why position velocity acceleration and “jerk” cover most everyday physics and phenomena pretty well. “snap crackle and pop” are the 4th 5th 6th. Best starting point of all, any unknown new thing.


Did you take the CSS in context? It is badly written and designed. The boundaries and rules are guided by internet software maintenance limitations, not design and need. Most all “languages” built by evolution and accretion fail to be closed, so they have to be coded as endless exceptions. If the LLMs train on using CSS parsers and linters, they would not “infer CSS” but use standard tools which can be learned. LLMs would focus on raw data and the real world and not try to force “LLM will solve every problem”.


Does it allow for more designers, innovation, small groups, range of materials, many poor countries and places. Or just more of your things, because you can sell them cheaper while each thing is in demand? “Cost minimization” is not the same as “meeting needs” or “useful profit”. Good skills, but are you working at the right things, to help the most people?

 


What do you want to see from Mozilla in the future?  I am not sure. That is up to the people involved. It ought to focus on global issues not small US only battles.

Transparency into how AI systems and tools actually work
Holding corporations accountable for their AI products
Review Government regulations for AI
Leveraging AI to help address societal issues such as racial justice, climate justice, gender justice, etc.

Replace Chrome and other browsers as the Internet AI browser See RichardKCollin2 on X

Restructuring the Internet, setting good Internet policies, integrating AIs into society

Richard K Collins

About: Richard K Collins

The Internet Foundation Internet policies, global issues, global open lossless data, global open collaboration


Leave a Reply

Your email address will not be published. Required fields are marked *