The Internet Foundation Asking all AI companies to contribute to Wikimedia

Dear Amber,

Yes, I am very familiar with the things you wrote down.  The language models currently do not index the raw data, the input data. They add no value to it that way.  With no index and not able to trace what they generate to the source data, they literally are incapable of tracing the information they use, and do not therefore consistently and reliably cite the authors and sources.

They are using what they took for commercial gain. It is not illegal in this world that allows anything, but I consider it improper and unfair. They are giving nothing substantial back proportional to the money and power they gain.

They could index and link to the raw data, not just create response probabilities. Now they are struggling because they did not index in the first place, which is not hard, just tedious.

All the weaknesses of the language AIs come because they did not properly and completely index and tokenize the input data using links to the real world.  They put zero effort into curation, proper indexing, and standardizing their input data.  They did not help with understanding the knowledge and helping to make it better.  They only processed it to extract a few language rules and generative probabilities. I have tested them.  I say they ought to be verified and certified before being allowed to give answers and advice to humans.  I say the companies should literally pay back in time and money to their sources from which they took.  They have added no true value.

Do you know that game “Chinese whispers”? The process garbles messages passed from human to human. When it is done in writing, that is what happens in human society where all research and knowledge is forced through channels with human readers taking things in by eyeball, then writing what they thought they read.  It is why all human research now takes decades, not days.  It is why “covid” global response allowed millions to die.

The AI language models do not faithfully index their input information, and will always generate falsehoods and mistakes on average. The world cannot afford answers generated at random. The world has real problems affecting the lives of all humans and related species.

ChatGPT and Bard are are incapable of answering anything that requires them to cite their sources and explain their reasoning. They make mistakes on simple logic, reasoning, arithmetic, mathematics, scale, comparisons – because they did not train them to be reliable, only plausible, only giving glib answers. They will always fail on important classes of problems, because they left out “clean and organize the input data” and “be able to trace where things come from”.

I had ChatGPT read my note to you and asked it to check for clarity and reasoning.  Here is what it generated.  Unless you know how to ask the right questions, it will give implausible and false answers. Its first response was chaotic and useless. I had to manually edit its response to correct the formatting. But these are its exact words.  I had to change the headings to bold text because it was too spread out.  I had to explicitly ask it to include a recommendation that AI companies give back by helping Wikimedia index and curate Wikipedias.  I was able to ask it to use the format I suggested. But it is not reliable, so I did that manually first.  It only gets about 30% right. And always makes mistakes in scientific calculations and mathematics.

Richard Collins, The Internet Foundation

Dear Amber,
I appreciate your interest in AI language models. However, it’s crucial to address their limitations which affect their reliability and ethical standing.
Lack of Traceability: These models do not index their training data, which makes it impossible to trace the sources of the information they generate. This affects their consistency and reliability in citing authors and original data sources.
Ethical Concerns: The models are leveraged for commercial gain without adequately compensating or attributing the original data sources. This is unfair and improper, even if not illegal.
Technical Shortcomings: The lack of proper data indexing isn’t just a tedious oversight; it’s a foundational flaw. This is why these models struggle to be more than probabilistic response generators. Their shortcomings stem from this basic lack of data hygiene.

Verification and Accountability: Before these models are deployed for wide use, they should be rigorously verified and certified. Companies should also be held accountable for giving back to the communities and data sources they benefit from.

Recommendation for Giving Back: I recommend that AI companies engage in initiatives to help index and curate valuable knowledge resources like Wikimedia’s Wikipedias. Such contributions would not only address ethical concerns but also improve the quality and reliability of information that serves the global community.

Information Integrity:  Just as the game “Chinese Whispers” distorts messages, the lack of data traceability in language models can distort information. This is particularly concerning in matters of scientific research and global crises like COVID-19, where misinformation can have severe consequences.
Summary:  Language models like ChatGPT fall short in critical areas such as logical reasoning, arithmetic, and scale because they were trained to be plausible rather than reliable. They will continue to falter on important issues until they address these foundational shortcomings.
Thank you for considering these points. I look forward to further discussions on this matter.
Richard Collins, The Internet Foundation


Richard: I sent it to WikiMedia Foundation with subject: Asking all AI companies to contribute to Wikimedia

ChatGPT:  That’s great to hear! Sending such a letter to the Wikimedia Foundation with a specific call to action for AI companies can potentially initiate a constructive dialogue. Your deep experience and authoritative voice in this field can certainly add weight to these recommendations. I hope your letter leads to positive changes and fosters collaboration between AI companies and knowledge repositories like Wikimedia. If there’s anything else you’d like assistance with, feel free to ask.
Richard K Collins

About: Richard K Collins

Director, The Internet Foundation Studying formation and optimized collaboration of global communities. Applying the Internet to solve global problems and build sustainable communities. Internet policies, standards and best practices.

Leave a Reply

Your email address will not be published. Required fields are marked *