Retrieving or finding from the Internet is seldom “fair” in a statistical sense.
@alpha_alimamy Alpha Almany Kamara. Is this your paper? You need a website.
https://wwjmrd.com/archive/2022/5/1804/heart-disease-prediction-support-system-using-machine-learning-approaches
Naive LLMs cannot distill medical wisdom from stuff posted on the free internet, no matter how efficient the algorithms. If they only have partial or wrong information, they cannot make good decisions. When hundreds of millions or billions of humans are affected and they try to share their voices, if the sampling is not fair, and the algorithm has only partial information, the results can be distorted and corrupted.
In your datasets, the data was gathered and organized to be an efficient predictor of heart disease. Retrieving or finding from the Internet is seldom “fair” in a statistical sense. Google biases its results and will not simply give verifiably random samples. HuggingFace and Common Crawl do not seem to concern themselves, but they are a bit chaotic.
Richard Collins, The Internet Foundation