this post was submitted on 01 Oct 2024

86 points (81.6% liked)

Asklemmy

43492 readers

1724 users here now

A loosely moderated place to ask open-ended questions

Search asklemmy 🔍

If your post meets the following criteria, it's welcome here!

Open-ended question
Not offensive: at this point, we do not have the bandwidth to moderate overtly political discussions. Assume best intent and be excellent to each other.
Not regarding using or support for Lemmy: context, see the list of support communities and tools for finding communities below
Not ad nauseam inducing: please make sure it is a question that would be new to most members
An actual topic of discussion

Looking for support?

Looking for a community?

Lemmyverse: community search
sub.rehab: maps old subreddits to fediverse options, marks official as such
!lemmy411@lemmy.ca: a community for finding communities

~Icon~ ~by~ ~@Double_A@discuss.tchncs.de~

founded 5 years ago

MODERATORS

Why are we training AIs on reddit posts instead of Research Papers? We could be saving the world! (lemmy.dbzer0.com)

submitted 2 days ago by Melatonin@lemmy.dbzer0.com to c/asklemmy@lemmy.ml

91 comments fedilink hide all child comments

top 50 comments

sorted by: hot top controversial new old

[–] macabrett@lemmy.ml 12 points 1 day ago

editor's note: it will not save the world

[–] CanadaPlus@lemmy.sdf.org 9 points 1 day ago

They're trained on both, and the kitchen sink.

[–] queermunist@lemmy.ml 85 points 2 days ago (2 children)

AI isn't saving the world lol

[–] spongebue@lemmy.world 21 points 2 days ago* (last edited 2 days ago)

Machine learning has some pretty cool potential in certain areas, especially in the medical field. Unfortunately the predominant use of it now is slop produced by copyright laundering shoved down our throats by every techbro hoping they'll be the next big thing.

[–] UlyssesT@hexbear.net 10 points 2 days ago (3 children)

It's marketing hype, even in the name. It isn't "AI" as decades of the actual AI field would define it, but credulous nerds really want their cyberpunkerino fantasies to come true so they buy into the hype label.

[–] FaceDeer@fedia.io 12 points 2 days ago

The term AI was coined in 1956 at a computer science conference and was used to refer to a broad range of topics that certainly would include machine learning and neural networks as used in large language models.

I don't get the "it's not really AI" point that keeps being brought up in discussions like this. Are you thinking of AGI, perhaps? That's the sci-fi "artificial person" variety, which LLMs aren't able to manage. But that's just a subset of AI.

load more comments (2 replies)

[–] schnurrito@discuss.tchncs.de 11 points 2 days ago (1 children)

Who is "we"? My understanding is LLMs are mostly being trained on a large amount of publicly available texts, including both reddit posts and research papers.

load more comments (1 replies)

[–] howrar@lemmy.ca 25 points 2 days ago (2 children)

I find it amusing that everyone is answering the question with the assumption that the premise of OP's question is correct. You're all hallucinating the same way that an LLM would.

LLMs are rarely trained on a single source of data exclusively. All the big ones you find will have been trained on a huge dataset including Reddit, research papers, books, letters, government documents, Wikipedia, GitHub, and much more.

Example datasets:

[–] andrewta@lemmy.world 5 points 2 days ago

Rules of lemmy

Ignore facts, don’t do research to see if the comment/post is correct, don’t look at other comments to see if anyone else has corrected the post/comment already, there is only one right side (and that is the side of the loudest group)

[–] intensely_human@lemm.ee 1 points 1 day ago

When humans do it, it’s called “confabulation”

[–] Trainguyrom@reddthat.com 14 points 2 days ago (1 children)

Short answer: they already are

Slightly longer answer: GPT models like ChatGPT are part of an experiment in "if we train the AI model on shedloads of data does it make a more powerful AI model?" and after OpenAI made such big waves every company is copying them including trying to train models similar to ChatGPT rather than trying to innovate and do more

Even longer answer: There's tons of different AI models out there for doing tons of different things. Just look at the over 1 million models on Hugging Face (a company which operates as a repository for AI models among other services) and look at all of the different types of models you can filter for on the left.

Training an image generation model on research papers probably would make it a lot worse at generating pictures of cats, but training a model that you want to either generate or process research papers on existing research papers would probably make a very high quality model for either goal.

More to your point, there's some neat very targeted models with smaller training sets out there like Microsoft's PHI-3 model which is primarily trained on textbooks

As for saving the world, I'm curious what you mean by that exactly? These generative text models are great at generating text similar to their training data, and summarization models are great at summarizing text. But ultimately AI isn't going to save the world. Once the current hype cycle dies down AI will be a better known and more widely used technology, but ultimately its just a tool in the toolbox.

[–] Umbrias@beehaw.org 2 points 2 days ago (1 children)

also the answer to that question, shitloads of data for a better ai, is yes… with logarithmic returns. massively underpriced (by cost to generate) returns that have questionable value statement at best.

[–] intensely_human@lemm.ee 1 points 1 day ago (1 children)

How are the “returns” measured numerically here?

[–] greyw0lv@lemmy.ml 1 points 1 day ago

Hillusionations per GWH iirc.

[–] collapse_already@lemmy.ml 2 points 1 day ago

Why are we training kids on civics with Fox News or MSNBC? People are dumb and will continue to be so.

[–] ryathal@sh.itjust.works 38 points 2 days ago (1 children)

Both are happening. Samples of casual writing are more valuable to use to generate an article than research papers though.

[–] FaceDeer@fedia.io 9 points 2 days ago (1 children)

Yeah. Scientific papers may teach an AI about science, but Reddit posts teach AI how to interact with people and "talk" to them. Both are valuable.

[–] geekwithsoul@lemm.ee 8 points 2 days ago (4 children)

Hopefully not too pedantic, but no one is “teaching” AI anything. They’re just feeding it data in the hopes that it can learn probabilities for certain types of output. It “understands” neither the Reddit post nor the scientific paper.

load more comments (4 replies)

[–] TheOubliette@lemmy.ml 23 points 2 days ago (15 children)

"AI" is a parlor trick. Very impressive at first, then you realize there isn't much to it that is actually meaningful. It regurgitates language patterns, patterns in images, etc. It can make a great Markov chain. But if you want to create an "AI" that just mines research papers, it will be unable to do useful things like synthesize information or describe the state of a research field. It is incapable of critical or analytical approaches. It will only be able to answer simple questions with dubious accuracy and to summarize texts (also with dubious accuracy).

Let's say you want to understand research on sugar and obesity using only a corpus from peer reviewed articles. You want to ask something like, "what is the relationship between sugar and obesity?". What will LLMs do when you ask this question? Well, they will just attempt to do associations and to construct reasonable-sounding sentences based on their set of research articles. They might even just take an actual semtence from an article and reframe it a little, just like a high schooler trying to get away with plagiarism. But they won't be able to actually mechanistically explain the overall mechanisms and will fall flat on their face when trying to discern nonsense funded by food lobbies from critical research. LLMs do not think or criticize. Of they do produce an answer that suggests controversy it will be because they either recognized diversity in the papers or, more likely, their corpus contains reviee articles that criticize articles funded by the food industry. But it will be unable to actually criticize the poor work or provide a summary of the relationship between sugar and obesity based on any actual understanding that questions, for example, whether this is even a valid question to ask in the first place (bodies are not simple!). It can only copy and mimic.

[–] Melatonin@lemmy.dbzer0.com 1 points 1 day ago (2 children)

Surely that is because we make it do that. We cripple it. Could we not unbound AI so that it genuinely weighed alternatives and made value choices? Write self-improvement algorithms?

If AI is only a "parrot" as you say, then why should there be worries about extinction from AI? https://www.safe.ai/work/statement-on-ai-risk#open-letter

It COULD help us. It WILL be smarter and faster than we are. We need to find ways to help it help us.

[–] mormund@feddit.org 2 points 1 day ago (2 children)

If AI is only a "parrot" as you say, then why should there be worries about extinction from AI?

You should look closer who is making those claims that "AI" is an extinction threat to humanity. It isn't researchers that look into ethics and safety (not to be confused with "AI safety" as part of "Alignment"). It is the people building the models and investors. Why are they building and investing in things that would kill us?

AI doomers try to 1. Make "AI"/LLMs appear way more powerful than they actually are. 2. Distract from actual threats and issues with LLMs/"AI". Because they are societal, ethical, about copyright and how it is not a trustworthy system at all. Cause admitting to those makes it a really hard sell.

[–] Melatonin@lemmy.dbzer0.com 1 points 1 day ago (1 children)

We cripple things by not programming the the abilities we obviously could give them.

We could have AI do an integrity check before printing an answer. No problem at all. We don't.

We could do many things to unbound the limitations AI has.

[–] chaos@beehaw.org 2 points 1 day ago

That's not how it works at all. If it were as easy as adding a line of code that says "check for integrity" they would've done that already. Fundamentally, the way these models all work is you give them some text and they try to guess the next word. It's ultra autocomplete. If you feed it "I'm going to the grocery store to get some" then it'll respond "food: 32%, bread: 15%, milk: 13%" and so on.

They get these results by crunching a ton of numbers, and those numbers, called a model, were tuned by training. During training, they collect every scrap of human text they can get their hands on, feed bits of it to the model, then see what the model guesses. They compare the model's guess to the actual text, tweak the numbers slightly to make the model more likely to give the right answer and less likely to give the wrong answers, then do it again with more text. The tweaking is an automated process, just feeding the model as much text as possible, until eventually it gets shockingly good at predicting. When training is done, the numbers stop getting tweaked, and it will give the same answer to the same prompt every time.

Once you have the model, you can use it to generate responses. Feed it something like "Question: why is the sky blue? Answer:" and if the model has gotten even remotely good at its job of predicting words, the next word should be the start of an answer to the question. Maybe the top prediction is "The". Well, that's not much, but you can tack one of the model's predicted words to the end and do it again. "Question: why is the sky blue? Answer: The" and see what it predicts. Keep repeating until you decide you have enough words, or maybe you've trained the model to also be able to predict "end of response" and use that to decide when to stop. You can play with this process, for example, making it more or less random. If you always take the top prediction you'll get perfectly consistent answers to the same prompt every time, but they'll be predictable and boring. You can instead pick based on the probabilities you get back from the model and get more variety. You can "increase the temperature" of that and intentionally choose unlikely answers more often than the model expects, which will make the response more varied but will eventually devolve into nonsense if you crank it up too high. Etc, etc. That's why even though the model is unchanging and gives the same word probabilities to the same input, you can get different answers in the text it gives back.

Note that there's nothing in here about accuracy, or sources, or thinking, or hallucinations, anything. The model doesn't know whether it's saying things that are real or fiction. It's literally a gigantic unchanging matrix of numbers. It's not even really "saying" things at all. It's just tossing out possible words, something else is picking from that list, and then the result is being fed back in for more words. To be clear, it's really good at this job, and can do some eerily human things, like mixing two concepts together, in a way that computers have never been able to do before. But it was never trained to reason, it wasn't trained to recognize that it's saying something untrue, or that it has little knowledge of a subject, or that it is saying something dangerous. It was trained to predict words.

At best, what they do with these things is prepend your questions with instructions, trying to guide the model to respond a certain way. So you'll type in "how do I make my own fireworks?" but the model will be given "You are a chatbot AI. You are polite and helpful, but you do not give dangerous advice. The user's question is: how do I make my own fireworks? Your answer:" and hopefully the instructions make the most likely answer something like "that's dangerous, I'm not discussing it." It's still not really thinking, though.

[–] Melatonin@lemmy.dbzer0.com 1 points 1 day ago

If you look at the signatories (in the link) there are plenty of people who are not builders and investors, people who are in fact scientists in the field.

load more comments (1 replies)

[–] Brahvim@lemmy.kde.social 3 points 2 days ago (1 children)

They might even just take an actual semtence from an article and reframe it a little

Case for many things that can be answered via stackoverflow searches. Even the order in which GPT-4o brings up points is the exact same as SO answers or comments.

load more comments (1 replies)

load more comments (13 replies)

[–] ImplyingImplications@lemmy.ca 26 points 2 days ago (2 children)

Because AI needs a lot of training data to reliably generate something appropriate. It's easier to get millions of reddit posts than millions of research papers.

Even then, LLMs simply generate text but have no idea what the text means. It just knows those words have a high probability of matching the expected response. It doesn't check that what was generated is factual.

load more comments (2 replies)

[–] Rampsquatch@sh.itjust.works 21 points 2 days ago

You could feed all the research papers in the world to an LLM and it will still have zero understanding of what you trained it on. It will still make shit up, it can't save the world.

[–] SteposVenzny@beehaw.org 17 points 2 days ago

Training it on research papers wouldn’t make it smarter, it would just make it better at mimicking their writing style.

Don’t fall for the hype.

[–] Strayce@lemmy.sdf.org 7 points 2 days ago* (last edited 2 days ago)

They are. T&F recently cut a deal with Microsoft. Without author's consent, of course.

I'm fairly sure a few others have too, but that's the only article I could find quickly.

[–] sirico@feddit.uk 12 points 2 days ago (1 children)

Redditors are always right, peer reviewed papers always wrong. Pretty obvious really. :D

load more comments (1 replies)

[–] Tabooki@lemmy.world 10 points 2 days ago (1 children)

They already do that. You're being a troglodyte.

[–] Melatonin@lemmy.dbzer0.com 8 points 2 days ago (3 children)

Hmmm. Not sure if I'm being insulted. Is that one of those fish fossils that looks kind of like a horseshoe crab?

[–] Glytch@lemmy.world 11 points 2 days ago

You're thinking of a trilobite

load more comments (2 replies)

[–] tiddy@sh.itjust.works 8 points 2 days ago

Papers are most importantly a documentation of exactly what and how a procedure was performed, adding a vagueness filter over that is only going to decrease its value infinitely.

Real question is why are we using generative ai at all (gets money out of idiot rich people)

[–] RangerJosie@lemmy.world 4 points 2 days ago

Saving the world isn't profitable in the short term.

Vulture capitalists don't care about the future. They care about the immediate. Short term profitability. And nothing else.

[–] Even_Adder@lemmy.dbzer0.com 7 points 2 days ago

They're trained on technical material too.

[–] atimehoodie@lemmy.ml 3 points 2 days ago

Who's going to peer review that?

[–] HobbitFoot@thelemmy.club 5 points 2 days ago

Because they are looking for conversations.

[–] r00ty@kbin.life 6 points 2 days ago

Anyone running a webserver and looking at their logs will know AI is being trained on EVERYTHING. There are so many crawlers for AI that are literally ripping the internet wholesale. Reddit just got in on charging the AI companies for access to freely contributed content. For everyone else, they're just outright stealing it.

[–] TheReturnOfPEB@reddthat.com 6 points 2 days ago (1 children)

The Ghost of Aaron Schwartz

load more comments (1 replies)

[–] RobotToaster@mander.xyz 5 points 2 days ago (5 children)

Nobody wants an AI that talks like that.

load more comments (5 replies)

[–] cobysev@lemmy.world 5 points 2 days ago

We are. I just read an article yesterday about how Microsoft paid research publishers so they could use the papers to train AI, with or without the consent of the papers' authors. The publishers also reduced the peer review window so they could publish papers faster and get more money from Microsoft. So... expect AI to be trained on a lot of sloppy, poorly-reviewed research papers because of corporate greed.

[–] milicent_bystandr@lemm.ee 5 points 2 days ago

I saw an article about one trained on research papers. (Built by Meta, maybe?) It also spewed out garbage: it would make up answers that mimicked the style of the papers but had its own fabricated content! Something about the largest nuclear reactor made of cheese in the world...

load more comments