AI-generated texts are already “polluting” the internet

⇧ [VIDÉO] You might also like this partner content (after ad)

Tools like DALL-E 2 or Stable Diffusion, or like ChatGPT — which we are talking about a lot these days — are very impressive. The former are able to create images from a textual description; the third is an intelligent chatbot that can answer almost any question or generate custom text. These technologies are so advanced that it is sometimes hard to believe that their productions are not the work of a human. However, as Melissa Heikkilä explains in MIT Technology Reviewthis profusion of “artificial” texts could be more problematic than it seems.

ChatGPT is a bit of an encyclopedia available 24 hours a day, which has the answer to (almost) everything in record time. Mathematics, history, philosophy, … nothing escapes him. But where this conversational agent – ​​which is based on OpenAI’s GPT-3 language model – particularly stands out, is in the generation of text. Fictional story, e-mail, joke, press article, etc., he can write a clear, understandable and credible text on any topic. In less than a month of existence, it has already been used by more than a million people.

While this feature can potentially allow students to write essays effortlessly, it can also have far more serious consequences. Melissa Heikkilä mentions in particular health advice type content – ​​which would not have received the approval of a real health professional – or other important informative content. ” AI systems could also mindlessly facilitate the production of masses of misinformation, abuse and spam, distorting the information we consume and even our sense of reality. “, she writes.


Support us by buying a poster that throws:

Increasingly rare “good” training data

There are some tools to detect texts generated by an AI, but these prove to be ineffective against ChatGPT, specifies the journalist. What is most to be feared today is not so much the fact that we cannot determine the origin of the text (human or artificial), but above all that the Web could very quickly be filled mainly with erroneous content. . Why ? Because AIs are trained from content gleaned from the Internet… that other AIs have themselves produced!

Initially, computer models of language are trained on datasets (texts and images) found on the Internet. These can include quality content, but also misleading and malicious information, posted by certain people. The AI ​​trained from this data in turn produces erroneous content, which is disseminated to the Web… and is used by other AIs to produce even more convincing language models, which humans can use to generate and spreading other false information, and so on.

The phenomenon now extends to images. ” The internet is now forever contaminated with images made by AI. The images we made in 2022 will be part of all models that will be made from now on », points out Mike Cookan AI researcher at King’s College London.

The takeaway from all of this is that it will be increasingly difficult to find quality, non-AI generated data to train future AI models. ” It’s really important to consider whether we need to train on the entire internet or whether there are ways to filter out the high quality stuff that will give us the kind of language model we want “, explains to MIT Technology Review Daphne Ippolito, principal researcher at Google Brain, Google’s research unit dedicated to deep learning.

How to detect texts generated by an AI?

It is therefore becoming essential to develop tools to detect texts generated by an AI. Not only to guarantee the quality of future linguistic models, but also to ensure that the information we have access to on a daily basis is based on truths. As Melissa Heikkilä points out, people could try to submit scientific articles produced by an AI to a peer review or use this technology as a tool of disinformation – which would be particularly harmful during election times, for example.

Humans also have a role to play in this fight against artificial content: they must become wiser and learn to spot texts that have not been written by a human. ” Humans are messy writers “: a text written by a real person will contain typos or spelling errors, a few slang words, sometimes convoluted turns of phrase, just as many small signs that an AI cannot reproduce (at least, not yet ). Additionally, language models work by predicting the next word in a sentence; therefore, they mainly use the most common words and very few rare words.

So much for the form. Regarding the substance, it is also important to simply take a step back from what you read on the Internet. You should know for example that the learning phase of ChatGPT ended in 2021; the tool is therefore based on data present on the Web that year. Therefore, answers requiring knowledge after that date will necessarily be wrong, outdated, or made up.

Ippolito says people should be alert to subtle inconsistencies or factual errors in texts presented as fact “reports Melissa. We are quite capable of it: the researcher’s work has shown that with practice, humans are increasingly able to identify artificial content.

Source : MIT Technology Review

We want to say thanks to the writer of this article for this incredible material

AI-generated texts are already “polluting” the internet

Find here our social media profiles as well as other pages that are related to them.