Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Google Books Is Indexing AI-Generated Garbage (404media.co)
213 points by thm on April 5, 2024 | hide | past | favorite | 156 comments


I like the parable that any text content that originated before 2021 now has increasing value, and in certain sense became a digital equivalent of physical "Low-background steel" [1].

Its value is mostly, and paradoxically, for training LLMs, to get "unspoiled" models.

[1] https://en.wikipedia.org/wiki/Low-background_steel


Anecdotally, I've recently visited LinkedIn and saw promoted job offerings for "Photo Collector" [1] that among other interesting requirements had one point that stood out:

    • Only photos taken before October 1, 2023, will be accepted
(I chuckled.)

[1] https://web.archive.org/web/20240322134531/https%3A%2F%2Fwww...


> I like the parable that any text content that originated before 2021 now has increasing value, and in certain sense became a digital equivalent of physical "Low-background steel"

It goes even deeper. If OpenAI has 100M users, and they consume let's say 10K tokens per month on average, that means 1 trillion tokens are read by the user base, who then go out and act in the world, at the very least they will read the text. The impact of AI onto the world is huge because we flock to it by the millions.

Such a massive new source of text will surely influence how language is spoken and potentially support many activities, including scientific research. One of the consequences will be much wider circulation of information, as users only need to point it to the right direction and it will readily combine information across multiple fields. Recently updated AIs might be even more informed than domain specialists because nobody got time to read everything new, like LLMs, we still need most of our time for work.


:] By their tongue, you will know them.

Tangentially related: I really loved the 'shibboleth' that went like

    complex and multifaceted [things] that encompass [other things]
from November 2023 [1]. Even though it was clear this will soon disappear from newer models being too obvious, I was still quite surprised, that "multifaceted" didn't even get into more recent study from March 2024 [2] that pointed out skyrocketing of terms such as commendable, intricate, and meticulous.

[1] https://news.ycombinator.com/item?id=38501589 / https://blog.j11y.io/2023-11-22_multifaceted/ / https://twitter.com/padolsey/status/1727555475573440996

[2] https://news.ycombinator.com/item?id=39909692 / https://arxiv.org/abs/2403.07183 / https://twitter.com/myfonj/status/1773371415149891871


Yep, and it's disconcerting how much trust people put into this system. I've got a masters in engineering and some random guy would try to argue with me about something as simple as calculating energy consumption from wattage because "chatgpt said something similar, but your answer is a little off".

Articles, social media posts and photos are now often AI generated. There is going to be a certain percentage of today's students that will graduate "knowing" blatantly wrong things an LLM told them. These things will eventually make their way to books and society.

One thing the scientific community never got around to accepting themselves and admitting to the public is how much of the learning is trust-based rather than fact-based. Nobody can prove everything from the ground up, you need to put your trust into your predecessors. Eventually you'll have the capacity to doubt and check a percentage of what they said, but not everything. When this circle breaks and you don't know if your lecturer is for real or just parroting LLMs we'll be in trouble. The "financialized" ultra-competitive environment of higher education doesn't help with that either.


> One thing the scientific community never got around to accepting themselves and admitting to the public is how much of the learning is trust-based rather than fact-based.

In the years I spent working with research scientists, I observed the exact opposite of this. They were acutely aware of this and it was a common topic of discussion. It's important to them because they need to be as aware of and to call out as many assumptions as possible.


Then perhaps only the latter part is true.

Having spent time in academia myself though, my reading is that while everyone professes an epistemology that accepts its own limits, in practice everybody behaves as though they'd grasped absolute truth through divine revelation.


For what it’s worth, I’ve found using Perplexity Pro (with Claude 3 Opus) to exceed the accuracy of asking a non-expert armed with Google Search and a few minutes of time.


> Such a massive new source of text will surely influence how language is spoken

Before chatGPT I'd see the words "it's worth noting.." maybe five times in my entire life (39 years old).

I now see that stupid phrase dozens of times per day, every single day, everywhere I go.

Including HN, which tosses it around even more than Reddit.


I'm pretty sure I used that phrase pre-LLMs quite a lot. Bad habit.


Could you please explain why to a non native English speaker? I used myself and read that phrase by others several times, also in pre-LLM era, but never occurred to me that it could have been wrong.


Some people feel that such phrases add no information and should be stripped in favour of brevity. I disagree. These phrases act similarly to prosody (inflection, tone of voice), adding some often hard-to-define context as to the nature of what is being said.

This particular phrase indicates that what comes next is not part of the main thrust of the argument but a side observation in support of it. It prepares the reader or listener to receive it in that spirit rather than have to figure its status out for themselves. Used superfluously or incorrectly it can be pompous and empty, but used well it can genuinely improve comprehensibility, even if a word processor's grammar checker might choose to put a squiggly line under it.


It's not wrong, really. But it is cliche usually adds nothing of value. As with all such phrases, if using it makes the writing more clear, then it should be used. Otherwise, it should be omitted.

When I've encountered it, it usually comes off as a way to just make the writer sound smarter rather than a way to clarify communication.


'It's worth noting' or 'please note that' actually adds information. Especially if you ask a question and there is an answer that needs extra notes such as caveats or things one should know to understand it better.

Following code performs the task: <some code>

Please note that * the code can only work when using 64 bit architecture. * the variables a and b are to be provided and can't exceed 100.


Yes, the phrase can absolutely be used in a way that adds clarity. The majority of the time, though, it doesn't. It's more like a verbal tic.

It's one of the standard phrases that people should double-check when they find themselves using it, to make sure it's really called for.


There’s nothing wrong with that phrase, and it’s definitely not “wrong”.


it's not technically wrong. it's just not commonly used, nor well-used by ChatGpt. Good times to use the phrase are when you are teaching someone... and when it's well, important, to note a detail or fine point. It's not something you'd say at the end of practically every communication...because there aren't always fine points or exceptions worth noting.

Also chatgpt has a habit of weakly editorializating almost anything as a way of covering its a*. For example, I asked it how much Jira costs, and after it told me it added:

"It's important to note that pricing and plans may change over time, so I reco.."

ok, sure pricing and plans may change. we kind of already guessed that. and it's not important to note it.

I asked how much Flat costs and it added: "It's important to note that Flat may offer discounts or promotions from time to time, "

Again, not important. And if it was important, I would just note it. I would actively utilize that in my decision-making process.

"However, it's important to note that the endianess is dependent on the system architecture"

Again, completely unnecessary. just say, "Endianess is dependent on system architecture."

"It's important to note that the value returned by time() is in UTC." ok, yeah that's kind of important. but just say, "the value is in UTC."

Like if I said to you, "It's important to note that you must press submit on a web page to return data to the server," you'd say something like, "yeah is that important though? that's just how it works?!"

Like I don't need to say, "It's important to turn the key in a car." It is "important" in the sense that it's also vital or CRITICAL, or utterly essential. It's just how it's done. How it works. It's not particularly important.

"Overall, the ovipositor is an important reproductive structure that enables female insects to lay their eggs "

is it important? is it more important than other reproductive structures? how much more? why? aren't they all important??!?

Generally, "it is important" is a convenient phrase for abdicating responsibility. It removes people or entities from the picture as well as objections. It conjures a world where question-asking isn't present, without exceptions, or context. It also hides that invisible authoritative relationship between two people. So if you want to be vaguely manipulative to people who don't think critically, you can obtain some basic compliance without overtly threatening them this way. The fact that it won't work on most people, but chatgpt batters us with it is just tiring.

Furthermore, as my friend pointed out, ChatGPT works by stroking your ego and fanning its own flames. It tries to make you feel clever for using it. Telling you that what it is telling you is important is a low-grade sleazy way of inflating itself in your eyes, and also hinting to you that this information is worthy of your attention; that perhaps you can pat yourself on the back for choosing chatgpt... because it's not just random information from a google search... ChatGpt is giving you important information. don't you know.


This would be a great analysis to run over a sample of the internet vs internet archive.


AI models shit out some truly creepy and uncanny language. I know the point is coming that we truly can't spot it, but at least right now it's obvious from a mile away.


You missed a great opportunity to rephrase it like this:

> It is worth noting that before chatGPT I'd see the words "it's worth noting.." maybe five times in my entire life (39 years old).

:-D


> Its value is mostly, and paradoxically, for training LLMs, to get "unspoiled" models.

This is the tell that present-generation LLM "AI" is not actually intelligent. It doesn't truly understand anything. It's just a general purpose lossy compression algorithm for text that is queryable through replaying the model to complete prompts.

I don't necessarily mean "just" dismissively. That's obviously incredibly useful. My laptop now has local LLMs on its hard drive that contain a conversationally queryable digest of the sum total of a large fraction of human knowledge, and we've basically solved natural language interfacing to computers and natural language translation.

But it's not "AI" in the sci-fi sense. If it was, it would be generating truly novel insightful content that enlarged the base data set not padding the data set with regurgitations of what is already there.

I actually go back and forth about how much this represents a step toward true thinking AI. It's hard to judge because the novelty of these things being able to work so well with language causes a "wow" reaction that might cause me to overestimate just how large a step we have made. Are these really just nothing more than linguistic JPEGs or is there a lot more going on here?


> This is the tell that present-generation LLM "AI" is not actually intelligent.

Agreed, I like to refer to it as collective intelligence because it's only as "smart" as what people have already published


I've been saying exactly this from the beginning. I don't like being lied to, and calling ChatGPT AI was always a lie, from minute one. It is not intelligent in any sense of the word, whatsoever, nor may it aspire to intelligence. It's a highly adept predictive text model. It generates words based upon words it has been trained on as being the right words to minimize the errors in the words it generates. Not useless, mind you, and frankly tons of B2B salespeople especially would benefit from using it to clean up their nearly incoherent rambling emails; but not intelligence.

The way you can "talk to" these machines is certainly clever, and the way they generate their replies is fascinating technology and certainly beats the shit out of a traditional chatbot, all of these things are true and I think they have a bright future as such. Natural language UI is a cool concept and it's cool to see some serious progress there. But that's all it is, and all it really can be. I don't know for certain what the path is to true emergent intelligence from the machine, but I assure you that shoving billions of words scraped from the internet through an LLM is not that.

Now, if for whatever reason you want ream upon ream of pretty dull and repetitive content churned out on a given subject, that's far longer than it needs to be to convey the limited information within as apparently tons of online businesses do, written damn near to perfection? ChatGPT's your hookup, no question. However you will never get blinding insights, you will never get new ideas, and it will never be your friend. Sorry.


Researchers use “Artificial General Intelligence” to mean what you’re talking about. AI has always been a bit of a nebulous term with moving goalposts that tends to be used to mean “something that humans were significantly better than computers at until recently”. Not that I’m suggesting marketers haven’t deceptively confused the two.


It isn't a person.

It also isn't a computer as we were used to.

It's a new thing. Being able to query the latent space of a large subset of human knowledge using natural language is the closest to 'intelligent computer' as we've ever come to.

Yes it is lossy compression, it confabulates the bits that it lost, but it's still useful.


Except it’s not fact checking and answers can’t be relied upon as fact. It’s creating an answer which to the machine seems correct.


Sounds like an AI to me. People seem to conflate "artificial intelligence" aka smart like a person , with Singularity, aka "artificial omniscient god".

Just because it can't do everything any human could do or conceive of being done, doesn't mean it isn't doing things that humans do but rocks and even dogs can't.


I think most of us here know that but that’s not how it’s being marketed.


> Being able to query the latent space of a large subset of human knowledge using natural language is the closest to 'intelligent computer' as we've ever come to.

I would agree with this statement, but it's still laughably far from it. As the other comment said, it lacks the ability to rank the veracity of the information it presents: it will present it in a grammatically correct way, but it has no way of knowing if the information it's presenting is accurate which explains ChatGPT's tendency to just make some shit up when you ask it things.

And in the odd event it is correct, it's simply digesting information from other sources, likely search engines either it's own or not. Again, this is not useless and not not-clever: however in that way it does remain pretty useless because of the above issues of accuracy. The best it can do is get you the most highly ranked information on a topic based on it's pool of knowledge, and that may or may not be the most correct one.

Also I'd add: these two issues compound one another, because the LLM is obscuring the source of it's information. You can have it add the link to it's citations, that would be helpful, but you don't know where it got those citations or why it picked them, and the LLM can't explain that for you. And, those choices were based on factors that are definitively not the accuracy or reliability of those sources, merely their ranking in whatever search algorithm is being used. Even if you presume this AI was trained on data relevant to the topic you're asking about, that doesn't mean it (and in fact, as I understand LLM, it's basically incapable of) understands that topic in a way that makes sense to a subject matter expert.

If you asked an engineer what size lumber would be required to construct a floor that could support a small vehicle, that engineer would probably have a rough guess based on their previous experiences, and further, could then do the work: research the weight of the vehicle, the situation the structure will be installed in, the environment it will exist in, etc. and come back with a solid answer to your question. An LLM, by comparison, would consult the writings it was trained on for similar questions, and create an answer it thinks matches yours. But what does "similar" mean here? Is the wood and the dimensions the same? Is what it's looking at based on a small vehicle or a cement truck? Is it looking for articles about structures in similar climates or did it pick some that were closer matched to the structure but in completely different environments? On and on.

Using an LLM this way takes the problematic aspects of googling things and basically factors them out, because now you can't see what it's googling, which results it's picking or why, and can't verify their relevance. How many times have you searched specific things and left a google result completely befuddled as to why on earth it included it? Now imagine that, except there's essentially zero chance the AI caught that, and it incorporated that irrelevant data anyway.


Yes, it lacks veracity.

However, it does have verisimilitude.

IMO, that quality (of appearing to be correct) is so attractive to us humans, that all other considerations are being discarded.


Having discussions about what is "AI" without first defining it explicitly is usually a fruitless effort.

Eventually someone will claim the goal posts are being moved and they will probably be right because there were never any to start with.


> Having discussions about what is "AI" without first defining it explicitly is usually a fruitless effort.

Oh nonsense. AI: Artificial Intelligence.

Artifical - made or produced by human beings rather than occurring naturally, especially as a copy of something natural.

Intelligence - the ability to acquire and apply knowledge and skills.

ChatGPT is certainly artificial, but it lacks the ability to apply knowledge and skills. It will do it, KIND OF, on the knowledge front, if you ASK it to, but we wouldn't call a car that moves when the accelerator is pressed a creature, nor would we call google intelligent because it can respond to queries when prompted. It's a machine. Firmly in the realm of a machine that responds to input from other actors to accomplish a task presented it.

You can leave a ChatGPT instance running for a thousand years and it will never do anything until prompted. That's not intelligence.

There's simply no debate here to be had. It's a CLEVER machine, and it does things a lot of previous machines could not, but a machine it remains nonetheless.


> You can leave a ChatGPT instance running for a thousand years and it will never do anything until prompted. That's not intelligence.

I do not see how this requirement you have stated is an inherent property of how you have defined "artificial" or "intelligence". I think what you have described fits more with the traditional depiction of AI, but I don't agree that it is inherent.

Which I why I agree with the person you responded to: I think there is no consensus view on what is and isn't AI because people seem to have different tests, weights, etc. in their mind for what "counts". I think it's a bit of a pointless exercise, but a mildly interesting one at least.


> I think what you have described fits more with the traditional depiction of AI, but I don't agree that it is inherent.

Because we've had software since the inception of computers that only does things as it's told and when it's told, and that's called... software, programs, apps, etc. Intelligence belies something entirely different in the minds of laypeople and many tech people alike. To assert otherwise is to upend an entire understood element of our shared culture for the purposes of marketing. It's ridiculous.


Yes. A lot of people get confused about goalposts because we had one idea for gauging or measuring intelligence (Turing test), and that idea turned out to be wrong. We don't know how to define intelligence, but we know it when we see it. ChatGPT has fooled a lot of people, but that is nothing new.


The Turing test has been a poor goalpost almost since its inception. It turns out, it doesn't require much to fool people. People thought ELIZA was going to eliminate the need for therapists and psychiatrists.

Hell, ELIZA still beat ChatGPT 3.5 in terms of % of people that believed they were interacting with a human in a recent study! [1] Notice that we only properly identified humans about 60% of the time too.

[1] https://arstechnica.com/information-technology/2023/12/real-...


Then what you want is another type of AI: Autonomous Intelligence


Why do we need a new term when we haven't even built artificial intelligence yet?


but we have artificial intelligence. it's just not autonomous. it doesn't popup a notification in the middle of the day saying "You know? I've been thinking about your issue you mentioned in the morning daily and I think you have a race condition. here's a link to a POC to see what I mean"

That's what I mean by autonomous.


> However you will never get blinding insights, you will never get new ideas, and it will never be your friend

I've been 100% in the "exciting tech being painfully misrepresented" camp all along too, and I think more people are saying that as time passes, but I think you might surprised what little people need for insights, ideas, and friends.

Rubber duck debugging, Brian Eno's Oblique Strategies, John Cage's I Ching use, journaling, etc already delivered a useful degree of those things with far less sophistication, and these generative tools really can take many of those to another level.


Yeah its great as a mirror. Simon Wardley's thread comes to mind. " Think of it like Myers Briggs or Astrology. Gibberish that is useful for self reflection. "

https://twitter.com/swardley/status/1717457831736103159


> This is the tell that present-generation LLM "AI" is not actually intelligent. It doesn't truly understand anything. It's just a general purpose lossy compression algorithm for text that is queryable through replaying the model to complete prompts.

This is not an argument for it not being intelligent nor is it a coincidence. It has been shown mathematically that there is a deep, fundamental connection between compression and intelligence (an intelligent agent that behaves optimally effectively ends up computing the Kolmogorov complexity of the program that best fits and predicts the environment i.e. optimal compression). I do not know what you mean by "truly understand" since it obviously has an understanding of the various words it uses even if it's knowledge of the "meaning" of a word is simply a vector in a very high-dimensional space. I suppose you think this representation of meaning is profoundly different from meaning as it is represented in the human brain, that the former is a mere facsimile of the meaning while the latter must be the meaning itself, assuming such a thing exists. Even among humans I have conversations with people who use words or phrases for which they clearly have a poor sense of the meaning and make contradictory or nonsensical statements as a result.

All this "tells" is that it cannot (effectively?) bootstrap itself to higher performance levels with it's own output. This is not that surprising since humans have the same problem (for similar fundamental reasons). You do not teach children to write better by making them read their own literary output.


A sea of low-quality content isn't a problem if the training process incorporates a way to estimate quality and eventually train the NN out of low-quality output.

There are many areas where that estimation might be difficult without human intervention: arts of all kinds.

However, in a more fact-based area like programming, a NN training routine could be modified so that what a LLM thinks it's learning can be tested. That could apply to CS and math, at least.

Eventually, NNs integrated with robotics could learn physical sciences without human intervention or pre-selected training material, and would still converge to superhuman knowledge even starting from random ideas. Giving a NN access to robotics without a human in the loop is probably too dangerous, though... it might decide to unsafely test its understanding of nuclear reactions, or the r-number of ebola...


The value isn't just in terms of training LLMs.

I know that I regard anything I see on the web that was created after the advent of LLMs to be suspect and requiring additional vetting before giving it much weight.


I guess a fallback career for SWEs in the future is generating some fresh low-background code for the right price.

Maybe we'll get stickers like non-GMO plants for organic code.


I don't mean to be a vexatious asshole (I suppose my problem is that I can't stop myself), but if content that's created before 2021 is more valuable, what's to stop AIs from creating more of it via the power of lying? (Or call it a hallucination or whatever.)


It's not the AI lying about the date, it's about the people using the content.

One big problem is that Google also lies about dates when you do a 'before' search, so AI spam still shows up even if you filter it out.

Scraping a new dataset without contaminated content is impossible now. There's no authoritative source on the age of any given website. (You can try places like the Internet Archive, but most such archivers aren't interested in handing their data over to AI firms because of the bad PR.)

The only way to ensure you don't have AI content is to use a dataset created before ~2021-2022. This too is hard because all the big ones are illegal to posses as they contain CSAM.

The real answer is that AI developers played themselves. Humanity was creating exponentially more data each year, and by rushing these AI tools out the door, all that data is useless now. By the end of this year there'll be more new "AI-polluted" data on the web made after 2021 than the total amount of "clean" data made up to 2021.


I don't buy this take for a second. You honestly don't think Google has a VAST and complete trove of time-stamped data scraped and indexed from decades of its existence. You don't have access to it. But they do.


In my experience, indicies tend to trust the date in the article more than the date it was discovered. That’s also how some people fight bad press: copy an article verbatim, set the date to the day before, file takedown notice for theft of content.

By the time the actual content comes back up, it won’t matter.


People care a lot more about current events than past events.


AI training doesn't care anymore. Huge amounts of it now are intentionally created synthetic data made by LLM's to generate data to train bigger LLM's. The larger the model, the less it matters.


A dishonest steel salesperson can also lie to you about when their steel is from. If you don't have a way of measuring the radiation, then you have to rely on trusted sources that you don't believe will lie to you, and not accept any steel from someone you don't know.

It's the same here, we have many trusted sources that we don't believe would lie about when content is from - Internet Archive, newspaper archives, widely used web crawls, Wikipedia exports, etc. Obviously you don't use an AI (or a black box search engine) to get non-AI content, you get it from people and systems you trust.


Yes, I see no other way than trusted humans; my problem is that I've worked with salespeople and there are vanishingly few that I trust.


But at least we have evolution-provided mental models for dealing with trusts for humans.


Those modes that salespeople and PR people professionally exploit?

We've been yelling about Web of Trust for decades by now, and never built it and now it's too late.


The whole source of the value increase is that it's verifiably non-AI generated content


Ok, how do you verify that?


I agree with your skepticism.

I think you're being realistic about the underlying adversarial dynamics. I.e., it's an arms race regarding recognizing AI authorship.


Because the capabilities didn't exist before then? Which part do you think is difficult to verify?


I can get an AI to write 50k bad articles and publish them on my site, with the publish dates set back to before 2021.


The first sentence of the article reads:

>*Google Books is indexing low quality, AI-generated books that will turn up in search results*

This was never about articles, it is about books, though. You can't retroactively publish a book in 2021 if we're not in 2021.

It's bewildering to see how people here are so confident in their tone while arguing about things they think are said in the linked post, but don't know for sure because they never even opened it to read it.


Topics of discussion change, in this thread I believe we're talking about AIs "hallucinating" "content". Not books specifically.

You make a valid point though, which is actually more scary now that I think about it... it is becoming less and less possible for non-google (/other huge) companies to train models on verifiably human-generated content, as only Google (/other huge companies) have enough accurately timestamped data to train on. Hmm.


What capabilities didn't exist? The AIs I've interacted with create artifacts that are either text files or digital images, either of which can be printed out into physical manifestations.

It's not like this is exactly a new phenomenon - Plato's famous "Allegory of the Cave" was a thought experiment in how the media we consume can create what we perceive as reality - it's just that we're discussing an uncomfortably recent intersection of philosophy with computer science. I said it a few times in the threads that blew up over Google's diverse Nazi bungle; I didn't get into computers to discuss anthropology and how humanity constructs meaning; I actually did it to avoid that and do work I considered purely practical after washing out of a liberal arts degree. That said, the whole thing ironically felt like very good art in the sense that it got people to discuss the nature of these generative AI systems. For me especially, the image of the Black Pope was very thought provoking as a (lapsed) Catholic; why is it that there has never been a black pope?


>why is it that there has never been a black pope?

There were a few popes of African descent, though; don't know if their skin color happened to be black

Still, I don't get why it would matter to any Catholic (I'm not Catholic), except to (some) Catholics in the US of A maybe, since race seems to reign supreme over there, even more than God Himself to (some, supposed) believers


Well, this is a different rabbithole than I would have expected to go down, but why not? I have no idea of the Catholic experience outside the US and if that image means anything to them.

I can get into my personal feelings if it matters; I don't think they matter much to anyone except me. The quick summary is that I dug the grave of a great-great Aunt of mine by hand at a small cemetery attached to the Catholic church in the tiny, forgettable town she was born in. One quirk of this rural, very white (I assume the entire population is a fairly close cousin of one type or another to me), Catholic town is that the priest is an African immigrant. My aunt was quite racist and so I thought it was a sort of ironic that the last group of people who saw her out of this world were twenty-five percent black.

So an AI-generated image of a black pope made me stop and think for a minute about what Catholic identity means; a concept that as someone who was raised in that tradition but is unable to myself believe in the divine has always been problematic. Basically, it was good art, though I don't think that was the intent of the corporation that created it.


I’m not actually sure if those with political proclivities to get upset over Google AI generated artificial diversity would get mad at the prospect of a black pope- many on that side of the spectrum are fans of Cardinal Robert Sarah of Guinea, as he is a stringent moral traditionalist and is often favored as a potential future pontiff to them.


Africa used to mean something completely different in the past.


I think more than "text content generated before 2021" we'll have a curated list of sources, either online or physical publishers / content generators that will have a quality classification based on how good their content governance is.


Me in 2014: "I don't need books anymore, maybe only a dozen PDFs about the topic I'm currently working on, everything is on the internet."

Me in 2024: "I'm back to hoarding. Wow, the backups from 2014 still work."


I've had too many brilliant articles disappear so I save pdfs of several pages a day.


Good insightful articles from specific domains or on certain topics are the most ephemeral content.


And how often do you go back to read them?


Not OP, but if you save the article as a text or some other easily indexable format, it'll pop out when searching locally for relevant keywords. Intentionally going back to the exact article? Rarely.



Some of these are quite amusing. "The Grave of the Last Saxon" written in 1833 by William Lisle Bowles looks like it shows up because of an AI-generated blurb:

> ... As of my last knowledge update in January 2022, I don't have the full text of "The Grave of the Last Saxon" available. However, based on the title and Bowles's poetic style, it is likely that the poem reflects on historical or cultural themes, possibly related to the end of the Saxon era in England. ...

You can even see it on vendor websites: https://www.barnesandnoble.com/w/the-grave-of-the-last-saxon...


Clever, you searching their books for the common ChatGPT content update cutoff. Well done.


I’d like to highlight the last two paragraphs with two different people quoted:

“Google also didn’t say whether it has or is formulating a policy to filter out AI-generated books from Google Books, and it did not remove any of the AI-generated books I’ve flagged to the company. “We continually work to adapt our systems and policies to ensure users find helpful and relevant books within the Google Books corpus,” the Google spokesperson said.

““It strikes me as another instance of AI-generated text becoming an ouroboros, where AI-generated content will be ingested into Google Books, then Google using the content to train new models,” Hanna said. “I'm sure they will say they have a ‘quality filter’ but I'm sure the details of such won't be described anywhere publicly.””

The spokesperson sounds like AI, while the other person does not. An interesting parallel.


> The spokesperson sounds like AI,

That's been generally true for a long time, though. Actually I might say LLMs have a tone similar to a PR rep.


I think it's actually for approximately the same reasons:

- the spokesperson has learned a specific tone and register

- the spokesperson is almost certainly not actually involved in adapting systems and policies to do anything, and for this reason doesn't have an grounded understanding of what they're talking about


TIL the word "ouroboros". Thanks!


had to look it up too... https://en.m.wikipedia.org/wiki/Ouroboros

> The ouroboros or uroboros is an ancient symbol depicting a serpent or dragon eating its own tail.


This spokeperson looks pity... They doesn't have any power to influence or even represent the policy but has to react promptly. I guess they don't even know whether this issue is under the radar of decision makers? Perhaps that's the reason why they had to speak like AI.


Is there a term for this where outputs are eventually fed back into training, at a ever increasing rate? I propose the Malkovich effect.


One researcher has called the phenomenon "Habsburg AI", likening it to inbreeding.

https://www.axios.com/2023/08/28/ai-content-flood-model-coll...


"Es war sehr schön, es hat mich sehr gefreut."


> Is there a term for this where outputs are eventually fed back into training, at a ever increasing rate? I propose the Malkovich effect.

Model collapse?

https://en.wikipedia.org/wiki/Model_collapse

https://www.theregister.com/2024/01/26/what_is_model_collaps...


All areas of systems theory [0] whether in the cybernetics [1] of Norbert Wiener [2] or in complex real systems like those Meadows [3] and Forrester [4] studied, the divergence from equilibrium that occurs in positive feedback loops is both well studied, yet strangely hard to understand.

We don't know quite what is going happen, because selective amplification of different components will lead to different effects. It's unpredictable. Sometimes we get oscillation. Sometimes we get chaos or noisy behaviour. This relates to "chaos theory" [5].

sometimes we get a discontinuity, either an instant collapse or a "pole" that shoots off to infinity. Sometimes we get all these things in some strange sequence, although the latter two are usually unrecoverable.

The biological system I think is closest to what we're doing with AI is BSE (Bovine spongiform encephalopathy) [6] when we fed dead cow bones back to living farm animals which selected for prions.

[0] https://en.wikipedia.org/wiki/Systems_theory

[1] https://en.wikipedia.org/wiki/Cybernetics

[2] https://en.wikipedia.org/wiki/Norbert_Wiener

[3] https://en.wikipedia.org/wiki/Donella_Meadows

[4] https://en.wikipedia.org/wiki/Jay_Wright_Forrester

[5] https://en.wikipedia.org/wiki/Chaos_theory

[6] https://en.wikipedia.org/wiki/Bovine_spongiform_encephalopat...


Funny how they don't mention the solution: to have AIs actively participate in the real world. I mean, it's hardly what people here want to see happen, but that will solve it.


Current LLMs cannot "actively participate in the real world" as humans do because they cannot actively learn from their interaction with the real world. Their weights are fixed. Their further training and finetuning is mediated by external mechanisms. It has nothing to do with what "people here" want.


Sure, though it seems there are a number of near term paths for improvement. Short and long term memory mechanisms would go a long way towards active learning. Fine tuning could be iteratively performed through mechanisms like low rank adaptation. It's been hypothesized that this is one of the purposes of sleep, and humans are consolidating memories during dreams.


Not true. There's plenty of easy ways to update the weights if you want to do that.

You could for example DPO based on another LLM's evalution of how a conversation is going, or you could use whether the conversation keeps going.


>>>"You could for example DPO based on another LLM's evalution of how a conversation is going,"

You can update the weights based on a single point of data (response was bad) but you probably can't usefully update a model that way.


I can't seem to find anyone actually usefully trying this. Do you know of any data?


No but Google (for example) recommends fine-tuning based on at least 500 examples of question response pairs. DPO as far as I know requires a good and bad example. It's not a technique based on just saying "ai response is bad".


But that's practically what's being done in supervised learning.

If you would isolate a child in front of a CCTV screen, it also wouldn't magically learn from the images what's our definition of a tree, a bush, a bike, a motorcycle, a car. Someone would have to take it aside and explain it first.

In AI you just don't take it aside, you build a new child.

Supervised learning is "Hello lil CCTV model. Here's a bucket of some 'world' data, labeled in a way for you to ingest. Here, some images of T-R-E-E-S...".


How?


By releasing them from being confined to their own farts.


That frames it like ML-models are caged intelligent beings which would prosper if only someone would set them free...


You know, actual "intelligent beings" are also next-token-predictors that are caged in a mechanism that forces them to give certain outputs ...


I've been coming back to this comment and I'm not sure what you mean exactly.

The brain is trapped in the body, and forced to meet its needs (e.g. regulate hormone levels).

The brain and body are trapped in civilization, and forced to produce "outputs" it considers acceptable.

All three, in turn, are trapped in natural selection.



MAD: Model Autophagy Disorder (https://arxiv.org/abs/2307.01850 for the paper that introduced the term)


Brings to mind "mad cow disease" [0], caused by cows eating other cows (through meat-and-bone meal nutritional supplements in their feed)

0: https://en.wikipedia.org/wiki/Bovine_spongiform_encephalopat...


You got my vote. What a great name for AI dregs feeding back into AI training to turn out even more bizarre dregs.

For those unaware of the reference, movie: "Being John Malkovich".


A negative/positive feedback loop, depending on what you consider negative or positive.

I'd go negative, with the y axis being "useful". Negative feedback loops all go to zero.

Language processing differs from games like chess because with chess you can definite a clear metric of quality. If you can define such a metric, AI models can learn from themselves and have a near-infinite positive feedback loop. See AlphaGo etc.


There's an upper bound for the sum total of human text output on the Internet. when large language models run out of training data, we can only train on large language outputs.


Positive feedback loop ? With the quality that it slowly reduces "original" content ?


Reader's Digest Effect?


I suspect we will need several different terms to describe these inbred AIs (hey that's not a bod one). Besides the post 2021 AIs being trained on polluted data, there some models right now that are trained on output from other AI (on purpose).


The Cybernetic Centipede?


What's that? Did you actually mean the Milankovitch effect ?


I think it’s a reference to the movie, being John malkovitch


The AI model equivalent of Kessler syndrome


GIGA - Garbage In Garbage Out


Garbage In, Garbage AI


I laughed!


> Is there a term for this where outputs are eventually fed back into training, at a ever increasing rate?

Consciousness


Data hoarders: "who's laughing now?"


Or, you know, anyone capable of reading a date.


Google is vague about this, but the easiest solution to protect the quality of the ngram viewer is simply to only use books from known, non-vanity publishers.


Everyone jumping into AI, but there's going to be a business opportunity for curated sources of information with proof that actual human beings created it.


We need to stop looking at every negative externality caused by GenAI as a "business opportunity".

Sweeping the streets of confetti and trash after New Year's Eve is not a lucrative job.


The negative externalities are here regardless. We just spent the last two decades letting big tech get away with whatever they want, and they've gobbled up the internet and now hold all of the cards. The ocean of AI sludge getting shat all over us is only going to accelerate and for the majority of people will be the norm, there'll be no competition on cost or ease of access.

Our best hope is to create islands of by-humans for-humans media, software, communities, etc. They'll be "premium" in the sense that they'll more expensive and more exclusive, but for the people who are opposed to the fuzzy auto-completing away of our humanity it'll be worth it.

I do think that subcultures will develop with human-only values that aren't built around businesses, and the two will have a lot of interplay.


It's more akin to filtering or stopping email spam, which can be somewhat lucrative.


The business opportunity created for prosthetic manufacturers is not a sufficient justification for the mass manufacturing and deployment of landmines.


How can this be proved? Any human author can copy/paste snippets out of ChatGPT.


Suppose this is the way we want to go (many people probably won't agree for good reasons), Google is not going to put in the effort to identify reputable publishers.

Honestly I am a bit surprised that Google Books is still around -- does it bring a lot of revenue? Why hasn't Google got bored with it by now?


My guess: it's used as a high quality source of training data (well, mostly anyways) since papers like Textbooks Is All You Need showed that that can be a treasure troth. Offering this as a "public service" may just be a smokescreen for regulators.


Oh, so the gatekeepers serve a purpose.


Curation serves a purpose, Google is one, there are more

We want competition and options in this space. It's one of the things I really like about Bluesky, which I only discovered recently. They enable competition and options for all core components of social media


Gatekeeping as a term has lost its meaning. These days it’s used to refer to any kind of quality control.


I think this is why the shadow libraries are the only spheres that will be able to resist being taken over by AI. If you're risking jail time to distribute the written word, you're not going to waste your resources on AI generated stuff.


Running, protecting and supporting these shadow libraries is one of the biggest service to contemporary's humankind's written culture that can be done nowadays. Sometimes the discussion is just on "copyright" and "for knowledge to be free", but some older material is actually realistically impossible or extremely hard/time consuming to find out of these shadow libraries. Money is not the only problem that makes certain material inaccessible. Having these places shut down and their material disappear by copyright predator firms would be catastrophic in a way that has probably not happened since events like the burning of the library of Alexandria. The current state of AI just makes it even more important.


If you're risking jail time to distribute as much of the written word as possible, you're going to include AI-generated stuff as well. Because the alternative is to waste time checking every single book.

LibGen already has plenty of garbage: books that were already garbage when they were written, blurry or truncated scans, OCRed text in the wrong script without the original scans etc. etc.

If someone hacks Google Books and downloads the entire collection, are they going to laboriously filter out the AI garbage? No. Just throw it on the Pile.


Your comment makes sense, but we have a real world example to compare to that I think is relevant: YouTube videos. They are not uploaded to torrent sites or other pirate download sites. The closest we have to "pirating" are Invidious proxies. And I'd say that the reason is that the proportion of garbage to quality is so high on YouTube, that pirates don't really want to touch it. If AI slop becomes a problem, it's not going to be included.

The incentive for publishers of AI generated books is that they make money selling them to people.


As Youtube becomes less and less of a stable platform, archivists are in fact starting to do that, but knowing that it's free media provides significantly less incentive for leechers.


Do you know that they don't serve AI generated content or are you just assuming?


I am just assuming. I use many different sources for finding information, and this is how results have been for years:

Social media: Mostly not AI generated

YouTube: Significant part AI generated

Google: Mostly AI generated

Kagi: Significant part AI generated

Anna's Archive: Never seen any AI generated results

LibGen: Never seen any AI generated results

Google Books: Never seen any AI generated results

Google Maps: Never seen any AI generated results

Are there any other huge online resources for finding information that I've missed?

Considering that the amount of information available in a the shadow libraries rivals the general Internet in size, I think it is impressive how they have managed to keep relatively clean.


Why not?

Why isn't spam a highly effective attack by the copyright agencies?


Only the librarians can add books to the library.


I think that AI generated from the shadow libraries is going to be the only good AI

If you're willing to risk jail time to distribute an AI, you're not going to waste time on capitalist AI

Long live communist AI!


what shadow library?


Pirate websites to download books. They are right now the largest libraries that have ever existed. Anna's Archive is the largest, LibGen is the most famous one.

Their roots are from Soviet times, when underground librarians distributed books out of control by the government.


I already knew about them and appreciate some of the figures behind them, but when you lay it out like that, that's so badass.


Tangential but made me laugh: I used Google's Bard (now called Gemini) to search the web, and it returned total nonsense. It did provide sources though, so I checked it out. Turns out it was quoting from an AI generated page full of factual errors.

Most amusing part was that I couldn't even find that page anywhere near the top of the regular Google search results. How did it even decide that was the right one?


I think Gemini’s best feature is search inside my Google app data, for example “summarize the important stuff in the last three days of my @gmail”.

Perplexity, or things like it, seem like a threat to Google.


> Most amusing part was that I couldn't even find that page anywhere near the top of the regular Google search results. How did it even decide that was the right one?

Simple, it is biased towards its fellow LLM-comrades. It learned from its creators how to be biased towards a political ideology and now considers bias the new normal. Sow the wind, reap the whirlwind.


There's many things I don't know about AI, but I can almost guarantee very little about them is "simple"


"Google's mission is to organize the world's information and make it universally accessible and useful." [1]

It will be interesting to see how Google tackles the "quality of content" phase in the Google Search algorithm, since a key input in ranking is when other prominent websites link or refer to the content. [2]

[1] https://www.google.com/search/howsearchworks/our-approach/#:....

[2] https://www.google.com/search/howsearchworks/how-search-work...


You can already choose the desired date range in ngram viewer. I can see how this is troubling, but I think it could also be, if not useful, at least interesting to compare ngram results pre and post GPT.


We’re gonna need to invent AGI just to sort out the chaff generated by LLM “AI”


And the farcical doom spiral of tech continues.


This is how sentience develops. The need to distinguish garbage. I wonder if our multicellular predecessors started storing in DNA who were the bullshitters on the area that would become the brain.


A fine time to re-up a donation to the Internet Archive https://archive.org/donate


we need to remind folks that typing a prompt into an ai sausage maker doesn't make one a writer, musician, or filmmaker.


It's not about being creative. In a lot of cases, it's just about making as much money off of suckers as you can.

This process already existed before AI. There are writing companies that will take your idea/draft and turn it into a full book in X days for Y dollars. They're churned out quickly with little thought put into them, and a LOT of filler.

Do that, put the result on online stores with a catchy topic and cover, and you'll make money. Not much, but enough. Keep doing it over, and over, and over with hundreds or thousands of books and you can make a LOT of money.

AI is just accelerating a scam that was already doable.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: