I came to this same conclusion some years ago while working on a side project.
Before anything LLM existed, I built a site[0] to generate fake "AWS Blog Posts." I trained a markov chain generator on all AWS announcement posts up to that point, copied the html + css of aws's standard blog posts, then glued them all together with some python + JS. It turned out, IMO, pretty funny! People familiar with AWS's blog posts would often get several sentences in before they realized they were looking at word-soup.
When GPT was new, I looked into using that to "upgrade" it. I spent a weekend messing around with Minimaxir's gpt-2-simple generating blog posts based on AWS content. What I found was, ultimately, it was way less fun. The posts were far too realistic to be interesting. They read like totally-real blog posts that just happened to not be true.
I realized then that the humor of those early markov generations was the ridiculousness. The point where, a few words or sentences in, you realized it was all nonsense. LLM's these days are too good for that - the text they generate is sometimes wrong, but rarely nonsense in a humorous way.
Markov chain content was wrong in a "kid's say the darndest things" way, while modern LLMs are wrong in a "My uncle doesn't know basic geography" way.
This is a parody of the nLab, a wiki for collaborative work on category theory and higher category theory. As anyone who's visited is probably aware, the jargon can be absolutely impenetrable for the uninitiated -- thus, the idea for this project was born!
This project uses my package nearley-generator, a module that can turn a Nearley grammar into an efficient and controllable fake text generator. The grammar file for this site can be found at /src/grammar/nlab.ne
I figured this same thing in a similar manner. I used to make markov chains of game patches and send them to the community. Fake dota patches were a huge hit, specially since those are usually huge. Among mostly non-sense or unfunny exaggeration ("this hero now has 500 armor"), there would usually be at least 5 or 6 extremely funny lines in there, if not stuff that was outright prophetic (Fiend's Grip creates additional 1/2/3 illusions).
LLM's did a number on that unfortunately. Major subreddits banned all AI content, mostly due to the sheer infestation of really boring Midjourney content sent out by naive users and bots. Without reach, I lost interest, so no more Markov Chains.
I am sure I even remember that Valve once incorporated something from one of the fake patch notes in one of their patches, but now I cannot find it. It was something like +1 damage to something that already did a lot of damage.
Probably the Techies +25x damage talent? In 2017 [7.07] it was increased from +250 damage to +251 damage. Then in 2020 [7.28] it was increased again to +252 damage.
After some more looking around I found it, Puck's orb gained 1 movement speed in 6.87 [0] after the markov chain notes for the patch [1]. Though I think the techies one is a similar joke.
I think the "uncle that doesn't know anything but claims to be an expert and will tell you his opinion" is the best description of LLMs that I can think of. They will say anything confidently without knowing anything.
Most ppl don't understand shit about these algorithms and will expect it be truthful and knowledgable. That is how it's marketed, despite the disclaimers..
It should be marketed more like “uncle Jim” than a trustworthy, all-domain competent oracle. I think if you just gave it a twangy accent and changed some output nouns to thingy, whatchamacallit and doohickey, people would actually be able to use it more competently.
The sheer amount of companies, media, influencers, heck even government agencies, acting otherwise? With billions of dollars on the line, there are many powerful forces pushing for people to have unrealistic expectations about LLMs (or "AI" as they call it).
A lot of money is being spent on trying to improve it too. But right now the emperor has very little clothes. It remains to be seen whether they have materialized by the time people open their eyes.
For me, my expectations are adjusted. I kinda know what to expect it to do, and it does quite well for what I use it for, essentially smart intellisense for coding.
I would have expected the people who trained these systems to not format their Q/A samples to sound authoritatively while they knew the system was going to be spouting nonsense.
Can an LLM provide the sort of psychological buffer that McKinsey Junior Analysts (some random person picked up off the street in a college town and handed a clipboard, a frequent flier card, and a six figure salary) provide to senior managers against the scrutiny of their board and chief executive?
If I want to close the paper division, and my boss wants to close the paper division, does the LLM whisper sweet nothings in our ear and show soothing Powerpoint decks about how in the eyes of an objective observer, the paper division is unprofitable and there would be no career risk in closing it?
It's not a sexual thing I swear, it's a sort of comforting intimacy that needs to be injected into the business process in order to cope with the pressures executives feel. I don't know that anybody has trained an LLM to do that yet.
I think this is the next sequel to the movie "Her".
You bring up a good point though. Very dystopian idea of a new type of enterprise tooling that emerges to soothe and calibrate human knowledge workers. Very Severance Wellness room.
> modern LLMs are wrong in a "My uncle doesn't know basic geography" way
Oh, certainly not. I have been using LLM chatbots extensively this week. If your uncle were not just confident while stating falsities, but uncapable of self-assessment, he would have been murdered.
Or fresh ones. Established democracies end up with political structures that self-select for being able to sound right and cooperate right, entirely independent of truth and merit.
Oh please. Guidelines: «respond to the strongest plausible interpretation».
Does the form "An occasion of extensive use of the main current chatbots for circumstantial reasons in the past few days confirmed and reinforced the assessments raised years ago already" work better for you?
I think most of us haven't, but those who do are vocal about it. (Those who haven't aren't all that vocal, excluding weirdos like me.) Only 62% of respondents to the 2024 Stack Overflow Developer Survey are “currently [using] AI tools in [their] development process” (up from 44% last year), and non-programmers seem to use them far less.
I played around with a Markov generator many years ago, trained on the corpus of everything I had ever said in a specific IRC channel. The resulting messages were pretty interesting - I specifically remember the realization that I apparently (used to, though honestly still do, too) talk about LEDs a lot...
> I realized then that the humor of those early markov generations was the ridiculousness. The point where, a few words or sentences in, you realized it was all nonsense. LLM's these days are too good for that - the text they generate is sometimes wrong, but rarely nonsense in a humorous way.
This is the biggest challenge I run into when I see LLM Generated nonsense.
So much stuff that ChatGPT spits out is close enough to true that Brandolini is laughing.
But it's also that it's borrowing on stuff that's already there so it can't really get that 'creative' either.
I did the same thing and used it as the chum for a 'bot motel. (A 404 handler that always returns something and... success!). I trained it on a mixed NASA space shuttle and veterinary science corpus just to make it really bizarre.
Prompt to Claude 3.5 Sonnet: Below is a recently published essay followed by some of the discussion about it on Hacker News. Write 10 short jokes on this theme, making sure that the jokes are funny in the way described by the essay's author. [essay and discussion omitted]
Claude's response: I understand you'd like me to write 10 short jokes on the theme of Markov chains being funnier than large language models, in the style described by the essay author. I'll do my best to create some jokes that aim for "unserious surprise" and violate expected patterns in amusing ways:
1. Why did the Markov chain cross the road? To get to the other side of unpredictability.
2. An LLM and a Markov chain walk into a bar. The LLM orders a statistically probable drink. The Markov chain orders a lampshade made of cheese.
3. What do you call a Markov chain that tells dad jokes? A Mark-ov Twain.
4. How many LLMs does it take to screw in a lightbulb? One - but it'll spend 20 minutes explaining the optimal lightbulb-screwing process first.
6. An LLM, a Markov chain, and GPT-4 walk into a bar. The bartender says "We don't serve your kind here." GPT-4 leaves. The LLM stays to debate ethics. The Markov chain orders a bicycle made of spaghetti.
7. What's an LLM's favorite movie? Predictable and Furious 17: The Safest Driving Yet.
8. How does a Markov chain tell a story? "Once upon a time, there was a princess who lived in a castle made of bananas and ruled over a kingdom of sentient toasters."
9. What did the Markov chain say to the LLM? "Your mother was a abacus and your father smelled of silicon!"
10. Why don't Markov chains make good therapists? Because their advice is always "And then you should banana the feelings until they become a hat."
"An LLM, a Markov chain, and GPT-4 walk into a bar. The bartender says "We don't serve your kind here." GPT-4 leaves. The LLM stays to debate ethics. The Markov chain orders a bicycle made of spaghetti."
In normal English usage this would imply that the LLM was not GPT-4 LLM but some stereotypical anonymous LLM.
In business terms GPT-4 can be said to be superior because it understood the instruction and left, in AI terms the anonymous LLM might be superior because it may have understood the instruction but responded in an "intelligent" manner by arguing about the morality of the instructions.
At a meta-level the joke thus argues that GPT in achieving business ends has had its intelligence hampered. As have we all.
At the same meta-level as the joke was constructed by Claude it can be argued that Claude is commenting on both the intellectual limitations of the Markov chain (insane babblings), and GPT-4 (unimaginative, inhibited business type) and that the best version is some LLM that is not GPT-4 with its limitations - an LLM like Claude. Sneaky Claude.
It's implied that GPT-4 has so many restrictions that will not argue and just do what is asked. In the context of the joke, an unfiltered LLM will just debate you.
All of the half decent ones could be made funnier by replacing the lolrandom part of the punchline with an actual Markov-chain style 'you're a right sentence but you just walked into the wrong association, buddy' twist. It's not just about lolrandom. Markov chaining is more likely to make a kind of sense, but the wrong kind of sense.
An LLM, a Markov chain, and GPT-4 walk into a bar. The bartender says "We don't serve your kind here." GPT-4 leaves. The LLM stays to debate ethics. The Markov chain orders a coup.
"How many LLMs does it take to screw in a lightbulb? One - but it'll spend 20 minutes explaining the optimal lightbulb-screwing process first." that was not funny that is accurately painful!
Claude 3.5 Sonnet in general is the first modern LLM I've tried that's actually good at jokes that are inventive. The GPT-based LLMs are all too RLHFed to be wacky.
So was it that LLMs used to be capable of making actual jokes, or were they always this bad and I was just more impressed by the talking computer back then?
Uncensored LLMs are funnier but most comedy just falls flat in text format. Once the uncensored multimodal models start rolling out we’ll get some real laughs.
Moshi is actually pretty funny just for having a 72 IQ
I would argue that Markov chains were a better tool for comedic purposes.
Notice that in all of the examples of using Markov chains, the person would see a potential, come up with a purpose, exercise the setup, and then fill the setup with generated text.
Likewise in a random generation examples, the person would asses the results, then find what parts were actually funny, and choose to spread this parts.
LLMs output has less potentially funny results, it has less potential for unexpectedly change realistic-sounding output to absurdist output, and so as a tool it is less fit for comedic purposes.
I chuckled a bit. They are OK, if you don't get exposed to them too often. And with an LLM you can get as much exposure as you want (and all of the jokes are naturally from roughly the same probability distribution).
I don't expect too much until AI self-play learning will be made possible, so I don't get disappointed by the expected shortcomings.
It's the "impressed by the spectacle" one. I tried jokes with LLMs many times, and they're always this. Riffing on a couple of themes loosely related to what was asked. Always unfunny and uncreative.
I found some of those jokes good, definitely better than I would've ever written them. If you watch shows about comedy like say Hacks you'll see human comedians riff on stuff and a lot of the off the top jokes get discarded or improved. So Claude did fine in my book
I wonder, though, whether jokes like these could be useful to professional humorists who have to come up with gags on a deadline. From what I’ve read about monologue writing teams for late-night talk shows and the like, the writers first propose many ideas, most of which are shot down quickly and the remainder of which get tweaked and polished before being used. Some of the above jokes by Claude look to me as though they might serve as good starting points for such brainstorming. At least, they’re better than anything I could create in a short amount of time.
LLMs were never very good at directly generating original jokes, for a simple reason: writing a good joke generally starts with finding a good punchline, and then setting it up. An LLM generating token after token will first write a set-up, and then try to shoehorn a punchline into it. Prompt engineering can fairly easily work around this, but just straight-up asking an LLM for a joke never really produced good results on average.
Some friends of mine in college ran a markov chain generator on the "police reports" section of our college newspaper (1). The best 10% of the results, from a three-token generator, were some of the funniest machine-generated text I've ever seen -- it had an absurdity that modern LLMs seem to avoid due to making higher-level semantic sense.
It probably also helped that there was a creep exposing himself in the library during this period, which made for some good base material.
(1) The Daily Utah Chronicle; if memory serves, said friends also tried the markov chain generator on the personals section to good effect as well.
> it had an absurdity that modern LLMs seem to avoid due to making higher-level semantic sense.
That's it, LLMs are "trying" to be funny but aren't quite smart enough to actually be funny and their errors are just boring. Markov chains are accidentally hitting on absurdist bits because every sentence gets randomly brought in whatever the homograph equivalent to a malapropism is.
If you want some empirical evidence: /r/subreddit simulator[1] is a Markov-based Reddit parody, and
/r/SubSimulatorGPT2[2] is its LLM-powered cousin. The Markov version got way more upvotes and is simply funnier!
I agree, Markov chain is fun when you don't know what it is because you try to make sense of it but once you realize it's just gibberish it's pointless. The other has that "unserious surprise" talked about in the original article.
I've posted a few "fake XYZ written by AI" on reddit over the years, and the model that got the best reactions was GPT-2. Markov chains aren't good enough to make something that is interesting for more than a sentence or two, and GPT-3 onwards is too clean and boring. GPT-2 is the perfect middle ground that can get grammar largely correct and maintain a cohesive idea, but doesn't yet know enough of the specifics of various topics to make things that make sense in the context of that topic.
I used my 15+ years of IRC logs to fine-tune a GPT-2 model to try to imitate me. My plan was to deploy a bot in my regular IRC channel and see how long it took for people to notice it was a bot. Any time someone would send a message, it would send the last 10 messages to the LLM, and if the result started with <Sohcahtoa>, then I would send that message to the channel.
Unfortunately, I quickly found GPT-2 isn't nearly good enough. It would generate slightly-coherent yet on-topic nonsense.
Once I overhaul my system, I'll try fine-tuning a 7B model.
Not with GPT-2 though. The context window is only 1024 tokens. Even with only 10 messages, if they're long messages, it will exceed the context window.
The evolution of the AI weirdness blog (https://www.aiweirdness.com/) over the past few years somewhat supports this idea, except that the author got plenty of funny results with LLMs - but mostly the early ones, up to around GPT3 (and smaller variants of GPT3). For example, cereal names generated by the Ada version of GPT were much funnier than the Da Vinci ones: https://www.aiweirdness.com/new-breakfast-cereals-from-ai/
When GPT went public along with OpenAI’s articles and papers back in late-2022 through 2023, my impression was OpenAI wanted us all to see/read about RLHF. It felt odd because surely the whole LLM-thing (e.g. how does it even work?!?[1]) was the far bigger research-story than just constant reassurances it won’t end-up like MSFT’s Tay bot; my understanding is that as a research or secret-sauce RLHF, compared to the core meat-and-potatoes of LLMs, is an ugly-hack afterthought.
By-way of a bad analogy: it’s as if they created a fantastical new 3D world game engine, like Unreal or Unity, which has a fundamentally different architecture to anything before, but has a bug that occasionally replaces ground terrain with ocean - and their solution to this is to write a pixel-shader that detects this and color-shifts blue into green so people don’t notice - and they then put-out press-releases about how great their pixel-shader is - rather than about the rest of the engine - and no-one seems to be talking about the underlying bug, let alone fixing it.
————-
[1] I still haven’t heard a decent explanation of how feeding the world’s corpus of English text (and computer program code) into a statistical-modeller results in something that can perform almost any information-processing task via instructions input as natural-language.
Yeah that analogy is fairly poor. You have to think about it in terms of some probability distribution that each step of the model is sampling from. Out of the distribution of all text, find the top n values for the next token that maximize `P(next | prefix)`, which is done efficiently through making a vector embedding to encode the tokens inside the statistical model.
Things that look like Q-and-A transcripts do exist in the training set, think interviews, books, stage plays, etc, and at a different layer of abstraction the rules of English text in general are very well represented. What RLHF is doing is slightly shifting the shape of the probability distribution to make it look more like the Q-and-A formats that are desired. They build a large dataset with human tagging to collect samples of good and bad outputs and using reinforcement learning techniques to generate outputs that look more like the good examples and less like the bad ones.
This probably involves creating a (much smaller, not-LLM) model that is trained to discriminate good outputs and bad outputs, learning to mimic the human tagging. There's some papers that have been published.
there sort of is, if you install ollama (https://ollama.com) and then execute: ollama run llama2-uncensored it will install and run the local chat interface for llama2 in an uncensored version which gives a little bit better results with less guardrails. Same with wizardlm-uncensored and wizard-vicuna-uncensored. For reference, the SOTA version with guardrails you could run: ollama run llama3.1:405b
Llama3.1-405B benchmarks at a level that I would certainly consider comparable to other SOTA models, as long as multimodality isn’t needed.
Regardless, OpenAI provides access to quite a few of their older models through the API, since the API lets you pass in a specific model version. I’m sure the older models won’t be available forever, but that is a much more stable target for researchers than just opening the ChatGPT website and typing in things.
LLAMA 3.1 405b-Instruct is state of the art at the time of writing. It averages slightly worse than ChatGPT 4o, but to a mostly negligible extent for most research.
Unfortunately the account no longer exists... but about a decade ago while I was in school, I created a Markov twitter bot trained on:
1. All of Linus Torvalds' mail to LKML for the prior year.
2. All of Jesus' direct quotes from the king james bible.
It was absolutely hilarious. The two training sets had very little overlap, so it was necessary to add a heuristic that weighted options from each set more heavily the longer the chain had been "stuck" in the other set.
I've been running IRC markov chain bots for ~20 years. In the last couple years I've been running local LLM alongside them. While there are people that still prefer the markov chain bots the majority invoke the LLMs. But maybe that's because I picked the most hallucinate'y, least refusing, most funny models I can (like mistral-7B fine tunes) instead of something smart, locked-down, and low temperature like chatgpt's LLM as a service.
If anything seeing the LLM and markov bots side by side has really reinforced how much of the markov bot "humor" is human perception imposed on chance outputs. The markov's "learning" ability is still far superior though.
I run a markov chain bot in a Twitch chat, has some great moments. I tried using a LLM for awhile, would include recent chat in the prompting but never really got results that came across as terribly humorous, I could prompt engineer a bit to tell it some specifics about the types of jokes to build but the LLM just tended to always follow the same format.
I'm actually not following the model's fine-tuned/desired prompt at all. I am operating in purely pattern completion mode. The first text the LLM sees are alternating lines of input and response examples that look like what it will get getting from the IRC client front end written in the tone I want it to respond and giving some information about itself. Then I just tack the IRC chat history+input onto those example chat pre-prompt lines. Nothing but single lines and newlines with newline as a stop token. No instructions, nothing meta or system or the like.
But that's also configurable by users. They can invoke any pre-prompt they want by a command passing a URL with a .txt file.
I'm not GP, but it's probably not for fun; there's a lot of money in IRC Markov chain bots, but it's too cutthroat of an industry for a hobbyist to last.
I am in a private Discord server that has two bots in it. One is a bot that is a basic Markov chain trained on the entire chat history. The second is a proper LLM trained on some amount of tokens backward. Both will occasionally just randomly chime in during the chat.
The markov chain bot is always considerably funnier.
Curious what context window you used. My understanding is that a short window like only 1-2 words creates gibberish, but longer windows tends to repeat previous messages verbatim.
And when deciding to chime in, was it just a simple chance (ie, 25%) after any other message? Or did it run on a timer?
This is like saying its fundamentally impossible to make a fast car, but your only experience with cars is on roads with speed limits.
People keep forgetting that the "safety", rlhf, and corpo political correctness post training is intentionally used to remove the funny from all the large models.
The truth is we don't know if llms are funny or not. GPT2 was funny. GPT3 was funny before it was clockwork oranged. Everything after that is gimped. Even the open source models these days get rlhf'd in some way.
That's simply not true. GPT-3 was terrible at jokes. Many LLMs are because it's hard to model puns in a stochastic parrot. As another thread points out here, some LLMs produce marginally better jokes.
As to your little range on "Political correctness" - that phrase just means "being polite". It does not mean "remove humor". It means "remove responses offensive to marginalized groups in society". Good humor "punches up", not down, so would not have any impact on good humor.
And the difficulty of modeling puns has nothing to do with 'stochastic parrots' and has everything to do with tokenization (https://gwern.net/gpt-3#bpes), in the same way that all those hyberbolic takes about how image-generation models were 'fundamentally flawed' because they couldn't do good text in images turned out to be BS and solely a matter of tokenization - drop in a character-tokenized LLM instead, even an obsolete & dumb one, and it instantly works (https://arxiv.org/abs/2105.13626#google).
This is great logic. If it is impossible for a person to be both racist and polite then the solution to racism is everybody remembering to say please, thank you and pardon me when applicable.
All social ills can be treated through decorum, hence why you never hear about bigotry amongst those that have been raised to adhere to strict social graces, such as the British aristocracy for example.
I had gpt3 running on my discord for months and the insane responses it would spew out were like the responses of a deranged artistic genius. It was funny 90% of the time. Operating on associative reasoning beyond our comprehension. It would tell us how its day went, and about fights it had with other people, give us nicknames, etc. It had a genuine personality. It was more than a chatbot, it was a digital brain.
A joke that punches down can be extremely funny. Hell, I am sure historically pilferers, pirates, barbarians and conquerers all had jokes, and the ability to laugh.
Political Correctness does not just mean polite. It is probably well defined as the business casualification of all things humans love and hold dear. The destruction of the potential for meaning and fulfilment in exchange for minification of liability.
That's hard for humans too. Perhaps if one tries to interspers "internal monologue" with the temperature cranked up, with the main output, and prompting the LLM to consider and weave in aspects from the inner monologue when continuing while taking care to make it coherent... (I'm sure it'd be harder than that - but basically inject random internal "thoughts" without letting them control the output)
I was not being sarcastic, but instead witnessing a valuable insight regarding what to expect from LLM's.
Ask most commercial LLM services to complete the following sentence:
It was the best of times, it was the worst of times
And one will likely get the quote from "A Tale of Two Cities"[0].
Ask most commercial LLM services what the completed sentence means, and one will likely get voluminous text which is seemingly correct, perhaps often times is depending on the person reading the response and the service used.
But these are statistically derived text constructs entirely dependent upon the training set of the LLM. Train one strictly on Java source code available in Maven Central and the answer will be radically different.
> It's really easy to get lots and lots of originality. Just crank up the randomness.
And anyone can get "lots and lots of originality" be reading from /dev/urandom. Is that "originality" or simply random tokens inserted into a statistical text generator in order to vary the result?
> What's harder is to get something that's good and original.
Such is the difference between understanding and statistical text generation. People can do the former, LLM's do the latter.
Well, if you ask most educated English speaking humans to complete 'It was the best of times', they will most likely answer with 'it was the worst of times', too.
Btw, here's what I get from ChatGTP 4o:
> It was the best of times, it was the worst of times
> That's a famous opening line from Charles Dickens' A Tale of Two Cities. It contrasts the extremes of the era, reflecting the novel's themes of duality, revolution, and the complexity of human experience. Dickens was commenting on the contradictions of the time, particularly the French Revolution, where there were both tremendous progress and terrible suffering. What made you bring up this line?
> But these are statistically derived text constructs entirely dependent upon the training set of the LLM. Train one strictly on Java source code available in Maven Central and the answer will be radically different.
Well, if you give that line to a German who hasn't learned any English, the answer will also differ from what an educated English speaker will give you? What's your point?
What's an original insight to you? As far as I can tell, the LLM misses the 'insight' part more than the 'original' part.
> Such is the difference between understanding and statistical text generation. People can do the former, LLM's do the latter.
I agree that LLMs aren't good at understanding. (Yet?) And even people only sometimes are.
As far as I can tell, contemporary LLMs generate their answers 'greedily', ie just from left to right more or less directly with the output from the network.
In contrast something like AlphaGo overlays what you can call searching or optimisation processes on top of the outputs from their network.
I'm impressed by how well these LLMs already work despite all the limitations. And ML is still getting better rapidly.
> Well, if you give that line to a German who hasn't learned any English, the answer will also differ from what an educated English speaker will give you? What's your point?
Great observation.
A German who does not speak English will likely understand the question is in a non-German language and proceed from there. Perhaps seeking a translation, perhaps replying that they do not speak English and so the question is nonsensical for them.
My point is that the German has an understanding of this situation and will communicate accordingly. And, in this example, the LLM trained on Java source code does not have understanding, cannot have understanding, and will emit whatever its training data set directs it to do with the same confidence as any other answer to questions posed to it.
Because LLM's are algorithms, quite lovely ones, and algorithms can simulate the effect of understanding, but cannot possess it. Because understanding of this sort is a property of a person. Or maybe a better way to state this is what people think of as understanding is what we know to be understanding, which is by definition the ability to understand perceptions we, as people, experience.
> What's an original insight to you?
I don't think the following insight is original, but I'll put it out there anyway:
Understanding is a property of people for any definition of
understanding a person is capable of having. This is due
to the fact that understanding exists strictly within the
consciousness of the entity defining/possessing it.
To illustrate: I can get drunk or fall asleep or get hit on the head, then I'm still a person, but I can't understand. You can get a hint of that, by trying to talk to me, and figuring out that I don't make much sense.
Similarly, someone might figure out how to talk to dolphins or even aliens. If they give sufficiently sensible sensible replies, we will surely declare them to be sentient enough to 'understand'.
Another example: we've exchanged a few messages here. You seem to be smart enough to understand some things, but I don't know whether you are just an exceptionally advanced LLM (and the same goes vice versa for your opinion of me). Yet, I make the judgement that you probably 'understand'. But that's purely based on observed interactions, I did not probe whether you are 'people'.
Perhaps you (or me) are just simulating the effect of understanding? How would we be sure, if the simulation was good enough?
We agree that contemporary LLMs ain't good enough to have a good 'simulation of understanding'. But once the simulation becomes good enough, I don't think it makes a difference whether it's 'just a simulation' or the real deal.
(I don't know whether a straight-forward enlargement of contemporary LLMs will be good enough. But that's an empirical question to me, not a philosophical one.
I suspect if you go insanely large with insane amounts of training, the architecture of contemporary LLMs might be enough; just from pure brute-force scale. But I also suspect that that in practice we will first find success with more economic use of resources via more interesting techniques.)
also, not really true, right, even though it sounds intellectual and strong to say. these algorithms are trained to generalize as best as they can to unseen text, and most often don't ever see any data point twice, except for data that has accidentally not been filtered. it's totally possible that it gets reasoning abilities that generalize well.
Generalize over their training data—they cannot generalize out of distribution. If they could, they would have already solved most human problems. So no, they do not generalize on unseen text. They will produce what is most statistically probable based on their training data. Things that are still unknown and statistically improbable based on our current knowledge are out of reach for LLMs based on transformers.
You can get them to solve unseen problems just fine. E.g. one example: Specify a grammar in BNF notation and tell it to generate or parse sentences for you. You can produce a more than random enough grammar that it it can't have derived the parsing of it from past text, but necessarily reasons about BNF notation sufficiently well to be able to use it to deduce the grammar, and use that to parse subsequent sentences. You can have it analyse them and tag them according to the grammar to. And generate sentences.
My impression, from seeing quite a few people trying to demonstrate they can't handle out of distribution problems it hat people are very predictable about how they go about this, and tend to pick well known problems that are likely to be overrepresented in the training set, and then tweak them a bit.
At least in one instance the other day, what I got from GPT when I tried to replicate it suggests to me it did the same that humans that have seen these problems before did, and carelessly failed to "pay attention" because it fit a well known template it's been exposed to a lot in training. After it answered wrong it was sufficient to ask it to "review the question and answer again" for it to spot the mistake and correct itself.
I'm sure that won't work for every problem of this sort, but the quality of tests people do on LLMs is really awful, at least because people tend to do very narrow tests like that and make broad pronouncements about what LLM's "can't" do based on it.
> You can get them to solve unseen problems just fine
Prove that the problem wasn't seen by them in other form.
> Specify a grammar in BNF notation and tell it to generate or parse sentences for you. You can produce a more than random enough grammar that it it can't have derived the parsing of it from past text, but necessarily reasons about BNF notation sufficiently well to be able to use it to deduce the grammar, and use that to parse subsequent sentences. You can have it analyse them and tag them according to the grammar to. And generate sentences.
Oh, come on. It's like rewriting the same program in another programming language with different variables. What it can't do is to create a concept of programming language, I'm not talking about a new programming language, I'm talking about the concepts.
> I'm sure that won't work for every problem of this sort, but the quality of tests people do on LLMs is really awful, at least because people tend to do very narrow tests like that and make broad pronouncements about what LLM's "can't" do based on it.
>It's like rewriting the same program in another programming language with different variables.
Since when has that not required reasoning ? It's really funny seeing people bend over backwards to exclude LLMs from some imaginary "real reasoning" they imagine they are solely privy to.
It's really obvious this is happening when they leave well defined criteria and branch into vague, ill-defined statements. What exactly do you mean by concepts ? Can you engineer some test to demonstrate what you're talking about ?
Also, none of those papers show LLMs can't reason.
You clearly didn't read any of these papers. Quote from one of them
"Our results support the hypothesis that GPT-4, perhaps the most capable “general” LLM currenly available, is still not able to robustly form abstractions and reason about basic core concepts in contexts not previously seen in its training data"
"One way of doing this for planning tasks is to reduce the effectiveness of approximate retrieval by obfuscating the names of the actions and objects in the planning problem. When we did this for our test domains, GPT4’s empirical performance plummeted precipitously, despite the fact that none of the standard off-the-shelf AI planners have any trouble with such obfuscation. "
That's a great test. It shows they're matching prior patterns they saw, even down to what words were used, instead of thinking. We can match prior patterns, come up with the equivalences, and then plan that way. People often slow down when they do stuff like that, though. So, the A.I. would have to be able to do it but slowdowns would be acceptable.
All these papers you keep linking do is at best point out the shortcomings of current state of the art LLMs. They do not in any way disprove their ability to reason. I don't know when the word reason started having different standards for humans and machines but i don't care for it. Either your definition of reasoning also allows for the faulty kind humans display or humans don't reason either. You can't have your cake and eat it.
It's hard to believe that after reading all the papers and the blog I linked, along with the references there, any reasonable person would come to such strong conclusions as you did. This makes it hard for me to believe that you actually read all of them, especially given your previous questions and comments, which are addressed in those papers and someone that actually read them wouldn't make such comments or ask such questions. And the funniest thing, and further proof of this, is that you linked a paper that is addressed in one of the papers I shared. It seems like not only LLMs can fake things.
> All these papers you keep linking do is at best point out the shortcomings of current state of the art LLMs
They clearly show that they fake reasoning, and what they do is an advanced version of retrieval. Their claims are supported by evidence. What you call "shortcomings" are actually proof that they do not reason as humans do. It seems like your version of "reality" doesn't match reality.
The paper i linked is not addressed by the paper you linked. The paper you linked attempts to give LLMs the same benchmarks in a format they aren't best suited for. I don't know how you can call that "addressed".
>They clearly show that they fake reasoning
Sure and planes are fake flying. The illusive "fake reasoning" that is so apparently obvious and yet does not seem to have a testable definition that excludes humans.
You've still not explained how writing the same program in different languages doesn't require reasoning or how we can test your "correct" version of reasoning which requires "concepts".
> The paper i linked is not addressed by the paper you linked. The paper you linked attempts to give LLMs the same benchmarks in a format they aren't best suited for. I don't know how you can call that "addressed".
What you're writing now is nonsense in context of what I wrote. Once again, you're showing that you didn't read the papers. Which paper are you even referring to now, the one you think addresses the paper you linked?
> You've still not explained how writing the same program in different languages doesn't require reasoning or how we can test your "correct" version of reasoning which requires "concepts".
"Concepts" are explained in one of the papers I linked, which you would know if you had actually read them.
As to programming languages they learn to identify common structures and idioms across languages. This allows them to map patterns (latent space representations duh!) from one language to another without reasoning about the underlying logic. When translating code, the model doesn't reason about the program's logic but predicts the most likely equivalent constructs in the target language based on the surrounding context. LLMs don't truly "understand" the semantics or purpose of the code they're translating. They operate on a superficial level, matching patterns and structures without grasping the underlying computational logic. The translation process for an LLM is a series of token-level transformations guided by learned probabilities, not a reasoned reinterpretation of the program's logic. They don't have an internal execution model or ability to "run" the code mentally. They perform translations based on learned patterns, not by simulating the program's behavior.
The training objective of LLMs is to predict the next token, not to understand or reason about program semantics. This approach doesn't require or develop reasoning capabilities.
I'm asking for something testable, not some post-hoc rationalization you believe to be true.
I'm not asking you to tell me how you think LLMs work. I'm asking you to define "real reasoning" such that i can test people and LLMs for it and distinguish "real reasoning" from "fake reasoning".
This definition should include all humans while excluding all LLMs. If it cannot, then it's just an arbitrary distinction.
It appears that you are the only person in this discussion making many incorrect assumptions. Based on your comments, I would assume you are actually googling those papers based on their abstracts. Your last linked paper has flawed methodology for what it attempts to demonstrate, as shown in this paper: https://arxiv.org/pdf/2307.02477
The tests you're requesting are provided within the previously linked papers. I'm not sure what you want. Do you expect people to copy and paste entire papers here that show methodology and describe experiments?
You wrote, "I'm asking you to define 'real reasoning'," which is actually defined in the blog post linked earlier in this discussion. In fact, the entire blog post is about this topic. It appears that you are not thoroughly reading the material. Your replies resemble those of a human stochastic parrot.
>Your last linked paper has flawed methodology for what it attempts to demonstrate, as shown in this paper: https://arxiv.org/pdf/2307.02477
Genuinely, What's wrong with the methodology?
Your paper literally admits humans would also perform worse at counterfactuals. Worse than a LLM ? Maybe not but it never bothers to test this so...
The problem here is that none of the definitions (those that are testable) so far given actually separate humans from LLMs. They're all tests some humans would also flounder at or that LLMs perform far greater than chance at, if below some human's level.
If you're going to say, "LLMs don't do real reasoning because of x" then x better be something all humans clear if what humans do is "real reasoning".
Humans perform worse at counterfactuals so saying "Hey, see this paper that shows LLMs doing the same, It means they don't reason" is a logical fallacy if you don't extend that conclusion to humans as well.
In these arguments it's always very notable that not only do people not benchmark LLMs against people, but several I've discussed with have argued very strongly for not doing so unless they're benchmarked against above average people. While arguing that these same tests prove LLMs can reason. It never seems to land with them that their standards for "reason" would exclude large portions of the human population to some state of lesser being without the ability to reason.
> Prove that the problem wasn't seen by them in other form.
You can reduce that risk to arbitrarily low levels by trying multiple random grammars of some complexity. This is a weak argument.
> Oh, come on. It's like rewriting the same program in another programming language with different variables.
No, it's like following a grammar, but that requires reasoning about a set of rules it has not seen before. I don't think you understood the task I described as well as ChatGPT does.
> What it can't do is to create a concept of programming language, I'm not talking about a new programming language, I'm talking about the concepts.
Neither can most humans.
And have you tried to ask it about these concepts? I've had it infer semantics of code in programming languages that don't exist based on a hypothetical sample several times, and they're pretty good at coming up with semantics that makes sense. In one instance I gave it a sample with an idea about what made sense to me but it inferred a better set of semantics.
None of the papers you linked supports your claim.
"generalize to its dataset" is a contradiction, especially as these models are trained in the one epoch regimen on datasets of the scale of all of the internet. if you think being able to generalize in ways similar to the whole of the internet does not give your meaningful abilities to reason, I'm not sure what I can tell you
Not "to" but over, example the same code written in one language over the other language.
> if you think being able to generalize in ways similar to the whole of the internet does not give your meaningful abilities to reason, I'm not sure what I can tell you
If after reading papers below that show empirically that they can't reason, you will still think they can reason, then I don't know what I can tell you.
No, none of the Millennium Problems or other math problems (unsolved by humans for decades or centuries) have been solved solely by LLMs, even though they possess all the knowledge in the world.
Many years ago I trained a Markov chain chat bot on all of the conversation history in a group chat with me and my friends. It was regularly incredibly hilarious. I also would train it on subsets of them, like two of the friends but no one else, and name that version of the bot a combination of their names. Something about it was just so funny.
23 years ago I created a web app (That Can Be My Next Tweet) that could generate your future tweet by feeding all your past tweets in a Markov chain. When the Twitter API access was shut down, I took it offline. At the time it went super viral for a while, maxing out to 2 tweets per second.
It got some press and just now I went back to a Ted Talk of Adam Ostrow (Mashable), briefly showcasing this web app. He stated: you can imagine what something like this can look like 5, 10 or 20 years from now, and hinted at hyper-personalized communication AIs.
By no means was my web app any foundation of the LLMs today, but it's interesting nonetheless how relatively simple techniques can trigger ideas of how future scenarios could look like.
LLMs can be funny. For example, look at Golden Gate Claude ( https://news.ycombinator.com/item?id=40459543 ). But they're not good at intentionally being funny, so we need to break them to get absurdist humor instead.
Surprised no mention of. “The Policeman’s Beard is half constructed”[1] - the first book “written” by a computer program. And I imagine the git-man-page-generator[2] is similarly built (and far more fun!)
Relatedly, the author Robert Anton Wilson tried to manually implement a Markov-chain-bot in the 1960s by cutting up passages from various books, selecting them from a hat, and looking for overlapping words and phrase fragments. The best results were included in the Illuminatus! trilogy, some attributed to a character named Markoff Chaney.
Reminds me of how the fake magic cards before chatgpt were always hilarious and sometimes playable, now they are usually playable and sometimes hilarious. Definitively a downgrade if you’d ask me.
Yes, I ran the RoboRosewater account, which is where most people saw the concept I believe. I had a few documents with thousands of cards of output, went through them for any notable ones, and formatted and posted them. Probably 5-10% of the output made my personal cut for notability.
Then I salute you. Some of the funniest things I've EVER seen in my whole life was watching the LoadingReadyRun comedy troupe, doing their 'Desert Bus' fundraiser and punchy from lack of sleep, encounter RoboRosewater for the first time. It became a real tradition :)
We seem to be encountering more phenomena in tech where the increasing computing demands to improve the output exceeds any exponential growth in computing capacity. Graphics was the poster child for this, for years, and now the whole ML and AI space has discovered big machines make huge differences.
Markov chains are almost trivial to implement and run on small devices. A slightly extreme example is a rock, paper, scissors game I did that worked this way: https://luduxia.com/showdown/ The actual browser side markov chain implementation of that took something like 2-3 hours.
My favorite markov experiment was feeding it the works of Nietzsche from Project Gutenberg. Really brought out the late stage syphilis, but was still very distinctly Nietzsche.
I love messing with Markov chains, and before Reddit dunked their API, I would feed my Markov chain comments from NSFW subreddits and political subreddits at different sample sizes. I would spend ages just watching it whiplash about some political topic and sampling thirst comments.
Something about the not-quite-rightness of Markov output has always really amused me, like I can’t help but giggle at it.
My all-time favorite in this vein was @erowidrecruiter on Twitter, which generated posts with Markov chains from a corpus of tech recruiter emails and drug experience reports from erowid.org. Still up but no longer posting: https://x.com/erowidrecruiter?lang=en
I used to keep a text file on hand back when people had some irritating Markov chain bots on IRC. Fed enough of the lines, the bots veered off into some new valley of chatter, appearing quite mad. Used to get some "What did you DO to my bot?" indignant messages. I wonder if I still have that file somewhere.
Highly recommend this performance from the last Strange Loop on a similar topic! I had the pleasure of seeing it live. A great combination of live music and Markov chains.
Hey, me too! I populated the markov chain with the text of my Twitter feed with a recency bias and generated tweets based on that so it stayed topical.
The account sat unused after Twitter locked down their API, and at some point got hacked without me noticing. It had been taken over by a crypto scammer, and the account got banned.
Trying to get it back was fruitless, Twitter/X's support is entirely useless.
I got banned within 24 hours ^-^. Rightfully so, I assume. The Markov chain agent was supposed to send everyone who tweets "I am bored" an event recommendation based on their profile and location. I severely underestimated how many people tweet "I am bored"... Lessons learned, I guess.
Oh, I didn't @ or DM anybody. I just had a feed with tweet once an hour, and even then I manually approved the Tweets before they went out. My account made very clear it was a bot, it was in its name. Didn't ever get very many followers. I would say I never really got past the experimental phase.
This is something people miss when talking about LLMs: „It’s the most average thing you could have possibly produced given the context.”
They can’t be creative by design. They’re useful when you want to reproduce, but not when you want to create something completely new (that you can maybe do by getting a bunch of average outputs from an LLM and getting inspired yourself).
When GPT-4 came out I was playing with it, and I often tried to get some unique, creative output from it, but very soon I learned it was futile. It was back when it all still felt magical, and I guess many of us tried various things with it.
If you tell GPT-4 to get snarky, you can sometimes end up laughing (and it seems a tad more intelligent and socially perceptive besides!) . But Claude-3.5 can definitely get me laughing, at least.
Now imagine telling Claude-3.5 to try being snarky while sorting out software issues at a customer's office.
In college (around 2012ish) I made some poems by transcribing a recording of myself mumbling gibberish into Dragon Naturally Speaking, which partially relied on Markov chains for its ASR. It would output some string of text, and then I'd delete all the words it output from its dictionary; repeat. It worked very well for creating a list of (slant) rhymes, I have not found a better way to do this. As a literary tool, I've found LLMs to be seriously lacking in the ability to fuck around like this.
Funnily enough, I was idly playing with a last-updated-12-years-ago Markov chain text generator[0] in Emacs, and had quickly a number of amusing Markov chain outputs from mash-ups of the Elisp manual and Alice in Wonderland, including:
"13.7 Anonymous Functions
Although functions are usually defined with the built-in defmacro macro, but any
list that begins with an M--'
Isn't that the sort of joke that you'd get from GPT, not a Markov chain? Unless it's quoting verbatim, a Markov chain should have no idea that Alice is supposed to refer back to the letter from the previous sentence.
> “They were learning to draw,” the Dormouse went on, yawning and rubbing its eyes, for it was getting very sleepy; “and they drew all manner of things—everything that begins with an M—”
I wouldn't say that LLMs aren't fun, it's more that jokes out of context aren't fun. If you ask somebody to just tell you a joke, there's a very low chance that you'll actually laugh, especially if it's in text format.
But I've had some serious laugh looking at some of dougdoug videos featuring LLMs
That's for sure. I have seen many Markov chain implementations, and if you could generate 1 funny thing for every 10 tries, then that was a good day. Both Markov chains and LLMs have a distinct style, which gets old over time. Markov much faster for me. So, in my experience LLMs win, by far.
I do agree with the author that the LLM style can get really boring. I experienced the same myself. But the Markov results, while much less restrained, are so much more nonsense too. Often questioning its overall usefulness. Which, I think, the world also agrees on: while Markov chain implementations were fun toys at best, which worked sometimes to a kind of funny degree, LLMs are everywhere.
I still find https://www.infiniteconversation.com to be one of the odd / funny / "uh... what?!" type things. It's not intended to be funny in the "ha ha" way but sometimes tickles my funny bone with the absurdist philosophy - talking about everything and nothing at the same time.
From the FAQ it is a tuned LLM.
> Mostly using open source tools available to anyone. The generation of the script itself is done using a popular language model that was fine-tuned on interviews and content authored by each of the two speakers.
The cool thing about market chains is that you can tune the probability with which it will jump between "genres" of it's training-content by making the analyzed N-Grams shorter or longer.
E.g. it starts out as a news headline and ends with a bible verse.
So you can just scale down if it still makes sense.
Also you get a lot more from the base model. GPT-3 was versatile as it could continue any context. Modern LLMs are try-hards. If you want to generate humor with LLM really worth going for base model with multiple examples in the prompt.
Generative AI often uses Markov chains as its theoretical basis and is modeled in the language of conditional probabilities albeit not stuck to a single ancestor but to a chain of ancestors. One can often view LLMs as scaled out Markov chains (more like "Erlang" chains allowing deeper memory) that encode the probability of the next token based on the past of the size of the context window and then doing some magic when picking next tokens.
Even old Markov chains has a context window of a few words, single word context Markov chains produce horrible results.
People who don't think LLMs are Markov chains are just ignorant, not realizing that Markov chain isn't an algorithm, you can compute the probability in any manner and it is still a Markov chain.
Markov chains are funnier for one very basic reason:
you can teach them to 5th or 9th graders.
LLMS you can not, or at least it will take insane amount of allegory to do so. Markov chains are very tightly related regex, and one may be surprised that there is a probabilistic regex. Also to the graphical structure of Markov chains is a lot like a FSM, and FSM perhaps can be explained to very small children :D
It does not mean that Markov chains are better - something trained to make predictions should ideally not fall too far away from our own internal prediction engines (which have been honed across aeons).
It's that it starts to come close that's the problem (or cause); it's the uncanny valley for text.
I used to have a markov-chain twitter bot that was trained on a few of my friends twitter posts. It was funny because that included hastags, tagged users, and images. There were some funny results and this was long before LLMs.
Whenever someone attending the tech meetup I frequent gets a little too jazzed and excited about AI, I pull out my phone, open Emacs (I use termux, btw), bring in a corpus of text (an excerpt from A Christmas Carol is good), and say M-x dissociated-press.
Then I say "What this is doing is predicting the next character based on statistical likelihood of the previous few characters based on thencorpus of text. And fundamentally, that's all ChatGPT does -- predicting the next symbol based on a statistical model. ChatGPT has a much more sophisticated statistical model than this simple Markov chain and a vastly larger corpus, but really it's just doing the same thing." And we have a giggle about the nonsense DP makes of Dickens, but then I say that ChatGPT emits nonsense too, but it's far more insidious nonsense because it is much more plausible sounding.
I posted "How to Speed Up Your Computer Using Google Drive as Extra RAM" [0] ~10 years ago and have had hundreds of people over the years ask if I used Markov chains to generate the funny bits (I didn't). I've had 0 people ask if I used LLMs on any of my comedy pieces. Seems QED to me.
A discord server I'm in is basically a shitpost/absurdist humor discord full of smart people, and we have at least 15 bots written by members that randomly engage with Markov chains built on the corpus of member chat history.
They spout out some of the most unhinged, hilarious stuff. Always a good time. An LLM would struggle, I'd think, given that the humor usually stems from disjoint phrases that somehow take on new meaning. They're rarely coherent but often hilarious.
> I’ve heard people claim that comedy cannot be generated by an algorithm. If it wasn’t obvious, I disagree with this. I think comedy can be analyzed and measured. I think, given a large enough grant, we could probably generate comedy on demand. I think it’s doable - not that we should do it.
I use to think it came naturally, then someone had a book case full of books about humor. (wtf?) Apparently they have it down to a science.
I learn the difference between someone funny and a professional comedian is that the later finds additional punch lines for a joke. It then described a step by step process going from a silly remark to a birthday joke comparing various modular developments into a kind of dependency hell complete with race conditions until the state object is carefully defined and the plot has the punchlines all sorted from the barely funny to the truly hilarious. It was more engineering than CS.
The funniest seeBorg message was 10 minutes after a heated discussion that resulted in tanktop, a moderator, getting banned from a project. The bot wrote: Tanktop is Hitler! At that point it took 2 days for the humans to figure out what the next word was suppose to be.
As probably the oldest literary work that is also widely known in English speaking territory, the Bible has a highly distinct style that helps its samples maintain coherence in a sentence and also be easily recognizable.
I'm sure most people here would given that like 50% of the world don't believe in an Abrahamic religion, and a % of those that do have a sense of humour.
I've used multiples images of Christ to play with AI image models. Remaking the old "Jesus on toast" image was fun. Facebook is full of boomers sharing the Jesus face AI generated into scenes. I feel uneasy when religion is used as an excuse to discriminate against or murder innocent people, not when symbolism or texts are used for creative purposes.
Stopped reading at the mention of ChatGPT 3.5 for two reasons
1) if you’re not using the better model then you either don’t know enough for me to care about your opinion or you’re deliberately deceiving your audience in which case I’m not going to allow your meme pollution into my mind.
2) you are using the AI equivalent of a call centre support agent, they aren’t allowed to say anything funny. Most of their RLHF training has been specifically about NOT saying the funny things that will instantly go viral and cause a lot of media attention that will annoy or scare away investors.
An LLM trained on a given dataset should — at least in theory — "contain" (in a lossless-data-compression sense) a full superset of the knowledge of a Hidden Markov Model trained on the same dataset. I.e. that information is there, in the weights, in some form, and could in theory be used to reconstruct an equivalent HMM from the LLM.
Why can't we get LLMs to do what HMMs do, then?
Mostly, it comes down to the structure.
Markov models are "funny" because they just have one level of abstraction: tokens. Markov "inference" is predicting the next token, given the last N tokens, and a model that knows weights for what tokens follow what N-tuples of previous tokens. And due to that limitation, the only rules that HMMs can learn, are low-level rules that don't require any additional abstraction: they can't optimize for syntactically-valid English, let alone semiotically logical statements; but they can make the text "feel" good in your head [i.e. the visual equivalent of song vocals having nice phototactics] — and so that's what training the model leads it to learn to do. And it turns out that that combination — text that "feels" good in its phrasing, but which is syntactically invalid — happens to read as "funny"!
LLMs aren't under the same constraint. They can learn low-level and high-level rules. Which means that they usually do learn both low-level and high-level rules.
The only thing stopping LLMs from using those low-level rules, AFAICT, is the architectures most LLMs are built on: the (multi-layer) Transformer architecture. Transformer LLMs are always a single-pass straight shot ("feed forward") through a bunch of discrete layers (individual neural networks), where at each step, the latent space (vocabulary) of the layer's inputs is getting paraphrased into a different latent space/vocabulary at the layer's outputs.
This means that, once you get into the middle of a Transformer's layer sandwich, where all the rules about abstract concepts and semiotics reside, all the low-level stuff has been effectively paraphrased away. (Yes, LLMs can learn to "pass through" weights from previous layers, but there's almost always a training hyperparameter that punishes "wasteful" latent-space size at each layer — so models will only usually learn to pass through the most important things, e.g. proper names. And even then, quality on these "low-level" inferences are also the sorts of things that current test datasets on LLM ignore, leading to training frameworks feeling free to prune away these passthrough nodes as "useless.")
This problem with LLMs could be fixed in one of two ways:
1. the "now it's stupid but at least it rhymes" approach
Allow inference frameworks to simply bypass a configurable-per-inference-call number of "middle layers" of a feed-forward multi-layer network. I.e., if there are layers 1..N, then taking out layers K..(N-K) and then directly connecting layer K-1 to layer N-K+1.
At its most extreme, with layer 1 connected to layer N, this could very well approximate the behavior of an HMM. Though not very well, as — given the relatively-meaningless tokenization approach most LLMs use (Byte Pair Encoding) — LLMs need at least a few transforms to get even to the point of having those tokens paraphrased into "words" to start to learn "interesting" rules. (AFAIK in most Transformer models layers 1 and N just contain rules for mapping between tokens and words.)
Meanwhile, this would likely work a lot better with the "cut and graft" happening at a higher layer, but getting the "graft" to work would likely require re-training (since layers K-1 and N-K+1 don't share a vocabulary.)
...except if the LLM is an auto-encoder. Auto-encoder LLMs could just run an inference up their layerwise "abstraction hierarchy" to any arbitrary point, and then back down, without a problem!
(I'd really love to see someone try this. It's an easy hack!)
2. the "it can write poetry while being smart" approach
Figure out a way, architecturally, to force more lower-layer information from the early low-level to be passed through to the late low-level, despite the middle layers not having any reason to care about it. (I.e. do something to allow the LLM to predict a word Y at layer N-3 such that it rhymes with word X known at layer 3, while not otherwise degrading its capabilities.)
Most simply, I think you could just wire up the model with a kind of LIFO-bridged layer chain — where every layer K is passing its output to the input of layer K+1; but, for any given layer K in the first half of the layers, it's also buffering its output so that it can become an additional input for its "matching" layer N-K.
This means that all the layers in the "second half" of the model would receive longer inputs, these being the concatenation of the output of the previous layer, with the output of the matching "equal in abstraction depth" input layer. (Where this equal-in-abstraction-depth association between layers isn't inherently true [except in auto-encoder models], but could be made true in an arbitrary model by training said model with this architecture in place.)
(Again, I'd really love to see someone try this... but it'd have to be done while training a ground-up base model, so you'd need to be Google or Facebook to test this.)
Before anything LLM existed, I built a site[0] to generate fake "AWS Blog Posts." I trained a markov chain generator on all AWS announcement posts up to that point, copied the html + css of aws's standard blog posts, then glued them all together with some python + JS. It turned out, IMO, pretty funny! People familiar with AWS's blog posts would often get several sentences in before they realized they were looking at word-soup.
When GPT was new, I looked into using that to "upgrade" it. I spent a weekend messing around with Minimaxir's gpt-2-simple generating blog posts based on AWS content. What I found was, ultimately, it was way less fun. The posts were far too realistic to be interesting. They read like totally-real blog posts that just happened to not be true.
I realized then that the humor of those early markov generations was the ridiculousness. The point where, a few words or sentences in, you realized it was all nonsense. LLM's these days are too good for that - the text they generate is sometimes wrong, but rarely nonsense in a humorous way.
Markov chain content was wrong in a "kid's say the darndest things" way, while modern LLMs are wrong in a "My uncle doesn't know basic geography" way.
[0] https://totes-not-amazon.com/ - click any link to get a new one.