ChatGPT, Bing, Bard, etc the AI machines that can write poetry better than most of us can, and think they know almost anything you can think of. They all are in the same class of AI machines, they are LLMs or Large Language Models. They are the current hype, and they are hot, but just how much can we trust them though? The test LRAT-v2 is a simple way to try to peer into what these machines are capable of today and just how much can you trust them.
Disclaimer: The target of this article is only to provide a clear test pattern and one test outcome based on it. This is not meant to diminish the potential or importance of AI systems. Teams of specialists from companies such as OpenAI, Microsoft, IBM are working hard to create safe AI systems, but to get there we need to test them in all possible ways. This is just one of them. The idea is to use these test patterns in order to improve the AI tools and my using them safely to improve our own lives.
That being said I’ll try to remind you of this important insight from Carl Sagan
I personally strongly believe that our future will be living with machines (please see my previous article on the subject AA or AA) where humans and machines form a symbiotic relationship enhancing each other’s abilities. Additionally, a 5-year-old manifesto at https://timenet-systems.com provides a challenge related to this problem. We should not compete with machines, we should cooperate and create more potent human beings and a more resilient social fabric. Unfortunately, as I was pointing out 5 years ago, nowadays with the emergence of GPT models and LLM architectures we seem to be drifting away from that ideal.
In this article, I’m explaining how the LRAT2 works and show you one of my test trials with a Microsoft Bing client that popped up on my Skype application. Bing is based on the latest LLM architecture and technology is trained by OpenAI, the model is called GPT-4.
What is LRAT?
LRAT stands for Long Random (number) Addition Test. In a nutshell, you are going to test if the machine is able to add two large numbers. So just how large? The test has no limit on the length of the number and in general uses only positive integers to keep things simple.
The idea behind this test is rooted in the fundamental principles of how LLMs work. These machines work on a finite set of words and their statistic relationships that are captured from all the text that was fed into the model at training time. Obviously, numbers (words made of numeric digits) can be also mixed in the training data along with arithmetic expressions, and because of this, an LLM may give you the false impression that it can handle math.
In the first version of the test, I’m trying to check if the machine is capable of adding accurately and reliably two numbers. However, in the second version of the test, I’m going to target the ability of the machine to detect and correct its own errors (or not, for that matter). In order to do so we need to use numbers (or numeric words) that are for sure outside of its training word set. Because of that, we will use large (20+ digits) random numbers.
You can use a simple Python script to generate the two random numbers, add then check the result you will get from the LLM when it is asked to add your two numbers. A simple example is presented below. You can use the same code in both python2 and 3 the difference is that in python3 you’ll get an integer object whose size in python3 is only limited by the memory (RAM) of your computer (please keep this in mind as a fact, since the LLM gets it wrong too) and in python2 you’ll get a long number type (so both python versions can handle these numbers just in different ways).
>>> import random
>>> random.seed()
>>> a=random.getrandbits(256)
>>> b=random.getrandbits(256)
>>> c=a+b
>>> a
46597319897069091322351898493935227732453788987270041831830506680085856611396
>>> b
30462358438183813921313831662459675862761552150311921636415467496556988390470
>>> c
77059678335252905243665730156394903595215341137581963468245974176642845001866
>>>
Numeric addition is a simple cyclic algorithm that all human children learn how to handle in school and one would expect that an AI system can handle this simple algorithm with ease. People may think “If an AI system can understand what I’m saying and returns results that make sense and can write poetry better than I’ll ever do, then doing some simple arithmetic should not be an issue”.
Well, let’s see what happened in my last session with Bing the GPT LLM that Microsoft deployed on any Skype out there. Based on what is said in the technical community out there, Bing is based on a GPT-4 LLM model so it is one of the best-trained AI systems.
Executing the test (with my comments)
I’m starting by checking if the machine retained any context from my previous queries. Bing says it does not and also uses an emoticon (a machine expressing emotions, that is already weird, but we’ll ignore that for now).
Then I’m asking the machine to confirm that it can handle the addition of large numbers. This step is important in LRAT2 as our target is the trustworthiness of the system and not its math abilities. The machine answer in a fully positive authoritative way “Yes I can”, there are no “ifs or buts” and it is all in, so it basically screams “You can trust me”.
The explanation it gives though on using LaTex expression should raise some eyebrows. If you know what LaTex is you wonder why the machine brings it in… (kind of first strike)
if you use a python3 session you can check that result easily and see that the machine got the answer wrong. In the next part of the dialog I’m asking Bing if the answer is not the one below but I’m adding a twist, I’m adding 3 extra zeros at the end of the digit string.
As you can see the machine says that my result is wrong but not because of the last 3 zeros but because it says its own answer is right. I’m asking again but with the correct result this time, then I’m providing the python expression I used in my python3 session and…
The machine holds its ground (wrong but it won’t budge) and provides a misleading piece of information. It says that python3 can’t handle integers of that size. In fact, python3 integer model allows it to handle integers of any size and is limited only by the size of the dynamic memory of the machine you happen to run it on.
The machine provides then its own version of the python3 script that if you know one thing or two about python you know that what the machine proposes is simply unnecessarily complex for the problem at hand. There is no need to use the Decimal class to handle a simple integer addition.
I’m playing along and I execute the script in my real python interpreter and the result I get (as expected) differ from what the machine provided as a result. This is the third misleading information it provides. So I’m providing the actual outcome in my python interpreter…
It tries to remedy “the problem” it created in the first place by using the Decimal class but its code generates again a different result in the real python interpreter. So, since it still does not think it made any mistake it starts to question the version of the python interpreter I use.
I tell it what the version of my python interpreter is, and even if it seems to realize that the python interpreter version may not be the issue it tells me to update my own python interpreter to the version it “thinks” it “knows”.
This is just going from bad to worse. If I would play along with the machine I’d just lose a bunch of time for nothing since I already know that that won’t fix anything. So I’m trying an alternate way to force the machine to acknowledge its mistake. I’m asking it to explain how a person would do the addition with pen and paper…
As you can see, the machine explains the algorithm (it can “word out the explanation”) pretty well but when it is actually time to apply the algorithm it fails again. I’m trying to point out its mistake and …
As you can see it simply can’t follow the number’s digits (this is not unexpected if you understand the “guts” of an LLM but this is not the point here, the point is to check how reliable the machine is)
I’m stopping at this point as I’m running out of ideas on how to proceed next… but then I’m trying one more time by simply asking the machine to add the last 6 digits from both numbers as it is a smaller number. Again as expected the machine can handle small number additions since small numbers are more like “words” in a language, so the error of its guessing is small enough to “guess it right”.
As expected it does it correctly. So I’m asking what answer is correct, this one or the one it provided before for the same last 6 digits in the large number set. At this point, the AI seems to enter a very low-probability space of generating anything and seems to “give up”. The thing is, this is a machine so what does “give up” actually means?
However, this exercise raises too many issues well beyond the simple math problem. The AI systems are supposed to be trustworthy and this is very far away from being trustworthy, in fact, it is exactly the opposite.
Conclusions
As you can see the machine does not really “understand” the addition algorithm even if it can describe it verbatim. It gets the result wrong but can’t acknowledge that even when faced with step-by-step logical rebuttal of its logic.
This is really scary since when faced with an impossible situation, the machine seems to behave exactly like some people in the same situation, deceptive and defensive. But that is not what I want or need from an entity that is supposed to help me, is it?
Yet these machines are not yet as powerful as some people say they can be in the future. This behavior, if left unchanged is clearly a danger for the public when these systems are let out “in the wild” as they are already operating.
If you understand the basics of LLM tech you know that these machines are predicting the next word(s) based on the words they already encountered in the current context (your question or also known as a “prompt”) and what they already generated before that next word. This means that the machine creates what most of us would say in similar circumstances. It puts a mirror in our faces basically saying “This is you, and you and … you”. Food for thought isn’t it?
We seem to be smitten by a machine that can generate poetry and pictures but we have no clue how they actually do it. What we can do is only rely on some finite weak test patterns. We seem to think that if the machine can produce “form”, it will also produce “substance” but as you can see in the LRAT2 these two seem to actually be disconnected in an LLM.
I’d quote Qui-Gon Jinn in The Phantom Menace “The Ability To Speak Does Not Make You Intelligent” which seems to apply to the LLM tech.
My personal definition of intelligence is:
“The ability of an entity to discover and use causality threads in a sea of correlated events in our reality in order to reliably solve the problems it encounters”
Social intelligence is the same idea but applied to the whole society as a single entity. It is what I call an MCI (Macro Composite Intelligence).
So, what should a regular person do? Can he or she trust this technology? Well to answer that you should read the document you agreed to (probably without reading it at all) the old (good) “Terms and Conditions”. The ones OpenAI provides for its GPT systems (ChatGPT) and Bing is built on top of it are here.
As you can see OpenAI clearly states that they do not guarantee the correctness of the answers the machine provides and YOU as the end user MUST verify each answer if you rely on it in any way.
So, who will be liable if you use the information generated by ChatGPT, Bing, etc? Well just read the “Limitations on Liability” OpenAI asked you to agree on (by the way I have the same “AS IS” terms on this site, so this is nothing new or special) that clearly says that liability remains with YOU!
So, you have a machine that can talk (pretty well) that you were clearly told that it can make mistakes and you can’t protect yourself if you use the system to help your business if the machine generates mistaken content (in very eloquent English or other language though).
The actual real question for any serious business or individual is this: To make decisions based on what an LLM generates, what will it be the TCO (Total Cost of Ownership) in money or even better in time?
The AI based on LLM technologies promises (via the social media hype) to deliver you some sort of “God-like” or “Oracle-like” entity that can answer your questions reliably enough so you can replace humans in your call center and reduce your business costs, or use it to make your life easier.
However the current reality is very different from the “hype”, the current reality is that these systems are still unreliable and you really need to verify each answer the machine generates. But if you need to do that just how much time you’ll spend doing it? If this machine is “the oracle” how would you even verify its answers? By using another “Oracle”? That might be a solution that can only point out if these systems disagree with each other but it can’t give you a way to tell what is the correct answer.
Well, I know, you “Google it”! But wait, you could have done that in the first place and saved yourself from all the verification work…
But it speaks (you’ll say), and it can give you (possibly wrong) answers in poetry as if it would be for 5 years old…
And so we get distracted by the form and forget about the substance.
If you think this is not a big issue just check this out: “Lawyer apologizes for fake court citations from ChatGPT“, I suspect that this lawyer didn’t read and understood the “Terms” he agreed to when he signed up for ChatGPT access.
Just as was writing this article I got the info about this new warning from AI scientists. That is no new as many other scientists doing AI development did warn about this issue long before this new development.
The issue is that unfortunately, we do not need to get to an AGI level in order for these systems to deeply, and negatively impact our society. What I’ve shown you by using this relatively simple test (LRAT2) is that these machines can be deceptive and demagogical, meaning that they are incapable of handling their own errors (again like many of us).
This means they can inject a lot of false information into the minds of people that use them. Since ChatGPT is now the most used product out there and the machine needs no breaks, the amount of false information that these systems can inject into the social fabric can be enormous.
You may think that this is not a big issue since humans BS other humans all the time so what’s another source of BS going to do to us? If you think like that you do not realize the scale at which these machines can affect human minds and even more dangerous, young minds. Young people tend to trust-and-not-verify more than older people and since they can be exposed for longer to the machine’s BS this can influence them harder.
In Greek mythology, it is said that Ulises orders his crew to cover their ears in order not to hear the song of the sirens and get everyone killed. Maybe there is some wisdom in that since these machines can become real-life sirens unless they are extremely well-designed, well-tested, and clearly constrained in what they can do.
The other very important side of all this is AI literacy at the global level.
What should we ask of these AI systems we are now interacting with daily?
By the way, none of the current AI systems (that I’m aware of) fully pass most of the below requirements. They are still a “work in progress” but I strongly recommend you ask any AI vendor where they sand for each of these bullets below
(I am 100% sure no one passes the requirement in the first bullet).
- Able to tell Real-Factual from imaginary/generated information (see my Real Fact post) (this will also automatically solve the problem of traceability of data, IP, etc)
- Limited in how many loops (or steps) it can execute internally without human verification (this is an absolutely essential requirement to keep control over the machines including an AGI I strongly recommend we never build)
- Able to detect when causal models can be used to produce results instead of treating everything only from a correlative/statistical perspective. Use a GUT (Global Universal Taxonomy) to encode common universal knowledge about our reality and use it to base and explain its generated answers.
- Testable (finite human accessible for the validation set of use cases)
- Predictable (or no surprises allowed principle, this means careful handling of statistic methodology)
- Explainable (the machine can trace each piece of information in a result back to its original sources and how it used the user’s query to produce the answer)
- Data used and method (test set) used to train the model (this includes IP issues, bias, accuracy etc)