The KHIT Blog: "Training Data," Human Neurobiological Wetware vs Generative Digital AI

Friday, September 8, 2023

"Training Data," Human Neurobiological Wetware vs Generative Digital AI

Begs a question or two, perhaps?

One of the most troubling issues around generative AI is simple: It’s being made in secret. To produce humanlike answers to questions, systems such as ChatGPT process huge quantities of written material. But few people outside of companies such as Meta and OpenAI know the full [sic] extent of the texts these programs have been trained on.

Some training text comes from Wikipedia and other online writing, but high-quality generative AI requires higher-quality input than is usually found on the internet—that is, it requires the kind found in books. In a lawsuit filed in California last month, the writers Sarah Silverman, Richard Kadrey, and Christopher Golden allege that Meta violated copyright laws by using their books to train LLaMA, a large language model similar to OpenAI’s GPT-4—an algorithm that can generate text by mimicking the word patterns it finds in sample texts. But neither the lawsuit itself nor the commentary surrounding it has offered a look under the hood: We have not previously known for certain whether LLaMA was trained on Silverman’s, Kadrey’s, or Golden’s books, or any others, for that matter.

Pirated books are being used as inputs for computer programs that are changing how we read, learn, and communicate. The future promised by AI is written with stolen words.

Upwards of 170,000 books, the majority published in the past 20 years, are in LLaMA’s training data. In addition to work by Silverman, Kadrey, and Golden, nonfiction by Michael Pollan, Rebecca Solnit, and Jon Krakauer is being used, as are thrillers by James Patterson and Stephen King and other fiction by George Saunders, Zadie Smith, and Junot Díaz. These books are part of a dataset called “Books3,” and its use has not been limited to LLaMA. Books3 was also used to train Bloomberg’s BloombergGPT, EleutherAI’s GPT-J—a popular open-source model—and likely other generative-AI programs now embedded in websites across the internet. A Meta spokesperson declined to comment on the company’s use of Books3…

OK. Very interesting Atlantic Monthly article. Worth your time.

Been quite the hot topic of late. Recall my post last December—"Malign Technoligies Update: AI Natural Language Generation (NLG). Who might have copyright ownership claims to AI-generated human-readable text?"

Comment I posted on another blog yesterday.

What do y’all think about the recently reported (IMO inadequately defined) “pirated use” of the books of numerous notable authors as Generative AI “training data?”
I now have about 700 Kindle format eBooks in my iPad. I read them all carefully, marking them up profusely. They are part of my “unsupervised” neural cognitive wetware “training data”—“NI” (Neurological Intelligence).
Does that differ materially? (In an ethical sense?)

Screenshot from my Kindle

I've been a fanatic book reader my entire life (which now spans a creaky 77+ years). I've averaged 2-3 books a week for many decades, in addition to all of my periodicals (and, increasingly online stuff spanning the gamut from the frivolous to the profound).

My Unsupervised Training Data.

Below, my Las Vegas loft library Data Warehouse in 2013.

I stlll miss that pad. We had crammed floor-to-ceiling bookcases all over the place. We've since given away about 90% of our hardcopy books. I'm getting to where I do much better with eBook & online reading as my old coot eyesight atrophies.

After getting my Master's, I taught adjunct evening faculty Critical Thinking and Argument Analysis from 1999 - 2004 at UNLV (during my bank risk analyst days). Back then, the academic plagiarism concerns were mostly focused on the Microsoft Word etc ease of Ctrl-C / Ctrl-V Cut & Paste. Now we have to wring hands as students can use ChatGPT to simply ghostwrite their assignments for them.

Beyond academia, the prospect now uneasily portends wherein AI quietly writes our news stories, magazine articles, books, screenplays, political speeches etc—well, in the case of Donald Trump, though, it'd certainly immeasurably add coherence:

Yeah, it's not funny.

So, back to the original riff here. Issues of IP "piracy," "copyright," "fair use" aside, how do my lifetime accrued "wetware" "training data" differ ethically? For the sake of argument, let's assume that these AI companies paid retail for every title available on Amazon (neutralizing the "piracy" beef) and then used the authors' prose simply as "training data," not publishing and disseminating verbatim "unauthorized copies?"

OK, backing up a tad in paraphrase: "To produce humanlike answers to questions, people like BobbyG process huge quantities of written material. But few people ... know the extent of the texts he has been trained on."

Now, I could never keep up with the computers. My consumption of "training data" would be nanoscopically puny by comparison, in terms of sheer volume. So, perhaps AI will soon be able to kick my nominally formidable verbal butt, in terms of both topical analytic acumen and creative elegance of rhetorical flourish (to the extent that I can be said to possess the latter competence).

I guess we'll know before long. Maybe. Color me a bit skeptical as yet.

Just wondering. What do you think? (LOL, it'd be funny if I got AI "responses" generated by people using ChatGPT.)

CODA

I guess I'm kinda strange. Unremarkable B student in high school in NJ (albeit a voracious reader from early on prior to HS). Left home at 18 in 1964 to go on the road with a bar band in lieu of college. Trapsed all over the U.S. and Canada. Got politicized in 1967 when I hit California for the first time. I joke that I "was the only rock & roll guitar player in the country with subscriptions to Harper's, The New Yorker, the Atlantic, Ramparts, The New Republic, The Washington Post, and The Washington Monthly." Just about all of my fellow musicians wanted only to jaw about other musicians and bands and their recordings, and axes and equipment.

I didn't really fit in. There was more important stuff out there.

Then, after going White Collar in the wake of finally getting my undergrad at the age of 39, I found it difficult to fully fit in with the "suits." The cubicle crowd didn't get me either.

I now just refer to myself as a "life-long unlearner."

Running outa time. There's just too much too reconsider. Sometimes the current relentless vulgar media absurdity gets away with me.

SATURDAY ERRATUM

Ugh. Morocco death toll will continue to mount. Terrible.

More to come...

__________

No comments:

Brave New Health

Commonwell Health Alliance

Another important read (pdf)

I love this kind of stuff. It sustains and humbles me. "As politicians, advertisers, salesmen, and propagandists for various political, economic, moral, religious, psychic, environmental, dietary, and artistic doctrinaire positions know only too well, fallible human minds are easily tricked, by clever verbiage... Common language—or at least, the English language—has an almost universal tendency to disguise epistemological statements by putting them into a grammatical form which suggests to the unwary an ontological statement. A major source of error in current probability theory arises from an unthinking failure to perceive this."

Quotes

"An economist is a person who sees something that works in practice and tries to figure out whether it will work in theory."

- J.D. Kleinke, medical economist
___

"The only person who enjoys change is a baby with a wet diaper."

"Every misspent dollar in our health care system is part of somebody's paycheck.

- Brent James, M.D., M.Stat

“We could do healthcare, at markedly higher quality, for everyone in this country, without rationing or denying anybody the care that they need, without having the government dictate how doctors practice or whether hospitals could expand, at half the cost we do it now.”

- Health Care Futurist Joe Flower

Most of the sciences, unlike parts of medical science, are not concerned with the impossible. There is not complementary and alternative physics, or chemistry, or biochemistry, or engineering. These disciplines compare their ideas against reality, and, if the ideas are found wanting, abandoned."

- Mark A. Crislip, MD

"Q: How much alcohol is too much?
A: More than your doctor drinks."

- a physician I once heard speak during a CME presentation

“Just because science doesn’t know everything, doesn’t mean you get to fill in the gaps with whatever fairy tale most appeals to you.”

- Dara O’Briain

'[I]t is one small step from using the computer for "helping" doctors to monitoring them, judging them, dictating to them what to do, and withdrawing payment for computer non-compliance. The use of computer data is a multi-edged sword. It can be used for the "good," facilitating diagnosis and treatment and making it more accurate and up-to-date, and for “evil,” invading privacy, inviting security breechs, and making decisions based on the opinions of remote authorities rather than those present at the patient-doctor encounter.'

- Richard Reece, MD

“[T]here ARE statistics which are non-political. Just because The Washington Post/Fox News reports the temperature is 75 degrees doesn’t mean it’s really snowing and sunscreen is a liberal/conservative plot. Even if you earn a living being ideological.”

- Michael L. Millenson

"It is a generally a fairly convincing argument that people shouldn’t have to be subsidized to undertake a change which is in their best interest.

The reconciliation seems to be that EHR is not supposed to make a doctor’s practice more efficient and higher quality. It is supposed to make the system of care more efficient and higher quality, which is not the same thing. Those of you who took calc recall that maximizing the total of variables is not achieved by maximizing any one variable and this is a perfect example of that.

Those of you have served in combat certainly noticed that too — if everyone works as a team the unit takes fewer casualties. If you try to save your own hide, you might, but at the expense of more casualties overall."

- Al Lewis

"There are two ideas to keep in mind about Bayesian reasoning and how we tend to mess it up. The first is that base rates matter, even in the presence of evidence about the case at hand. This is often not intuitively obvious. The second is that intuitive impressions of the diagnosticity of evidence are often exaggerated."

- Daniel Kahneman, "Thinking, Fast and Slow"

"Physicians apply advanced scientific knowledge, but they must do so without the favorable conditions that experimental scientists create for themselves. Multitasking is forced on physicians, often in chaotic environments and under severe time and resource constraints."

- Lawrence and Lincoln Weed, "Medicine in Denial"

"It’s time to stop the whining about Obama care and acknowledge we already have universal health care. We just pay for it in the stupidest way possible that ensures problems are that much more disastrous and complicated when they’re finally treated."

- Mark Hoofnagle, MD, PhD

"Every act of conscious learning requires the willingness to suffer an injury to one's self-esteem. That is why young children, before they are aware of their own self-importance, learn so easily."

- Thomas Szasz, MD
___

"Of course, one reason that process metrics* are so popular is that processes are much easier to define and measure than outcomes."

- The Skeptical Scalpel
___

"There is an “illusion of validity” for any random data point, a seductive sense that is colored by what we hope will be true. Mountains of pharmaceutical claims are often made from mere molehills of data."

- Danielle Ofri, MD
___

"Joy empowers people. It is a source of energy that enables people to hope and plan and change their lives for the better. Spend some time around someone who is relentlessly negative and how do you feel–drained, right? More and more research shows that joy is not something that just happens to you, like a bolt of lightening out of the blue. Joy is, instead, a habit to cultivate. Negative thinking and despair are the crabgrass of our souls–weeds that take root and spread, sometimes to all areas of life. Joy, in contrast, is a soul’s rose–hardy when cared for, able to put down roots over time and withstand disease and extremes. Like a rose, however, your joy can become blighted from neglect or harsh conditions. We all need to tend to our joy–to prune away the badness, and to know that, even though it may look like a prickly bare root, if you invest time in a joyous outlook, gorgeous things will bloom, even in the harshest conditions."

- Dr. Jan Gurley
___

"'Solutions' exist only in mathematics."

- Karen Martin
___

"The issue of how to regulate clinical software is, in the long run, indistinguishable from the issue of how to regulate medicine. The only difference is that medicine is practiced in the open, without secrecy, subject to peer review, and under a merit-based state license."

- Adrian Gropper, MD
___

"Economist, rope, tree: some assembly required."

- Source unknown

DISCLAIMER:

I write this blog wholly on my own time and my own dime. The views proffered are expressly my own as a concerned and active citizen/taxpayer (in addition to being the result of my substantive experience in the various IT fields), and in no way reflect any policy views of my former employer, notwithstanding that some of the thinking has indeed obviously been spurred by the implications of the work with which I have been doing for them.

FAIR USE POLICY
I cite a ton of news and web sources spanning the breadth of relevant technical and policy domains, sometimes at substantial length. I believe I remain well within the bounds of "Fair Use," as [1] I am not doing any of this for profit, [2] I always provide attribution and links -- which, [3] far from negatively impacting any copyright holders' commercial interests, might actually increase traffic to and interest in their offerings.

Nonetheless, should I post anything of yours regarding which you have any objection, just let me know and I will remove it forthwith.

The KHIT Blog

Search the KHIT Blog

Friday, September 8, 2023

"Training Data," Human Neurobiological Wetware vs Generative Digital AI

No comments:

Post a Comment