Proving Wrongdoing

That could be complicated. We have many activities to consider: data collection; preparation; often distribution; training in LLM's which are novel constructs; generating outputs from LLM's which has some legal precedent. Since I'm not an I.P. lawyer, I'm not going to try to guess what courts will say about the LLM's. What's better understood by lawyers, especially those employed by copyright holders, are the ways people can obtain, use, and share copywritten works. We'll simplify things by focusing just on how the LLM's are being trained.

Current, copyright law says that using any published work in another published work outside of fair use requires a license. Distributing published works verbatim or sometimes with paraphrases to other people is often not fair use. Additionally, contract law says we must abide by the terms of use our data suppliers gave us. Some require attribution of the source, non-commercial use, or even money in some situations. Copyleft licenses (eg CC BY-SA, GPL) often further require that all derived works (outputs) be licensed under copyleft themselves. Usually the same license.

The high-level questions to ask are these:

1. Did the AI companies use copywritten works without permission, against their license, and/or against the terms of service?

2. If sources require specific licenses for derivatives (i.e. copyleft), were the derivatives licensed in the way that's required if you use the sources as training data?

Now, let's get more specific about how we'll assess that.

For copyright, we're using the default assumption that any published work is the property of its owner who must license it to you in some way. People who don't own the copyright are not allowed to either publicly display or distribute copies of these works. That's what all the legal claims were about for file sharing and YouTube videos. The lawsuits could cost more than your house. Any copywritten work in a training set that has no license might be there illegally. That might get either the supplier, the user, or a middle person (eg data curator) sued. So, we can say a training set is a risk by default if it includes copywritten works without licenses.

For permissive use, some have requirements. If attribution, you must cite the source in any derivative work. Some require a link back to their site. Others, like Apache, might require you to include the license somewhere. Some, under trademark law, require you to take their name out of any derivatives you make. Content owners' responses to these vary from being annoyed to taking legal action. So, they're still a risk. We need to know which permissive sources have citation or other requirements.

Many works are free with restrictions. There's both research papers and source code (examples) that are available for non-commercial use. The beautiful text of the ESV mentions both non-commercial use and a ban on Creative Commons-licensed outputs (huh?). Since Crossway holds it tightly, I just link to a licensed supplier like you just saw if I want to use it. Whereas, both ChatGPT 3.5 and 4 will directly quote ESV text if asked. They do it without the copyright notice for $20 a month while assigning the outputs to people who might be using them in Creative Commons-licensed works. That might not be legal. Whereas, a public domain translation (eg WEB) would be fine.

For copyleft, the derivative works must be shared under a copyleft license. This is true if they're distributed. Cloud vendors often dodged this requirement because they technically didn't distribute their modifications outside of their organization. The user data came in, their software acted on it, and outputs are sent back to the user. The LLM-based AI's are different because the internal state they mix into the outputs might be the copywritten work itself. They even often assign the final copyright to the user or treat collections of prompt/response pairs as a form of intellectual property. Under current law, any AI or LLM using copyleft works in its training data might be required to put that copyleft license on either all of its outputs or those which clearly derive from the copyleft inputs.

I'll quickly note a solution that showed up in the coding space. Some LLM's for code are trained only on permissive works to reduce the odds of legal issues. Then, there's a search function that compares the output to files in the training set. Whatever doesn't match must be original [enough]. If it does match, it's permissive code that can be easily cited. While a great idea, many coding models are layered on top of foundational models. The output might instead be derived from the gigantic pile of data (1+ TB) used to train the foundational model. The code in that data set might be proprietary, patented, and copyleft. The LLM might reproduce any of that in its output.

Tangent risk: All code, permissive or not, might also have security vulnerabilities that the A.I.'s add to code they generate. Bad codebases teach A.I.'s to write bad code.

Specific, Data Sources

Let's look at the specific, data sources that are often used in AI models. We'll glance at their licenses, terms, and so on. We'll start with those they use the most.

Common Crawl and Its Derivatives

Common Crawl. Its FAQ and Terms of Use say they definitely don't own it: "we just found it on the web. So we are not vouching for the content or liable if there is something wrong with it." They also have a copyright contact page so owners can send in claims of copyright infringement. All of this is risky to share or reuse without a license.

OSCAR. In this article (2019), section 7 says OSCAR is a processed version of Common Crawl (Nov. 2018 version). They were trying to reduce its size so people with fewer resources could use it. The collection itself is CC-0 (public domain equivalent). In this article (2021), the authors make a new version of OSCAR that also uses Common Crawl while being drop-in compatible with the old one.

RealNews. This paper (pdf) says they filtered Common Crawl down to just news articles. It's limited to "the 5000 news domains indexed by Google News." They used articles from Dec. 2016 to Mar. 2019.

CC-Stories (or STORIES). This paper says in section 5.3 that they process Common Crawl to produce a subset that's good for their training purposes. The result, STORIES, has over a million documents.

C4. Google's C4 is a "colossal, cleaned version of Common Crawl's web crawl corpus." A processed version is here that's released (see bottom) under ODC-BY license and Common Crawl's terms.

Refined Web: paper and HuggingFace. It's made by Falcon team. Refined Web is a refined version of Common Crawl. It inherits any of its legal risks. Section 2 actually describes the progression of how the field as a whole acquired their training data. It's devoid of any concept of legal rights.

Wikipedia

Wikipedia. Contrary to popular assumptions, Wikipedia's content is not permissive to use for just anything. It's released under CC BY-SA 4.0 license also with a Terms of Use. Here's the Contributors' Rights page. The CC BY-SA is a copyleft license that says any outputs we produce with such content must also be released freely under CC BY-SA 4.0. AI models that use Wikipedia in their training set will have to release either their whole work or Wikipedia-based outputs under that license. Spotting Wikipedia quotes would require asking the AI's about articles that existed before they were trained (eg pre-2021 for ChatGPT). Since some filter articles, I suggest using English-language articles with substantial amounts of content looking for unique portions.

The Pile and Its Components

The Pile (paper) includes these data sources: ArXiv; Freelaw; StackExchange; BooksCorpus2; Hacker News.

Let's look at those data sources:

On legality, Section 6 of the paper says it divides data into publicly available, obtained following the terms of service, and what has authorial consent. For instance, they claim Books3, BookCorpus2, and OpenWebText2 (below) were merely public. They say they fulfill both terms of service and have author consent on others sources including ArXiv, StackExchange, and Hacker News. I'm not sure if that's their interpretation or they have proof they can share.

On copyright, they make a few arguments in Section 7 that their work is legal. My pro-AI arguments in the copyright section were based on what's in The Pile paper to keep close to what the AI developers are arguing. I won't repeat them here.

2. They claim their use is transformative: "the original form of the data is ineffective for our purposes and our form of the data is ineffective for the purposes of the original documents." They also claim that using the full text of documents is legally allowed when it's necessary. It's necessary for NLP research. My concern here is that these AI's spit out both unique ideas in and sometimes verbatim claims from their training data. That's exactly why people are using the AI's in place of the source material for the same purposes. Especially for learning new things. So, the copyright works are intended to gain the authors something while delivering specific value to their users.

(Note: Shawn Presser claimed to be the author of The Pile. He said that AI models aren't copyrightable. He also claimed that they're preparing to take Meta/Facebook to court to establish that precedent. His stated goal is to make life better for researchers and programmers sharing or using their own works. The resulting precedent would allow AI models to be built on almost any data before being used for any purpose. That includes both reproducing and improving that data.)

In Section 7.1, the authors warn that "we do not have the metadata necessary to determine exactly which texts are copyrighted, and so this can only be undertaken at the component level. Thus, this should be be taken to be a heuristic rather than a precise determination."

During evaluation, they used OpenAI API credits that the company donated. So, OpenAI is aware of The Pile. Although GPT2/3 weren't trained on it, the authors of The Pile say certain aspects of their performance evaluation can show what parts of the pile are very similar to GPT2/3's training data. On page 7, they say these parts of GPT3's datasets are very similar to their own: Books3, Wikipedia, Pile-CC, and Project Gutenberg. Other portions had such different results that it's highly unlikely that they were in GPT3's training set. If the content is similar, then GPT3 will probably have similar types of copyright, patent, and trademark issues compared to using that content.

Miscellaneous Data Sets

OpenWebText2 is a large, filtered dataset of 17,103,059 documents in text form (65.86GB uncompressed). These were scraped from URL's found on Reddit submisisons. That means that, if a person's comment had a link, they automatically grabbed what was in that link. Scrapers are just a retrieval mechanism that pulls data from its source to the local computer. They usually ignore copyright, trademarks, or terms of service. Like Common Crawl, all texts in OpenWebText2 are copywritten by their users. Distributing that content or reusing it in derivatives might be illegal. The collection itself is released under the highly-permissive, MIT license which only requires including the license when it's distributed.

PushShift.io Reddit. This paper says it has all submissions and comments on Reddit from mid-2005 to mid-2019. That's 651 million submissions with 5.6 billion comments. Their API lets researchers work with that data, too. Their retrieval engines can work with web API's or scrape HTML pages. I'm not sure which they used to get the Reddit content. Google Scholar said "over 100 peer-reviewed publications" had used it by 2019.

Reddit content itself. I looked at Reddit's User Agreement. It says you cannot "license, sell, transfer, assign, distribute, host, or otherwise commercially exploit the Services or Content." You cannot "modify, prepare derivative works of... the Services or Content." People are supplying or using Reddit data for A.I. with the packages or models under specific licenses. Some are selling the use of the A.I. or value added services. So, they're probably licensing, selling, transfering, distributing, modifying, and preparing derivative works of the content. While their use is against Reddits terms, the "Your Content" section shows that Reddit itself can use and sell their data for A.I. training.

News Crawl: It's one of these sets from WMT. I don't know its license. That they're hosting Common Crawl suggests they're not worried about that, though. The 2018 paper has examples of systems that use these data sets.

Giga5 from LDC. It's several years worth of news wire data from seven sources. Those organizations' copyrights are asserted on the page. Although free for its members, LDC sells this data for fees ranging from $3,000-6,000. This work was sponsored by a DARPA grant.

ClueWeb12 from CMU. Over 700 million web pages licensed for research purposes only. They have a formal, licensing process to use it. It was sponsored by NSF. The note on the bottom thanks Google for contributions to the data set. That means that Google might be using it internally. Someone could ask these companies if they're using any of these data sets in any of their products or services.

MassiveText (2021). Its sources include "web pages, books, news articles, and code." It has "2.35 billion documents, or about 10.5TB of text." This paper says in Appendix A and Table 2 that it includes data from MassiveWeb (604 million documents), Books (4 million), C4 (361 million), News (1.1 billion), Github (142 million), and Wikipedia (6 million). MassiveWeb is scraped from the web and uses Google SafeSearch in its process.

RoBERTa. The paper says their data set included: BookCorpus, English Wikipedia, CC-News, OpenWebText, and STORIES. BookCorpus + English Wikipedia came from BERT. They expanded the data set.

Infiniset: From Google, this paper says it "consists of 2.97B documents and 1.12B dialogs with 13.39B utterances." The components are: "12.4% C4 data; 12.5% code documents from sites related to programming like Q&A sites, tutorials, etc.; 12.5% Wikipedia (English); 6.25% English web documents; 6.25% non-English web documents."

What Do Models Use?

OpenAI's/Microsoft's GPT3: Their paper says they use Common Crawl (180GT) (billion tokens), "WebText2" (55.1GT), "Books1" (22.8GT), "Books2" (23.65GT); Wikipedia (10.2GT). They outright say they use Common Crawl and Wikipedia. That's already a problem if they don't meet its legal requirements. Then, there's whatever those books are. That The Pile paper says Books3 is probably in both GPT3 and The Pile means a full list of books in Books3 is probably useful.

OpenAI's/Microsoft's GPT3.5/GPT4/ChatGPT: These aren't published in enough detail to evaluate them. Their page make it look like they're fine tunings of GPT3 that layer extra stuff on top of that model. They'd share its foundational, training data. One guy leaked that GPT4 is a Mixture of Experts model that's around eight, 220B models cooperating. If so, they trained it from scratch. A simple way to assess that would be to ask them if GPT3 itself or its training data are used in the GPT 3.5 and GPT 4 series models.

Google's LaMDA: This paper says it uses Infiniset.

Google's GLaM: This paper says its these sources with this many tokens: filtered web pages (143B); Wikipedia (3B); Conversations (174B); Forums (247B); Books (390B); News (650B). "We also incorporate the public domain social media conversations used by" this paper. It says it's 341GB of text, mostly conversations. I didn't see a source mentioned. So, I wouldn't assume it's actually public domain.

Google's PaLM: PaLM1 (paper) says its 780 billion tokens is "a mixture of filtered web pages, books, Wikipedia, news articles, source code, and social media conversations." The data set is based what LaMDA (Infiniset) and GLaM used. PaLM2 (paper) is "composed of a diverse set of sources: web documents, books, code, mathematics, and conversational data. The pre-training corpus is significantly larger than the corpus used to train PaLM... In addition to non-English monolingual data, PaLM 2 is also trained on parallel data covering hundreds of languages in the form of source and target text pairs where one side is in English"

Google's Bard: This announcement says it was based on LaMDA. This article quotes a podcast interview with Google's CEO saying they'd soon switch Bard over to PaLM.

Anthropic's Claude: Claude is a top competitor to GPT. They're financially backed by Alphabet/Google. Their model page says "Claude models are trained on a proprietary mix of publicly available information from the Internet, datasets that we license from third party businesses, and data that our users affirmatively share or that crowd workers provide. Some of the human feedback data used to fine-tune Claude was made public [12] alongside our RLHF [2] and red-teaming [4] research." Who knows.

Stability AI's models: Famous for Stable Diffusion, they say they've also released many LLM models: GPT-J, GPT-NeoX, the Pythia suite, and now StableLM. Most were trained on The Pile. StableLM "is trained on a new experimental dataset built on The Pile, but three times larger with 1.5 trillion tokens of content."

Facebook's/Meta's LlaMA: This paper says it uses a mix of several datasets with pre-processing that filters some portions. They use Common Crawl (above), C4, Github's permissive projects, Wikipedia, Gutenberg's books, Books3 from The Pile, ArXiv's Latex files, and StackExchange. Originally released to outsiders in a restricted way, someone leaked the model weights for LlaMA so that anyone could use them. Also, the open-source community built a huge ecosystem on top of LlaMA that innovated faster than big companies while making fine-tuning cheaper (sometimes around $100). As a result, Meta released LlaMA2 under a permissive license allowing all uses. Its paper says it used "a new mix of publicly available data," nothing from Meta's services, that it was 40% larger than LlaMA's mix, and that the total was "2 trillion tokens."

(Note: GPT3, PaLM, Bard, Claude, Meta's LlaMA 2, and some Stability AI models are either usable API's or commercially deployed.)

DeepMind (now Google): Gopher's paper says it used MassiveText. Chincilla's paper says it used around 1.4 trillion "tokens" for training: almost five times more than GPT-3. It also uses MassiveText but a different subset. I'll note that Chincilla is one of the most, widely-cited papers because it discovered a special rule about LLM's: Chincilla optimality. The theory is that LLM's must be trained on at least 20 tokens of text per parameter to be highly effective. If a 1B model, then 20 billion tokens. If 175B (eg GPT3), then it's 3.5 terabytes of tokens. Since the Chincilla paper, most LLM projects have aimed to gather as much data as possible to meet this requirement. So far, those trained with gigantic amounts of data have outperformed those with smaller amounts of data when quality was similar.

Microsoft/NVIDIA's Megatron-Turing NLG 530B: This model is over twice the size of GPT-3. To train it, this article says they used 15 datasets with over 300 billion tokens. They included subsets of The Pile, two snapshots of Common Crawl, RealNews, and CC-Stories.

OPT: This paper says they combine portions of data from RoBERTa, The Pile, and PushShift.io Reddit. From RoBERTa, they included BookCorpus and Stories (Trinh and Le, 2018). They got a modern version of CC-News. From The Pile, they used: CommonCrawl, DM Mathematics, Project Gutenberg, Hacker News, OpenSubtitles, OpenWebText2, USPTO and Wikipedia.

HuggingFace cofounder's BLOOM: This is a 176B model (GPT-3 sized) that was made by "over 1,000 collaborators worldwide" as an alternative to those of big, AI companies. It was trained on this supercomputer in France. The model page, datasets, and this detailed article describe their data sets. That says they included many data sets their collaborators suggested in their "Data Sourcing Catalog." They also did a "pseudo crawl" which was "finding their data in an existing web crawl." They included OSCAR v2. I'm not sure where they put their full, data set. The same page as the article has many folders with a data folder that we can assume they use for either that or other projects. In it, I see "pile" (The Pile?), "openwebtext," "oscar," and "oscar-multilingual." I might have missed something since I ran out of time to dig through all of this.

GLM: This paper says it's a 130B model trained by Zhipu.ai and Tsinghua in China. It was trained on 400 billion tokens, half English and half Chinese. Section 2.2 says it uses these data sources: "1.2T Pile (Gao et al., 2020) English corpus, 1.0T Chinese Wudao-Corpora (Yuan et al., 2021), and 250G Chinese corpora (including online forums, encyclopedia, and QA) we crawl from the web, which form a balanced composition of English and Chinese contents." They only one I recognize is The Pile.

Falcon: The Falcon models are from United Arab Emirates. This group both created and used Refined Web.

Finding Your Data in Proprietary, AI Models

Courtroom Approach

Ask the court to get the training data for the A.I.. That's how the models get their knowledge. The A.I.'s have several sets:

1. Raw, pre-training data. This must be as large as possible for the model to learn as much as possible.

2. Processed, pre-training data. Even those using the same sources, like Common Crawl, have different strategies for filtering out low-quality or undesirable data.

3. Fine-tuning data. These are much smaller. Their goal is to teach the A.I. specific skills, moral alignment, etc. Many open models were competitive with GPT in some areas by the right choice of fine-tuning data.

4. Prompts. Prompts are the text you give a trained model to get it to do something. How you ask it things can have a major impact on what it outputs. Companies might put text or commands before, inside of, or after the users' prompts. They're used to do everything from set the A.I.'s mode to make it sound polite to get it to be silent on controversial subjects.

Let's look at them for legal usefulness vs competitiveness. Pre-training data could become a competitive advantage as they curated sources carefully. Right now, they can't argue it is since they're shoving hundreds of gigabytes of scraped data into these systems. Also, open, data sets are probably comparable to what GPT3 was trained on. Pre-training data is highly likely to have copyrighted works, too. On the other end, both fine-tuning data and prompt techniques have a strong impact on A.I.'s utility in business. Fine-tuning can have intellectual property in it but smaller amounts. Pre-training data is the highest value in investigations with the lowest, negative impact on the A.I. company. Request it.

Make sure you ask for the folders, categories, or labels. An investigation into how an A.I. was trained can be sped up by looking at the sources or types of data instead of every individual work. We've seen so far that they're often categoried: "Books", "Wikipedia," "FreeLaw," "Reddit," etc. Getting a list of books for a language model will tell you really quickly if they used copyrighted works without permission. Other categories can be checked to see if they cut deals with suppliers, like Reddit. Did they get data without the suppliers' permission to use it in ways they don't allow? If they used Wikipedia, did they cite the sources and release anything they produced under the copyleft license?

Again, the evidence that an AI vendor is honest or breaking the law will be in the pretraining data. Again, it's also not a risk to their competitive advantage for a court to know what they used.

Less-Technical Method

What AI's generate is usually based on their existing data as opposed to raw creativity or reasoning. If the AI mentions your work, your work might be in its training set. If you didn't licensed it for that, then I believe that's a strong form of evidence of copyright infringement if it's a for-profit AI. The open models may or may not be infringing if they're licensed for commercial use. The reason being that they might have been created in a non-profit way which might be covered by fair use. However, they could be licensed for commercial use by others or intended for commercial use by the AI's creators. That might not be fair use. Who knows. The commercial AI's with API's distributing possibly-infringing works are the low-hanging fruit of these investigations.

The next problem: is it your content (eg book) or another person talking about it? There's summaries, online reviews, and even fan fiction online that's based on popular content. How to determine what you're actually seeing is an open, research problem. Here's a method that might work:

1. Pick content that you either know is in the training sets or do not have many hits on Google. The latter are there because that makes anything the AI contains to be more likely to be your own work.

2. In your content, pick sections with unique ideas or wording that are unlikely to be mentioned in online articles. Google them in quotation marks to look for the whole phrase. Make sure they have few hits.

3. Use a cheap API, like ChatGPT, to search for those in the AI's. Try to get them to spit out specific information about your content that they'd only know if they read it or a source that plagiarized it.

4. For each potential infringement, keep a copy of your question with the API's raw response. I have a chat tool for all GPT-3 and ChatGPT models that certifies the output. Maybe a picture, too.

5. Repeat this multiple times. Maybe several examples per piece of content (eg book) and multiple pieces of content (eg your collection). This is your evidence collection.

What you do from there is up to you.

Technical Approaches

I only know one off the top of my head: Extracting Training Data from Large Language Models. They claim to be "able to extract hundreds of verbatim text sequences from the model's training data. That includes both personal information and source code.

Recent Moves

Some A.I. companies recently started doing two things that seem relevant to this. First, they won't say what data sources they used even though it was common to be public about that. Second, both ChatGPT and Bing seem to have programming to dodge prompts that might get them into legal trouble. These two tactics are evidence that they're reducing their risk of being accused of or proven to have done copyright infringement. 

Others' Articles, Lawsuits, Etc.

Sarah Silverman's lawsuit

Has Your Book Been Used to Train A.I.? by Schoppert

(Note 1: Schoppert claims Books3 in The Pile has ISBN's in it and that it comes from bibliotik.)

(Note 2: I found these Reddit threads... 1, 2, and 3... that show that Bibliotik was terabytes of books moving from thieves in a private, file-sharing network eventually to AI models. File-sharing networks for books, especially private ones, are usually piracy. The third link has a lot of claims about actual felonies on top of it. The comments in the first link about how people were taking but not giving back are amusing. For some reason, content pirates often gripe about other pirates not giving them what they feel that they're owed for the stolen data.)

9k authors say AI firms exploited books to train chatbots (LA Times)

Llama copyright drama: Meta stops disclosing what data it uses to train the company's giant AI models (Business Insider)

Generative AI Has An Intellectual Property Problem (Harvard Business Review)

Next section: Better Models of AI Development.

(Navigation: Go to top-level page in this series.)

(Learn the Gospel of Jesus Christ with proof it's true and our stories. Learn how to live it.)