Understanding ‘Why’ Training ChatGPT Transcends the Contours of Copyright – SpicyIP

0
13


In a fantastic 2 part post teasing apart how Large Language Models actually work, Shivam Kaushik gets down and dirty with the technical details of the algorithmic training process and explains why there are copyright implications beyond the ‘obvious’. In this first part, Shivam explains each part of the “G” “P” T” in ChatGPT actually does, and then takes forward why this is relevant. Shivam is an LLM candidate at NUS Law specializing in IP and Tech Laws and a Research Assistant at the Lumens Machine Learning Project. He is interested in exploring the legal issues posed by emerging technologies. His previous posts can be accessed here.

Part I- Applying Natural Intelligence (NI) to Artificial Intelligence (AI): Understanding ‘why’ training ChatGPT transcends the contours of copyright

Shivam Kaushik

The hearing in the ANI v. OpenAI is gathering steam before the Delhi High Court. The amici have made some really insightful and contentious arguments before the Court, and we are awaiting the arguments from the parties. In the many discussions I’ve read and heard surrounding the matter though, it appears there is a lack of clarity on the technical side and inner workings of Large Language Models (LLMs) within legal circles. Though much ink has been spilled on the issue, and a broad conceptual understanding is present, it appears that certain essential features of LLMs are often ignored. In a self-declared valiant attempt (it is the intent that counts, the results not so much!), this post tries to fill that void by taking a peek into the model caught in the eye of the storm: our lord and saviour, the multi (if not omni) potent ChatGPT! 

As an initial disclaimer: The Court in the OpenAI case is examining 1) whether ‘storage’ of Plaintiff’s data by ChatGPT for ‘training’ amounts to infringement, and 2) whether ‘generation’ of response by ChatGPT for the user amounts to infringement. This post only deals with (one of) the training aspect, called ‘Pre-training’. Post the breakdown of pre-training, I argue this understanding lends itself to seeing the pre-training process as one that does not violate copyright. This should not be understood to mean that I am trying to draw an iron curtain between the training and output side of LLMs. This is more of a logistical bifurcation than a logical bifurcation imposed to keep the post short(er) and sweet(er).

From a bird’s eye view, Large Language Models (LLMs) are the incarnation of something known as language modelling, which seeks to predict the next word in any given text using conditional generation. Conditional generation is the task of generating text conditioned on an input piece of text called a prompt. At the peril of oversimplification, it can be said that LLMs are Neural Network Models (NNM) that use probability and statistics based on learned probabilities to complete this task. That is to say, instead of predicting one word with certainty, it assigns a probability to all the possible words (a probability distribution) that can succeed the given text. The NNM has multiple layers each dedicated to a specific function. The input layer takes the input, the hidden layers (plural) perform the computation, and the output layer produces the probability distribution of the next predicted token. Here, it is crucial to understand the correlation between ‘prediction’ and ‘generation’. What the model does is that it repeatedly predicts the next word, thereby generating a coherent sentence, paragraph or even longer texts. So basically, what a language model does is fill in the blanks. Let us say, someone puts the following prompt:

“Copyright law protects authors by granting them _________”

The model tries to predict the next word and ‘might’ output the following probabilities:

Next Word Probability
exclusive 40%
rights 30%
control 15%
Access 10%

This entire process can be seen in action in the Transluce Model Investigator, where each word in the output shows all the probable words and the probability assigned to each word.

Deciphering the ‘GPT’

With that, let’s now dive into the specifics. First and foremost, it needs to be understood that ChatGPT is not an LLM; it is a chatbot based on an LLMs like GPT-4 which are known as foundation models. GPT stands for ‘Generative Pre-Trained Transformer’. Out of these three words, the most important in terms of functionality is ‘transformer’ and I will deal with it first. A transformer is a model architecture that enables LLMs to work as we see them. This breakthrough architecture was introduced in 2017 when a group of scientists working at Google published a paper called ‘Attention is All You Need’. The importance of transformers can be inferred from the fact that before Transformers came into being, OpenAI was working on making stuff like a robot butler which could set and clear a table, robot arms that could solve Rubik’s cube with one hand, and bots that could play Dota 2!! (hear it for yourself on this Bloomberg Podcast from 13:20 onwards). But this paper caught the attention of Ilya Sutskever, co-founder and chief scientist at OpenAI and the brain behind ChatGPT, which led OpenAI to reorient itself (sadly!). An AI robot butler would have been super-cool.

Transformer architecture introduced a concept called ‘attention’ which allowed the algorithm to capture contextual information to understand which words influence others. The Model assigns ‘attention scores’ to different words based on their importance in the sentence. For example, in the sentence:

Lokesh is a prolific academic. He has won several essay competitions.

If I ask, who has won many essay competitions, the transformer model will return ‘Lokesh’ as the answer. It recognises that ‘Lokesh’ and ‘He’ are closely related despite not occurring in the same sentence. For a transformer-based LLM to work as intended, it has to be ‘pre-trained’. What I mean is that, for example, a chatbot like ChatGPT is not trained from scratch. But instead, it is based on a pre-trained LLM. A pre-trained model is already trained on a large dataset of books, articles, web pages, etc., to learn general language patterns, which can then be adapted to more specific tasks, such as chatbots like ChatGPT. This adaptation is called ‘fine-tuning’ using reinforcement learning from human feedback (RLHF) to improve alignment with human preferences.

Data, data everywhere, not a byte to bite

This is the first contentious part from the copyright perspective. The data that goes into training the model and how training is done has an enormous implication on the copyright infringement question. Back in the day when OpenAI used to be ‘open’ (pun intended), it published a paper in 2020 titled ‘Language Models are Few-Shot Learners’ giving the details of the training dataset for GPT-3 (p.8 onwards). The training dataset consisted of the free and open Common Crawl dataset (a trillion words, terabytes of data), WebText (collected by OpenAI by scraping links), two internet-based books corpora, Books1 and Books2 (COPYRIGHT ALERT!!! Made up 16% of the training dataset, 50 billion words) and Wikipedia. There has been quite some discussion on the Books1 and Books2 datasets (here and here). In the Authors Guild copyright infringement suit brought against OpenAI, there were documents that stated that OpenAI later deleted the Books1 and Books2 datasets. Currently, the website of now not-so-OpenAI provides only the following information with respect of the data used for ‘teaching’ ChatGPT:

“What type of information is used to teach ChatGPT?

As noted above, ChatGPT and our other services are developed using (1) information that is ‘publicly available’ on the internet, (2) information that we partner with third parties to access, and (3) information that our users or human trainers and researchers provide or generate. This article focuses on the first set: information that is publicly available on the internet.

For this set of information, we only use publicly available information that is freely and openly available on the Internet – for example, we do not seek information that we know is behind paywalls or from the “dark web.” We apply filters and remove information that we do not want our models to learn from or output, such as hate speech, adult content, sites that primarily aggregate personal information, and spam. We then use the information to teach our models.”

One key takeaway of the discussion would be that ‘collection of data’ per se is not an integral part of training LLMs, as argued before the Court. Often, models are trained on datasets that are openly and freely available. This distinction is of important from Technological Protection Measures (TPM) perspective which adds an extra layer of protection. Actively collecting information behind paywalls is a separate offence under Section 65A of the Copyright Act so long as the purpose is violative of copyright;, simply using it is not.  However, it is also important to highlight that public availability of data does not mean that it is necessarily public juris, i.e., it is not necessarily part of the public domain to be used by the public without the permission of the copyright owner. 

A Word is Characterised by the Company it Keeps”.  

Now comes one of the most consequential questions, which will turn a lot in terms of copyright infringement. How is the model trained?  It is common general knowledge that text is broken down into tokens for training. It is often presumed that tokens are words. But that is not the case. It is a convenient lie. Tokens are frequently sub-words, as can be seen in the following sentence consisting of 7 words:

Tokens: 11 
Characters: 37

Image of the sentence "Praharsh is a great editor at SpicyIP" with various parts of it highlighted to show the presence of 11 tokens across this one sentence.

Conceptually speaking, it can be assumed that a token consists of 4 letters, and roughly speaking 75 words make up 100 tokens (you can play around with the GPT Tokenizer here). Once tokenization is done, each token is assigned a Token ID. So, in this pre-training phase, the algorithm ‘looks’ at the text, creates tokens based on it, and, instead of storing the tokens, the model works with numeric IDs, which act as a bridge between token and vector representations (will be discussed a little later). For illustration, the following are the Token IDs for the above sentence:

A series of seemingly random numbers including 88444, 8665, 1116, 382, 261, 2212, 9836, 540, 2856, 3869, 3799.

Once tokenization is done and Token IDs are generated, tokens are converted into numerical vectors in a high-dimensional space (for example, it can be 300 dimensional) called vector space. This is where the ‘intelligence’ in LLMs comes from. This is where the magic happens. In 2013, a groundbreaking paper titled “Efficient Estimation of Word Representations in Vector Space” by Mikolov et al. from Google introduced the Word2Vec technique. It demonstrated that word embeddings- vector representations of words- that the model learned from a massive dataset encoded syntactic and semantic information. To illustrate this, consider the 150-dimensional word embedding of the word ‘judge’: 

Series of 150 numbers representing the 150 dimensional word embeddinng.

In the above image, what appears to be an entry of meaningless random numbers (there are 150 entries as the model was set to 150 dimensions) signifies dimensions in a vector space capturing different aspects of the word. So, in the vector space, words like ‘justice’, ‘court’, ‘law’ will have word embeddings close to ‘judge’. Theoretically, this was already understood in linguistics. This is widely known as the distributional hypothesis, according to which words appearing in similar contexts have similar meanings- “a word is characterised by the company it keeps.  

A chart showing the degree of similarity between the words 'helicopter', 'drone', 'rocket', on a 3d axis of 'sky', 'engine' and 'wings'.

However, Mikolov et al. “somewhat surprisingly” found that the similarity goes beyond simple syntactic regularities. They found, for example, if you subtract the vector for ‘Man’ from ‘King’, and add ‘Woman’, you get a vector close to ‘Queen’, i.e., Vector (King) – Vector (Man) + Vector (Woman) = Vector (Queen). But a picture is worth a thousand words, so:

A chart showing vector values of 'queen', 'king', 'woman', 'man' and demonstrating the gaps between them being similar.

This mapping allows ChatGPT to understand words beyond keyword matching; it captures conceptual relationships. These conceptual relationships are colloquially referred to as the “statistical patterns” that scholars talk about, which the model learns during pre-training. Now, word similarity can be computed by analysing their vectors (something known as cosine similarity). Fascinating stuff, right? The next part will look into the copyright issues stemming from here, and in the meantime, thoughts, comments and corrections if any are welcome!  

[The author would like to thank Swaraj and Lokesh for their round the clock availability and comments on the draft] 



Source link

LEAVE A REPLY

Please enter your comment!
Please enter your name here