Copyright Infringement in AI Training Data Sets

Jul 19

The proliferation of AI programs has led to increased scrutiny of the information that is being used to train AI models. Specifically, there is growing concern over whether using copyrighted materials or other intellectual property to train AI models can give rise to copyright infringement or other intellectual property claims.

The question of whether the use of copyrighted works to train AI models is copyright infringement is not just a theoretical one. For example, on July 7, 2023 , comedian and author Sarah Silverman (along with two other authors) filed lawsuits against Open AI and Meta in federal court regarding claims of copyright infringement. Late last year, there was also a lawsuit filed against Microsoft’s Copilot code-writing assistance AI.

These class action suits allege that OpenAI's ChatGPT and Meta's LLaMA were trained on illegally-acquired datasets containing the plaintiffs' works. The complaints further allege that OpenAI and Meta accessed these works via "shadow library" websites, noting the books are "available in bulk via torrent systems."

Silverman and the other authors claim that they "did not consent to the use of their copyrighted books as training material" for the companies' AI models. The lawsuits contain six counts of various types of copyright violations, negligence, unjust enrichment, and unfair competition. The plaintiffs seek restitution of profits and statutory and other damages.

The plaintiffs’ arguments are based on new legal theories, which is not surprising given the recency of the AI technology involved. These cases are only some of the earliest to test these theories, and it will be interesting to see how the courts evaluate and develop them going forward.

In the meantime, how can you avoid setting yourself up to be a defendant in one of the initial cases testing these theories? To build an AI large language model (LLM), you need hundreds of terabytes of training data. But as a developer, where do you get all this data? And once you've built your model, how can you be sure you have not unknowingly used copyrighted material?

The obvious first step is to ensure you have researched and verified the source of all your training data. This is easier said than done. Hyperscalers like Amazon or Microsoft train their own models with mountains of user data collected from their businesses. But for a start-up looking to train a new model, collecting a similar volume of data while dodging copyrighted material can seem impossible.

First, make sure you have got the necessary permissions or licenses to access and use the datasets you have selected, including how you intend to use them, and that you have set up rules to govern your collection and storage of user data.

Using only public domain or appropriately Creative Commons-licensed works can be helpful. Public domain works are not protected by copyright, so you can use them freely. Creative Commons-licensed works can also be free to use as long as you comply with the terms of the license (for example, you may have to give certain credit about where you got the data). Be careful, however, when dealing with Creative Commons licenses that only permit "non-commercial" uses.

You may also consider using a smaller dataset to train your model or fine-tuning an existing open-source alternative. This makes it easier to collect enough data and verify its origins. While this model may have less broad applicability than a ChatGPT or Bard, you may realize an opportunity to enhance its reliability for a more specific domain or industry.

Many programmers in the AI community see synthetic training data as the preferred option. Using synthetic data helps skirt many issues plaguing organic training data beyond copyright, such as reliability and bias. If you can synthesize data for a particular problem, it is possible to train models to a far higher degree of accuracy while avoiding copyright issues altogether.

Choosing to filter your data to remove copyrighted content is an option as well. There are many filters available that can be used to remove copyrighted content from training data. This can help to reduce the risk of copyright infringement claims.

A final solution, perhaps the most elegant one, is for AI researchers to simply create databases where there is no possibility of copyright infringement — either because the material has been properly licensed or because it has been created for the specific purpose of AI training. One such example is "The Stack" — a dataset for training AI designed specifically to avoid accusations of copyright infringement. It includes only code with the most permissive possible open-source licensing and offers developers an easy way to remove their data on request. Its creators say their model could be used throughout the industry.

In the end, especially with the high-profile Silverman cases, it is becoming clearer that the risks of using copyrighted works to train AI are real. Developers who use careful thinking and processes to create AI models we call smart can proudly feel even smarter by keeping their training data free of infringeable material.

Chris Jackson

Copyright Infringement in AI Training Data Sets

2023 Recap

Right to Repair Laws: The IP Implications

Client Testimonials

2023 Recap