Using Copyrighted Works in AI Training Data May Infringe Even if the AI Output Doesn’t

David Rabinowitz & Milton Springut , PARTNERS , Moses & Singer LLP

17 Jan 2024

Thomson Reuters Enterprise Centre GmbH v. Ross Intelligence Inc., decided in the federal District Court in Delaware on September 25, 2023, asks the question whether a company can train its AI on a competitor’s copyrighted works in order to help it compete? What if the AI output does not infringe the competitor’s copyrights?

The answer so far is “maybe,” but that the answer is not “no” adds to the hazards of using copyrighted works as training data.

Thomson Reuters v. Ross is a trial court decision denying summary judgment on the points at issue, and is therefore not a definitive declaration of law. However, by denying summary judgment on copyright infringement to the AI builder and user, the decision opens the door to the kind of lengthy, expensive and uncertain litigation that could deter builders and users of AI from using copyrighted works as training data.

Gust Launch can set your startup right so its investment ready.

SETUP RIGHT

Ross’s AI

Thomson Reuters owns Westlaw, the computer-era descendant of West Publishing Company, which has been publishing judicial decisions and much other legal matter since 1872. Ross Intelligence is a new legal research company seeking to create a natural language search engine, using machine learning and artificial intelligence to “avoid human intermediated materials.”

Ross poses a competitive threat to Westlaw because Ross proposes to eliminate the need for human commentary on cases. Westlaw creates exactly such commentary in the headnotes that it writes for each case. Westlaw’s headnotes are organized by its key number system into categories and subcategories of legal issues, enabling systematic legal research. If Ross is successful, however, users will be able to enter ordinary English questions into Ross’s AI search engine and receive back relevant quotations from judicial opinions. Ross’s system would thereby find case authority directly, bypassing Westlaw’s headnotes and possibly eliminating the need for Westlaw’s entire key number system.

To train its AI, Ross, through a subcontractor, created and used about 25,000 legal questions and answers. The questions were meant to be those that a lawyer would ask, and the answers were direct quotations from legal opinions.

Westlaw, however, alleges that the “created” legal questions are actually nothing more than Westlaw headnotes with a question mark appended. If true, Ross’s training data could include or constitute infringing copies of Westlaw’s headnotes.

Does Copying Copyrighted Works for Use as Training Data Constitute Infringement?

Westlaw moved for summary judgment on Ross’s liability for copyright infringement, claiming that 2,830 of the Ross “created” legal questions infringed Westlaw’s headnotes. (Westlaw claimed that many more of such questions infringed, but only made its motion on 2,830 of them.) Westlaw alleged that reproducing copyrighted matter for AI training data constitutes copyright infringement. Westlaw says that Ross merely translated the headnotes into numerical data and that translation is “paradigmatic derivative work.”

Copying alone constitutes copyright infringement. 17 U.S.C. §106(a). However, the doctrine of fair use sometimes allows copying. Broadly speaking, copying copyrighted works for training data is fair use if the ultimate use of the training data is a sufficiently transformative use that does not invade the natural copyright market of the author. To judge whether copying works for training data is a fair use, we must turn to the output.

The Google Books case is an example where copying copyrighted works into a computer constituted fair use. In Authors Guild, Inc. v. Google, Inc., 804 F.3d 202 (2d Cir. 2015), the Court of Appeals held that Google’s copying of millions of books, many still in copyright, was fair use. Once the books were on Google’s servers, users could search the books and, for books still in copyright, see tiny snippets of the books. This function enabled users to find books of interest but, in the court’s view, did not supplant the market for the books. The court said, “Google’s making of a digital copy to provide a search function is a transformative use, which augments public knowledge by making available information about Plaintiffs’ books without providing the public with a substantial substitute for matter protected by the Plaintiffs’ copyright interests in the original works or derivatives of them.” The Supreme Court declined to review the ruling. 578 U.S. 15 (2016).

What makes Thomson Reuters v. Ross particularly interesting is that Ross says that Westlaw’s headnotes, even if copies, disappear into Ross’s AI, never to be seen again, even as snippets. Ross’s AI outputs only excerpts from judicial opinions, which are public domain.

Yet, the court rejected Ross’s attempt to dismiss the copyright claims. The court held that the issue of fair use presented fact questions sufficiently unclear to require a trial. Why?

Intermediate Copying and Fair Use

Judge Bibas, a judge on the Third Circuit Court of Appeals sitting by designation in the District of Delaware, applied the common fair use test balancing the transformativeness of the use against commercial nature of the use. Noting that Ross’s use was commercial and, indeed, intended to compete with Westlaw, Judge Bibas looked to previous cases involving “intermediate” computer copying and competition. He said,

The idea is that the artificial intelligence will be able to recognize patterns in the question-answer pairs. It can then use those patterns to find answers not just to the exact questions fed into it, but to all sorts of legal questions users might ask.

Ross says that the caselaw on “intermediate copying” most appropriately reflects its use. In those cases, the users copied material to discover unprotectable information or as a minor step towards developing an entirely new product. So the final output—despite using copied material as an input—was transformative. In Sega Enterprises Ltd. v. Accolade, Inc., 977 F.2d 1510 (9th Cir. 1992), the defendant copied Sega’s copyrighted software. But it did so only to figure out the functional requirements to make games compatible with Sega’s gaming console. Id. at 1522. That functional information was unprotected, so the copying was fair use. Id. at 1522-23.

Similarly, in Sony Computer Entertainment Inc. v. Connectix Corp., 203 F.3d 596 (9th Cir. 2000), the defendant used a copy of Sony’s software to reverse engineer it and create a new gaming platform on which users could play games designed for Sony’s gaming system. Id. at 601. The court concluded that this was fair use for two reasons: the defendant created “a wholly new product, notwithstanding the similarity of uses and functions” between it and Sony’s system, and the “final product [did] not itself contain infringing material.” Id. at 606. The Supreme Court has cited these intermediate copying cases favorably, particularly in the context of “adapt[ing] the doctrine of fair use . . . in light of rapid technological change.” (emphasis added)

The competing products in the two cases discussed in the above quote, which were made using intermediate copying, are at least superficially similar to how Ross describes its AI. In both cases, the defendant copied the plaintiff’s software to create a product that competed with plaintiff – in one case games that competed with Sega’s games on Sega’s console, in the other case a competing gaming platform where Sony’s games could be played. If Ross’s description of its AI is accurate, why would the Ross case be different?

Westlaw said that Ross used the untransformed text of headnotes to get its AI to replicate and reproduce the creative drafting done by Westlaw’s attorney-editors. Judge Bibas cited Westlaw’s argument in refusing summary judgment:

It was transformative intermediate copying if Ross’s AI only studied the language patterns in the headnotes to learn how to produce judicial opinion quotes. But if Thomson Reuters is right that Ross used the untransformed text of headnotes to get its AI to replicate and reproduce the creative drafting done by Westlaw’s attorney-editors, then Ross’s comparisons to cases like Sega and Sony are not apt. Again, this is a material question of fact that the jury needs to decide.

(emphasis added)

Yet, Westlaw did not argue and Judge Bibas did not find that Ross’s AI was going to output any of Westlaw’s “creative drafting” in the form of text generated by the AI. Westlaw did not deny and Judge Bibas did not find that Ross’s AI was going to output anything other than excerpts from judicial opinions.

It appears that Westlaw and Judge Bibas were referring to the incorporation of that creative drafting in Ross’s AI itself when they spoke of the AI replicating and reproducing Westlaw’s creative drafting. That is, Ross’s AI was a kind of copy of Westlaw’s “creative drafting” and its output would be a product of that copy.

The question thereby raised is reminiscent of White-Smith Music Pub. Co. v. Apollo Co., 209 U.S. 1 (1908), in which the Supreme Court was faced with another new technology: piano rolls for player pianos. The question there was whether piano rolls, scrolls with holes punched in them that would be read by player pianos which would play music, were “copies” of the copyrighted music. The Supreme Court said no, being unable to imagine that this medium, unreadable by humans, constituted a copy, although the piano rolls contained all of the information present in sheet music:

These perforated rolls are parts of a machine which, when duly applied and properly operated in connection with the mechanism to which they are adapted, produce musical tones in harmonious combination. But we cannot think that they are copies within the meaning of the Copyright Act.

Id. at 18.
Justice Holmes, concurring specially, saw past the trouble that the majority had with the mechanical aspects of the new medium, saying, “[o]n principle, anything that mechanically reproduces that collocation of sounds ought to be held a copy …” Id. at 20.

Judge Bibas did not expand on why he thought Ross’s AI might somehow be replicating and reproducing Westlaw’s “creative drafting.” But just as unreadable piano rolls might be considered copies of musical compositions because when accessed they reproduce the sounds of the music, so might a court find that unreadable AI is no less a copy of the creative thinking in its training data when it reproduces that creative thinking in the form of selected excerpts of judicial opinions.

Because this decision was a denial of summary judgment, leaving further development of this idea and the relevant facts for trial, it is unwise to read too much into it. But the court seems to be suggesting that Ross’s AI itself can invade Westlaw’s rightful copyright market in a fair use analysis, even if its output doesn’t contain Westlaw’s copyrighted headnotes. That is, the array of Westlaw copyrighted headnotes resides as a sort of copy in Ross’s AI, ready to take the inquiring lawyer to judicial opinions the same way that Westlaw’s key number system does.

“The Turk” and Frankenstein

In 1770, Wolfgang van Kempelen built “The Turk.” The Turk was a chess playing machine. For 84 years it played strong chess until it was destroyed by fire. Of course, such a machine was beyond the technology of 1770; The Turk actually had a cleverly concealed chamber for a human chessplayer who made the moves.

Fast forward to 2023. Judge Bibas apparently envisions a real mechanical legal researcher residing in the AI machine mimicking Westlaw’s human legal researchers and copying their creative thinking. His conclusion is that this kind of intermediate copying, which produces something that competes in this way with the copyright owner, is different from the Sega and Sony situations. Possibly the distinction is that the Ross AI contains and uses a kind of copy of Westlaw’s headnotes. Perhaps the Thomson v. Ross trial will explore (1) in what form, if at all, training data like case headnotes live on as a recognizable copy within AI, and (2) if they do, is it copyright infringement.

If Westlaw wins after trial, the rationale will be of great interest to the AI industry. If the reason is that fair use doesn’t allow AI from using Westlaw’s copyrightable creativity to provide essentially the same service as Westlaw – connecting lawyers to relevant cases – and thereby to compete directly with Westlaw, the result will be of limited impact. Ross, after all, allegedly took most or all of its training data from one source – Westlaw – with which it will compete directly.

On the other hand, if Judge Bibas finds that creating an AI Frankenstein with a Westlaw brain – that the AI itself is a species of forbidden copy – the door will be open to a wide scale limitation of AI under copyright. The solution to this problem for AI developers might be simply to draw training data from many sources rather than just one. Doing so might result in an AI brain that doesn’t imitate any single source. Or it might result in an AI that imitates many sources and thereby infringes many sources.

In any event, this case is just one of many cases working its way through the judicial system that could have the potential for leaving a big mark on the artificial intelligence field.

About

David Rabinowitz & Milton Springut are partners at Moses & Singer LLP, a practice who believe in investing heavily in understanding their clients’ businesses and developing close working relationships with them. David focuses in the substantive areas of financial industry litigation, including corporate trusts and letters of credit, trusts and estates, intellectual property, contracts and employment. Milton focuses on intellectual property litigation and counseling. He litigates and prosecutes patents in the scientific disciplines, including electrical and electronic systems, computer hardware and software, and business systems.

Gust Launch can set your startup right so its investment ready.

SETUP RIGHT

This article is intended for informational purposes only, and doesn't constitute tax, accounting, or legal advice. Everyone's situation is different! For advice in light of your unique circumstances, consult a tax advisor, accountant, or lawyer.

Using Copyrighted Works in AI Training Data May Infringe Even if the AI Output Doesn’t

Gust Launch can set your startup right so its investment ready.

Ross’s AI

Does Copying Copyrighted Works for Use as Training Data Constitute Infringement?

Intermediate Copying and Fair Use

“The Turk” and Frankenstein

About

Gust Launch can set your startup right so its investment ready.

Ready to launch your startup?

can help.