Sourcing: A Thorny Problem for AI

By Jim Hamilton on April 12, 2024

The term “generative artificial intelligence” describes the ability to create text, images, or other data in response to prompts. It requires extensive computing power to input immense amounts of data so that the data can create a response to the given prompt. Businesses are now figuring out how to apply generative artificial intelligence (AI) across a range of functions including art & design, cloud computing, customer service, data management & processing, entertainment, fashion, finance, healthcare, product design, sales & marketing, software development, and writing.

Potential Copyright Issues

Generative prompts resulting in AI images have run into trouble for borrowing images during their process of scouring large databases of content. Key to this problem—and this relates as much to text as to images—is that there is rarely any indication of what sources were used to produce the resulting AI images. What if clearly identifiable images are co-opted for the purpose of responding to a generative AI prompt? Do the copyright holders have any recourse?

A deep learning, text-to-image model called “stable diffusion” is at the heart of some of the controversy surrounding image copyrights. First released in 2022, this model generates detailed (though relatively low-resolution) images called “synthographs” based on a text prompt. The start-up company behind stable diffusion, Stability AI, has run into legal issues for a variety of reasons, including:

Exposure of private or sensitive information gleaned from the data that the stable diffusion process was trained on.
Unauthorized use of copyrighted images, personal likenesses, or commercial brands.
Creation of violent or sexually explicit imagery, including depictions of underage individuals.

Lawsuits from artists and photographers (including groups like Getty Images) are ongoing as a result.

The Importance of Legitimate Sourcing

The lack of clear sourcing for AI-developed text and images poses significant challenges regarding copyright as well as credibility. Any college student writing a term paper knows very well that all of their sources must be accurately cited. A well-known example in the legal field involves the 2023 US court case against Colombia-based Avianca Airlines, in which the lawyer for the injured plaintiff used the AI system ChatGPT to conduct research and ended up citing bogus cases to show precedent. Unfortunately, these cases did not exist—they were made up by ChatGPT. The lawyer believed that the cases provided in response to a ChatGPT prompt were real. He was under the impression that ChatGPT was a search tool and could be counted on to source actual legal cases.

This legal case gets to the core of the sourcing issue, and hints at how AI tools could legitimately be used. What if the lawyer in this case was using an AI system whose sources were based upon a legal publisher’s library of documents? Assuming that the AI tool was operating correctly, the sources would be legitimate.

The Possibility for Bias

When AI tools draw on broad databases of text and images, is there a chance that those databases include some inherent bias or overlook some key factors? For example, you might ask an AI tool for information on how to price your products, but what if the knowledge database was time-limited? Would you want to base your pricing strategy on 20th Century data, or even on data that only went up to 2020?

In a similar sense, gender or racial bias are likely to be reflected in many databases and will, therefore, be reflected in the responses to generative AI tools. Non-curated databases may include disturbing, pornographic, or violent imagery that unsuspecting users may draw out unwittingly through their prompts. Such content might even be drawn out intentionally.

An Opportunity for Print Service Providers

Print service providers (PSPs) that have control over libraries of documents or images have an opportunity to help their clients as the use of AI technologies expands. For example, publishing in-plants or commercial printers that are responsible for the brand materials of their client companies are in a position to avoid the pitfalls of AI tools that draw from non-curated databases. In this case, the source material is curated.

If you’re a business executive, you’ve probably received promotional e-mails from companies that promise to generate content in your voice, tone, and style based on the library of documents that you have about your company, products, and services. This is a promising use of tools like ChatGPT because the library of content they are drawing from is, in a sense, pre-approved. If the information within your database is accurate, AI can generate accurate iterations. If there are errors in the content, however, those errors have the potential to proliferate.

The Bottom Line

Generative AI brings up moral issues surrounding copyright, bias, and fairness that PSPs cannot sweep under a rug. One way to avoid copyright issues is to be sure that the AI system clearly identifies its sources and draws from approved materials. These could be a legal or other publishing database, approved marketing materials from a brand owner, or properly licensed content. Until AI systems develop a more transparent method of identifying their source materials, PSPs should be cautious of relying too heavily on these tools. PSPs do, however, have an opportunity to leverage AI when they manage an appropriate content database.

Source: Jim Hamilton, Consultant Emeritus at Keypoint Intelligence

Author bio: Jim Hamilton of Green Harbor Publications is an industry analyst, market researcher, writer, and public speaker. For many years, he was Group Director in charge of Keypoint Intelligence’s (formerly InfoTrends’) Production Digital Printing & Publishing consulting services. He has a BA in German from Amherst College and a Master’s in Printing Technology from the Rochester Institute of Technology.

RESOURCES