Artist finds private medical record photos in popular AI training data set

ADVERTISEMENT

Controlled medical images found in the LAION-5B dataset used for AI training.  Black bars and distortion have been added.
Zoom / Controlled medical images found in the LAION-5B dataset used for AI training. Black bars and distortion have been added.

Ars Technica

Late last week, California-based You are an artist Who goes by the name Lapin Discover Private medical record images taken by her doctor in 2013 and referenced in the LAION-5B Image Collection, a scrape from publicly available images on the web. AI researchers download a subset of that data to train AI image synthesis models like Stable Diffusion and Google Imagen.

Labin discovered her medical images on a site called Have I been Trainined, which lets artists see if their work is in the LAION-5B dataset. Instead of performing a text search on the site, Labin uploaded a recent photo of her using the site’s reverse image search feature. I was surprised to discover a set of two medical before and after photos of her face, which were only authorized by her doctor for private use, as reflected in Labin’s permission form. chirp Also presented to Ars.

Labin suffers from an inherited condition called congenital dyskeratosis. “It affects everything from my skin to my bones and teeth,” Labin told Ars Technica in an interview. “In 2013, I underwent a small set of procedures to restore facial features after undergoing several rounds of oral and maxillofacial surgeries. These photos are from my last set of procedures with this surgeon.”

The surgeon who was in possession of the medical images died of cancer in 2018, according to Labin, and she suspects they somehow left his clinic after that. “It’s the digital equivalent of receiving stolen property, says Labin. “Someone stole the photo from the deceased doctor’s files and it ended up somewhere on the internet, and then it was stuffed into this data set.”

Labin prefers to remain anonymous for reasons of medical privacy. Through records and photos provided by Lapin, Ars confirmed that there are medical images of her referenced in Lion’s data set. As we searched for Labin images, we also discovered thousands of images of similar patients’ medical records in the dataset, each of which may have a similar ethical or legal status of questionable, many of which were likely incorporated into popular image montage models that companies like to offer Midjourney and Stability AI as a commercial service.

That’s not to say that anyone could suddenly create an AI version of Labin’s face (as in technology right now) – and her name isn’t associated with the images – but it does bother her that private medical images have been turned into a product without any form of consent or recourse to remove them. . “It’s bad enough that a photo got leaked, but now it’s part of a product,” Labin says. “And that applies to anyone’s photos, their medical record or not. The potential for future abuse is really high.”

Who watches the monitors?

Lion describes itself as a nonprofit organization with members around the world, “aiming to make large-scale machine learning models and related data sets and code available to the general public.” His data can be used in various projects, from facial recognition to computer vision to image synthesis.

For example, after an AI training process, some of the images in the LAION dataset became the basis for Stable Diffusion’s amazing ability to create images from text descriptions. Since LAION is a collection of URLs that point to images on the web, LAION does not host the images themselves. Instead, Lyon says, researchers should download images from different sites when they want to use them in a project.

The LAION dataset is full of potentially sensitive images collected from the Internet, like this one, which are now being incorporated into commercial machine learning products.  Black bars added by Ars for privacy purposes.
Zoom / The LAION dataset is full of potentially sensitive images collected from the Internet, like this one, which are now being incorporated into commercial machine learning products. Black bars added by Ars for privacy purposes.

Ars Technica

Under these circumstances, the responsibility to include a particular image in the LAION set becomes a fictional game of passing on responsibility. A friend of Labin asked an open question on LAION’s Discord server #Safety-and-Privacy last Friday asking how to remove her photos from the group. “The best way to remove an image from the Internet is to ask the hosting website to stop hosting it,” LAION Architect Romain Beaumont replied. “We do not host any of these images.”

In the US, scraping publicly available data from the internet appears legal, as the findings of a 2019 court case confirm, and is it mostly the deceased doctor’s fault? Or the site that hosts Lapin’s banned photos on the web?

Ars has contacted LAION for comment on these questions, but has not received a response at press time. The LAION website provides a form where European citizens can request that information be removed from their database to comply with EU GDPR laws, but only if a person’s photo is associated with a name in the photo’s metadata. Thanks to services like PimEyes, it’s easy to associate someone’s face with names through other means.

In the end, Labin realizes how the chain of custody for her private photos has failed, but she still wants to have her photos removed from the LAION dataset. “I would like a way for anyone to request that their image be removed from the data set without sacrificing personal information. Just because they remove it from the web doesn’t mean it was meant to be public information, or even on the web at all.”

ADVERTISEMENT

Leave a Reply

Your email address will not be published.