OpenAI's models were trained on ebooks from a private ebook torrent tracker leeched en-mass during a free leech event by people who hated private torrent trackers and wanted to destroy their "economy."
The books were all in epub format, converted, cleaned to plain text, and hosted on a public data hoarder site.
NYT claims that OpenAI trained on their material. They argue for copyright violation, although I think another argument might be breach of TOS in scraping the material from their website or archive.
The complaint filing has some references to some of the other training material used by OpenAI, but I didn't dig deeply in to what all of it was:
The books were all in epub format, converted, cleaned to plain text, and hosted on a public data hoarder site.