Harvard open-sources AI training dataset 'Institutional Books 1.0', covering 983,000 books in its collection

With the support of Microsoft and OpenAIHarvard UniversityThe law school library officiallyOpen SourceIts first AI training openDataset"Institutional Books 1.0". The dataset purportedly contains 983,000 books in the Harvard University collection, covering 245 languages, and contains a total of 242 billion Token, 1AI with project address (https://huggingface.co/datasets/institutional/institutional-books-1.0).

Harvard open-sources AI training dataset 'Institutional Books 1.0', covering 983,000 books in its collection

According to the report, the corresponding data set contains 40% books in English, books published in the 19th and 20th centuries, divided into a total of 20 topics, in addition to the followingThe corresponding dataset also provides complete metadata for each book, including information on "author, year of publication, language, and original source"..

According to the Harvard Law School Library, the researchers will continue to expand the data in the future, and members of the project team are already working with the Boston Public Library to digitize "millions" of historical newspapers to add to the dataset.

In the future, the Harvard Law School Library plans to develop a series of AI tools to improve the efficiency of organizing and opening collections and to promote "responsible data use practices.

statement:The content of the source of public various media platforms, if the inclusion of the content violates your rights and interests, please contact the mailbox, this site will be the first time to deal with.
Information

The world's first large pediatric model lands at Beijing Ronghua Hospital, with diagnostic accuracy better than the average of attending physicians

2025-6-16 21:52:48

Information

Meta partners with Oakley, expects to announce new smart glasses on June 20

2025-6-17 11:29:10

Search