With the support of Microsoft and OpenAIHarvard UniversityThe law school library officiallyOpen SourceIts first AI training openDataset"Institutional Books 1.0". The dataset purportedly contains 983,000 books in the Harvard University collection, covering 245 languages, and contains a total of 242 billion Token, 1AI with project address (https://huggingface.co/datasets/institutional/institutional-books-1.0).

According to the report, the corresponding data set contains 40% books in English, books published in the 19th and 20th centuries, divided into a total of 20 topics, in addition to the followingThe corresponding dataset also provides complete metadata for each book, including information on "author, year of publication, language, and original source"..
According to the Harvard Law School Library, the researchers will continue to expand the data in the future, and members of the project team are already working with the Boston Public Library to digitize "millions" of historical newspapers to add to the dataset.
In the future, the Harvard Law School Library plans to develop a series of AI tools to improve the efficiency of organizing and opening collections and to promote "responsible data use practices.