All Tags

Dataset

10 trillion tokens! Weeda contributed to the largest open source data set in the world and pushed four open source AI models

On January 6, in the CES 2026 keynote address held today, Chief Executive Officer Hoang In-hoon of Inweida delivered a keynote speech announcing a large-scale expansion of his open-source model bank, the release of new models and data sets covering the four main areas of language, robotics, autopilot and medicine, and further acceleration of industry-wide AI innovation. Weeda contributed to the Open Source Training Framework and the world's largest open multi-modular data set, including 10 trillion language training tokens, 500,000 robot tracks, 455,000 protein structures and 100 TB vehicle sensors..
Information
- 1.9k
1/6
Harvard open-sources AI training dataset 'Institutional Books 1.0', covering 983,000 books in its collection

With the support of Microsoft and OpenAI, the Harvard Law School Library officially open-sourced its first open dataset for AI training, "Institutional Books 1.0," last week. The dataset is said to contain 983,000 books in Harvard's collection, covering 245 languages and a total of 242 billion tokens,1AI attached the project address (https://huggingface.co/datasets/institutional/institutional-). ...
Information
- 4.3k
25/6/17
Yandex releases Yambda, the largest open source dataset for music recommendations

Russian search engine giant Yandex on May 30 released Yambda, the world's largest open source dataset for music recommendation systems, containing 4.79 billion anonymized user interactions designed to help developers create smart music services. Yandex collected data from nearly 28 million monthly Yandex Music users over a ten-month period, specifically 4.79 billion user interactions with 9.39 million songs, and the dataset includes key listener feedback on whether a song is good or bad, with all interactions timestamped for increased accuracy. Ya...
Information
- 3.7k
25/5/31
Chinese Internet Corpus AI Resource Platform Released: 27 Datasets, Total 2.7T

January 11 news, China Association for Cyberspace Security issued a notice on January 9, for the community to release the Chinese Internet corpus resource platform to support industry sectors, content modality, volume scale and other labels classification, easy for users to download and use. The Association said that under the guidance of the Central Internet Information Office, together with the National Internet Emergency Response Center, in the early release of the Chinese Internet basic corpus 1.0, based on the corpus building and sharing mechanism established by the ad hoc committee, to bring together a number of new high-quality and credible data, after a series of rigorous and detailed data processing and processing, such as source screening, content filtering, data de-emphasis, and so on, the...
Information
- 6.3k
25/1/11
China's first general-purpose embodied intelligent robot dataset released, containing more than two hundred multi-class scenarios of different tasks

According to CCTV News, a few days ago, the National and Local Co-construction of Embodied Intelligent Robotics Innovation Center and the School of Computer Science of Peking University jointly launched China's first universal open-source dataset for embodied intelligent robot training. The dataset is a data collection of multiple forms of robot ontology, including more than two hundred different tasks of multiple types of scenes. Reporters in the innovation center robot data collection field to see, engineers through the operation of the mechanical arm, can be in the virtual world to capture the robot to complete the action data. Robot through the remote control equipment, can be action learning and grasping; this side of the engineer by wearing a full-body motion capture clothing, can teach...
Headlines
- 4.8k
25/1/4
World's First: Wizards Robotics Announces Open Source AgiBot World Million-Machine Dataset, Dramatically Outperforms Google's Open X-Embodiment

December 30, 2011 - Wisdom Robotics today announced the launch of AgiBot World, the world's first open source project based on real-world scenarios, an all-around hardware platform, and full quality control of millions of real-machine datasets. Wisdom Robotics said, "This milestone open source project marks the arrival of 'ImageNet time' in the field of body intelligence. ImageNet moment' in the field of Embodied Intelligence." Jiyuan Robotics will upload the data in batches on HuggingFace, Github, and agibot-world.com as planned, with the following addresses: Huggin...
Information
- 7.3k
24/12/30
Harvard, Google release 1 million public domain books to provide legitimate data for AI training

December 13, 2011 - Harvard University and Google announced the joint release of 1 million public domain books as an AI training dataset, TechCrunch reported on December 12th. Image source Pexels The data required for AI training is costly, but more suitable for well-funded tech companies. As a result, Harvard plans to release a dataset of about 1 million public domain books covering a wide range of genres, languages, and authors, including classic authors such as Dickens, Dante, and Shakespeare that are no longer under copyright, due to the fact that the copyrights on these works...
Information
- 7.8k
24/12/13
Wuhan University and China Mobile's Jiutian AI team jointly open-sourced the audio and video speaker recognition dataset VoxBlink2

Wuhan University, China Mobile's Jiutian AI team, and Duke Kunshan University have jointly released VoxBlink2, an open-source audio and video speaker recognition dataset of more than 110,000 hours based on YouTube data. The dataset contains 9,904,382 high-quality audio clips and their corresponding video clips from 111,284 users on YouTube. It is currently the largest publicly available audio and video speaker recognition dataset. The release of the dataset aims to enrich the open-source speech corpus and support the training of large voiceprint models. The VoxBlink2 dataset is mined through the following steps: Candidate…
Information
- 12k
24/7/26
The world's largest Oracle "dataset" is open source

The "Digital Oracle Bone Co-creation Center" officially opened the world's largest oracle bone inscription multimodal dataset today, which contains a total of 10,000 oracle bone rubbings and copies, the corresponding positions of oracle bone words, corresponding character heads, corresponding interpretations, as well as word grouping and interpretation order. It is reported that all researchers can develop algorithms such as oracle bone detection, recognition, copy generation, glyph matching and interpretation based on this dataset to accelerate the intelligentization of oracle bone research. The Digital Oracle Bone Co-creation Center is composed of the Ministry of Education Oracle Bone Information Processing Laboratory of Anyang Normal University, Tencent SSV Digital Culture Laboratory, Tencent Youtu Laboratory, and the Chinese Academy of Social Sciences Oracle Bone Research Center.
Information
- 11.6k
24/7/6

❯

Checking in, please wait

Click for today's check-in bonus!

You have earned {{mission.data.mission.credit}} points today!

Check-in

Leaderboard

{{item.credit}}

Lasted{{item.count}}days

My Coupons

_￥_Coupons

Limitation of useExpired and Unavailable

Limitation of use
before

Limitation of usePermanently valid

Coupon ID:
×

Available for the following products: Available for the following products categories: Unrestricted use:

[{{ct.name}}]

Available for all products and product types

No coupons available!

Cart

×

Delete

Shopping Cart is Empty!

Empty Cart Checkout

You have a new message

No new messages

Write a new message More

{{userData.name}}Verify

Dataset

10 trillion tokens! Weeda contributed to the largest open source data set in the world and pushed four open source AI models

Harvard open-sources AI training dataset 'Institutional Books 1.0', covering 983,000 books in its collection

Yandex releases Yambda, the largest open source dataset for music recommendations

Chinese Internet Corpus AI Resource Platform Released: 27 Datasets, Total 2.7T

China's first general-purpose embodied intelligent robot dataset released, containing more than two hundred multi-class scenarios of different tasks

World's First: Wizards Robotics Announces Open Source AgiBot World Million-Machine Dataset, Dramatically Outperforms Google's Open X-Embodiment

Harvard, Google release 1 million public domain books to provide legitimate data for AI training

Wuhan University and China Mobile's Jiutian AI team jointly open-sourced the audio and video speaker recognition dataset VoxBlink2

The world's largest Oracle "dataset" is open source

AI Weibo

AI Applications

5000+ AI applications! Updated daily

1AICLUB

Highly recommended! Official brand Weibo

AI Tutorials

Tons of tutorials to read

AI Basic Training Camp

Zero-based entry, leading you to become an AI expert

1ai tiktok

1ai master

TikTok account: 1ai.net

1ai master

TikTok account: 1ai.net

1ai WeChat

Five minutes a day

Become a master in one year

Scan the QR code to follow