{"id":16485,"date":"2024-07-26T09:36:32","date_gmt":"2024-07-26T01:36:32","guid":{"rendered":"https:\/\/www.1ai.net\/?p=16485"},"modified":"2024-07-26T09:36:32","modified_gmt":"2024-07-26T01:36:32","slug":"%e6%ad%a6%e6%b1%89%e5%a4%a7%e5%ad%a6%e8%81%94%e5%90%88%e4%b8%ad%e5%9b%bd%e7%a7%bb%e5%8a%a8%e4%b9%9d%e5%a4%a9%e4%ba%ba%e5%b7%a5%e6%99%ba%e8%83%bd%e5%9b%a2%e9%98%9f%e5%bc%80%e6%ba%90%e9%9f%b3%e8%a7%86","status":"publish","type":"post","link":"https:\/\/www.1ai.net\/en\/16485.html","title":{"rendered":"Wuhan University and China Mobile&#039;s Jiutian AI team jointly open-sourced the audio and video speaker recognition dataset VoxBlink2"},"content":{"rendered":"<p data-pm-slice=\"0 0 []\"><a href=\"https:\/\/www.1ai.net\/en\/tag\/%e6%ad%a6%e6%b1%89%e5%a4%a7%e5%ad%a6\" title=\"Look at the article with the tag\" target=\"_blank\" >Wuhan University<\/a>joint<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e4%b8%ad%e5%9b%bd%e7%a7%bb%e5%8a%a8\" title=\"[See articles with [China Moves] labels]\" target=\"_blank\" >China Mobile<\/a>The Jiutian AI team and Duke Kunshan University used YouTube data to<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e5%bc%80%e6%ba%90\" title=\"[View articles tagged with [open source]]\" target=\"_blank\" >Open Source<\/a>More than 110,000 hours of audio and video speaker recognition<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e6%95%b0%e6%8d%ae%e9%9b%86\" title=\"[See articles with [data set] labels]\" target=\"_blank\" >Dataset<\/a><a href=\"https:\/\/www.1ai.net\/en\/tag\/voxblink2\" title=\"_Other Organiser\" target=\"_blank\" >VoxBlink2<\/a>The dataset contains 9,904,382 high-quality audio clips and their corresponding video clips from 111,284 users on YouTube. It is currently the largest publicly available audio and video speaker recognition dataset. The release of the dataset aims to enrich the open source speech corpus and support the training of large voiceprint models.<\/p>\n<div class=\"pgc-img\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-16486\" title=\"get-823\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2024\/07\/get-823.jpg\" alt=\"get-823\" width=\"920\" height=\"441\" \/><\/div>\n<p data-track=\"89\">The VoxBlink2 dataset is mined through the following steps:<\/p>\n<blockquote>\n<p data-track=\"90\">Candidate preparation: Collect multilingual keyword lists, retrieve user videos, and select the first minute of video for processing.<\/p>\n<p data-track=\"91\">Face Extraction &amp; Detection: Extract video frames at a high frame rate and use MobileNet to detect faces, ensuring that the video track contains only a single speaker.<\/p>\n<p data-track=\"92\">Face recognition: Pre-trained face recognizer recognizes each frame to ensure that the audio and video clips are from the same person.<\/p>\n<p data-track=\"93\">Active speaker detection: Using lip movement sequences and audio, a multimodal active speaker detector outputs the utterance segment, and aliasing detection removes multi-speaker segments.<\/p>\n<\/blockquote>\n<p data-track=\"94\">In order to improve the data accuracy, a bypass step of the in-set face recognizer was also introduced. Through rough face extraction, face verification, face sampling and training, the accuracy was improved from 72% to 92%.<\/p>\n<p data-track=\"95\">VoxBlink2 also open-sources voiceprint models of different sizes, including a 2D convolutional model based on ResNet, a time series model based on ECAPA-TDNN, and a super-large model ResNet293 based on Simple Attention Module. After post-processing on the Vox1-O dataset, these models can achieve an EER of 0.17% and a minDCF of 0.006%.<\/p>\n<p data-track=\"96\"><strong>Dataset website:<\/strong>https:\/\/VoxBlink2.github.io<\/p>\n<p data-track=\"97\"><strong>How to download the dataset:<\/strong>https:\/\/github.com\/VoxBlink2\/ScriptsForVoxBlink2<\/p>\n<p data-track=\"98\"><strong>Meta files and models:<\/strong>https:\/\/drive.google.com\/drive\/folders\/1lzumPsnl5yEaMP9g2bFbSKINLZ-QRJVP<\/p>\n<p data-track=\"99\"><strong>Paper address:<\/strong>https:\/\/arxiv.org\/abs\/2407.11510<\/p>\n<p>&nbsp;<\/p>","protected":false},"excerpt":{"rendered":"<p>Wuhan University, together with China Mobile's JiuTian AI team and Duke Kunshan University, have open-sourced over 110,000 hours of audio\/video speaker recognition dataset VoxBlink2 based on YouTube data.The dataset consists of 9,904,382 high-quality audio clips and their corresponding video clips from 112,284 YouTube users, and it is the largest publicly available speaker recognition dataset in the world. It is the largest publicly available audio and video speaker recognition dataset. The release of the dataset is intended to enrich the open source speech corpus and support the training of large models of voiceprints. The VoxBlink2 dataset is mined by the following steps: Candidate Preparation: Collect multilingual keyword lists, retrieve user videos, and select the first minute of video for processing. Face Extraction &amp; Detection:Extract video frames at high frame rate and use M<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[146],"tags":[3713,2072,219,3355,3712],"collection":[],"class_list":["post-16485","post","type-post","status-publish","format-standard","hentry","category-news","tag-voxblink2","tag-2072","tag-219","tag-3355","tag-3712"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/16485","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/comments?post=16485"}],"version-history":[{"count":0,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/16485\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/media?parent=16485"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/categories?post=16485"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/tags?post=16485"},{"taxonomy":"collection","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/collection?post=16485"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}