Tuesday, June 3, 2025
LBNN
  • Business
  • Markets
  • Politics
  • Crypto
  • Finance
  • Energy
  • Technology
  • Taxes
  • Creator Economy
  • Wealth Management
  • Documentaries
No Result
View All Result
LBNN

New technique can automate data curation for self-supervised pre-training of AI datasets

Simon Osuji by Simon Osuji
June 3, 2024
in Artificial Intelligence
0
New technique can automate data curation for self-supervised pre-training of AI datasets
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter


New technique to automate data curation for self-supervised pre-training of AI datasets
An overview of the data curation pipeline. Large data pool often exhibits a long-tailed distribution of concepts. We apply hierarchical k-means to obtain clusters that spread uniformly over the concepts. Data points are then sampled from the clusters to form a curated dataset that has a better balance of concepts. Credit: arXiv (2024). DOI: 10.48550/arxiv.2405.15613

A team of computer scientists and AI researchers from FAIR at Meta, INRIA, Université Paris Saclay and Google, has developed a possible means for automating data curation for self-supervised pre-training of AI datasets.

Related posts

New tool boosts model transparency

New tool boosts model transparency

June 3, 2025
“Mario Kart World” Devs Broke Their Own Rule on Who Gets to Drive

“Mario Kart World” Devs Broke Their Own Rule on Who Gets to Drive

June 3, 2025

The group has written a paper describing their development process, the technique they developed and how well it has worked thus far during testing. It is posted on the arXiv preprint server.

As developers and users alike have been learning over the past year, the quality of the data that is used to train AI systems is tied very closely to the accuracy of results. Currently, the best results are obtained with systems that use manually curated data and the worst are obtained from systems that are uncurated.

Unfortunately, manually curating data takes a lot of time and effort. Therefore, computer scientists have been looking for ways to automate the process. In this new study, the research team has developed a technique that does just that, and that does it in a way that is on a par with manual curation.

The new technique starts with a large dataset, and then carries out a three-step process that results in data that is both more diverse and more balanced.

The first step involves using a feature-extraction model that calculates high-quality places to embed data points. In their approach, the things that are embedded are numbers that represent features of different types of data, such as text, audio, or images.

The second step involves the use of successive k-means clustering, where data points are assigned to a group based on their similarity to other data points.

The third step involves the use of multi-step hierarchical k-means clustering to ensure that data clusters are balanced. It is achieved via building data-cluster trees in a bottom-up fashion.

The research team tested their technique using vision models that had been trained on various types of datasets. They found that models using their technique outperformed those using uncurated data and were as good as or sometimes better than those trained on data that was curated manually.

More testing will have to be done to find out how well their technique works on real-world data and different kinds of AI systems.

More information:
Huy V. Vo et al, Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach, arXiv (2024). DOI: 10.48550/arxiv.2405.15613

Journal information:
arXiv

© 2024 Science X Network

Citation:
New technique can automate data curation for self-supervised pre-training of AI datasets (2024, June 3)
retrieved 3 June 2024
from https://techxplore.com/news/2024-06-technique-automate-curation-pre-ai.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.





Source link

Previous Post

CTF 150’s HMS Diamond intercepts 2.4 tonnes of hashish in Arabian Sea

Next Post

97 Countries Prepare To Attend BRICS 2024 in June in Russia

Next Post
97 Countries Prepare To Attend BRICS 2024 in June in Russia

97 Countries Prepare To Attend BRICS 2024 in June in Russia

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

RECOMMENDED NEWS

Artists are taking things into their own hands to protect their work from generative AI

Artists are taking things into their own hands to protect their work from generative AI

11 months ago
Meet the Hired Guns Who Make Sure School Cyberattacks Stay Hidden

Meet the Hired Guns Who Make Sure School Cyberattacks Stay Hidden

4 months ago
7 Best Portable External Storage Drives (2023): SSDs, Hard Drives, Rugged

7 Best Portable External Storage Drives (2023): SSDs, Hard Drives, Rugged

2 years ago
New Prediction Points SHIB Rising 600%, Hitting $0.0001

New Prediction Points SHIB Rising 600%, Hitting $0.0001

1 year ago

POPULAR NEWS

  • Ghana to build three oil refineries, five petrochemical plants in energy sector overhaul

    Ghana to build three oil refineries, five petrochemical plants in energy sector overhaul

    0 shares
    Share 0 Tweet 0
  • When Will SHIB Reach $1? Here’s What ChatGPT Says

    0 shares
    Share 0 Tweet 0
  • Matthew Slater, son of Jackson State great, happy to see HBCUs back at the forefront

    0 shares
    Share 0 Tweet 0
  • Dolly Varden Focuses on Adding Ounces the Remainder of 2023

    0 shares
    Share 0 Tweet 0
  • US Dollar Might Fall To 96-97 Range in March 2024

    0 shares
    Share 0 Tweet 0
  • Privacy Policy
  • Contact

© 2023 LBNN - All rights reserved.

No Result
View All Result
  • Home
  • Business
  • Politics
  • Markets
  • Crypto
  • Economics
    • Manufacturing
    • Real Estate
    • Infrastructure
  • Finance
  • Energy
  • Creator Economy
  • Wealth Management
  • Taxes
  • Telecoms
  • Military & Defense
  • Careers
  • Technology
  • Artificial Intelligence
  • Investigative journalism
  • Art & Culture
  • Documentaries
  • Quizzes
    • Enneagram quiz
  • Newsletters
    • LBNN Newsletter
    • Divergent Capitalist

© 2023 LBNN - All rights reserved.