• Business
  • Markets
  • Politics
  • Crypto
  • Finance
  • Intelligence
    • Policy Intelligence
    • Security Intelligence
    • Economic Intelligence
    • Fashion Intelligence
  • Energy
  • Technology
  • Taxes
  • Creator Economy
  • Wealth Management
  • LBNN Blueprints
  • Business
  • Markets
  • Politics
  • Crypto
  • Finance
  • Intelligence
    • Policy Intelligence
    • Security Intelligence
    • Economic Intelligence
    • Fashion Intelligence
  • Energy
  • Technology
  • Taxes
  • Creator Economy
  • Wealth Management
  • LBNN Blueprints

Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft

Simon Osuji by Simon Osuji
December 12, 2024
in Artificial Intelligence
0
Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft
0
SHARES
2
VIEWS
Share on FacebookShare on Twitter


Harvard University announced Thursday it’s releasing a high-quality dataset of nearly one million public-domain books that could be used by anyone to train large language models and other AI tools. The dataset was created by Harvard’s newly formed Institutional Data Initiative with funding from both Microsoft and OpenAI. It contains books scanned as part of the Google Books project that are no longer protected by copyright.

Around five times the size of the notorious Books3 dataset that was used to train AI models like Meta’s Llama, the Institutional Data Initiative’s database spans genres, decades, and languages, with classics from Shakespeare, Charles Dickens, and Dante included alongside obscure Czech math textbooks and Welsh pocket dictionaries. Greg Leppert, executive director of the Institutional Data Initiative, says the project is an attempt to “level the playing field” by giving the general public, including small players in the AI industry and individual researchers, access to the sort of highly-refined and curated content repositories that normally only established tech giants have the resources to assemble. “It’s gone through rigorous review,” he says.

Leppert believes the new public domain database could be used in conjunction with other licensed materials to build artificial intelligence models. “I think about it a bit like the way that Linux has become a foundational operating system for so much of the world,” he says, noting that companies would still need to use additional training data to differentiate their models from those of their competitors.

Burton Davis, Microsoft’s vice president and deputy general counsel for intellectual property, emphasized that the company’s support for the project was in line with its broader beliefs about the value of creating “pools of accessible data” for AI startups to use that are “managed in the public’s interest.” In other words, Microsoft isn’t necessarily planning to swap out all of the AI training data it has used in its own models with public domain alternatives like the books in the new Harvard database. “We use publicly available data for the purposes of training our models,” Davis says.

As dozens of lawsuits filed over the use of copyrighted data for training AI wind their way through the courts, the future of how artificial intelligence tools are built hangs in the balance. If AI companies win their cases, they’ll be able to keep scraping the internet without needing to enter into licensing agreements with copyright holders. But if they lose, AI companies could be forced to overhaul how their models get made. A wave of projects like the Harvard database are plowing forward under the assumption that—no matter what happens—there will be an appetite for public domain datasets.

In addition to the trove of books, the Institutional Data Initiative is also working with the Boston Public Library to scan millions of articles from different newspapers now in the public domain, and it says it’s open to forming similar collaborations down the line. The exact way the books dataset will be released is not settled. The Institutional Data Initiative has asked Google to work together on public distribution, but the search giant hasn’t publicly agreed to host it yet, though Harvard says it’s optimistic it will. (Google did not respond to WIRED’s requests for comment.)



Source link

Related posts

Apple Blocks US Users From Downloading ByteDance’s Chinese Apps

Apple Blocks US Users From Downloading ByteDance’s Chinese Apps

March 7, 2026
How to Avoid Getting Locked Out of Your Google Account

How to Avoid Getting Locked Out of Your Google Account

March 7, 2026
Previous Post

Local governments are using AI without clear rules or policies, and the public has no idea

Next Post

New ASRI rocket launch facility at Denel OTR

Next Post
New ASRI rocket launch facility at Denel OTR

New ASRI rocket launch facility at Denel OTR

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

RECOMMENDED NEWS

Kenya to Partially Divest Safaricom Stake by 2026 to Ease Budget Strain

Kenya to Partially Divest Safaricom Stake by 2026 to Ease Budget Strain

9 months ago
Nigeria loses CH-4 drone

Nigeria loses CH-4 drone

2 months ago
Bahrain Bourse launches the 8th edition of the Smart Investor Program in collaboration with INJAZ Bahrain

Bahrain Bourse launches the 8th edition of the Smart Investor Program in collaboration with INJAZ Bahrain

3 months ago
Trump cuts funding to South Africa as land reform tensions escalate

Trump cuts funding to South Africa as land reform tensions escalate

1 year ago

POPULAR NEWS

  • Mahama attends Liberia’s 178th independence anniversary

    Mahama attends Liberia’s 178th independence anniversary

    0 shares
    Share 0 Tweet 0
  • Ghana to build three oil refineries, five petrochemical plants in energy sector overhaul

    0 shares
    Share 0 Tweet 0
  • The world’s top 10 most valuable car brands in 2025

    0 shares
    Share 0 Tweet 0
  • Top 10 African countries with the highest GDP per capita in 2025

    0 shares
    Share 0 Tweet 0
  • Global ranking of Top 5 smartphone brands in Q3, 2024

    0 shares
    Share 0 Tweet 0

Get strategic intelligence you won’t find anywhere else. Subscribe to the Limitless Beliefs Newsletter for monthly insights on overlooked business opportunities across Africa.

Subscription Form

© 2026 LBNN – All rights reserved.

Privacy Policy | About Us | Contact

Tiktok Youtube Telegram Instagram Linkedin X-twitter
No Result
View All Result
  • Home
  • Business
  • Politics
  • Markets
  • Crypto
  • Economics
    • Manufacturing
    • Real Estate
    • Infrastructure
  • Finance
  • Energy
  • Creator Economy
  • Wealth Management
  • Taxes
  • Telecoms
  • Military & Defense
  • Careers
  • Technology
  • Artificial Intelligence
  • Investigative journalism
  • Art & Culture
  • LBNN Blueprints
  • Quizzes
    • Enneagram quiz
  • Fashion Intelligence

© 2023 LBNN - All rights reserved.