Basement AI lab captures 10,000 hours of brain scans to train thought-to-text AI models — largest known neural dataset collected from thousands of humans over six months
Conduit built a multimodal “AI helmet” system and a large-scale data operation to train models that decode semantic content from brain activity.
A San Francisco start-up has spent the past six months running one of the more unusual data projects in AI. Conduit says it has collected roughly 10,000 hours of non-invasive neural data from “thousands of unique individuals” in a basement studio, forming what it believes is the largest neuro-language dataset assembled to date. The company is using the recordings to train thought-to-text AI models that attempt to decode semantic content from brain activity in the seconds before a participant speaks or types.
Participants sit for two-hour sessions in small booths and converse freely with an LLM through speech or typing on “simplified” keyboards. Early sessions relied on rigid tasks, but Conduit shifted to personalized back-and-forth conversation after noticing that engagement strongly influenced data quality. The goal is to maximize the amount of natural language produced during each recording while maintaining tight time alignment between text, audio, and neural signals.
Conduit built the hardware itself after finding that no commercial multimodal headset met its requirements. The team combined best-in-class EEG, fNIRS, and other sensors into custom 3D-printed shells and created separate designs for training and inference. Training headsets are dense, heavy four-pound rigs intended to maximize signal coverage, while inference headsets will be shaped by ablation studies conducted after the models mature. All data now flows through a Zarr 3 format that unifies input from multiple sensor types under a single framework.
The company initially treated electrical interference as the primary threat to data quality. Staff wrapped equipment in rubber, experimented with power conditioners, and eventually shut off main power entirely, relying on battery packs to eliminate the 60 Hz spike typical in EEG recordings. That approach created its own problems, including dropped frames and a steady rotation of heavy batteries, but Conduit later restored normal power after discovering that scale changed the tradeoffs. Once the dataset crossed roughly 4,000 to 5,000 hours, the model began to generalize across people, booths, and setups, reducing the value of aggressive noise reduction.
Operating costs fell as the process scaled. Conduit cut the marginal cost per usable hour of data by about 40% between May and October by redesigning its backend to catch corrupted sessions in real time and allowing session managers to monitor multiple booths through cameras. A custom booking system introduced dynamic pricing and overbooking to keep its headsets filled during a 20-hour daily schedule.
Conduit says it is now focused almost entirely on model training, and that it plans to detail its decoding system in a later release.
Follow Tom's Hardware on Google News, or add us as a preferred source, to get our latest news, analysis, & reviews in your feeds.
Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.

Luke James is a freelance writer and journalist. Although his background is in legal, he has a personal interest in all things tech, especially hardware and microelectronics, and anything regulatory.