Dark Web ChatGPT Unleashed: Meet DarkBERT

Image of a data center with lots of rack mounted servers

(Image credit: Shutterstock)

We're still early in the snowball effect unleashed by the release of Large Language Models (LLMs) like ChatGPT into the wild. Paired with the open-sourcing of other GPT (Generative Pre-Trained Transformer) models, the number of applications employing AI is exploding; and as we know, ChatGPT itself can be used to create highly advanced malware.

As time passes, applied LLMs will only increase, each specializing in their own area, trained on carefully curated data for a specific purpose. And one such application just dropped, one that was trained on data from the dark web itself. DarkBERT, as its South Korean creators called it, has arrived — follow that link for the release paper, which gives an overall introduction to the dark web itself.

DarkBERT is based on the RoBERTa architecture, an AI approach developed back in 2019. It has seen a renaissance of sorts, with researchers discovering it actually had more performance to give than could be extracted from it in 2019. It seems the model was severely undertrained when released, far below its maximum efficiency.

To train the model, the researchers crawled the Dark Web through the anonymyzing firewall of the Tor network, and then filtered the raw data (applying techniques such as deduplication, category balancing, and data pre-processing) to generate a Dark Web database. DarkBERT is the result of that database being used to feed the RoBERTa Large Language Model, a model that can analyze a new piece of Dark Web content — written in its own dialects and heavily-coded messages — and extract useful information from it.

Saying that English is the business language of the Dark Web wouldn't be entirely correct, but it's a specific enough concotion that the researchers believe a specific LLM had to be trained on it. In the end, they were right: the researchers showed that DarkBERT outperformed other large language models, which should allow security researchers and law enforcement to penetrate deeper into the recesses of the web. That is, after all, where most of the action is.

As with other LLMs, that doesn't mean DarkBERT is finished, and further training and tuning can continue to improve its results. How it will be used, and what knowledge can be gleaned, remains to be seen.

TOPICS

Francisco Pires is a freelance news writer for Tom's Hardware with a soft side for quantum computing.

10 Comments Comment from the forums

Interesting take on the dark web. I hope this is successful and they crack the whole dark web, which to be honest it's very hard to understand how it really works.

Btw, the exact same group of researchers last year worked on another paper/thesis dubbed as, ‘Shedding New Light on the Language of the Dark Web,’ where they brought forward and introduced CoDA (text corpus of the dark web collected from various onion services divided into topical categories).

CoDA according to them as per definition is a publicly available Dark Web dataset consisting of 10000 web documents tailored towards text-based Dark Web analysis. Here is the paper:

https://aclanthology.org/2022.naacl-main.412.pdf
Reply
PlaneInTheSky

Ah yes..."AI" language models...because it's working out so great right now /s

376\00d7168 jpg
15,9 kBhttps://i.postimg.cc/FszL0wjF/Screenshot-2023-05-16-at-10-25-45-AM.jpg
Reply
bwana

I guess the lives of cybercriminals, child traffickers and drug dealers just got a little more interesting. But since the CIA and NSA have already been processing petabytes of data for actionable intel using AI, I wonder how their model compares.

And when action is taken against these targets, it gives the idea of a 'loss function' more meaning.
Reply
Neoony

PlaneInTheSky said:
Ah yes..."AI" language models...because it's working out so great right now /s

376\00d7168 jpg
15,9 kBhttps://i.postimg.cc/FszL0wjF/Screenshot-2023-05-16-at-10-25-45-AM.jpg
Thats just an example of what you should not use LLMs for xD (its generally bad at counting words/characters)

Bard is just awful anyways

Reply
Vanderlindemedia

But if one discloses a website on the dark web, how does a indexer get pass through that?
Reply
The paper also pointed out detail on how much data they fed DarkBERT, including a table that details every site and category it was filed under.

https://www.dexerto.com/_ipx/w_1080,q_75/https%3A%2F%2Feditors.dexerto.com%2Fwp-content%2Fuploads%2F2023%2F05%2F17%2Fstats-darkbert-1024x576.jpg?url=https%3A%2F%2Feditors.dexerto.com%2Fwp-content%2Fuploads%2F2023%2F05%2F17%2Fstats-darkbert-1024x576.jpg&w=1080&q=75
Reply
SquishyShark

I'm not sure how I feel about the phrase "the anonymyzing firewall of the Tor network".

Isn't it the world's worst-kept secret that TOR is operated by the CIA or am I just blindly believing rumors?

I'm not being pedantic, this is a genuine question for anyone here who knows more about this than me, which is likely the majority of you guys.
Reply
Avro Arrow

DarkBERT 2077:DarkBERT 2088:DarkBERT 2099:
It starts here... :ROFLMAO:
Reply
DarkTERMINATOR 3077 more like ! 😈
Reply
rickentick44

You might find the actual history of the Tor Project to be interesting. It is now a non-profit organization. As to privacy and anonymity, the engineering works. It does get attacked by nation-states, but mostly holds its own such that it is extremely difficult to pick a target and deanonymize it. Bridges solved the problem of countries blocking Tor server IPs.

Carelessness is the main way lawbreakers get caught. It is more effective when more people use Tor for access to legitimate onion sites, such as newspapers, blogs, wikis, and so on. It is so effective that it is the main reason people have said that censorship is dead. Even China's GFW cannot defeat onion routing as deployed with bridges.
Reply

Show more comments