Project to Simulate the Entire Web

Tsubasa Kato
3 min readJun 16, 2023

--

I asked the following question to ChatGPT (GPT-4 with Bing)

Q. What is the project by an organization in USA that simulates the entire web?

Here came the reply from ChatGPT (GPT-4 with Bing).

A. The project you’re referring to is led by researchers at the Massachusetts Institute of Technology (MIT). Using a supercomputing system, they have developed a model that simulates what global web traffic might look like on a given day. This includes all types of interactions, from common pings to tech giants like Google and Facebook, to rare, brief connections that still seem to impact overall web traffic​1​.

The model can take any massive network dataset and generate statistical measurements about how all connections in the network affect each other. It can reveal insights about peer-to-peer filesharing, nefarious IP addresses and spamming behavior, the distribution of attacks in critical sectors, and traffic bottlenecks. This can then be used to better allocate computing resources and keep data flowing​1​.

The researchers gathered the world’s largest publicly available internet traffic dataset, which comprised 50 billion data packets exchanged in different locations across the globe over several years. They then ran this data through a neural network pipeline operating across 10,000 processors of the MIT SuperCloud. This system combines computing resources from the MIT Lincoln Laboratory and across the Institute​1​.

The researchers collaborated with the Widely Integrated Distributed Environment (WIDE) project, founded by several Japanese universities, and the Center for Applied Internet Data Analysis (CAIDA) in California. Together, they captured the world’s largest packet-capture dataset for internet traffic. This anonymized dataset contains nearly 50 billion unique source and destination data points between consumers and various apps and services during random days across various locations over Japan and the U.S., dating back to 2015​1​.

To process the dataset, they used software they created, called Dynamic Distributed Dimensional Data Mode (D4M), to efficiently compute and sort hypersparse data — data that contains far more empty space than data points. The data was broken down into units of about 100,000 packets across 10,000 MIT SuperCloud processors, creating compact matrices of billions of rows and columns of interactions between sources and destinations​1​.

The neural network used to process the matrices fitted models to the entirety of the data, creating a probability distribution of potentially accurate models. The researchers modified a traditional error-correction technique to further refine the parameters of each model to capture as much data as possible. This allowed the model to consider all outlying data, such as isolated links, as significant to the overall measurements​1​.

The neural network essentially generates a simple model that describes the internet traffic dataset, “from really popular nodes to isolated nodes, and the complete spectrum of everything in between”​1​. The research is groundbreaking in the efficient processing of a “firehose stream of traffic” to identify meaningful patterns and web activity​1​. Beyond the internet, the neural network pipeline can be used to analyze any hypersparse network, such as biological and social networks​1​.

The researchers involved in this project include Jeremy Kepner, Kenjiro Cho of the Internet Initiative Japan, KC Claffy of the Center for Applied Internet Data Analysis at the University of California at San Diego, Vijay Gadepally and Peter Michaleas of Lincoln Laboratory’s Supercomputing Center, and Lauren Milechin, a researcher in MIT’s Department of Earth, Atmospheric and Planetary Sciences​1​.

Related Links:

--

--

Tsubasa Kato
Tsubasa Kato

Written by Tsubasa Kato

Japanese Search Engineer. Likes Quantum Computers and handheld computers.