Project to Simulate the Entire Web

3 min readJun 16, 2023

I asked the following question to ChatGPT (GPT-4 with Bing)

Q. What is the project by an organization in USA that simulates the entire web?

Here came the reply from ChatGPT (GPT-4 with Bing).

A. The project you’re referring to is led by researchers at the Massachusetts Institute of Technology (MIT). Using a supercomputing system, they have developed a model that simulates what global web traffic might look like on a given day. This includes all types of interactions, from common pings to tech giants like Google and Facebook, to rare, brief connections that still seem to impact overall web traffic1.

The model can take any massive network dataset and generate statistical measurements about how all connections in the network affect each other. It can reveal insights about peer-to-peer filesharing, nefarious IP addresses and spamming behavior, the distribution of attacks in critical sectors, and traffic bottlenecks. This can then be used to better allocate computing resources and keep data flowing1.

The researchers gathered the world’s largest publicly available internet traffic dataset, which comprised 50 billion data packets exchanged in different locations across the globe over several years. They then ran this data through a neural network pipeline operating across 10,000 processors of the MIT SuperCloud. This system combines computing resources from the MIT Lincoln Laboratory and across the Institute1.

The researchers collaborated with the Widely Integrated Distributed Environment (WIDE) project, founded by several Japanese universities, and the Center for Applied Internet Data Analysis (CAIDA) in California. Together, they captured the world’s largest packet-capture dataset for internet traffic. This anonymized dataset contains nearly 50 billion unique source and destination data points between consumers and various apps and services during random days across various locations over Japan and the U.S., dating back to 20151.

To process the dataset, they used software they created, called Dynamic Distributed Dimensional Data Mode (D4M), to efficiently compute and sort hypersparse data — data that contains far more empty space than data points. The data was broken down into units of about 100,000 packets across 10,000 MIT SuperCloud processors, creating compact matrices of billions of rows and columns of interactions between sources and destinations1.

The neural network used to process the matrices fitted models to the entirety of the data, creating a probability distribution of potentially accurate models. The researchers modified a traditional error-correction technique to further refine the parameters of each model to capture as much data as possible. This allowed the model to consider all outlying data, such as isolated links, as significant to the overall measurements1.

The neural network essentially generates a simple model that describes the internet traffic dataset, “from really popular nodes to isolated nodes, and the complete spectrum of everything in between”1. The research is groundbreaking in the efficient processing of a “firehose stream of traffic” to identify meaningful patterns and web activity1. Beyond the internet, the neural network pipeline can be used to analyze any hypersparse network, such as biological and social networks1.

The researchers involved in this project include Jeremy Kepner, Kenjiro Cho of the Internet Initiative Japan, KC Claffy of the Center for Applied Internet Data Analysis at the University of California at San Diego, Vijay Gadepally and Peter Michaleas of Lincoln Laboratory’s Supercomputing Center, and Lauren Milechin, a researcher in MIT’s Department of Earth, Atmospheric and Planetary Sciences1.

Supercomputer analyzes web traffic across entire internet

Using the MIT SuperCloud and the MIT Lincoln Laboratory Supercomputing Center, researchers have developed a model that…

news.mit.edu

Project to Simulate the Entire Web

Supercomputer analyzes web traffic across entire internet

Using the MIT SuperCloud and the MIT Lincoln Laboratory Supercomputing Center, researchers have developed a model that…

CAIDA

CAIDA conducts network research and builds research infrastructure to support large-scale data collection, curation…

Written by Tsubasa Kato

No responses yet