Privacy in file sharing networks
Encyclopedia
Peer-to-peer file sharing
(P2P) systems like Gnutella
, KaZaA
, and eDonkey
/eMule
, became extremely popular in recent years, with estimated user population of millions. Measurements show that about 50% of the file exchanges are illegal copies of multimedia files like AVI, MP3 etc.
An academic research paper pointed out some weaknesses of two popular P2P networks in protecting user's privacy. The research analyzed Gnutella and eMule protocols and found weaknesses in the protocol; many of the issues found in these networks are fundamental and probably common on other P2P networks.
Users of file sharing networks like eMule and Gnutella do not enjoy privacy but are subject to possible surveillance. Users may be tracked by IP address, DNS name, software version they use, files they share, queries they initiate, and queries they answer to.
Much is known about the network structure, routing schemes, performance load and fault tolerance of P2P systems in general and Gnutella in particular. This document concentrates on the user privacy that reveals by the Gnutella and eMule networks.
It might be surprising, but the eMule protocol does not provide much privacy to the users, although it is a P2P protocol which is supposed to be decentralized.
connection and using the HTTP GET mechanism.
Gnutella (version 0.6): If an ultrapeer receives a message from an ultrapeer with hop count zero then it knows that the message originated by the ultrapeer or by one of its leaves (The average number of the leaves nodes that are connected to an ultrapeer is 200).
Research shows that a simple crawler which is connected to Gnutella network can get from an initial entry point a list of IP addresses which are connected to that entry point. Then the crawler can continue to inquire for other IP addresses.
An academic research performed the following experiment:
At NYU, a regular Gnucleus software client that was connected to the Gnutella network as a leaf node, with distinctive listening TCP port 44121.
At the Hebrew University, Jerusalem, Israel, a crawler ran looking for client listening with port 44121. In less than 15 minutes the crawler found the IP address of the Gnucleus client in NYU with the unique port.
can be easily harvested by hackers. Using HTTP monitoring feature which collects about 300,000 unique addresses within 10 hours.
(GUID) is a 16 bytes field in the Gnutella message header, which uniquely identifies every Gnutella message. The protocol does not specify how to generate the GUID.
Gnucleus on Windows uses the Ethernet MAC address
used as the GUID 6 lower bytes. Therefore, Windows clients reveal their MAC address when sending queries.
In the JTella 0.7 client software the GUID is created using the Java
random number without an initialization. Therefore, on each session, the client creates a sequence of queries with the same repeating IDs. Over time, a correlation between the user queries can be found.
In Gnutella V0.6, information about client software can be collected (even if the client does not support HTTP monitoring). The information is found in the first two messages connection handshake.
The team ran five Gnutella as ultrapeer (in order to listen to other nodes’ queries). The team revealed about 6% of the queries.
Half of the search queries are strings and half of them are output of an hash function
(SHA-1) applied on the string. Although the usage of hash function is intended to improve the privacy, an academic research showed that the query content can be exposed easily by a dictionary attack: Collaborators ultrapeers can gradually collect common search strings, calculate their hash value and store them into a dictionary. When a hashed query arrives, each collaborated ultrapeer can check matches with the dictionary and expose the original string accordingly.
have emerged. There all data is end-to-end encrypted and not direct connections are established between the peers that exchange the data.
Thus all traffic is anonymized and encrypted. Unfortunately, anonymity and safety come at the price of much lower speeds, and due to the nature of those networks being internal networks there currently still is less content. However, this will change, once there are more users.
Peer-to-peer file sharing
P2P or Peer-to-peer file sharing allows users to download files such as music, movies, and games using a P2P software client that searches for other connected computers. The "peers" are computer systems connected to each other through internet. Thus, the only requirements for a computer to join...
(P2P) systems like Gnutella
Gnutella
Gnutella is a large peer-to-peer network which, at the time of its creation, was the first decentralized peer-to-peer network of its kind, leading to other, later networks adopting the model...
, KaZaA
Kazaa
Kazaa Media Desktop started as a peer-to-peer file sharing application using the FastTrack protocol licensed by Joltid Ltd. and operated as Kazaa by Sharman Networks...
, and eDonkey
EDonkey
eDonkey may refer to:* eDonkey network , a popular file sharing network** eDonkey2000, a discontinued file sharing program that used the eDonkey network...
/eMule
EMule
eMule is a free peer-to-peer file sharing application for Microsoft Windows. Started in May 2002 as an alternative to eDonkey2000, eMule now connects to both the eDonkey network and the Kad network...
, became extremely popular in recent years, with estimated user population of millions. Measurements show that about 50% of the file exchanges are illegal copies of multimedia files like AVI, MP3 etc.
An academic research paper pointed out some weaknesses of two popular P2P networks in protecting user's privacy. The research analyzed Gnutella and eMule protocols and found weaknesses in the protocol; many of the issues found in these networks are fundamental and probably common on other P2P networks.
Users of file sharing networks like eMule and Gnutella do not enjoy privacy but are subject to possible surveillance. Users may be tracked by IP address, DNS name, software version they use, files they share, queries they initiate, and queries they answer to.
Much is known about the network structure, routing schemes, performance load and fault tolerance of P2P systems in general and Gnutella in particular. This document concentrates on the user privacy that reveals by the Gnutella and eMule networks.
It might be surprising, but the eMule protocol does not provide much privacy to the users, although it is a P2P protocol which is supposed to be decentralized.
The eMule protocol
eMule is one of the clients which implements the eDonkey network. The eMule protocol consists of more than 75 types of messages. When an eMule client connects to the network, it first gets a list of known eMule servers which can be obtained from the Internet. Despite the fact that there are millions of eMule clients, there are only several hundred servers. The client connects to a server with TCP connection. That stays open as long as the client is connected to the network. Upon connecting ,the client sends a list of its shared files to the server. By this the server builds a database with the files that reside on this client. The server also returns a list of other known servers. The server returns an ID to the client, which is a unique client identifier within the system. The server can only generate query replies to clients which are directly connected to it. The download is done by dividing the file into parts and asking each client a part.Gnutella protocol v0.4
In Gnutella protocol V0.4 all the nodes are identical, and every node may choose to connect to every other. The Gnutella protocol consist of 5 message types: query for tile search. Query messages use a flooding mechanism, i.e. each node that receives a query forwards it on all of its links. A node that receives a query and has the appropriate file replies with a query hit message. A hop count field in the header limits the message lifetime. Ping and Pong messages are used for detecting new nodes that can be linked to the actual file download performed by opening TCPTransmission Control Protocol
The Transmission Control Protocol is one of the core protocols of the Internet Protocol Suite. TCP is one of the two original components of the suite, complementing the Internet Protocol , and therefore the entire suite is commonly referred to as TCP/IP...
connection and using the HTTP GET mechanism.
Gnutella protocol v0.6
Gnutella protocol V0.6 includes several modifications: A node has one of two operational modes: "leaf node" or "ultrapeer". Initially each node starts in a leaf node mode in which it can only connect to ultrapeers. The leaf nodes send query to an ultrapeer, the ultrapeer forwards the query and waits for the replies. When a node has enough bandwidth and uptime, the node may become an ultrapeer. Ultrapeers send periodically a request for their clients to send a list with the shared files they have. If a query arrives with a search string that matches one of the files in the leaves, the ultrapeer replies and pointing to the specific leaf.Tracking initiators and responders
Gnutella (version 0.4): An ultrapeer which receives a message from a leaf node (message with hop count zero) knows for sure that the message was originated from that leaf node.Gnutella (version 0.6): If an ultrapeer receives a message from an ultrapeer with hop count zero then it knows that the message originated by the ultrapeer or by one of its leaves (The average number of the leaves nodes that are connected to an ultrapeer is 200).
Tracking a single node
Many clients of Gnutella have an HTTP monitor feature. This feature allows sending information about the node to any node which supports an empty HTTP request, and receiving on response.Research shows that a simple crawler which is connected to Gnutella network can get from an initial entry point a list of IP addresses which are connected to that entry point. Then the crawler can continue to inquire for other IP addresses.
An academic research performed the following experiment:
At NYU, a regular Gnucleus software client that was connected to the Gnutella network as a leaf node, with distinctive listening TCP port 44121.
At the Hebrew University, Jerusalem, Israel, a crawler ran looking for client listening with port 44121. In less than 15 minutes the crawler found the IP address of the Gnucleus client in NYU with the unique port.
IP address harvesting
If a user is connected to the Gnutella network within, say, the last 24 hours, that user's IP addressIP address
An Internet Protocol address is a numerical label assigned to each device participating in a computer network that uses the Internet Protocol for communication. An IP address serves two principal functions: host or network interface identification and location addressing...
can be easily harvested by hackers. Using HTTP monitoring feature which collects about 300,000 unique addresses within 10 hours.
Tracking nodes by GUID creation
A Globally unique identifierGlobally Unique Identifier
A globally unique identifier is a unique reference number used as an identifier in computer software. The term GUID also is used for Microsoft's implementation of the Universally unique identifier standard....
(GUID) is a 16 bytes field in the Gnutella message header, which uniquely identifies every Gnutella message. The protocol does not specify how to generate the GUID.
Gnucleus on Windows uses the Ethernet MAC address
MAC address
A Media Access Control address is a unique identifier assigned to network interfaces for communications on the physical network segment. MAC addresses are used for numerous network technologies and most IEEE 802 network technologies, including Ethernet...
used as the GUID 6 lower bytes. Therefore, Windows clients reveal their MAC address when sending queries.
In the JTella 0.7 client software the GUID is created using the Java
Java (programming language)
Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...
random number without an initialization. Therefore, on each session, the client creates a sequence of queries with the same repeating IDs. Over time, a correlation between the user queries can be found.
Collecting miscellaneous information users
The monitoring facility of Gnutella reveals an abundance of precious information on its users. It is possible to collect the information about the software vendor and the version that the clients use. Other statistical information about the client is available as well: capacity, uptime, local files etc.In Gnutella V0.6, information about client software can be collected (even if the client does not support HTTP monitoring). The information is found in the first two messages connection handshake.
Tracking users by partial information
Some Gnutella users have a small look-alike set, which makes it easier to track them by knowing this very partial information.Tracking users by queries
An academic research team performed the following experiment:The team ran five Gnutella as ultrapeer (in order to listen to other nodes’ queries). The team revealed about 6% of the queries.
Usage of hash functions
SHA-1 hashes refer to SHA-1 of files not search strings.Half of the search queries are strings and half of them are output of an hash function
Hash function
A hash function is any algorithm or subroutine that maps large data sets to smaller data sets, called keys. For example, a single integer can serve as an index to an array...
(SHA-1) applied on the string. Although the usage of hash function is intended to improve the privacy, an academic research showed that the query content can be exposed easily by a dictionary attack: Collaborators ultrapeers can gradually collect common search strings, calculate their hash value and store them into a dictionary. When a hashed query arrives, each collaborated ultrapeer can check matches with the dictionary and expose the original string accordingly.
Countermeasures
In order not to reveal ones IP number when downloading or uploading content, anonymizing networks, such as I2P - The Anonymous NetworkI2P
I2P is a mixed-license, free and open source project building an anonymous network .The network is a simple layer that applications can use to anonymously and securely send...
have emerged. There all data is end-to-end encrypted and not direct connections are established between the peers that exchange the data.
Thus all traffic is anonymized and encrypted. Unfortunately, anonymity and safety come at the price of much lower speeds, and due to the nature of those networks being internal networks there currently still is less content. However, this will change, once there are more users.
See also
- Gnutella2Gnutella2Gnutella2, often referred to as G2, is a peer-to-peer protocol developed mainly by Michael Stokes and released in 2002. While inspired by the gnutella protocol, G2 shares little of its design with the exception of its connection handshake and download mechanics. It adopts an extensible binary...
, a reworked network based on Gnutella - BitziBitziBitzi is a website where volunteers share reports about any kind of digital file, with identifying metadata, commentary, and other ratings.Information contributed and rated by volunteers is compiled into the Bitpedia data set and reference work, described by Bitzi as a "digital media encyclopedia"...
, an open contentOpen contentOpen content or OpenContent is a neologism coined by David Wiley in 1998 which describes a creative work that others can copy or modify. The term evokes open source, which is a related concept in software....
file catalog integrated with some Gnutella clients - Torrent poisoningTorrent poisoningTorrent poisoning is the act of intentionally sharing corrupt data or data with misleading file names using the BitTorrent protocol. This practice of uploading fake torrents is sometimes carried out by anti-piracy organisations as an attempt to prevent the peer-to-peer sharing of copyrighted...