User Traffic Categorization through On-Device Clustering: Empowering Privacy and Control
When you browse the internet or use an app on your phone, a significant portion of your traffic is not related to the content or services you’re accessing. Websites and apps often include elements from third-party services, such as analytics providers, content delivery networks, and advertising networks. These third parties can track your behavior, collect data on your browsing habits, and even influence where your data is routed. Often, users aren’t fully aware of this data-sharing, which can impact privacy and online experiences.
In this article, we aim to tackle this issue by exploring an approach that categorizes the web traffic of users by the website/app they are using as well as if any third parties are involved. Thus, giving users a clear view of who is interacting with their data and allowing them to make informed choices on how it flows, promoting transparency and giving control back to the user.
Understanding The Problem and Constraints
To start, let’s look at what we’re trying to solve and the data we can work with. As a VPN, the URnetwork app has access to all of a user’s internet traffic. The goal is to use this traffic data to help users better understand how their data is being accessed and enable them to control this flow of data effectively.
To give users a clear picture of their traffic patterns, we need to identify the applications they use and categorize any third-party connections, such as ads, analytics, or other services. This information enables users to make informed choices about their data—for example, blocking specific third parties. This could also allow them to save profiles for apps, such as, "when I'm using this app, always route my traffic through a certain country." Additionally, we want to guarantee that user traffic is never shared with unauthorized parties. An easy way to achieve this is by developing a solution that works entirely on-device. This ensures privacy by keeping the user’s information confidential and under their control.
We also chose to avoid deep packet inspection (DPI) because users generate a vast amount of data daily, and processing it all on-device would be computationally expensive, potentially infeasible under a time constraint, and certainly raise privacy concerns. Avoiding DPI means we need to find a way to categorize traffic without actually looking into what users are sending or receiving. Hence, we will only consider the headers of packets as available information.
This defines our problem and constraints—identify the applications users use and categorize any third-party connections entirely on-device only considering the headers of packets. Since URnetwork is written in go
, we decided to build our solution in go
. The source code can be found here.
First Steps
First, we need to define the input data. Since this project is meant to be used within URnetwork, which already parses the packets of a user for routing, we can get the input data in any format. The main idea is that packets are part of transport sessions (TCP, UDP, etc.), so we can try and group them in this manner. For this, we employ protocol buffers (protobuf
), as they are a language-neutral way of serializing data in a compact manner.
We define four transport messages that correspond to the different packet types in a session:
TransportOpen
- this is the opening message; if the session is a TLS one, we can also extract the name of the server with which the user is communicating (domain name).TransportClose
- this is the closing message; here, we also have information on why the connection was closed (error, timeout, etc.).WriteDataChunk
- this is application data sent by the client.ReadDataChunk
- this is application data received by the client.
Hence, now we represent a transport session using a collection of the four message types where we expect each session to have at least a TransportOpen
. All the other message types can be missing, e.g., if the session has not finished or there is no flow of application data. We can also add an ID for each different session to query them easily. For this, we decide to use ulid
as they can be monotonically ordered and can be sorted even when generated within a millisecond. Furthermore, they are case insensitive and do not use special characters.
As already mentioned, in the final version, URnetwork will provide the protobuf
transport messages that we work with. However, for this preliminary implementation (and for simpler testing), we need to be able to transform packets in an easier way. For this, we decide to use pcap
files, as they are a standardized packet capture format across platforms. So we implement a function that can parse pcap
files into our transport records. From now on, we can safely assume that we work only with transport messages, as any pcap
file can be converted into them.
Interpreting the Data
Currently, we have a collection of transport sessions where each consists of different transport messages, but how can we interpret this data for our needs? Well, our main goal is to relate the different sessions to each other to try and figure out the ones that are part of the same application. This is a perfect use case for clustering. Most clustering algorithms work using the distance between the different input points. We, however, have sessions (and their packets), so we need to make our own relation that can be used as distance.
The first idea that comes to mind is to define each session by the times at which the packets were sent, as it can be expected that sessions relating to the same application will be active during the same intervals of time. So we extract, for each session, a list of times (corresponding to the send/receive time of the packets) and compare them.
Sessions Comparison
Now, the question is, how can we compare two sessions? Let's say that sessionA
was active at timeA
, and sessionB
was active at timeB
. Do we count this as a match between the sessions only if timeA = timeB
, or if the two times have a 1-second difference, or somehow else entirely? Let's, for now, take the simplest approach—two times count as overlapping if they are within 1 second of each other. Then, we define the total overlap to be the number of pairs of times between the two sessions that overlap. For example, if we have sessions with times [10, 15, 17, 18]
and [11, 16]
, then the total overlap is 3 because there are 3 pairs that overlap (<10, 11>
, <15, 16>
, <17, 16>
). We will later explore other approaches that might be better for overlap calculation.
Domain Interpretation
Another piece of information that we don't fully use is the domain names of each session. As each domain can be broken down into hierarchical levels (top-, second-, third-level domains), we could potentially group sessions based on parent domains. For instance, ipv4-abc.1.api.urnetwork.com
can be split into api.urnetwork.com
, urnetwork.com
, and .com
as parent domains. So then, for example, if we encounter api.urnetwork.com
and vpn.urnetwork.com
, we can store their times separately as expected, but also group them under urnetwork.com
. This approach allows us to relate sessions originating from the same parent domain better. Furthermore, if two domains that share the same parent domain are not related, this might prove that one (or both) of them is a third party that cannot be correlated to the bigger parent domain group. Note, we intentionally ignore top-level domains (like .com
), as they are too general to provide meaningful relations. Thus, we split each domain into a second- and a third-level domain and aggregate times that share the same parent domain.
Test Data
As we can see, there are several times that seem to correlate. For example, in green we see domain names for Wikipedia, in purple there seems to be some traffic for WhatsApp (which was running in the background at the time of recording the test session), in orange are Netflix-related domains, and in blue - New York Times. In the future, we can use these related times to verify our clustering by checking if they are grouped together in the same cluster.
Overlap and Distance
Now, we see that the most prominent region is the WhatsApp one (in purple), with the rest being somewhat recognizable. So we have not made any improvements to our distance relation.
Clustering
At long last, we have everything we need to cluster the data. The question now is which clustering algorithm to choose. For this, we need to define the characteristics of our data. It is non-linear, with most values near 1 (max distance). The formed clusters are heterogeneous (different shape, size, and density) of an unknown amount with possibly unclustered points. Based on these characteristics, we decide that the best options for a clustering algorithm are HDBSCAN and OPTICS, as both are not sensitive to outliers, can form heterogeneous clusters of different densities, and leave some points unclustered. Additionally, the two methods we chose have quite similar parameters, with the main ones being relatively easy to choose. This page shows a nice visualization of different clustering approaches to get an intuition behind their pros and cons.
Now, let's try and cluster using the two approaches. To build the distance matrix, we choose to use Gaussian overlap with a standard deviation of 10ms as per the results of the last section. We also need to choose the constant λ
for the exponential decay. After some testing, we find that when using OPTICS, a large value (we use λ=150
) is needed to exaggerate the differences between the smaller overlap values. For HDBSCAN, the decay value does not matter (as long as it is bigger than 1) as HDBSCAN is actually an algorithm that builds upon DBSCAN by finding the most optimal value (in a quite efficient manner) for one of its parameters (ε
), which is normally quite unintuitive to choose. Thus, the rate of decay λ
for HDBSCAN doesn't matter as changing λ
changes the optimal ε
value, but HDBSCAN will still choose the correct one, which will ultimately produce the same clustering. The main advantage of having a bigger λ
in general is that the differences are exaggerated, so it is easier to visualize the relationships between the sessions.
Next, to cluster, we need to choose the main (intuitive) parameter for each algorithm, which is the minimum size at which a cluster will form. Normally, this would be 2, as any two related sessions should be clustered, but since we also include parent domains, we make it 3. Also, both approaches put the unclustered domains together in a cluster with ID=-1
to indicate that they are unclustered. Now, after running the clustering for OPTICS, we get 6 clusters:
Running HDBSCAN, we get 8 clusters:
From these first results, we see that both methods form good clusters based on our distance matrix. Comparing the two, it seems that OPTICS is better at finding closely related sessions, whereas HDBSCAN is a bit more lax in what it considers a cluster. However, the more pressing concern is why some clusters were not formed, i.e., the Wikipedia one? There are several possible causes for this:
The data is too small to form the clusters (it is only ~4 minutes of traffic after all);
The overlap function used is still not good enough;
The cluster algorithm's other parameters can be further exploited to achieve better clustering.
Relating to the first issue, we try generating a second dataset that is 3 times bigger and cluster similarly to how we did with the first one. The results seem to be better, with more of the meaningful clusters forming and even HDBSCAN outperforming OPTICS. So we can partially attribute for the "lackluster" first results to the size of the data. Next, we believe that the overlap function that is currently used is good at extracting the prominent features of the data and that, for this preliminary implementation, is sufficient.
As to the final concern, we tried changing the other parameters. For completeness of results, we will go over the results without going into much detail on exactly why certain values provide the best results. For OPTICS, there seem to be some improvements when changing the max_eps
, but nothing compared to the improvements in HDBSCAN. In HDBSCAN, there is one hyperparameter (alpha
) that, when reduced to a small value (around 0.001
), we find more meaningful clusters are formed. Additionally, using cluster_selection_epsilon=0.001
proves to work better for a bigger dataset. This page provides some intuition on how to choose values for the parameters of HDBSCAN. Our results with these values (using the initial smaller dataset) are as follows:
Conclusion
In this article, we outlined our approach on clustering transport sessions of packets such that similar sessions are clustered together. We defined a method to relate the sessions using the time of each packet. We further analyzed several approaches to calculate the relationships between sessions, including fixed margin overlap and Gaussian overlap. Both approaches represent the time of each packet as a continuous interval and calculate the total overlap between pairs of times of different sessions. We then proposed a method, exponential decay, to convert the overlap values to distance values that can be used to cluster the sessions.
Finally, we evaluated two algorithms, HDBSCAN and OPTICS, by using them to cluster a small test dataset. We conclude that the two methods are quite similar in results, producing well-defined and meaningful clusters. When tweaking the hyperparameters, we find that HDBSCAN outperforms OPTICS in terms of finding more and smaller clusters. This, however, can be due to our relatively small test data, so a final conclusion cannot be drawn until further analysis is performed using more traffic data.
Discussion and Future Work
Our main goal was to "identify the applications users use and categorize any third-party connections entirely on-device only considering the headers of packets." We showed that identifying the application/websites used can be achieved by grouping them into clusters. However, we have not directly addressed how to identify third-parties; hence, we theorize an approach that uses the results from the clustering to achieve this. As we split the domain of a session into parent domains, we can use these parent domains and the unclustered sessions to identify third-parties. If a session has sufficient overlap with other sessions but is still left unclustered, we might consider this session as a third-party. Additionally, if there is a small cluster where most of the sessions have parent domains that are either unclustered or clustered in another cluster, then we can consider the small cluster as a third-party cluster. Thus, in a future implementation of our approach, we believe there is enough information in the current results to obtain the third-parties present in the dataset.
Another aspect we did not consider during our exploration is the memory that this approach is using, as with more data, more space is needed, which might not be feasible, especially on certain mobile devices. Thus, we briefly discuss a technique that can be used in the future to reduce the amount of space needed to store our distance matrix. Currently, the distance matrix is stored as a sparse matrix. This approach, however, makes it so we do not know the actual size this sparse matrix takes before we start parsing the transport sessions and calculating the overlap. One method that can be used is count-min sketch. With count-min sketch, we can build a fixed-size matrix that is probabilistic in nature with several guarantees on how close the estimated values are to the real ones (this data structure can only overestimate values). After a small test, we believe there is potential in this technique for achieving what we strive for.
Finally, we compose a short list of the directions where this project can go in the future:
More efficient implementation of the clustering algorithm;
Use more of the data in the packet headers, e.g., size of the packet;
Optimize memory usage of clustering;
Technique for extracting third-parties;
Compose bigger test data.
The next section outlines some specific challenges we encountered during development as well as how we use tests in our approach.
Challenges and Testing
During the development of our solution, we tried and tested different approaches for many parts of the project. As already mentioned, there are 2 implementations for overlap functions, and we tried using other distance functions but ultimately decided on using exponential decay. Additionally, as go
did not provide many options for clustering algorithms, we initially used python
to figure out which approach is best. So all the clustering in this article was done using the scikit-learn
library. However, after figuring out that HDBSCAN is the clustering method we want to go for, we found a go
implementation, and that is what is currently used in our code. This go
implementation, however, did not have all the parameters the python
one had, so the clustering results are not as good as the python
ones (but not by a lot). Our code can be found here, where we have left the python
scripts to cluster if needed in the future. The README further explains all of the functionalities of the created module. There are several other python
scripts that were used to create the visualizations of this article that can be used to understand the data and the steps of the solution.
Lastly, we developed 2 testing methodologies and an evaluation technique to help with the assessment of the two clustering algorithms. First, we developed a genetic hill climbing algorithm to find the best parameter values for OPTICS and HDBSCAN. Second, we defined test cases for our test sessions consisting of clusters we expect the clustering to be able to find with high accuracy. Third, we defined a not-so-precise evaluation function to give us an idea of how the clustering is performing without the need to manually examine it every time. All of these methods are further explained in the README of the source code.
Last updated