Authors: Ahmed Bouhoula, Karel Kubicek, Amit Zac, Carlos Cotrini, and David Basin
Abstract: Privacy regulations such as the General Data Protection Regulation (GDPR) require websites to inform EU-based users about non-essential data collection and to request their consent to this practice. Previous studies have documented widespread violation of these regulations. However, these studies provide a limited view of the general compliance picture: they are either restricted to a subset of notice types, detect only simple violations using prescribed patterns, or analyze notices manually. Thus, they are restricted both in their scope and in their ability to analyze violations at scale.
We present the first general, large-scale, automated analysis of cookie notice compliance. Our method is capable of interacting with cookie notices, e.g., by navigating through their settings. It observes declared processing purposes and available consent options using Natural Language Processing and compares the actual use of cookies with the declared usage. By virtue of the generality and scale of our analysis, we correct for the selection bias present in previous studies focusing on specific Consent Management Platforms (CMP). We also provide a more general view of the overall compliance picture using a set of 97k websites popular in the EU. We report, in particular, that 65.4% of websites offering a cookie rejection option likely collect user data despite explicit negative consent.
@inproceedings{bouhoula2024automated,
author = {Ahmed Bouhoula and Karel Kubicek and Amit Zac and Carlos Cotrini and David Basin},
title = {Automated Large-Scale Analysis of Cookie Notice Compliance},
booktitle = {33st USENIX Security Symposium (USENIX Security 24)},
year = {2024},
month = aug,
pages = {TBA},
isbn = {TBA},
publisher = {USENIX Association},
url = {https://www.usenix.org/conference/usenixsecurity24/presentation/bouhoula},
address = {Philadelphia, PA},
}
To deter privacy-intrusive practices and address the ubiquitous tracking on the internet, the EU introduced privacy regulations such as the General Data Protection Regulation (GDPR) and the ePrivacy Directive. These laws mandate, in particular, that websites inform users about the explicit purposes for which their data is collected. This has led to the global adoption of cookie notices, which are now unavoidable when browsing the web.
Several studies, including CookieBlock, showed high levels of non-compliance with the regulations. However, these studies are significantly constrained, leading to biased measurements. They either depend on specific technologies present in a subset of consent notices or they are manual and therefore cannot be scaled up. Our work addresses these constraints using a crawler that interacts with the cookie notice and machine learning models that classify declared and observed practices of the website.
Previous automated studies relied on the API used by websites that relied on specific Consent Management Platforms (CMP). We created a crawler that detects cookie notices using a heuristic based on 1) a crowd-sourced EasyList Cookie, 2) the z-index of the cookie notice, and 3) sentence segmentation models.
The crawler is not limited to the front page of the cookie notice as it navigates the notice with a DFS approach. This allows us to perceive the cookie notices as users do, not as the API specifies them.
The crawler browses the websites after performing several actions: rejecting or accepting the cookies, saving default settings, dismissing the notice with a close button, or not interacting at all. For each action, we extract the set of cookies resulting from interacting with it, which we rely on to predict whether websites honor user choices.
Using machine translation, we support websites in the following languages: Danish, Dutch, English, Finnish, French, German, Italian, Portuguese, Polish, Spanish, and Swedish.
Below, we show you an interactive cookie notice example. All elements in blue dotted boxes contain descriptions of how our ML models would classify them, which is available when you hover your mouse over them. You need JS to see the demo.
The crawler extracts the declared behavior as described in the cookie notice (its text and interactive elements), as well as the observed behavior (cookies that were used by the website depending on the action performed on the cookie notice). We reason about the collected data using three ML models.
We crawl 97k websites selected using the Chrome UX report, which better represents real browsing patterns than other lists (Ruth et al.). We select the EU (+UK) countries that speak one of the supported languages to ensure that the websites target users under the GDPR.
The crawled data allows us to reason about the differences between declared and observed behavior. The outputs of the crawler and machine learning models serve as parameters for a decision tree, which outputs ten privacy violations or dark patterns.
Decision tree that takes as input the outputs of the crawl and ML models. It outputs all types of potential violations present on the website. “AA cookies” stands for Analytics or Advertising cookies. Such cookies require consent under EU regulations.
On the left, the decision tree of dark patterns. On the right, observed statistics of potential violations and dark patterns.
We can parametrize website selection based on the country, popularity rank, or the consent provider used by the website. We list the most important observations, all of them are statistically significant:
On the left, we present violations per rank. On the right, we compare our results with other studies.
Q: What is the probability that your violation decision tree produces false positives?
A: We tuned the models to be conservative, so they rather produce false negatives. Our evaluation on 500 random websites quntifies both false positives and negatives, showing that our results are conservative. For more details, check Section 7 of the paper.