Automated Large-Scale Analysis of Cookie Notice Compliance

Authors: Ahmed Bouhoula, Karel Kubicek, Amit Zac, Carlos Cotrini, and David Basin

Abstract: Privacy regulations such as the General Data Protection Regulation (GDPR) require websites to inform EU-based users about non-essential data collection and to request their consent to this practice. Previous studies have documented widespread violation of these regulations. However, these studies provide a limited view of the general compliance picture: they are either restricted to a subset of notice types, detect only simple violations using prescribed patterns, or analyze notices manually. Thus, they are restricted both in their scope and in their ability to analyze violations at scale.

We present the first general, large-scale, automated analysis of cookie notice compliance. Our method is capable of interacting with cookie notices, e.g., by navigating through their settings. It observes declared processing purposes and available consent options using Natural Language Processing and compares the actual use of cookies with the declared usage. By virtue of the generality and scale of our analysis, we correct for the selection bias present in previous studies focusing on specific Consent Management Platforms (CMP). We also provide a more general view of the overall compliance picture using a set of 97k websites popular in the EU. We report, in particular, that 65.4% of websites offering a cookie rejection option likely collect user data despite explicit negative consent.

BibTex

@inproceedings{bouhoula2024automated,
  author = {Ahmed Bouhoula and Karel Kubicek and Amit Zac and Carlos Cotrini and David Basin},
  title = {Automated Large-Scale Analysis of Cookie Notice Compliance},
  booktitle = {33st USENIX Security Symposium (USENIX Security 24)},
  year = {2024},
  month = aug,
  pages = {TBA},
  isbn = {TBA},
  publisher = {USENIX Association},
  url = {https://www.usenix.org/conference/usenixsecurity24/presentation/bouhoula},
  address = {Philadelphia, PA},
}

Motivation

To deter privacy-intrusive practices and address the ubiquitous tracking on the internet, the EU introduced privacy regulations such as the General Data Protection Regulation (GDPR) and the ePrivacy Directive. These laws mandate, in particular, that websites inform users about the explicit purposes for which their data is collected. This has led to the global adoption of cookie notices, which are now unavoidable when browsing the web.

Several studies, including our own CookieBlock, showed high levels of non-compliance with the regulations. However, these studies are significantly constrained, giving us biased measurements. They either depend on specific technologies present in a subset of consent notices or they are manual and therefore cannot be scaled up. Our work addresses these constraints using a crawler that interacts with the cookie notice and machine learning models that classify declared and observed practices of the website.

Previous automated studies relied on the API used by websites that relied on specific Consent Management Platforms (CMP). We created a crawler that detects cookie notices using a heuristic based on 1) a crowd-sourced EasyList Cookie, 2) the z-index of the cookie notice, and 3) sentence segmentation models.

The crawler is not limited to the front page of the cookie notice as it navigates the notice with a DFS approach. This allows us to perceive the cookie notices as users do, not as the API specifies them.

The crawler browses the websites after performing several actions: rejecting or accepting the cookies, saving default settings, dismissing the notice with a close button, or not interacting at all. For each action, we extract the set of cookies resulting from interacting with it, which we rely on to predict whether websites honor user choices.

Using machine translation and multilingual models, we support websites in the following languages: Danish, Dutch, English, Finnish, French, German, Italian, Portuguese, Polish, Spanish, and Swedish.

Below, we show you an interactive cookie notice example. All elements in blue dotted boxes contain descriptions of how our ML models would classify them, which is available when you hover your mouse over them. You need JS to see the demo.

Interactive element prediction: close.

Example cookie notice

We use cookies to enhance your browsing experience. NLP prediction: Essential purpose. We also use them to serve personalized ads and to analyze our traffic. NLP prediction: Non-essential purpose.

Choose what types of cookies you consent to. Limitation of our study is that we do not inspect what checkbox/toggle values were selected by users. We however submit the default value, and if we observe cookies, it means that consent was not active.

NLP prediction: Essential purpose. NLP prediction: Non-essential purpose.

Now the crawler would randomly browse 5 links, scroll, observing all notice after consent.

Interactive element prediction: reject. Interactive element prediction: settings. Interactive element prediction: save. Interactive element prediction: accept.

ML and NLP models

The crawler extracts the declared behavior as describe in the cookie notice (its text and interactive elements), as well as the observed behavior (cookies that were used by the website depending on the action performed on the cookie notice). We reason about the collected data using three ML models.

Violation detection

We crawl 97k websites selected using the Chrome UX report, which better represents real browsing patterns than other lists (Ruth et al.). We select the EU (+UK) countries that speak one of the supported languages to ensure that the websites target users under the GDPR.

The crawled data allows us to reason about the differences between declared and observed behavior. The outputs of the crawler and machine learning models serve as parameters for a decision tree, which outputs ten privacy violations or dark patterns.

Decision tree for violations Decision tree that takes as input the classifications by our model and outputs all types of potential violations present on the website. “AA cookies” stands for Analytics or Advertising cookies. Such cookies require consent under EU regulations.

Decision tree dark patternsObserved violations On the left, the decision tree of dark patterns. On the right, observed statistics of potential violations and dark patterns.

Aggregated results

We can parametrize website selection based on the country, popularity rank, and the consent notice technology the website uses. We list the most important observations, all of them are statistically significant:

Violations per rankViolations bias comparison On the left, we present violations per rank from the Chrome UX report. On the right, we compare our results with other studies and investigate whether their website selection caused any bias.

Q&A