CLAUDETTE: Automating Legal Evaluation of Terms of Service and Privacy Policies using Machine Learning

It is possible to teach machines to read and evaluate terms of service and privacy politics for you.

Have you ever actually read the privacy policy and terms of service you accept? If so, you’re an exception. Consumers do not read these documents. They are too long, too complex, and there are too many of them. And even if they did the documents, they have no way to change them.

Regulators around the world, acknowledging this problem, put in place rules on what these documents must and must not contain. For example, the EU enacted regulations on unfair contractual terms; and recently the General Data Protection Regulation. The latter, applicable since 25th May 2018, makes clear what information must be presented in privacy policies, and in what form. And yet, our research has shown that, despite substantive and procedural rules in place, online platforms largely do not abide by the norms concerning terms of service and privacy policies. Why? Among other reasons, there is just too much for the enforcers to check. With virtually thousands of platforms and services out there, the task is overwhelming. NGOs and public agencies might have competence to verify the ToS and PPs, but lack the actual capability to do so. Consumers have rights, civil society has its mandate, but no one has time and resources to bring them into application. Battle lost? Not necessarily. We can use AI for this good cause.

The ambition of the CLAUDETTE Project, hosted at the Law Department of the European University Institute in Florence, and supported by engineers from the University of Bologna and the University of Modena and Reggio Emilia, is to automate the legal evaluation of terms of service and privacy policies of online platforms, using machine learning. The project’s philosophy is to empower the consumers and civil society using artificial intelligence. Currently artificial intelligence tools are used mostly by large corporations and the state. However, we believe that with efforts of academia and the civil society AI-powered tools for consumers and NGOs can and should be created. Our most technically advanced tool, described in our recent paper, CLAUDETTE: an Automated Detector of Potentially Unfair Clauses in Online Terms of Service, can detect potentially unfair contractual clauses with 80%-90% accuracy. Such tools can be used both to increase consumers’ autonomy (tell them what they accept), and increase efficiency and effectiveness of the civil society’s work, by automating big parts of their job.

Our most recent work has been an attempt to automate the analysis of privacy policies under the GDPR. This project, funded and supported by the European Consumer Organization, has led to the publication of the report: Claudette Meets GDPR: Automating the Evaluation of Privacy Policies Using Artificial Intelligence. Our findings indicate that the task can indeed be automated once a significantly larger learning dataset is created. This learning process was interrupted by major changes in privacy policies undertaken by the majority of online platforms around 25 May 2018, the date when the GDPR started being applicable. Nevertheless, the project led us to interesting conclusions.

Doctrinally, we have outlined what requirements a GDPR-complaint privacy policy should meet (comprehensive information, clear language, fair processing), as well as the ways in which these documents can be unlawful (if required information is insufficient, language unclear, or potentially unfair processing indicated). Anyone – researchers, policy drafters, journalists – can use these “golden standards” to help them asses existing policies, or draft new ones, compliant with the GDPR.

Empirically, we have analyzed the contents of privacy policies of Google, Facebook (and Instagram), Amazon, Apple, Microsoft, WhatsApp, Twitter, Uber, AirBnB, Booking.com, Skyscanner, Netflix, Steam and Epic Games. Our normative study indicates that none of the analyzed privacy policies meet the requirements of the GDPR. The evaluated corpus, comprising 3658 sentences (80.398 words), contains 401 sentences (11.0%) which we marked as containing unclear language and 1240 sentences (33.9%) that we marked as potentially unlawful clauses, i.e. either a “problematic processing” clause or an “insufficient information” clause (under articles 13 and 14 of the GDPR). Hence, there is significant room for improvement on the side of business, as well as for action on the side of consumer organizations and supervisory authorities.

The post originally appeared at the Machine Lawyering blog of the Centre for Financial Regulation and Economic Development at the Chinese University of Hong Kong

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s