The Italian Data Protection Authority has issued an informational note regarding potential countermeasures that internet site operators and online platform managers, operating in Italy as controllers, could implement in order to prevent the collection of data by third parties for the purpose of training Artificial Intelligence models (known as web scraping).
Web scraping is a computer technique that allows the massive and indiscriminate collection of personal data on the web. Information and data can be systematically collected through automated programs (web robots or simply bots) operated by third parties, which simulate human browsing, provided that the resources (e.g., websites, content, etc.) visited by these bots are freely accessible to the public online and not subject to access controls. The collected data is then aggregated into databases and used for training Generative Artificial Intelligence (IAG) systems which constantly require large amounts of information for their training.
The Italian Data Protection Authority does express judgments on the legitimacy and/or lawfulness of web scaping activities, but rather offers suggestions for those entities, both public and private, that manage websites and online platforms operating in Italy as data controllers of personal data made available online, regarding the possible precautions that should be adopted to mitigate the effects of collections carried out through third-party bots, according to the technique of web scraping, aimed at training Generative Artificial Intelligence (GAI) systems.
In the informational note, the Italian Data Protection Authority suggests the following solutions as possible alternatives to protect the data published on their portals from web scraping:
• creating restricted areas accessing data indeed to be protected, accessible only upon registration, thereby removing such information from public availability;
• including specific clauses in the Terms of Service (ToS) that expressly prohibit the use of web scraping techniques. This precaution, relatively effective from a data protection point of view, is widely used as a tool for protecting intellectual property against contractual breaches;
• monitoring web traffic to detect abnormal data flows in and out of the web pages;
• implementing specific technical interventions on bots performing web scraping activities. Although it is impossible to prevent their operation in absolute terms, the Italian Autorithy indicates several technical measures including: CAPTCHA verification, which requires an action executable only by humans; periodic modification of HTML markup; embedding data within multimedia elements (such as images) to make extraction by bots extremely complex; monitoring log files (generated by software and containing information about server or system operations); modifying the robot.txt file (a robot exclusion protocol that indicates rules for data access on websites).
The adoption of the aforementioned measures is not mandatory for controllers managing websites and online platforms operating in Italy, but their implementation may be necessary based on an independent assessment by the controller, in accordance with the accountability principle. This can help mitigate the effects of web scraping and the potential unauthorized use of personal data published online by third parties.
GAI systems can provide benefits to society, but their operation requires massive amounts of data, including personal data, with significant consequences for all those data subjects. In compliance with the principle of accountability, each controller is required to assess the compatibility of the web scraping activity concerning the purposes and legal basis underlying the processing and, based on the specific case conditions, determine the necessity of implementing the measures suggested by the Italian Data Protection Authority.
Avv. Simona Lanna e Dott. Lorenzo Maione