Balancing Innovation and Privacy: Generative AI, Data Scraping, and the Path Forward


Joseph Morgan


22 August, 2023


Generative AI models such as ChatGPT present both a boon and a curse. They have the potential to dramatically enhance business productivity and automation but also pose significant risks, particularly regarding content and data privacy. Consider this hypothetical situation: What if your entire business model revolves around content, and success hinges on the consistent value, visibility, and accessibility of your content to the largest possible audience of unique visitors? Herein arises the complex issue of content scraping.

The Benefits of Content Scraping

Content scraping involves the use of bots to capture and store content, offering distinct advantages. For instance, when combined with machine learning, it can combat news bias by amassing vast amounts of data from different websites and employing machine learning to assess both the content’s accuracy and tone.

Moreover, content scraping methods can aggregate information rapidly, enabling cost savings through automation that reduces data extraction time and minimises human intervention. However, the approach is not without its pitfalls.

The Drawbacks of Content Scraping

The darker side of content scraping became evident when we began collaborating with a global e-commerce platform. A staggering 75% of the site’s traffic was generated by bots, predominantly scraping bots. These bots duplicated data that could then be marketed on the Dark Web or used for malicious purposes, such as crafting false identities or spreading misinformation or disinformation.

Another concerning instance is the emergence of fake “Googlebots” – scraper bots that can evade detection on websites, mobile apps, and application programming interfaces (APIs) by masquerading as SEO-friendly crawlers. Recognising that websites depend on good Google rankings, unscrupulous actors create bots that resemble Googlebots but engage in malicious activities once they access the websites, apps, or APIs.

The Ambiguity Between

ChatGPT is trained on an extensive dataset scraped from the internet, empowering it to address a wide spectrum of queries. ChatGPT primarily utilises Common Crawl, a legitimate nonprofit organisation that produces and maintains an open repository of web crawl data, granting access to vast amounts of information for large language models (LLMs). However, by deploying the crawler bot (CCBot), ChatGPT and other LLMs can collect and utilise any content not explicitly protected.

This raises serious questions. Consider a journalist who invests time interviewing experts, conducting research, and crafting a perfect article, only to have it scraped by ChatGPT without acknowledgment. The journalist’s efforts are rendered futile by a web scraping bot. Furthermore, readers no longer access the original website where the article was published, resulting in lost website traffic, domain authority, and potentially ad revenue.

A similar scenario occurred when AI was employed to replicate rapper Drake’s voice in a song he did not write or participate in, which subsequently went viral on TikTok. This incident raises legal and copyright concerns, as well as broader discussions about AI and the future of music.

Are these actions malicious, or are they ethical or business operation questions? While much of this falls outside the typical “fair use” definition, AI innovation is outstripping the pace of legal and regulatory developments, placing much of this scraping activity in a grey area. This leaves companies with a decision: block content or not?

The Road Ahead

To prevent ChatGPT or other generative AI models from using your data, a first step could be to block traffic from the Common Crawler bot, CCBot, either through a code line or by blocking the CCBot user agent. However, some traffic from the ChatGPT plug-in now comes from sophisticated bots that mimic human traffic. Hence, blocking the CCBot alone is insufficient. Notably, LLMs like ChatGPT employ other, subtler methods to scrape content, which are likewise difficult to counteract.

Another approach is to place content behind a paywall, which will deter scraping unless the scraper pays for access. However, this limits organic views on media websites and may frustrate human readers. Given the rapid advancement of AI technology, will this be an effective solution in the future?

If a significant number of websites begin blocking web scrapers from accessing data supplied to Common Crawl or used by ChatGPT and similar tools, developers may stop sharing their crawler identity in user agents, forcing companies to adopt increasingly advanced techniques to detect and block scrapers.

Additionally, companies like OpenAI and Google may choose to build data sets that train their AI models using Bing and Google search engine scraper bots. This would make opting out of data collection difficult for online businesses that rely on Bing and Google to index their content and drive website traffic.

The future of AI and content scraping remains uncertain, but the technology, rules, and regulations surrounding it will continue to evolve. Companies must determine if they want their data to be scraped and what constitutes acceptable use for AI chatbots. Content creators seeking to avoid web scraping must be prepared to bolster their defences as scraping technology advances and the generative AI market expands.