What Are AI Web Crawlers and Why Are Businesses Blocking Them?

Experts warns that AI crawlers hurt news publishers by using copyrighted content without attribution

By Anshul Panda
Updated December 16, 2024 03:34 +08

Enterprises are increasingly blocking artificial intelligence (AI) web crawlers that scrape data from websites and disrupt their performance, industry experts reveal. These bots, designed to collect data for training AI models, have raised concerns about ethical practices, website functionality, and intellectual property (IP) rights.

AI crawlers like GPTBot, Bytespider, ClaudeBot, and PerplexityBot are widely used to train large language models. Unlike conventional search engine bots such as GoogleBot and BingBot, which follow ethical scraping rules, many AI bots ignore content guidelines. Their activities overload servers, increase costs, and create security risks for websites.

A recent analysis by Cloudflare shows nearly 40% of the top internet domains have moved to block AI crawlers. Content delivery network companies report that bots from major players like TikTok (Bytespider) and OpenAI (GPTBot) dominate internet scraping activities. While some bots adhere to rules, most website owners are still choosing to block them, citing performance degradation and data misuse.

Raj Shekhar, Responsible AI lead at Nasscom, warns that AI crawlers hurt news publishers by using copyrighted content without attribution. He points to the ongoing legal battle between ANI Media and OpenAI as a cautionary tale. "AI developers must respect IP laws and ensure compliant data collection to avoid liabilities," Shekhar stated.

Reuben Koh, director of security at Akamai Technologies, explains how AI scrapers harm websites. "These bots interact with websites intensively, scraping every piece of content and affecting performance," he said. Unlike conventional crawlers, AI bots use advanced intelligence to classify and prioritize data, making them harder to manage.

Traditionally, web crawlers follow the robots.txt protocol to respect website owners' preferences. Search engine bots like GoogleBot adhere to predictable indexing schedules, allowing websites to prepare for their impact. However, newer AI crawlers operate unpredictably, often ignoring established guidelines.

Akamai's "State of the Internet" report highlights an alarming trend—40% of all internet traffic now comes from bots, with 65% of these classified as malicious. Koh warns of crawlers designed for fraudulent purposes, further complicating the issue.

Despite the challenges, experts caution against outright blocking all AI crawlers. Websites rely on search engine visibility to attract traffic and customers. With AI search gaining popularity, balancing legitimate and harmful bot activities is crucial. "Enterprises must carefully assess whether they are blocking revenue-generating bots or allowing malicious ones," Koh emphasized.

The debate on AI crawlers is far from settled. While they offer valuable tools for advancing AI technologies, their unchecked growth poses risks to website security, performance, and intellectual property rights. As the internet evolves, businesses face the challenge of balancing discovery with protection.

This article was first published on December 15, 2024