Bad bots might have tarnished the reputation of automated scripts, but good bot management is actually highly nuanced. We break down everything from differentiating bot and human traffic to the essential strategies you can start using today.
It might come as a surprise that today, more than 40% of the total internet traffic comes from bots. So, if you own a website, there’s a high likelihood that a significant amount of traffic on your website doesn’t come from legitimate human users.
The thing is, there are bots that are going to be beneficial for your website, like Google’s crawler, and there are also bad bots that are built with malicious intent. We wouldn’t want to accidentally block the useful good bots, but at the same time not blocking these harmful bots can cause significant damage to your website and (potentially) your reputation.
This is why having a bot management strategy in place is now crucial for any business that’s serious with their cybersecurity. A good bot management practice should include two different layers:
1. Differentiating between legitimate human users and bot traffic to avoid false positives.
2. Differentiating between good bots and bad bots, managing traffic coming from good bots while blocking activities from malicious bots
Here, we will discuss how we can do it, but first, let us start from the very beginning.
What are Internet Bots?
A bot, or to be exact, an Internet bot, is a software programme that operates on the Internet to perform automated, repetitive tasks. They are called ‘bots’ because they can automatically perform their tasks without the intervention of human users, and they can do their tasks much faster than a human ever could.
As discussed, while bots can be useful and beneficial, there are also bots developed and used by cybercriminals to perform malicious intent.
What are Good Bots?
We define good bots as bots that are beneficial to the website and/or to the website users. Google, Bing, and other search engines, for example, are made possible with the help of search crawler/spider bots such as GoogleBot and Bingbot.
One of the key factors in recognising good bots is their sources: good bots typically are deployed by reputable companies, and in most cases will make themselves identifiable while conforming to the rules/policies set by the website owners (typically in a robots.txt file).
An important thing to note is that although in most cases, good bots are beneficial, not all of them are going to be useful depending on your website’s objective. For example, if you are not serving the Chinese market, then allowing Baidu Spider (Baidu’s crawler bot) to crawl your site might translate into an unnecessary waste of resources. So, we might want to block its activity by configuring our robots.txt file.
Other examples of good bots include:
Monitoring bots: bots deployed to monitor website metrics and provide information for users. For example, there are bots that monitor whether a site is down, and there are also bots from SEO tools/solutions designed to monitor the site’s link profile.
Chatbots: bots deployed on the site to provide automated chat while imitating human conversation. Today’s chatbots are pretty sophisticated and can answer pretty complex questions while maintaining lengthy conversations.
Copyright bots: as the name suggests, these bots crawl websites/platforms looking for content that might violate copyright law. These bots are typically deployed by companies who own copyrighted assets (i.e. music publisher, etc. ).
Feed bots: these bots are deployed to look for relevant, valuable content to add to a platform/aggregator’s feeds. Social media networks, content aggregator sites, and other platforms may deploy these bots.
Partner bots: bots deployed by vendors of solutions you use on your site. For example, if you use SEMRush, then it might deploy a bot to crawl and monitor your site’s SEO performance.
eCommerce bots/Shopbots: bots crawling the internet looking for the most affordable product available on the internet.
What are Bad Bots?
Bad bots are bots that are specifically designed and deployed to perform malicious tasks. As we’ve mentioned, if a bot is coming from an unknown source (or if it masked its identity), then most likely it’s a malicious bot.
These deceptive bots won’t follow policies set by your robots.txt file, and they work in an evasive manner by masking their user-agents (UAs), rotating between hundreds if not thousands of different IP addresses, and so on.
While these bots are usually deployed by cybercriminals, scammers, fraudsters, and other parties dealing with illegal activities, they can also be sent by your business’s competitors to steal content from your website, launch a Layer 7 DDoS attack, and so on.
Even when we’ve successfully blocked a malicious task done by the bad bot (i.e. content scraping), their presence can still strain your web servers and use your available bandwidth to slow your site for legitimate users.
The thing about bad bots is that they are now pretty easy to develop and pretty affordable to purchase/rent. You can, for example, purchase basic malicious bots to generate fake reviews or commit ad fraud for below USD 5. Of course, cheap/basic bots are pretty easy to defend with today’s various bot mitigation solutions, but there are also advanced, sophisticated bots that can be really effective in mimicking human behaviours. They can be really difficult to detect and mitigate.
Here are some common tasks performed by bad bots:
DDoS attacks. DDoS (Distributed Denial-of-Service) attacks are when perpetrators flood the target website with requests to slow down the website or cause complete failure. However, we have to differentiate bad bots used to launch DDoS with botnets. A botnet is a PC or IoT device (owned by real humans) that is infected by malware, so hackers can control and use this device to launch a DDoS attack.
Web/content scraping: while web scraping might not be 100% malicious (web crawling by Googlebot, for example, is a form of web scraping), there are bad bots that are specifically designed to steal and extract hidden and/or copyrighted content within the website. Stolen data can include product prices (and leak them to your competitors), secret information, intellectual property, etc. Ticketing websites and similar websites are vulnerable to this type of bot.
Click/ad fraud: in this case, a bot clicks on an ad to boost the ad revenue for the website. These bots can cost advertisers a lot of money because they end up paying for fraud clicks not coming from their target audiences, and they won’t get revenue/conversion. Click fraud bots can also be deployed by your competitors to deliberately drive up your PPC ad cost.
Credential stuffing: these bots are used to launch automated brute force attacks using billions of stolen credentials circulating on various dark web forums and marketplaces. These bots are exploiting the fact that most of us use the same set of usernames and passwords for all of our accounts, causing the relatively high success rate of credential stuffing attacks.
Spambot: this is an umbrella term consisting of any type of spammy behaviour performed by bad bots. Examples include sending spam emails with malicious links, spamming blog comment sections/forums, generating fake/biased reviews, spamming fake page views, fake followers on social media, etc. On a larger scale, these bots can be deployed massively to rig election votes, launch political propaganda, etc.
Credit card fraud: bots are deployed to look for credentials related to the credit card account (also debit card and gift card accounts) so they can create counterfeit cards to steal the cash value of the cards. Bots are typically used to test the stolen credit card credentials to perform small transactions (below USD 1), and will automatically make large purchases when it’s a success.
The Challenges of Detecting Bad Bots
As we’ve mentioned, there are two layers of challenges in detecting bad bots: differentiating bots with legitimate human traffic and differentiating good bots and bad bots.
Differentiating good bots from legitimate human traffic might not be too difficult, since most good bots will tell you that it is indeed a bot. However, bad bots are built with the intention of masking themselves as legitimate users.
On top of that, bad bots today are evolving rapidly, with perpetrators now using the latest technologies like AI and machine learning to create more advanced bots, so they are now much more effective in evading the bot mitigation systems. Bad bots have evolved dramatically in the past decade, and now we can classify these bots into four different ‘generations’:
Gen-1: the most basic form of malicious bots, they’re built with basic scripting tools and are mainly deployed to perform straightforward and repetitive tasks like web scraping and spamming. They tend to use inconsistent UAs, making the detection fairly manageable. We can use IP-based blacklist/whitelist against them since gen-1 bots tend to use only one or two IP addresses to make a lot of requests.
Gen-2: mainly operate in web applications and tend to use headless browsers like PhantomJS and headless Chrome/Firefox. They are often used to launch DDoS attacks, ad/click fraud, and web scraping. Effective mitigation includes identification of the traffic’s browser and device characteristics (fingerprinting).
Gen-3: these bots use real website browsers that are infected by malware so they are much more difficult to detect, requiring challenge tests and fingerprinting. They can also simulate basic human behaviours like simple mouse movements and keystrokes but are not yet very advanced in this regard. Interaction-based behavioural analysis is required to detect gen-3 bots.
Gen-4: the newest generation of bots feature randomised human-like behaviours like nonlinear mouse movements, randomised keystrokes, etc. They can also change their UAs periodically while rotating between thousands of IP addresses. Mitigating gen-4 bots require AI-based bot management solutions that can perform advanced behavioural analytics.
Bad Bots Management Best Practices
1. Manage, don’t block
It’s important to not use a one-size-fits-all approach in mitigating bot traffic since, as we’ve discussed, we wouldn’t want to accidentally block the good bots that are beneficial to our site.
Basic bot management solutions tend to block bot traffic without further analysing whether it is a good or bad bot, which can lead to several issues:
• If we block without discrimination, it can lead to low or even no search engine visibility, which will affect your site’s overall traffic, among other disadvantages.
• Blocking will send the bot back to the sender, letting the attacker know about our defense system. The attacker can then use this information to modify the bot to attack our site again in the future.
Thus, a proper bot management solution that can use behavioural and signature-based analysis before it blocks incoming traffic is a more desired approach: it can let good bots and legitimate traffic visit the website while effectively blocking only the bad ones.
2. Invest in good bot management solutions
To effectively manage gen-4, advanced bots, an AI-based bot management solution with advanced behavioural analysis capability is required. Datadome, for example, utilises AI machine learning to perform real-time behavioural and fingerprinting-based bot protection to effectively detect and manage unknown bots.
After the traffic is detected as a bad bot, we have several different options:
Block the source altogether: blocking the traffic might seem like the most effective and cost-efficient approach since we don’t need to process and monitor traffic while applying protection rules. However, as we’ve discussed above, a persistent attacker can then use this information to learn about your defensive measures and use it to update the bot. This approach is still useful in a lot of cases, especially if you also utilise a self-learning management solution (via machine learning). We can also perform what’s called silent deny, which is blocking the request without returning any error code to the client, masking our blocking response.
Feed fake data: another effective approach is to keep the bot active but reply with fake information, which is useful in countering web scraper bots. You can also redirect the bot to another web app where the content has been modified/reduced to prevent the bot from accessing your original content.
Throttling: instead of blocking the bot altogether, we can slow down its activities by deliberately lowering bandwidth, with the hope that the attacker will give up since the task is performed slowly. We can, for instance, allow a request to the site after inserting an 8 to 30-second delay.
Challenge: an example of this would be offering sudden CAPTCHA or invisible challenges such as including type data in a mandatory form field.
3. Minimise human error
Contrary to popular belief, many security vulnerabilities that are exploited by these bad bots are caused by (often simple) human error. A common example is when employees use weak passwords.
So, educating your team about common cybersecurity practices can be very effective for strengthening your defences around your infrastructure.
On the other hand, you can also use API management layers or other approaches to apply security precautions like encryption and authentication, without relying on individual developers to minimise human errors. Always aim to protect exposed APIs and mobile apps (as well as your website) and share blocking information between systems as fast as you can.
4. Blacklist known sources
A pretty basic but still useful approach to defend against gen-1 and gen-2 bots is to block traffic from known sources. For example, there are bot hosting services like Digital Ocean, OVH Hosting, Gignet, etc.
Monitor traffic sources carefully for any suspicious signs such as high bounce rates, sudden spikes and/or sudden decrease of any metrics, failed login attempts, and so on.
Bad bots are becoming more advanced and sophisticated than ever before, and it’s much harder to differentiate them from legitimate human users. At the same time, we can’t simply block all bots (for example, with CAPTCHA), not only because there are useful good bots we wouldn’t want to block, but this kind of practice can also hurt your site’s user experience.
So, having the correct approach in managing these bots, as we’ve discussed above, is very important: we should aim to manage incoming traffic carefully to avoid false positives while maximising security. Therefore, investing in a professional-grade, AI-based bot management system, such as Datadome, remains one of the best practices for any businesses serious about their cybersecurity.