50%

OF WEB TRAFFIC

iS BOT ACTIVITY 

Bots are proliferating on the web at a hectic pace. A recent report from cybersecurity firm Norton by Symantec revealed that in 2016 no fewer than 6.7 million additional bots joined the ranks of these automated computer programmes. Some are benevolent, others are web burglars and others are basically sales agents. The general public knows little or nothing about them, but they actually account for 50% of global web traffic and they cause disruptions to online business to the tune of $24.5 billion a year. We met up with Fabien Grenier, a rather special bot hunter who is already looking at the second phase of the process – how to take advantage of this automated traffic – and asked him some questions.


DO YOU REALLY KNOW WHO VISITS YOUR WEBSITE?

Shutterstock

L’Atelier BNP Paribas: You’ve said that 50% of all web traffic is bot activity. Where does this statistic come from?

Fabien Grenier: First and foremost, we need to understand that we’re talking here about invisible traffic. Or it would perhaps be clearer to say that bot activity is another 100% on top of the people-generated web traffic. If you take a website like LeMonde.fr, there’s another 100% invisible web traffic on top generated by bots. Why ‘invisible’? Because most robots don’t use JavaScript and JavaScript is what Google Analytics, Médiamétrie, and other web analytics companies such as AT Internet use to measure traffic on a website. So how do we arrive at this figure? It’s the average observed at our customers’ and users’ sites. We’ve audited hundreds of websites, media sites, e-commerce sites and classified advertising sites, i.e. sites for small ads and online directories, and we’ve worked out that on average an extra 100% of traffic is generated by bots.

When you’re hunting bots on the web, how do you differentiate between people, ‘good bots’, and ‘bad bots’?

Well, we install our module as far upstream as possible on the web servers. We use artificial intelligence and machine learning to scan and intercept in real time all the traffic coming in from our customers. We trace the request origins, analyse the hits, in other words the digital imprints. We also analyse behaviour, because a robot clearly doesn’t behave like a person. For example, a bot never goes back to the home page, never backtracks, and refuses to accept session cookies. So by correlating these technical and behavioural criteria we can detect 99% of all bots in play. 

good bots 

Once we’ve detected them, we classify them in large groups on a real-time dashboard: the good bots, the bad bots, and the sales bots. The good bots are all those that raise your profile and bring you traffic – search engines such as Google, Bing, Yahoo, and social networks such as Facebook, Twitter, and Pinterest – so by default we let them pass. The bad bots are the ones that hack into your systems, steal your identity, spam you, run fraudulent advertising or steal your content. By default, we block these. Behind the bad bots are mainly either hackers or companies that wish to mask their own identity.

Where do you categorise sales bots?

Our third category consists of sales bots. When we’re installing our solution for a client, we draw up a strategy with the client: we need to understand which companies they are looking at for potential business and which they wish to block. For example, there are advertising companies that crawl websites on a permanent basis, there are marketing tools, data suppliers, price comparison sites, security firms, media tracking companies, media tracking systems on the social networks, or business intelligence companies. If I’m a major media site, for instance, rather than letting media monitoring companies access my content, take it over and sell it without my authorisation – especially since it’s subject to copyright – I’ll identify them and then allow them to work with me, enabling them to access the information they need, to use my data, but via an API and on my terms. The advantage of an API is that they’ll be able to benefit from legal, structured information available in real time. So in fact what we’ll be doing there is lead generation. Our customers will receive emails from companies which create these robots, with a view to setting up partnerships.

In short, Datadome enables you to detect robots and then block them or generate new business opportunities by putting those who need the data in touch with those who produce it. 

Everyone’s now talkin about the sacred fire that is AI, but in the end what is AI worth without its fuel, i.e. the data?

data  

That’s right, there is a huge amount of talk about artificial intelligence and machine learning. But these techniques only work if you have the fuel, i.e. the data. Algorithms are able to learn but they need case information in order to do so. What do I mean by ‘cases’? Well it’s the web. So I want to hoover up the entire web with my huge tubes so as to feed the AI but I need the agreement of the people whose sites I’m crawling and that’s where Datadome comes in by offering to standardise the process and create synergies. The idea is not to build walls, but rather to create platforms, bridges so that everyone can make good use of the data. We want to create a situation where all these Big Data companies, which need fuel, can access this strategic data more easily, on conditions set by the publishers.

So we’re talking about a Data Marketplace connecting content publishers and Big Data players? Nothing of the sort existed previously?

the key role of the API

I’d talk about an API rather than a Data Marketplace. Up to now, in the same way that content publishers don’t have any tools to enable them to control and monetise the data they produce, the Big Data players have no choice but to turn to ‘web scraping’ – a range of techniques for extracting data content from a website – to get hold of the data they need. Behind the robots there are very often Big Data firms who want to access the content but neither the technical processes nor sales channels exist to enable them to do so. If you’re a business intelligence firm, for instance, and you want to analyse the reputation of a brand, you need to gather comments from major media websites, but they just aren’t for sale, there’s no API available. So you have no choice but to develop a robot to get hold of them. In the same vein, you’re Samsung, say, and you need to track the prices being charged for your products. You have no alternative but to call on an agency that’ll do the web-crawling for you and give you the information in dashboard format. 

Does your technology guarantee that a bot can’t slip through into a client’s system?

machine learning 

With machine learning we can detect 99% of all bots operating. We have the means of making our algorithms more intelligent, getting them to evolve by using artificial intelligence. Each time they process a new case, they become a bit ‘smarter’. It’s really like the antivirus industry, it’s a cat and mouse game. There’ll always be new opportunities for robots, new technologies to improve, so we need to be able to grow at their speed and even try to stay one step ahead of them. That’s why I was talking about the importance of automated rules in real time. If you’re still working with manual rules, you’ll always be running behind, because you’ll need to undergo an attack in order to detect the threat and catch it, but that’ll already be too late as the robot will have already reached a thousand IP addresses. So the only thing you can do is to use artificial intelligence and machine learning in order to detect fresh suspicious behaviour in real time. 

On the other hand, what’s the likelihood that you’ll block a human visitor and spoil his/her user experience

When our algorithms detect suspicious behaviour, they post a Captcha, you know, like Google’s reCaptcha functionality where you tick a box to prove that you’re not a robot. These Captchas stop the system from blocking a user displaying compulsive human behaviour which might be likened to robot behaviour. And we give our client the option of viewing in real time on their dashboard the number of Captchas that have been solved so that they will see that Datadome never allows the user experience to be spoiled. Of course at Datadome we monitor what’s happening for all our customers and alerts go to our data analysts whenever there’s an abnormal number of Captchas going through the system. Then we work on the algorithms so we won’t find ourselves in that situation again. But a user will never get blocked by Datadome technology; that can’t happen.

Is a Captcha sufficient to distinguish between a bot and a human being

Le reCaptcha de Google
  • 1 min

A robot won’t be able to get past a Captcha, but a person can and will then continue his/her visit to the website. There is a tiny proportion of robots that might be able to pass a Captcha, but it’s marginal. A more likely scenario is the use of Captcha farms in India or Madagascar: thousands of people sitting behind computers and spending their days solving Captchas so that robots can continue through to websites without being impeded. If a robot meets a Captcha, it will be sent to Madagascar, and an employee at the Captcha farm will tick the box or copy the code and pass the test. So the robot will then be able to carry on. At Datadome, we’ve learnt to automatically detect this type of farm. It costs a lot less for those wishing to get their robots through to ask an Indian firm for example to solve the captchas than to develop more complex robots, which would require a huge amount of investment to solve them automatically.

So what’s your business model? 

Our business model is very simple. On the protection side, we have a SaaS (Software as a Service) offering; a publisher pays a subscription that varies according to the traffic, the volume of hits that needs protecting. Then there’s a second offering: data monetisation. When we help our content publishers to get paid by Big Data players, we take a percentage of the revenue generated.

And who are your clients? 

Basically, we work for websites providing content and we have references from three types of organisation: media companies, including Le Figaro, Ouest France, Le Parisien); e-commerce websites such as Price Minister and Blablacar; and classified sites like Cairn.info, which is a directory of scientific publications, and the Yellow Pages France. We have 20 active clients, and on a basis of yearly subscriptions, we analyse ten billion hits a month to ensure their security. We make 12% of our turnover from foreign clients – in Australia, the United States, the United Arab Emirates and East Europe.

Do you work with banks

We’re currently at the POC [Proof of Concept] stage with one major bank. Wherever there’s a database, we’ll protect it.  We protect for example the private areas of the Blablacar site. As regards hijacking accounts, websites such as Instagram, Yahoo and LinkedIn have had their users’ private data stolen. Login-password combinations and information like that changes hands for a handful of bitcoins on the Dark Web. So you can buy them and then programme robots that will try out all these combinations so as to connect to any and every possible website that you could think of. All users who have the same login-password combination will be hacked like that. Except that the publisher will be under the impression that it’s a user who’s visiting the website because there’s no hint of force, because the robots are working on thousands of IP addresses. And as there’s no alert system, if you have no technical assistance such as Datadome, you won’t be aware that a robot is clearing out your account or making a transfer to an account abroad. The bank won’t be aware either until a user makes a complaint or reports a fraudulent transaction.

The new EU Payments Services Directive (PSD 2) will provide Datadome with a new opportunity to win over the banks, won’t it?

Fintech

Banks and Fintech startups: make love, not war!

By Emmanuel Touboul

/ Paris
  • 05 Jul
    2017
  • 5 min

With some bancassurance firms we can spot robots hacking into connected parties. There are money management products such as Bankin, account aggregators to which you can give your access codes so that they go and find information. And how do they get the information? By using robots. But at the moment they aren’t always authorised to obtain all the information they actually come up with. However, from January 2018, under the new EU Payment Services Directive (PSD2), they will have the right to do so and some bancassurance companies want to know who those aggregators are that are looking for their customers’ data, in order to monitor and deal with them.

The General Data Protection Regulation (GDPR), with its potential fine of up to 4% of a company’s annual worldwide turnover, will force digital companies to take the threat of personal data theft very seriously. Will this provide another opportunity for Datadome?

personal data

Well this is a real issue. Some online sports betting websites use our services because on this type of website you do have account balances and some robots are able, if they have your ID information, to log in just as if they were you, change your bank contact details, and clean out your accounts, transferring the funds to a tax haven somewhere. Once again this is a problem of identity theft. And for all these site publishers it’s also a real reputational issue. If these data leaks get splashed across the media that can really have a very negative impact on them in terms of trust, and would then have an immediate effect on their turnover. Under the  GDPR, which is due to come into force in May 2018, a company’s Data Controller will be under a legal obligation to immediately report any data breach to the Supervisory Authority, which makes it very likely that the media would hear about it, in turn increasing the risk of reputation loss. And then there’s the potential fine of up to 4% of annual worldwide turnover payable by any company found to be at fault. However, people are not yet making the link between the problems of data theft and the requirement to protect data on the one hand, and the threat from bots, especially in terms of account hijacking, on the other.

What are your targets for 2018? 

In 2018 we’d like to scale up. Since January we’ve been growing at a double-digit rate. We hope to continue developing at this pace in France and sign up our first flagship companies elsewhere in Europe. That’s our first goal. Today our main income is from data protection and our second objective is to grow our monetisation activity, as we think it’s this area that will enable us to turn Datadome into a very large company. That’s the second stage of our rocket, if you like. This will be an upsell vis-à-vis all our clients who have gone for our protection software. They will have time to start understanding the potential of their data when they observe those commercial robots on their dashboards. So, in this second stage we’ll be inviting them to draw benefit from this automated traffic. Our aim is to transform a bot threat into a business opportunity.

End of the interview with Fabien Grenier 

We then asked our specialist Yoni Abittan, who is a strategic analyst at L'Atelier BNP Paribas, what he thought about this issue. His reply was: 

One might envisage ‘secure by design’ solutions being jointly developed by entrepreneurs, researchers and designers

Yoni Abittan

Nowadays, cyberattacks are increasing exponentially. ‘Bad bots’ cause damage and incur huge costs for major corporations, state organisations and small or medium-sized companies and can paralyse their activities and hurt their business and their reputation. In addition to the startups that are doing an amazing job to counter bot activity, academic research can also help to design solutions based on looking into hackers’ various behavioural models and scenarios so as to detect them pro-actively. One might for instance envisage ‘secure by design’ solutions being jointly developed by entrepreneurs, researchers and designers. Given the sophisticated methods used by hackers, the solutions developed by startups to counter bad bots will not be sufficient on their own.


By Oriane Esposito
Responsable éditoriale