units of data refined
in four years
Big Data good, Smart Data better! Given the oceans of data which we soak up a little more of each day, a number of companies have embarked on what might be described as the work of a jeweler or goldsmith: painstakingly transforming the precious raw material that is Big Data into highly polished tools to aid decision-making through the use of Artificial Intelligence (AI). The real value of this new ‘black gold’ lies not in its quantity but its quality, its reliability, and the potential correlations with similar data, i.e. alternative, heterogeneous, non-structured data: the hidden, submerged face of Big Data.L’Atelier recently met up with one of these pioneers,Thanh-Long Huynh, co-founder of Quantcube Technology, a French startup specializing in the predictive analysis of Big Data as applied to finance and the economy. Over the last four years, he has refined and aggregated no less than eight billion units of data and channelled the results to assist banks, private investors and other institutions. This Frenchman of Vietnamese origin is a real 'data whisperer' who can coax results out of the 'data horse'.
Thanh-Long Huynh's AI tools are able to spot or forecast financial, economic and political trends or occurrences one to six months in advance. He predictedBrexit, the election of President Trump fifteen days before it happened, and the Macron-Le Pen electoral duel, and has proved capable of assessing the growth potential of a city or a country, taking into account for example disruptions in social stability caused by natural disasters, based on satellite image analysis and oceanographic data. So we went and quizzed data clairvoyant Thanh-Long Huynh about his model, the range of possibilities, existing and dreamed-of applications in the Smart City of tomorrow, not forgetting of course the ethical considerations involved.
smart data : the hidden face of big data
L'Atelier : The starting point for QuantCube was predictive analysis models using Big Data to build investment strategies. Can you tell us more about that?
Thanh-Long Huynh : When we founded the company in 2013, it was basically with a view to creating a new generation of FinTech investment strategies on the basis of data analysis, which is today generally known as Big Data. At the time, we took data from the social networks, including Twitter, Facebook andLinkedIn, and this data gave us a 'sentiment score' for each share, so we could recommend the shares whose price was likely to rise. Our recommendations were given on the basis of Natural-language processing(NLP) – i.e. the systematic analysis of texts, from which we can determine whether a given message is positive or negative in tone. By analyzing and aggregating all these messages posted on the social networks, we came up with a number of financial indicators for each share and we set up a portfolio containing the shares with highest sentiment score. At that stage we weren't doing any trading; the purpose was to demonstrate the validity of our models and analyses. We tracked the data on the social networks without having any process for investment decision-making. That was the first iteration.
What strategy did you adopt after that?
From the very beginning we decided to use only alternative data: social networks, e-commerce sites, satellite images, air and sea traffic.
Today our core business lies in active strategies: every day we can put shares up for sale or decide to purchase them depending on our analysis of Big Data.But we don't restrict ourselves to financial forecasting. We also create macroeconomic predictive indicators. From the very beginning we decided to use only alternative data, to take an interest in a wide range of data from social networks, e-commerce sites, satellite images, air and sea traffic, etc. This is what differentiates us from most asset managers, who essentially rely on market data and business and financial reports. Our aim is both to build up a diversified data warehouse from heterogeneous and statistically independent sources, and to develop a second, more analytical, layer which is our company's core business. We have around 20 data scientists specializing in Artificial Intelligence, in areas ranging from textual analysis (Natural-Language Processing [NLP]), image analysis (Deep Learning), and graph analysis, working closely with experts in macroeconomics, finance and insurance. We have to master this data analysis layer in order to produce the third layer – the real applications layer. We can then produce the indicators we call 'smart data', which come out of the analysis we gather on a daily basis. This smart data is our product, consisting of predictive indicators which will be used to take investment decisions in real time.
How do you measure the performance of your solution?
The solution we've put in place isn’t just a real-time solution, a 'live' solution. In addition to this, in the space of three years, it has generated a Sharpe ratio (a way of examining the performance of an investment by adjusting for its risk) of 1.8. This means that we've done four times better in real time than all the other solutions studied in backtesting (testing a predictive model on historical data). This is how our solution is compared in terms of performance on the financial markets, especially by asset managers. How does our performance measure up in terms of macroeconomic prediction? How do we correlate with the official figures? We correlate at between 85% and 95%, and in addition we're one to six months in advance of the official figures. This performance has been measured independently by the French Central Bank. Lastly, as regards our performance using algorithms, we're also way ahead. If we take the example of NLP, between the first versions of the algorithms we developed five or six years ago, and those we have now… At the very beginning we counted words; the following year we assessed sentiment per word count; then later on it was all about how we take into account the meaning of the grammar, the meaning of the sentence, and now we've even managed to take into account punctuation and emoticons, emotions. If you achieve 90%, that's exceptional; if you achieve 80%, that's already very good. We're currently between 75 and 87% depending on the language – Arabic, Chinese orRussian.
Our correlation factor compared with the official figures is 85 to 95%, and between one and six months in advance.
How do you build up your data warehouse? What sort of partnerships have you forged?
You can obtain data via APIs either free of charge or by paying for it. We have a budget for data but the most difficult thing is to find out how to access non-public data. For example, satellite data is quite difficult to obtain and the kind of thing that happens is the French National Centre for Space Studies (CNES) offering to make available all the data they have as part of our strategic collaboration, which means thirty years of historical observation data, mainly on planet Earth. We've done the same thing for other types of data and we never cease adding to our data warehouse.
with the cnes we're talking about accessing thirty years of data from historical observations of the earth
Does 'alternative data' constitute a major differentiating factor vis-à-vis other Big Data companies?
Yes, that's right. The differentiating factor as far as we're concerned is everything that comes under the heading 'alternative data': on the one hand raw data drawn from the social networks, data generated by individuals – consumer reviews, tweets, etc.; and on the other hand, data generated by public entities, everything referred to as ‘open data’. Then you have the data generated by machines, satellite data for instance. We call all of this 'raw data'. And there are a lot of companies that do this: Twitter for the social networks, governments for all sorts of open data, Planet Labs and even Airbus for satellite data. When we used raw data, we found that the performance factor generated was systematically around 0%. In other words, we weren't generating any financial performance whatsoever from it. Then you have another type of data, known as 'semi processed data'. Every week somewhere in the world a new startup is founded which specializes in natural language processing. When it comes to processed data, there are companies which do satellite imaging analysis, like Orbital Insights for example. You also have companies such as Cargo Metrics that track all the ships in the world and analyze them. But each of these companies is working in a very narrow field: either natural language processing, which gives you the sentiment index, satellite data analysis, or, when it comes to ships, graph analysis.
You're working on a real-time economic growth indicator. What sort of data are you aggregating?
In order to estimate the economic growth of the United States, you track New York for the finance sector, San Francisco for technology, Boston for health and Houston for the energy sector.
The first example I gave you, the social networks, relates to short-term strategies. It's what happens day-to-day. Users are just as likely to take positions regarding the shares they own as a result of social network sentiment as they are to follow the risk factors associated with their share portfolio. Then there are strategies we call ‘global macro’, i.e. macroeconomic strategies. One of the most complex indicators we produce is a real-time economic growth indicator. How can you estimate a country's economic growth in real time? Well, you obviously have to take into account a large number of factors: employment, tourism, goods transport, import-export, etc.For each of these factors you’re going to need a specific indicator. For instance, one of the sub-sets of economic growth is the hotel business. Why is that? Well, you can calculate the occupation rate at business hotels based on the price of their rooms, which means that this is an advance indicator of local economic growth. In order to estimate economic growth in the United States, you track New York for the finance sector, San Francisco for technology, Boston for health and Houston for the energy sector.
How does satellite data help you work out a city's potential growth several months ahead of the official figures?
the search for precision data
We do our macroeconomic forecasting in real time. We have data between one and six months ahead of the official figures, so that gives us a substantial lead. This means that we're seen a little bit as a benchmark for macroeconomic indicators such as real-time inflation and economic growth. It's by aggregating and analyzing data and continually adding new types of data – first from social networks, then job vacancies (for forecasting business cycles),and now using satellite data to look closely at everything that serves as an indicator of economic stability – that we've achieved such a level of precision.And we’re planning to acquire our own drones in the third quarter of2018. We’ll have two: one that we fly in visual line-of-sight over farming land and we'll specify ourselves the type of equipment we need – thermal imaging, heat detection, photos, etc. – and another with ten hours of autonomy that will criss-cross France during the day. And if you criss-cross France five times in a day, you've covered the whole of the country, taking account of the urban areas, the farming regions and so on. In the longer term, our objective is to have a fleet of drones pretty much everywhere in the world. This is like the satellite data component but done systematically. And then you have to put in all the links. It's a lot more difficult than textual analysis!
What degree of information granularity can machine learning enable you to achieve?
The servers that we set up in-house to do the 'deep learning' are even faster than Amazon's and Microsoft’s cloud machines. To give you an idea, the latest data we're analyzing corresponds to the equivalent of close to a million images processed simultaneously. This is the sort of data that we’re processing, today, right now, as we speak. A year ago, we already had very good scores when segmenting the data: are those images crop fields, streams, buildings, bridges, roads? That was a year ago. Now we've moved on; we have sixty types of classification. As regards buildings, we can already determine what they're used for: schools, hospitals, shopping centres, and so on. We can actually get down to that level of granularity.
How do you train yourselves to analyse these images to that level of precision?
state of the ART of pRedICTIVE analysis
We’re now busy developing the 'real-time satellite data analysis' component, so that we can analyse all the images in real time. Something which the army for instance still can't do systematically. In fact, we find ourselves using military technology for civil applications. The problem is to put in place all the data pipes, because data from drones is high-precision data. Recently we took part in a competition which challenged the Kaggle community to develop algorithms to automatically detect and classify certain species of sea-life so as to assist with conservation. The goal was to make a systematic analysis of aerial images of sub-marine life and spot and count the number of seals in the images. This was done as part of the development of our predictive analysis model, with a view to working out where we stand in terms of state-of-the-art technology, and then to go beyond. If you can count the seals and distinguish them from sea-lions, you can count vehicles and even classify them into three categories: a motorbike, a car, a truck. It’s exactly the same problem for the different types of seals: baby seals, females and males.
Can you predict natural disasters and their macro consequences for a given country from oceanographic data?
predicting natural disasters
Just before joining you here, I was on the phone with the largest insurer inEurope, talking about new insurance applications. On the team we have a 21-year old who suggested this product, based on the observation that we have all the oceanographic data necessary to model waves, i.e. to analyse wave behavior and in particular the force of the waves. Last summer we gathered oceanographic data on grids from two to ten kilometers, down to a depth of 20 meters. From this data, we can predict climatic phenomena such as the droughts in south-east Asia in 2016, and also last summer's hurricanes.Indonesia was one of the big producers of palm oil, and so the droughts not only led to a sharp rise in the price of palm oil, but also created social unrest.
What sort of practical applications do you see for retail banking?
What might be useful for retail banking is to work out where a bank ought to situate its branches. You could do that using telecoms data to look at people flows. Given that we don't yet have access to telecoms data, we use Vélib [a large-scale public bicycle sharing system in France] data, Vélib journeys and [French electric car sharing service] Autolib' journeys as a proxy, though this does give us a bit less. Anyway, it gives you an idea of the flows of people passing by and you know full well that turnover will depend on the number of people walking by the branches. That's a direct application of the data.
Are you working on health data?
Within two or three years we’ll be entering the era of connected objects, so there's a huge amount to do. For the moment we’re already putting in place everything to do with analysis, all the necessary links, but I wouldn't be surprised if in two or three years' time we moved into that field. In France, we're rather skeptical when it comes to that sort of data that’s less the case in other countries.
Where do you stand on data ethics?
- 16 min
We're very much aware of this social issue, and our conviction is reflected in a joint work published in France on 13 October last year, Le Manifeste du Crapaud fou (the 'Crazy Toad Manifesto'). It’s a collective manifesto which in part deals with the ethics of data and the positive social impact we ought to be making. AI is becoming so powerful that we have to ask ourselves how we can have a positive social impact and take care that not just data but also artificial intelligence should be used for good. The manifesto arose from an initiative byFrench engineer and business consultant Thanh Nghiem and French Cédric Villani. They wanted to draw up a collaborative document to reflect on the way Big Data can be used for public institutions which do not perhaps have large budgets, such as theRed Cross. Thanh Nghiem, who at 30 years old was the youngest partner at McKinsey, feels that social impact is extremely important.
Where did the title 'Crazy Toad' come from?
It's because toads have a rather experimental nature, and there's always one toad which, when they’re building a motorway for instance, goes and sacrifices its life in order to find the best crossing point for the community. There are two aspects: how can we use data to serve the common good? And then, in order to do the right thing, how can we take the byways, experiment and all learn together?
So there are no technological limits to your predictive models, just ethical limits?
Well, there's more to it than that. We can predict natural disasters, that's why we gather oceanographic data. On the other hand, it's far more difficult to predict a tsunami, everything to do with seismic issues. That's far more difficult. Or, if you do manage to predict it, it will be just a few seconds before it happens, so it’s not worth it. Nor are we able to predict something that's inherently unpredictable, for example everything to do with cyber-risks – we can't predict how widespread they’ll be. We can only predict that something unforeseen will happen somewhere!
When you think about data, you think about transparency and anonymity, and consequently about the blockchain. What’s your thinking on this?
the blockchain revolution
There are two sides to your question. First data transparency. Smart data transparency is very important to us; we need to be able to trace the sources of our data. So we're not just transparent when it comes to our algorithms.That’s why we publish policy discussion papers and we’re in regular contact with bodies that issue rules and regulations. We all know that the Blockchain can be used for many different purposes. What interests us is to understand the concept, because it's not as easy as all that. And since we specialize in financial forecasting, cryptocurrencies are of particular interest, so we keep a close eye on the latest trends. The big market, just a few months ago, was inChina, then China closed its cryptocurrency market down so the entire market has moved to Japan. Japan is now busy regulating the market, and we're waiting for China to bring out its own cryptocurrency. This is what we're expecting to see in future trends.
We're waiting for China to bring out its own cryptocurrency. This is what we're expecting to see in future trends.
Who exactly are your customers?
If you're able to assess the economic growth of all the countries in Africa or China, and you're a player in the real estate business or the renewable energy sector, this will be enormously useful to you.
We currently have around a dozen customers. They're financial institutions – US investment banks, sovereign funds and even international institutions. But we're now also being approached by strategic management at large corporations. Let me give you an example: if you're able to assess the economic growth of all the countries in Africa or China, and you're a player in the real estate business or the renewable energy sector, this will be enormously useful to you. Though we hadn't specifically anticipated this kind of market, customers from these fields have got in touch with us.
So what's your business model?
Economic growth, real-time inflation, job vacancies: these are all what we call 'smart data components'. For example, a sovereign fund in the Middle East is interested in overall economic growth because it's investing everywhere in the world. This is what we call 'global macro smart data'. A pension fund in Canada, for example, already has a macroeconomics team, but they're lacking the 'real-time job vacancies' component which would enable them to track the job market. Clients will be able to choose the type of data they’re interested in.Our model is to provide 'smart data series', available in the form of 'Platform as a Service' (PaaS) via a licence. You connect via your phone and you see economic growth in real time or you connect via the web through an interface and you see the global macroeconomic figures of all the regions in the world or country by country, and that’s a very big market. So our business model is basically issuing licences to our clients. And then you also have financial smart data. That's a limited market because the product is used only for the stock market and when it’s used it can have market impact. For example, we were talking a moment ago about the sentiment index: you work with the sentiment index on a specific share, and sentiment is something that is highly emotional, which means that you’ll have the impact over just a few days. And so we'll besselling these indicators as an exclusive product to a very small number of clients.
What are your targets for 2018? Which markets are you aiming at?
In commercial terms, we're first going to open three subsidiaries abroad: one in New York, another in the Middle East, plus an Asian research hub, i.e. the equivalent of what we're doing in Paris, but in Tokyo. For our business the market tomorrow – in two or three years' time – is Greater China. In HR terms, we'll be scaling up from 27 to 50 people, recruiting two to three new people every month. We're planning 20 to 25 recruitments in the next 12 to 18 months. As regards technology, we'll be purchasing our own drones and as far as AI is concerned we continue to recruit with a view to systematizing everything. We're targeting first of all the markets in the United States andJapan: the US because it’s a real wealth management, market and Japan because it's a savings market. The average Japanese household has savings amounting to $500,000. Last summer we were invited to join the City of Tokyo's FinTech accelerator and at the end of November we presented our solution to the city governor. This has opened a number of doors for us in Japan.