Normalisation: a dirty word or a differentiator for success?
The risks and impacts of poor data normalisation and how to avoid them.
By Dr Elliot Banks, Chief Product Officer, BMLL
In today’s competitive markets, market participants rely on historical data for everything from building and back-testing trading strategies to advanced TCA and execution analysis. To meet these data needs, trading practitioners and quants need access to high quality data that captures all information, but importantly is also consistent and easy to use. Put simply, they need ‘normalised’ data. In this article I take a deeper look at what good data normalisation really means, and why it matters.
What is meant by data ‘normalisation’?
The term ‘normalisation’ in the context of data simply means organising and structuring the data so that it’s easy to query, analyse and understand.
Normalisation is often seen as problematic in the market data world, viewed as an issue that users either have to work around, or avoid altogether. This is primarily due to two reasons.
The first is legacy data products, which often have inconsistent fields that vary over time, and are unrecoverable since the original source data has long since been lost.
The second (and far worse) is that the process of normalisation is often a low priority, done at the convenience of engineering teams rather than for the benefit of end users. This leads to a myriad of classic problems - normalisation design choices that vary by region (or developer), endless lists of new fields that are added to a normalisation scheme, rather than a strict, clear model, and the inability of support teams to explain the normalisation approach. This leads to data science and quant teams spending their valuable time “re-normalising” vendor market data, rather than using the data to find value.
What defines true normalisation? And why does it matter?
A good normalisation model means that data users can understand each market, and the nuances of what’s in the data, without having to go through the details of each market. To do this, users need:
A well defined, global data model. This means datasets are consistent across regions and over time, without local variations. Market features such as Icebergs, cancellations and implied orders also need to be handled consistently.
Clear, concise mappings and documentation. There will always be situations where it's necessary to dive into the nuances of a market. The ability to go from normalised to raw data is critical here.
Focus on the end-user requirements. That might mean joining multiple datasets together (such as the Xetra level 3 feed and off book trades), or including sequence numbers so users can join across different datasets.
Vendors that put strong emphasis on true normalisation of trades and order book data, and the integration between full research and production, enable their clients to spend less time on data formatting, and more time on their business.
Good normalisation and what market participants should expect:
To achieve good normalisation, we at BMLL focus on a number of steps:
We start from the raw data. We build our data products directly from raw captures or exchange data, and always store that data.
We build a global data model, normalised down to level 3. This means a small number of values for each data field that is consistent over time. For example, in the BMLL normalised datasets, there are 5 types of auction. By comparison, some vendors have as many as 31 or more auction types (including both Unknown and Undefined as possible values).
We provide the data in two formats, allowing you to choose the right solution for the right problem:
a) Normalised data. A small number of fields per venue, built from our global data model.
b) Harmonised data. This format makes it easy to rebuild and replay the order book consistently across multiple markets, but preserves all the raw fields that come from the exchange.
Re-normalising data should not be the norm.
When evaluating market data vendors for normalised data, it is important to ask whether the data normalisation is good or bad. Pick good normalisation, and your quants save time and effort, and can focus on business problems not data issues. Choose a vendor that follows only bad normalisation processes, and you can expect to do the normalisation yourself.