Choosing the Right Market Data for Scalable Insight
Maximising analytical value while avoiding unnecessary cost and complexity
Author: Gavin Carey, Head of Content, BMLL
Whether your firm is a hedge fund, prop trading firm, market maker or investment bank, in quantitative trading, market data drives strong convictions. PCAP (Packet Capture) data is often considered the most complete and accurate record of market activity. While this view isn’t wrong, it’s not the whole story. The key question isn’t whether data is raw or curated, but whether its fidelity fits the problem being solved.
PCAP historical data is essential for certain use cases, such as analysing exchange connectivity, network behaviour, or protocol-level issues. However, in most systematic trading research, where historical data is aggregated over minutes, retaining every packet adds little analytical value while incurring high processing and storage costs.
Defaulting to maximum fidelity (data accuracy, reliability and precision) can create unnecessary complexity and cost. For research and strategy testing, normalised historical market data often provides a more practical, scalable foundation, delivering consistency, reproducibility, and insight without compromising analytical integrity.
Choosing the right market data for the problem at hand
Market data is often framed simply: raw data is considered “pure,” while processed data is seen as “compromised.” In reality, there is no absolute right or wrong choice; selecting the right data depends on the specific problem you need to solve.
Different problems require different approaches. More detail isn’t always better. It’s helpful to think of market data as a spectrum of sources, each designed to address particular types of questions effectively.
Market data granularity spectrum

None of these layers are inherently superior. Each is more or less appropriate depending on the question at hand.
When maximum fidelity is essential
There are some use cases for which PCAP data is not just useful but essential. For example, investigating exchange network behaviour, diagnosing feed anomalies, and analysing packet timing or protocol edge cases all require full packet-level visibility.
Teams developing real-time infrastructure from the ground up also require raw data as a source of ground truth for building abstractions. In these scenarios, alternatives to PCAP are inadequate; teams should rely on their own network timestamps. However, such use cases are limited and often expensive to pursue, as networking, A/B arbitration, and hardware decisions all contribute significantly to the overall cost.
Research does not fail for lack of historical data rawness
In research and backtesting, failures rarely stem from insufficient fidelity. They stem from inconsistent data processing, misalignment, silent data errors, and irreproducible results. Most backtests fail due to inconsistent or inaccurate data, not a lack of packet-level detail. Starting from PCAP increases these risks.
Rebuilding historical order books from raw data requires complex pipelines, deep knowledge of exchanges, and ongoing maintenance. Small interpretation errors can silently corrupt large datasets and undermine research long before discovery. For research, the binding constraint is not granularity. It is data consistency and accuracy at scale.
The cost of over-fidelity
Processing PCAPs carries hidden costs that many firms recognise but rarely articulate. Engineering teams often spend too much time maintaining pipelines rather than enabling research. Multiple internal datasets emerge, each with subtle differences. Backtests diverge. Insight velocity slows, not because the questions are hard, but because the data is needlessly complex. Using PCAP data for analytics is possible, but it is rarely the right tool for the task.
Redefining accuracy for analytics
PCAP data advocates often define accuracy as completeness. However, analytical accuracy means capturing what matters in a consistent, transparent way. For analytics and research, accuracy means temporal alignment, cross-venue consistency, deterministic reconstruction, and reproducibility over time. These properties are built into higher-level data representations; they are not emergent from raw data.
A more useful decision framework
Rather than defaulting to maximum fidelity, firms benefit from asking simple questions:
Do we need packet-level data or market-level outcomes?
Will multiple teams consume and compare results?
Do tests need to run consistently months or years later and still produce comparable results?
Is this a one-off investigation or a production research workflow?
Are we optimising for completeness or insight velocity?
In many cases, the optimal answer is not raw but purpose-built datasets.
Selecting the right data fidelity to maximise research efficiency
PCAP data remains essential for certain use cases, but treating it as a universal solution reflects an outdated view of how modern quantitative research operates.
The firms that scale insight most effectively are not those that capture the most data, but those that choose the right level of fidelity for each problem, avoid paying for unnecessary precision, and work with providers that deliver consistent, well-documented, normalised data that can be directly mapped to the PCAP when required.
The key to using market data effectively is matching data fidelity to the specific problem at hand. It's important to consider using maximum fidelity only when necessary, focusing on consistency and accuracy, and prioritising purpose-built solutions for research and analytics. The right data choice maximises insight while avoiding unnecessary cost and complexity.