Does bigger data lead to better predictions?

If you are in a hurry, the short answer to this question is no, it does not.

Current enthusiasm about big data is based on an extreme inductivist optimism. Hykel Hosni and Angelo Vulpiani explain why in a readable note(1).

There are two equally radical, yet opposite methodologies. A reductionist one, based on deduction from first principles, and a naıve-inductivist one, based only on data. The availability of unprecedented amounts of data and increasingly sophisticated algorithmic techniques (machine learning) has led some luminaries and interested parties to claim that we can dispense with theory, modelling or even hypothesising. Chris Anderson famously announced “the end of theory” in Wired in 2008: the data deluge makes the scientific method obsolete.

All this enthusiasm is loosely rooted on two presuppositions:

  • First, the idea that big data will lead to much better forecasts.
  • Second, it will do so across the board, from scientific discovery to medical, financial, commercial and political applications.

Hosni and Vulpiani challenge both:

  • More data may lead to worse predictions.
  • A suitably specified context is crucial for forecasts to be scientifically meaningful.

They use the representative example of weather forecasting, the mother of all approaches to prediction, where the early attempts at arriving at a quantitative solution turned out to be unsuccessful precisely because they took into account too much data.

This is the takeaway for would-be big data practitioners: Big data constitutes a great opportunity for scientific and technological advance, with a potential for considerable socio-economic impact. The role of modelling cannot be discounted: not only larger data-sets, but also the lack of an appropriate level of description may make useful forecasting practically impossible.

in spite of a persistent emphasis on a fourth paradigm (beyond the traditional ones, i.e. experiment, theory and computation) based only on data, there is as yet no evidence data alone can bring about scientifically meaningful advance. To the contrary, (…) up to now it seems that the unique way to understand some non-trivial scientific or technological problem, is following the traditional approach based on a clever combination of data, theory (and/or computations), intuition and wise use of previous knowledge.

In other words, don’t throw out your science textbooks just yet.


(1) Hosni, Hykel, and Angelo Vulpiani. 2017. ‘Forecasting in the Light of Big Data’, May. doi:10.1007/s13347-017-0265-3.



  1. Without reading the article. Could this be because weather prediction is chaotic? (meaning, small changes in IC lead to big changes in output). Would this apply to a well behaved system but with (too?) many parameters (but with a very large data base)?

    • You can read the concluding remarks in the paper. They elaborate on it, e.g.”…even in the most optimistic conditions, if the state vector of the system were known with arbitrary precision, the amount of data necessary to make the meaningful predictions would grow exponentially with the effective number of degrees of freedom, independently of the presence of chaos.”

  2. The problem seems to be in considerations out of context or in failing to observe a good, old principle: “the absence of evidence is not the evidence of absence”.
    So … the fact that some used “big data”, hoping to improve forecast accuracy and failed does not lead to a conclusion that “big data does not improve forecast accuracy”.
    The posted paper concludes that the best approach is …” the traditional approach based on a clever combination of data, theory (and/or computations), intuition and wise use of previous knowledge”.
    Let’s see what it really means.
    1. “clever combination” cannot be attributed exclusively to any approach in particular and computer systems “cleverness” is only limited to the system creator’s capabilities. Thus, computer systems can just as “clever” as humans.
    2. “data” can be handled much better by computer systems than humans … I’m afraid.
    3. “theory” comes from … discovery of a model, which can be expressed using some form of formality (often mathematics) from … observed and recorded sets of data … lots of data (big data?).
    4. Without opening an argument about semantics, let’s agree that “knowledge” comes in two forms: formal (“theory”) and informal (all other noted experiences). The Informal Knowledge is all about relationships (sometimes complex), between the Data elements, representing real Life.
    5. “wise use” … since I know some humans, which I could not accuse of “wisdom” I will not hold lack of it against computer systems.
    The Big-data Systems can infer rules and possibly theories from lots of Data … possibly better than humans. The same applies to the Informal Knowledge.
    So there are no inherent, objective factors limiting, for instance, the accuracy of forecast produced by a Big-data System. The only limiting factor here could be an individual building and/or using such a System.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s