Published:
|
Updated:

Published on INFORMS OR/MS Today (Joseph Byrum)
The future of data analytics and operations research lies in establishing causation, not correlation.
The statistics we rely upon to power operations research would be familiar to the likes of Ronald Fisher. His pioneering work in the field was firmly rooted in correlation, which has served us well for decades. Public policy today is almost entirely guided by correlation, as seen in studies showing that people who smoke also tend to die of lung cancer more frequently than those who don’t [1]. Drivers who don’t wear a seat belt are more likely to die in traffic collisions [2]. And so on.
Establishing a loose association is good enough for many purposes, but the artificial intelligence applications of the future will require a much more rigorous analysis. That analysis aims at establishing causation.
Case Study: Information Privacy
Consider the growing public concern about the privacy of our personal data. In today’s world, such information is plentiful. Its availability is not always known to those who should care the most – the subjects themselves. Hardly a day goes by without some report of a successful hack, a leak or the misuse of personal information by companies that were supposed to safeguard such data. That’s often when the topic enters the public consciousness for the first time.
Companies entrusted with sensitive information largely recognize the potential for backlash, and most have established privacy policies to limit access to data provided to them by their customers and third parties to limit exposure to lawsuits.
Of course, cutting off all access to data isn’t a solution to the information privacy problem. Customers and other stakeholders are willing to hand over private information with the expectation of gaining some kind of benefit in return. One has to hand over financial records to the bank to obtain a loan, for instance. While unauthorized use of private data as a result of hacking is a criminal offense, the line between the legitimate expected use of data and the undesirable, but legal use of the data is increasingly blurred.
Potential Pitfalls of Profiling
An example of the blurring between intended and unintended use of customer data is the growing practice of profiling. Profiling consists of analyzing data about individuals in a population to create similarity classes, which are in turn used to classify individuals. Typical examples are found in marketing and advertising, where the classification of an individual may result in sending particular direct mail offerings or estimates of the demand for particular products for the individual or family unit.
Profiling is a statistical approach to structuring populations of individuals into useful similarity groups or subpopulations. Most profiling approaches are correlational, that is, the similarity classes are discovered based on the co-occurrence of observed values of attributes from the data describing the individuals in the population. The more data available to the profiling process, the better it can generally define the characteristics of the similarity groups. As with any statistical process, however, there is residual error. This error results in the misclassification of some number of individuals in the population.
Usually, that just means facing the minor annoyance of skipping past a handful of unwanted junk emails or tossing out irrelevant advertising fliers from a physical mailbox. But profiling can do much more. It can serve as a risk assessment tool for law enforcement, or it can be used as an excuse to deny services to individuals an algorithm sees as potentially troublesome. As such, misclassification can cause real harm, triggering lawsuits and negative publicity in some cases, while completely missing real threats in others.
Better Results Through Causal Analysis
Profiling’s statistical basis is the root of its limitation. While statistical models are of great value in practice, their importance derives from their ability to summarize rather than characterize an individual. The alternative to a statistical model is a causal model, in which the mechanisms that govern the observed outcomes or behaviors are explicitly represented. The great advantage of causal models is their accuracy, both for broad sets of cases and specific individual observations. The obvious example is Newton’s laws of motion, which unify the behaviors of rigid bodies across a very large range of practical situations.
Causal models are not limited to physical behaviors of inanimate objects. Skilled human investigators often apply mental causal models in pursuing analysis of crimes and accidents. Investigators look for the mechanisms underlying the behaviors, such as motive, means and opportunity for a crime, or causal chains of failure that cause traffic accidents.
One of the greatest strengths of causal models is their ability to explain the conclusions and future predictions of the model in terms of observations that are testable and analytic. Take one of those automobile collisions, where one can measure the length of the skid marks to calculate how fast the vehicle was going before applying the brakes. This information can be used to determine whether excessive speed was a causal factor in the accident. This is markedly different from the statistical observation that speeding was associated with 27 percent of fatal traffic accidents [3].
Similarly, a causal model of accounts payable fraud that explicitly represents the motives, means and opportunities for this form of crime and can connect specific observations with specific goals and activities is much more accurate than the general statistical fact that accounts payable fraud occurs in 19 percent of companies with similar business structures.
Probable Cause and Information Access
The causal model points to establishing probable cause as a way out of the information privacy quandary. Government regulators can and should focus on probable cause – that is, showing the causal relationships between the needs of the public and the private information that is sought when weighing information privacy cases. Doing so will provide the information needed to protect the privacy of the individual against unintended and unjustified infringement.
When an individual’s private data are used to create general statistical profiles, the process normally masks that person’s unique identity at the time that the classifications are created. Profiling is different, as it requires accessing all of a person’s private data, even in the absence of probable cause. Not only does profiling create misclassifications that can be adverse to the individual, but even the act of creating the profile undermines the privacy of the individual.
The use of causal models for analysis of individual behaviors by computer allows the analysis of information to focus on those aspects of the case that are causally connected, not merely correlational. Database queries can be highly directed and precisely authorized so that the principle of probable cause is enforced. At any time in the collection of information about a case, the reasons for data requests and uses of the data can be expressed. Information requests can be individually reviewed for appropriateness based on the findings in the case to date so that probable cause claims can be evaluated by regulators, as needed.
The causal model for the analysis of human behaviors promises to provide a powerful tool for security and the public interest without crippling the right to privacy inherent in the principle of probable cause. The same causal approach useful in the information privacy debate also happens to be the foundation of machine learning, artificial intelligence and the optimization approaches of the future.
References
- https://academic.oup.com/ije/article/29/6/963/659223
- https://crashstats.nhtsa.dot.gov/Api/Public/ViewPublication/812369
- https://crashstats.nhtsa.dot.gov/Api/Public/Publication/812409

Joseph Byrum is an accomplished executive leader, innovator, and cross-domain strategist with a proven track record of success across multiple industries.
