Second Edinburgh Conference on Risk: Analysis, Assessment and Mangement, September 1997, Edinburgh, UK
Conventional Probabilistic Risk Analysis (PRA) operates, perhaps tacitly, on the assumption that the past is a reliable predictor of the future. This prevailing philosophy is largely a result of the necessity of making inferences about the likely future behaviour of a system from historical experience. However, too little attention is generally given to establishing the stability and predictability of the system of interest. In order to improve this aspect of PRA, ideas from Statistical Process Control (SPC) are introduced into the risk assessment process. In particular, the distinction between common (predictable) and special (unpredictable) causes of failure is discussed in a PRA context. Introducing SPC into PRA demands a commitment to continual system measurement once in operation but this itself brings further benefits in terms or improving efficiency and effectiveness.
The common paradigm of Probabilistic Risk Analysis (PRA) is to analyse a system so as to express a complex hazard as a logical function (structure function) of a set of elementary events whose probabilities can be inferred, so as to derive a probability for the occurrence of the hazard. Probabilities for elementary events typically come from historical data, test data or expert opinion. This procedure has been criticised from a number of angles in recent years: from the effectiveness and efficiency of the statistical procedures (Singpurwalla, 1988; Clarotti, 1993) to the difficulties of predicting future behaviour of systems influenced by human factors (Reason, 1990; King, 1993; Bernstein, 1996).
The twin issues of statistical rigour and future predictability formed a key part in the business model proposed by Deming (1982) for improving products and services. In particular, Shewhart’s concept of a stable and predictable process, proves to be a fundamental premise of PRA activities. Of course, the statistical concept of independent and identically distributed events is a familiar one within conventional PRA, though perhaps one that is seldom very heavily scrutinised. However, in risk assessment, we are making inferences about an uncertain future, not about a fixed population of events. Though we may be able to investigate the properties of our historical data, the real uncertainty lies in our need to postulate that future events will have a similar probabilistic behaviour. A stable and predictable process is one in which we feel able to make such a judgment and can justify it with scientific evidence. However, as we shall see, this judgment is not made once and for all time but must be continually monitored and reviewed on a routine basis. In this paper, such monitoring is proposed as a strategy for improving risk management along with conventional PRA methods.
Deming’s work also emphasised the importance of thinking about all management tasks in terms of improving the performance of a whole system. This will be a familiar idea to PRA practitioners. Statistics of hardware failure have no meaning outside their context within some wider system where they are influenced by human factors and duty cycle (see for example Singpurwalla, 1988). It is the system performance that we need to assess, as it is the whole process that determines the risk of system failure. However, Perrow (1984) describes a number of accidents that were not predicted owing to failures of management to understand the nature of the systems within which their processes operated. King (1993) highlights a number of ways in which hierarchical organisations fail to appreciate the systems implications of the way they operate or the limits in their ability to foretell the future. Deming’s systems thinking offers key insights and remedies for these deficiencies, in particular:
The barriers to the enlightened drawing of system boundaries are evident in many capital-goods businesses. Project managers seek to externalise risks by passing them on to their customer, through onerous maintenance and operating requirements, and to their sub-system suppliers, through demands for indemnities and accreditation to standards. Such behaviour makes systems less reliable than could be the case were there a climate of trust and cooperation between supplier, project manager and customer, aimed at providing a safe and reliable system to the public.
Deming (1982) drew a distinction between two types of statistical study: enumerative and analytic. In an enumerative study, the investigator is interested in some fixed, extant population. Typical examples of enumerative studies include: morbidity among a population of antelopes, the percentage of the national adult population who prefer margarine to butter or the fraction of last January’s production that was defective. In these situations there are only two sources of uncertainty: the uncertainty in the measurement method itself and the uncertainty arising from sampling less than 100% of the population, the sampling error. Given an accurate and precise measurement procedure and a 100% sample, there is no error or uncertainty in the study. The tools that are used to elucidate and quantify uncertainty in enumerative studies are the familiar devices of the elementary and advanced statistics courses.
However, in most industrial statistics, and particularly in risk analysis, we are not dealing with a fixed population of events that is to hand and that can be sampled in a statistically sound and unbiased manner. We are interested in the output of a process: a population of events, much of which lies in the future. For example, when assessing a cooling system, we have historical data on the failure of circulating pumps in the past and wish to infer the probability of such failures in the future. In this case, in addition to the uncertainties arising from measurement and finite sample-size, there is the uncertainty arising from extrapolating our results into the future, to pumps as yet not realised. Many changes in duty cycle, human factors, environment, installation procedure, component manufacture or materials can change subsequent to when data were collected. In general, the uncertainty arising from our inability to truly predict the future is far more important than that arising from the sampling error. Hence, much of our conventional statistics is of limited relevance. Effort spent in agonising over classical statistical issues for analysing data is misplaced. For example, a claim such as that of Clarotti, as to the superiority of bayesian over frequentist PRA methods, is irrelevant. The uncertainty arising from the need to extrapolate historical data (or even expert opinion) into the future is far greater that the divergence between competing approaches to statistical inference. Of course, this is not an invitation to wholesale neglect of important base rates. Bernstein (1996) has observed that PRA techniques used in market analysis fail as unexpected social and supply-side changes frustrate the stability of econometric models predicting the future. This precisely illustrates the importance of recognising when we are involved in an analytic study. The difficulty was well known to Keynes (1935, p 152) who observed that, in many cases, "... our existing knowledge does not provide a sufficient basis for a calculated mathematical expectation."
A key element in understanding how historical data can be extrapolated to the future is the distinction between what I shall call common and special causes of failure. Shewhart (1931) observed that there are two classes of variation in measured data. Common causes of variation are those that form part of a stable system. Special causes lie outside the system and are intrinsically unpredictable, in frequency and in severity, as future behaviour is in no way encoded in past behaviour. In a similar way, we can talk about common and special causes of failures and faults. For example, failures of a circulating pump owing to wear-out might form a stable system where the pumps were the product of a manufacturing process that was itself stable. On such a system of failures, we can use statistical tools to predict future events at least in some approximately probabilistic manner. However, failures arising from extreme environmental events, misuse by maintenance technicians or unusual damage during installation are special causes of failure. Shewhart (1931) called processes that reveal no special causes in statistical control, or in modern nomenclature stable and predictable. Shewhart observed that stable and predictable processes are not states of nature and are only achieved after extensive work on the process to eliminate the special causes. As Deming (1975) observed, special causes of failure fundamentally compromise our ability to make predictions about future behaviour, even in a probabilistic sense.
Perrow (1984, p5) coined the term normal accidents for those incidents that will inexorably arise from the manner in which a system is designed and operated. They are not the result of any failure in hardware or operation but inevitable consequences of the system’s nature. Neither are they necessarily foreseen in the system design and analysis. Highly complicated processes that are prone to common-mode failures, among their safety systems, or which are sensitive to human error, are well-known to feature emergent behaviours that defeat the current technologies for hazard analysis. Perrow (1984) describes many accidents where such has been the case. Furthermore, Petroski (1994) details the difficulties in mapping historical engineering-design experience onto new technologies and more demanding environments. Normal accidents coincide with our notion of common causes of failure. They are part of the system. Like an incapable manufacturing process, no amount of tampering (correcting on a case-by-case basis) with the process will improve the situation. In fact, tampering can only make things worse, as vividly illustrated by the funnel experiment described by Deming (1982, p327). To rid a system of common-cause failure (normal accidents), we need to reengineer the system.
Road accidents offer an instructive insight into normal accidents. Road accidents are an inevitable consequence of mass, high-speed, individualised road-transport. No degree of exhortation of the public to drive more safely, through for example leaving adequate braking distance, will succeed in reducing road deaths. As in a manufacturing plant, no degree of exhortation of the work force to greater efforts will improve quality. To reduce deaths, the whole road transport system demands reengineering. Some impact can be made through police supervision of poor driving but this is ultimately as ineffective as improving the quality of a manufactured product through inspection. In fact, most measurable reduction in road accidents is through reengineering the most hazardous road layouts and control systems and improvement of vehicle safety.
The idea of using standards to manage quality was pioneered by Charles Dudley in the late nineteenth century. Dudley introduced metallurgical standards to reduce variation in steel rails purchased by the Pennsylvania Railroad Company and his innovation was effective in eliminating the gross variations in product from which the railroad had been suffering. Metallurgical variation resulted in widely disparate and unpredictable performance between various suppliers and manufacturing batches. As such, Dudley’s work played an important part in engineering industry’s transition from craft production to mass production as described in Womak et al. (1990).
The existence of standards is often invoked as the rationale for allowing historical or test data to be extrapolated into the future. However, standards do not guarantee uniformity of product, only conformance within stipulated limits. It is quite possible that our test data was obtained from a batch of parts that were particularly durable, perhaps they were manufactured with material from a particularly defect-free cast. Future parts could be manufactured from inferior material but still fall within the limits specified by the appropriate standard. Drifts in quality can arise from any one of many process changes during manufacture, installation and operation. In general, few processes are managed with sufficient diligence to eliminate such a danger. The standard controls the poorest performance of the product but there is no practical way we can be sure that the test data truly embraces the worst-case behaviour. Even prototype parts manufactured to maximum and minimum material condition cannot account for changes in material, manufacturing process or assembly operations.
We have seen that we cannot make useful predictions about the future unless we have a stable and predictable process of system operation. In order to achieve this, we need to discriminate between special and common causes of variation and failure. Once special causes are identified, work on the process is necessary to eliminate them. Once all special causes are eliminated and we have a stable and predictable process, we need to continue monitoring to give ourselves confidence that special causes have not reappeared through changing circumstances.
Shewhart (1931) proposed the control chart as a tool for distinguishing between common and special causes of variation. A control chart (Figure 1) is a graph of the quantity of interest against time, with three lines added: the centre line and the upper and lower control limits. The centre line is drawn at the sample mean of the observed data and the upper and lower control limits at three standard deviations above and below the centre line, respectively. Detailed explanation of the construction and use of control charts is given by Wheeler and Chambers (1992).
There are a number of rules for identifying suspected special causes, one of which is a point outside the upper and lower control limits. Any such point needs to be investigated to identify a possible cause for its occurrence. It is not certain that it represents a special cause but it makes economic sense to investigate. A process that is stable and predictable will vary, more or less randomly, between the control limits. Shewhart (1931) observed that stable and predictable processes are not states of nature and that extensive work on the process is generally needed to rid the system of special causes.
Constructing a control chart of hazardous incidents seems to be of limited value. Such events are, we trust, extremely rare and, moreover, we only identify a deterioration in safety after hazardous incidents have actually occurred. Measurements such as the time between hazardous events are known as results measures (R-measures) because they monitor exactly what is experienced by the customer. By observing R-measures, we can only identify deteriorations once the customer has also identified them. We also need process measures, (P-measures) to observe the factors that influence, and critically determine, safety. In a system that needed a dual circulating-pump failure to initiate a hazardous event, relevant P-measures might include:
Any of these can help us to see a deterioration in pump quality or in operating environment before there is any threat of a hazard. Perrow (1984, p79) observes that concentration on R-measures frequently neglects evidence of emergent and unanticipated behaviour in complicated systems and that P-measures are needed to give advanced warning of the unexpected. All the measures we choose must indicate a stable and predictable process, free from special causes of variation. If not, we are exposed to unknown and unquantifiable risks.
Of course, even a stable and predictable process may suffer from more normal accidents that society finds acceptable. This is the equivalent of a manufacturing process that is stable and predictable but not capable of producing parts that conform to its customers’ requirements. We know well from the manufacturing environment that tampering with the process, by continually adjusting whenever we see a part out of specification, makes matters worse. Deming (1982, p327) illustrates this through the Nelson-funnel experiment. Similarly, to react to every normal accident by investigating its perceived causes and tampering with the system, will make things worse. Normal accidents can only be eradicated by reviewing all the data, from satisfactory operation in addition to accidents, and reengineering the system.
Such diligent process monitoring at first seems onerous. However, the whole apparatus of measurement and control charting was developed to drive organisations in improving performance and reducing operating costs. Collecting data on the manufacture, installation, operation and replacement of parts captures new and sound knowledge about system costs and performance. Such learning can be used in the design and risk assessment of future systems and the increasingly safe, effective and efficient operation of existing systems. Shewhart characterised such a process in his plan-do-study-act (PDSA) cycle. The measurement discipline itself creates value that far outweighs its cost.
The present common approach to PRA is (Figure 2):
Everything in steps 1-3 in unexceptional. However, as we have seen, step 4 may not be possible. If there are special causes of failure, and in general there will be, then we can make no reliable prediction of future frequency. From the bayesian standpoint, the future occurrence of special causes is not encoded in our historical experience and we must not expect our inferences to be well-calibrated.
An alternative approach (Figure 3) would be, following step 3, to work back from step 6, starting from an assessment of what degree of risk would be acceptable, to a specification of an acceptable probability for the hazardous event. At this stage, we now do something very similar to the old step 3. We use our historical data and expert knowledge to judge whether the specified probability is achievable. However, we do not stop there, as in a conventional analysis, because we know that such estimates are, in general, unreliable owing to likely special causes of failure and to the unpredicted normal accidents of complicated systems. Here, I propose that we can then use our conventional analysis to set up a regime of control-charting than would, during the life of the system under assessment:
A key element of this approach is that, when we identify a special cause, we work to identify its root cause and to eliminate it from further occurrence in the system. We do not try to infer statistically how often it will happen in the future. The presence of special causes fundamentally compromises our ability to predict the future behaviour of the system and hence to assess the risks in its operation. Perrow (1984) notes many cases where accidents have arisen because surprises are not tracked to their true root cause, in a climate of operator passivity.
The need to detect unacceptably frequent failures as quickly as possible may, in part, be answered by cusum charts but more work is needed in this area.
PRA’s ability to assess the safety of complicated and novel systems is limited by our knowledge of the future. Some future system behaviours may be unknown, even in a probabilistic sense. To answer this lacuna, control charts can be used to learn about processes and to establish, and monitor, a stable and predictable regime.
Effective control-charting requires a change of focus from a results-driven organisation to one committed to continual improvement of all the processes within the engineering system and its environment.
Bernstein, P L (1996) The new religion of risk management, Harvard Business Review, March-April, 47-51
Clarotti, C A (1993) Making decisions via PRA: the frequentist vs the bayesian approach, in Barlow, R E et al. (eds) Reliability and Decision Making, Chapman and Hall, London, 323-346
Deming, W E (1975) On probability as a basis for action. The American Statistician 29, 146-152
Deming, W E (1982) Out of the Crisis: Quality, Productivity and Competitive Position, Cambridge University Press, New York
Keynes, J M (1935) The General Theory of Employment, Interest and Money. Macmillan, London
King, J B (1993) Learning to solve the right problems: the case of nuclear power in America. Journal of Business Ethics 12, 105-116
Perrow, C (1984) Normal Accidents: Living with High-Risk Technologies, Basic Books
Petroski, H (1994) Design Paradigms: Case Histories of Error of Judgement in Engineering, Cambridge University Press
Reason, J. (1990) The contribution of latent human failures to the breakdown of complex systems, Philosophical Transactions of the Royal Society of London (Series B) 327, 475-484.
Shewhart, W A (1931) Economic Control of Quality of Manufactured Product, Van Nostrand
Singpurwalla, N D (1988) Foundational issues in reliability and risk analysis. SIAM Review 30, 264-282
Wheeler, D J & Chambers, D S (1992) Understanding Statistical Process Control, second edition, SPC Press
Womak, J P et al. (1990) The Machine that Changed the World, Rawson Associates, New York
This page last updated 19th November 2000
copyright ©2000 by A N Cutler