The environment and disease: association or causation? - Observational Studies

Page created by Erik Dean

Science

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Observational Studies 6 (2020) 1-9               Submitted 1965; Published reprinted, 1/20

    The environment and disease: association or causation?

Sir Austin Bradford Hill

    Editor’s Note: Sir Austin Bradford Hill was Professor of Medical Statistics, University
of London, United Kingdom. This article was originally published in the Proceedings of the
Royal Society of Medicine, May 1965, 58, 295-300. This paper is reprinted with permission of
the copyright holder, Sage Publications. New comments by the following researchers follow:
Peter Armitage; Mike Baiocchi; Samantha Kleinberg; James O’Malley; Chris Phillips and
Joel Greenhouse; Kenneth Rothman; Herb Smith; Tyler VanderWeele; Noel Weiss; and
William Yeaton.

    Among the objects of this newly founded Section of Occupational Medicine are: first,
‘to provide a means, not readily afforded elsewhere, whereby physicians and surgeons with
a special knowledge of the relationship between sickness and injury and conditions of work
may discuss their problems, not only with each other, but also with colleagues in other
fields, by holding joint meetings with other Sections of the Society’; and second, ‘to make
available information about the physical, chemical and psychological hazards of occupation,
and in particular about those that are rare or not easily recognized’.
    At this first meeting of the Section and before, with however laudable intentions, we
set about instructing our colleagues in other fields, it will be proper to consider a problem
fundamental to our own. How in the first place do we detect these relationships between
sickness, injury and conditions of work? How do we determine what are physical, chemical
and psychological hazards of occupation, and in particular those that are rare and not easily
recognised?
    There are, of course, instances in which we can reasonably answer these questions from
the general body of medical knowledge. A particular, and perhaps extreme, physical en-
vironment cannot fail to be harmful; a particular chemical is known to be toxic to man
and therefore suspect on the factory floor. Sometimes, alternatively, we may be able to
consider what might a particular environment do to man, and then see whether such con-
sequences are indeed to be found. But more often than not we have no such guidance, no
such means of proceeding; more often than not we are dependent upon our observation and
enumeration of defined events for which we then seek antecedents. In other words, we see
that the event B is associated with the environmental feature A, that, to take a specific
example, some form of respiratory illness is associated with a dust in the environment. In
what circumstances can we pass from this observed association to a verdict of causation?
Upon what basis should we proceed to do so?
    I have no wish, nor the skill, to embark upon a philosophical discussion of the meaning
of ‘causation’. The ‘cause’ of illness may be immediate and direct, it may be remote and
indirect underlying the observed association. But with the aims of occupational, and almost

c 2020 Sage Publications.

Hill

synonymously preventive, medicine in mind, the decisive question is whether the frequency
of the undesirable event B will be influenced by a change in the environmental feature A.
How such a change exerts that influence may call for a great deal of research. However,
before deducing ‘causation’ and taking action, we shall not invariably have to sit around
awaiting the results of that research. The whole chain may have to be unravelled or a few
links may suffice. It will depend upon circumstances.
Disregarding then any such problem in semantics we have this situation. Our observa-
tions reveal an association between two variables, perfectly clear-cut and beyond what we
would care to attribute to the play of chance. What aspects of that association should we
especially consider before deciding that the most likely interpretation of it is causation?

1. Strength. First upon my list, I would put the strength of the association. To take a
very old example, by comparing the occupations of patients with scrotal cancer with
the occupations of patients presenting with other diseases, Percival Pott could reach a
correct conclusion because of the enormous increase of scrotal cancer in the chimney
sweeps. ‘Even as late as the second decade of the twentieth century’, writes Richard
Doll, ‘the mortality of chimney sweeps from scrotal cancer was some 200 times that
of workers who were not specially exposed to tar or mineral oils and in the eighteenth
century the relative difference is likely to have been much greater’ (Doll, 1964).
To take a more modern and more general example upon which I have now reflected for
over 15 years, prospective inquiries into smoking have shown that the death rate from
cancer of the lung in cigarette smokers is nine to 10 times the rate in non-smokers and
the rate in heavy cigarette smokers is 20 to 30 times as great. On the other hand, the
death rate from coronary thrombosis in smokers is no more than twice, possibly less,
the death rate in non-smokers. Though there is good evidence to support causation
it is surely much easier in this case to think of some features of life that may go hand-
in-hand with smoking – features that might conceivably be the real underlying cause
or, at the least, an important con- tributor, whether it be lack of exercise, nature
of diet or other factors. But to explain the pronounced excess in cancer of the lung
in any other environmental terms requires some feature of life so intimately linked
with cigarette smoking and with the amount of smoking that such a feature should
be easily detectable. If we cannot detect it or reasonably infer a specific one, then in
such circumstances, I think we are reasonably entitled to reject the vague contention
of the armchair critic ‘you can’t prove it, there may be such a feature’.
Certainly in this situation, I would reject the argument sometimes advanced that
what matters is the absolute difference between the death rates of our various groups
and not the ratio of one to other. That depends upon what we want to know. If
we want to know how many extra deaths from cancer of the lung will take place
through smoking (i.e. presuming causation), then obviously we must use the absolute
differences between the death rates – 0.07 per 1000 per year in non-smoking doctors,
0.57 in those smoking 1–14 cigarettes daily, 1.39 for 15–24 cigarettes daily and 2.27
for 25 or more daily. But it does not follow here, or in more specifically occupational
problems, that this best measure of the effect upon mortality is also the best measure
in relation to aetiology. In this respect, the ratios of 8, 20 and 32 to 1 are far more
informative. It does not, of course, follow that the difference revealed by ratios are of

Association or Causation?

any practical importance. Maybe they are, maybe they are not; but that is another
point altogether.
We may recall John Snow’s classic analysis of the opening weeks of the cholera epi-
demic of 1854 (Snow, 1855). The death rate that he recorded in the customers supplied
with the grossly polluted water of the Southwark and Vauxhall Company was in truth
quite low – 71 deaths in each 10,000 houses. What stands out vividly is the fact that
the small rate is 14 times the figure of five deaths per 10,000 houses supplied with the
sewage-free water of the rival Lambeth Company.
In thus putting emphasis upon the strength of an association, we must, nevertheless,
look at the obverse of the coin. We must not be too ready to dismiss a cause-and-effect
hypothesis merely on the grounds that the observed association appears to be slight.
There are many occasions in medicine when this is in truth so. Relatively few persons
harbouring the meningococcus fall sick of meningococcal meningitis. Relatively few
persons occupationally exposed to rat’s urine contract Weil’s disease.

2. Consistency: Next on my list of features to be specially considered, I would place the
consistency of the observed association. Has it been repeatedly observed by different
persons, in different places, circumstances and times?
This requirement may be of special importance for those rare hazards singled out in the
Section’s terms of reference. With many alert minds at work in industry today many
an environmental association may be thrown up. Some of them on the customary tests
of statistical significance will appear to be unlikely to be due to chance. Nevertheless,
whether chance is the explanation or whether a true hazard has been revealed may
sometimes be answered only by a repetition of the circumstances and the observations.
Returning to my more general example, the Advisory Committee to the Surgeon-
General of the United States Public Health Service found the association of smoking
with cancer of the lung in 29 retrospective and seven prospective inquiries (US Depart-
ment of Education, Health and Welfare, 1964). The lesson here is that broadly the
same answer has been reached in quite a wide variety of situations and techniques. In
other words, we can justifiably infer that the association is not due to some constant
error or fallacy that permeates every inquiry. And we have indeed to be on our guard
against that.
Take, for instance, an example given by Heady (Heady, 1958). Patients admitted to
hospital for operation for peptic ulcer are questioned about recent domestic anxieties
or crises that may have precipitated the acute illness. As controls, patients admitted
for operation for a simple hernia are similarly quizzed. But, as Heady points out,
the two groups may not be in pari materia. If your wife ran off with the lodger last
week, you still have to take your perforated ulcer to hospital without delay. But with
a hernia you might prefer to stay at home for a while – to mourn (or celebrate) the
event. No number of exact repetitions would remove or necessarily reveal that fallacy.
We have, therefore, the somewhat paradoxical position that the different results of a
different inquiry certainly cannot be held to refute the original evidence; yet the same
results from precisely the same form of inquiry will not invariably greatly strengthen

Hill

the original evidence. I would myself put a good deal of weight upon similar results
reached in quite different ways, e.g. prospectively and retrospectively.
Once again looking at the obverse of the coin, there will be occasions when repetition
is absent or impossible and yet we should not hesitate to draw conclusions. The
experience of the nickel refiners of South Wales is an outstanding example. I quote
from the Alfred Watson Memorial Lecture that I gave in 1962 to the Institute of
Actuaries:

The population at risk, workers and pensioners, numbered about one thou-
sand. During the ten years 1929 to 1938, sixteen of them had died from
cancer of the lung, eleven of them had died from cancer of the nasal sinuses.
At the age specific death rates of England and Wales at that time, one might
have anticipated one death from cancer of the lung (to compare with the
16), and a fraction of a death from cancer of the nose (to compare with the
11). In all other bodily sites cancer had appeared on the death certificate
11 times and one would have expected it to do so 10-11 times. There had
been 67 deaths from all other causes of mortality and over the ten years’
period 72 would have been expected at the national death rates. Finally
division of the population at risk in relation to their jobs showed that the
excess of cancer of the lung and nose had fallen wholly upon the workers
employed in the chemical processes.
More recently my colleague, Dr Richard Doll, has brought this story a stage
further. In the nine years 1948 to 1956 there had been, he found, 48 deaths
from cancer of the lung and 13 deaths from cancer of the nose. He assessed
the numbers expected at normal rates of mortality as, respectively 10 and
01. In 1923, long before any special hazard had been recognized, certain
changes in the refinery took place. No case of cancer of the nose has been
observed in any man who first entered the works after that year, and in
these men there has been no excess of cancer of the lung. In other words,
the excess in both sites is uniquely a feature in men who entered the refinery
in, roughly, the first 23 years of the present century.
No causal agent of these neoplasms has been identified. Until recently no
animal experimentation had given any clue or any support to this wholly
statistical evidence. Yet I wonder if any of us would hesitate to accept it as
proof of a grave industrial hazard? (Hill, 1930)

In relation to my present discussion, I know of no parallel investigation. We have (or
certainly had) to make up our minds on a unique event; and there is no difficulty in
doing so.

3. Specificity: One reason, needless to say, is the specificity of the association, the third
characteristic which invariably we must consider. If, as here, the association is lim-
ited to specific workers and to particular sites and types of disease and there is no
association between the work and other modes of dying, then clearly that is a strong
argument in favour of causation.

Association or Causation?

We must not, however, overemphasise the importance of the characteristic. Even in
my present example there is a cause-and-effect relationship with two different sites
of cancer – the lung and the nose. Milk as a carrier of infection and, in that sense,
the cause of disease can produce such a disparate galaxy as scarlet fever, diphtheria,
tuberculosis, undulant fever, sore throat, dysentery and typhoid fever. Before the
discovery of the underlying factor, the bacterial origin of disease, harm would have
been done by pushing too firmly the need for specificity as a necessary feature before
convicting the dairy.
Coming to modern times, the prospective investigations of smoking and cancer of
the lung have been criticised for not showing specificity – in other words, the death
rate of smokers is higher than the death rate of non-smokers from many causes of
death (though in fact the results of Doll and Hill (1964) do not show that). But here
surely one must return to my first characteristic, the strength of the association. If
other causes of death are raised 10, 20 or even 50% in smokers whereas cancer of the
lung is raised 900–1000% we have specificity – a specificity in the magnitude of the
association.
We must also keep in mind that diseases may have more than one cause. It has always
been possible to acquire a cancer of the scrotum without sweeping chimneys or taking
to mule-spinning in Lancashire. One-to-one relationships are not frequent. Indeed,
I believe that multi-causation is generally more likely than single causation though
possibly if we knew all the answers we might get back to a single factor.
In short, if specificity exists we may be able to draw conclusions without hesitation; if
it is not apparent, we are not thereby necessarily left sitting irresolutely on the fence.

4. Temporality: My fourth characteristic is the temporal relationship of the association –
which is the cart and which the horse? This is a question which might be particularly
relevant with diseases of slow development. Does a particular diet lead to disease
or do the early stages of the disease lead to those peculiar dietetic habits? Does a
particular occupation or occupational environment pro- mote infection by the tubercle
bacillus or are the men and women who select that kind of work more liable to contract
tuberculosis whatever the environment – or, indeed, have they already contracted it?
This temporal problem may not arise often but it certainly needs to be remembered,
particularly with selective factors at work in industry.

5. Biological gradient: Fifth, if the association is one which can reveal a biological gra-
dient, or dose-response curve, then we should look most carefully for such evidence.
For instance, the fact that the death rate from cancer of the lung rises linearly with
the number of cigarettes smoked daily adds a very great deal to the simpler evidence
that cigarette smokers have a higher death rate than non-smokers. That comparison
would be weakened, though not necessarily destroyed, if it depended upon, say, a much
heavier death rate in light smokers and a lower rate in heavier smokers. We should
then need to envisage some much more complex relationship to satisfy the cause-and-
effect hypothesis. The clear dose-response curve admits of a simple explanation and
obviously puts the case in a clearer light.

Hill

The same would clearly be true of an alleged dust hazard in industry. The dustier
the environment the greater the incidence of disease we would expect to see. Often
the difficulty is to secure some satisfactory quantitative measure of the environment
which will permit us to explore this dose-response. But we should invariably seek it.

6. Plausibility: It will be helpful if the causation we suspect is biologically plausible. But
this is a feature I am convinced we cannot demand. What is biologically plausible
depends upon the biological knowledge of the day.
To quote again from my Alfred Watson Memorial Lecture (Hill 1962), there was

...no biological knowledge to support (or to refute) Pott’s observation in
the 18th century of the excess of cancer in chimney sweeps. It was lack of
biological knowledge in the 19th that led a prize essayist writing on the value
and the fallacy of statistics to conclude, amongst other ‘absurd’ associations,
that ‘it could be no more ridiculous for the stranger who passed the night
in the steerage of an emigrant ship to ascribe the typhus, which he there
contracted, to the vermin with which bodies of the sick might be infected’.
And coming to nearer times, in the 20th century there was no biological
knowledge to support the evidence against rubella.

In short, the association we observe may be one new to science or medicine and we
must not dismiss it too light-heartedly as just too odd. As Sherlock Holmes advised
Dr Watson, ‘when you have eliminated the impossible, whatever remains, however
improbable, must be the truth’.

7. Coherence: On the other hand the cause-and-effect interpretation of our data should
not seriously conflict with the generally known facts of the natural history and biology
of the disease – in the expression of the Advisory Committee to the Surgeon-General
it should have coherence.
Thus in the discussion of lung cancer, the Committee finds its association with
cigarette smoking coherent with the temporal rise that has taken place in the two
variables over the last generation and with the sex difference in mortality – features
that might well apply in an occupational problem. The known urban/rural ratio of
lung cancer mortality does not detract from coherence, nor the restriction of the effect
to the lung.
Personally, I regard as greatly contributing to coherence the histopathological evidence
from the bronchial epithelium of smokers and the isolation from cigarette smoke of
factors carcinogenic for the skin of laboratory animals. Nevertheless, while such labo-
ratory evidence can enormously strengthen the hypothesis and, indeed, may determine
the actual causative agent, the lack of such evidence cannot nullify the epidemiological
observations in man. Arsenic can undoubtedly cause cancer of the skin in man but
it has never been possible to demonstrate such an effect on any other animal. In a
wider field, John Snow’s epidemiological observations on the conveyance of cholera by
the water from the Broad Street pump would have been put almost beyond dispute
if Robert Koch had been then around to isolate the vibrio from the baby’s nappies,
the well itself and the gentleman in delicate health from Brighton. Yet the fact that

Association or Causation?

Koch’s work was to be awaited another 30 years did not really weaken the epidemio-
logical case though it made it more difficult to establish against the criticisms of the
day – both just and unjust.

8. Experiment: Occasionally, it is possible to appeal to experimental, or semi-experimental,
evidence. For example, because of an observed association some preventive action is
taken. Does it in fact prevent? The dust in the workshop is reduced, lubricating oils
are changed, persons stop smoking cigarettes. Is the frequency of the associated events
affected? Here the strongest support for the causation hypothesis may be revealed.

9. Analogy: In some circumstances, it would be fair to judge by analogy. With the effects
of thalidomide and rubella before us, we would surely be ready to accept slighter but
similar evidence with another drug or another viral disease in pregnancy.

Here, then, are nine different viewpoints from all of which we should study association
before we cry causation. What I do not believe – and this has been suggested – is that
we can usefully lay down some hard-and-fast rules of evidence that must be obeyed before
we accept cause and effect. None of my nine viewpoints can bring indisputable evidence
for or against the cause-and-effect hypothesis and none can be required as a sine qua non.
What they can do, with greater or less strength, is to help us to make up our minds on the
fundamental question – is there any other way of explaining the set of facts before us, is
there any other answer equally, or more, likely than cause and effect?

Tests of Significance
No formal tests of significance can answer those questions. Such tests can, and should,
remind us of the effects that the play of chance can create, and they will instruct us in the
likely magnitude of those effects. Beyond that they contribute nothing to the ‘proof’ of our
hypothesis.
Nearly 40 years ago, among the studies of occupational health that I made for the
Industrial Health Research Board of the Medical Research Council was one that concerned
the workers in the cotton-spinning mills of Lancashire (Hill, 1962). The question that I had
to answer, by the use of the National Health Insurance records of that time, was this: Do
the workers in the cardroom of the spinning mill, who tend the machines that clean the
raw cotton, have a sickness experience in any way different from that of other operatives
in the same mills who are relatively unexposed to the dust and fibre that were features of
the cardroom? The answer was an unqualified ‘Yes’. From age 30 to age 60, the cardroom
workers suffered over three times as much from respiratory causes of illness whereas from
non-respiratory causes their experience was not different from that of the other workers.
This pronounced difference with the respiratory causes was derived not from abnormally
long periods of sickness but rather from an excessive number of repeated absences from
work of the cardroom workers.
All this has rightly passed into the limbo of forgotten things. What interests me today is
this: My results were set out for men and women separately and for half a dozen age groups
in 36 tables. So there were plenty of sums. Yet I cannot find that anywhere I thought it
necessary to use a test of significance. The evidence was so clear-cut, the differences between

Hill

the groups were mainly so large, the contrast between respiratory and non-respiratory causes
of illness so specific, that no formal tests could really contribute anything of value to the
argument. So why use them?
Would we think or act that way today? I rather doubt it. Between the two world wars
there was a strong case for emphasising to the clinician and other research workers the
importance of not overlooking the effects of the play of chance upon their data. Perhaps
too often generalities were based upon two men and a laboratory dog while the treatment
of choice was deduced from a difference between two bedfuls of patients and might easily
have no true meaning. It was therefore a useful corrective for statisticians to stress, and to
teach the need for, tests of significance merely to serve as guides to caution before drawing
a conclusion, before inflating the particular to the general.
I wonder whether the pendulum has not swung too far – not only with the attentive
pupils but even with the statisticians themselves. To decline to draw conclusions without
standard errors can surely be just as silly? Fortunately, I believe we have not yet gone so
far as our friends in the USA where, l am told, some editors of journals will return an article
because tests of significance have not been applied. Yet there are innumerable situations in
which they are totally unnecessary – because the difference is grotesquely obvious, because
it is negligible, or because, whether it be formally significant or not, it is too small to be
of any practical importance. What is worse the glitter of the t table diverts attention from
the inadequacies of the fare. Only a tithe, and an unknown tithe, of the factory personnel
volunteer for some procedure or interview, 20% of patients treated in some particular way
are lost to sight, 30% of a randomly-drawn sample are never contacted. The sample may,
indeed, be akin to that of the man who, according to Swift, ‘had a mind to sell his house and
carried a piece of brick in his pocket, which he showed as a pattern to encourage purchasers’.
The writer, the editor and the reader are unmoved. The magic formulae are there.
Of course, I exaggerate. Yet too often I suspect we waste a deal of time, we grasp
the shadow and lose the substance, we weaken our capacity to interpret data and to take
reasonable decisions whatever the value of P. And far too often we deduce ‘no difference’
from ‘no significant difference’. Like fire, the χ2 test is an excellent servant and a bad
master.

The case for action
Finally, in passing from association to causation I believe in ‘real life’, we shall have to
consider what flows from that decision. On scientific grounds, we should do no such thing.
The evidence is there to be judged on its merits and the judgment (in that sense) should
be utterly independent of what hangs upon it – or who hangs because of it. But in another
and more practical sense we may surely ask what is involved in our decision. In occupa-
tional medicine our object is usually to take action. If this be operative cause and that be
deleterious effect, then we shall wish to intervene to abolish or reduce death or disease.
While that is a commendable ambition it almost inevitably leads us to introduce differ-
ential standards before we convict. Thus, on relatively slight evidence, we might decide to
restrict the use of a drug for early-morning sickness in pregnant women. If we are wrong
in deducing causation from association no great harm will be done. The good lady and the
pharmaceutical industry will doubtless survive.

Association or Causation?

On fair evidence, we might take action on what appears to be an occupational hazard,
e.g. we might change from a probably carcinogenic oil to a non-carcinogenic oil in a limited
environment and without too much injustice if we are wrong. But we should need very
strong evidence before we made people burn a fuel in their homes that they do not like
or stop smoking the cigarettes and eating the fats and sugar that they do like. In asking
for very strong evidence I would, however, repeat emphatically that this does not imply
crossing every ‘t’, and swords with every critic, before we act.
All scientific work is incomplete – whether it be observational or experimental. All
scientific work is liable to be upset or modified by advancing knowledge. That does not
confer upon us a freedom to ignore the knowledge we already have, or to postpone the
action that it appears to demand at a given time. Who knows, asked Robert Browning,
but the world may end tonight? True, but on available evidence most of us make ready to
commute on the 8.30 next day.

References
Doll, R (1964). Cancer. In: Witts LJ (ed.) Medical Surveys and Clinical Trials, 2nd ed.
London: Oxford University Press, p.333.
Doll, R and Hill, A.B. (1964). Mortality in relation to smoking: ten years’ observations of
British doctors. British Medical Journal, 1, 1399-1410.
Heady J.A. (1958). False figuring: statistical method in medicine. Med World Lond, 89,
305.
Hill A.B. (1930). Sickness Amongst Operatives in Lancashire Spinning Mills. Industrial
Health Research Board Report No. 59. London: HMSO.
Hill, A.B. (1962). The statistician in medicine. (Alfred Watson Memorial Lecture). J Inst
Actuar, 88, 178–191.
Snow J. (1855). On the Mode of Communication of Cholera. 2nd edn. London: John
Churchill (reprinted 1936, New York).
US Department of Health, Education, and Welfare (1964). Smoking and Health. Public
Health Service Publication No. 1103. Washington.

Observational Studies 6 (2020) 10 Submitted 7/19; Published 1/20

Commentary on Lecture by Sir Austin Bradford Hill

Peter Armitage armitage55@btinternet.com
Wallingford, Oxfordshire, U.K.

This lecture is quoted widely as the source for Bradford Hill’s principles, or criteria,
for favouring causality in epidemiological studies as distinct from mere association between
environmental factor and disease. The criteria were also proposed in an edition of his text-
book on medical statistics published later (Hill, 1977). A careful analysis of the arguments
is provided by Rothman and Greenland (1998). The lecture was given a few years after
Hill’s retirement, and perhaps indicates a return to the themes that would have perme-
ated his thoughts in the pre-war period, when he was actively involved in occupational
epidemiology, before his important advocacy of randomized trials which occupied much of
his time in the 1950s. The abundance in this paper of examples of contentious issues in
occupational medicine suggests that these had never been far from his mind. Is it also, per-
haps, an indication that he thought of himself primarily as an epidemiologist rather than a
statistician?
It is interesting that statistical analysis does not occur as the central theme of any of
the nine criteria, but there is an important final section on significance tests, which he
regards as being irrelevant to the main issue. This may have been an offshoot of his desire
to discuss the problem as one of common sense, avoiding technical detail. One wonders,
though, whether significance tests can be wholly ignored. If all the other criteria are satisfied
the results of a test may be sufficiently ambiguous as to affect the conclusions and perhaps
lead the investigators to extend the study before publicizing the results.
Bradford Hill was a persuasive and perceptive advocate, and his writings are a joy to
read. I frequently read in the press of large studies which demonstrate that people who
have particular habits suffer more from some disease, without any explanation of the thorny
distinction between association and causation. Let us hope that this reissue of Bradford
Hill’s classic lecture will achieve some of its author’s intentions.

References
Hill, A.B. (1977). A Short Textbook of Medical Statistics. London: Hodder and Stoughton.
Rothman, K.J and Greenland, S. (1998). Hill’s Criteria for Causality. In Encyclopedia of
Biostatistcs (ed. P. Armitage and T. Colton), Vol.3. London: Wiley.

c 2020 Peter Armitage.

Observational Studies 6 (2020) 11-16                           Submitted 12/19; Published 1/20

                              Following Bradford Hill

Mike Baiocchi                                                              baiocchi@stanford.edu
Department of Epidemiology and Population Health
Stanford University
Stanford, CA 94305, USA

                                            Abstract
    In 1965, Sir Austin Bradford Hill offered his thoughts on: “What aspects of [an] associ-
    ation should we especially consider before deciding that the most likely interpretation of
    it is causation?” He proposed nine means for reasoning about the association, which he
    named as: strength, consistency, specificity, temporality, biological gradient, plausibility,
    coherence, experiment, and analogy. In this paper, we look at what motivated Bradford
    Hill to propose we focus on these nine features. We contrast Bradford Hill’s approach
    with a currently fashionable framework for reasoning about statistical associations – the
    Common Task Framework. And then suggest why following Bradford Hill, 50+ years on,
    is still extraordinarily reasonable.
    Keywords: Causality, Bradford Hill Criteria, Common Task Framework

1. Reading with context
It feels odd writing about a paper that is more than 50 years old, particularly inside of
a discipline that is currently undergoing extraordinary growth and innovation. But the
“Bradford Hill criteria” (Bradford Hill, 1965) occupy a particularly prominent peak for
those of us interested in making decisions that hinge on causal claims. To some, Bradford
Hill laid out friendly signposts that suggest safer paths to achieving solid inferences about
causal connections. To more, the “Bradford Hill criteria” are only stood up in order to be
knocked right back down; the nine “criteria” are introduced and logical holes are punched
through until students are left with the impression that hemming in causality with rules
is a fool’s errand. And yet, this paper persists. In fact, I was discussing this paper the
other day with a colleague who told a fascinating story about when she served as an expert
witness for a defense team in some legal case or another. The defense attorneys wanted her
to work through the plausibility of each of the criteria. By report, it sounded like a deeply
interesting (and lucrative) exercise in careful thinking. Her story got me curious so I dug
into the legal world – and lo and behold – there are citations, and guides, and warnings
about both deploying the Bradford Hill criteria in one’s arguments before the court, as well
as detailed guides on how to counter opposing counsel’s expert witness’s use of the criteria.
These ideas seem to have sprouted legs and scurried out of our exclusive domain and into
others, even while still kicking up heated exchanges in our own academic literature (Phillips
and Goodman (2004), Höfler (2005), Phillips and Goodman (2006), Höfler (2006) – as you
might be able to guess from the alternation of authors, that’s a fun exchange). So what’s
going on here?

c 2020 Mike Baiocchi.

Baiocchi

If you encountered Bradford Hill’s ideas in a setting disconnected from the original
manuscript then it may help unlock a bit more of his meaning by considering his audience.
Bradford Hill first gave his remarks to the newly formed Section of Occupational Medicine.
In context, these ideas were offered to medical practitioners charged with making complex
decisions about health but with an eye toward bottom line economics, employment dynam-
ics, and (to some degree) consumer tastes. His audience was not the usual data-analysts
we think of, concerned with uncertainty intervals and p-values. Bradford Hill tells us this
directly as he sets the table: “[Suppose] our observations reveal an association between two
variables, perfectly clear-cut and beyond what we would care to attribute to the play of
chance. What aspects of that association should we especially consider before deciding that
the most likely interpretation of it is causation?” Using modern terminology, we might say
Bradford Hill is quite a bit less interested in statistical inference and more interested in
study design considerations. Though even that terminology does not get quite at the nub
of his line of reasoning.
To get closer to the flow of his arguments, let’s jump over all the important bits in the
middle and pull from his concluding section: The Case for Action. Again, letting the man
speak for himself: “Finally, in passing from association to causation I believe in ’real life’ we
shall have to consider what flows from that decision... In occupational medicine our object
is usually to take action. . . While that is a commendable ambition it almost inevitably
leads us to introduce differential standards before we convict.” What follows is a discussion
of balances – how does the strength of evidence enter into the decision to forbid/compel
people to take actions? The size of the effect, levels of certainty, chains of consequences
that may arise from our (in)actions – these all need to be weighed out. If he stopped his
reasoning at that level of thought then this manuscript likely would have resolved into some
kind of call for a better decision-theoretic framework. But that’s not where Bradford Hill
took the argument, and this is why this paper is still fascinating all these years later. As
far as I can discern, there are two additional tensions he is tracking. The first is the tension
he highlights quite a bit which is the need for action right now, which he contrasts with
the academic’s slower, more careful building of evidence toward solid, scientific conclusions.
More interesting (at least to me and my contemporary eyes) is his focus on convincing
people.

2. Reasoning with context
There are several ways to think about what we – those of us who are interested in making
empirically rigorous, positive change in the world – are doing. Perhaps we are mathematiz-
ing the scientific method, providing crisp, quantified boundaries on what can be known and
how best to empirically know it. Maybe we merely clear away the rubbish others bring into
the Ivory Tower. These are recognizable roles we play in academic settings. But Bradford
Hill is reminding us of a more fundamental role: we do all this to convince people. We
may believe that rigorous evidence will compel, but it won’t. Look at the absurdity of
climate-change denial, or the rates of anti-vax. Change does not – exclusively – arise from
rigorous empirical conclusions.
One way to understand the challenge of convincing people is to unpack why we tend
to formalize and create decision rules. The first reason we formalize is to make discovery

Following Bradford Hill

more productive. When a new causal discovery emerges it often feels a bit shocking and
it’s natural to marvel at its departure from what has come before. For us, data-analysts,
we tend to focus on the methods by which the discovery was achieved – which can be even
more novel than the underlying causal discovery. But quickly, what used to seem novel and
cutting-edge in science gets pulled into the core – what used to be “art” becomes codified and
reproducible. This process is wildly productive, allowing many researchers to explore new
directions that previously only the cleverest could. The second reason we tend to formalize
requires some insight about how humans make decisions and assign responsibility: if we can
formalize these kinds of causal-discovery methods, then we shift the burden of responsibility
for declaring discovery outside of the individual (e.g., located in the idiosyncrasies of both
the situation under investigation and the researcher making the claim) and into the general
(e.g., rules that are recognizable across settings but also rules that have buy-in from the
researchers or policy makers or stakeholders in our domain who will be impacted by our
empirical investigations). Formalized decision rules that standardize discovery and regulate
our claims on the strength of conclusions – in a manner that is much like laws – help
communities set standards and make planning and settling disagreements easier and less
arbitrary. In fact, looking at several of Bradford Hill’s “criteria” with modern eyes, you can
see how his insights have been formalized with statistical methodologies (to pick a few): (i)
“strength of effect” looks quite a bit like the thinking behind modern sensitivity analysis
addressing unobserved confounders; (ii) “consistency” looks like meta-analysis, replicability,
and transportability of effect; (iii) even the initially surprising appeal to “analogy” that
he suggests has found some formalization in the work on “transfer learning” in machine
learning. These additions to our formalized tool set are great.
But, again, formalization is not what Bradford Hill is interested in. In fact, he takes some
wonderfully cheeky shots at formalized decision making. He suggests that using statistical
processes allows decision-makers to obscure and shirk their critical responsibilities. The
tension that keeps Bradford Hill’s argument fresh is the one that makes many of us excited
about doing our jobs: figuring out how to bring new discoveries to the larger community.
When a debate is vital and complex, when the stakes really matter then how do we reach
solid conclusions that will be strong enough to win over our colleagues and those impacted
by our conclusions? If you’re in the position Bradford Hill was, talking to a room full of
physicians interested in Occupational Medicine, then you’ll understand the tension between
the kinds of formalized rules that rigorous statistical analysis provides and the kinds of
arguments used to convince and debate in the larger (less technical) community. A concrete
example: given our analysis, we believe we should order a popular agricultural product
removed from the market. If we decide to act on this belief then a new rule will come into
existence. There will be many “losers” in this new regime, and a number of them will need
to change their behaviors. How do we explain this rule? How do we get buy-in for this
rule? The less familiar the logic used to create the rule is to the people on the receiving
end then the less likely they will engage in the required change in behavior.
For a moment, let’s pause this unpacking of Bradford Hill’s manuscript and move for-
ward to approximately contemporary time. One of the dominant modes of reasoning in
contemporary data analysis – the Common Task Framework (CTF) – serves as an extraor-
dinary contrast to Bradford Hill’s ideas; it’s worth exploring the CTF to better understand
Bradford Hill.

Baiocchi

3. Reasoning without context
The productivity, and explosive improvements, in statistical prediction (“machine learning”)
has rightly stood out in fields touched by data science. (If we’re being honest then to some
degree it has also caused some feelings of anxiety inside those of us less inclined to flashy
prediction, and more enamored of the slower accumulation of information in fields interested
in causal inference.) If, like me, you are less familiar with the field of prediction then you’re
likely even less familiar with the epistemological engine that has powered its growth. The
Common Task Framework (Liberman (2015); Donoho (2017)) provides a fast, low-barriers-
to-entry way for analysts to debate which algorithm performs “best” on a given data set.
The CTF is an alternative way of assessing the suitability of an algorithm; it stands in
contrast to the more traditional methods like mathematical theorems or simulations from a
given data generating function. Even if the CTF name is unfamiliar you’ve likely heard of
this dynamic; the NetFlix Prize (Bennett and Lanning, 2007) was an excellent example of
this framework. The key features of the CTF are (slightly modified from Donoho (2017)):

1. A publicly available training dataset involving, for each observation, a list of (possibly
many) feature measurements, and an outcome for that observation.

2. A set of enrolled competitors (analysts) whose common task is to infer a prediction
rule from the training data.

3. A scoring referee, to which competitors can submit their prediction rule. The referee
runs the prediction rule against a testing dataset which is sequestered behind a Chinese
wall. The referee objectively and automatically reports the score (prediction accuracy)
achieved by the submitted rule.

All competitors share the common task of creating prediction rules which will receive
a good score; hence the phase common task framework. The performance metric provides
an ordering that gives analysts permission to claim “this algorithm provides useful insights
when used on this data set” – such claims are strongest when framed relative to other
algorithms. This is where the “leaderboard” style of algorithmic development came from.
Obviously, predictive models are not causal models. But it is not hard to find colleagues
who have become a bit too enamored of the predictive power of this or that fantastical black
box – believing a bit too much in its ability to accurately describe all possible dynamics
of the data. For these folks, it is a small leap of faith to using a model like this to try
answering questions about what will happen if we intervene. (Dear reader, I assure you: it
pains me too.) How? Perhaps they use something like predictive margins (see Graubard and
Korn, 1999) – first, setting all the observational units level to unexposed, second setting
all the observational units level to exposed, and then contrasting the two hypothetical
groups’ outcomes. Hidden behind almost all actions taken after consulting a black box is a
confidence in its ability to faithfully describe all potential configurations of the data. But
where did their confidence in the model come from?
The CTF allows fantastically complex, “black box” algorithms to be developed and (in
a particular sense) evaluated. Without the CTF, complex algorithms – so complicated that
they cannot be described mathematically – would have much weaker evidence to be trusted
and thus deployed. With the CTF, we can see the performance of any algorithm vis-à-vis

Following Bradford Hill

any other algorithm on the same data set. The CTF has allowed algorithm developers to be
extremely productive, principally by being able to avoid both slow moving math as well as
the kind of deeper engagement with nuanced issues that gave rise to the data that traditional
causal inference analysts do. Algorithms that grow up in the CTF aren’t required to be
accountable to slow-moving, tradition-bound, coherence-seeking people. In fact, there’s a
very explicit line of argument inside some data communities that “human experts need
to be removed from the decision-making process” – rather, the machines should do the
learning because they are more likely to produce the most optimal results. The thinking
goes: Humans are slow. Humans are hard to understand. Let’s remove humans from this
process.
But here’s the thing, when we stand with Bradford Hill, humans are the point. We’re
trying to convince humans to change. Rules that are (in a particular sense) “optimal” are
not the same as rules that are useful for affecting change. In fact, it’s easy to imagine
that rules generated by “black box” algorithms (i.e., literally inexplicable) are less likely
to be complied with than rules reached through consensus building and through reasoning
accessible by those who are being asked to have their lives shaped by the rules. Do not
mistake what I’m saying as arguing against well-reasoned, formal, quantified rules that
come out of statistical analyses. Our rigorous statistical methods are the strong bones of
the beast, but they don’t provide the heart, mind, and muscles that animate and make these
decisions human. We are better now that we have statistical procedures that formalize many
of Bradford Hill’s criteria. But these new statistical procedures do not solve the principal
issue Bradford Hill was engaging, how to convince and change.

4. Following Bradford Hill
I didn’t introduce the CTF to either bury or praise it, but rather because it is a perfect
example of how one might reason about data in a way that is about as far removed as possible
from the way Bradford Hill advocates we reason about data. The contrast here, I hope,
helps illuminate the point that Bradford Hill was making. When I read this manuscript,
I see someone making tough, impactful decisions in the presence of uncertainty. He is
steeped in the particulars of the situation. While formalizing Bradford Hill’s criteria is
useful, and will produce better decision-making, it is also beside the point. (And, in the
most extreme, can lead to a type of blindness about the role of experts, stakeholders, and
consequences for our analyses.) The criteria are paths of reasoning about causality that
resonate and reassure. In the kinds of questions epidemiologists, health policy researchers,
economists, criminologists. . . engage, the ultimate audience is a community of people who
our conclusions impact. Statistical reasoning can be like mathematics at times, but in
answering these kinds of questions it is better to think of statistical reasoning as a form of
rigorous, quantitative argumentation – meant to guide thought and shift beliefs.
I have a friend that keeps a copy of Bradford Hill’s criteria pinned to his office wall. He
uses it to remind himself of the paths he might take. I like that. If you follow Bradford Hill
then I think you’ll have an easier time reaching your audience.

Acknowledgments

Baiocchi

I would like to Jordan Rodu for conversations about the CTF and the cheese plate.

References
Bennett, J. and Lanning, S. (2007). The netflix prize. Proceedings of KDD cup and work-
  shop, 2(3):35.

Bradford Hill, A. (1965). The environment and disease: association or causation? Proceed-
  ings of the Royal Society of Medicine, 58(2):295–300.

Donoho, D. (2017). 50 years of data science. Journal of Computational and Graphical
 Statistics, 2(26):745–66.

Graubard, B. and Korn, E. (1999). Predictive margins with survey data. Biometrics,
  55(2):652–9.

Höfler, M. (2005). The bradford hill considerations on causality: a counterfactual perspec-
  tive. Emerging Themes in Epidemiology, 2(1).

Höfler, M. (2006). Getting causal considerations back on the right track. Emerging Themes
  in Epidemiology, 3(1).

Liberman, M. (2015). Reproducible research and the common task method. Simmons
  Foundation Lecture.

Phillips, C. and Goodman, K. (2004). The missed lessons of sir austin bradford hill. Epi-
  demiologic Perspectives & Innovations, 1(1).

Phillips, C. and Goodman, K. (2006). Causal criteria and counterfactuals; nothing more
  (or less) than scientific common sense. Emerging Themes in Epidemiology, 3(1).

                                            16

Observational Studies 6 (2020) 17-19 Submitted 9/19; Published 1/20

On the use and abuse of Hill’s viewpoints on causality

Samantha Kleinberg samantha.kleinberg@stevens.edu
Computer Science Department
Stevens Institute of Technology
Hoboken, NJ, USA 07030

Here, then, are nine different viewpoints from all of which we should study
association before we cry causation. What I do not believe — and this has
been suggested — is that we can usefully lay down some hard-and-
fast rules of evidence that must be obeyed before we accept cause
and effect. None of my nine viewpoints can bring indisputable evidence for or
against the cause-and-effect hypothesis and none can be required as a sine qua
non. What they can do, with greater or less strength, is to help us to make up
our minds on the fundamental question — is there any other way of explaining
the set of facts before us, is there any other answer equally, or more, likely than
cause and effect? (emphasis added, italics original) Hill (1965)

Not since Fisher1 suggested p < 0.05 is often convenient has such a clear statement
by a statistician been so misunderstood. Hill’s sensible advice has has been transformed
like Samsa in Kafka’s Metamorphosis into what his article warned against: a checklist.
Google scholar returns over 100,000 articles using the phrase “Bradford Hill Criteria,” it
has been growing in usage in books since the 1990s (see figure 1),2 and even the Wikipedia
page on the topic is titled “Bradford Hill Criteria.”3 And yet Hill wrote that there are no
“hard-and-fast rules” for causality.
This is not just a marketing problem. How we talk influences how we think (Boroditsky,
2011) and the mutation of considerations into criteria is in fact part of their misuse. Hill
referred to the pieces of evidence we may wish to examine as “aspects of [an] association
[to] consider before deciding that the most likely interpretation of it is causation” (p. 295)
and “viewpoints from [which] to study association before we cry causation” (p. 299). Con-
siderations may influence our decisions, such as whether to believe a causal relationship
exists, but they are also things we may evaluate and ignore if they’re not relevant. Criteria,
in contrast, are a benchmark against which we test something. In the case of causality,
criteria provide a tantalizing yet misleading shortcut: check off these boxes and you can
claim causality. Yet, there is no such checklist for causality and Hill’s considerations are
neither necessary nor sufficient to establish a causal relationship.4
1. Fisher (1925) said about a p-value threshold of 0.05 that “it is convenient to take this point as a limit in
judging whether a deviation is to be considered significant or not.” The message people heard appears
to be “p < 0.05 or it didn’t happen.”
2. Variations of the phrase involving viewpoints and considerations are so rare that no Ngrams are found.
3. https://en.wikipedia.org/wiki/Bradford_Hill_criteria
4. See Kleinberg (2015) and (Rothman et al., 2008, p. 26) for a few examples detailing just why this is.

c 2020 Samantha Kleinberg.

Kleinberg

Figure 1: Google Ngram results for usage of “Bradford Hill Criteria” in books.

Checklists can be a powerful tool in safety critical domains where cognitive load is
high and time is short (Gawande, 2010). However the settings where Hill’s views are most
useful are not that. They are cases where experiments are difficult or impossible and we
must cobble together piecemeal evidence for causal claims. These are cases where we also
must assess whether a consideration is relevant to the topic. For example, Hill along with
Richard Doll identified a link between smoking and lung cancer at a time when little was
known about the etiology of lung cancer (Doll and Hill, 1950). It is not possible to conduct
randomized experiments to test the hypothesis that smoking is responsible for cancer, but
it is of great public health significance to know what causes cancer so it can be prevented.
From this experience Hill distilled his views on how we can gain such causal knowledge into
his famous article.
Yet rather than providing a starting point, Hill’s viewpoints have been widely and
repeatedly used as a standard of evidence, the same way the majority of researchers use a
p-value cutoff of 0.05. The precise danger of conventions is that one need not justify them,
whereas a p-value threshold of 0.04 or 0.06 would invite significant scrutiny.5 However,
many other factors such as effect size are important to determining whether a result is
actually important or not. When Hill’s views are treated as criteria, they similarly become
a causal inference figleaf. If these reasonable but still unvalidated pieces of evidence can be
provided,6 then congratulations, you can claim causality.
So if I’m suggesting researchers quit the causal criteria cold turkey, what will replace
them? It is perfectly fine to refer to Hill’s considerations as a starting point when thinking
about what evidence one might gather when evaluating an association. The part that is
not fine is making the leap from these pieces of evidence to a definitive claim of causality
– and both failing to consider other types of evidence and forcing these considerations to
fit scenarios where they do not apply. These are two critical areas for future research to
5. Editorial guidelines for the journal Cognition explicitly state that only effects with p < 0.05 can be
described as statistically significant, stating that “for better or for worse, this is the current convention,”
which it seems even journals are powerless to change.
6. This is all leaving aside the question of what it means to satisfy each criteria, which surely requires more
nuance than present/absent.

Use and abuse of Hill’s viewpoints on causality

explore. First, our methods and data have evolved in the years since Hill’s article, yet the
considerations remain static. It is worth exploring whether there are other evidence types
that may prove useful as well as updating how current methods might support the existing
considerations (e.g. how do big data and simulations fit in?). Second, while Hill focused
on epidemiology, the considerations have been used more broadly and it is important to
examine how the needs for and standards of evidence vary across domains. By allowing
them to evolve, Hill’s considerations will hopefully meet a better end than poor Samsa.

Acknowledgments
Thanks to Dylan Small for providing an outlet for these viewpoints.

References
Boroditsky, L. (2011). How language shapes thought. Scientific American, 304(2):62–65.

Doll, R. and Hill, A. B. (1950). Smoking and carcinoma of the lung. British medical journal,
2(4682):739.

Fisher, R. A. (1925). Statistical Methods for Research Workers. Oliver and Boyd, Edinburgh.

Gawande, A. (2010). The Checklist Manifesto. Henry Holt and Company.

Hill, A. (1965). The environment and disease: association or causation? Proceedings of the
Royal Society of Medicine, 58(2):295–300.

Kleinberg, S. (2015). Why: A Guide to Finding and Using Causes. O’Reilly Media.

Rothman, K. J., Greenland, S., Lash, T. L., et al. (2008). Modern epidemiology, volume 3.
Wolters Kluwer Health/Lippincott Williams & Wilkins Philadelphia.

You can also read