An official website of the United States government
The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.
The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.
- Account settings
- Advanced Search
- Journal List
- Indian J Dermatol
- v.61(2); Mar-Apr 2016
Methodology Series Module 2: Case-control Studies
Maninder singh setia.
Epidemiologist, MGM Institute of Health Sciences, Navi Mumbai, Maharashtra, India
Case-Control study design is a type of observational study. In this design, participants are selected for the study based on their outcome status. Thus, some participants have the outcome of interest (referred to as cases), whereas others do not have the outcome of interest (referred to as controls). The investigator then assesses the exposure in both these groups. The investigator should define the cases as specifically as possible. Sometimes, definition of a disease may be based on multiple criteria; thus, all these points should be explicitly stated in case definition. An important aspect of selecting a control is that they should be from the same ‘study base’ as that of the cases. We can select controls from a variety of groups. Some of them are: General population; relatives or friends; and hospital patients. Matching is often used in case-control control studies to ensure that the cases and controls are similar in certain characteristics, and it is a useful technique to increase the efficiency of the study. Case-Control studies can usually be conducted relatively faster and are inexpensive – particularly when compared with cohort studies (prospective). It is useful to study rare outcomes and outcomes with long latent periods. This design is not very useful to study rare exposures. Furthermore, they may also be prone to certain biases – selection bias and recall bias.
Case-Control study design is a type of observational study design. In an observational study, the investigator does not alter the exposure status. The investigator measures the exposure and outcome in study participants, and studies their association.
In a case-control study, participants are selected for the study based on their outcome status. Thus, some participants have the outcome of interest (referred to as cases), whereas others do not have the outcome of interest (referred to as controls). The investigator then assesses the exposure in both these groups. Thus, by design, in a case-control study the outcome has to occur in some of the participants that have been included in the study.
As seen in Figure 1 , at the time of entry into the study (sampling of participants), some of the study participants have the outcome (cases) and others do not have the outcome (controls). During the study procedures, we will examine the exposure of interest in cases as well as controls. We will then study the association between the exposure and outcome in these study participants.
Example of a case-control study
Examples of Case-Control Studies
Smoking and lung cancer study.
In their landmark study, Doll and Hill (1950) evaluated the association between smoking and lung cancer. They included 709 patients of lung carcinoma (defined as cases). They also included 709 controls from general medical and surgical patients. The selected controls were similar to the cases with respect to age and sex. Thus, they included 649 males and 60 females in cases as well as controls.
They found that only 0.3% of males were non-smokers among cases. However, the proportion of non-smokers among controls was 4.2%; the different was statistically significant ( P = 0.00000064). Similarly they found that about 31.7% of the female were non-smokers in cases compared with 53.3% in controls; this difference was also statistically significant (0.01< p <0.02).
Melanoma and tanning (Lazovic et al ., 2010)
The authors conducted a case-control study to study the association between melanoma and tanning. The 1167 cases - individuals with invasive cutaneous melanoma – were selected from Minnesota Cancer Surveillance System. The 1101 controls were selected randomly from Minnesota State Driver's License list; they were matched for age (+/- 5 years) and sex.
The data were collected by self administered questionnaires and telephone interviews. The investigators assessed the use of tanning devices (using photographs), number of years, and frequency of use of these devices. They also collected information on other variables (such as sun exposure; presence of freckles and moles; and colour of skin, hair, among other exposures.
They found that melanoma was higher in individuals who used UVB enhances and primarily UVA-emitting devices. The risk of melanoma also increased with increase in years of use, hours of use, and sessions.
Risk factors for erysipelas (Pitché et al, 2015)
Pitché et al (2015) conducted a case-control study to assess the factors associated with leg erysipelas in sub-Saharan Africa. This was a multi-centre study; the cases and controls were recruited from eight countries in sub-Saharan Africa.
They recruited cases of acute leg cellulitis in these eight countries. They recruited two controls for each case; these were matched for age (+/- 5 years) and sex. Thus, the final study has 364 cases and 728 controls. They found that leg erysipelas was associated with obesity, lympoedema, neglected traumatic wound, toe-web intertrigo, and voluntary cosmetic depigmentation.
We have provided details of all the three studies in the bibliography. We strongly encourage the readers to read the papers to understand some practical aspects of case-control studies.
Selection of Cases and Controls
Selection of cases and controls is an important part of this design. Wacholder and colleagues (1992 a, b, and c) have published wonderful manuscripts on design and conduct of case-control of studies in the American Journal of Epidemiology. The discussion in the next few sections is based on these manuscripts.
Selection of case
The investigator should define the cases as specifically as possible. Sometimes, definition of a disease may be based on multiple criteria; thus, all these points should be explicitly stated in case definition.
For example, in the above mentioned Melanoma and Tanning study, the researchers defined their population as any histologic variety of invasive cutaneous melanoma. However, they added another important criterion – these individuals should have a driver's license or State identity card. This probably is not directly related to the clinic condition, so why did they add this criterion? We will discuss this in detail in the next few paragraphs.
Selection of a control
The next important point in designing a case-control study is the selection of control patients.
In fact, Wacholder and colleagues have extensively discussed aspects of design of case control studies and selection of controls in their article.
According to them, an important aspect of selecting a control is that they should be from the same ‘study base’ as that of the cases. Thus, the pool of population from which the cases and controls will be enrolled should be same. For instance, in the Tanning and Melanoma study, the researchers recruited cases from Minnesota Cancer Surveillance System; however, it was also required that these cases should either have a State identity card or Driver's license. This was important since controls were randomly selected from Minnesota State Driver's license list (this also included the list of individuals who have the State identity card).
Another important aspect of a case-control study is that we should measure the exposure similarly in cases and controls. For instance, if we design a research protocol to study the association between metabolic syndrome (exposure) and psoriasis (outcome), we should ensure that we use the same criteria (clinically and biochemically) for evaluating metabolic syndrome in cases and controls. If we use different criteria to measure the metabolic syndrome, then it may cause information bias.
Types of Controls
We can select controls from a variety of groups. Some of them are: General population; relatives or friends; or hospital patients.
An important source of controls is patients attending the hospital for diseases other than the outcome of interest. These controls are easy to recruit and are more likely to have similar quality of medical records.
However, we have to be careful while recruiting these controls. In the above example of metabolic syndrome and psoriasis, we recruit psoriasis patients from the Dermatology department of the hospital as controls. We recruit patients who do not have psoriasis and present to the Dermatology as controls. Some of these individuals have presented to the Dermatology department with tinea pedis. Do we recruit these individuals as controls for the study? What is the problem if we recruit these patients? Some studies have suggested that diabetes mellitus and obesity are predisposing factors for tinea pedis. As we know, fasting plasma glucose of >100 mg/dl and raised trigylcerides (>=150 mg/dl) are criteria for diagnosis of metabolic syndrome. Thus, it is quite likely that if we recruit many of these tinea pedis patients, the exposure of interest may turn out to be similar in cases and controls; this exposure may not reflect the truth in the population.
Relative and friend controls
Relative controls are relatively easy to recruit. They can be particularly useful when we are interested in trying to ensure that some of the measurable and non-measurable confounders are relatively equally distributed in cases and controls (such as home environment, socio-economic status, or genetic factors).
Another source of controls is a list of friends referred by the cases. These controls are easy to recruit and they are also more likely to be similar to the cases in socio-economic status and other demographic factors. However, they are also more likely to have similar behaviours (alcohol use, smoking etc.); thus, it may not be prudent to use these as controls if we want to study the effect of these exposures on the outcome.
These controls can be easily conducted the list of all individuals is available. For example, list from state identity cards, voter's registration list, etc., In the Tanning and melanoma study, the researchers used population controls. They were identified from Minnesota state driver's list.
We may have to use sampling methods (such as random digit dialing or multistage sampling methods) to recruit controls from the population. A main advantage is that these controls are likely to satisfy the ‘study-base’ principle (described above) as suggested by Wacholder and colleagues. However, they can be expensive and time consuming. Furthermore, many of these controls will not be inclined to participate in the study; thus, the response rate may be very low.
Matching in a Case-Control Study
Matching is often used in case-control control studies to ensure that the cases and controls are similar in certain characteristics. For example, in the smoking and lung cancer study, the authors selected controls that were similar in age and sex to carcinoma cases. Matching is a useful technique to increase the efficiency of study.
’Individual matching’ is one common technique used in case-control study. For example, in the above mentioned metabolic syndrome and psoriasis, we can decide that for each case enrolled in the study, we will enroll a control that is matched for sex and age (+/- 2 years). Thus, if 40 year male patient with psoriasis is enrolled for the study as a case, we will enroll a 38-42 year male patient without psoriasis (and who will not be excluded for other reason) as controls.
If the study has used ‘individual matching’ procedures, then the data should also reflect the same. For instance, if you have 45 males among cases, you should also have 45 males among controls. If you show 60 males among controls, you should explain the discrepancy.
Even though matching is used to increase the efficiency in case-control studies, it may have its own problems. It may be difficult to fine the exact matching control for the study; we may have to screen many potential enrollees before we are able to recruit one control for each case recruited. Thus, it may increase the time and cost of the study.
Nonetheless, matching may be useful to control for certain types of confounders. For instance, environment variables may be accounted for by matching controls for neighbourhood or area of residence. Household environment and genetic factors may be accounted for by enrolling siblings as controls.
If we use controls from the past (time period when cases did not occur), then the controls are sometimes referred to historic controls. Such controls may be recruited from past hospital records.
Strengths of a Case-Control Study
- Case-Control studies can usually be conducted relatively faster and are inexpensive – particularly when compared with cohort studies (prospective)
- It is useful to study rare outcomes and outcomes with long latent periods. For example, if we wish to study the factors associated with melanoma in India, it will be useful to conduct a case-control study. We will recruit cases of melanoma as cases in one study site or multiple study sites. If we were to conduct a cohort study for this research question, we may to have follow individuals (with the exposure under study) for many years before the occurrence of the outcome
- It is also useful to study multiple exposures in the same outcome. For example, in the metabolic syndrome and psoriasis study, we can study other factors such as Vitamin D levels or genetic markers
- Case-control studies are useful to study the association of risk factors and outcomes in outbreak investigations. For instance, Freeman and colleagues (2015) in a study published in 2015 conducted a case-control study to evaluate the role of proton pump inhibitors in an outbreak of non-typhoidal salmonellosis.
Limitations of a Case-control Study
- The design, in general, is not useful to study rare exposures. It may be prudent to conduct a cohort study for rare exposures
Since the investigator chooses the number of cases and controls, the proportion of cases may not be representative of the proportion in the population. For instance if we choose 50 cases of psoriasis and 50 controls, the prevalence of proportion of psoriasis cases in our study will be 50%. This is not true prevalence. If we had chosen 50 cases of psoriasis and 100 controls, then the proportion of the cases will be 33%.
- The design is not useful to study multiple outcomes. Since the cases are selected based on the outcome, we can only study the association between exposures and that particular outcome
- Sometimes the temporality of the exposure and outcome may not be clearly established in case-control studies
- The case-control studies are also prone to certain biases
If the cases and controls are not selected similarly from the study base, then it will lead to selection bias.
- Odds Ratio: We are able to calculate the odds ratios (OR) from a case-control study. Since we are not able to measure incidence data in case-control study, an odds ratio is a reasonable measure of the relative risk (under some assumptions). Additional details about OR will be discussed in the biostatistics section.
The OR in the above study is 3.5. Since the OR is greater than 1, the outcome is more likely in those exposed (those who are diagnosed with metabolic syndrome) compared with those who are not exposed (those who do are not diagnosed with metabolic syndrome). However, we will require confidence intervals to comment on further interpretation of the OR (This will be discussed in detail in the biostatistics section).
- Other analysis : We can use logistic regression models for multivariate analysis in case-control studies. It is important to note that conditional logistic regressions may be useful for matched case-control studies.
Calculating an Odds Ratio (OR)
Hypothetical study of metabolic syndrome and psoriasis
Additional Points in A Case-Control Study
How many controls can i have for each case.
The most optimum case-to-control ratio is 1:1. Jewell (2004) has suggested that for a fixed sample size, the chi square test for independence is most powerful if the number of cases is same as the number of controls. However, in many situations we may not be able recruit a large number of cases and it may be easier to recruit more controls for the study. It has been suggested that we can increase the number of controls to increase statistical power (if we have limited number of cases) of the study. If data are available at no extra cost, then we may recruit multiple controls for each case. However, if it is expensive to collect exposure and outcome information from cases and controls, then the optimal ratio is 4 controls: 1 case. It has been argued that the increase in statistical power may be limited with additional controls (greater than four) compared with the cost involved in recruiting them beyond this ratio.
I have conducted a randomised controlled trial. I have included a group which received the intervention and another group which did not receive the intervention. Can I call this a case-control study?
A randomised controlled trial is an experimental study. In contrast, case-control studies are observational studies. These are two different groups of studies. One should not use the word case-control study for a randomised controlled trial (even though you have a control group in the study). Every study with a control group is not a case-control study. For a study to be classified as a case-control study, the study should be an observational study and the participants should be recruited based on their outcome status (some have the disease and some do not).
Should I call case-control studies prospective or retrospective studies?
In ‘The Dictionary of Epidemiology’ by Porta (2014), the authors have suggested that even though the term ‘retrospective’ was used for case-control studies, the study participants are often recruited prospectively. In fact, the study on risk factors for erysipelas (Pitché et al ., 2015) was a prospective case case-control study. Thus, it is important to remember that the nature of the study (case-control or cohort) depends on the sampling method. If we sample the study participants based on exposure and move towards the outcome, it is a cohort study. However, if we sample the participants based on the outcome (some with outcome and some do not) and study the exposures in both these groups, it is a case-control study.
In case-control studies, participants are recruited on the basis of disease status. Thus, some of participants have the outcome of interest (referred to as cases), whereas others do not have the outcome of interest (referred to as controls). The investigator then assesses the exposure in both these groups. Case-control studies are less expensive and quicker to conduct (compared with prospective cohort studies at least). The measure of association in this type of study is an odds ratio. This type of design is useful for rare outcomes and those with long latent periods. However, they may also be prone to certain biases – selection bias and recall bias.
Financial support and sponsorship
Conflicts of interest.
There are no conflicts of interest.
Have a language expert improve your writing
Run a free plagiarism check in 10 minutes, generate accurate citations for free.
- Knowledge Base
- What Is a Case-Control Study? | Definition & Examples
What Is a Case-Control Study? | Definition & Examples
Published on February 4, 2023 by Tegan George . Revised on June 22, 2023.
A case-control study is an experimental design that compares a group of participants possessing a condition of interest to a very similar group lacking that condition. Here, the participants possessing the attribute of study, such as a disease, are called the “case,” and those without it are the “control.”
It’s important to remember that the case group is chosen because they already possess the attribute of interest. The point of the control group is to facilitate investigation, e.g., studying whether the case group systematically exhibits that attribute more than the control group does.
Table of contents
When to use a case-control study, examples of case-control studies, advantages and disadvantages of case-control studies, other interesting articles, frequently asked questions.
Case-control studies are a type of observational study often used in fields like medical research, environmental health, or epidemiology. While most observational studies are qualitative in nature, case-control studies can also be quantitative , and they often are in healthcare settings. Case-control studies can be used for both exploratory and explanatory research , and they are a good choice for studying research topics like disease exposure and health outcomes.
A case-control study may be a good fit for your research if it meets the following criteria.
- Data on exposure (e.g., to a chemical or a pesticide) are difficult to obtain or expensive.
- The disease associated with the exposure you’re studying has a long incubation period or is rare or under-studied (e.g., AIDS in the early 1980s).
- The population you are studying is difficult to contact for follow-up questions (e.g., asylum seekers).
Retrospective cohort studies use existing secondary research data, such as medical records or databases, to identify a group of people with a common exposure or risk factor and to observe their outcomes over time. Case-control studies conduct primary research , comparing a group of participants possessing a condition of interest to a very similar group lacking that condition in real time.
Here's why students love Scribbr's proofreading services
Discover proofreading & editing
Case-control studies are common in fields like epidemiology, healthcare, and psychology.
You would then collect data on your participants’ exposure to contaminated drinking water, focusing on variables such as the source of said water and the duration of exposure, for both groups. You could then compare the two to determine if there is a relationship between drinking water contamination and the risk of developing a gastrointestinal illness. Example: Healthcare case-control study You are interested in the relationship between the dietary intake of a particular vitamin (e.g., vitamin D) and the risk of developing osteoporosis later in life. Here, the case group would be individuals who have been diagnosed with osteoporosis, while the control group would be individuals without osteoporosis.
You would then collect information on dietary intake of vitamin D for both the cases and controls and compare the two groups to determine if there is a relationship between vitamin D intake and the risk of developing osteoporosis. Example: Psychology case-control study You are studying the relationship between early-childhood stress and the likelihood of later developing post-traumatic stress disorder (PTSD). Here, the case group would be individuals who have been diagnosed with PTSD, while the control group would be individuals without PTSD.
Case-control studies are a solid research method choice, but they come with distinct advantages and disadvantages.
Advantages of case-control studies
- Case-control studies are a great choice if you have any ethical considerations about your participants that could preclude you from using a traditional experimental design .
- Case-control studies are time efficient and fairly inexpensive to conduct because they require fewer subjects than other research methods .
- If there were multiple exposures leading to a single outcome, case-control studies can incorporate that. As such, they truly shine when used to study rare outcomes or outbreaks of a particular disease .
Disadvantages of case-control studies
- Case-control studies, similarly to observational studies, run a high risk of research biases . They are particularly susceptible to observer bias , recall bias , and interviewer bias.
- In the case of very rare exposures of the outcome studied, attempting to conduct a case-control study can be very time consuming and inefficient .
- Case-control studies in general have low internal validity and are not always credible.
Case-control studies by design focus on one singular outcome. This makes them very rigid and not generalizable , as no extrapolation can be made about other outcomes like risk recurrence or future exposure threat. This leads to less satisfying results than other methodological choices.
If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.
- Student’s t -distribution
- Normal distribution
- Null and Alternative Hypotheses
- Chi square tests
- Confidence interval
- Quartiles & Quantiles
- Cluster sampling
- Stratified sampling
- Data cleansing
- Reproducibility vs Replicability
- Peer review
- Prospective cohort study
- Implicit bias
- Cognitive bias
- Placebo effect
- Hawthorne effect
- Hindsight bias
- Affect heuristic
- Social desirability bias
A faster, more affordable way to improve your paper
Scribbr’s new AI Proofreader checks your document and corrects spelling, grammar, and punctuation mistakes with near-human accuracy and the efficiency of AI!
Proofread my paper
A case-control study differs from a cohort study because cohort studies are more longitudinal in nature and do not necessarily require a control group .
While one may be added if the investigator so chooses, members of the cohort are primarily selected because of a shared characteristic among them. In particular, retrospective cohort studies are designed to follow a group of people with a common exposure or risk factor over time and observe their outcomes.
Case-control studies, in contrast, require both a case group and a control group, as suggested by their name, and usually are used to identify risk factors for a disease by comparing cases and controls.
A case-control study differs from a cross-sectional study because case-control studies are naturally retrospective in nature, looking backward in time to identify exposures that may have occurred before the development of the disease.
On the other hand, cross-sectional studies collect data on a population at a single point in time. The goal here is to describe the characteristics of the population, such as their age, gender identity, or health status, and understand the distribution and relationships of these characteristics.
Cases and controls are selected for a case-control study based on their inherent characteristics. Participants already possessing the condition of interest form the “case,” while those without form the “control.”
Keep in mind that by definition the case group is chosen because they already possess the attribute of interest. The point of the control group is to facilitate investigation, e.g., studying whether the case group systematically exhibits that attribute more than the control group does.
The strength of the association between an exposure and a disease in a case-control study can be measured using a few different statistical measures , such as odds ratios (ORs) and relative risk (RR).
No, case-control studies cannot establish causality as a standalone measure.
As observational studies , they can suggest associations between an exposure and a disease, but they cannot prove without a doubt that the exposure causes the disease. In particular, issues arising from timing, research biases like recall bias , and the selection of variables lead to low internal validity and the inability to determine causality.
Sources in this article
We strongly encourage students to use sources in their work. You can cite our article (APA Style) or take a deep dive into the articles below.
George, T. (2023, June 22). What Is a Case-Control Study? | Definition & Examples. Scribbr. Retrieved November 30, 2023, from https://www.scribbr.com/methodology/case-control-study/
Schlesselman, J. J. (1982). Case-Control Studies: Design, Conduct, Analysis (Monographs in Epidemiology and Biostatistics, 2) (Illustrated). Oxford University Press.
Is this article helpful?
Other students also liked, what is an observational study | guide & examples, control groups and treatment groups | uses & examples, cross-sectional study | definition, uses & examples, what is your plagiarism score.
Methodology Series Module 2: Case-control Studies
- 1 Epidemiologist, MGM Institute of Health Sciences, Navi Mumbai, Maharashtra, India.
- PMID: 27057012
- PMCID: PMC4817437
- DOI: 10.4103/0019-5154.177773
Case-Control study design is a type of observational study. In this design, participants are selected for the study based on their outcome status. Thus, some participants have the outcome of interest (referred to as cases), whereas others do not have the outcome of interest (referred to as controls). The investigator then assesses the exposure in both these groups. The investigator should define the cases as specifically as possible. Sometimes, definition of a disease may be based on multiple criteria; thus, all these points should be explicitly stated in case definition. An important aspect of selecting a control is that they should be from the same 'study base' as that of the cases. We can select controls from a variety of groups. Some of them are: General population; relatives or friends; and hospital patients. Matching is often used in case-control control studies to ensure that the cases and controls are similar in certain characteristics, and it is a useful technique to increase the efficiency of the study. Case-Control studies can usually be conducted relatively faster and are inexpensive - particularly when compared with cohort studies (prospective). It is useful to study rare outcomes and outcomes with long latent periods. This design is not very useful to study rare exposures. Furthermore, they may also be prone to certain biases - selection bias and recall bias.
Keywords: Case-control studies; design; limitations; strengths.
What Is A Case Control Study?
Editor at Simply Psychology
BA (Hons) Psychology, Princeton University
Julia Simkus is a graduate of Princeton University with a Bachelor of Arts in Psychology. She is currently studying for a Master's Degree in Counseling for Mental Health and Wellness in September 2023. Julia's research has been published in peer reviewed journals.
Learn about our Editorial Process
Saul Mcleod, PhD
BSc (Hons) Psychology, MRes, PhD, University of Manchester
Saul Mcleod, Ph.D., is a qualified psychology teacher with over 18 years experience of working in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.
On This Page:
A case-control study is a research method where two groups of people are compared – those with the condition (cases) and those without (controls). By looking at their past, researchers try to identify what factors might have contributed to the condition in the ‘case’ group.
A case-control study looks at people who already have a certain condition (cases) and people who don’t (controls). By comparing these two groups, researchers try to figure out what might have caused the condition. They look into the past to find clues, like habits or experiences, that are different between the two groups.
The “cases” are the individuals with the disease or condition under study, and the “controls” are similar individuals without the disease or condition of interest.
The controls should have similar characteristics (i.e., age, sex, demographic, health status) to the cases to mitigate the effects of confounding variables .
Case-control studies identify any associations between an exposure and an outcome and help researchers form hypotheses about a particular population.
Researchers will first identify the two groups, and then look back in time to investigate which subjects in each group were exposed to the condition.
If the exposure is found more commonly in the cases than the controls, the researcher can hypothesize that the exposure may be linked to the outcome of interest.
Figure: Schematic diagram of case-control study design. Kenneth F. Schulz and David A. Grimes (2002) Case-control studies: research in reverse . The Lancet Volume 359, Issue 9304, 431 – 434
Quick, inexpensive, and simple
Because these studies use already existing data and do not require any follow-up with subjects, they tend to be quicker and cheaper than other types of research. Case-control studies also do not require large sample sizes.
Beneficial for studying rare diseases
Researchers in case-control studies start with a population of people known to have the target disease instead of following a population and waiting to see who develops it. This enables researchers to identify current cases and enroll a sufficient number of patients with a particular rare disease.
Useful for preliminary research
Case-control studies are beneficial for an initial investigation of a suspected risk factor for a condition. The information obtained from cross-sectional studies then enables researchers to conduct further data analyses to explore any relationships in more depth.
Subject to recall bias.
Participants might be unable to remember when they were exposed or omit other details that are important for the study. In addition, those with the outcome are more likely to recall and report exposures more clearly than those without the outcome.
Difficulty finding a suitable control group
It is important that the case group and the control group have almost the same characteristics, such as age, gender, demographics, and health status.
Forming an accurate control group can be challenging, so sometimes researchers enroll multiple control groups to bolster the strength of the case-control study.
Do not demonstrate causation
Case-control studies may prove an association between exposures and outcomes, but they can not demonstrate causation.
A case-control study is an observational study where researchers analyzed two groups of people (cases and controls) to look at factors associated with particular diseases or outcomes.
Below are some examples of case-control studies:
- Investigating the impact of exposure to daylight on the health of office workers (Boubekri et al., 2014).
- Comparing serum vitamin D levels in individuals who experience migraine headaches with their matched controls (Togha et al., 2018).
- Analyzing correlations between parental smoking and childhood asthma (Strachan and Cook, 1998).
- Studying the relationship between elevated concentrations of homocysteine and an increased risk of vascular diseases (Ford et al., 2002).
- Assessing the magnitude of the association between Helicobacter pylori and the incidence of gastric cancer (Helicobacter and Cancer Collaborative Group, 2001).
- Evaluating the association between breast cancer risk and saturated fat intake in postmenopausal women (Howe et al., 1990).
Frequently asked questions
1. what’s the difference between a case-control study and a cross-sectional study.
Case-control studies are different from cross-sectional studies in that case-control studies compare groups retrospectively while cross-sectional studies analyze information about a population at a specific point in time.
In cross-sectional studies , researchers are simply examining a group of participants and depicting what already exists in the population.
2. What’s the difference between a case-control study and a longitudinal study?
Case-control studies compare groups retrospectively, while longitudinal studies can compare groups either retrospectively or prospectively.
In a longitudinal study , researchers monitor a population over an extended period of time, and they can be used to study developmental shifts and understand how certain things change as we age.
In addition, case-control studies look at a single subject or a single case, whereas longitudinal studies can be conducted on a large group of subjects.
3. What’s the difference between a case-control study and a retrospective cohort study?
Case-control studies are retrospective as researchers begin with an outcome and trace backward to investigate exposure; however, they differ from retrospective cohort studies.
In a retrospective cohort study , researchers examine a group before any of the subjects have developed the disease, then examine any factors that differed between the individuals who developed the condition and those who did not.
Thus, the outcome is measured after exposure in retrospective cohort studies, whereas the outcome is measured before the exposure in case-control studies.
Boubekri, M., Cheung, I., Reid, K., Wang, C., & Zee, P. (2014). Impact of windows and daylight exposure on overall health and sleep quality of office workers: a case-control pilot study. Journal of Clinical Sleep Medicine: JCSM: Official Publication of the American Academy of Sleep Medicine, 10 (6), 603-611.
Ford, E. S., Smith, S. J., Stroup, D. F., Steinberg, K. K., Mueller, P. W., & Thacker, S. B. (2002). Homocyst (e) ine and cardiovascular disease: a systematic review of the evidence with special emphasis on case-control studies and nested case-control studies. International journal of epidemiology, 31 (1), 59-70.
Helicobacter and Cancer Collaborative Group. (2001). Gastric cancer and Helicobacter pylori: a combined analysis of 12 case control studies nested within prospective cohorts. Gut, 49 (3), 347-353.
Howe, G. R., Hirohata, T., Hislop, T. G., Iscovich, J. M., Yuan, J. M., Katsouyanni, K., … & Shunzhang, Y. (1990). Dietary factors and risk of breast cancer: combined analysis of 12 case—control studies. JNCI: Journal of the National Cancer Institute, 82 (7), 561-569.
Lewallen, S., & Courtright, P. (1998). Epidemiology in practice: case-control studies. Community eye health, 11 (28), 57–58.
Strachan, D. P., & Cook, D. G. (1998). Parental smoking and childhood asthma: longitudinal and case-control studies. Thorax, 53 (3), 204-212.
Tenny, S., Kerndt, C. C., & Hoffman, M. R. (2021). Case Control Studies. In StatPearls . StatPearls Publishing.
Togha, M., Razeghi Jahromi, S., Ghorbani, Z., Martami, F., & Seifishahpar, M. (2018). Serum Vitamin D Status in a Group of Migraine Patients Compared With Healthy Controls: A Case-Control Study. Headache, 58 (10), 1530-1540.
Schulz, K. F., & Grimes, D. A. (2002). Case-control studies: research in reverse. The Lancet, 359(9304), 431-434.
What is a case-control study?
Leave a Comment Cancel reply
You must be logged in to post a comment.
Study Design 101: Case Control Study
- Case Report
- Case Control Study
- Cohort Study
- Randomized Controlled Trial
- Practice Guideline
- Systematic Review
- Helpful Formulas
- Finding Specific Study Types
A study that compares patients who have a disease or outcome of interest (cases) with patients who do not have the disease or outcome (controls), and looks back retrospectively to compare how frequently the exposure to a risk factor is present in each group to determine the relationship between the risk factor and the disease.
Case control studies are observational because no intervention is attempted and no attempt is made to alter the course of the disease. The goal is to retrospectively determine the exposure to the risk factor of interest from each of the two groups of individuals: cases and controls. These studies are designed to estimate odds.
Case control studies are also known as "retrospective studies" and "case-referent studies."
- Good for studying rare conditions or diseases
- Less time needed to conduct the study because the condition or disease has already occurred
- Lets you simultaneously look at multiple risk factors
- Useful as initial studies to establish an association
- Can answer questions that could not be answered through other study designs
- Retrospective studies have more problems with data quality because they rely on memory and people with a condition will be more motivated to recall risk factors (also called recall bias).
- Not good for evaluating diagnostic tests because it's already clear that the cases have the condition and the controls do not
- It can be difficult to find a suitable control group
Design pitfalls to look out for
Care should be taken to avoid confounding, which arises when an exposure and an outcome are both strongly associated with a third variable. Controls should be subjects who might have been cases in the study but are selected independent of the exposure. Cases and controls should also not be "over-matched."
Is the control group appropriate for the population? Does the study use matching or pairing appropriately to avoid the effects of a confounding variable? Does it use appropriate inclusion and exclusion criteria?
There is a suspicion that zinc oxide, the white non-absorbent sunscreen traditionally worn by lifeguards is more effective at preventing sunburns that lead to skin cancer than absorbent sunscreen lotions. A case-control study was conducted to investigate if exposure to zinc oxide is a more effective skin cancer prevention measure. The study involved comparing a group of former lifeguards that had developed cancer on their cheeks and noses (cases) to a group of lifeguards without this type of cancer (controls) and assess their prior exposure to zinc oxide or absorbent sunscreen lotions.
This study would be retrospective in that the former lifeguards would be asked to recall which type of sunscreen they used on their face and approximately how often. This could be either a matched or unmatched study, but efforts would need to be made to ensure that the former lifeguards are of the same average age, and lifeguarded for a similar number of seasons and amount of time per season.
Boubekri, M., Cheung, I., Reid, K., Wang, C., & Zee, P. (2014). Impact of windows and daylight exposure on overall health and sleep quality of office workers: a case-control pilot study. Journal of Clinical Sleep Medicine : JCSM : Official Publication of the American Academy of Sleep Medicine, 10 (6), 603-611. https://doi.org/10.5664/jcsm.3780
This pilot study explored the impact of exposure to daylight on the health of office workers (measuring well-being and sleep quality subjectively, and light exposure, activity level and sleep-wake patterns via actigraphy). Individuals with windows in their workplaces had more light exposure, longer sleep duration, and more physical activity. They also reported a better scores in the areas of vitality and role limitations due to physical problems, better sleep quality and less sleep disturbances.
Togha, M., Razeghi Jahromi, S., Ghorbani, Z., Martami, F., & Seifishahpar, M. (2018). Serum Vitamin D Status in a Group of Migraine Patients Compared With Healthy Controls: A Case-Control Study. Headache, 58 (10), 1530-1540. https://doi.org/10.1111/head.13423
This case-control study compared serum vitamin D levels in individuals who experience migraine headaches with their matched controls. Studied over a period of thirty days, individuals with higher levels of serum Vitamin D was associated with lower odds of migraine headache.
- Odds ratio in an unmatched study
- Odds ratio in a matched study
A patient with the disease or outcome of interest.
When an exposure and an outcome are both strongly associated with a third variable.
A patient who does not have the disease or outcome.
Each case is matched individually with a control according to certain characteristics such as age and gender. It is important to remember that the concordant pairs (pairs in which the case and control are either both exposed or both not exposed) tell us nothing about the risk of exposure separately for cases or controls.
The method of assignment of individuals to study and control groups in observational studies when the investigator does not intervene to perform the assignment.
The controls are a sample from a suitable non-affected population.
Now test yourself!
1. Case Control Studies are prospective in that they follow the cases and controls over time and observe what occurs.
a) True b) False
2. Which of the following is an advantage of Case Control Studies?
a) They can simultaneously look at multiple risk factors. b) They are useful to initially establish an association between a risk factor and a disease or outcome. c) They take less time to complete because the condition or disease has already occurred. d) b and c only e) a, b, and c
Evidence Pyramid - Navigation
- Meta- Analysis
- Case Reports
- << Previous: Case Report
- Next: Cohort Study >>
- Last Updated: Sep 25, 2023 10:59 AM
- URL: https://guides.himmelfarb.gwu.edu/studydesign101
- Himmelfarb Intranet
- Privacy Notice
- GW is committed to digital accessibility. If you experience a barrier that affects your ability to access content on this page, let us know via the Accessibility Feedback Form .
- Himmelfarb Health Sciences Library
- 2300 Eye St., NW, Washington, DC 20037
- Phone: (202) 994-2850
- [email protected]
- Ask Dr Cath
- How to work in public health
- Tips for getting public health jobs
- How to become a Public Health Consultant
- Being a Consultant in Public Health
- Securing the consultant interview
- Public Health Registrars
- Public Health Estate Agent
- FPH Diplomate Exam
- FPH Membership Exam
- GP Trainees
- Public Health
- Child Public Health
- Health Protection
- Health Promotion
- Medical Sociology
- Population Health
- Social Medicine
- Caldicott Guardian
- Commissioning Public Health Services
- Effective Manager
- Effective Leader
- Implementing Best Practice
- Outbreak management
- Press Releases
- Time Management
- Strategic Planning and Management
- Understanding Statistics
- More Statistical Understanding
- Action Research
- Case Control
- Delphi Methods
- Descriptive Studies
- Document Analysis
- Economic Appraisal
- Focus Groups
- Health Acorn
- Health Equity Audit
- Health Impact Assessment
- Health Needs Asessment
- Health Status Assessment
- Intervention Studies
- Observational Studies
- Official statistics
- Quality of Life
- Respondent Driven Sampling
- Social Marketing Research
- Systematic Reviews
- Secondary Statisitical Analysis
- In Dr Cath's Shoes
Case Control Studies
Pearce N. Classification of Epidemiological Studies .Int J Epidemiol (2012) 41 (2): 393-397. (This article talks about there really only being 4 types of epidemiological studies: incidence studies, prevalance studies, incidence case-control studies and prevalance case control studies. The differences being the outcome and whether or not you sample on outcome).
The case-control study design, sampling of cases, sampling of controls, case-control studies – an efficient observational study design.
- Article contents
- Figures & tables
- Supplementary Data
- Peer Review
- Open the PDF for in another window
- Get Permissions
- Cite Icon Cite
- Search Site
Karlijn J. van Stralen , Friedo W. Dekker , Carmine Zoccali , Kitty J. Jager; Case-Control Studies – An Efficient Observational Study Design. Nephron Clinical Practice 1 February 2010; 114 (1): c1–c4. https://doi.org/10.1159/000242442
Download citation file:
- Ris (Zotero)
- Reference Manager
Case-control studies are an efficient research method for investigating risk factors of a disease. The method involves the comparison of the odds of exposure in a patient group with that of the odds of exposure in a control group. As only a minority of the population is included in the study, less time can be devoted to those who remain free of the disease of interest. The design of a case-control study can be complex due to the selection of the appropriate cases and controls. Cases can be identified in a prospective and retrospective manner from various sources. Controls can be obtained via the patient, random digit dialing or in a hospital and all at different points in the time period of the study. All options have their own advantages and disadvantages. Furthermore, different forms of bias, such as recall bias and selection bias, can occur. When appropriately designed, case-control studies can provide the same information as in a cohort study, in a more rapid and efficient manner.
In the previous paper, we discussed the cohort study design [ 1 ]. In cohort studies, much effort is devoted to the follow-up of subjects who remain free of the disease of interest, as usually only a minority of the study population actually develop the disease. Case-control studies aim to obtain the same information as cohort studies: the difference in exposure to risk factors in subjects in relation to the development of the disease. But, unlike in cohort studies, only a minority of the population is included in the study, namely all cases (patients) and a selected number of control subjects. Furthermore, as data on exposure are being collected in retrospect, case-control studies can be much more efficient. Nevertheless, case-control studies have, besides the ‘general types of bias’, their own specific sources of bias. Also, the selection of cases and controls may be quite complex. In the current paper, we will discuss the basic aspects of the design and the conduct of a case-control study and we will touch upon its major difficulties.
Like case reports and case series, case-control studies can be seen as the natural expansion of the daily practice of physicians [ 2 ]. Cases are selected on the basis of the presence of the disease of interest. By asking questions to the patients or by obtaining data from medical records about the period prior to the occurrence of the disease, these cases are then classified into either exposed or unexposed to a particular risk factor. In a case series, one considers it sufficient to observe that the risk factor is present in many more cases than anticipated (some people call this ‘expected’ exposure to the risk factor a kind of ‘mental control group’). However, one could easily imagine that this mental control group is dependent on a physician’s experience and practice and that the responses of cases could depend on the way of questioning. Therefore, in a case-control study, a formal comparison with the exposure rate in controls is made. The controls are usually a representative sample of the population from which the cases originated (‘source population’; fig. 1 ). As only a sample of the population under study is included, it is not possible to determine the incidence rates in the exposed and unexposed groups, and therefore, no relative risk can be determined. However, the odds of exposure can be assessed in both groups, resulting in an odds ratio. This odds ratio can be interpreted as a relative risk, under conditions that will be explained in detail in a future paper.
Both cases and controls are sampled from a (hypothetical) source population, free of disease at the start. While the cases are sampled over time, controls can be selected in 3 different fashions: (1) the traditional case-control study with controls sampled at the end of the time period; (2) the incidence density case-control study, in which controls are sampled each time a case occurs, and (3) the case-base study, in which the controls are sampled at the (hypothetical) start of the study. When using methods 2 and 3, controls can be included in the study multiple times, both as control and as a case. When using method 1, all controls are still free of disease at the end of the time period and can therefore never be included as case.
An example of a case-control study is the one by Fored et al. [ 3 ] on the relationship between aspirin use and chronic kidney disease (CKD). Cases were patients with newly diagnosed CKD, as determined by increased serum creatinine levels, who were identified in the Swedish population register. Out of the source population of 5.3 million individuals, 926 cases were identified who had developed CKD in the preceding 2 years. The 998 controls were a random sample from the same register. Cases more often used aspirin (37%) compared with controls (19%) resulting in a 2.5 times higher risk of CKD in aspirin users compared with nonusers. If one would study this same relation between an exposure and CKD in a cohort study, to determine such an effect, one would need to include over 150,000 subjects and follow them for 1 year, because CKD is such a rare outcome. Therefore a case-control study design seems, indeed, the most efficient study design to investigate this relationship.
After the formulation of a research question and the decision to make use of the case-control study design, one must select eligible cases and controls. As a first step, one needs to define what a ‘case’ is. To avoid misclassification, objective criteria for the diagnosis of the disease under study are required. Moreover, a statement of eligibility criteria is essential in order to restrict the study to those who are potentially at risk for exposure. So, for example, in a study regarding the relation between oral contraceptives use and systemic lupus erythematosus, women who are postmenopausal or pregnant will need to be excluded, as they are not at risk of the use of oral contraceptives [ 4 ]. Naturally, the same eligibility criteria should be applied to the controls, as otherwise this could result in erroneous results [ 5 ].
Cases can be sampled from different sources, such as hospitals, medical records and disease or treatment registries and can be collected in 2 different ways. First, one could collect all prevalent cases, that is, all incident cases who developed the disease in the years before the start of the study, as in a retrospective cohort study. This is a very efficient and quick method. However, if the cases are patients with an unfavorable prognosis like end-stage renal disease for their study, it is likely that cases who would have been eligible died in the preceding years, resulting in a selection of only surviving cases. The second way of including cases is to collect all incident patients prospectively, i.e. to wait for a case to occur and include it into the study. This reduces the chances of including only a selection of cases (the survivors). The drawback is that this method is more time consuming and therefore less efficient.
In a case-control study, the decision which controls to select often proves to be more difficult than the selection of cases. If cases are difficult to sample one could increase the power of the study by collecting more controls than cases in the study. The first aspect needing careful attention is the choice of the population from which the controls will be derived. In principle, though not compulsory, controls should be derived from the same source population as the cases. Controls can be selected from the general population, by random-digit dialing (phone numbers), they can be obtained via the patient such as partners, friends or neighbors, or controls can be sampled from the same hospital as the case. Usually, there will be advantages for one type of control group that are missing in the other, and vice versa. For example, when studying risk factors associated with renal agenesis after a live birth [ 6 ], controls should be sampled from all live births. If patients with renal agenesis would be born more often in academic hospitals, it would not be adequate to sample controls only from babies born in the same hospital as the cases, as in this situation one would over-sample the number of babies born in academic hospitals. This may result in too high a number of babies and mothers with other diseases leading to an underestimation of the effect of potential exposures like diabetes.
A second point to address is the sampling scheme of controls. As in a cohort study, the source population must be free of the disease of interest at the start of inclusion in the study. A person is eligible to be selected as a control as long as they are free of this disease [ 7,8 ]. Therefore, theoretically, they can be included in the study multiple times, both as control and as a case. One could sample controls in 3 different manners (fig. 1 ). First, one could sample controls at the end of inclusion in the study (traditional case-control study). A second option is to sample controls each time a case occurs (incidence density sampling), and the third option is to sample controls before the first diagnosis of a case, i.e. at the (hypothetical) start of inclusion in the study (case-base study). Each sampling method has different consequences for the interpretation of the odds ratio but this is beyond the scope of this paper. In short, the case-base study and the incidence density sample case-control study will provide an accurate estimate of the relative risk. The traditional case-control study provides an accurate estimate if the disease is sufficiently rare, i.e. the ‘rare disease assumption’.
As in other types of study designs, in a case-control study, multiple types of bias can occur. If cases and controls are not derived from a similar population or if they have different chances of being exposed, selection bias may arise. Another important source of bias, which is unique for case-control studies, is recall bias. This occurs when a subject is interviewed to obtain exposure information after the disease has occurred. As the case is suffering from the disease, they may search their memory for any history of exposure. Controls do not have such a stimulus, which possibly results in less accurate or more socially acceptable answers. For example, in a study of risk factors for acute pyelonephritis, women were asked to give information about the number of urinary tract infections (UTI) in their mothers [ 9 ]. Women with pyelonephritis may have asked their mothers on their history of UTI, resulting in accurate estimates of the percentage exposed. However, controls may be unaware of the history of UTI in their mothers, simply because they have not asked them. Therefore, even if there would be no association between the exposure and outcome, it is still possible to find an effect. This bias could be avoided in several ways. One could obtain information regarding exposure from sources that documented the information before the outcome was known such as medical records. However, such records may not be available for healthy individuals. It is also possible to send the questionnaire by mail so that cases and controls both have enough time to obtain accurate answers. Finally, it is important to use trained interviewers, preferably unaware (‘blind’) of the subject’s disease status and use standardized questionnaires. Another solution is to obtain information from persons who have had a similar stimulus such as women with a different disease, resulting in an equal motivation to recall potentially relevant exposures. However, if both diseases would have ‘joint’ exposures, this would result in an underestimation of the effect of the exposure on the disease under study.
Although textbooks on epidemiology agree on the use of the terminology regarding case-control studies, authors of many papers have applied the term incorrectly. Recent reviews of articles in the fields of pediatrics and surgery showed that in 25–65% of the articles describing themselves as case-control studies had in fact another study design [ 10,11 ]. In these papers, the terms ‘cases’ and ‘controls’ were used to denote subjects who were affected or unaffected by a given risk factor, for example diabetic versus nondiabetic patients. Subjects were then followed forward in time to assess development of another outcome, for example CKD. However, although patients are compared with non-patients, this study design is a cohort study, as subjects are followed in time to assess the effect of exposure (in this case diabetes) to an outcome (in this case CKD). Although these studies may be of good quality, incorrect use of the terms of the used study design may lead to confusion with readers [ 12 ].
In summary, case-control studies are an efficient design to study associations in a relatively cheap and rapid manner. When appropriately designed, they can provide the same information as obtained in a cohort study. Nevertheless, selection of both cases and controls is complex and also case-control studies are subject to multiple sources of bias. Therefore, before starting a case-control study, it is advisable to consult an epidemiologist or statistician.
Citing articles via
- Online ISSN 1660-2110
- Contact & Support
- Information & Downloads
- Rights & Permissions
- Terms & Conditions
- Catalogue & Pricing
- People & Organisation
- Stay Up-to-Date
- Exhibitions & Webinars
- Regional Offices
- Community Voice
- Healthcare Professionals
- Patients & Supporters
- Health Sciences Industry
- Medical Societies
- Agents & Booksellers
- S. Karger AG
- P.O Box, CH-4009 Basel (Switzerland)
- Allschwilerstrasse 10, CH-4055 Basel
- Tel: +41 61 306 11 11
- Fax: +41 61 306 12 34
- Email: [email protected]
- Experience Blog
- © 2023 S. Karger AG, Basel
This Feature Is Available To Subscribers Only
Sign In or Create an Account
Cohort studies have an intuitive logic to them, but they can be very problematic when one is investigating outcomes that only occur in a small fraction of exposed and unexposed individuals. They can also be problematic when it is expensive or very difficult to obtain exposure information from a cohort. In these situations a case-control design offers an alternative that is much more efficient. The goal of a case-control study is the same as that of cohort studies, i.e., to estimate the magnitude of association between an exposure and an outcome. However, case-control studies employ a different sampling strategy that gives them greater efficiency.
After completing this module, the student will be able to:
- Define and explain the distinguishing features of a case-control study
- Describe and identify the types of epidemiologic questions that can be addressed by case-control studies
- Define what is meant by the term "source population"
- Describe the purpose of controls in a case-control study
- Describe differences between hospital-based and population-based case-control studies
- Describe the principles of valid control selection
- Explain the importance of using specific diagnostic criteria and explicit case definitions in case-control studies
- Estimate and interpret the odds ratio from a case-control study
- Identify the potential strengths and limitations of case-control studies
Overview of Case-Control Design
In the module entitled Overview of Analytic Studies it was noted that Rothman describes the case-control strategy as follows:
"Case-control studies are best understood by considering as the starting point a source population , which represents a hypothetical study population in which a cohort study might have been conducted. The source population is the population that gives rise to the cases included in the study. If a cohort study were undertaken, we would define the exposed and unexposed cohorts (or several cohorts) and from these populations obtain denominators for the incidence rates or risks that would be calculated for each cohort. We would then identify the number of cases occurring in each cohort and calculate the risk or incidence rate for each. In a case-control study the same cases are identified and classified as to whether they belong to the exposed or unexposed cohort. Instead of obtaining the denominators for the rates or risks, however, a control group is sampled from the entire source population that gives rise to the cases. Individuals in the control group are then classified into exposed and unexposed categories. The purpose of the control group is to determine the relative size of the exposed and unexposed components of the source population. Because the control group is used to estimate the distribution of exposure in the source population, the cardinal requirement of control selection is that the controls be sampled independently of exposure status."
To illustrate this consider the following hypothetical scenario in which the source population is the state of Massachusetts. Diseased individuals are red, and non-diseased individuals are blue. Exposed individuals are indicated by a whitish midsection. Note the following aspects of the depicted scenario:
- The disease is rare.
- There is a fairly large number of exposed individuals in the state, but most of these are not diseased.
If we somehow had exposure and outcome information on all of the subjects in the source population and looked at the association using a cohort design, we might find the data summarized in the contingency table below.
In this hypothetical example, we have data on all 6,000,000 people in the source population, and we could compute the probability of disease (i.e., the risk or incidence) in both the exposed group and the non-exposed group, because we have the denominators for both the exposed and non-exposed groups.
The table above summarizes all of the necessary information regarding exposure and outcome status for the population and enables us to compute a risk ratio as a measure of the strength of the association. Intuitively, we compute the probability of disease (the risk) in each exposure group and then compute the risk ratio as follows:
The problem , of course, is that we usually don't have the resources to get the data on all subjects in the population. If we took a random sample of even 5-10% of the population, we would have few diseased people in our sample, certainly not enough to produce a reasonably precise measure of association. Moreover, we would expend an inordinate amount of effort and money collecting exposure and outcome data on a large number of people who would not develop the outcome.
We need a method that allows us to retain all the people in the numerator of disease frequency (diseased people or "cases") but allows us to collect information from only a small proportion of the people that make up the denominator (population, or "controls"), most of whom do not have the disease of interest. The case-control design allows us to accomplish this. We identify and collect exposure information on all the cases, but identify and collect exposure information on only a sample of the population. Once we have the exposure information, we can assign subjects to the numerator and denominator of the exposed and unexposed groups. This is what Rothman means when he says,
"The purpose of the control group is to determine the relative size of the exposed and unexposed components of the source population."
In the above example, we would have identified all 1,300 cases, determined their exposure status, and ended up categorizing 700 as exposed and 600 as unexposed. We might have ransomly sampled 6,000 members of the population (instead of 6 million) in order to determine the exposure distribution in the total population. If our sampling method was random, we would expect that about 1,000 would be exposed and 5,000 unexposed (the same ratio as in the overall population). We calculate a similar measure as the risk ratio above, but substituting in the denominator a sample of the population ("controls") instead of the whole population:
Note that when we take a sample of the population, we no longer have a measure of disease frequency, because the denominator no longer represents the population. Therefore, we can no longer compute the probability or rate of disease incidence in each exposure group. We also can't calculate a risk or rate difference measure for the same reason. However, as we have seen, we can compute the relative probability of disease in the exposed vs. unexposed group. The term generally used for this measure is an odds ratio , described in more detail later in the module.
Consequently, when the outcome is uncommon, as in this case, the risk ratio can be estimated much more efficiently by using a case-control design. One would focus first on finding an adequate number of cases in order to determine the ratio of exposed to unexposed cases. Then, one only needs to take a sample of the population in order to estimate the relative size of the exposed and unexposed components of the source population. Note that if one can identify all of the cases that were reported to a registry or other database within a defined period of time, then it is possible to compute an estimate of the incidence of disease if the size of the population is known from census data. While this is conceptually possible, it is rarely done, and we will not discuss it further in this course.
A Nested Case-Control Study
Suppose a prospective cohort study were conducted among almost 90,000 women for the purpose of studying the determinants of cancer and cardiovascular disease. After enrollment, the women provide baseline information on a host of exposures, and they also provide baseline blood and urine samples that are frozen for possible future use. The women are then followed, and, after about eight years, the investigators want to test the hypothesis that past exposure to pesticides such as DDT is a risk factor for breast cancer. Eight years have passed since the beginning of the study, and 1.439 women in the cohort have developed breast cancer. Since they froze blood samples at baseline, they have the option of analyzing all of the blood samples in order to ascertain exposure to DDT at the beginning of the study before any cancers occurred. The problem is that there are almost 90,000 women and it would cost $20 to analyze each of the blood samples. If the investigators could have analyzed all 90,000 samples this is what they would have found the results in the table below.
Table of Breast Cancer Occurrence Among Women With or Without DDT Exposure
If they had been able to afford analyzing all of the baseline blood specimens in order to categorize the women as having had DDT exposure or not, they would have found a risk ratio = 1.87 (95% confidence interval: 1.66-2.10). The problem is that this would have cost almost $1.8 million, and the investigators did not have the funding to do this.
While 1,439 breast cancers is a disturbing number, it is only 1.6% of the entire cohort, so the outcome is relatively rare, and it is costing a lot of money to analyze the blood specimens obtained from all of the non-diseased women. There is, however, another more efficient alternative, i.e., to use a case-control sampling strategy. One could analyze all of the blood samples from women who had developed breast cancer, but only a sample of the whole cohort in order to estimate the exposure distribution in the population that produced the cases.
If one were to analyze the blood samples of 2,878 of the non-diseased women (twice as many as the number of cases), one would obtain results that would look something like those in the next table.
Odds of Exposure: 360/1079 in the cases versus 432/2,446 in the non-diseased controls.
Totals Samples analyzed = 1,438+2,878 = 4,316
Total Cost = 4,316 x $20 = $86,320
With this approach a similar estimate of risk was obtained after analyzing blood samples from only a small sample of the entire population at a fraction of the cost with hardly any loss in precision. In essence, a case-control strategy was used, but it was conducted within the context of a prospective cohort study. This is referred to as a case-control study "nested" within a cohort study.
Rothman states that one should look upon all case-control studies as being "nested" within a cohort. In other words the cohort represents the source population that gave rise to the cases. With a case-control sampling strategy one simply takes a sample of the population in order to obtain an estimate of the exposure distribution within the population that gave rise to the cases. Obviously, this is a much more efficient design.
It is important to note that, unlike cohort studies, case-control studies do not follow subjects through time. Cases are enrolled at the time they develop disease and controls are enrolled at the same time. The exposure status of each is determined, but they are not followed into the future for further development of disease.
As with cohort studies, case-control studies can be prospective or retrospective. At the start of the study, all cases might have already occurred and then this would be a retrospective case-control study. Alternatively, none of the cases might have already occurred, and new cases will be enrolled prospectively. Epidemiologists generally prefer the prospective approach because it has fewer biases, but it is more expensive and sometimes not possible. When conducted prospectively, or when nested in a prospective cohort study, it is straightforward to select controls from the population at risk. However, in retrospective case-control studies, it can be difficult to select from the population at risk, and controls are then selected from those in the population who didn't develop disease. Using only the non-diseased to select controls as opposed to the whole population means the denominator is not really a measure of disease frequency, but when the disease is rare , the odds ratio using the non-diseased will be very similar to the estimate obtained when the entire population is used to sample for controls. This phenomenon is known as the r are-disease assumption . When case-control studies were first developed, most were conducted retrospectively, and it is sometimes assumed that the rare-disease assumption applies to all case-control studies. However, it actually only applies to those case-control studies in which controls are sampled only from the non-diseased rather than the whole population.
The difference between sampling from the whole population and only the non-diseased is that the whole population contains people both with and without the disease of interest. This means that a sampling strategy that uses the whole population as its source must allow for the fact that people who develop the disease of interest can be selected as controls. Students often have a difficult time with this concept. It is helpful to remember that it seems natural that the population denominator includes people who develop the disease in a cohort study. If a case-control study is a more efficient way to obtain the information from a cohort study, then perhaps it is not so strange that the denominator in a case-control study also can include people who develop the disease. This topic is covered in more detail in EP813 Intermediate Epidemiology.
Retrospective and Prospective Case-Control Studies
Students usually think of case-control studies as being only retrospective, since the investigators enroll subjects who have developed the outcome of interest. However, case-control studies, like cohort studies, can be either retrospective or prospective. In a prospective case-control study, the investigator still enrolls based on outcome status, but the investigator must wait to the cases to occur.
When is a Case-Control Study Desirable?
Given the greater efficiency of case-control studies, they are particularly advantageous in the following situations:
- When the disease or outcome being studied is rare.
- When the disease or outcome has a long induction and latent period (i.e., a long time between exposure and the eventual causal manifestation of disease).
- When exposure data is difficult or expensive to obtain.
- When the study population is dynamic.
- When little is known about the risk factors for the disease, case-control studies provide a way of testing associations with multiple potential risk factors. (This isn't really a unique advantage to case-control studies, however, since cohort studies can also assess multiple exposures.)
Another advantage of their greater efficiency, of course, is that they are less time-consuming and much less costly than prospective cohort studies.
The DES Case-Control Study
A classic example of the efficiency of the case-control approach is the study (Herbst et al.: N. Engl. J. Med. Herbst et al. (1971;284:878-81) that linked in-utero exposure to diethylstilbesterol (DES) with subsequent development of vaginal cancer 15-22 years later. In the late 1960s, physicians at MGH identified a very unusual cancer cluster. Eight young woman between the ages of 15-22 were found to have cancer of the vagina, an uncommon cancer even in elderly women. The cluster of cases in young women was initially reported as a case series, but there were no strong hypotheses about the cause.
In retrospect, the cause was in-utero exposure to DES. After World War II, DES started being prescribed for women who were having troubles with a pregnancy -- if there were signs suggesting the possibility of a miscarriage, DES was frequently prescribed. It has been estimated that between 1945-1950 DES was prescribed for about 20% of all pregnancies in the Boston area. Thus, the unborn fetus was exposed to DES in utero, and in a very small percentage of cases this resulted in development of vaginal cancer when the child was 15-22 years old (a very long latent period). There were several reasons why a case-control study was the only feasible way to identify this association: the disease was extremely rare (even in subjects who had been exposed to DES), there was a very long latent period between exposure and development of disease, and initially they had no idea what was responsible, so there were many possible exposures to consider.
In this situation, a case-control study was the only reasonable approach to identify the causative agent. Given how uncommon the outcome was, even a large prospective study would have been unlikely to have more than one or two cases, even after 15-20 years of follow-up. Similarly, a retrospective cohort study might have been successful in enrolling a large number of subjects, but the outcome of interest was so uncommon that few, if any, subjects would have had it. In contrast, a case-control study was conducted in which eight known cases and 32 age-matched controls provided information on many potential exposures. This strategy ultimately allowed the investigators to identify a highly significant association between the mother's treatment with DES during pregnancy and the eventual development of adenocarcinoma of the vagina in their daughters (in-utero at the time of exposure) 15 to 22 years later.
For more information see the DES Fact Sheet from the National Cancer Institute.
An excellent summary of this landmark study and the long-range effects of DES can be found in a Perspective article in the New England Journal of Medicine. A cohort of both mothers who took DES and their children (daughters and sons) was later formed to look for more common outcomes. Members of the faculty at BUSPH are on the team of investigators that follow this cohort for a variety of outcomes, particularly reproductive consequences and other cancers.
Selecting & Defining Cases and Controls
The "case" definition.
Careful thought should be given to the case definition to be used. If the definition is too broad or vague, it is easier to capture people with the outcome of interest, but a loose case definition will also capture people who do not have the disease. On the other hand, an overly restrictive case definition is employed, fewer cases will be captured, and the sample size may be limited. Investigators frequently wrestle with this problem during outbreak investigations. Initially, they will often use a somewhat broad definition in order to identify potential cases. However, as an outbreak investigation progresses, there is a tendency to narrow the case definition to make it more precise and specific, for example by requiring confirmation of the diagnosis by laboratory testing. In general, investigators conducting case-control studies should thoughtfully construct a definition that is as clear and specific as possible without being overly restrictive.
Investigators studying chronic diseases generally prefer newly diagnosed cases, because they tend to be more motivated to participate, may remember relevant exposures more accurately, and because it avoids complicating factors related to selection of longer duration (i.e., prevalent) cases. However, it is sometimes impossible to have an adequate sample size if only recent cases are enrolled.
Sources of Cases
Typical sources for cases include:
- Patient rosters at medical facilities
- Death certificates
- Disease registries (e.g., cancer or birth defect registries; the SEER Program [Surveillance, Epidemiology and End Results] is a federally funded program that identifies newly diagnosed cases of cancer in population-based registries across the US )
- Cross-sectional surveys (e.g., NHANES, the National Health and Nutrition Examination Survey)
Selection of the Controls
As noted above, it is always useful to think of a case-control study as being nested within some sort of a cohort, i.e., a source population that produced the cases that were identified and enrolled. In view of this there are two key principles that should be followed in selecting controls:
- The comparison group ("controls") should be representative of the source population that produced the cases.
- The "controls" must be sampled in a way that is independent of the exposure, meaning that their selection should not be more (or less) likely if they have the exposure of interest.
If either of these principles are not adhered to, selection bias can result (as discussed in detail in the module on Bias).
Note that in the earlier example of a case-control study conducted in the Massachusetts population, we specified that our sampling method was random so that exposed and unexposed members of the population had an equal chance of being selected. Therefore, we would expect that about 1,000 would be exposed and 5,000 unexposed (the same ratio as in the whole population), and came up with an odds ratio that was same as the hypothetical risk ratio we would have had if we had collected exposure information from the whole population of six million:
What if we had instead been more likely to sample those who were exposed, so that we instead found 1,500 exposed and 4,500 unexposed among the 6,000 controls? Then the odds ratio would have been:
This odds ratio is biased because it differs from the true odds ratio. In this case, the bias stemmed from the fact that we violated the second principle in selection of controls. Depending on which category is over or under-sampled, this type of bias can result in either an underestimate or an overestimate of the true association.
A hypothetical case-control study was conducted to determine whether lower socioeconomic status (the exposure) is associated with a higher risk of cervical cancer (the outcome). The "cases" consisted of 250 women with cervical cancer who were referred to Massachusetts General Hospital for treatment for cervical cancer. They were referred from all over the state. The cases were asked a series of questions relating to socioeconomic status (household income, employment, education, etc.). The investigators identified control subjects by going door-to-door in the community around MGH from 9:00 AM to 5:00 PM. Many residents are not home, but they persist and eventually enroll enough controls. The problem is that the controls were selected by a different mechanism than the cases, AND the selection mechanism may have tended to select individuals of different socioeconomic status, since women who were at home may have been somewhat more likely to be unemployed. In other words, the controls were more likely to be enrolled (selected) if they had the exposure of interest (lower socioeconomic status).
Sources for "Controls"
A population-based case-control study is one in which the cases come from a precisely defined population, such as a fixed geographic area, and the controls are sampled directly from the same population. In this situation cases might be identified from a state cancer registry, for example, and the comparison group would logically be selected at random from the same source population. Population controls can be identified from voter registration lists, tax rolls, drivers license lists, and telephone directories or by "random digit dialing". Population controls may also be more difficult to obtain, however, because of lack of interest in participating, and there may be recall bias, since population controls are generally healthy and may remember past exposures less accurately.
Example of a Population-based Case-Control Study: Rollison et al. reported on a "Population-based Case-Control Study of Diabetes and Breast Cancer Risk in Hispanic and Non-Hispanic White Women Living in US Southwestern States". (ALink to the article - Citation: Am J Epidemiol 2008;167:447–456).
"Briefly, a population-based case-control study of breast cancer was conducted in Colorado, New Mexico, Utah, and selected counties of Arizona. For investigation of differences in the breast cancer risk profiles of non-Hispanic Whites and Hispanics, sampling was stratified by race/ethnicity, and only women who self-reported their race as non-Hispanic White, Hispanic, or American Indian were eligible, with the exception of American Indian women living on reservations. Women diagnosed with histologically confirmed breast cancer between October 1999 and May 2004 (International Classification of Diseases for Oncology codes C50.0–C50.6 and C50.8–C50.9) were identified as cases through population-based cancer registries in each state."
"Population-based controls were frequency-matched to cases in 5-year age groups. In New Mexico and Utah, control participants under age 65 years were randomly selected from driver's license lists; in Arizona and Colorado, controls were randomly selected from commercial mailing lists, since driver's license lists were unavailable. In all states, women aged 65 years or older were randomly selected from the lists of the Centers for Medicare and Medicaid Services (Social Security lists). Of all women contacted, 68 percent of cases and 42 percent of controls participated in the study."
"Odds ratios and 95% confidence intervals were calculated using logistic regression, adjusting for age, body mass index at age 15 years, and parity. Having any type of diabetes was not associated with breast cancer overall (odds ratio = 0.94, 95% confidence interval: 0.78, 1.12). Type 2 diabetes was observed among 19% of Hispanics and 9% of non-Hispanic Whites but was not associated with breast cancer in either group."
In this example, it is clear that the controls were selected from the source population (principle 1), but less clear that they were enrolled independent of exposure status (principle 2), both because drivers' licenses were used for selection and because the participation rate among controls was low. These factors would only matter if they impacted on the estimate of the proportion of the population who had diabetes.
Hospital or Clinic Controls:
- They have diseases that are unrelated to the exposure being studied. For example, for a study examining the association between smoking and lung cancer, it would not be appropriate to include patients with cardiovascular disease as control, since smoking is a risk factor for cardiovascular disease. To include such patients as controls would result in an underestimate of the true association.
- Second, control patients in the comparison should have diseases with similar referral patterns as the cases, in order to minimize selection bias. For example, if the cases are women with cervical cancer who have been referred from all over the state, it would be inappropriate to use controls consisting of women with diabetes who had been referred primarily from local health centers in the immediate vicinity of the hospital. Similarly, it would be inappropriate to use patients from the emergency room, because the selection of a hospital for an emergency is different than for cancer, and this difference might be related to the exposure of interest.
The advantages of using controls who are patients from the same facility are:
- They are easier to identify
- They are more likely to participate than general population controls.
- They minimize selection bias because they generally come from the same source population (provided referral patterns are similar).
- Recall bias would be minimized, because they are sick, but with a different diagnosis.
Example: Several years ago the vascular surgeons at Boston Medical Center wanted to study risk factors for severe atherosclerosis of the lower extremities. The cases were patients who were referred to the hospital for elective surgery to bypass severe atherosclerotic blockages in the arteries to the legs. The controls consisted of patients who were admitted to the same hospital for elective joint replacement of the hip or knee. The patients undergoing joint replacement were similar in age and they also were following the same referral pathways. In other words, they met the "would" criterion: if one of the joint replacement surgery patients had developed severe atherosclerosis in their leg arteries, they would have been referred to the same hospital.
Friend, Neighbor, Spouse, and Relative Controls:
Occasionally investigators will ask cases to nominate controls who are in one of these categories, because they have similar characteristics, such as genotype, socioeconomic status, or environment, i.e., factors that can cause confounding, but are hard to measure and adjust for. By matching cases and controls on these factors, confounding by these factors will be controlled. However, one must be careful that the controls satisfy the two fundamental principles. Often, they do not.
How Many Controls?
Since case-control studies are often used for uncommon outcomes, investigators often have a limited number of cases but a plentiful supply of potential controls. In this situation the statistical power of the study can be increased somewhat by enrolling more controls than cases. However, the additional power that is achieved diminishes as the ratio of controls to cases increases, and ratios greater than 4:1 have little additional impact on power. Consequently, if it is time-consuming or expensive to collect data on controls, the ratio of controls to cases should be no more than 4:1. However, if the data on controls is easily obtained, there is no reason to limit the number of controls.
Methods of Control Sampling
There are three strategies for selecting controls that are best explained by considering the nested case-control study described on page 3 of this module:
- Survivor sampling: This is the most common method. Controls consist of individuals from the source population who do not have the outcome of interest.
- Case-base sampling (also known as "case-cohort" sampling): Controls are selected from the population at risk at the beginning of the follow-up period in the cohort study within which the case-control study was nested.
- Risk Set Sampling: In the nested case-control study a control would be selected from the population at risk at the point in time when a case was diagnosed.
The Rare Outcome Assumption
It is often said that an odds ratio provides a good estimate of the risk ratio only when the outcome of interest is rare, but this is only true when survivor sampling is used. With case-base sampling or risk set sampling, the odds ratio will provide a good estimate of the risk ratio regardless of the frequency of the outcome, because the controls will provide an accurate estimate of the distribution in the source population (i.e., not just in non-diseased people).
More on Selection Bias
Always consider the source population for case-control studies, i.e. the "population" that generated the cases. The cases are always identified and enrolled by some method or a set of procedures or circumstances. For example, cases with a certain disease might be referred to a particular tertiary hospital for specialized treatment. Alternatively, if there is a database or a disease registry for a geographic area, cases might be selected at random from the database. The key to avoiding selection bias is to select the controls by a similar, if not identical, mechanism in order to ensure that the controls provide an accurate representation of the exposure status of the source population.
Example 1: In the first example above, in which cases were randomly selected from a geographically defined database, the source population is also defined geographically, so it would make sense to select population controls by some random method. In contrast, if one enrolled controls from a particular hospital within the geographic area, one would have to at least consider whether the controls were inherently more or less likely to have the exposure of interest. If so, they would not provide an accurate estimate of the exposure distribution of the source population, and selection bias would result.
Example 2: In the second example above, the source population was defined by the patterns of referral to a particular hospital for a particular disease. In order for the controls to be representative of the "population" that produced those cases, the controls should be selected by a similar mechanism, e.g., by contacting the referring health care providers and asking them to provide the names of potential controls. By this mechanism, one can ensure that the controls are representative of the source population, because if they had had the disease of interest they would have been just as likely as the cases to have been included in the case group (thus fulfilling the "would" criterion).
Example 3: A food handler at a delicatessen who is infected with hepatitis A virus is responsible for an outbreak of hepatitis which is largely confined to the surrounding community from which most of the customers come. Many (but not all) of the infected cases are identified by passive and active surveillance. How should controls be selected? In this situation, one might guess that the likelihood of people going to the delicatessen would be heavily influenced by their proximity to it, and this would to a large extent define the source population. In a case-control study undertaken to identify the source, the delicatessen is one of the exposures being tested. Consequently, even if the cases were reported to the state-wide surveillance system, it would not be appropriate to randomly select controls from the state, the county, or even the town where the delicatessen is located. In other words, the "would" criterion doesn't work here, because anyone in the state with clinical hepatitis would end up in the surveillance system, but someone who lived far from the deli would have a much lower likelihood of having the exposure. A better approach would be to select controls who were matched to the cases by neighborhood, age, and gender. These controls would have similar access to go to the deli if they chose to, and they would therefore be more representative of the source population.
Analysis of Case-Control Studies
The computation and interpretation of the odds ratio in a case-control study has already been discussed in the modules on Overview of Analytic Studies and Measures of Association. Additionally, one can compute the confidence interval for the odds ratio, and statistical significance can also be evaluated by using a chi-square test (or a Fisher's Exact Test if the sample size is small) to compute a p-value. These calculations can be done using the Case-Control worksheet in the Excel file called EpiTools.XLS.
Advantages and Disadvantages of Case-Control Studies
- They are efficient for rare diseases or diseases with a long latency period between exposure and disease manifestation.
- They are less costly and less time-consuming; they are advantageous when exposure data is expensive or hard to obtain.
- They are advantageous when studying dynamic populations in which follow-up is difficult.
- They are subject to selection bias.
- They are inefficient for rare exposures.
- Information on exposure is subject to observation bias.
- They generally do not allow calculation of incidence (absolute risk).