Enago Academy

Unraveling Research Population and Sample: Understanding their role in statistical inference

' src=

Research population and sample serve as the cornerstones of any scientific inquiry. They hold the power to unlock the mysteries hidden within data. Understanding the dynamics between the research population and sample is crucial for researchers. It ensures the validity, reliability, and generalizability of their findings. In this article, we uncover the profound role of the research population and sample, unveiling their differences and importance that reshapes our understanding of complex phenomena. Ultimately, this empowers researchers to make informed conclusions and drive meaningful advancements in our respective fields.

Table of Contents

What Is Population?

The research population, also known as the target population, refers to the entire group or set of individuals, objects, or events that possess specific characteristics and are of interest to the researcher. It represents the larger population from which a sample is drawn. The research population is defined based on the research objectives and the specific parameters or attributes under investigation. For example, in a study on the effects of a new drug, the research population would encompass all individuals who could potentially benefit from or be affected by the medication.

When Is Data Collection From a Population Preferred?

In certain scenarios where a comprehensive understanding of the entire group is required, it becomes necessary to collect data from a population. Here are a few situations when one prefers to collect data from a population:

1. Small or Accessible Population

When the research population is small or easily accessible, it may be feasible to collect data from the entire population. This is often the case in studies conducted within specific organizations, small communities, or well-defined groups where the population size is manageable.

2. Census or Complete Enumeration

In some cases, such as government surveys or official statistics, a census or complete enumeration of the population is necessary. This approach aims to gather data from every individual or entity within the population. This is typically done to ensure accurate representation and eliminate sampling errors.

3. Unique or Critical Characteristics

If the research focuses on a specific characteristic or trait that is rare and critical to the study, collecting data from the entire population may be necessary. This could be the case in studies related to rare diseases, endangered species, or specific genetic markers.

4. Legal or Regulatory Requirements

Certain legal or regulatory frameworks may require data collection from the entire population. For instance, government agencies might need comprehensive data on income levels, demographic characteristics, or healthcare utilization for policy-making or resource allocation purposes.

5. Precision or Accuracy Requirements

In situations where a high level of precision or accuracy is necessary, researchers may opt for population-level data collection. By doing so, they mitigate the potential for sampling error and obtain more reliable estimates of population parameters.

What Is a Sample?

A sample is a subset of the research population that is carefully selected to represent its characteristics. Researchers study this smaller, manageable group to draw inferences that they can generalize to the larger population. The selection of the sample must be conducted in a manner that ensures it accurately reflects the diversity and pertinent attributes of the research population. By studying a sample, researchers can gather data more efficiently and cost-effectively compared to studying the entire population. The findings from the sample are then extrapolated to make conclusions about the larger research population.

What Is Sampling and Why Is It Important?

Sampling refers to the process of selecting a sample from a larger group or population of interest in order to gather data and make inferences. The goal of sampling is to obtain a sample that is representative of the population, meaning that the sample accurately reflects the key attributes, variations, and proportions present in the population. By studying the sample, researchers can draw conclusions or make predictions about the larger population with a certain level of confidence.

Collecting data from a sample, rather than the entire population, offers several advantages and is often necessary due to practical constraints. Here are some reasons to collect data from a sample:

what is population and sample in research with example

1. Cost and Resource Efficiency

Collecting data from an entire population can be expensive and time-consuming. Sampling allows researchers to gather information from a smaller subset of the population, reducing costs and resource requirements. It is often more practical and feasible to collect data from a sample, especially when the population size is large or geographically dispersed.

2. Time Constraints

Conducting research with a sample allows for quicker data collection and analysis compared to studying the entire population. It saves time by focusing efforts on a smaller group, enabling researchers to obtain results more efficiently. This is particularly beneficial in time-sensitive research projects or situations that necessitate prompt decision-making.

3. Manageable Data Collection

Working with a sample makes data collection more manageable . Researchers can concentrate their efforts on a smaller group, allowing for more detailed and thorough data collection methods. Furthermore, it is more convenient and reliable to store and conduct statistical analyses on smaller datasets. This also facilitates in-depth insights and a more comprehensive understanding of the research topic.

4. Statistical Inference

Collecting data from a well-selected and representative sample enables valid statistical inference. By using appropriate statistical techniques, researchers can generalize the findings from the sample to the larger population. This allows for meaningful inferences, predictions, and estimation of population parameters, thus providing insights beyond the specific individuals or elements in the sample.

5. Ethical Considerations

In certain cases, collecting data from an entire population may pose ethical challenges, such as invasion of privacy or burdening participants. Sampling helps protect the privacy and well-being of individuals by reducing the burden of data collection. It allows researchers to obtain valuable information while ensuring ethical standards are maintained .

Key Steps Involved in the Sampling Process

Sampling is a valuable tool in research; however, it is important to carefully consider the sampling method, sample size, and potential biases to ensure that the findings accurately represent the larger population and are valid for making conclusions and generalizations. While the specific steps may vary depending on the research context, here is a general outline of the sampling process:

what is population and sample in research with example

1. Define the Population

Clearly define the target population for your research study. The population should encompass the group of individuals, elements, or units that you want to draw conclusions about.

2. Define the Sampling Frame

Create a sampling frame, which is a list or representation of the individuals or elements in the target population. The sampling frame should be comprehensive and accurately reflect the population you want to study.

3. Determine the Sampling Method

Select an appropriate sampling method based on your research objectives, available resources, and the characteristics of the population. You can perform sampling by either utilizing probability-based or non-probability-based techniques. Common sampling methods include random sampling, stratified sampling, cluster sampling, and convenience sampling.

4. Determine Sample Size

Determine the desired sample size based on statistical considerations, such as the level of precision required, desired confidence level, and expected variability within the population. Larger sample sizes generally reduce sampling error but may be constrained by practical limitations.

5. Collect Data

Once the sample is selected using the appropriate technique, collect the necessary data according to the research design and data collection methods . Ensure that you use standardized and consistent data collection process that is also appropriate for your research objectives.

6. Analyze the Data

Perform the necessary statistical analyses on the collected data to derive meaningful insights. Use appropriate statistical techniques to make inferences, estimate population parameters, test hypotheses, or identify patterns and relationships within the data.

Population vs Sample — Differences and examples

While the population provides a comprehensive overview of the entire group under study, the sample, on the other hand, allows researchers to draw inferences and make generalizations about the population. Researchers should employ careful sampling techniques to ensure that the sample is representative and accurately reflects the characteristics and variability of the population.

what is population and sample in research with example

Research Study: Investigating the prevalence of stress among high school students in a specific city and its impact on academic performance.

Population: All high school students in a particular city

Sampling Frame: The sampling frame would involve obtaining a comprehensive list of all high schools in the specific city. A random selection of schools would be made from this list to ensure representation from different areas and demographics of the city.

Sample: Randomly selected 500 high school students from different schools in the city

The sample represents a subset of the entire population of high school students in the city.

Research Study: Assessing the effectiveness of a new medication in managing symptoms and improving quality of life in patients with the specific medical condition.

Population: Patients diagnosed with a specific medical condition

Sampling Frame: The sampling frame for this study would involve accessing medical records or databases that include information on patients diagnosed with the specific medical condition. Researchers would select a convenient sample of patients who meet the inclusion criteria from the sampling frame.

Sample: Convenient sample of 100 patients from a local clinic who meet the inclusion criteria for the study

The sample consists of patients from the larger population of individuals diagnosed with the medical condition.

Research Study: Investigating community perceptions of safety and satisfaction with local amenities in the neighborhood.

Population: Residents of a specific neighborhood

Sampling Frame: The sampling frame for this study would involve obtaining a list of residential addresses within the specific neighborhood. Various sources such as census data, voter registration records, or community databases offer the means to obtain this information. From the sampling frame, researchers would randomly select a cluster sample of households to ensure representation from different areas within the neighborhood.

Sample: Cluster sample of 50 households randomly selected from different blocks within the neighborhood

The sample represents a subset of the entire population of residents living in the neighborhood.

To summarize, sampling allows for cost-effective data collection, easier statistical analysis, and increased practicality compared to studying the entire population. However, despite these advantages, sampling is subject to various challenges. These challenges include sampling bias, non-response bias, and the potential for sampling errors.

To minimize bias and enhance the validity of research findings , researchers should employ appropriate sampling techniques, clearly define the population, establish a comprehensive sampling frame, and monitor the sampling process for potential biases. Validating findings by comparing them to known population characteristics can also help evaluate the generalizability of the results. Properly understanding and implementing sampling techniques ensure that research findings are accurate, reliable, and representative of the larger population. By carefully considering the choice of population and sample, researchers can draw meaningful conclusions and, consequently, make valuable contributions to their respective fields of study.

Now, it’s your turn! Take a moment to think about a research question that interests you. Consider the population that would be relevant to your inquiry. Who would you include in your sample? How would you go about selecting them? Reflecting on these aspects will help you appreciate the intricacies involved in designing a research study. Let us know about it in the comment section below or reach out to us using  #AskEnago  and tag  @EnagoAcademy  on  Twitter ,  Facebook , and  Quora .

' src=

Thank you very much, this is helpful

Very impressive and helpful and also easy to understand….. Thanks to the Author and Publisher….

Rate this article Cancel Reply

Your email address will not be published.

what is population and sample in research with example

Enago Academy's Most Popular Articles

Gender Bias in Science Funding

  • Diversity and Inclusion
  • Trending Now

The Silent Struggle: Confronting gender bias in science funding

In the 1990s, Dr. Katalin Kariko’s pioneering mRNA research seemed destined for obscurity, doomed by…

Content Analysis vs Thematic Analysis: What's the difference?

  • Reporting Research

Choosing the Right Analytical Approach: Thematic analysis vs. content analysis for data interpretation

In research, choosing the right approach to understand data is crucial for deriving meaningful insights.…

Addressing Biases in the Journey of PhD

Addressing Barriers in Academia: Navigating unconscious biases in the Ph.D. journey

In the journey of academia, a Ph.D. marks a transitional phase, like that of a…

Cross-sectional and Longitudinal Study Design

Comparing Cross Sectional and Longitudinal Studies: 5 steps for choosing the right approach

The process of choosing the right research design can put ourselves at the crossroads of…

Networking in Academic Conferences

  • Career Corner

Unlocking the Power of Networking in Academic Conferences

Embarking on your first academic conference experience? Fear not, we got you covered! Academic conferences…

Choosing the Right Analytical Approach: Thematic analysis vs. content analysis for…

Comparing Cross Sectional and Longitudinal Studies: 5 steps for choosing the right…

Research Recommendations – Guiding policy-makers for evidence-based decision making

what is population and sample in research with example

Sign-up to read more

Subscribe for free to get unrestricted access to all our resources on research writing and academic publishing including:

  • 2000+ blog articles
  • 50+ Webinars
  • 10+ Expert podcasts
  • 50+ Infographics
  • 10+ Checklists
  • Research Guides

We hate spam too. We promise to protect your privacy and never spam you.

I am looking for Editing/ Proofreading services for my manuscript Tentative date of next journal submission:

what is population and sample in research with example

As a researcher, what do you consider most when choosing an image manipulation detector?

  • How it works

researchprospect post subheader

Population vs Sample – Definitions, Types & Examples

Published by Alvin Nicolas at September 20th, 2021 , Revised On July 19, 2023

Wondering who wins in the Population vs. Sample battle? Don’t know which one to choose for your survey?

If you are hunting similar questions, congratulations, you have come to the right place.

The Sample and Population sections tend to be a stumbling block for most students, if not all. And if you are one of those people, now is the perfect time to seize an opportunity. This guide contains all the information in the world to sweep through the methodology section of your dissertation proficiently.

Sounds interesting? Let’s get started then!

What is Population in Research?

Population in the research market comprises all the members of a defined group that you generalize to find the results of your study. This means the exact population will always depend on the scope of your respected study. Population in research is not limited to assessing humans; it can be any data parameter, including events, objects, histories, and more possessing a common trait. The measurable quality of the population is called a parameter .

For instance…

If you are to evaluate findings for Health Concerns of Women , you might have to consider all the women in the world that are dead, alive, and will live in the future.

Types of Population

Though there are different types and sub-categories of population, below are the four most common yet important ones to consider.

Types of Population

Countable Population

As the term itself explains, this type of population is one that can be numbered and calculated. It is also known as  finite population . An example of a finite or countable population would be all the students in a college or potential buyers of a brand. A countable population in statistical analysis is thought to be of more benefit than other types.

Uncountable Population

The uncountable population, primarily known as an infinite population, is where the counting units are beyond one’s consideration and capabilities. For instance, the number of rice grains in the field. Or the total number of protons and electrons on a blank page. The fact that this type of population cannot be calculated often leaves room for error and uncertainty.

Hypothetical Population 

This is the population whose unit is not available in a tangible form. Although the population in research analysis includes all sets of possible observations, events, and objects, there still are situations that can only be hypothetical. The perfect example to explain this would be the population of the world. You can give an estimated and hypothetical value gathered by different governments, but can you count all humans existing on the planet? Certainly, no! Another example would be the outcome of rolling dice.

Existent Population

The existent population is the opposite of a hypothetical population, i.e., everything is countable in a concrete form. All the notebooks and pens of students of a particular class could be an example of an existent population.

Is all clear?

Let us move on to the next important term of this guide.

What is Sample in Research?

In quantitative research methodology , the sample is a set of collected data from a defined procedure. It is basically a much smaller part of the whole, i.e., population. The sample depicts all the members of the population that are under observation when conducting research surveys . It can be further assessed to find out about the behavior of the entire population data. The measurable quality of the sample is called a statistic .

Say you send a research questionnaire to all the 200 contacts on your phone, and 42 of them end up filling up the forms. Your sample here is the 42 contacts that participated in the study. The rest of the people who did not participate but were sent invitations become part of your  sampling frame . The sampling frame is the group of people who could possibly be in your research or can be a good fit, which here are the 158 people on your phone.

Can you think of more examples? 

Before we start with the sampling types, here are a few other terminologies related to sampling for a better understanding.

Sample Size : the total number of people selected for the survey/study

Sample Technique : The technique you use in order to get your desired sample size.

Pro Tip: Use a sample for your research when you have a larger population, and you want to generalize your findings for the entire population from this sample.

What data collection best suits your research?

  • Find out by hiring an expert from ResearchProspect today!
  • Despite how challenging the subject may be, we are here to help you.

data collection

Types of Sampling Methods

There are two major types of sampling; Probability Sampling and Non-probability Sampling.

Probability Sampling

In this type of sampling, the researcher tends to set a selection of a few criteria and selects members of a population randomly. This means all the members have an equal chance to be a part of the study.

For example, you are to examine a bag containing rice or some other food item. Now any small portion or part you take for observation will be a true representative of the whole food bag.

It is further divided into the following five types:

Probability Sampling

  • Simple Random Sampling

In this type of probability sampling, the members of the study are chosen by chance or randomly. Wondering if this affects the overall quality of your research? Well, it does not. The fact that every member has an equal chance of being selected, this random selection will do just as fine and speak well for the whole group. The only thing you need to make sure of is that the population is  homogenous , like the bag of rice.

  • Systematic Sampling

In systematic sampling, the researcher will select a member after a fixed interval of time. The member selected for the study after this fixed interval is known as the  Kth element.  

For example, if the researcher decides to select a member occurring after every 30 members, the Kth element here would be the 30th element.

  • Stratified Random Sampling

If you know the meaning of strata, you might have guessed by now what stratified random sampling is. So, in this type of sampling, the population is first divided into sub-categories. There is no hard and fast rule for it; it is all done randomly.

So, when do we need this kind of sampling?

Stratified random sampling is adopted when the population is not homogenous. It is first divided into groups and categories based on similarities, and later members from each group are randomly selected. The idea is to address the problem of less homogeneity of the population to get a truly representative sample.

  • Cluster Sampling

This is where researchers divide the population into clusters that tend to represent the whole population. They are usually divided based on demographic parameters , such as location, age, and sex. It can be a little difficult than the ones earlier mentioned, but cluster sampling is one of the most effective ways to derive interface from the feedback.

For example, suppose the United States government wishes to evaluate the number of people taking the first dose of the COVID-19 vaccine. In that case, they can divide it into groups based on various country estates. Not only will the results be accurate using this sampling method, but it will also be easier for future diagnoses.

  • Multi-stage Sampling

Multi-stage sampling is similar to cluster sampling, but let’s say, a complex form of it. In this type of cluster sampling, all the clusters are further divided into sub-clusters. It involves multiple stages, thus the name. Initially, the naturally occurring categories in a population are chosen as clusters, then each cluster is categorized into smaller clusters, and lastly, members are selected from each smaller cluster.

How many stages are enough?

Well, that depends on the nature of your study/research. For some, two to three would be more than enough, while others can take up to 10 rounds or more.

Non-Probability Sampling

Non-probability sampling is the other sampling type where you cannot calculate the probability or chances of any members selected for research. In other words, it is everything the probability sampling is NOT. We just figured out that probability sampling includes selection by chance; this one depends on the subjective judgment of the researcher.

For example, one member might have a 20 percent chance of getting selected in non-probability sampling, while another could have a 60 percent chance.

Get statistical analysis help at an affordable price

  • An expert statistician will complete your work
  • Rigorous quality checks
  • Confidentiality and reliability
  • Any statistical software of your choice
  • Free Plagiarism Report

statistical analysis

Which type of sampling do you think is better?

The debate on this might prevail forever because there is no correct answer for this. Both have their advantages and disadvantages. While non-probability sampling cannot be reliable, it does save your time and costs. Similarly, if probability sampling yields accurate results, it also is not easy to use and sometimes impossible to be conducted, especially when you have a small population at hand.

Types of Non-Probability Sampling

The Four types of non-probability sampling are:

  • Convenience Sampling

Convenience sampling relies on the ease of access to specific subjects such as students in the college café or pedestrians on the road. If the researcher can conveniently get the sample for their study, it will fall under this type of sampling. This type of sampling is usually effective when researchers lack time, resources, and money. They have almost zero authority to choose the sample elements and are purely done on immediacy. You send your questionnaire to random contacts on your phone would be convenience sampling as you did not walk extra miles to get the job done.

  • Purposive Sampling

Purposive sampling is also known as judgmental sampling because researchers here would effectively consider the study’s purpose and some understanding of what to expect from the target audience. In other words, the target audience is defined here. For instance, if a study is conducted exclusively for Coronavirus patients, all others not affected by the virus will automatically be rejected or excluded from the study.

  • Quota Sampling

For quota sampling, you need to have a pre-set standard of sample selection. What happens in quota sampling is that the sample is formed on the basis of specific attributes so that the qualities of this sample can be found in the total population. Slightly complex but worth the hassle.

  • Snowball Sampling

Lastly, this type of non-probability sampling is applied when the subjects are rare and difficult to get. For example, if you are to trace and research drug dealers, it would be almost impossible to get them interviewed for the study. This is where snowball sampling comes into play. Similarly, writing a paper on the mental health of rape victims would also be a hard row to hoe. In such a situation, you will only tract a few sources/members and base the rest of your research on it.

To put it briefly, your sample is the group of people participating in the study, while the population is the total number of people to whom the results will apply. As an analogy, if the sample is the garden in your house, the population will be the forests out there.

Now that you have all the details on these two,  can you spot three differences between population and sample ?

Well, we are sure you can give more than just three.

Here are a few differences in case you need a quick revision.

Differences between Population and Sample

This brings us to the end of this guide. We hope you are now clear on these topics and have made up your mind to use a sample for your research or population. The final choice is yours; however, make sure to keep all the above-mentioned facts and particulars in mind and see what works best for you.

Meanwhile, if you have questions and queries or wish to add to this guide, please drop a comment in the comments section below.

FAQs About Population vs. Sample

How can you identify a sample and population.

Sample is the specific group you collect data from, and the population is the entire group you deduce conclusions about. The population is the bigger sample size.

What is a population parameter?

Parameter is some characteristic of the population that cannot be studied directly. It is usually estimated by numbers and figures calculated from the sample data.

Is it better to use a sample instead of a population?

Yes, if you looking for a cost-effective and easier way, a sample is the better option.

What is an example of statistics?

If one office is the sample of the population of all offices in a building, then the average of salaries earned by all employees in the sample office annually would be an example of a statistic .

Does a sample represent the entire population?

Not always. Only a representative sample reflects the entire population of your study. It is an unbiased reflection of what the population is actually like. For instance, you can evaluate the effectiveness by dividing your population on the basis of gender, education, profession, and so on. It depends on how much information is available about your population and the scope of your study. Not to mention how detailed you want your study to be.

You May Also Like

Standard error, abbreviated as SE, is a mathematical tool used to assess the variability in statistics.

Confidence intervals tell us how confident we are in our results. It’s a range of values, bounded above, and below the statistical mean, that is likely to be where our true value lies.

T-distribution is a probability distribution that predicts the population parameters when the population standard deviation is unknown, and the sample size is small.

USEFUL LINKS

LEARNING RESOURCES

researchprospect-reviews-trust-site

COMPANY DETAILS

Research-Prospect-Writing-Service

  • How It Works

Introduction to Research Methods

7 samples and populations.

So you’ve developed your research question, figured out how you’re going to measure whatever you want to study, and have your survey or interviews ready to go. Now all your need is other people to become your data.

You might say ‘easy!’, there’s people all around you. You have a big family tree and surely them and their friends would have happy to take your survey. And then there’s your friends and people you’re in class with. Finding people is way easier than writing the interview questions or developing the survey. That reaction might be a strawman, maybe you’ve come to the conclusion none of this is easy. For your data to be valuable, you not only have to ask the right questions, you have to ask the right people. The “right people” aren’t the best or the smartest people, the right people are driven by what your study is trying to answer and the method you’re using to answer it.

Remember way back in chapter 2 when we looked at this chart and discussed the differences between qualitative and quantitative data.

One of the biggest differences between quantitative and qualitative data was whether we wanted to be able to explain something for a lot of people (what percentage of residents in Oklahoma support legalizing marijuana?) versus explaining the reasons for those opinions (why do some people support legalizing marijuana and others not?). The underlying differences there is whether our goal is explain something about everyone, or whether we’re content to explain it about just our respondents.

‘Everyone’ is called the population . The population in research is whatever group the research is trying to answer questions about. The population could be everyone on planet Earth, everyone in the United States, everyone in rural counties of Iowa, everyone at your university, and on and on. It is simply everyone within the unit you are intending to study.

In order to study the population, we typically take a sample or a subset. A sample is simply a smaller number of people from the population that are studied, which we can use to then understand the characteristics of the population based on that subset. That’s why a poll of 1300 likely voters can be used to guess at who will win your states Governor race. It isn’t perfect, and we’ll talk about the math behind all of it in a later chapter, but for now we’ll just focus on the different types of samples you might use to study a population with a survey.

If correctly sampled, we can use the sample to generalize information we get to the population. Generalizability , which we defined earlier, means we can assume the responses of people to our study match the responses everyone would have given us. We can only do that if the sample is representative of the population, meaning that they are alike on important characteristics such as race, gender, age, education. If something makes a large difference in people’s views on a topic in your research and your sample is not balanced, you’ll get inaccurate results.

Generalizability is more of a concern with surveys than with interviews. The goal of a survey is to explain something about people beyond the sample you get responses from. You’ll never see a news headline saying that “53% of 1250 Americans that responded to a poll approve of the President”. It’s only worth asking those 1250 people if we can assume the rest of the United States feels the same way overall. With interviews though we’re looking for depth from their responses, and so we are less hopefully that the 15 people we talk to will exactly match the American population. That doesn’t mean the data we collect from interviews doesn’t have value, it just has different uses.

There are two broad types of samples, with several different techniques clustered below those. Probability sampling is associated with surveys, and non-probability sampling is often used when conducting interviews. We’ll first describe probability samples, before discussing the non-probability options.

The type of sampling you’ll use will be based on the type of research you’re intending to do. There’s no sample that’s right or wrong, they can just be more or less appropriate for the question you’re trying to answer. And if you use a less appropriate sampling strategy, the answer you get through your research is less likely to be accurate.

7.1 Types of Probability Samples

So we just hinted at the idea that depending on the sample you use, you can generalize the data you collect from the sample to the population. That will depend though on whether your sample represents the population. To ensure that your sample is representative of the population, you will want to use a probability sample. A representative sample refers to whether the characteristics (race, age, income, education, etc) of the sample are the same as the population. Probability sampling is a sampling technique in which every individual in the population has an equal chance of being selected as a subject for the research.

There are several different types of probability samples you can use, depending on the resources you have available.

Let’s start with a simple random sample . In order to use a simple random sample all you have to do is take everyone in your population, throw them in a hat (not literally, you can just throw their names in a hat), and choose the number of names you want to use for your sample. By drawing blindly, you can eliminate human bias in constructing the sample and your sample should represent the population from which it is being taken.

However, a simple random sample isn’t quite that easy to build. The biggest issue is that you have to know who everyone is in order to randomly select them. What that requires is a sampling frame , a list of all residents in the population. But we don’t always have that. There is no list of residents of New York City (or any other city). Organizations that do have such a list wont just give it away. Try to ask your university for a list and contact information of everyone at your school so you can do a survey? They wont give it to you, for privacy reasons. It’s actually harder to think of popultions you could easily develop a sample frame for than those you can’t. If you can get or build a sampling frame, the work of a simple random sample is fairly simple, but that’s the biggest challenge.

Most of the time a true sampling frame is impossible to acquire, so researcher have to settle for something approximating a complete list. Earlier generations of researchers could use the random dial method to contact a random sample of Americans, because every household had a single phone. To use it you just pick up the phone and dial random numbers. Assuming the numbers are actually random, anyone might be called. That method actually worked somewhat well, until people stopped having home phone numbers and eventually stopped answering the phone. It’s a fun mental exercise to think about how you would go about creating a sampling frame for different groups though; think through where you would look to find a list of everyone in these groups:

Plumbers Recent first-time fathers Members of gyms

The best way to get an actual sampling frame is likely to purchase one from a private company that buys data on people from all the different websites we use.

Let’s say you do have a sampling frame though. For instance, you might be hired to do a survey of members of the Republican Party in the state of Utah to understand their political priorities this year, and the organization could give you a list of their members because they’ve hired you to do the reserach. One method of constructing a simple random sample would be to assign each name on the list a number, and then produce a list of random numbers. Once you’ve matched the random numbers to the list, you’ve got your sample. See the example using the list of 20 names below

what is population and sample in research with example

and the list of 5 random numbers.

what is population and sample in research with example

Systematic sampling is similar to simple random sampling in that it begins with a list of the population, but instead of choosing random numbers one would select every kth name on the list. What the heck is a kth? K just refers to how far apart the names are on the list you’re selecting. So if you want to sample one-tenth of the population, you’d select every tenth name. In order to know the k for your study you need to know your sample size (say 1000) and the size of the population (75000). You can divide the size of the population by the sample (75000/1000), which will produce your k (750). As long as the list does not contain any hidden order, this sampling method is as good as the random sampling method, but its only advantage over the random sampling technique is simplicity. If we used the same list as above and wanted to survey 1/5th of the population, we’d include 4 of the names on the list. It’s important with systematic samples to randomize the starting point in the list, otherwise people with A names will be oversampled. If we started with the 3rd name, we’d select Annabelle Frye, Cristobal Padilla, Jennie Vang, and Virginia Guzman, as shown below. So in order to use a systematic sample, we need three things, the population size (denoted as N ), the sample size we want ( n ) and k , which we calculate by dividing the population by the sample).

N= 20 (Population Size) n= 4 (Sample Size) k= 5 {20/4 (kth element) selection interval}

what is population and sample in research with example

We can also use a stratified sample , but that requires knowing more about the population than just their names. A stratified sample divides the study population into relevant subgroups, and then draws a sample from each subgroup. Stratified sampling can be used if you’re very concerned about ensuring balance in the sample or there may be a problem of underrepresentation among certain groups when responses are received. Not everyone in your sample is equally likely to answer a survey. Say for instance we’re trying to predict who will win an election in a county with three cities. In city A there are 1 million college students, in city B there are 2 million families, and in City C there are 3 million retirees. You know that retirees are more likely than busy college students or parents to respond to a poll. So you break the sample into three parts, ensuring that you get 100 responses from City A, 200 from City B, and 300 from City C, so the three cities would match the population. A stratified sample provides the researcher control over the subgroups that are included in the sample, whereas simple random sampling does not guarantee that any one type of person will be included in the final sample. A disadvantage is that it is more complex to organize and analyze the results compared to simple random sampling.

Cluster sampling is an approach that begins by sampling groups (or clusters) of population elements and then selects elements from within those groups. A researcher would use cluster sampling if getting access to elements in an entrie population is too challenging. For instance, a study on students in schools would probably benefit from randomly selecting from all students at the 36 elementary schools in a fictional city. But getting contact information for all students would be very difficult. So the researcher might work with principals at several schools and survey those students. The researcher would need to ensure that the students surveyed at the schools are similar to students throughout the entire city, and greater access and participation within each cluster may make that possible.

The image below shows how this can work, although the example is oversimplified. Say we have 12 students that are in 6 classrooms. The school is in total 1/4th green (3/12), 1/4th yellow (3/12), and half blue (6/12). By selecting the right clusters from within the school our sample can be representative of the entire school, assuming these colors are the only significant difference between the students. In the real world, you’d want to match the clusters and population based on race, gender, age, income, etc. And I should point out that this is an overly simplified example. What if 5/12s of the school was yellow and 1/12th was green, how would I get the right proportions? I couldn’t, but you’d do the best you could. You still wouldn’t want 4 yellows in the sample, you’d just try to approximiate the population characteristics as best you can.

what is population and sample in research with example

7.2 Actually Doing a Survey

All of that probably sounds pretty complicated. Identifying your population shouldn’t be too difficult, but how would you ever get a sampling frame? And then actually identifying who to include… It’s probably a bit overwhelming and makes doing a good survey sound impossible.

Researchers using surveys aren’t superhuman though. Often times, they use a little help. Because surveys are really valuable, and because researchers rely on them pretty often, there has been substantial growth in companies that can help to get one’s survey to its intended audience.

One popular resource is Amazon’s Mechanical Turk (more commonly known as MTurk). MTurk is at its most basic a website where workers look for jobs (called hits) to be listed by employers, and choose whether to do the task or not for a set reward. MTurk has grown over the last decade to be a common source of survey participants in the social sciences, in part because hiring workers costs very little (you can get some surveys completed for penny’s). That means you can get your survey completed with a small grant ($1-2k at the low end) and get the data back in a few hours. Really, it’s a quick and easy way to run a survey.

However, the workers aren’t perfectly representative of the average American. For instance, researchers have found that MTurk respondents are younger, better educated, and earn less than the average American.

One way to get around that issue, which can be used with MTurk or any survey, is to weight the responses. Because with MTurk you’ll get fewer responses from older, less educated, and richer Americans, those responses you do give you want to count for more to make your sample more representative of the population. Oversimplified example incoming!

Imagine you’re setting up a pizza party for your class. There are 9 people in your class, 4 men and 5 women. You only got 4 responses from the men, and 3 from the women. All 4 men wanted peperoni pizza, while the 3 women want a combination. Pepperoni wins right, 4 to 3? Not if you assume that the people that didn’t respond are the same as the ones that did. If you weight the responses to match the population (the full class of 9), a combination pizza is the winner.

what is population and sample in research with example

Because you know the population of women is 5, you can weight the 3 responses from women by 5/3 = 1.6667. If we weight (or multiply) each vote we did receive from a woman by 1.6667, each vote for a combination now equals 1.6667, meaning that the 3 votes for combination total 5. Because we received a vote from every man in the class, we just weight their votes by 1. The big assumption we have to make is that the people we didn’t hear from (the 2 women that didn’t vote) are similar to the ones we did hear from. And if we don’t get any responses from a group we don’t have anything to infer their preferences or views from.

Let’s go through a slightly more complex example, still just considering one quality about people in the class. Let’s say your class actually has 100 students, but you only received votes from 50. And, what type of pizza people voted for is mixed, but men still prefer peperoni overall, and women still prefer combination. The class is 60% female and 40% male.

We received 21 votes from women out of the 60, so we can weight their responses by 60/21 to represent the population. We got 29 votes out of the 40 for men, so their responses can be weighted by 40/29. See the math below.

what is population and sample in research with example

53.8 votes for combination? That might seem a little odd, but weighting isn’t a perfect science. We can’t identify what a non-respondent would have said exactly, all we can do is use the responses of other similar people to make a good guess. That issue often comes up in polling, where pollsters have to guess who is going to vote in a given election in order to project who will win. And we can weight on any characteristic of a person we think will be important, alone or in combination. Modern polls weight on age, gender, voting habits, education, and more to make the results as generalizable as possible.

There’s an appendix later in this book where I walk through the actual steps of creating weights for a sample in R, if anyone actually does a survey. I intended this section to show that doing a good survey might be simpler than it seemed, but now it might sound even more difficult. A good lesson to take though is that there’s always another door to go through, another hurdle to improve your methods. Being good at research just means being constantly prepared to be given a new challenge, and being able to find another solution.

7.3 Non-Probability Sampling

Qualitative researchers’ main objective is to gain an in-depth understanding on the subject matter they are studying, rather than attempting to generalize results to the population. As such, non-probability sampling is more common because of the researchers desire to gain information not from random elements of the population, but rather from specific individuals.

Random selection is not used in nonprobability sampling. Instead, the personal judgment of the researcher determines who will be included in the sample. Typically, researchers may base their selection on availability, quotas, or other criteria. However, not all members of the population are given an equal chance to be included in the sample. This nonrandom approach results in not knowing whether the sample represents the entire population. Consequently, researchers are not able to make valid generalizations about the population.

As with probability sampling, there are several types of non-probability samples. Convenience sampling , also known as accidental or opportunity sampling, is a process of choosing a sample that is easily accessible and readily available to the researcher. Researchers tend to collect samples from convenient locations such as their place of employment, a location, school, or other close affiliation. Although this technique allows for quick and easy access to available participants, a large part of the population is excluded from the sample.

For example, researchers (particularly in psychology) often rely on research subjects that are at their universities. That is highly convenient, students are cheap to hire and readily available on campuses. However, it means the results of the study may have limited ability to predict motivations or behaviors of people that aren’t included in the sample, i.e., people outside the age of 18-22 that are going to college.

If I ask you to get find out whether people approve of the mayor or not, and tell you I want 500 people’s opinions, should you go stand in front of the local grocery store? That would be convinient, and the people coming will be random, right? Not really. If you stand outside a rural Piggly Wiggly or an urban Whole Foods, do you think you’ll see the same people? Probably not, people’s chracteristics make the more or less likely to be in those locations. This technique runs the high risk of over- or under-representation, biased results, as well as an inability to make generalizations about the larger population. As the name implies though, it is convenient.

Purposive sampling , also known as judgmental or selective sampling, refers to a method in which the researcher decides who will be selected for the sample based on who or what is relevant to the study’s purpose. The researcher must first identify a specific characteristic of the population that can best help answer the research question. Then, they can deliberately select a sample that meets that particular criterion. Typically, the sample is small with very specific experiences and perspectives. For instance, if I wanted to understand the experiences of prominent foreign-born politicians in the United States, I would purposefully build a sample of… prominent foreign-born politicians in the United States. That would exclude anyone that was born in the United States or and that wasn’t a politician, and I’d have to define what I meant by prominent. Purposive sampling is susceptible to errors in judgment by the researcher and selection bias due to a lack of random sampling, but when attempting to research small communities it can be effective.

When dealing with small and difficult to reach communities researchers sometimes use snowball samples , also known as chain referral sampling. Snowball sampling is a process in which the researcher selects an initial participant for the sample, then asks that participant to recruit or refer additional participants who have similar traits as them. The cycle continues until the needed sample size is obtained.

This technique is used when the study calls for participants who are hard to find because of a unique or rare quality or when a participant does not want to be found because they are part of a stigmatized group or behavior. Examples may include people with rare diseases, sex workers, or a child sex offenders. It would be impossible to find an accurate list of sex workers anywhere, and surveying the general population about whether that is their job will produce false responses as people will be unwilling to identify themselves. As such, a common method is to gain the trust of one individual within the community, who can then introduce you to others. It is important that the researcher builds rapport and gains trust so that participants can be comfortable contributing to the study, but that must also be balanced by mainting objectivity in the research.

Snowball sampling is a useful method for locating hard to reach populations but cannot guarantee a representative sample because each contact will be based upon your last. For instance, let’s say you’re studying illegal fight clubs in your state. Some fight clubs allow weapons in the fights, while others completely ban them; those two types of clubs never interreact because of their disagreement about whether weapons should be allowed, and there’s no overlap between them (no members in both type of club). If your initial contact is with a club that uses weapons, all of your subsequent contacts will be within that community and so you’ll never understand the differences. If you didn’t know there were two types of clubs when you started, you’ll never even know you’re only researching half of the community. As such, snowball sampling can be a necessary technique when there are no other options, but it does have limitations.

Quota Sampling is a process in which the researcher must first divide a population into mutually exclusive subgroups, similar to stratified sampling. Depending on what is relevant to the study, subgroups can be based on a known characteristic such as age, race, gender, etc. Secondly, the researcher must select a sample from each subgroup to fit their predefined quotas. Quota sampling is used for the same reason as stratified sampling, to ensure that your sample has representation of certain groups. For instance, let’s say that you’re studying sexual harassment in the workplace, and men are much more willing to discuss their experiences than women. You might choose to decide that half of your final sample will be women, and stop requesting interviews with men once you fill your quota. The core difference is that while stratified sampling chooses randomly from within the different groups, quota sampling does not. A quota sample can either be proportional or non-proportional . Proportional quota sampling refers to ensuring that the quotas in the sample match the population (if 35% of the company is female, 35% of the sample should be female). Non-proportional sampling allows you to select your own quota sizes. If you think the experiences of females with sexual harassment are more important to your research, you can include whatever percentage of females you desire.

7.4 Dangers in sampling

Now that we’ve described all the different ways that one could create a sample, we can talk more about the pitfalls of sampling. Ensuring a quality sample means asking yourself some basic questions:

  • Who is in the sample?
  • How were they sampled?
  • Why were they sampled?

A meal is often only as good as the ingredients you use, and your data will only be as good as the sample. If you collect data from the wrong people, you’ll get the wrong answer. You’ll still get an answer, it’ll just be inaccurate. And I want to reemphasize here wrong people just refers to inappropriate for your study. If I want to study bullying in middle schools, but I only talk to people that live in a retirement home, how accurate or relevant will the information I gather be? Sure, they might have grandchildren in middle school, and they may remember their experiences. But wouldn’t my information be more relevant if I talked to students in middle school, or perhaps a mix of teachers, parents, and students? I’ll get an answer from retirees, but it wont be the one I need. The sample has to be appropriate to the research question.

Is a bigger sample always better? Not necessarily. A larger sample can be useful, but a more representative one of the population is better. That was made painfully clear when the magazine Literary Digest ran a poll to predict who would win the 1936 presidential election between Alf Landon and incumbent Franklin Roosevelt. Literary Digest had run the poll since 1916, and had been correct in predicting the outcome every time. It was the largest poll ever, and they received responses for 2.27 million people. They essentially received responses from 1 percent of the American population, while many modern polls use only 1000 responses for a much more populous country. What did they predict? They showed that Alf Landon would be the overwhelming winner, yet when the election was held Roosevelt won every state except Maine and Vermont. It was one of the most decisive victories in Presidential history.

So what went wrong for the Literary Digest? Their poll was large (gigantic!), but it wasn’t representative of likely voters. They polled their own readership, which tended to be more educated and wealthy on average, along with people on a list of those with registered automobiles and telephone users (both of which tended to be owned by the wealthy at that time). Thus, the poll largely ignored the majority of Americans, who ended up voting for Roosevelt. The Literary Digest poll is famous for being wrong, but led to significant improvements in the science of polling to avoid similar mistakes in the future. Researchers have learned a lot in the century since that mistake, even if polling and surveys still aren’t (and can’t be) perfect.

What kind of sampling strategy did Literary Digest use? Convenience, they relied on lists they had available, rather than try to ensure every American was included on their list. A representative poll of 2 million people will give you more accurate results than a representative poll of 2 thousand, but I’ll take the smaller more representative poll than a larger one that uses convenience sampling any day.

7.5 Summary

Picking the right type of sample is critical to getting an accurate answer to your reserach question. There are a lot of differnet options in how you can select the people to participate in your research, but typically only one that is both correct and possible depending on the research you’re doing. In the next chapter we’ll talk about a few other methods for conducting reseach, some that don’t include any sampling by you.

Statology

Statistics Made Easy

Population vs. Sample: What’s the Difference?

Often in statistics we’re interested in collecting data so that we can answer some research question.

For example, we might want to answer the following questions:

1. What is the median household income in Miami, Florida?

2. What is the mean weight of a certain population of turtles?

3. What percentage of residents in a certain county support a certain law?

In each scenario, we are interested in answering some question about a  population , which represents every possible individual element that we’re interested in measuring.

However, instead of collecting data on every individual in a population we instead collect data on a sample of the population, which represents a portion of the population.

Population : Every possible individual element that we are interested in measuring.   Sample: A portion of the population.

Here is an example of a population vs. a sample in the three intro examples.

Example 1: What is the median household income in Miami, Florida?

The entire population might include 500,000 households, but we might only collect data on a sample of 2,000 total households.

Population vs. sample

The entire population might include 800 turtles, but we might only collect data on a sample of 30 turtles.

Difference between population and sample

The entire population might include 50,000 residents, but we might only collect data on a sample of 1,000 residents.

Example of population vs sample

Why Use Samples?

There are several reasons that we typically collect data on samples instead of entire populations, including:

1 . It is too time-consuming to collect data on an entire population . For example, if we want to know the median household income in Miami, Florida, it might take months or even years to go around and gather income for each household. By the time we collect all of this data, the population may have changed or the research question of interest might no longer be of interest.

2. It is too costly to collect data on an entire population. It is often too expensive to go around and collect data for every individual in a population, which is why we instead choose to collect data on a sample instead.

3. It is unfeasible to collect data on an entire population. In many cases it’s simply not possible to collect data for  every individual in a population. For example, it may be extraordinarily difficult to track down and weigh every turtle in a certain population that we’re interested in. 

By collecting data on samples, we’re able to gather information about a given population much faster and cheaper.

And if our sample is  representative of the population , then we can generalize the findings from a sample to the larger population with a high level of confidence.

The Importance of Representative Samples

When we collect a sample from a population, we ideally want the sample to be like a “mini version” of our population.

For example, suppose we want to understand the movie preferences of students in a certain school district that has a population of 5,000 total students. Since it would take too long to survey every individual student, we might instead take a sample of 100 students and ask them about their preferences. 

If the overall student population is composed of 50% girls and 50% boys, our sample would not be representative if it included 90% boys and only 10% girls.

Representative sample of a population

Or if the overall population is composed of equal parts freshman, sophomores, juniors, and seniors, then our sample would not be representative if it only included freshman. 

what is population and sample in research with example

A sample is representative of a population if the characteristics of the individuals in the sample closely matches the characteristics of the individuals in the overall population.

When this occurs, we can generalize the findings from the sample to the overall population with confidence. 

How to Obtain Samples

There are many different methods we can use to obtain samples from populations. 

To maximize the chances that we obtain a representative sample, we can use one of the three following methods:

Simple random sampling: Randomly select individuals through the use of a random number generator or some means of random selection.

Systematic random sampling: Put every member of a population into some order. Choose a random starting point and select every n th member to be in the sample.

Stratified random sampling: Split a population into groups. Randomly select some members from each group to be in the sample. 

In each of these methods, every individual in the population has an equal probability of being included in the sample. This maximizes the chances that we obtain a sample that is a “mini version” of the population.

Featured Posts

7 Common Beginner Stats Mistakes and How to Avoid Them

Hey there. My name is Zach Bobbitt. I have a Masters of Science degree in Applied Statistics and I’ve worked on machine learning algorithms for professional businesses in both healthcare and retail. I’m passionate about statistics, machine learning, and data visualization and I created Statology to be a resource for both students and teachers alike.  My goal with this site is to help you learn statistics through using simple terms, plenty of real-world examples, and helpful illustrations.

3 Replies to “Population vs. Sample: What’s the Difference?”

It is nice, clear, and understandable. Thank you!

simple but very accurate explanation

Great piece. Tank you very much, Mr Zach.

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Join the Statology Community

Sign up to receive Statology's exclusive study resource: 100 practice problems with step-by-step solutions. Plus, get our latest insights, tutorials, and data analysis tips straight to your inbox!

By subscribing you accept Statology's Privacy Policy.

3. Populations and samples

Populations, unbiasedness and precision, randomisation, variation between samples, standard error of the mean.

what is population and sample in research with example

Types of samples

There are two categories of sampling generally used – probability sampling and non-probability sampling :

  • Probability sampling , also known as random sampling, is a kind of sample selection where randomization is used instead of deliberate choice.
  • Non-probability sampling techniques involve the researcher deliberately picking items or individuals for the sample based on their research goals or knowledge

These two sampling techniques have several methods:

Probability sampling types include:

  • Simple random sampling Every element in the population has an equal chance of being selected as part of the sample. Find out more about simple random sampling.
  • Systematic sampling Also known as systematic clustering, in this method, random selection only applies to the first item chosen. A rule then applies so that every nth item or person after that is picked. Find out more about systematic sampling .
  • Stratified random sampling Sampling uses random selection within predefined groups. Find out more about stratified random sampling .
  • Cluster sampling Groups rather than individual units of the target population are selected at random.

Non-probability sampling types include:

  • Convenience sampling People or elements in a sample are selected based on their availability.
  • Quota sampling The sample is formed according to certain groups or criteria.
  • Purposive sampling Also known as judgmental sampling. The sample is formed by the researcher consciously choosing entities, based on the survey goals.
  • Snowball sampling Also known as referral sampling. The sample is formed by sample participants recruiting connections.

Find out more about sampling methods with our ultimate guide to sampling methods and best practices

Calculating sample size

Worried about sample sizes? You can also use our sample size calculator to determine how many responses you need to be confident in your data.

what is population and sample in research with example

Go to sample size calculator

When to use sampling

As mentioned, sampling is useful for dealing with population data that is too large to process as a whole or is inaccessible. Sampling also helps to keep costs down and reduce time to insight.

Advantages of using sampling to collect data

  • Provide researchers with a representative view of the population through the sample subset.
  • The researcher has flexibility and control over what kind of sample they want to make, depending on their needs and the goals of the research.
  • Reduces the volume of data, helping to save time.
  • With proper methods, researchers can achieve a higher level of accuracy
  • Researchers can get detailed information on a population with a smaller amount of resource
  • Significantly cheaper than other methods
  • Allows for deeper study of some aspects of data — rather than asking 15 questions to every individual, it’s better to use 50 questions on a representative sample

Disadvantages of using sampling to collect data

  • Researcher bias can affect the quality and accuracy of results
  • Sampling studies require well-trained experts
  • Even with good survey design, there’s no way to eliminate sampling errors entirely
  • People in the sample may refuse to respond
  • Probability sampling methods can be less representative in favor of random allocation.
  • Improper selection of sampling techniques can affect the entire process negatively

How can you use sampling in business?

Depending on the nature of your study and the conclusions you wish to draw, you’ll have to select an appropriate sampling method as mentioned above. That said, here are a few examples of how you can use sampling techniques in business.

Creating a new product

If you’re looking to create a new product line, you may want to do panel interviews or surveys with a representative sample for the new market. By showing your product or concept to a sample that represents your target audience (population), you ensure that the feedback you receive is more reflective of how that customer segment will feel.

Average employee performance

If you wanted to understand the average employee performance for a specific group, you could use a random sample from a team or department (population). As every person in the department has a chance of being selected, you’ll have a truly random — yet representative sample. From the data collected, you can make inferences about the team/department’s average performance.

Store feedback

Let’s say you want to collect feedback from customers who are shopping or have just finished shopping at your store. To do this, you could use convenience sampling. It’s fast, affordable and done at a point of convenience. You can use this to get a quick gauge of how people feel about your store’s shopping experience — but it won’t represent the true views of all your customers.

Manage your population and sample data easily

Whatever the sample size of your target audience, there are several things to consider:

  • How can you save time in conducting the research?
  • How do you analyze and compare all the responses?
  • How can you track and chase non-respondents easily?
  • How can you translate the data into a usable presentation format?
  • How can you share this easily?

These questions can make the task of supporting internal teams and management difficult.

This is where the Qualtrics CoreXM technology solution can help you progress through research with ease.

It includes:

  • Advanced AI and machine learning tools to easily analyze data from open-text responses and data, giving you actionable insights at scale.
  • Intuitive drag-and-drop survey building with powerful logic, 100+ question types, and pre-built survey templates . For more information on how to get started on your survey creation, visit our complete guide on creating a survey.
  • Stylish, accessible and easy-to-understand reporting that automatically updates in real time, so everyone in your organization has the latest insights at their fingertips.
  • Powerful automation to get up and running quickly with out-of-the-box workflows, including guided setup and proactive recommendations to help you connect with other teams and react fast to changes.

Also, the Qualtrics online research panels and samples help you to:

  • Choose a target audience and get access to a representative sample
  • Boost the accuracy of your research with a sample methodology that’s 47% more consistent than standard sampling methods
  • Get dedicated support at every stage, from launching your survey to reporting on the results.

Want to learn more?

Related resources

Panels & Samples

Representative Samples 13 min read

Reward survey participants 15 min read, panel management 14 min read, what is a research panel 10 min read.

Analysis & Reporting

Data Saturation In Qualitative Research 8 min read

How to determine sample size 12 min read.

Market Segmentation

User Personas 14 min read

Request demo.

Ready to learn more about Qualtrics?

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Ind Psychiatry J
  • v.19(1); Jan-Jun 2010

Statistics without tears: Populations and samples

Amitav banerjee.

Department of Community Medicine, D Y Patil Medical College, Pune, India

Suprakash Chaudhury

1 Department of Psychiatry, RINPAS, Kanke, Ranchi, India

Research studies are usually carried out on sample of subjects rather than whole populations. The most challenging aspect of fieldwork is drawing a random sample from the target population to which the results of the study would be generalized. In actual practice, the task is so difficult that some sampling bias occurs in almost all studies to a lesser or greater degree. In order to assess the degree of this bias, the informed reader of medical literature should have some understanding of the population from which the sample was drawn. The ultimate decision on whether the results of a particular study can be generalized to a larger population depends on this understanding. The subsequent deliberations dwell on sampling strategies for different types of research and also a brief description of different sampling methods.

Research workers in the early 19th century endeavored to survey entire populations. This feat was tedious, and the research work suffered accordingly. Current researchers work only with a small portion of the whole population (a sample) from which they draw inferences about the population from which the sample was drawn.

This inferential leap or generalization from samples to population, a feature of inductive or empirical research, can be full of pitfalls. In clinical medicine, it is not sufficient merely to describe a patient without assessing the underlying condition by a detailed history and clinical examination. The signs and symptoms are then interpreted against the total background of the patient's history and clinical examination including mental state examination. Similarly, in inferential statistics, it is not enough to just describe the results in the sample. One has to critically appraise the real worth or representativeness of that particular sample. The following discussion endeavors to explain the inputs required for making a correct inference from a sample to the target population.

TARGET POPULATION

Any inferences from a sample refer only to the defined population from which the sample has been properly selected. We may call this the target population. For example, if in a sample of lawyers from Delhi High Court it is found that 5% are having alcohol dependence syndrome, can we say that 5% of all lawyers all over the world are alcoholics? Obviously not, as the lawyers of Delhi High Court may be an institution by themselves and may not represent the global lawyers′ community. The findings of this study, therefore, apply only to Delhi High Court lawyers from which a representative sample was taken. Of course, this finding may nevertheless be interesting, but only as a pointer to further research. The data on lawyers in a particular city tell us nothing about lawyers in other cities or countries.

POPULATIONS IN INFERENTIAL STATISTICS

In statistics, a population is an entire group about which some information is required to be ascertained. A statistical population need not consist only of people. We can have population of heights, weights, BMIs, hemoglobin levels, events, outcomes, so long as the population is well defined with explicit inclusion and exclusion criteria. In selecting a population for study, the research question or purpose of the study will suggest a suitable definition of the population to be studied, in terms of location and restriction to a particular age group, sex or occupation. The population must be fully defined so that those to be included and excluded are clearly spelt out (inclusion and exclusion criteria). For example, if we say that our study populations are all lawyers in Delhi, we should state whether those lawyers are included who have retired, are working part-time, or non-practicing, or those who have left the city but still registered at Delhi.

Use of the word population in epidemiological research does not correspond always with its demographic meaning of an entire group of people living within certain geographic or political boundaries. A population for a research study may comprise groups of people defined in many different ways, for example, coal mine workers in Dhanbad, children exposed to German measles during intrauterine life, or pilgrims traveling to Kumbh Mela at Allahabad.

GENERALIZATION (INFERENCES) FROM A POPULATION

When generalizing from observations made on a sample to a larger population, certain issues will dictate judgment. For example, generalizing from observations made on the mental health status of a sample of lawyers in Delhi to the mental health status of all lawyers in Delhi is a formalized procedure, in so far as the errors (sampling or random) which this may hazard can, to some extent, be calculated in advance. However, if we attempt to generalize further, for instance, about the mental statuses of all lawyers in the country as a whole, we hazard further pitfalls which cannot be specified in advance. We do not know to what extent the study sample and population of Delhi is typical of the larger population – that of the whole country – to which it belongs.

The dilemmas in defining populations differ for descriptive and analytic studies.

POPULATION IN DESCRIPTIVE STUDIES

In descriptive studies, it is customary to define a study population and then make observations on a sample taken from it. Study populations may be defined by geographic location, age, sex, with additional definitions of attributes and variables such as occupation, religion and ethnic group.[ 1 ]

Geographic location

In field studies, it may be desirable to use a population defined by an administrative boundary such as a district or a state. This may facilitate the co-operation of the local administrative authorities and the study participants. Moreover, basic demographic data on the population such as population size, age, gender distribution (needed for calculating age- and sex-specific rates) available from census data or voters’ list are easier to obtain from administrative headquarters. However, administrative boundaries do not always consist of homogenous group of people. Since it is desirable that a modest descriptive study does not cover a number of different groups of people, with widely differing ways of life or customs, it may be necessary to restrict the study to a particular ethnic group, and thus ensure better genetic or cultural homogeneity. Alternatively, a population may be defined in relation to a prominent geographic feature, such as a river, or mountain, which imposes a certain uniformity of ways of life, attitudes, and behavior upon the people who live in the vicinity.

If cases of a disease are being ascertained through their attendance at a hospital outpatient department (OPD), rather than by field surveys in the community, it will be necessary to define the population according to the so-called catchment area of the hospital OPD. For administrative purposes, a dispensary, health center or hospital is usually considered to serve a population within a defined geographic area. But these catchment areas may only represent in a crude manner with the actual use of medical facilities by the local people. For example, in OPD study of psychiatric illnesses in a particular hospital with a defined catchment area, many people with psychiatric illnesses may not visit the particular OPD and may seek treatment from traditional healers or religious leaders.

Catchment areas depend on the demography of the area and the accessibility of the health center or hospital. Accessibility has three dimensions – physical, economic and social.[ 2 ] Physical accessibility is the time required to travel to the health center or medical facility. It depends on the topography of the area (e.g. hill and tribal areas with poor roads have problems of physical accessibility). Economic accessibility is the paying capacity of the people for services. Poverty may limit health seeking behavior if the person cannot afford the bus fare to the health center even if the health services may be free of charge. It may also involve absence from work which, for daily wage earners, is a major economic disincentive. Social factors such as caste, culture, language, etc. may adversely affect accessibility to health facility if the treating physician is not conversant with the local language and customs. In such situations, the patient may feel more comfortable with traditional healers.

Ascertainment of a particular disease within a particular area may be incomplete either because some patient may seek treatment elsewhere or some patients do not seek treatment at all. Focus group discussions (qualitative study) with local people, especially those residing away from the health center, may give an indication whether serious underreporting is occurring.

When it is impossible to relate cases of a disease to a population, perhaps because the cases were ascertained through a hospital with an undefined catchment area, proportional morbidity rates may be used. These rates have been widely used in cancer epidemiology where the number of cases of one form of cancer is expressed as a proportion of the number of cases of all forms of cancer among patients attending the same hospital during the same period.

POPULATIONS IN ANALYTIC STUDIES

Case control studies.

As opposed to descriptive studies where a study population is defined and then observations are made on a representative sample from it, in case control studies observations are made on a group of patients. This is known as the study group , which usually is not selected by sampling of a defined larger group. For instance, a study on patients of bipolar disorder may include every patient with this disorder attending the psychiatry OPD during the study period. One should not forget, however, that in this situation also, there is a hypothetical population consisting of all patients with bipolar disorder in the universe (which may be a certain region, a country or globally depending on the extent of the generalization intended from the findings of the study). Case control studies are often carried out in hospital settings because this is more convenient and accessible group than cases in the community at large. However, the two groups of cases may differ in many respects. At the outset of the study, it should be deliberated whether these differences would affect the external validity (generalization) of the study. Usually, analytic studies are not carried out in groups containing atypical cases of the disorder, unless there is a special indication to do so.

Populations in cohort studies

Basically, cohort studies compare two groups of people (cohorts) and demonstrate whether or not there are more cases of the disease among the cohort exposed to the suspected cause than among the cohort not exposed. To determine whether an association exists between positive family history of schizophrenia and subsequent schizophrenia in persons having such a history, two cohorts would be required: first, the exposed group, that is, people with a family history of mental disorders (the suspected cause) and second, the unexposed group, that is, people without a family history of mental disorders. These two cohorts would need to be followed up for a number of years and cases of schizophrenia in either group would be recorded. If a positive family history is associated with development of schizophrenia, then more cases would occur in the first group than in the second group.

The crucial challenges in a cohort study are that it should include participants exposed to a particular cause being investigated and that it should consist of persons who can be followed up for the period of time between exposure (cause) and development of the disorder. It is vital that the follow-up of a cohort should be complete as far as possible. If more than a small proportion of persons in the cohort cannot be traced (loss to follow-up or attrition), the findings will be biased , in case these persons differ significantly from those remaining in the study.

Depending on the type of exposure being studied, there may or may not be a range of choice of cohort populations exposed to it who may form a larger population from which one has to select a study sample. For instance, if one is exploring association between occupational hazard such as job stress in health care workers in intensive care units (ICUs) and subsequent development of drug addiction, one has to, by the very nature of the research question, select health care workers working in ICUs. On the other hand, cause effect study for association between head injury and epilepsy offers a much wider range of possible cohorts.

Difficulties in making repeated observations on cohorts depend on the length of time of the study. In correlating maternal factors (pregnancy cohort) with birth weight, the period of observation is limited to 9 months. However, if in a study it is tried to find the association between maternal nutrition during pregnancy and subsequent school performance of the child, the study will extend to years. For such long duration investigations, it is wise to select study cohorts that are firstly, not likely to migrate, cooperative and likely to be so throughout the duration of the study, and most importantly, easily accessible to the investigator so that the expense and efforts are kept within reasonable limits. Occupational groups such as the armed forces, railways, police, and industrial workers are ideal for cohort studies. Future developments facilitating record linkage such as the Unique Identification Number Scheme may give a boost to cohort studies in the wider community.

A sample is any part of the fully defined population. A syringe full of blood drawn from the vein of a patient is a sample of all the blood in the patient's circulation at the moment. Similarly, 100 patients of schizophrenia in a clinical study is a sample of the population of schizophrenics, provided the sample is properly chosen and the inclusion and exclusion criteria are well defined.

To make accurate inferences, the sample has to be representative. A representative sample is one in which each and every member of the population has an equal and mutually exclusive chance of being selected.

Sample size

Inputs required for sample size calculation have been dealt from a clinical researcher's perspective avoiding the use of intimidating formulae and statistical jargon in an earlier issue of the journal.[ 1 ]

Target population, study population and study sample

A population is a complete set of people with a specialized set of characteristics, and a sample is a subset of the population. The usual criteria we use in defining population are geographic, for example, “the population of Uttar Pradesh”. In medical research, the criteria for population may be clinical, demographic and time related.

  • Clinical and demographic characteristics define the target population, the large set of people in the world to which the results of the study will be generalized (e.g. all schizophrenics).
  • The study population is the subset of the target population available for study (e.g. schizophrenics in the researcher's town).
  • The study sample is the sample chosen from the study population.

METHODS OF SAMPLING

Purposive (non-random samples).

  • Volunteers who agree to participate
  • Snowball sample, where one case identifies others of his kind (e.g. intravenous drug users)
  • Convenient sample such as captive medical students or other readily available groups
  • Quota sampling, at will selection of a fixed number from each group
  • Referred cases who may be under pressure to participate
  • Haphazard with combination of the above methods

Non-random samples have certain limitations. The larger group (target population) is difficult to identify. This may not be a limitation when generalization of results is not intended. The results would be valid for the sample itself (internal validity). They can, nevertheless, provide important clues for further studies based on random samples. Another limitation of non-random samples is that statistical inferences such as confidence intervals and tests of significance cannot be estimated from non-random samples. However, in some situations, the investigator has to make crucial judgments. One should remember that random samples are the means but representativeness is the goal. When non-random samples are representative (compare the socio-demographic characteristics of the sample subjects with the target population), generalization may be possible.

Random sampling methods

Simple random sampling.

A sample may be defined as random if every individual in the population being sampled has an equal likelihood of being included. Random sampling is the basis of all good sampling techniques and disallows any method of selection based on volunteering or the choice of groups of people known to be cooperative.[ 3 ]

In order to select a simple random sample from a population, it is first necessary to identify all individuals from whom the selection will be made. This is the sampling frame. In developing countries, listings of all persons living in an area are not usually available. Census may not catch nomadic population groups. Voters’ and taxpayers’ lists may be incomplete. Whether or not such deficiencies are major barriers in random sampling depends on the particular research question being investigated. To undertake a separate exercise of listing the population for the study may be time consuming and tedious. Two-stage sampling may make the task feasible.

The usual method of selecting a simple random sample from a listing of individuals is to assign a number to each individual and then select certain numbers by reference to random number tables which are published in standard statistical textbooks. Random number can also be generated by statistical software such as EPI INFO developed by WHO and CDC Atlanta.

Systematic sampling

A simple method of random sampling is to select a systematic sample in which every n th person is selected from a list or from other ordering. A systematic sample can be drawn from a queue of people or from patients ordered according to the time of their attendance at a clinic. Thus, a sample can be drawn without an initial listing of all the subjects. Because of this feasibility, a systematic sample may have some advantage over a simple random sample.

To fulfill the statistical criteria for a random sample, a systematic sample should be drawn from subjects who are randomly ordered. The starting point for selection should be randomly chosen. If every fifth person from a register is being chosen, then a random procedure must be used to determine whether the first, second, third, fourth, or fifth person should be chosen as the first member of the sample.

Multistage sampling

Sometimes, a strictly random sample may be difficult to obtain and it may be more feasible to draw the required number of subjects in a series of stages. For example, suppose we wish to estimate the number of CATSCAN examinations made of all patients entering a hospital in a given month in the state of Maharashtra. It would be quite tedious to devise a scheme which would allow the total population of patients to be directly sampled. However, it would be easier to list the districts of the state of Maharashtra and randomly draw a sample of these districts. Within this sample of districts, all the hospitals would then be listed by name, and a random sample of these can be drawn. Within each of these hospitals, a sample of the patients entering in the given month could be chosen randomly for observation and recording. Thus, by stages, we draw the required sample. If indicated, we can introduce some element of stratification at some stage (urban/rural, gender, age).

It should be cautioned that multistage sampling should only be resorted to when difficulties in simple random sampling are insurmountable. Those who take a simple random sample of 12 hospitals, and within each of these hospitals select a random sample of 10 patients, may believe they have selected 120 patients randomly from all the 12 hospitals. In statistical sense, they have in fact selected a sample of 12 rather than 120.[ 4 ]

Stratified sampling

If a condition is unevenly distributed in a population with respect to age, gender, or some other variable, it may be prudent to choose a stratified random sampling method. For example, to obtain a stratified random sample according to age, the study population can be divided into age groups such as 0–5, 6–10, 11–14, 15–20, 21–25, and so on, depending on the requirement. A different proportion of each group can then be selected as a subsample either by simple random sampling or systematic sampling. If the condition decreases with advancing age, then to include adequate number in the older age groups, one may select more numbers in older subsamples.

Cluster sampling

In many surveys, studies may be carried out on large populations which may be geographically quite dispersed. To obtain the required number of subjects for the study by a simple random sample method will require large costs and will be cumbersome. In such cases, clusters may be identified (e.g. households) and random samples of clusters will be included in the study; then, every member of the cluster will also be part of the study. This introduces two types of variations in the data – between clusters and within clusters – and this will have to be taken into account when analyzing data.

Cluster sampling may produce misleading results when the disease under study itself is distributed in a clustered fashion in an area. For example, suppose we are studying malaria in a population. Malaria incidence may be clustered in villages having stagnant water collections which may serve as a source of mosquito breeding. In villages without such water stagnation, there will be lesser malaria cases. The choice of few villages in cluster sampling may give erroneous results. The selection of villages as a cluster may be quite unrepresentative of the whole population by chance.[ 5 ]

Lot quality assurance sampling

Lot quality assurance sampling (LQAS), which originated in the manufacturing industry for quality control purposes, was used in the nineties to assess immunization coverage, estimate disease prevalence, and evaluate control measures and service coverage in different health programs.[ 6 ] Using only a small sample size, LQAS can effectively differentiate between areas that have or have not met the performance targets. Thus, this method is used not only to estimate the coverage of quality care but also to identify the exact subdivisions where it is deficient so that appropriate remedial measures can be implemented.

The choice of sampling methods is usually dictated by feasibility in terms of time and resources. Field research is quite messy and difficult like actual battle. It may be sometimes difficult to get a sample which is truly random. Most samples therefore tend to get biased. To estimate the magnitude of this bias, the researcher should have some idea about the population from which the sample is drawn. In conclusion, the following quote cited by Bradford Hill[ 4 ] elegantly sums up the benefit of random sampling:

…The actual practice of medicine is virtually confined to those members of the population who either are ill, or think they are ill, or are thought by somebody to be ill, and these so amply fill up the working day that in the course of time one comes unconsciously to believe that they are typical of the whole. This is not the case. The use of a random sample brings to light the individuals who are ill and know they are ill but have no intention of doing anything about it, as well as those who have never been ill, and probably never will be until their final illness. These would have been inaccessible to any other method of approach but that of the random sample… . J. H. Sheldon

Source of Support: Nil.

Conflict of Interest: None declared.

  • Student Program
  • Sign Up for Free

Population vs sample in research: What’s the difference?

Data Collection Methods

Population vs sample in research: What’s the difference?

Population and sample are two important terms in research. Having a thorough understanding of these terms is important if you want to conduct effective research — and that’s especially true for new researchers. If you need a primer on population vs sample, this article covers everything you need to know, including how to collect data from either group.

What is a population?

Outside the research field, population refers to the number of people living in a place at a particular time. In research, however, a population is a well-defined group of people or items that share the same characteristics. It’s the group that a researcher is interested in studying.

Arvind Sharma , an assistant professor at Boston College, explains that a population isn’t limited to people: “It can be any unit from which you obtain data to carry out your research.” This group could consist of humans, animals, or objects.

Below are some examples of population:

  • Male adults in the United States
  • World Cup football matches
  • Insects in American rainforests

As you can see from the examples above, populations are usually large, so it’s often difficult to survey an entire population. That’s where sampling comes in.

What is a sample?

A sample is a select group of individuals from the research population. A sample is only a subset or a subgroup of the population and, by definition, is always smaller than the population. However, well-selected samples accurately represent the entire population.

Below are some examples to illustrate the differences between population vs sample:

The sample a researcher choses from any population will depend on their research goals and objectives. For example, if you’re researching employees in a large corporation, you may be interested in C-level executives, junior-level employees, or even external contractors.

What are the differences between population and sample?

Below are the main differences between a population and a sample, as pointed out by Sharma:

What are some reasons for sampling?

Collecting data from an entire population isn’t always possible. “In fact,” explains Sharma, “99 percent of the time, we can’t survey the entire population. Other times, it is not even necessary.

“A representative sample drawn using appropriate sampling techniques will provide results that are representative of the entire population. So, it would be unnecessary to survey every member of the population.”

Below are the other most important reasons for using sampling.

Population studies are more expensive than sample surveys. For example, researching the entire population of adult male Americans would be too costly. It’s more cost-effective to work with a representative sample.

2. Practicality

Consider the adult male American research example. Even if a researcher had the resources to survey all the males in that population, it may be difficult or impossible to obtain responses from all participants. For example, the researchers may not even be able to contact all members of this population.

3. Manageability

It’s easier to manage time, costs, and resources when working with samples. Also, it’s easier to manage the data you collect from a sample vs a population. For example, it’s easier to analyze data from a sample of 1,000 adult males than a sample of all adult males in the U.S. or even a specific state.

How can you collect data from a population?

Collecting data from an entire population requires a census. A census is a collection of information from all sections of the population. It’s a complete enumeration of the population, and it requires considerable resources, which is why researchers often work with a sample.

If the target population is small, however, then you can collect data from every member of the population. For example, you can survey the performance of the members of the customer service team in a bank branch. The number is likely to be more manageable, so you can access and collect data from this population.

What methods can you use to collect data from a sample?

There are so many approaches for collecting data from samples. Some of the more commonly used methods are listed below.

1. Simple random sampling

In simple random sampling, researchers select individuals at random from the population. In this method, every member of the population has an equal chance of being selected.

For example, suppose you want to select a sample of 50 employees from a population of 500 employees. You could write down all the names of the employees, place them in a hat or container, and pick employee names at random like you would in a lottery. That’s an example of simple random sampling. It works best when the population isn’t too large.

2. Systematic sampling

This is a sampling technique that selects every k th item from the population. It’s a type of probability sampling researchers use to select items from a population randomly. A researcher may want to use this technique if they’re working with a large population and need to sample only a small number of items in order to study them in detail.

For example, to apply systematic sampling in a performance survey of 1,000 customer service team members, we can choose every fifth member — i.e., the fifth, 10th, 15th customer service rep, and so on.

For more details on  what is systematic sampling , check out our guide

3. Stratified sampling

In this probability sampling method, researchers divide members of the population into groups based on age, race, ethnicity, or sex. Researchers select individuals randomly from those groups to form a sample. This ensures that every group is equally represented.

What is a sampling error?

A sampling error is the difference between the value obtained from a sample and the true population value. It’s the difference between an estimate from a sample and the true population value.

A sampling error can occur if you don’t have enough people in your sample or if you select people who aren’t representative of the population. This can impact the accuracy of your survey. For example, if you want to know what percentage of adults are vegetarian but only ask vegetarians in a specific city, then this would be an example of selecting people who aren’t representative of the population.

According to Sharma, you can reduce sampling errors by increasing the sample size . He also notes that sample design and variation within a population affect sampling errors.

How can Jotform make the research process easier?

Whether you’re surveying a small or large sample or even an entire population, Jotform gives you the right tools to make your research easier. With Jotform’s free online survey maker, you can create engaging surveys and collect responses online. You can easily customize any of our 10,000-plus free survey templates to suit your research purposes. Get started with Jotform today.

Photo by Stanley Dai on Unsplash

Thank you for helping improve the Jotform Blog. 🎉

RECOMMENDED ARTICLES

Data Collection Methods

5 of the top data analytics tools for your business

When to use focus groups vs surveys

When to use focus groups vs surveys

What is a double-barreled question, and how do you avoid it?

What is a double-barreled question, and how do you avoid it?

Why is data important to your business?

Why is data important to your business?

The 12 best Jotform integrations for managing collected data

The 12 best Jotform integrations for managing collected data

Understanding manual data entry

Understanding manual data entry

What are focus groups, and how do you conduct them?

What are focus groups, and how do you conduct them?

10 of the best data analysis tools

10 of the best data analysis tools

How to use the questionnaire method of data collection

How to use the questionnaire method of data collection

How to create a fillable form in Microsoft Word

How to create a fillable form in Microsoft Word

Qualitative data-collection methods

Qualitative data-collection methods

A guide on primary and secondary data-collection methods

A guide on primary and secondary data-collection methods

River sampling in market research: Definitions and examples

River sampling in market research: Definitions and examples

Types of sampling methods

Types of sampling methods

How to get started with business data collection

How to get started with business data collection

What is purposive sampling? An introduction

What is purposive sampling? An introduction

Quantitative data-collection methods

Quantitative data-collection methods

11 best voice recording software options

11 best voice recording software options

Benefits of data-collection: What makes a good data-collection form?

Benefits of data-collection: What makes a good data-collection form?

A comprehensive guide to types of research

A comprehensive guide to types of research

Qualitative vs quantitative data

Qualitative vs quantitative data

The 5 best data collection tools of 2024

The 5 best data collection tools of 2024

How small businesses can solve data-collection challenges

How small businesses can solve data-collection challenges

What is systematic sampling?

What is systematic sampling?

How to conduct an oral history interview

How to conduct an oral history interview

Automated data entry for optimized workflows

Automated data entry for optimized workflows

How to be GDPR compliant while collecting data

How to be GDPR compliant while collecting data

Send Comment :

 width=

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, automatically generate references for free.

  • Knowledge Base
  • Methodology
  • Population vs Sample | Definitions, Differences & Examples

Population vs Sample | Definitions, Differences & Examples

Published on 3 May 2022 by Pritha Bhandari . Revised on 5 December 2022.

Population vs sample

A population is the entire group that you want to draw conclusions about.

A sample is the specific group that you will collect data from. The size of the sample is always less than the total size of the population.

In research, a population doesn’t always refer to people. It can mean a group containing elements of anything you want to study, such as objects, events, organisations, countries, species, or organisms.

Table of contents

Collecting data from a population, collecting data from a sample, population parameter vs sample statistic, practice questions: populations vs samples, frequently asked questions about samples and populations.

Populations are used when your research question requires, or when you have access to, data from every member of the population.

Usually, it is only straightforward to collect data from a whole population when it is small, accessible and cooperative.

For larger and more dispersed populations, it is often difficult or impossible to collect data from every individual. For example, every 10 years, the federal US government aims to count every person living in the country using the US Census. This data is used to distribute funding across the nation.

However, historically, marginalised and low-income groups have been difficult to contact, locate, and encourage participation from. Because of non-responses, the population count is incomplete and biased towards some groups, which results in disproportionate funding across the country.

In cases like this, sampling can be used to make more precise inferences about the population.

Prevent plagiarism, run a free check.

When your population is large in size, geographically dispersed, or difficult to contact, it’s necessary to use a sample. With statistical analysis , you can use sample data to make estimates or test hypotheses about population data.

Ideally, a sample should be randomly selected and representative of the population. Using probability sampling methods (such as simple random sampling or stratified sampling ) reduces the risk of sampling bias and enhances both internal and external validity .

For practical reasons, researchers often use non-probability sampling methods . Non-probability samples are chosen for specific criteria; they may be more convenient or cheaper to access. Because of non-random selection methods, any statistical inferences about the broader population will be weaker than with a probability sample.

Reasons for sampling

  • Necessity : Sometimes it’s simply not possible to study the whole population due to its size or inaccessibility.
  • Practicality : It’s easier and more efficient to collect data from a sample.
  • Cost-effectiveness : There are fewer participant, laboratory, equipment, and researcher costs involved.
  • Manageability : Storing and running statistical analyses on smaller datasets is easier and reliable.

When you collect data from a population or a sample, there are various measurements and numbers you can calculate from the data. A parameter is a measure that describes the whole population. A statistic is a measure that describes the sample.

You can use estimation or hypothesis testing to estimate how likely it is that a sample statistic differs from the population parameter.

Sampling error

A sampling error is the difference between a population parameter and a sample statistic. In your study, the sampling error is the difference between the mean political attitude rating of your sample and the true mean political attitude rating of all undergraduate students in the Netherlands.

Sampling errors happen even when you use a randomly selected sample. This is because random samples are not identical to the population in terms of numerical measures like means and standard deviations .

Because the aim of scientific research is to generalise findings from the sample to the population, you want the sampling error to be low. You can reduce sampling error by increasing the sample size.

Samples are used to make inferences about populations . Samples are easier to collect data from because they are practical, cost-effective, convenient, and manageable.

Populations are used when a research question requires data from every member of the population. This is usually only feasible when the population is small and easily accessible.

A statistic refers to measures about the sample , while a parameter refers to measures about the population .

A sampling error is the difference between a population parameter and a sample statistic .

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the ‘Cite this Scribbr article’ button to automatically add the citation to our free Reference Generator.

Bhandari, P. (2022, December 05). Population vs Sample | Definitions, Differences & Examples. Scribbr. Retrieved 31 May 2024, from https://www.scribbr.co.uk/research-methods/population-versus-sample/

Is this article helpful?

Pritha Bhandari

Pritha Bhandari

Other students also liked, sampling methods | types, techniques, & examples, a quick guide to experimental design | 5 steps & examples, what is quantitative research | definition & methods.

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons

Margin Size

  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

1.1.5.1: Collecting Data- More Practice with Populations and Samples

  • Last updated
  • Save as PDF
  • Page ID 22003

  • Foster et al.
  • University of Missouri-St. Louis, Rice University, & University of Houston, Downtown Campus via University of Missouri’s Affordable and Open Access Educational Resources Initiative

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

We are usually interested in understanding a specific group of people. This group is known as the population of interest, or simply the population. The population is the collection of all people who have some characteristic in common; it can be as broad as “all people” if we have a very general research question about human psychology, or it can be extremely narrow, such as “all freshmen psychology majors at Midwestern public universities” if we have a specific group in mind.

Populations and samples

In statistics, we often rely on a sample --- that is, a small subset of a larger set of data --- to draw inferences about the larger set. The larger set is known as the population from which the sample is drawn.

Example \(\PageIndex{1}\)

You have been hired by the National Election Commission to examine how the American people feel about the fairness of the voting procedures in the U.S. Who will you ask?

It is not practical to ask every single American how he or she feels about the fairness of the voting procedures. Instead, we query a relatively small number of Americans, and draw inferences about the entire country from their responses. The Americans actually queried constitute our sample of the larger population of all Americans.

A sample is typically a small subset of the population. In the case of voting attitudes, we would sample a few thousand Americans drawn from the hundreds of millions that make up the country. In choosing a sample, it is therefore crucial that it not over-represent one kind of citizen at the expense of others. For example, something would be wrong with our sample if it happened to be made up entirely of Florida residents. If the sample held only Floridians, it could not be used to infer the attitudes of other Americans. The same problem would arise if the sample were comprised only of Republicans. Inferences from statistics are based on the assumption that sampling is representative of the population. If the sample is not representative, then the possibility of sampling bias occurs. Sampling bias means that our conclusions apply only to our sample and are not generalizable to the full population.

Example \(\PageIndex{2}\)

We are interested in examining how many math classes have been taken on average by current graduating seniors at American colleges and universities during their four years in school.

Whereas our population in the last example included all US citizens, now it involves just the graduating seniors throughout the country. This is still a large set since there are thousands of colleges and universities, each enrolling many students. (New York University, for example, enrolls 48,000 students.) It would be prohibitively costly to examine the transcript of every college senior. We therefore take a sample of college seniors and then make inferences to the entire population based on what we find. To make the sample, we might first choose some public and private colleges and universities across the United States. Then we might sample 50 students from each of these institutions. Suppose that the average number of math classes taken by the people in our sample were 3.2. Then we might speculate that 3.2 approximates the number we would find if we had the resources to examine every senior in the entire population. But we must be careful about the possibility that our sample is non-representative of the population. Perhaps we chose an overabundance of math majors, or chose too many technical institutions that have heavy math requirements. Such bad sampling makes our sample unrepresentative of the population of all seniors.

To solidify your understanding of sampling bias, consider the following example. Try to identify the population and the sample, and then reflect on whether the sample is likely to yield the information desired.

Example \(\PageIndex{3}\)

A substitute teacher wants to know how students in the class did on their last test. The teacher asks the 10 students sitting in the front row to state their latest test score. He concludes from their report that the class did extremely well. What is the sample? What is the population? Can you identify any problems with choosing the sample in the way that the teacher did?

The population consists of all students in the class. The sample is made up of just the 10 students sitting in the front row. The sample is not likely to be representative of the population. Those who sit in the front row tend to be more interested in the class and tend to perform higher on tests. Hence, the sample may perform at a higher level than the population.

Example \(\PageIndex{4}\)

A coach is interested in how many cartwheels the average college freshmen at his university can do. Eight volunteers from the freshman class step forward. After observing their performance, the coach concludes that college freshmen can do an average of 16 cartwheels in a row without stopping.

The population is the class of all freshmen at the coach's university. The sample is composed of the 8 volunteers. The sample is poorly chosen because volunteers are more likely to be able to do cartwheels than the average freshman; people who can't do cartwheels probably did not volunteer! In the example, we are also not told of the gender of the volunteers. Were they all women, for example? That might affect the outcome, contributing to the non-representative nature of the sample (if the school is co-ed).

Simple Random Sampling

Researchers adopt a variety of sampling strategies. The most straightforward is simple random sampling. Such sampling requires every member of the population to have an equal chance of being selected into the sample. In addition, the selection of one member must be independent of the selection of every other member. That is, picking one member from the population must not increase or decrease the probability of picking any other member (relative to the others). In this sense, we can say that simple random sampling chooses a sample by pure chance. To check your understanding of simple random sampling, consider the following example. What is the population? What is the sample? Was the sample picked by simple random sampling? Is it biased?

Example \(\PageIndex{5}\)

A research scientist is interested in studying the experiences of twins raised together versus those raised apart. She obtains a list of twins from the National Twin Registry, and selects two subsets of individuals for her study. First, she chooses all those in the registry whose last name begins with Z. Then she turns to all those whose last name begins with B. Because there are so many names that start with B, however, our researcher decides to incorporate only every other name into her sample. Finally, she mails out a survey and compares characteristics of twins raised apart versus together.

The population consists of all twins recorded in the National Twin Registry. It is important that the researcher only make statistical generalizations to the twins on this list, not to all twins in the nation or world. That is, the National Twin Registry may not be representative of all twins. Even if inferences are limited to the Registry, a number of problems affect the sampling procedure we described. For instance, choosing only twins whose last names begin with Z does not give every individual an equal chance of being selected into the sample. Moreover, such a procedure risks over-representing ethnic groups with many surnames that begin with Z. There are other reasons why choosing just the Z's may bias the sample. Perhaps such people are more patient than average because they often find themselves at the end of the line! The same problem occurs with choosing twins whose last name begins with B. An additional problem for the B's is that the “every-other-one” procedure disallowed adjacent names on the B part of the list from being both selected. Just this defect alone means the sample was not formed through simple random sampling.

Sample size matters

Recall that the definition of a random sample is a sample in which every member of the population has an equal chance of being selected. This means that the sampling procedure rather than the results of the procedure define what it means for a sample to be random. Random samples, especially if the sample size is small, are not necessarily representative of the entire population. For example, if a random sample of 20 subjects were taken from a population with an equal number of males and females, there would be a nontrivial probability (0.06) that 70% or more of the sample would be female. Such a sample would not be representative, although it would be drawn randomly. Only a large sample size makes it likely that our sample is close to representative of the population. For this reason, inferential statistics take into account the sample size when generalizing results from samples to populations. In later chapters, you'll see what kinds of mathematical techniques ensure this sensitivity to sample size.

More complex sampling

Sometimes it is not feasible to build a sample using simple random sampling. To see the problem, consider the fact that both Dallas and Houston are competing to be hosts of the 2012 Olympics. Imagine that you are hired to assess whether most Texans prefer Houston to Dallas as the host, or the reverse. Given the impracticality of obtaining the opinion of every single Texan, you must construct a sample of the Texas population. But now notice how difficult it would be to proceed by simple random sampling. For example, how will you contact those individuals who don’t vote and don’t have a phone? Even among people you find in the telephone book, how can you identify those who have just relocated to California (and had no reason to inform you of their move)? What do you do about the fact that since the beginning of the study, an additional 4,212 people took up residence in the state of Texas? As you can see, it is sometimes very difficult to develop a truly random procedure. For this reason, other kinds of sampling techniques have been devised. We now discuss two of them.

Stratified Sampling

Since simple random sampling often does not ensure a representative sample, a sampling method called stratified random sampling is sometimes used to make the sample more representative of the population. This method can be used if the population has a number of distinct “strata” or groups. In stratified sampling, you first identify members of your sample who belong to each group. Then you randomly sample from each of those subgroups in such a way that the sizes of the subgroups in the sample are proportional to their sizes in the population. Let's take an example: Suppose you were interested in views of capital punishment at an urban university. You have the time and resources to interview 200 students. The student body is diverse with respect to age; many older people work during the day and enroll in night courses (average age is 39), while younger students generally enroll in day classes (average age of 19). It is possible that night students have different views about capital punishment than day students. If 70% of the students were day students, it makes sense to ensure that 70% of the sample consisted of day students. Thus, your sample of 200 students would consist of 140 day students and 60 night students. The proportion of day students in the sample and in the population (the entire university) would be the same. Inferences to the entire population of students at the university would therefore be more secure.

Convenience Sampling

Not all sampling methods are perfect, and sometimes that’s okay. For example, if we are beginning research into a completely unstudied area, we may sometimes take some shortcuts to quickly gather data and get a general idea of how things work before fully investing a lot of time and money into well-designed research projects with proper sampling. This is known as convenience sampling, named for its ease of use. In limited cases, such as the one just described, convenience sampling is okay because we intend to follow up with a representative sample. Unfortunately, sometimes convenience sampling is used due only to its convenience without the intent of improving on it in future work.

  • Privacy Policy

Research Method

Home » Sampling Methods – Types, Techniques and Examples

Sampling Methods – Types, Techniques and Examples

Table of Contents

Sampling Methods

Sampling refers to the process of selecting a subset of data from a larger population or dataset in order to analyze or make inferences about the whole population.

In other words, sampling involves taking a representative sample of data from a larger group or dataset in order to gain insights or draw conclusions about the entire group.

Sampling Methods

Sampling methods refer to the techniques used to select a subset of individuals or units from a larger population for the purpose of conducting statistical analysis or research.

Sampling is an essential part of the Research because it allows researchers to draw conclusions about a population without having to collect data from every member of that population, which can be time-consuming, expensive, or even impossible.

Types of Sampling Methods

Sampling can be broadly categorized into two main categories:

Probability Sampling

This type of sampling is based on the principles of random selection, and it involves selecting samples in a way that every member of the population has an equal chance of being included in the sample.. Probability sampling is commonly used in scientific research and statistical analysis, as it provides a representative sample that can be generalized to the larger population.

Type of Probability Sampling :

  • Simple Random Sampling: In this method, every member of the population has an equal chance of being selected for the sample. This can be done using a random number generator or by drawing names out of a hat, for example.
  • Systematic Sampling: In this method, the population is first divided into a list or sequence, and then every nth member is selected for the sample. For example, if every 10th person is selected from a list of 100 people, the sample would include 10 people.
  • Stratified Sampling: In this method, the population is divided into subgroups or strata based on certain characteristics, and then a random sample is taken from each stratum. This is often used to ensure that the sample is representative of the population as a whole.
  • Cluster Sampling: In this method, the population is divided into clusters or groups, and then a random sample of clusters is selected. Then, all members of the selected clusters are included in the sample.
  • Multi-Stage Sampling : This method combines two or more sampling techniques. For example, a researcher may use stratified sampling to select clusters, and then use simple random sampling to select members within each cluster.

Non-probability Sampling

This type of sampling does not rely on random selection, and it involves selecting samples in a way that does not give every member of the population an equal chance of being included in the sample. Non-probability sampling is often used in qualitative research, where the aim is not to generalize findings to a larger population, but to gain an in-depth understanding of a particular phenomenon or group. Non-probability sampling methods can be quicker and more cost-effective than probability sampling methods, but they may also be subject to bias and may not be representative of the larger population.

Types of Non-probability Sampling :

  • Convenience Sampling: In this method, participants are chosen based on their availability or willingness to participate. This method is easy and convenient but may not be representative of the population.
  • Purposive Sampling: In this method, participants are selected based on specific criteria, such as their expertise or knowledge on a particular topic. This method is often used in qualitative research, but may not be representative of the population.
  • Snowball Sampling: In this method, participants are recruited through referrals from other participants. This method is often used when the population is hard to reach, but may not be representative of the population.
  • Quota Sampling: In this method, a predetermined number of participants are selected based on specific criteria, such as age or gender. This method is often used in market research, but may not be representative of the population.
  • Volunteer Sampling: In this method, participants volunteer to participate in the study. This method is often used in research where participants are motivated by personal interest or altruism, but may not be representative of the population.

Applications of Sampling Methods

Applications of Sampling Methods from different fields:

  • Psychology : Sampling methods are used in psychology research to study various aspects of human behavior and mental processes. For example, researchers may use stratified sampling to select a sample of participants that is representative of the population based on factors such as age, gender, and ethnicity. Random sampling may also be used to select participants for experimental studies.
  • Sociology : Sampling methods are commonly used in sociological research to study social phenomena and relationships between individuals and groups. For example, researchers may use cluster sampling to select a sample of neighborhoods to study the effects of economic inequality on health outcomes. Stratified sampling may also be used to select a sample of participants that is representative of the population based on factors such as income, education, and occupation.
  • Social sciences: Sampling methods are commonly used in social sciences to study human behavior and attitudes. For example, researchers may use stratified sampling to select a sample of participants that is representative of the population based on factors such as age, gender, and income.
  • Marketing : Sampling methods are used in marketing research to collect data on consumer preferences, behavior, and attitudes. For example, researchers may use random sampling to select a sample of consumers to participate in a survey about a new product.
  • Healthcare : Sampling methods are used in healthcare research to study the prevalence of diseases and risk factors, and to evaluate interventions. For example, researchers may use cluster sampling to select a sample of health clinics to participate in a study of the effectiveness of a new treatment.
  • Environmental science: Sampling methods are used in environmental science to collect data on environmental variables such as water quality, air pollution, and soil composition. For example, researchers may use systematic sampling to collect soil samples at regular intervals across a field.
  • Education : Sampling methods are used in education research to study student learning and achievement. For example, researchers may use stratified sampling to select a sample of schools that is representative of the population based on factors such as demographics and academic performance.

Examples of Sampling Methods

Probability Sampling Methods Examples:

  • Simple random sampling Example : A researcher randomly selects participants from the population using a random number generator or drawing names from a hat.
  • Stratified random sampling Example : A researcher divides the population into subgroups (strata) based on a characteristic of interest (e.g. age or income) and then randomly selects participants from each subgroup.
  • Systematic sampling Example : A researcher selects participants at regular intervals from a list of the population.

Non-probability Sampling Methods Examples:

  • Convenience sampling Example: A researcher selects participants who are conveniently available, such as students in a particular class or visitors to a shopping mall.
  • Purposive sampling Example : A researcher selects participants who meet specific criteria, such as individuals who have been diagnosed with a particular medical condition.
  • Snowball sampling Example : A researcher selects participants who are referred to them by other participants, such as friends or acquaintances.

How to Conduct Sampling Methods

some general steps to conduct sampling methods:

  • Define the population: Identify the population of interest and clearly define its boundaries.
  • Choose the sampling method: Select an appropriate sampling method based on the research question, characteristics of the population, and available resources.
  • Determine the sample size: Determine the desired sample size based on statistical considerations such as margin of error, confidence level, or power analysis.
  • Create a sampling frame: Develop a list of all individuals or elements in the population from which the sample will be drawn. The sampling frame should be comprehensive, accurate, and up-to-date.
  • Select the sample: Use the chosen sampling method to select the sample from the sampling frame. The sample should be selected randomly, or if using a non-random method, every effort should be made to minimize bias and ensure that the sample is representative of the population.
  • Collect data: Once the sample has been selected, collect data from each member of the sample using appropriate research methods (e.g., surveys, interviews, observations).
  • Analyze the data: Analyze the data collected from the sample to draw conclusions about the population of interest.

When to use Sampling Methods

Sampling methods are used in research when it is not feasible or practical to study the entire population of interest. Sampling allows researchers to study a smaller group of individuals, known as a sample, and use the findings from the sample to make inferences about the larger population.

Sampling methods are particularly useful when:

  • The population of interest is too large to study in its entirety.
  • The cost and time required to study the entire population are prohibitive.
  • The population is geographically dispersed or difficult to access.
  • The research question requires specialized or hard-to-find individuals.
  • The data collected is quantitative and statistical analyses are used to draw conclusions.

Purpose of Sampling Methods

The main purpose of sampling methods in research is to obtain a representative sample of individuals or elements from a larger population of interest, in order to make inferences about the population as a whole. By studying a smaller group of individuals, known as a sample, researchers can gather information about the population that would be difficult or impossible to obtain from studying the entire population.

Sampling methods allow researchers to:

  • Study a smaller, more manageable group of individuals, which is typically less time-consuming and less expensive than studying the entire population.
  • Reduce the potential for data collection errors and improve the accuracy of the results by minimizing sampling bias.
  • Make inferences about the larger population with a certain degree of confidence, using statistical analyses of the data collected from the sample.
  • Improve the generalizability and external validity of the findings by ensuring that the sample is representative of the population of interest.

Characteristics of Sampling Methods

Here are some characteristics of sampling methods:

  • Randomness : Probability sampling methods are based on random selection, meaning that every member of the population has an equal chance of being selected. This helps to minimize bias and ensure that the sample is representative of the population.
  • Representativeness : The goal of sampling is to obtain a sample that is representative of the larger population of interest. This means that the sample should reflect the characteristics of the population in terms of key demographic, behavioral, or other relevant variables.
  • Size : The size of the sample should be large enough to provide sufficient statistical power for the research question at hand. The sample size should also be appropriate for the chosen sampling method and the level of precision desired.
  • Efficiency : Sampling methods should be efficient in terms of time, cost, and resources required. The method chosen should be feasible given the available resources and time constraints.
  • Bias : Sampling methods should aim to minimize bias and ensure that the sample is representative of the population of interest. Bias can be introduced through non-random selection or non-response, and can affect the validity and generalizability of the findings.
  • Precision : Sampling methods should be precise in terms of providing estimates of the population parameters of interest. Precision is influenced by sample size, sampling method, and level of variability in the population.
  • Validity : The validity of the sampling method is important for ensuring that the results obtained from the sample are accurate and can be generalized to the population of interest. Validity can be affected by sampling method, sample size, and the representativeness of the sample.

Advantages of Sampling Methods

Sampling methods have several advantages, including:

  • Cost-Effective : Sampling methods are often much cheaper and less time-consuming than studying an entire population. By studying only a small subset of the population, researchers can gather valuable data without incurring the costs associated with studying the entire population.
  • Convenience : Sampling methods are often more convenient than studying an entire population. For example, if a researcher wants to study the eating habits of people in a city, it would be very difficult and time-consuming to study every single person in the city. By using sampling methods, the researcher can obtain data from a smaller subset of people, making the study more feasible.
  • Accuracy: When done correctly, sampling methods can be very accurate. By using appropriate sampling techniques, researchers can obtain a sample that is representative of the entire population. This allows them to make accurate generalizations about the population as a whole based on the data collected from the sample.
  • Time-Saving: Sampling methods can save a lot of time compared to studying the entire population. By studying a smaller sample, researchers can collect data much more quickly than they could if they studied every single person in the population.
  • Less Bias : Sampling methods can reduce bias in a study. If a researcher were to study the entire population, it would be very difficult to eliminate all sources of bias. However, by using appropriate sampling techniques, researchers can reduce bias and obtain a sample that is more representative of the entire population.

Limitations of Sampling Methods

  • Sampling Error : Sampling error is the difference between the sample statistic and the population parameter. It is the result of selecting a sample rather than the entire population. The larger the sample, the lower the sampling error. However, no matter how large the sample size, there will always be some degree of sampling error.
  • Selection Bias: Selection bias occurs when the sample is not representative of the population. This can happen if the sample is not selected randomly or if some groups are underrepresented in the sample. Selection bias can lead to inaccurate conclusions about the population.
  • Non-response Bias : Non-response bias occurs when some members of the sample do not respond to the survey or study. This can result in a biased sample if the non-respondents differ from the respondents in important ways.
  • Time and Cost : While sampling can be cost-effective, it can still be expensive and time-consuming to select a sample that is representative of the population. Depending on the sampling method used, it may take a long time to obtain a sample that is large enough and representative enough to be useful.
  • Limited Information : Sampling can only provide information about the variables that are measured. It may not provide information about other variables that are relevant to the research question but were not measured.
  • Generalization : The extent to which the findings from a sample can be generalized to the population depends on the representativeness of the sample. If the sample is not representative of the population, it may not be possible to generalize the findings to the population as a whole.

About the author

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Probability Sampling

Probability Sampling – Methods, Types and...

Quota Sampling

Quota Sampling – Types, Methods and Examples

Simple Random Sampling

Simple Random Sampling – Types, Method and...

Convenience Sampling

Convenience Sampling – Method, Types and Examples

Purposive Sampling

Purposive Sampling – Methods, Types and Examples

Systematic Sampling

Systematic Sampling – Types, Method and Examples

  • Math Article
  • Population And Sample

Population and Sample

Class Registration Banner

In statistics as well as in quantitative methodology, the set of data are collected and selected from a statistical population with the help of some defined procedures. There are two different types of data sets namely, population and sample . So basically when we calculate the mean deviation, variance and standard deviation , it is necessary for us to know if we are referring to the entire population or to only sample data. Suppose the size of the population is denoted by ‘n’ then the sample size of that population is denoted by n -1. Let us take a look of population data sets and sample data sets in detail.

It includes all the elements from the data set and measurable characteristics of the population such as mean and standard deviation are known as a parameter . For example, All people living in India indicates the population of India.

There are different types of population. They are:

Finite Population

Infinite population, existent population, hypothetical population.

Let us discuss all the types one by one.

The finite population is also known as a countable population in which the population can be counted. In other words, it is defined as the population of all the individuals or objects that are finite. For statistical analysis, the finite population is more advantageous than the infinite population. Examples of finite populations are employees of a company, potential consumer in a market.

The infinite population is also known as an uncountable population in which the counting of units in the population is not possible. Example of an infinite population is the number of germs in the patient’s body is uncountable.

The existing population is defined as the population of concrete individuals. In other words, the population whose unit is available in solid form is known as existent population. Examples are books, students etc.

The population in which whose unit is not available in solid form is known as the hypothetical population. A population consists of sets of observations, objects etc that are all something in common. In some situations, the populations are only hypothetical. Examples are an outcome of rolling the dice, the outcome of tossing a coin.

Also, read:

It includes one or more observations that are drawn from the population and the measurable characteristic of a sample is a statistic. Sampling is the process of selecting the sample from the population. For example, some people living in India is the sample of the population.

Basically, there are two types of sampling. They are:

  • Probability sampling
  • Non-probability sampling

Probability Sampling

In probability sampling, the population units cannot be selected at the discretion of the researcher. This can be dealt with following certain procedures which will ensure that every unit of the population consists of one fixed probability being included in the sample. Such a method is also called random sampling. Some of the techniques used for probability sampling are:

  • Simple random sampling
  • Cluster sampling
  • Stratified Sampling
  • Disproportionate sampling
  • Proportionate sampling
  • Optimum allocation stratified sampling
  • Multi-stage sampling

Non Probability Sampling

In non-probability sampling, the population units can be selected at the discretion of the researcher. Those samples will use the human judgements for selecting units and has no theoretical basis for estimating the characteristics of the population. Some of the techniques used for non-probability sampling are

  • Quota sampling
  • Judgement sampling
  • Purposive sampling

Population and Sample Examples

  • All the people who have the ID proofs is the population and a group of people who only have voter id with them is the sample.
  • All the students in the class are population whereas the top 10 students in the class are the sample.
  • All the members of the parliament is population and the female candidates present there is the sample.

Population and Sample Formulas

We will demonstrate here the formulas for mean absolute deviation (MAD), variance and standard deviation based on population and given sample. Suppose  n denotes the size of the population and n-1 denotes the sample size, then the formulas for mean absolute deviation, variance and standard deviation are given by;

Difference between Population and Sample

Some of the key differences between population and sample are clearly given below:

For more information on statistics, register with BYJU’S – The Learning App and also watch some interesting videos to learn with ease.

Quiz Image

Put your understanding of this concept to test by answering a few MCQs. Click ‘Start Quiz’ to begin!

Select the correct answer and click on the “Finish” button Check your score and answers at the end of the quiz

Visit BYJU’S for all Maths related queries and study materials

Your result is as below

Request OTP on Voice Call

Leave a Comment Cancel reply

Your Mobile number and Email id will not be published. Required fields are marked *

Post My Comment

what is population and sample in research with example

What is the word that defines the results of the population that is not the average population? For example, you have the average height of a group of males as 6’8 but when you look closer, the group of males are all professional basket ball players so it’s not really an average of all males.

It would be – Population standard deviation. Because standard deviation gives the measure for a group that actually spread out from the average(mean).

In case, if it is from Sample then the term used is – Sample Standard Deviation.

what is population and sample in research with example

Register with BYJU'S & Download Free PDFs

Register with byju's & watch live videos.

  • Open access
  • Published: 26 May 2024

The sense of coherence scale: psychometric properties in a representative sample of the Czech adult population

  • Martin Tušl 1 ,
  • Ivana Šípová 2 ,
  • Martin Máčel 2 ,
  • Kristýna Cetkovská 2 &
  • Georg F. Bauer 1  

BMC Psychology volume  12 , Article number:  293 ( 2024 ) Cite this article

279 Accesses

Metrics details

Sense of coherence (SOC) is a personal resource that reflects the extent to which one perceives the world as comprehensible, manageable, and meaningful. Decades of empirical research consistently show that SOC is an important protective resource for health and well-being. Despite the extensive use of the 13-item measure of SOC, there remains uncertainty regarding its factorial structure. Additionally, a valid and reliable Czech version of the scale is lacking. Therefore, the present study aims to examine the psychometric properties of the SOC-13 scale in a representative sample of Czech adults.

An online survey was completed by 498 Czech adults (18–86 years old) between November 2021 and December 2021. We used confirmatory factor analysis to examine the factorial structure of the scale. Further, we examined the variations in SOC based on age and gender, and we tested the criterion validity of the scale using the short form of the Mental Health Continuum (MHC) scale and the Generalized Anxiety Disorder (GAD) scale as mental health outcomes.

SOC-13 showed an acceptable one- and three-factor fit only with specified residual covariance between items 2 and 3. We tested alternative short versions by systematically removing poorly performing items. The fit significantly improved for all shorter versions with SOC-9 having the best psychometric properties with a clear one-factorialstructure. We found that SOC increases with age and males score higher than females. SOC showed a moderately strong positive correlation with MHC, and a moderately strong negative correlation with GAD. These findings were similar for all tested versions supporting the criterion validity of the SOC scale.

Our findings suggest that shortened versions of the SOC-13 scale have better psychometric properties than the original 13-item version in the Czech adult population. Particularly, SOC-9 emerges as a viable alternative, showing comparable reliability and validity as the 13-item version and a clear one-factorial structure in our sample.

Peer Review reports

Sense of coherence (SOC) was introduced by the sociologist Aaron Antonovsky as the main pillar of his salutogenic theory, which explains how individuals cope with stressors and stay healthy even in case of adverse life situations [ 1 ]. SOC is a personal resource defined as a global orientation to life determining the degree to which one perceives life as comprehensible, manageable, and meaningful [ 2 ]. A strong SOC enables individuals to cope with stressors and manage tension, thus moving to the ease-end of the ease/disease continuum [ 2 , 3 ]. A person’s strength of SOC can be measured with the Orientation to Life Questionnaire commonly referred to as the SOC scale [ 4 ]. The original version is composed of 29 items (SOC-29) and Antonovsky recommended 13 items for the short version of the scale (SOC-13). To date, both versions of the scale have been used across diverse populations in at least 51 languages and 51 countries [ 5 ]. Studies have consistently shown that SOC correlates strongly with different health and well-being outcomes [ 6 , 7 ] and quality of life measures [ 8 ]. In the context of the recent COVID-19 pandemic, SOC has been identified as the most important protective resource in relation to mental health [ 9 ]. Regarding individual differences, SOC has been shown to strengthen over the life course [ 10 ], males usually score higher than females [ 11 ], and some studies indicate that SOC increases with the level of education [ 12 ]. However, despite the extensive evidence on the criterion validity of the scale, there is still a lack of clarity about its underlying factor structure and dimensionality.

The SOC scale was conceptualized as unidimensional suggesting that SOC in its totality, as a global orientation, influences the movement along the ease/dis-ease continuum [ 2 ]. However, the structure of the scale is rather multidimensional as each item is composed of multiple elements. Antonovsky developed the scale according to the facet theory [ 13 , 14 ] which assumes that social phenomena are best understood when they are seen as multidimensional. Facet theory involves the construction of a mapping sentence which consists of the facets and the sentence linking the facets together [ 15 ]. The SOC scale is composed of five facets: (i) the response mode (comprehensibility, manageability, meaningfulness); (ii) the modality of stimulus (instrumental, cognitive, affective), (iii) its source (internal, external, both), (iv) the nature of the demand it poses (concrete, diffuse, affective), (v) and its time reference (past, present, future). For example, item 3 “Has it happened that people whom you counted on disappointed you?” is a manageability item that can be described with the mapping sentence as follows: "Respondent X responds to an instrumental stimulus (“counted on”), which originated from the external environment (“people”), and which poses a diffuse demand (“disappointed”) being in the past (“has it happened”)." Although each item can be categorized along the SOC component comprehensibility, manageability, or meaningfulness, the items also share elements from the other four facets with items within the same, but also within the other SOC components (see 2, Chap. 4 for details). As Antonovsky states [ 2 , p. 87]: “The SOC facet pulls the items apart; the other facets push them together.”

Thus, the multi-facet nature of the scale can create difficulties in identifying the three theorized SOC components using statistical methods such as factor analysis. In fact, both the unidimensional and the three-dimensional SOC-13 rarely yield an acceptable fit without specifying residual covariance between single items (see 5 for an overview). This has been further exemplified in a recent study which examined the dimensionality of SOC-13 using a network perspective. The authors were unable to identify a clear structure and concluded that SOC is composed of multiple elements that are deeply linked and not necessarily distinct [ 16 ]. As a result, several researchers have suggested modified [ 17 ] or abbreviated versions of the scale, such as SOC-12 [ 18 , 19 ], SOC-11 [ 20 , 21 , 22 ], or SOC-9 [ 23 ], which have empirically shown a better factorial structure. This prompts the general question, whether an alternative short version should be preferred over the 13-item version. In fact, looking into the original literature [ 2 ], it is not clear why Antonovsky chose specifically these 13 items from the 29-item scale. We will address this question with the Czech version of the SOC-13 scale.

Salutogenesis in the Czech Republic

Salutogenesis and the SOC scale were introduced to the Czech audience in the early 90s by a Czech psychologist Jaro Křivohlavý. His work included the Czech translation of the SOC-29 scale [ 24 ] and the application of the concept in research on resilience [ 25 ] and behavioral medicine [ 26 ]. Unfortunately, the early Czech translation of the scale by Křivohlavý is not available electronically, nor could we locate it in library repositories. Later studies examined SOC-29 in relation to resilience [ 27 , 28 ] and self-reported health [ 29 , 30 ], however, it is not clear which translation of SOC-29 the authors used in the studies. A new Czech translation of the SOC-13 scale has recently been developed by the authors of this paper to examine the protective role of SOC for mental health during the COVID-19 crisis [ 31 ]. In line with earlier studies [ 9 ], SOC was identified as an important protective resource for individual mental health. This recent Czech translation of the SOC-13 scale [ 31 ] is the subject of the present study.

Present study

Our study aims to investigate the psychometric properties of the SOC-13 scale within a representative sample of the Czech adult population. Specifically, we will examine the factorial structure of the SOC-13 scale to understand its underlying dimensions and evaluate its internal consistency to ensure its reliability as a measure of SOC. Additionally, we aim to assess criterion validity by examining the scale’s association with established measures of positive and negative mental health outcomes - the Mental Health Continuum [ 32 ] and Generalized Anxiety Disorder [ 33 ]. We anticipate a strong correlation between these measures and the SOC construct [ 6 ]. Furthermore, we will investigate demographic variations in SOC, considering factors such as age, gender, and education. Understanding these variations will provide valuable insights into the applicability of the SOC-13 scale across different population subgroups. Finally, we will explore whether alternative short versions of the SOC scale should be preferred over the 13-item version. This analysis will help determine the most efficient version of the SOC scale for future research.

Study design and data collection

Our study design is a cross-sectional online survey of the Czech adult population. We contracted a professional agency DataCollect ( www.datacollect.cz ) to collect data from a representative sample for our study. Participants were recruited using quota sampling. The inclusion criteria were: being of adult age (18+), speaking the Czech language, and having permanent residence in the Czech Republic. Exclusion criteria related to study participation were predetermined to minimize the risk of biases in the collected data. The order of items in all measures was randomized and we implemented two attention checks in the questionnaire (e.g. “Please, choose option number 2”). Participants were excluded if they did not finish the survey, completed the survey in less than five minutes, did not pass the attention checks, or gave the same answer to more than 10 consecutive items. Data collection was conducted via the online platform Survey Monkey between November 2021 and December 2021.

Translation into the Czech language

Translation of the SOC scale was carried out by the authors of the paper with the help of a qualified translator. We followed the translation guidelines provided on the website of the Society for Research and Theory on Salutogenesis ( www.stars-society.org ), where the original English version of the SOC scale is available for download. Two translations were conducted independently, then compared and checked for differences. Based on this comparison, the agreed version of the scale was back translated into English by a Czech-English translator. The final version was checked for resemblance to the original version in content and in form. Although we used only the short version of the scale in our study (i.e., SOC-13), the translation included the full SOC-29 scale. The Czech translation of the full SOC scale is available as supplementary material.

Sense of coherence. We used the short version of the Orientation to Life Questionnaire [ 3 ] to assess SOC. The measure consists of 13 items evaluated on a 7-point Likert-type scale with different response options. Five items measure comprehensibility (e.g., “Does it happen that you experience feelings that you would rather not have to endure?”), four items measure manageability (e.g., “Has it happened that people whom you counted on disappointed you?”), and four items measure meaningfulness (e.g., “Do you have the feeling that you really don’t care about what is going on around you?”). In our sample, Cronbach’s alpha for the full scale was α = 0.88, for comprehensibility α = 0.76, manageability α = 0.72, and meaningfulness α = 0.70.

Mental health continuum - short form (MHC-SF; 32). This scale consists of 14 items that capture three dimensions of well-being: (i) emotional (e.g. “During the past month, how often did you feel interested in life?”); (ii) social (e.g. “During the past month, how often did you feel that the way our society works makes sense to you?”); (iii) psychological (e.g. “During the past month, how often did you feel confident to think or express your own ideas and opinions?”). The items assess the experiences the participants had over the past two weeks, the response options ranged from 1 (never) to 6 (every day). Internal consistency of the scale was α = 0.90.

Generalized anxiety disorder (GAD; 33). The scale consists of seven items that measure symptoms of anxiety over the past two weeks. Sample items include, e.g. “Over the past two weeks, how often have you been bothered by the following problems?” (i) “feeling nervous, anxious, or on edge”, (ii) “worrying too much about different things”, (iii) “becoming easily annoyed or irritable”. The response options ranged from 0 (not at all) to 3 (almost every day). Internal consistency of the scale was α = 0.92.

Sociodemographic characteristics included age, gender, and level of education (i.e., primary/vocational, secondary, tertiary).

Analytical procedure

Data analysis was conducted in R [ 34 ]. For confirmatory factor analysis, we used the cfa function of the lavaan package 0.6–16 [ 35 ]. We compared a one-factor model of SOC-13 to a correlated three-factor model (correlated latent factors comprehensibility, manageability, and meaningfulness) and a bi-factor model (general SOC dimension and specific dimensions comprehensibility, manageability, meaningfulness). Based on the empirical findings we further assessed the fit of alternative shorter versions of the SOC scale. We assessed the model fit using the comparative-fit index (CFI), Tucker-Lewis index (TLI), root mean square error of approximation (RMSEA), and standardized root mean square residual (SRMR) with the conventional cut-off values. The goodness-of-fit values for CFI and TLI surpassing 0.90 indicate an acceptable fit and exceeding 0.95 a good fit [ 36 ]. A value under 0.08 for RMSEA and SRMR indicates a good fit [ 37 ]. Nested models were compared using chi-square difference tests and the Bayesian Information Criterion (BIC). Models with lower BIC values should be preferred over models with higher BIC values [ 38 ]. All models were fitted using maximum likelihood estimation.

Further, we used the cor function of the stats package 4.3.2 [ 34 ] for Pearson correlation analysis to explore the association between SOC-13 and age, the t.test function of the same package for between groups t-test for differences based on gender, and the aov function with posthoc tests of the same package for one-way between-subjects ANOVA to test for differences based on level of education. To examine the criterion validity of the scale, we used the cor function for Pearson correlation analysis to examine the associations between SOC-13, MHC-SF, and GAD. We conducted the same analyses for the alternative short versions of the scale.

Participants

The median survey completion time was 11 min. In total, 676 participants started the survey and 557 completed it. Of those, 56 were excluded due to exclusion criteria. One additional respondent was excluded because of dubious responses on demographic items (e.g., 100 years old and a student), and two respondents were excluded for not meeting the inclusion criteria (under 18 years old). The final sample included N  = 498 participants. Of those, 53.4% were female, the average age was 49 years ( SD  = 16.6; range = 18–86), 43% had completed primary, 35% secondary, and 22% tertiary education. The sample is a good representation of the Czech adult population Footnote 1 with regard to gender (51% females), age ( M  = 50 years), and education level (44% primary, 33% secondary, 18% tertiary). Representativeness was tested using chi-squared test which yielded non-significant results for all domains.

Descriptive statistics

In Table  1 , we present an inter-item correlation matrix along with skewness, kurtosis, means and standard deviations of single items for SOC-13. Item correlations ranged from r  = 0.07 (items 2 and 4) to r  = 0.67 (items 8 and 9). Strong and moderately strong correlations were found also across the three SOC dimensions (e.g., r  = 0.77 comprehensibility and manageability).

  • Confirmatory factor analysis

A one-factor model showed inadequate fit to the data [χ2(65) = 338.2, CFI = 0.889, TLI = 0.867, RMSEA = 0.092, SRMR = 0.062]. Based on existing evidence [ 6 ], we specified residual covariance between items 2 and 3 and tested a modified one-factor model. The model showed an acceptable fit to the data [χ2(64) = 242.6, CFI = 0.927, TLI = 0.911, RMSEA = 0.075, SRMR = 0.050], and it was superior to the one-factor model (Δχ2 = 95.5, Δ df  = 1, p  < 0.001).

A correlated three-factor model showed an acceptable fit considering CFI and SRMR [χ2(63) = 286.6, CFI = 0.909, TLI = 0.885, RMSEA = 0.085, SRMR = 0.058]. The model was superior to the one-factor model (Δχ2 = 51.5, Δ df  = 2, p  < 0.001), however, it was inferior to the modified one-factor model (ΔBIC = -56). We further tested a modified three-factor model with residual covariance between items 2 and 3 which showed an acceptable fit to the data based on CFI and TLI and a good fit based on RMSEA and SRMR [χ2(62) = 191.7, CFI = 0.947, TLI = 0.932, RMSEA = 0.066, SRMR = 0.046]. The model was superior to the three-factor model (Δχ2 = 97.1, Δ df  = 1, p  < 0.001) as well as to the modified one-factor model (Δχ2 = 50.9, Δ df  = 3, p  < 0.001). See Fig.  1 for a detailed illustration of the model.

Finally, we tested a bi-factor model with one general SOC factor and three specific factors (comprehensibility, manageability, meaningfulness), however, the model was not identified.

figure 1

Correlated three-factor model of SOC-13 with residual covariance between item 2 and item 3

Alternative short versions of the SOC scale

We further tested the fit of alternative shorter versions of the SOC scale by systematically removing poorly performing items. In SOC-12, item 2 was excluded (“Has it happened in the past that you were surprised by the behavior of people whom you thought you knew well?”). This item measures comprehensibility, hence SOC-12 has even distribution of items for each dimension (i.e., comprehensibility, manageability, meaningfulness). Item 2 has previously been identified as problematic [ 6 ] and also in our sample it did not perform well in any of the fitted SOC-13 models (i.e., low factor loading and explained variance). A one-factor SOC-12 model showed an acceptable fit to the data based on CFI and TLI and a good fit based on RMSEA and SRMR [χ2(54) = 221.1, CFI = 0.927, RMSEA = 0.079, SRMR = 0.048]. A correlated three-factor model showed an acceptable fit based on CFI and TLI and a good fit based on RMSEA and SRMR [χ2(52) = 171.1, CFI = 0.948, TLI = 0.932, RMSEA = 0.069 SRMR = 0.043]. The model was superior to the one-factor model (Δχ2 = 50, Δ df  = 3, p  < 0.001). Bi-factor model was not identified.

In SOC-11, we removed item 3 (“Has it happened that people whom you counted on disappointed you?”), which measures manageability. The item had the lowest factor loading and the lowest explained variance in the one-factor SOC-12. A one-factor SOC-11 model showed a good fit to the data [χ2 (44) = 138.5, CFI = 0.955, TLI = 0.944, RMSEA = 0.066, SRMR = 0.038]. A correlated three-factor model was identified but not acceptable due to covariance between comprehensibility and manageability higher than 1 (i.e., Heywood case; 39).

In SOC-10, we removed item 1 (“Do you have the feeling that you don’t really care about what goes on around you?”), which measures meaningfulness. The item had the lowest factor loading and the lowest explained variance in one-factor SOC-11. A one-factor SOC-10 model showed a good fit to the data [χ2 (35) = 126.6, CFI = 0.956, TLI = 0.943, RMSEA = 0.072, SRMR = 0.039]. As in the case of SOC-11, a correlated three-factor model was identified but not acceptable due to covariance between comprehensibility and manageability higher than 1.

Finally, in SOC-9, we removed item 11 (“When something happened, have you generally found that… you overestimated or underestimated its importance / you saw the things in the right proportion”), which measures comprehensibility. The item had the lowest factor loading and the lowest explained variance in one-factor SOC-10. SOC-9 has an even distribution of three items for each dimension. A one-factor model showed a good fit to the data [χ2 (27) = 105.6, CFI = 0.959, TLI = 0.946, RMSEA = 0.076, SRMR = 0.038]. As in the previous models, a correlated three-factor model was identified but not acceptable due to covariance between comprehensibility and manageability higher than 1. See Fig.  2 for an illustration of one-factor SOC-9 model. Detailed results of the confirmatory factor analysis are shown in Table  2 . In Table 3 , we present the items of the SOC-13 (and SOC-9) scale with details about their facet structure.

figure 2

One-factor model of SOC-9

Differences by gender, age, and education

Correlation analysis indicated that SOC-13 increases with age ( r  = 0.32, p  < 0.001), this finding was identical for all alternative short versions of the SOC scale (see Table  2 ). Further, the results of the two-tailed t-test showed that males ( M  = 4.8, SD  = 1.08) had a significantly higher SOC-13 score [ t (497) = 3.06, p  = 0.002, d  = 0.27] than females ( M  = 4.5, SD  = 1.07). A one-way between-subjects ANOVA did not show any significant effect of level of education on SOC-13 score [F(2, 497) = 1.78, p  = 0.169, η p 2  = 0.022]. These results were similar for all alternative short versions of the SOC scale.

Criterion validity

We found a moderately strong positive correlation ( r  = 0.61, p  < 0.001) between SOC-13 and the positive mental health measure MHC, and a moderately strong negative correlation between SOC-13 and the negative mental health measure GAD ( r = -0.68, p  < 0.001). These findings were similar for all alternative short versions of the SOC scale (see Table  4 ).

Our study examined the psychometric properties of the SOC-13 scale and its alternative short versions SOC-12, SOC-11, SOC-10, and SOC-9 in a representative sample of the Czech adult population. In line with existing studies [ 40 ], we found that SOC increases with age and that males score higher than females. In contrast to some prior findings [ 12 ], we did not find any significant differences in SOC based on the level of education. Further, we tested criterion validity using both positive and negative mental health outcomes (i.e., MHC and GAD). SOC had a strong positive correlation with MHC and a strong negative correlation with GAD, thus adding to the evidence about the criterion validity of the scale [ 6 , 40 ].

Analysis of the factor structure showed that a one-factor SOC-13 had an inadequate fit to our data, however, an acceptable fit was achieved for a modified one-factor model with specified residual covariance between item 2 (“Has it happened in the past that you were surprised by the behavior of people whom you thought you knew well?”) and item 3 (“Has it happened that people whom you counted on disappointed you?”). A correlated three factor model with latent factors comprehensibility, manageability, and meaningfulness showed a better fit than the one factor-model. However, it was also necessary to specify residual covariance between item 2 and item 3 to reach an acceptable fit for all fit indices. A recent Slovenian study [ 41 ] found a similar result and several prior studies (see 6 for an overview) have noted that items 2 and 3 of the SOC-13 scale are problematic. Although the items pertain to different SOC dimensions (item 2 to comprehensibility, item 3 to manageability), multiple studies [e.g., 20 , 42 , 43 ] have reported moderately strong correlation between them and this is also the case in our study ( r  = 0.5, p  < 0.001). The two items aptly illustrate the facet theory behind the scale construction as the SOC component represents only one building block of each item. Although items 2 and 3 theoretically pertain to different SOC components, they share the same elements from the other four facets (i.e., modality, source, demand, and time) which is reflected in the similarity of their wording. Therefore, they will necessarily share residual variance and this needs to be specified to achieve a good model fit. Drageset and Haugan [ 18 ] explain this similarity in that the people whom we know well are usually the ones that we count on, and feeling disappointed and surprised by the behavior of people we know well is closely related. Therefore, it should be theoretically justifiable to specify residual covariance between item 2 and item 3 as a possible solution to improve the fit. As we could show in our sample, the model fit significantly improved for both one-factor and three-factor solutions.

In addition, we examined the fit of alternative short versions of the SOC scale by systematically removing single items that performed poorly. First, in line with previous studies [ 6 ], we addressed the issue of residual covariance in SOC-13 by removing item 2, examining the factor structure of SOC-12. The remaining 12 items were equally distributed within the three SOC components with four items per each component. Interestingly, a one-factor model reached an acceptable fit and the fit further improved for a correlated three-factor model with latent factors of comprehensibility, manageability, and meaningfulness. Although correlated three-factor models were superior to one-factor models, we observed extreme covariances between latent variables, especially in case of comprehensibility and manageability (cov = 0.98). This suggests that the SOC components are not empirically separable and that, indeed, SOC is rather a one-dimensional global orientation with multiple components that are dynamically interrelated as Antonovsky proposed [ 2 ]. This notion was supported in a recent study that explored the dimensionality of the scale using a network perspective [ 16 ]. Our examination of SOC-11, SOC-10 and SOC-9 provided further support for a one-factor structure of the scale. All shorter versions yielded a good one-dimensional fit, however, we could not identify a correlated three-factor model fit due to the Heywood case. This refers to the situation when a solution that otherwise is satisfactory produces communality greater than one explained by the latent factor, which implies that the residual variance of the variable is negative [ 39 ]. In our case, this was true for the latent factors comprehensibility and manageability. However, we demonstrated that we could attain a good one-dimensional fit for all alternative short versions of SOC, and, importantly, they all showed comparable reliability and validity metrics to their longer counterpart SOC-13. In particular, SOC-9 shows very good fit indices and it performs equally well in validity analyses as SOC-13. Given these findings and existing evidence [ 5 ], we propose that future investigations may consider utilizing the SOC-9 scale instead of the SOC-13. It is interesting to point out that the majority of items that were removed for the shorter versions of the scale are negatively worded or reverse-scored (expect for item 11). This is in line with the latest research suggesting that such items can cause problems in model identification as they create additional method factors [ 44 , 45 , 46 ].

Finally, it is important to highlight that Antonovsky did not provide any information about the selection of the 13 items for the short version of the SOC scale [ 2 ]. For example, a detailed examination of the facet structure reveals that none of the items included in SOC-13 refers to future which is part of facet referring to time (i.e., past, present, future). Hence, considering the absence of explicit criteria for item selection in the SOC-13 scale, it would be interesting to gather data from diverse populations utilizing the full SOC-29 scale. Subsequently, through exploratory factor analysis, researchers could derive a new, theory- and empirical-driven, short version of the SOC scale.

Strengths and limitations

A clear strength of our study is that our findings are based on a representative sample that accurately reflects the Czech adult population. Moreover, we implemented rigorous data cleaning procedures, meticulously excluding participants who provided potentially careless or low-quality responses. By doing so, we ensured that our conclusions are based on high-quality data and that they are generalizable to our target population of Czech adults. Finally, we conducted a thorough back-translation procedure to achieve an accurate Czech version of the SOC scale and we carried out systematic testing of different short versions of the SOC scale.

However, our study also has some limitations. First, our conclusions are based on data from a culturally specific country and they may not be generalizable to other populations. It is important to note, however, that most of our findings are in line with multiple existing studies which supports the validity of our conclusions. Second, the data were collected during a later stage of the COVID-19 pandemic, which may have impacted particularly the mental health outcomes we used for criterion validity. It would be worthwhile to investigate whether the data replicate in our population outside of this exceptional situation. Third, it should be noted that we did not examine test-retest reliability of the scale due to the cross-sectional design of our study. Finally, self-reported data are subject to common method biases such as social desirability, recall bias, or consistency motive [ 47 ]. We aimed to minimize this risk by implementing various strategies in the questionnaire, such as randomization of items and the use of disqualifying items (e.g. “Please, choose option number 2”) to disqualify careless answers.

Our study contributes to decades of ongoing research on SOC, the main pillar of the theory of salutogenesis. In line with existing research, we found evidence for the validity of the SOC as a construct, but we could not identify a clear factorial structure of the SOC-13 scale. However, following Antonovsky’s conception of the scale, we believe it is theoretically sound to aim for a one-factor solution of the scale and we could show that this is possible with shorter versions of the SOC scale. We particularly recommend using the SOC-9 scale in future research which shows an excellent one-factor fit and validity indices comparable to SOC-13. Finally, since Antonovsky does not explain how he selected the items of the SOC-13 scale, it would be interesting to examine the possibility of developing a new one-dimensional short version based on exploratory factor analysis of the original SOC-29 scale.

Data availability

The datasets used and analyzed during the current study and the R code used for the statistical analysis are available as supplementary material.

www.czso.cz .

Antonovsky A. The salutogenic model as a theory to guide health promotion. Health Promot Int. 1996;11(1).

Antonovsky A. Unraveling the mystery of Health how people manage stress and stay well. Jossey-Bass; 1987.

Antonovsky A. Health stress and coping. Jossey-Bass; 1979.

Antonovsky A. The structure and Properties of the sense of coherence scale. Soc Sci Med. 1993;36(6):125–733.

Article   Google Scholar  

Eriksson M, Contu P. The sense of coherence: Measurement issues. The Handbook of Salutogenesis. Springer International Publishing; 2022. pp. 79–91.

Eriksson M. The sense of coherence: the Concept and its relationship to Health. The Handbook of Salutogenesis. Springer International Publishing; 2022. pp. 61–8.

Eriksson M, Lindström B. Antonovsky’s sense of coherence scale and the relation with health: a systematic review. J Epidemiol Community Health (1978). 2006;60(5):376–81.

Eriksson M, Lindström B. Antonovsky’s sense of coherence scale and its relation with quality of life: a systematic review. J Epidemiol Community Health. 2007;61(11):938–44.

Article   PubMed   PubMed Central   Google Scholar  

Mana A, Super S, Sardu C, Juvinya Canal D, Moran N, Sagy S. Individual, social and national coping resources and their relationships with mental health and anxiety: A comparative study in Israel, Italy, Spain, and the Netherlands during the Coronavirus pandemic. Glob Health Promot [Internet]. 2021;28(2):17–26.

Silverstein M, Heap J. Sense of coherence changes with aging over the second half of life. Adv Life Course Res. 2015;23:98–107.

Article   PubMed   Google Scholar  

Rivera F, García-Moya I, Moreno C, Ramos P. Developmental contexts and sense of coherence in adolescence: a systematic review. J Health Psychol. 2013;18(6):800–12.

Volanen SM, Lahelma E, Silventoinen K, Suominen S. Factors contributing to sense of coherence among men and women. Eur J Public Health [Internet]. 2004;14(3):322–30.

Guttman L. Measurement as structural theory. Psychometrika. 1971;3(4):329–47.

Guttman R, Greenbaum CW. Facet theory: its development and current status. Eur Psychol. 1998;3(1):13–36.

Shye S. Theory Construction and Data Analysis in the behavioral sciences. San Francisco: Jossey-Bass; 1978.

Google Scholar  

Portoghese I, Sardu C, Bauer G, Galletta M, Castaldi S, Nichetti E, Petrocelli L, Tassini M, Tidone E, Mereu A, Contu P. A network perspective to the measurement of sense of coherence (SOC): an exploratory graph analysis approach. Current Psychology. 2024;12:1-3.

Bachem R, Maercker A. Development and psychometric evaluation of a revised sense of coherence scale. Eur J Psychol Assess. 2016;34(3):206–15.

Drageset J, Haugan G. Psychometric properties of the orientation to Life Questionnaire in nursing home residents. Scand J Caring Sci. 2016;30(3):623–30.

Kanhai J, Harrison VE, Suominen AL, Knuuttila M, Uutela A, Bernabé E. Sense of coherence and incidence of periodontal disease in adults. J Clin Periodontol. 2014;41(8):760–5.

Naaldenberg J, Tobi H, van den Esker F, Vaandrager L. Psychometric properties of the OLQ-13 scale to measure sense of coherence in a community-dwelling older population. Health Qual Life Outcomes. 2011;9.

Luyckx K, Goossens E, Apers S, Rassart J, Klimstra T, Dezutter J et al. The 13-item sense of coherence scale in Dutch-speaking adolescents and young adults: structural validity, age trends, and chronic disease. Psychol Belg. 2012;52(4):351–68.

Lerdal A, Opheim R, Gay CL, Moum B, Fagermoen MS, Kottorp A. Psychometric limitations of the 13-item sense of coherence scale assessed by Rasch analysis. BMC Psychol. 2017;5(1).

Klepp OM, Mastekaasa A, Sørensen T, Sandanger I, Kleiner R. Structure analysis of Antonovsky’s sense of coherence from an epidemiological mental health survey with a brief nine-item sense of coherence scale. Int J Methods Psychiatr Res. 2007;16(1):11–22.

Křivohlavý J. Sense of coherence: methods and first results. II. Sense of coherence and cancer. Czechoslovak Psychol. 1990;34:511–7.

Křivohlavý J. Nezdolnost v pojetí SOC. Czechoslovak Psychol. 1990;34(6).

Křivohlavý J. Salutogenesis and behavioral medicine. Cas Lek Cesk. 1990;126(36):1121–4.

Kebza V, Šolcová I. Hlavní Koncepce psychické odolnosti. Czechoslovak Psychol. 2008;52(1):1–19.

Šolcová I, Blatný M, Kebza V, Jelínek M. Relation of toddler temperament and perceived parenting styles to adult resilience. Czechoslovak Psychol. 2016;60(1):61–70.

Šolcová I, Kebza V, Kodl M, Kernová V. Self-reported health status predicting resilience and burnout in longitudinal study. Cent Eur J Public Health. 2017;25(3):222–7.

Šolcová I, Kebza V. Subjective health: current state of knowledge and results of two Czech studies. Czechoslovak Psychol. 2006;501:1–15.

Šípová I, Máčel M, Zubková A, Tušl M. Association between coping resources and mental health during the COVID-19 pandemic: a cross-sectional study in the Czech Republic. Int J Environ Health Res. 2022;1–9.

Keyes CLM. The Mental Health Continuum: from languishing to flourishing in life. J Health Soc Behav. 2002;43(2):207–22.

Löwe B, Decker O, Müller S, Brähler E, Schellberg D, Herzog W, et al. Validation and standardization of the generalized anxiety disorder screener (GAD-7) in the General Population. Med Care. 2008;46(3):266–74.

R Core Team. R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2022.

Rosseel Y. Lavaan: an R Package for Structural equation modeling. J Stat Softw. 2012;48(2):1–36.

Bentler PM, Bonett DG. Significance tests and goodness of fit in the analysis of covariance structures. Psychol Bull. 1980;88(3):588–606.

Beauducel A, Wittmann WW. Simulation study on fit indexes in CFA based on data with slightly distorted simple structure. Struct Equ Model. 2005;12(1):41–75.

Raftery AE. Bayesian model selection in Social Research. Sociol Methodol. 1995;25:111–63.

Farooq R. Heywood cases: possible causes and solutions. Int J Data Anal Techniques Strategies. 2022;14(1):79.

Eriksson M, Lindström B. Validity of Antonovsky’s sense of coherence scale: a systematic review. J Epidemiol Community Health (1978). 2005;59(6):460–6.

Stern B, Socan G, Rener-Sitar K, Kukec A, Zaletel-Kragelj L. Validation of the Slovenian version of short sense of coherence questionnaire (SOC-13) in multiple sclerosis patients. Zdr Varst. 2019;58(1):31–9.

PubMed   PubMed Central   Google Scholar  

Bernabé E, Tsakos G, Watt RG, Suominen-Taipale AL, Uutela A, Vahtera J, et al. Structure of the sense of coherence scale in a nationally representative sample: the Finnish Health 2000 survey. Qual Life Res. 2009;18(5):629–36.

Sardu C, Mereu A, Sotgiu A, Andrissi L, Jacobson MK, Contu P. Antonovsky’s sense of coherence scale: cultural validation of soc questionnaire and socio-demographic patterns in an Italian Population. Clin Pract Epidemiol Mental Health. 2012;8:1–6.

Chyung SY, Barkin JR, Shamsy JA. Evidence-based Survey Design: the Use of negatively worded items in surveys. Perform Improv. 2018;57(3):16–25.

Suárez-Alvarez J, Pedrosa I, Lozano LM, García-Cueto E, Cuesta M, Muñiz J. Using reversed items in likert scales: a questionable practice. Psicothema. 2018;30(2):149–58.

PubMed   Google Scholar  

van Sonderen E, Sanderman R, Coyne JC. Ineffectiveness of reverse wording of questionnaire items: let’s learn from cows in the rain. PLoS ONE. 2013;8(7).

Podsakoff PM, MacKenzie SB, Lee JY, Podsakoff NP. Common method biases in behavioral research: a critical review of the literature and recommended remedies. J Appl Psychol. 2003;88(5):879–903.

Download references

Acknowledgements

The authors would like to thank to the team of Center of Salutogenesis at the University of Zurich for their helpful comments on the adapted version of the SOC scale.

MT received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement No 801076, through the SSPH + Global PhD Fellowship Program in Public Health Sciences (GlobalP3HS) of the Swiss School of Public Health. Data collection was supported by the Charles University Strategic Partnerships Fund 2021. The University of Zurich Foundation supported the contribution of GB.

Author information

Authors and affiliations.

Division of Public and Organizational Health, Center of Salutogenesis, Epidemiology, Biostatistics and Prevention Institute, University of Zurich, Hirschengraben 84, Zurich, 8001, Switzerland

Martin Tušl & Georg F. Bauer

Department of Psychology, Faculty of Arts, Charles University, Prague, Czech Republic

Ivana Šípová, Martin Máčel & Kristýna Cetkovská

You can also search for this author in PubMed   Google Scholar

Contributions

All authors contributed to the conception and design of the study. MT wrote the manuscript, conducted data analysis, and contributed to data collection. MM and IS conducted data collection, contributed to data analysis, interpretation of results, edited and commented on the manuscript. KC and GB contributed to interpretation of results, edited and commented on the manuscript. All authors have read and approved the final manuscript.

Corresponding author

Correspondence to Martin Tušl .

Ethics declarations

Ethics approval and consent to participate.

The study was conducted in accordance with the general principles of the Declaration of Helsinki and with the ethical principles defined by the university and by the national law ( https://cuni.cz/UK-5317.html ). Informed consent was obtained from all participants prior to the completion of the survey. Participation was voluntary and participants could withdraw from the study at any time without any consequences. For anonymous online surveys in adult population no ethical review by an ethics committee was necessary under national law and university rules. See: https://www.muni.cz/en/about-us/organizational-structure/boards-and-committees/research-ethics-committee/evaluation-request .

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Supplementary material 2, supplementary material 3, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Tušl, M., Šípová, I., Máčel, M. et al. The sense of coherence scale: psychometric properties in a representative sample of the Czech adult population. BMC Psychol 12 , 293 (2024). https://doi.org/10.1186/s40359-024-01805-7

Download citation

Received : 22 March 2023

Accepted : 21 May 2024

Published : 26 May 2024

DOI : https://doi.org/10.1186/s40359-024-01805-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Salutogenesis
  • Sense of coherence
  • Psychometrics
  • Czech adult population
  • Mental health

BMC Psychology

ISSN: 2050-7283

what is population and sample in research with example

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 17 October 2023

The impact of founder personalities on startup success

  • Paul X. McCarthy 1 , 2 ,
  • Xian Gong 3 ,
  • Fabian Braesemann 4 , 5 ,
  • Fabian Stephany 4 , 5 ,
  • Marian-Andrei Rizoiu 3 &
  • Margaret L. Kern 6  

Scientific Reports volume  13 , Article number:  17200 ( 2023 ) Cite this article

60k Accesses

2 Citations

305 Altmetric

Metrics details

  • Human behaviour
  • Information technology

An Author Correction to this article was published on 07 May 2024

This article has been updated

Startup companies solve many of today’s most challenging problems, such as the decarbonisation of the economy or the development of novel life-saving vaccines. Startups are a vital source of innovation, yet the most innovative are also the least likely to survive. The probability of success of startups has been shown to relate to several firm-level factors such as industry, location and the economy of the day. Still, attention has increasingly considered internal factors relating to the firm’s founding team, including their previous experiences and failures, their centrality in a global network of other founders and investors, as well as the team’s size. The effects of founders’ personalities on the success of new ventures are, however, mainly unknown. Here, we show that founder personality traits are a significant feature of a firm’s ultimate success. We draw upon detailed data about the success of a large-scale global sample of startups (n = 21,187). We find that the Big Five personality traits of startup founders across 30 dimensions significantly differ from that of the population at large. Key personality facets that distinguish successful entrepreneurs include a preference for variety, novelty and starting new things (openness to adventure), like being the centre of attention (lower levels of modesty) and being exuberant (higher activity levels). We do not find one ’Founder-type’ personality; instead, six different personality types appear. Our results also demonstrate the benefits of larger, personality-diverse teams in startups, which show an increased likelihood of success. The findings emphasise the role of the diversity of personality types as a novel dimension of team diversity that influences performance and success.

Similar content being viewed by others

what is population and sample in research with example

Predicting success in the worldwide start-up network

what is population and sample in research with example

The personality traits of self-made and inherited millionaires

what is population and sample in research with example

The nexus of top executives’ attributes, firm strategies, and outcomes: Large firms versus SMEs

Introduction.

The success of startups is vital to economic growth and renewal, with a small number of young, high-growth firms creating a disproportionately large share of all new jobs 1 , 2 . Startups create jobs and drive economic growth, and they are also an essential vehicle for solving some of society’s most pressing challenges.

As a poignant example, six centuries ago, the German city of Mainz was abuzz as the birthplace of the world’s first moveable-type press created by Johannes Gutenberg. However, in the early part of this century, it faced several economic challenges, including rising unemployment and a significant and growing municipal debt. Then in 2008, two Turkish immigrants formed the company BioNTech in Mainz with another university research colleague. Together they pioneered new mRNA-based technologies. In 2020, BioNTech partnered with US pharmaceutical giant Pfizer to create one of only a handful of vaccines worldwide for Covid-19, saving an estimated six million lives 3 . The economic benefit to Europe and, in particular, the German city where the vaccine was developed has been significant, with windfall tax receipts to the government clearing Mainz’s €1.3bn debt and enabling tax rates to be reduced, attracting other businesses to the region as well as inspiring a whole new generation of startups 4 .

While stories such as the success of BioNTech are often retold and remembered, their success is the exception rather than the rule. The overwhelming majority of startups ultimately fail. One study of 775 startups in Canada that successfully attracted external investment found only 35% were still operating seven years later 5 .

But what determines the success of these ‘lucky few’? When assessing the success factors of startups, especially in the early-stage unproven phase, venture capitalists and other investors offer valuable insights. Three different schools of thought characterise their perspectives: first, supply-side or product investors : those who prioritise investing in firms they consider to have novel and superior products and services, investing in companies with intellectual property such as patents and trademarks. Secondly, demand-side or market-based investors : those who prioritise investing in areas of highest market interest, such as in hot areas of technology like quantum computing or recurrent or emerging large-scale social and economic challenges such as the decarbonisation of the economy. Thirdly, talent investors : those who prioritise the foundation team above the startup’s initial products or what industry or problem it is looking to address.

Investors who adopt the third perspective and prioritise talent often recognise that a good team can overcome many challenges in the lead-up to product-market fit. And while the initial products of a startup may or may not work a successful and well-functioning team has the potential to pivot to new markets and new products, even if the initial ones prove untenable. Not surprisingly, an industry ‘autopsy’ into 101 tech startup failures found 23% were due to not having the right team—the number three cause of failure ahead of running out of cash or not having a product that meets the market need 6 .

Accordingly, early entrepreneurship research was focused on the personality of founders, but the focus shifted away in the mid-1980s onwards towards more environmental factors such as venture capital financing 7 , 8 , 9 , networks 10 , location 11 and due to a range of issues and challenges identified with the early entrepreneurship personality research 12 , 13 . At the turn of the 21st century, some scholars began exploring ways to combine context and personality and reconcile entrepreneurs’ individual traits with features of their environment. In her influential work ’The Sociology of Entrepreneurship’, Patricia H. Thornton 14 discusses two perspectives on entrepreneurship: the supply-side perspective (personality theory) and the demand-side perspective (environmental approach). The supply-side perspective focuses on the individual traits of entrepreneurs. In contrast, the demand-side perspective focuses on the context in which entrepreneurship occurs, with factors such as finance, industry and geography each playing their part. In the past two decades, there has been a revival of interest and research that explores how entrepreneurs’ personality relates to the success of their ventures. This new and growing body of research includes several reviews and meta-studies, which show that personality traits play an important role in both career success and entrepreneurship 15 , 16 , 17 , 18 , 19 , that there is heterogeneity in definitions and samples used in research on entrepreneurship 16 , 18 , and that founder personality plays an important role in overall startup outcomes 17 , 19 .

Motivated by the pivotal role of the personality of founders on startup success outlined in these recent contributions, we investigate two main research questions:

Which personality features characterise founders?

Do their personalities, particularly the diversity of personality types in founder teams, play a role in startup success?

We aim to understand whether certain founder personalities and their combinations relate to startup success, defined as whether their company has been acquired, acquired another company or listed on a public stock exchange. For the quantitative analysis, we draw on a previously published methodology 20 , which matches people to their ‘ideal’ jobs based on social media-inferred personality traits.

We find that personality traits matter for startup success. In addition to firm-level factors of location, industry and company age, we show that founders’ specific Big Five personality traits, such as adventurousness and openness, are significantly more widespread among successful startups. As we find that companies with multi-founder teams are more likely to succeed, we cluster founders in six different and distinct personality groups to underline the relevance of the complementarity in personality traits among founder teams. Startups with diverse and specific combinations of founder types (e. g., an adventurous ‘Leader’, a conscientious ‘Accomplisher’, and an extroverted ‘Developer’) have significantly higher odds of success.

We organise the rest of this paper as follows. In the Section " Results ", we introduce the data used and the methods applied to relate founders’ psychological traits with their startups’ success. We introduce the natural language processing method to derive individual and team personality characteristics and the clustering technique to identify personality groups. Then, we present the result for multi-variate regression analysis that allows us to relate firm success with external and personality features. Subsequently, the Section " Discussion " mentions limitations and opportunities for future research in this domain. In the Section " Methods ", we describe the data, the variables in use, and the clustering in greater detail. Robustness checks and additional analyses can be found in the Supplementary Information.

Our analysis relies on two datasets. We infer individual personality facets via a previously published methodology 20 from Twitter user profiles. Here, we restrict our analysis to founders with a Crunchbase profile. Crunchbase is the world’s largest directory on startups. It provides information about more than one million companies, primarily focused on funding and investors. A company’s public Crunchbase profile can be considered a digital business card of an early-stage venture. As such, the founding teams tend to provide information about themselves, including their educational background or a link to their Twitter account.

We infer the personality profiles of the founding teams of early-stage ventures from their publicly available Twitter profiles, using the methodology described by Kern et al. 20 . Then, we correlate this information to data from Crunchbase to determine whether particular combinations of personality traits correspond to the success of early-stage ventures. The final dataset used in the success prediction model contains n = 21,187 startup companies (for more details on the data see the Methods section and SI section  A.5 ).

Revisions of Crunchbase as a data source for investigations on a firm and industry level confirm the platform to be a useful and valuable source of data for startups research, as comparisons with other sources at micro-level, e.g., VentureXpert or PwC, also suggest that the platform’s coverage is very comprehensive, especially for start-ups located in the United States 21 . Moreover, aggregate statistics on funding rounds by country and year are quite similar to those produced with other established sources, going to validate the use of Crunchbase as a reliable source in terms of coverage of funded ventures. For instance, Crunchbase covers about the same number of investment rounds in the analogous sectors as collected by the National Venture Capital Association 22 . However, we acknowledge that the data source might suffer from registration latency (a certain delay between the foundation of the company and its actual registration on Crunchbase) and success bias in company status (the likeliness that failed companies decide to delete their profile from the database).

The definition of startup success

The success of startups is uncertain, dependent on many factors and can be measured in various ways. Due to the likelihood of failure in startups, some large-scale studies have looked at which features predict startup survival rates 23 , and others focus on fundraising from external investors at various stages 24 . Success for startups can be measured in multiple ways, such as the amount of external investment attracted, the number of new products shipped or the annual growth in revenue. But sometimes external investments are misguided, revenue growth can be short-lived, and new products may fail to find traction.

Success in a startup is typically staged and can appear in different forms and times. For example, a startup may be seen to be successful when it finds a clear solution to a widely recognised problem, such as developing a successful vaccine. On the other hand, it could be achieving some measure of commercial success, such as rapidly accelerating sales or becoming profitable or at least cash positive. Or it could be reaching an exit for foundation investors via a trade sale, acquisition or listing of its shares for sale on a public stock exchange via an Initial Public Offering (IPO).

For our study, we focused on the startup’s extrinsic success rather than the founders’ intrinsic success per se, as its more visible, objective and measurable. A frequently considered measure of success is the attraction of external investment by venture capitalists 25 . However, this is not in and of itself a good measure of clear, incontrovertible success, particularly for early-stage ventures. This is because it reflects investors’ expectations of a startup’s success potential rather than actual business success. Similarly, we considered other measures like revenue growth 26 , liquidity events 27 , 28 , 29 , profitability 30 and social impact 31 , all of which have benefits as they capture incremental success, but each also comes with operational measurement challenges.

Therefore, we apply the success definition initially introduced by Bonaventura et al. 32 , namely that a startup is acquired, acquires another company or has an initial public offering (IPO). We consider any of these major capital liquidation events as a clear threshold signal that the company has matured from an early-stage venture to becoming or is on its way to becoming a mature company with clear and often significant business growth prospects. Together these three major liquidity events capture the primary forms of exit for external investors (an acquisition or trade sale and an IPO). For companies with a longer autonomous growth runway, acquiring another company marks a similar milestone of scale, maturity and capability.

Using multifactor analysis and a binary classification prediction model of startup success, we looked at many variables together and their relative influence on the probability of the success of startups. We looked at seven categories of factors through three lenses of firm-level factors: (1) location, (2) industry, (3) age of the startup; founder-level factors: (4) number of founders, (5) gender of founders, (6) personality characteristics of founders and; lastly team-level factors: (7) founder-team personality combinations. The model performance and relative impacts on the probability of startup success of each of these categories of founders are illustrated in more detail in section  A.6 of the Supplementary Information (in particular Extended Data Fig.  19 and Extended Data Fig.  20 ). In total, we considered over three hundred variables (n = 323) and their relative significant associations with success.

The personality of founders

Besides product-market, industry, and firm-level factors (see SI section  A.1 ), research suggests that the personalities of founders play a crucial role in startup success 19 . Therefore, we examine the personality characteristics of individual startup founders and teams of founders in relationship to their firm’s success by applying the success definition used by Bonaventura et al. 32 .

Employing established methods 33 , 34 , 35 , we inferred the personality traits across 30 dimensions (Big Five facets) of a large global sample of startup founders. The startup founders cohort was created from a subset of founders from the global startup industry directory Crunchbase, who are also active on the social media platform Twitter.

To measure the personality of the founders, we used the Big Five, a popular model of personality which includes five core traits: Openness to Experience, Conscientiousness, Extraversion, Agreeableness, and Emotional stability. Each of these traits can be further broken down into thirty distinct facets. Studies have found that the Big Five predict meaningful life outcomes, such as physical and mental health, longevity, social relationships, health-related behaviours, antisocial behaviour, and social contribution, at levels on par with intelligence and socioeconomic status 36 Using machine learning to infer personality traits by analysing the use of language and activity on social media has been shown to be more accurate than predictions of coworkers, friends and family and similar in accuracy to the judgement of spouses 37 . Further, as other research has shown, we assume that personality traits remain stable in adulthood even through significant life events 38 , 39 , 40 . Personality traits have been shown to emerge continuously from those already evident in adolescence 41 and are not significantly influenced by external life events such as becoming divorced or unemployed 42 . This suggests that the direction of any measurable effect goes from founder personalities to startup success and not vice versa.

As a first investigation to what extent personality traits might relate to entrepreneurship, we use the personality characteristics of individuals to predict whether they were an entrepreneur or an employee. We trained and tested a machine-learning random forest classifier to distinguish and classify entrepreneurs from employees and vice-versa using inferred personality vectors alone. As a result, we found we could correctly predict entrepreneurs with 77% accuracy and employees with 88% accuracy (Fig.  1 A). Thus, based on personality information alone, we correctly predict all unseen new samples with 82.5% accuracy (See SI section  A.2 for more details on this analysis, the classification modelling and prediction accuracy).

We explored in greater detail which personality features are most prominent among entrepreneurs. We found that the subdomain or facet of Adventurousness within the Big Five Domain of Openness was significant and had the largest effect size. The facet of Modesty within the Big Five Domain of Agreeableness and Activity Level within the Big Five Domain of Extraversion was the subsequent most considerable effect (Fig.  1 B). Adventurousness in the Big Five framework is defined as the preference for variety, novelty and starting new things—which are consistent with the role of a startup founder whose role, especially in the early life of the company, is to explore things that do not scale easily 43 and is about developing and testing new products, services and business models with the market.

Once we derived and tested the Big Five personality features for each entrepreneur in our data set, we examined whether there is evidence indicating that startup founders naturally cluster according to their personality features using a Hopkins test (see Extended Data Figure  6 ). We discovered clear clustering tendencies in the data compared with other renowned reference data sets known to have clusters. Then, once we established the founder data clusters, we used agglomerative hierarchical clustering. This ‘bottom-up’ clustering technique initially treats each observation as an individual cluster. Then it merges them to create a hierarchy of possible cluster schemes with differing numbers of groups (See Extended Data Fig.  7 ). And lastly, we identified the optimum number of clusters based on the outcome of four different clustering performance measurements: Davies-Bouldin Index, Silhouette coefficients, Calinski-Harabas Index and Dunn Index (see Extended Data Figure  8 ). We find that the optimum number of clusters of startup founders based on their personality features is six (labelled #0 through to #5), as shown in Fig.  1 C.

To better understand the context of different founder types, we positioned each of the six types of founders within an occupation-personality matrix established from previous research 44 . This research showed that ‘each job has its own personality’ using a substantial sample of employees across various jobs. Utilising the methodology employed in this study, we assigned labels to the cluster names #0 to #5, which correspond to the identified occupation tribes that best describe the personality facets represented by the clusters (see Extended Data Fig.  9 for an overview of these tribes, as identified by McCarthy et al. 44 ).

Utilising this approach, we identify three ’purebred’ clusters: #0, #2 and #5, whose members are dominated by a single tribe (larger than 60% of all individuals in each cluster are characterised by one tribe). Thus, these clusters represent and share personality attributes of these previously identified occupation-personality tribes 44 , which have the following known distinctive personality attributes (see also Table  1 ):

Accomplishers (#0) —Organised & outgoing. confident, down-to-earth, content, accommodating, mild-tempered & self-assured.

Leaders (#2) —Adventurous, persistent, dispassionate, assertive, self-controlled, calm under pressure, philosophical, excitement-seeking & confident.

Fighters (#5) —Spontaneous and impulsive, tough, sceptical, and uncompromising.

We labelled these clusters with the tribe names, acknowledging that labels are somewhat arbitrary, based on our best interpretation of the data (See SI section  A.3 for more details).

For the remaining three clusters #1, #3 and #4, we can see they are ‘hybrids’, meaning that the founders within them come from a mix of different tribes, with no one tribe representing more than 50% of the members of that cluster. However, the tribes with the largest share were noted as #1 Experts/Engineers, #3 Fighters, and #4 Operators.

To label these three hybrid clusters, we examined the closest occupations to the median personality features of each cluster. We selected a name that reflected the common themes of these occupations, namely:

Experts/Engineers (#1) as the closest roles included Materials Engineers and Chemical Engineers. This is consistent with this cluster’s personality footprint, which is highest in openness in the facets of imagination and intellect.

Developers (#3) as the closest roles include Application Developers and related technology roles such as Business Systems Analysts and Product Managers.

Operators (#4) as the closest roles include service, maintenance and operations functions, including Bicycle Mechanic, Mechanic and Service Manager. This is also consistent with one of the key personality traits of high conscientiousness in the facet of orderliness and high agreeableness in the facet of humility for founders in this cluster.

figure 1

Founder-Level Factors of Startup Success. ( A ), Successful entrepreneurs differ from successful employees. They can be accurately distinguished using a classifier with personality information alone. ( B ), Successful entrepreneurs have different Big Five facet distributions, especially on adventurousness, modesty and activity level. ( C ), Founders come in six different types: Fighters, Operators, Accomplishers, Leaders, Engineers and Developers (FOALED) ( D ), Each founder Personality-Type has its distinct facet.

Together, these six different types of startup founders (Fig.  1 C) represent a framework we call the FOALED model of founder types—an acronym of Fighters, Operators, Accomplishers, Leaders, Engineers and D evelopers.

Each founder’s personality type has its distinct facet footprint (for more details, see Extended Data Figure  10 in SI section  A.3 ). Also, we observe a central core of correlated features that are high for all types of entrepreneurs, including intellect, adventurousness and activity level (Fig.  1 D).To test the robustness of the clustering of the personality facets, we compare the mean scores of the individual facets per cluster with a 20-fold resampling of the data and find that the clusters are, overall, largely robust against resampling (see Extended Data Figure  11 in SI section  A.3 for more details).

We also find that the clusters accord with the distribution of founders’ roles in their startups. For example, Accomplishers are often Chief Executive Officers, Chief Financial Officers, or Chief Operating Officers, while Fighters tend to be Chief Technical Officers, Chief Product Officers, or Chief Commercial Officers (see Extended Data Fig.  12 in SI section  A.4 for more details).

The ensemble theory of success

While founders’ individual personality traits, such as Adventurousness or Openness, show to be related to their firms’ success, we also hypothesise that the combination, or ensemble, of personality characteristics of a founding team impacts the chances of success. The logic behind this reasoning is complementarity, which is proposed by contemporary research on the functional roles of founder teams. Examples of these clear functional roles have evolved in established industries such as film and television, construction, and advertising 45 . When we subsequently explored the combinations of personality types among founders and their relationship to the probability of startup success, adjusted for a range of other factors in a multi-factorial analysis, we found significantly increased chances of success for mixed foundation teams:

Initially, we find that firms with multiple founders are more likely to succeed, as illustrated in Fig.  2 A, which shows firms with three or more founders are more than twice as likely to succeed than solo-founded startups. This finding is consistent with investors’ advice to founders and previous studies 46 . We also noted that some personality types of founders increase the probability of success more than others, as shown in SI section  A.6 (Extended Data Figures  16 and 17 ). Also, we note that gender differences play out in the distribution of personality facets: successful female founders and successful male founders show facet scores that are more similar to each other than are non-successful female founders to non-successful male founders (see Extended Data Figure  18 ).

figure 2

The Ensemble Theory of Team-Level Factors of Startup Success. ( A ) Having a larger founder team elevates the chances of success. This can be due to multiple reasons, e.g., a more extensive network or knowledge base but also personality diversity. ( B ) We show that joint personality combinations of founders are significantly related to higher chances of success. This is because it takes more than one founder to cover all beneficial personality traits that ‘breed’ success. ( C ) In our multifactor model, we show that firms with diverse and specific combinations of types of founders have significantly higher odds of success.

Access to more extensive networks and capital could explain the benefits of having more founders. Still, as we find here, it also offers a greater diversity of combined personalities, naturally providing a broader range of maximum traits. So, for example, one founder may be more open and adventurous, and another could be highly agreeable and trustworthy, thus, potentially complementing each other’s particular strengths associated with startup success.

The benefits of larger and more personality-diverse foundation teams can be seen in the apparent differences between successful and unsuccessful firms based on their combined Big Five personality team footprints, as illustrated in Fig.  2 B. Here, maximum values for each Big Five trait of a startup’s co-founders are mapped; stratified by successful and non-successful companies. Founder teams of successful startups tend to score higher on Openness, Conscientiousness, Extraversion, and Agreeableness.

When examining the combinations of founders with different personality types, we find that some ensembles of personalities were significantly correlated with greater chances of startup success—while controlling for other variables in the model—as shown in Fig.  2 C (for more details on the modelling, the predictive performance and the coefficient estimates of the final model, see Extended Data Figures  19 , 20 , and 21 in SI section  A.6 ).

Three combinations of trio-founder companies were more than twice as likely to succeed than other combinations, namely teams with (1) a Leader and two Developers , (2) an Operator and two Developers , and (3) an Expert/Engineer , Leader and Developer . To illustrate the potential mechanisms on how personality traits might influence the success of startups, we provide some examples of well-known, successful startup founders and their characteristic personality traits in Extended Data Figure  22 .

Startups are one of the key mechanisms for brilliant ideas to become solutions to some of the world’s most challenging economic and social problems. Examples include the Google search algorithm, disability technology startup Fingerwork’s touchscreen technology that became the basis of the Apple iPhone, or the Biontech mRNA technology that powered Pfizer’s COVID-19 vaccine.

We have shown that founders’ personalities and the combination of personalities in the founding team of a startup have a material and significant impact on its likelihood of success. We have also shown that successful startup founders’ personality traits are significantly different from those of successful employees—so much so that a simple predictor can be trained to distinguish between employees and entrepreneurs with more than 80% accuracy using personality trait data alone.

Just as occupation-personality maps derived from data can provide career guidance tools, so too can data on successful entrepreneurs’ personality traits help people decide whether becoming a founder may be a good choice for them.

We have learnt through this research that there is not one type of ideal ’entrepreneurial’ personality but six different types. Many successful startups have multiple co-founders with a combination of these different personality types.

To a large extent, founding a startup is a team sport; therefore, diversity and complementarity of personalities matter in the foundation team. It has an outsized impact on the company’s likelihood of success. While all startups are high risk, the risk becomes lower with more founders, particularly if they have distinct personality traits.

Our work demonstrates the benefits of personality diversity among the founding team of startups. Greater awareness of this novel form of diversity may help create more resilient startups capable of more significant innovation and impact.

The data-driven research approach presented here comes with certain methodological limitations. The principal data sources of this study—Crunchbase and Twitter—are extensive and comprehensive, but there are characterised by some known and likely sample biases.

Crunchbase is the principal public chronicle of venture capital funding. So, there is some likely sample bias toward: (1) Startup companies that are funded externally: self-funded or bootstrapped companies are less likely to be represented in Crunchbase; (2) technology companies, as that is Crunchbase’s roots; (3) multi-founder companies; (4) male founders: while the representation of female founders is now double that of the mid-2000s, women still represent less than 25% of the sample; (5) companies that succeed: companies that fail, especially those that fail early, are likely to be less represented in the data.

Samples were also limited to those founders who are active on Twitter, which adds additional selection biases. For example, Twitter users typically are younger, more educated and have a higher median income 47 . Another limitation of our approach is the potentially biased presentation of a person’s digital identity on social media, which is the basis for identifying personality traits. For example, recent research suggests that the language and emotional tone used by entrepreneurs in social media can be affected by events such as business failure 48 , which might complicate the personality trait inference.

In addition to sampling biases within the data, there are also significant historical biases in startup culture. For many aspects of the entrepreneurship ecosystem, women, for example, are at a disadvantage 49 . Male-founded companies have historically dominated most startup ecosystems worldwide, representing the majority of founders and the overwhelming majority of venture capital investors. As a result, startups with women have historically attracted significantly fewer funds 50 , in part due to the male bias among venture investors, although this is now changing, albeit slowly 51 .

The research presented here provides quantitative evidence for the relevance of personality types and the diversity of personalities in startups. At the same time, it brings up other questions on how personality traits are related to other factors associated with success, such as:

Will the recent growing focus on promoting and investing in female founders change the nature, composition and dynamics of startups and their personalities leading to a more diverse personality landscape in startups?

Will the growth of startups outside of the United States change what success looks like to investors and hence the role of different personality traits and their association to diverse success metrics?

Many of today’s most renowned entrepreneurs are either Baby Boomers (such as Gates, Branson, Bloomberg) or Generation Xers (such as Benioff, Cannon-Brookes, Musk). However, as we can see, personality is both a predictor and driver of success in entrepreneurship. Will generation-wide differences in personality and outlook affect startups and their success?

Moreover, the findings shown here have natural extensions and applications beyond startups, such as for new projects within large established companies. While not technically startups, many large enterprises and industries such as construction, engineering and the film industry rely on forming new project-based, cross-functional teams that are often new ventures and share many characteristics of startups.

There is also potential for extending this research in other settings in government, NGOs, and within the research community. In scientific research, for example, team diversity in terms of age, ethnicity and gender has been shown to be predictive of impact, and personality diversity may be another critical dimension 52 .

Another extension of the study could investigate the development of the language used by startup founders on social media over time. Such an extension could investigate whether the language (and inferred psychological characteristics) change as the entrepreneurs’ ventures go through major business events such as foundation, funding, or exit.

Overall, this study demonstrates, first, that startup founders have significantly different personalities than employees. Secondly, besides firm-level factors, which are known to influence firm success, we show that a range of founder-level factors, notably the character traits of its founders, significantly impact a startup’s likelihood of success. Lastly, we looked at team-level factors. We discovered in a multifactor analysis that personality-diverse teams have the most considerable impact on the probability of a startup’s success, underlining the importance of personality diversity as a relevant factor of team performance and success.

Data sources

Entrepreneurs dataset.

Data about the founders of startups were collected from Crunchbase (Table  2 ), an open reference platform for business information about private and public companies, primarily early-stage startups. It is one of the largest and most comprehensive data sets of its kind and has been used in over 100 peer-reviewed research articles about economic and managerial research.

Crunchbase contains data on over two million companies - mainly startup companies and the companies who partner with them, acquire them and invest in them, as well as profiles on well over one million individuals active in the entrepreneurial ecosystem worldwide from over 200 countries and spans. Crunchbase started in the technology startup space, and it now covers all sectors, specifically focusing on entrepreneurship, investment and high-growth companies.

While Crunchbase contains data on over one million individuals in the entrepreneurial ecosystem, some are not entrepreneurs or startup founders but play other roles, such as investors, lawyers or executives at companies that acquire startups. To create a subset of only entrepreneurs, we selected a subset of 32,732 who self-identify as founders and co-founders (by job title) and who are also publicly active on the social media platform Twitter. We also removed those who also are venture capitalists to distinguish between investors and founders.

We selected founders active on Twitter to be able to use natural language processing to infer their Big Five personality features using an open-vocabulary approach shown to be accurate in the previous research by analysing users’ unstructured text, such as Twitter posts in our case. For this project, as with previous research 20 , we employed a commercial service, IBM Watson Personality Insight, to infer personality facets. This service provides raw scores and percentile scores of Big Five Domains (Openness, Conscientiousness, Extraversion, Agreeableness and Emotional Stability) and the corresponding 30 subdomains or facets. In addition, the public content of Twitter posts was collected, and there are 32,732 profiles that each had enough Twitter posts (more than 150 words) to get relatively accurate personality scores (less than 12.7% Average Mean Absolute Error).

The entrepreneurs’ dataset is analysed in combination with other data about the companies they founded to explore questions about the nature and patterns of personality traits of entrepreneurs and the relationships between these patterns and company success.

For the multifactor analysis, we further filtered the data in several preparatory steps for the success prediction modelling (for more details, see SI section  A.5 ). In particular, we removed data points with missing values (Extended Data Fig.  13 ) and kept only companies in the data that were founded from 1990 onward to ensure consistency with previous research 32 (see Extended Data Fig.  14 ). After cleaning, filtering and pre-processing the data, we ended up with data from 25,214 founders who founded 21,187 startup companies to be used in the multifactor analysis. Of those, 3442 startups in the data were successful, 2362 in the first seven years after they were founded (see Extended Data Figure  15 for more details).

Entrepreneurs and employees dataset

To investigate whether startup founders show personality traits that are similar or different from the population at large (i. e. the entrepreneurs vs employees sub-analysis shown in Fig.  1 A and B), we filtered the entrepreneurs’ data further: we reduced the sample to those founders of companies, which attracted more than US$100k in investment to create a reference set of successful entrepreneurs (n \(=\) 4400).

To create a control group of employees who are not also entrepreneurs or very unlikely to be of have been entrepreneurs, we leveraged the fact that while some occupational titles like CEO, CTO and Public Speaker are commonly shared by founders and co-founders, some others such as Cashier , Zoologist and Detective very rarely co-occur seem to be founders or co-founders. To illustrate, many company founders also adopt regular occupation titles such as CEO or CTO. Many founders will be Founder and CEO or Co-founder and CTO. While founders are often CEOs or CTOs, the reverse is not necessarily true, as many CEOs are professional executives that were not involved in the establishment or ownership of the firm.

Using data from LinkedIn, we created an Entrepreneurial Occupation Index (EOI) based on the ratio of entrepreneurs for each of the 624 occupations used in a previous study of occupation-personality fit 44 . It was calculated based on the percentage of all people working in the occupation from LinkedIn compared to those who shared the title Founder or Co-founder (See SI section  A.2 for more details). A reference set of employees (n=6685) was then selected across the 112 different occupations with the lowest propensity for entrepreneurship (less than 0.5% EOI) from a large corpus of Twitter users with known occupations, which is also drawn from the previous occupational-personality fit study 44 .

These two data sets were used to test whether it may be possible to distinguish successful entrepreneurs from successful employees based on the different patterns of personality traits alone.

Hierarchical clustering

We applied several clustering techniques and tests to the personality vectors of the entrepreneurs’ data set to determine if there are natural clusters and, if so, how many are the optimum number.

Firstly, to determine if there is a natural typology to founder personalities, we applied the Hopkins statistic—a statistical test we used to answer whether the entrepreneurs’ dataset contains inherent clusters. It measures the clustering tendency based on the ratio of the sum of distances of real points within a sample of the entrepreneurs’ dataset to their nearest neighbours and the sum of distances of randomly selected artificial points from a simulated uniform distribution to their nearest neighbours in the real entrepreneurs’ dataset. The ratio measures the difference between the entrepreneurs’ data distribution and the simulated uniform distribution, which tests the randomness of the data. The range of Hopkins statistics is from 0 to 1. The scores are close to 0, 0.5 and 1, respectively, indicating whether the dataset is uniformly distributed, randomly distributed or highly clustered.

To cluster the founders by personality facets, we used Agglomerative Hierarchical Clustering (AHC)—a bottom-up approach that treats an individual data point as a singleton cluster and then iteratively merges pairs of clusters until all data points are included in the single big collection. Ward’s linkage method is used to choose the pair of groups for minimising the increase in the within-cluster variance after combining. AHC was widely applied to clustering analysis since a tree hierarchy output is more informative and interpretable than K-means. Dendrograms were used to visualise the hierarchy to provide the perspective of the optimal number of clusters. The heights of the dendrogram represent the distance between groups, with lower heights representing more similar groups of observations. A horizontal line through the dendrogram was drawn to distinguish the number of significantly different clusters with higher heights. However, as it is not possible to determine the optimum number of clusters from the dendrogram, we applied other clustering performance metrics to analyse the optimal number of groups.

A range of Clustering performance metrics were used to help determine the optimal number of clusters in the dataset after an apparent clustering tendency was confirmed. The following metrics were implemented to evaluate the differences between within-cluster and between-cluster distances comprehensively: Dunn Index, Calinski-Harabasz Index, Davies-Bouldin Index and Silhouette Index. The Dunn Index measures the ratio of the minimum inter-cluster separation and the maximum intra-cluster diameter. At the same time, the Calinski-Harabasz Index improves the measurement of the Dunn Index by calculating the ratio of the average sum of squared dispersion of inter-cluster and intra-cluster. The Davies-Bouldin Index simplifies the process by treating each cluster individually. It compares the sum of the average distance among intra-cluster data points to the cluster centre of two separate groups with the distance between their centre points. Finally, the Silhouette Index is the overall average of the silhouette coefficients for each sample. The coefficient measures the similarity of the data point to its cluster compared with the other groups. Higher scores of the Dunn, Calinski-Harabasz and Silhouette Index and a lower score of the Davies-Bouldin Index indicate better clustering configuration.

Classification modelling

Classification algorithms.

To obtain a comprehensive and robust conclusion in the analysis predicting whether a given set of personality traits corresponds to an entrepreneur or an employee, we explored the following classifiers: Naïve Bayes, Elastic Net regularisation, Support Vector Machine, Random Forest, Gradient Boosting and Stacked Ensemble. The Naïve Bayes classifier is a probabilistic algorithm based on Bayes’ theorem with assumptions of independent features and equiprobable classes. Compared with other more complex classifiers, it saves computing time for large datasets and performs better if the assumptions hold. However, in the real world, those assumptions are generally violated. Elastic Net regularisation combines the penalties of Lasso and Ridge to regularise the Logistic classifier. It eliminates the limitation of multicollinearity in the Lasso method and improves the limitation of feature selection in the Ridge method. Even though Elastic Net is as simple as the Naïve Bayes classifier, it is more time-consuming. The Support Vector Machine (SVM) aims to find the ideal line or hyperplane to separate successful entrepreneurs and employees in this study. The dividing line can be non-linear based on a non-linear kernel, such as the Radial Basis Function Kernel. Therefore, it performs well on high-dimensional data while the ’right’ kernel selection needs to be tuned. Random Forest (RF) and Gradient Boosting Trees (GBT) are ensembles of decision trees. All trees are trained independently and simultaneously in RF, while a new tree is trained each time and corrected by previously trained trees in GBT. RF is a more robust and straightforward model since it does not have many hyperparameters to tune. GBT optimises the objective function and learns a more accurate model since there is a successive learning and correction process. Stacked Ensemble combines all existing classifiers through a Logistic Regression. Better than bagging with only variance reduction and boosting with only bias reduction, the ensemble leverages the benefit of model diversity with both lower variance and bias. All the above classification algorithms distinguish successful entrepreneurs and employees based on the personality matrix.

Evaluation metrics

A range of evaluation metrics comprehensively explains the performance of a classification prediction. The most straightforward metric is accuracy, which measures the overall portion of correct predictions. It will mislead the performance of an imbalanced dataset. The F1 score is better than accuracy by combining precision and recall and considering the False Negatives and False Positives. Specificity measures the proportion of detecting the true negative rate that correctly identifies employees, while Positive Predictive Value (PPV) calculates the probability of accurately predicting successful entrepreneurs. Area Under the Receiver Operating Characteristic Curve (AUROC) determines the capability of the algorithm to distinguish between successful entrepreneurs and employees. A higher value means the classifier performs better on separating the classes.

Feature importance

To further understand and interpret the classifier, it is critical to identify variables with significant predictive power on the target. Feature importance of tree-based models measures Gini importance scores for all predictors, which evaluate the overall impact of the model after cutting off the specific feature. The measurements consider all interactions among features. However, it does not provide insights into the directions of impacts since the importance only indicates the ability to distinguish different classes.

Statistical analysis

T-test, Cohen’s D and two-sample Kolmogorov-Smirnov test are introduced to explore how the mean values and distributions of personality facets between entrepreneurs and employees differ. The T-test is applied to determine whether the mean of personality facets of two group samples are significantly different from one another or not. The facets with significant differences detected by the hypothesis testing are critical to separate the two groups. Cohen’s d is to measure the effect size of the results of the previous t-test, which is the ratio of the mean difference to the pooled standard deviation. A larger Cohen’s d score indicates that the mean difference is greater than the variability of the whole sample. Moreover, it is interesting to check whether the two groups’ personality facets’ probability distributions are from the same distribution through the two-sample Kolmogorov-Smirnov test. There is no assumption about the distributions, but the test is sensitive to deviations near the centre rather than the tail.

Privacy and ethics

The focus of this research is to provide high-level insights about groups of startups, founders and types of founder teams rather than on specific individuals or companies. While we used unit record data from the publicly available data of company profiles from Crunchbase , we removed all identifiers from the underlying data on individual companies and founders and generated aggregate results, which formed the basis for our analysis and conclusions.

Data availability

A dataset which includes only aggregated statistics about the success of startups and the factors that influence is released as part of this research. Underlying data for all figures and the code to reproduce them are available on GitHub: https://github.com/Braesemann/FounderPersonalities . Please contact Fabian Braesemann ( [email protected] ) in case you have any further questions.

Change history

07 may 2024.

A Correction to this paper has been published: https://doi.org/10.1038/s41598-024-61082-7

Henrekson, M. & Johansson, D. Gazelles as job creators: A survey and interpretation of the evidence. Small Bus. Econ. 35 , 227–244 (2010).

Article   Google Scholar  

Davila, A., Foster, G., He, X. & Shimizu, C. The rise and fall of startups: Creation and destruction of revenue and jobs by young companies. Aust. J. Manag. 40 , 6–35 (2015).

Which vaccine saved the most lives in 2021?: Covid-19. The Economist (Online) (2022). noteName - AstraZeneca; Pfizer Inc; BioNTech SE; Copyright - Copyright The Economist Newspaper NA, Inc. Jul 14, 2022; Last updated - 2022-11-29.

Oltermann, P. Pfizer/biontech tax windfall brings mainz an early christmas present (2021). noteName - Pfizer Inc; BioNTech SE; Copyright - Copyright Guardian News & Media Limited Dec 27, 2021; Last updated - 2021-12-28.

Grant, K. A., Croteau, M. & Aziz, O. The survival rate of startups funded by angel investors. I-INC WHITE PAPER SER.: MAR 2019 , 1–21 (2019).

Google Scholar  

Top 20 reasons start-ups fail - cb insights version (2019). noteCopyright - Copyright Newstex Oct 21, 2019; Last updated - 2022-10-25.

Hochberg, Y. V., Ljungqvist, A. & Lu, Y. Whom you know matters: Venture capital networks and investment performance. J. Financ. 62 , 251–301 (2007).

Fracassi, C., Garmaise, M. J., Kogan, S. & Natividad, G. Business microloans for us subprime borrowers. J. Financ. Quantitative Ana. 51 , 55–83 (2016).

Davila, A., Foster, G. & Gupta, M. Venture capital financing and the growth of startup firms. J. Bus. Ventur. 18 , 689–708 (2003).

Nann, S. et al. Comparing the structure of virtual entrepreneur networks with business effectiveness. Proc. Soc. Behav. Sci. 2 , 6483–6496 (2010).

Guzman, J. & Stern, S. Where is silicon valley?. Science 347 , 606–609 (2015).

Article   ADS   CAS   PubMed   Google Scholar  

Aldrich, H. E. & Wiedenmayer, G. From traits to rates: An ecological perspective on organizational foundings. 61–97 (2019).

Gartner, W. B. Who is an entrepreneur? is the wrong question. Am. J. Small Bus. 12 , 11–32 (1988).

Thornton, P. H. The sociology of entrepreneurship. Ann. Rev. Sociol. 25 , 19–46 (1999).

Eikelboom, M. E., Gelderman, C. & Semeijn, J. Sustainable innovation in public procurement: The decisive role of the individual. J. Public Procure. 18 , 190–201 (2018).

Kerr, S. P. et al. Personality traits of entrepreneurs: A review of recent literature. Found. Trends Entrep. 14 , 279–356 (2018).

Hamilton, B. H., Papageorge, N. W. & Pande, N. The right stuff? Personality and entrepreneurship. Quant. Econ. 10 , 643–691 (2019).

Salmony, F. U. & Kanbach, D. K. Personality trait differences across types of entrepreneurs: A systematic literature review. RMS 16 , 713–749 (2022).

Freiberg, B. & Matz, S. C. Founder personality and entrepreneurial outcomes: A large-scale field study of technology startups. Proc. Natl. Acad. Sci. 120 , e2215829120 (2023).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Kern, M. L., McCarthy, P. X., Chakrabarty, D. & Rizoiu, M.-A. Social media-predicted personality traits and values can help match people to their ideal jobs. Proc. Natl. Acad. Sci. 116 , 26459–26464 (2019).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Dalle, J.-M., Den Besten, M. & Menon, C. Using crunchbase for economic and managerial research. (2017).

Block, J. & Sandner, P. What is the effect of the financial crisis on venture capital financing? Empirical evidence from us internet start-ups. Ventur. Cap. 11 , 295–309 (2009).

Antretter, T., Blohm, I. & Grichnik, D. Predicting startup survival from digital traces: Towards a procedure for early stage investors (2018).

Dworak, D. Analysis of founder background as a predictor for start-up success in achieving successive fundraising rounds. (2022).

Hsu, D. H. Venture capitalists and cooperative start-up commercialization strategy. Manage. Sci. 52 , 204–219 (2006).

Blank, S. Why the lean start-up changes everything (2018).

Kaplan, S. N. & Lerner, J. It ain’t broke: The past, present, and future of venture capital. J. Appl. Corp. Financ. 22 , 36–47 (2010).

Hallen, B. L. & Eisenhardt, K. M. Catalyzing strategies and efficient tie formation: How entrepreneurial firms obtain investment ties. Acad. Manag. J. 55 , 35–70 (2012).

Gompers, P. A. & Lerner, J. The Venture Capital Cycle (MIT Press, 2004).

Shane, S. & Venkataraman, S. The promise of entrepreneurship as a field of research. Acad. Manag. Rev. 25 , 217–226 (2000).

Zahra, S. A. & Wright, M. Understanding the social role of entrepreneurship. J. Manage. Stud. 53 , 610–629 (2016).

Bonaventura, M. et al. Predicting success in the worldwide start-up network. Sci. Rep. 10 , 1–6 (2020).

Schwartz, H. A. et al. Personality, gender, and age in the language of social media: The open-vocabulary approach. PLoS ONE 8 , e73791 (2013).

Plank, B. & Hovy, D. Personality traits on twitter-or-how to get 1,500 personality tests in a week. In Proceedings of the 6th workshop on computational approaches to subjectivity, sentiment and social media analysis , pp 92–98 (2015).

Arnoux, P.-H. et al. 25 tweets to know you: A new model to predict personality with social media. In booktitleEleventh international AAAI conference on web and social media (2017).

Roberts, B. W., Kuncel, N. R., Shiner, R., Caspi, A. & Goldberg, L. R. The power of personality: The comparative validity of personality traits, socioeconomic status, and cognitive ability for predicting important life outcomes. Perspect. Psychol. Sci. 2 , 313–345 (2007).

Article   PubMed   PubMed Central   Google Scholar  

Youyou, W., Kosinski, M. & Stillwell, D. Computer-based personality judgments are more accurate than those made by humans. Proc. Natl. Acad. Sci. 112 , 1036–1040 (2015).

Soldz, S. & Vaillant, G. E. The big five personality traits and the life course: A 45-year longitudinal study. J. Res. Pers. 33 , 208–232 (1999).

Damian, R. I., Spengler, M., Sutu, A. & Roberts, B. W. Sixteen going on sixty-six: A longitudinal study of personality stability and change across 50 years. J. Pers. Soc. Psychol. 117 , 674 (2019).

Article   PubMed   Google Scholar  

Rantanen, J., Metsäpelto, R.-L., Feldt, T., Pulkkinen, L. & Kokko, K. Long-term stability in the big five personality traits in adulthood. Scand. J. Psychol. 48 , 511–518 (2007).

Roberts, B. W., Caspi, A. & Moffitt, T. E. The kids are alright: Growth and stability in personality development from adolescence to adulthood. J. Pers. Soc. Psychol. 81 , 670 (2001).

Article   CAS   PubMed   Google Scholar  

Cobb-Clark, D. A. & Schurer, S. The stability of big-five personality traits. Econ. Lett. 115 , 11–15 (2012).

Graham, P. Do Things that Don’t Scale (Paul Graham, 2013).

McCarthy, P. X., Kern, M. L., Gong, X., Parker, M. & Rizoiu, M.-A. Occupation-personality fit is associated with higher employee engagement and happiness. (2022).

Pratt, A. C. Advertising and creativity, a governance approach: A case study of creative agencies in London. Environ. Plan A 38 , 1883–1899 (2006).

Klotz, A. C., Hmieleski, K. M., Bradley, B. H. & Busenitz, L. W. New venture teams: A review of the literature and roadmap for future research. J. Manag. 40 , 226–255 (2014).

Duggan, M., Ellison, N. B., Lampe, C., Lenhart, A. & Madden, M. Demographics of key social networking platforms. Pew Res. Center 9 (2015).

Fisch, C. & Block, J. H. How does entrepreneurial failure change an entrepreneur’s digital identity? Evidence from twitter data. J. Bus. Ventur. 36 , 106015 (2021).

Brush, C., Edelman, L. F., Manolova, T. & Welter, F. A gendered look at entrepreneurship ecosystems. Small Bus. Econ. 53 , 393–408 (2019).

Kanze, D., Huang, L., Conley, M. A. & Higgins, E. T. We ask men to win and women not to lose: Closing the gender gap in startup funding. Acad. Manag. J. 61 , 586–614 (2018).

Fan, J. S. Startup biases. UC Davis Law Review (2022).

AlShebli, B. K., Rahwan, T. & Woon, W. L. The preeminence of ethnic diversity in scientific collaboration. Nat. Commun. 9 , 1–10 (2018).

Article   CAS   Google Scholar  

Żbikowski, K. & Antosiuk, P. A machine learning, bias-free approach for predicting business success using crunchbase data. Inf. Process. Manag. 58 , 102555 (2021).

Corea, F., Bertinetti, G. & Cervellati, E. M. Hacking the venture industry: An early-stage startups investment framework for data-driven investors. Mach. Learn. Appl. 5 , 100062 (2021).

Chapman, G. & Hottenrott, H. Founder personality and start-up subsidies. Founder Personality and Start-up Subsidies (2021).

Antoncic, B., Bratkovicregar, T., Singh, G. & DeNoble, A. F. The big five personality-entrepreneurship relationship: Evidence from slovenia. J. Small Bus. Manage. 53 , 819–841 (2015).

Download references

Acknowledgements

We thank Gary Brewer from BuiltWith ; Leni Mayo from Influx , Rachel Slattery from TeamSlatts and Daniel Petre from AirTree Ventures for their ongoing generosity and insights about startups, founders and venture investments. We also thank Tim Li from Crunchbase for advice and liaison regarding data on startups and Richard Slatter for advice and referrals in Twitter .

Author information

Authors and affiliations.

The Data Science Institute, University of Technology Sydney, Sydney, NSW, Australia

Paul X. McCarthy

School of Computer Science and Engineering, UNSW Sydney, Sydney, NSW, Australia

Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, Australia

Xian Gong & Marian-Andrei Rizoiu

Oxford Internet Institute, University of Oxford, Oxford, UK

Fabian Braesemann & Fabian Stephany

DWG Datenwissenschaftliche Gesellschaft Berlin, Berlin, Germany

Melbourne Graduate School of Education, The University of Melbourne, Parkville, VIC, Australia

Margaret L. Kern

You can also search for this author in PubMed   Google Scholar

Contributions

All authors designed research; All authors analysed data and undertook investigation; F.B. and F.S. led multi-factor analysis; P.M., X.G. and M.A.R. led the founder/employee prediction; M.L.K. led personality insights; X.G. collected and tabulated the data; X.G., F.B., and F.S. created figures; X.G. created final art, and all authors wrote the paper.

Corresponding author

Correspondence to Fabian Braesemann .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this Article was revised: The Data Availability section in the original version of this Article was incomplete, the link to the GitHub repository was omitted. Full information regarding the corrections made can be found in the correction for this Article.

Supplementary Information

Supplementary information., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

McCarthy, P.X., Gong, X., Braesemann, F. et al. The impact of founder personalities on startup success. Sci Rep 13 , 17200 (2023). https://doi.org/10.1038/s41598-023-41980-y

Download citation

Received : 15 February 2023

Accepted : 04 September 2023

Published : 17 October 2023

DOI : https://doi.org/10.1038/s41598-023-41980-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

what is population and sample in research with example

COMMENTS

  1. Population vs. Sample

    A population is the entire group that you want to draw conclusions about.. A sample is the specific group that you will collect data from. The size of the sample is always less than the total size of the population. In research, a population doesn't always refer to people. It can mean a group containing elements of anything you want to study, such as objects, events, organizations, countries ...

  2. Population vs Sample: Uses and Examples

    Population and Sample Examples. For an example of population vs sample, researchers might be studying U.S. college students. This population contains about 19 million students and is too large and geographically dispersed to study fully. However, researchers can draw a subset of a manageable size to learn about its characteristics.

  3. Population vs. Sample

    Example 1: Research Study: Investigating the prevalence of stress among high school students in a specific city and its impact on academic performance. Population: All high school students in a particular city Sampling Frame: The sampling frame would involve obtaining a comprehensive list of all high schools in the specific city. A random selection of schools would be made from this list to ...

  4. Population vs Sample

    Definition. In quantitative research methodology, the sample is a set of collected data from a defined procedure. It is basically a much smaller part of the whole, i.e., population. The sample depicts all the members of the population that are under observation when conducting research surveys.

  5. 7 Samples and Populations

    So if you want to sample one-tenth of the population, you'd select every tenth name. In order to know the k for your study you need to know your sample size (say 1000) and the size of the population (75000). You can divide the size of the population by the sample (75000/1000), which will produce your k (750).

  6. 1.3: Populations and Samples

    Quantitative data are always numbers. Quantitative data are the result of counting or measuring attributes of a population. Amount of money, pulse rate, weight, number of people living in your town, and number of students who take statistics are examples of quantitative data. Quantitative data may be either discrete or continuous.

  7. Population vs. Sample: What's the Difference?

    Here is an example of a population vs. a sample in the three intro examples. Example 1: What is the median household income in Miami, Florida? ... By the time we collect all of this data, the population may have changed or the research question of interest might no longer be of interest. 2. It is too costly to collect data on an entire population.

  8. 8.1: Samples, Populations and Sampling

    Defining a population. A sample is a concrete thing. You can open up a data file, and there's the data from your sample. A population, on the other hand, is a more abstract idea.It refers to the set of all possible people, or all possible observations, that you want to draw conclusions about, and is generally much bigger than the sample. In an ideal world, the researcher would begin the ...

  9. Sampling Methods

    The sampling frame is the actual list of individuals that the sample will be drawn from. Ideally, it should include the entire target population (and nobody who is not part of that population). Example: Sampling frame You are doing research on working conditions at a social media marketing company. Your population is all 1000 employees of the ...

  10. 3. Populations and samples

    Answers Chapter 3 Q3.pdf. Populations In statistics the term "population" has a slightly different meaning from the one given to it in ordinary speech. It need not refer only to people or to animate creatures - the population of Britain, for instance or the dog population of London. Statisticians also speak of a population.

  11. Population and Samples: the Complete Guide

    In statistical methods, a sample consists of a smaller group of entities, which are taken from the entire population. This creates a subset group that is easier to manage and has the characteristics of the larger population. This smaller subset is then surveyed to gain information and data. The sample should reflect the population as a whole ...

  12. Statistics without tears: Populations and samples

    A population is a complete set of people with a specialized set of characteristics, and a sample is a subset of the population. The usual criteria we use in defining population are geographic, for example, "the population of Uttar Pradesh". In medical research, the criteria for population may be clinical, demographic and time related.

  13. Populations, Parameters, and Samples in Inferential Statistics

    Example of a Population with Important Subpopulations. ... In both cases, your sample or population is defined by the scope of your research question or area of interest. The distinction between a sample and a population isn't a fixed, objective attribute of a set of data, but rather a perspective that depends on the particular context and ...

  14. Samples & Populations in Research

    Population and sample in research are often confused with one another, so it is important to understand the differences between the terms population and sample. A population is an entire group of ...

  15. Population vs sample in research: What's the difference?

    A sample is a select group of individuals from the research population. A sample is only a subset or a subgroup of the population and, by definition, is always smaller than the population. However, well-selected samples accurately represent the entire population. Below are some examples to illustrate the differences between population vs sample:

  16. Population vs Sample

    A population is the entire group that you want to draw conclusions about. A sample is the specific group that you will collect data from. The size of the sample is always less than the total size of the population. In research, a population doesn't always refer to people. It can mean a group containing elements of anything you want to study ...

  17. What Is the Big Deal About Populations in Research?

    A population is a complete set of people with specified characteristics, while a sample is a subset of the population. 1 In general, most people think of the defining characteristic of a population in terms of geographic location. However, in research, other characteristics will define a population.

  18. PDF Describing Populations and Samples in Doctoral Student Research

    When selecting a sample, there are two primary considerations: how many units must be in the sample (sample size) and how will these units be selected (sampling methods). Figure 1 depicts how a researcher identifies a sample from the population of interest and the target population within the sampling frame.

  19. 1.1.5.1: Collecting Data- More Practice with Populations and Samples

    Random samples, especially if the sample size is small, are not necessarily representative of the entire population. For example, if a random sample of 20 subjects were taken from a population with an equal number of males and females, there would be a nontrivial probability (0.06) that 70% or more of the sample would be female.

  20. Research Fundamentals: Study Design, Population, and Sample Size

    design, population of interest, study setting, recruit ment, and sampling. Study Design. The study design is the use of e vidence-based. procedures, protocols, and guidelines that provide the ...

  21. Sampling Methods

    This is often used to ensure that the sample is representative of the population as a whole. Cluster Sampling: In this method, the population is divided into clusters or groups, and then a random sample of clusters is selected. Then, all members of the selected clusters are included in the sample. Multi-Stage Sampling: This method combines two ...

  22. Population and Sample

    Sample. It includes one or more observations that are drawn from the population and the measurable characteristic of a sample is a statistic. Sampling is the process of selecting the sample from the population. For example, some people living in India is the sample of the population. Basically, there are two types of sampling.

  23. PDF 84 CHAPTER 3 Research design, research method and population

    population. 3.5.4 Sample size A general rule of the thumb is to always use the largest sample possible. The larger the sample the more representative it is going to be, smaller samples produce less accurate results because they are likely to be less representative of the population (LoBiondo-Wood & Haber 1998:263-264). In this

  24. (PDF) CONCEPT OF POPULATION AND SAMPLE

    A part of population that repre sents it completely is known as sample. It means, the units, selected from the population as a sample, must represent all kind of characteristics of different ...

  25. The sense of coherence scale: psychometric properties in a

    Sense of coherence (SOC) is a personal resource that reflects the extent to which one perceives the world as comprehensible, manageable, and meaningful. Decades of empirical research consistently show that SOC is an important protective resource for health and well-being. Despite the extensive use of the 13-item measure of SOC, there remains uncertainty regarding its factorial structure.

  26. Understanding Population and Sample in Research Studies

    For example, if research seeks to determine the prevalence of a specific illness in a society, the population will include all persons living in that country. Shukla (2020) further states that a sample, on the other hand, is a portion of the population that has been chosen for in-depth research or observation.

  27. Sample Catalog Records and Illustrations

    The following catalog records and illustrations demonstrate how TGM I and TGM II terms, TGM I appendix subdivisions, and headings from the LC authority files are used in combination to provide subject and genre/form access to images in Library of Congress collections.The examples (shown with MARC format field tags) represent both item-level and group-level cataloging and a variety of material ...

  28. The impact of founder personalities on startup success

    Here, we show that founder personality traits are a significant feature of a firm's ultimate success. We draw upon detailed data about the success of a large-scale global sample of startups (n ...