Beyond the Boxes, Part 4: Complications in Coding Race and Ethnicity

Natalie Smith, Rae Anne Martinez, Nafeesa Andrabi, Andrea (Andi) Goodwin, Rachel Wilbur, Paul Zivich

This post is the fourth in a series about the use of race and ethnicity in population health research. Our previous posts have outlined our thoughts around conceptualizing and measuring race and ethnicity. Generally, researchers will then have to code those race and ethnicity measurements for use in analyses. This post details how we approach coding these variables.

As we’ve discussed, issues of race and ethnicity measurement are directly linked to analysis and interpretation. But there is another layer of complexity here: coding. The practice of coding variables—collapsing groups together or merging different variables—can fundamentally alter the results of data analysis, and ultimately, the interpretations of those results. As a part of our larger project studying how population researchers incorporate race and ethnicity into their work, we examined how population health studies code race and ethnicity. You can see some of our results here. In sum, we find:

Emphasis on binary coding schemes oriented around whiteness (i.e. “white,” “non-white”)
Broad use of “white,” “Black,” “Hispanic,” and “other,” where “Hispanic” is used as a de-facto racial category, and everyone else is aggregated into the ambiguous “other”
Slight variations on the above

These findings drive our guiding questions:

How will you code race or ethnicity in statistical analyses?
Have you collapsed race and ethnicity into an ethno-racial construct?
If you aggregate different racial or ethnic groups, what are the potential implications for your findings?
In your manuscript, have you communicated which groups you collapsed together, why these decisions were made, and what implications they have for interpreting your findings (biases, limitations)?

The underlying assumption when we engage with race and ethnicity variables is that these groups are meaningful in terms of history, privilege, access to resources, cultural similarities, and so on. When we collapse groups together, we implicitly make decisions about whose history, power, privilege, and so on are more or less similar and important. Sometimes, depending on the context and specific study question, collapsing might be advisable. Racial or ethnic groups may share similar histories along various axes. But oftentimes they do not. In such instances, collapsing groups is not advisable, particularly in the context of the highly ambiguous “other” category.

We are still considering how to approach the aggregation of groups—and really, we should all push ourselves to think about this more. How can we meaningfully interpret a coefficient that represents people with vastly different contexts and backgrounds? Why did we bother including a group at all if we can’t produce findings with respect to their unique lived experience?

We also wanted to mention the differences between using an ethno-racial construct compared to using individual ethnicity and race constructs. These two approaches to coding have very different embedded assumptions. When race and ethnicity are kept separate for analyses, we assume that each captures distinct information, and could be related to population health outcomes in distinct ways. Conversely, when they are combined into an ethno-racial construct (e.g., non-Hispanic white, non-Hispanic Black, Hispanic, other), this assumes that race and ethnicity are capturing similar information and have similar relationships to health outcomes.

As researchers, we must be aware of the differences and similarities between race and ethnicity as constructs. Your research question might demand an ethnoracial perspective, or perhaps you’re dealing with data limitations.

Some additional guiding questions on this point:

Did you intentionally end up with an ethnoracial construct?
Do you agree with the assumptions behind this position?

Racial boundaries and identities may overlap with ethnic boundaries and identities. Clear delineation can be challenging. On that note, we believe it is important to name those we did not see represented, discussed, or acknowledged in the studies we sampled.

We never saw MENA (Middle Eastern or North African), Black or Afro-Latinos, or Indigenous Latinos. We also very rarely saw the diversity of Native American/Alaskan Native, Asian, or Hawaiian & Pacific Islanders explored in health scholarship. These groups are frequently relegated to the “other” category. For example, when we treat “Hispanic” or “Latino/a/x” as a racial group, Afro-Latinos and Indigenous Latinos typically fall into the same group, masking potentially important differences between those groups. Most authors fail to state who even constitutes the “other” category. As such, we are unable to tell if folks are present in our studies but masked by coding practices, or if they aren’t included at all. Regardless, more work is needed to explore the unique health concerns for these erased populations.

It is critical that we justify why we chose a particular action and that we understand the assumptions those actions entail. Over the last few posts, we have highlighted that our conceptualization of race ethnicity and how we measure these social constructs influence the assumptions we make when we collapse groups together. To be clear: We are all making assumptions that influence our methodological choices. We recommend that authors make their assumptions and choices clear in their scientific communication.

Our next post will discuss our guiding questions around interpretation of race and ethnicity in analyses.

1 Comment

Sam Sellers
November 24, 2020 @ 12:25 pm
As a CPC Alum, I really appreciate the careful, nuanced discussion of a very delicate and complicated topic. Kudos to the authors on a job well done!
One issue that I hope the authors can speak to a bit more is the desire among scholars and advocates to make compelling claims about social disparities and how this understandable desire intersects with challenges related to statistical power, which can motivate those who do research that uses racial and/or ethnic categories to lump groups with potentially very different histories or current challenges together in order to increase statistical power. However, this can create challenges by “erasing” the identities of smaller groups in published research. As shown very nicely by Pew in its work on income inequality (https://www.pewsocialtrends.org/2018/07/12/appendix-b-additional-tables-4/), ethnic groups that are commonly combined in many demographic surveys, such as Burmese or Chinese under “Asian”, or Puerto Rican and Argentinian under “Hispanic” or “Latino/a/x” experience very different socioeconomic realities, yet members of these groups may be captured in tiny numbers in surveys where the sample size numbers in the hundreds or thousands.
A similar set of challenges relate to gender classification, where there is a need for more research to understand disparities associated with transgender and non-binary populations. As these groups are a small share of the overall population, it can be difficult to make statistically precise claims about these groups based on data collected from standard population-based surveys unless there are efforts made to oversample these groups. Not measuring disparities can lead to those disparities being ignored in policymaking that relies on population health research. On the other hand, researchers have an ethical responsibility to report the caveats and uncertainties associated with their findings, of which there are more when sample sizes are smaller. The limitations inherent in current methods of capturing racial and ethnic (and, I would argue, gender) classification, necessitate creative strategies to measure the lived experiences and disparities experienced by identity groups that are smaller in size, which are often not captured in population-level surveys.
Another issue that would be interesting to probe further is changing social expectations regarding how individuals “should” identify when providing demographic information to employers or researchers based on changing social norms. As the cases of Rachael Dolezal and Elizabeth Warren among others show, identifying with a racial or ethnic group that does not reflect one’s own visible identity characteristics can lead to social condemnation, even if an individual has a sincerely held belief that he or she is a member of such a group. On the other hand, there are groups of individuals who share visible identity characteristics with a particular racial or ethnic group, but who have reservations about identifying with this group. This is the case for many Jews in the United States, as nicely discussed recently in the New York Times (https://www.nytimes.com/2020/10/13/magazine/im-jewish-and-dont-identify-as-white-why-must-i-check-that-box.html), some of whom will identify as white in surveys, but who may privately identify as Jewish (as a group distinct from “white”). As noted in the Part 3 blog post, gaps between “public” and “private” identities are not uncommon, but can be difficult to measure.

Beyond the Boxes, Part 4: Complications in Coding Race and Ethnicity

About the authors

1 Comment

Leave a Reply

All comments will be reviewed and posted if substantive and of general interest to IAPHS readers.

Beyond the Boxes, Part 4: Complications in Coding Race and Ethnicity

About the authors

Related Posts

Population Health News Round-Up: March 2025

A Message from IAPHS Leadership

Selling Health, Selling Illness, or Just Selling Drugs? A Look at a Potential Ban on Prescription Drug Ads

1 Comment

Leave a Reply

All comments will be reviewed and posted if substantive and of general interest to IAPHS readers.

Selling Health, Selling Illness, or Just Selling Drugs?
A Look at a Potential Ban on Prescription Drug Ads