Biomedical researchers often feel pressed to represent continuous or ordinal dependent variables in binary terms: For example, is the patient’s response above some criterion threshold that is used to define a clinically noteworthy condition? If “Yes,” then certain category labels and associated treatment recommendations may apply. This situation often seems to justify dichotomizing dependent variables at predetermined or invented thresholds, and subsequently adopting logistic regression as the data-analytic technique. Yet biostatisticians would urge researchers not to tread so quickly down this path.
The first reason is the loss of information attendant on transforming a variable with fairly precisely measured values that reflect the genuine heterogeneity of patients into just two values. Any hope of using other variables to adjust or condition fits of the outcome (“predictions”) is undermined, because the correlations those variables have with the dichotomized response is much lower. If the goal is to test a treatment factor, it takes a third more patients to find a significant effect when the response has been dichotomized. Don’t dichotomize because it reduces statistical power.
A second reason not to dichotomize continuous responses is that doing so creates a biologically implausible precipice between values just below and just above the cut-off (recoded 0 and 1, which are the maximum distance apart) , when in fact the values should be almost the same. When displayed in a scatterplot against other continuous biological measures, responses typically vary more-or-less smoothly. Don’t dichotomize because it misrepresents the biology.
Another concern is that often the cut-off values suggested in the literature have not been chosen in a manner to optimize decisions and outcomes. Indeed, often when one really investigates the cut-offs, there are competing suggestions, and their justifications are “informal.” Don’t dichotomize because the evidence for the thresholds is often slender.
A desire for simplicity of presentation impels some researchers to dichotomize responses. For example, with the right coding of predictor variables, negative logistic regression coefficients indicate “protective” factors whereas positive coefficients indicate “risk” factors, thereby making the findings easy to grasp. Such an approach seems to aid communication, but it risks communicating faulty conclusions. A better approach is to analyze the data in its original continuous or ordinal form, and then summarize results using descriptive statistics for predictions obtained from a more nearly correct statistical model. The idea is to model the data accurately before seeking simplified presentations of results. Avoid premature dichotomization.
So far we’ve been discussing dependent variables, but most of the caveats apply to independent variables as well. For example, body mass index (BMI) is often di- or trichotomized into “normal,” “overweight,” and “obese,” using one or two indicator variables as explanatory variables in a regression. Doing so means that the researcher never examines the presumably smooth relationship between response and predictor, as the predictor is prematurely used to create groups of patients. All of the heterogeneity within the created groups is unused in the analysis effort, and relationships are thereby masked. After applying an appropriate model using continuous variables, which will typically provide stronger and clearer results, descriptive statistics for grouped predictions can be shown if necessary.
To conclude, the best approach is to analyze dependent and independent variables in their most precisely measured forms. Dichotomization, the most drastic and implausible of all transformations, is at best a final resort.
The CCTS's Design and Analysis Core provides consultative services to clinical-translational investigators in the conceptualization, design, conduct, and analysis of their research studies.