Single cell cluster naming

I know in general the Allen institute does a lot of careful thought in naming single cell clusters, but I am seeing a general trend with a lot of single cell papers from other places and am trying to get a better understanding on best practices and considerations

It seems like a lot of single cell papers will name cluster based on “canonical markers”. Where they will basically cherry pick a cluster based on the expression of these markers many of which are neuropeptides. This is done even for clusters where there is only a handful of the thousands of cells in a cluster that show sparse to no expression of these markers. I’ve even seen papers where a different cluster will show higher expression of one of these markers, but they will call the cluster with lower expression the marker. Additionally often times many of these clusters show expression of multiple “markers” not just the one they decide to call the cluster.

Can someone help me make sense of the logic behind this. Is it basically other papers have shown the existence of these cells so they must exist… Even though we don’t have any clusters that show high expression of these marker genes we are just going to assume because the other cells in this cluster share gene expression levels that this cluster it should still be called this? If so, how do we ignore that often times these cluster express many of these markers. Why doesn’t anyone ever do rnascope with these markers and some of the top genes that are exclusively expressed in the same cluster to show that these cells actually exist.

Can someone help me make sense of this. Is anyone aware of any white papers, blog posts, or publications from prominent people in the field that discuss the logic behind this and how one should to think about cluster naming?

Hello! I have reached out to our scientists and they provided the answer below. I hope it helps.

Unfortunately, best practices for annotating single-cell RNA-seq (scRNA-seq) clusters vary across biological fields. In hematology, for example, historical precedent with canonical markers and established references allows for more unbiased annotation using software tools like AUCell, SingleR, Azimuth, and Viewmaster. These tools rely on high-quality, regularly updated references, which work best for studying normal development. Cancer cells become a little funky and express a multitude of lineage markers and are therefore hard to annotate with known references and often need manual annotation.Another approach is using gene signatures from resources like the Molecular Signatures Database (MSigDB) that have been experimentally validated and describe certain cell types or states. While not complete reference atlases, these signatures can complement cell type annotation with canonical marker genes. Yet, manual annotation remains necessary when experimental conditions differ from available gene signatures.Many labs that investigate biological questions (and do not develop scRNAseq methods) use orthogonal techniques such as bulkRNA-seq, spatial transcriptomics, flowcytometry and immunofluorescence to learn about their specific experimental systems and validate scRNA-seq results. These methods, combined with field-specific expertise, enhance cluster annotation. Papers should provide clearer descriptions of annotation methods to aid interpretation.Another more novel notion in the field is that many cell types share markers and that marker expression can be noisy or overlap across clusters due to shared lineage or differentiation states. This makes single-cell clustering inherently a simplification of biological reality. Techniques like pseudotime analysis and RNA velocity offer more nuanced views, treating cell states as a continuum rather than discrete clusters. Best practices for scRNA-seq annotation should include:

  1. Integration: Compare to high-quality reference datasets if available.
  2. Multi-marker Signatures: Define clusters based on comprehensive experimentally validated gene expression profiles.
  3. Contextualization: Consider experimental conditions and biological context.
  4. Validation: Use orthogonal methods to confirm findings (always!).
  5. Neutral Labeling: Use descriptive labels (e.g., “X high/Y low”) rather than prematurely assigning cell types.