Troels Holger Vaaben

Bulk RNA-seq: from biological sample to interpretable signal

A practical reference for moving from bulk RNA-seq experiment design to differential expression and layered biological interpretation.

19 min readIntermediate
What you'll learn
  • How to get from sample planning to a trustworthy count table
  • How to interpret DESeq2 outputs without mixing inference and visualization
  • When to move from genes to gene sets, sample states, and regulatory inference
  • When transcript-aware or genome-aware analyses are worth the extra complexity

Executive summary

This guide is intentionally split into two parts:

  • Upstream: everything from biological sample and library design through QC, quantification, and the construction of a trustworthy count table.
  • Downstream: everything after the count table, starting with core differential-expression analysis and then moving into the extra layers that help you squeeze more biology out of the dataset.

If you want one default path that works for most human or mouse bulk RNA-seq projects, make it this:

  1. Plan the biology first: replication, RNA quality, library type, strandedness, and annotation version.
  2. Treat FastQC + MultiQC as the universal first stop after FASTQ generation.
  3. For expression-first work, quantify with Salmon and import with tximport, ideally keeping provenance with tximeta.
  4. Use DESeq2 as the default differential-expression engine.
  5. Only after you trust the counts and the design should you add enrichment, pathway activity, TF activity, cell-context inference, and more advanced transcript-aware analyses.

This page is not trying to list every transcriptomics method ever published. It is a practical guideline to the core analysis up to DEGs, and then to the downstream analyses that are most useful when you want to get more biological signal out of bulk RNA-seq without drifting into method tourism. The mainline recommendation is:

Clean mammalian tissue, expression-first question, reasonable RNA quality

poly(A) enrichment -> stranded library -> FASTQ QC -> Salmon or nf-core/rnaseq -> tximport/tximeta -> DESeq2 -> enrichment + pathway activity + TF activity + cell-context inference

Switch away from that default when the question changes:

  • Choose rRNA depletion when RNA is degraded, FFPE-like, intronic signal matters, or non-polyadenylated transcripts matter.
  • Choose STAR when you need genome alignments for genome-browser coverage, exon-aware counting, or genome-position-aware follow-up.
  • Choose paired-end and longer reads when genome-aware follow-up is an important part of the study rather than an afterthought.

Companion analysis repo

If you want a small runnable example that follows the downstream logic in this guide, use the companion airway walkthrough repo, Transcriptomics-extended. It walks through QC, DESeq2, ORA, GSEA, ssGSEA, PROGENy, and DoRothEA + VIPER on the Bioconductor airway dataset.

Who this is for

This guide stays strictly in bulk RNA-seq. If your core question depends on rare cell populations, lineage mixtures, or spatial localization, bulk RNA-seq may still be useful, but a single-cell, spatial, or long-read assay may answer the real question better.

Where are you in the process?

Pipeline overview

Upstream

Ends at count table

1. Biological planning

Sample quality, replication, annotation choice, poly(A) versus rRNA depletion, strandedness, read geometry.

2. FASTQ QC

FastQC + MultiQC first, then RNA-specific checks such as RSeQC, RNA-SeQC, or Picard when needed.

3. Trim only if needed

Trim Galore or direct Cutadapt when adapters or quality tails are a real problem, not by reflex.

4. Quantify or align

Salmon for expression-first work. STAR when you need genome-aware follow-up such as positional signal, regional analysis, or browser-style coverage inspection.

5. Count table

This is the handoff point. If you do not trust this object, do not move downstream yet.

Downstream

Starts from count table

1. Gene-level

Start with import, QC, design, normalization, and differential expression. This is where you establish which genes change.

2. Gene set-level

Move from individual genes to pathways or processes with ORA or GSEA once the DE result is trustworthy.

3. Sample-level pathway or state activity

Use GSVA, ssGSEA, or related signature scoring when the question is how programs vary across samples rather than just across groups.

4. Regulatory or signaling inference

Use PROGENy or DoRothEA plus VIPER when you want a more explicit hypothesis about upstream drivers of the expression pattern.

5. Tissue composition

Stress-test whether apparent biology may instead reflect changing cell-type mixture in the bulk sample.

6. Genome-aware or transcript-level

Reserve DEXSeq, DRIMSeq, rMATS, or PREDA for questions that depend on transcript structure or genomic position rather than gene-level abundance alone.

Upstream is about generating trustworthy inputs. Downstream is about asking the right biological question of those inputs. The count table is the boundary between the two.

Upstream: from sample to count table

The upstream half of the workflow is about getting from biological material to a count table you believe. If this part is weak, no downstream method will save it.

Before sequencing

The best bulk RNA-seq analysis begins before the sequencer sees a library. Bad experimental structure is harder to rescue than bad code.

Planning checklist

DecisionDefaultSwitch whenWhy it matters later
RNA captureStranded poly(A) enrichmentUse rRNA depletion for degraded RNA, FFPE-like input, non-polyadenylated transcripts, intronic signal, or broader transcriptome coverageChanges what molecules you are measuring and what quantifiers and interpretations make sense
StrandednessStrandedOnly tolerate unstranded if the library is already made and you cannot change itResolves overlapping antisense transcription and improves counting in gene-dense regions
Read geometrySingle-end for simple gene-level expression onlyUse paired-end when genome-aware follow-up or more structure-aware counting mattersAffects how confidently you can interpret structure-aware or coordinate-aware signal
Read lengthModerate lengths are fine for gene-level DEGo longer when positional follow-up or more structure-aware interpretation will matter laterShort reads compress ambiguity into multi-mapping and transcript assignment problems
AnnotationLock one version and keep it fixedSwitch only with intent and document it clearlyAnnotation drift breaks joins, gene IDs, length corrections, and reproducibility

Library design choices

Recommended default

Library design for most bulk RNA-seq studies

Default: Stranded poly(A)-enriched bulk RNA-seq

Use when

You have reasonably intact human or mouse RNA and the main question is gene-level expression in coding transcripts.

Switch to an alternative when

Use stranded rRNA depletion when RNA is degraded, non-polyadenylated RNA matters, or you care about intronic signal and broader transcriptome coverage. Use paired-end and longer reads once genome-aware follow-up becomes a real objective.

Inputs

  • High-quality RNA from tissue, sorted cells, or cultured cells
  • Clear biological question and replicate structure
  • Human or mouse reference and annotation plan

Outputs

  • Libraries enriched for mature polyadenylated transcripts
  • Cleaner coding-focused signal
  • Straightforward downstream gene-level analysis

Alternatives

  • Stranded rRNA depletion
  • Unstranded protocols if legacy only
  • Paired-end longer-read libraries when coordinate-aware follow-up matters

What good output looks like

Replicates cluster mainly by biology, strandedness checks agree with the expected protocol, and gene-body coverage looks plausible for the library chemistry rather than erratic or heavily degraded.

Pros

  • Simple and efficient for gene-level DE
  • Strong fit for Salmon plus tximport or tximeta workflows
  • Strandedness helps in antisense and overlapping loci

Cons

  • Misses non-polyadenylated biology
  • Less forgiving with degraded RNA
  • Single-end short reads are weak for transcript-structure questions

Practical note

People often choose cheap single-end short reads and then later expect confident genome-aware follow-up. That is usually a design mismatch, not an analysis failure.

Poly(A) versus rRNA depletion

Poly(A) enrichment is the cleanest default for coding-focused mammalian expression studies with decent RNA integrity.

rRNA depletion is the better default for degraded RNA, FFPE-like material, non-polyadenylated transcripts, nascent or intronic signal, and broader transcriptome coverage. It often trades interpretive simplicity for coverage breadth.

Human and mouse annotation note

For computation, prefer stable IDs such as Ensembl gene identifiers. Use gene symbols for figures and readable tables later. Strip version suffixes only intentionally, because careless version stripping can silently merge or drop rows.

If you map mouse genes into human-centric signatures, treat ortholog mapping as a convenience layer. It is not ground truth, especially for immune, developmental, and rapidly evolving gene families.

Annotation and reference strategy

Use one reference bundle and freeze it:

  • GENCODE: strong default for human and mouse when you want well-curated gene and transcript models.
  • Ensembl: strong default when your ecosystem or annotation tooling is already Ensembl-centric.
  • NCBI RefSeq: use when a collaborator, institution, or downstream clinical context is already standardized on RefSeq.

Do not mix:

  • GENCODE transcriptome index with RefSeq gene annotation in downstream joins
  • Ensembl IDs from one release with symbols from another release
  • human-centric signature resources onto mouse data without explicitly documenting ortholog conversion
Reference choiceBest forWatch out for
GENCODEGeneral human and mouse bulk RNA-seq with transcript-level workVersion changes will shift transcript membership and gene models
EnsemblEnsembl-centric pipelines, BioMart-heavy workflows, consistent Ensembl IDsSymbol updates and archived release handling can break later joins
RefSeqClinical, NCBI-standardized, or collaborator-constrained projectsTranscript model differences can change summarization and exon-level counts

Replication, power, and controls

Biological replication matters more than squeezing one more downstream algorithm out of underpowered data.

  • Use RNASeqPower or powsimR when you can still influence the experiment.
  • If you are already underpowered, be honest about effect-size uncertainty and treat pathway-level summaries as supportive, not rescuing.
  • ERCC-style spike-ins can be useful in selected designs, but they do not replace good biological replication and they can create a false sense of security if used as a generic fix.
# Simple planning logic, not a substitute for study-specific assumptions
library(RNASeqPower)
est_power <- rnapower(
  depth = 20,
  n = 4,
  cv = 0.3,
  effect = 1.5,
  alpha = 0.05
)
est_power

Planning pitfall

Do not treat batch as an afterthought. If all treated samples were processed on a different day, by a different person, or with a different kit lot, that is a design problem first and a statistics problem second.

From FASTQ to trusted quantification

Keep the data type explicit at each step:

  • FASTQ: raw reads and base qualities
  • BAM/CRAM: genome alignments
  • quant.sf or equivalent: transcript-level abundance estimates
  • gene count matrix: integer counts for count-based DE frameworks
  • normalized matrix: transformed expression for visualization and sample structure
  • ranked vector: gene-wise statistics for GSEA
  • splice-event or exon tables: event-aware follow-up
  • regional scores or genome-aware summaries: specialized outputs

Universal QC comes first

Recommended default

Initial sequencing QC

Default: FastQC + MultiQC

Use when

Always, immediately after FASTQ generation or after receiving raw data from a core facility.

Switch to an alternative when

Move beyond FastQC when you need RNA-specific sanity checks: strandedness inference, ribosomal contamination context, exon-intron read distribution, or gene-body bias. Those require RNA-aware tools and often alignments.

Inputs

  • FASTQ files
  • Sample sheet or sample naming convention
  • Knowledge of expected read length, library chemistry, and strandedness

Outputs

  • Per-sample FastQC reports
  • One MultiQC summary across the project
  • A first-pass decision on whether trimming or deeper RNA-specific QC is necessary

Alternatives

  • RNA-SeQC for RNA-process metrics
  • RSeQC for strandedness, read distribution, and gene-body coverage
  • Picard CollectRnaSeqMetrics for alignment-aware RNA metrics

What good output looks like

Adapters are absent or modest, quality tails are not catastrophic, replicate libraries look broadly similar, and any suspicious sample is obvious before downstream work begins.

Pros

  • Fast and universal
  • Easy to compare many samples
  • MultiQC makes bad samples obvious early

Cons

  • FastQC alone is not RNA-aware
  • Some warnings are expected and not actionable
  • Cannot tell you if strandedness matches the library without deeper checks

Practical note

Overreacting to every FastQC warning is a rookie move. RNA-seq libraries often show non-random sequence content and uneven positional composition for legitimate protocol reasons.

mkdir -p qc/fastqc
fastqc data/fastq/*.fastq.gz --outdir qc/fastqc
multiqc qc -o qc/multiqc

For RNA-specific QC after a genome-alignment step, check:

  • RSeQC: infer_experiment.py, read_distribution.py, geneBody_coverage.py
  • RNA-SeQC: broad process metrics, coverage and expression summaries
  • Picard CollectRnaSeqMetrics: coding, UTR, intronic, intergenic fractions, ribosomal bases
infer_experiment.py -r refs/hg38.bed -i align/sample.bam
geneBody_coverage.py -r refs/hg38_housekeeping.bed -i align/sample.bam -o qc/rseqc/sample
picard CollectRnaSeqMetrics \
  I=align/sample.bam \
  O=qc/picard/sample.rna_metrics.txt \
  REF_FLAT=refs/gencode.vM35.refFlat.txt \
  STRAND_SPECIFICITY=SECOND_READ_TRANSCRIPTION_STRAND

Trimming: useful, but do not do it reflexively

Recommended default

Adapter and quality trimming

Default: Trim Galore for simple projects; direct Cutadapt when you need control

Use when

Adapters or severe quality tails are real problems, or when a core facility note and FastQC make trimming clearly worthwhile.

Switch to an alternative when

Use direct Cutadapt when you need fine control over adapter logic, overlaps, minimum length thresholds, or unusual library structures. Skip trimming when the quantifier already handles modest imperfections and there is no actual problem to solve.

Inputs

  • FASTQ files
  • Expected adapter sequences
  • FastQC evidence or protocol knowledge

Outputs

  • Trimmed FASTQ files
  • Trim reports
  • Cleaner input for aligners or quantifiers when problems were genuine

Alternatives

  • Direct Cutadapt
  • No trimming if adapters and tails are not meaningfully harming downstream quantification
  • Pipeline-managed trimming inside nf-core/rnaseq

What good output looks like

Adapter content drops when it was truly present, short-read collapse is limited, and downstream mapping or quantification improves rather than merely looking cleaner.

Pros

  • Trim Galore is simple and wraps Cutadapt
  • Fixes genuine adapter contamination cleanly
  • Easy to explain and automate

Cons

  • Unnecessary trimming can discard useful sequence
  • Aggressive trimming can distort read-length distributions
  • People often trim out of habit rather than evidence

Practical note

Do not trim because a generic tutorial told you to. Trim because your data or protocol gives you a reason.

trim_galore --paired \
  --cores 4 \
  sample_R1.fastq.gz sample_R2.fastq.gz \
  --output_dir trimmed/

Quantification strategy

Recommended default

Expression-first quantification

Default: Salmon -> tximport -> tximeta (optional but recommended)

Use when

The main goal is robust expression analysis and you do not need BAM-centric genome-aware follow-up as the primary output.

Switch to an alternative when

Move to STAR when you need genome-browser coverage, exon-level counting, or genome-position-aware analyses. Use aligned-read counting when the follow-up depends on BAMs rather than just expression estimates.

Inputs

  • Transcriptome index tied to a frozen annotation release
  • FASTQ files
  • Sample metadata and library type knowledge

Outputs

  • Transcript abundances
  • Gene-level matrices after tximport
  • Annotation-linked SummarizedExperiment objects if using tximeta

Alternatives

  • STAR plus featureCounts or Rsubread for BAM-centric work
  • kallisto for lightweight pseudoalignment
  • HISAT2 when genome alignment is needed but STAR is not your choice

What good output looks like

Sample-level library sizes and composition look plausible, imported gene-level counts align with biology, and no sample is clearly broken by unexpected strandedness or annotation mismatches.

Pros

  • Fast and accurate for expression-first workflows
  • Transcript-level estimation can improve gene-level inference when imported correctly
  • tximeta preserves annotation provenance and reduces metadata drift

Cons

  • Not the right primary output for browser coverage or coordinate-aware analysis
  • Requires care with transcriptome and annotation matching
  • Pseudoalignment does not replace true alignments for all questions

Practical note

The biggest failure mode is annotation mismatch: Salmon index built from one transcript set, tx2gene built from another, and downstream symbols from a third.

salmon quant \
  -i refs/gencode.v44.salmon_index \
  -l A \
  -1 trimmed/sample_R1.fq.gz \
  -2 trimmed/sample_R2.fq.gz \
  -p 8 \
  --validateMappings \
  -o quants/sample
library(tximport)
library(tximeta)

coldata <- read.csv("metadata/samples.csv")
se <- tximeta(coldata)
gse <- summarizeToGene(se)

When STAR should be your default instead

Choose STAR when you need true genome alignments for exon-aware counting, genome-browser coverage, or genome-position-aware follow-up. Salmon is the default for expression-first questions, not for all questions.

Tool pathUse this whenMain outputMain drawback
Salmon + tximportFast, expression-first quantificationTranscript abundances and imported gene-level countsNot a BAM-centric workflow
STAR + featureCounts or RsubreadCoverage, browser tracks, exon-aware counting, region-aware follow-upBAMs, junctions, gene or exon countsHeavier compute and storage
kallistoLightweight expression workflows and fast iterationPseudoalignment quantificationsLess often the first choice in mixed BAM-centric labs
HISAT2 + countsAlternative splice-aware aligner pathBAMs and countable alignmentsNot usually my first choice over STAR for bulk mammalian work

Downstream: from count table to biological interpretation

The downstream half starts once you have a count table and credible metadata. The first job is still boring and essential: import, QC, design, normalization, and differential expression. Only then does it make sense to start asking more ambitious biological questions.

Core analysis

Once you have a count-like object and trustworthy metadata, the work becomes statistical and interpretive rather than purely computational.

Import, normalization, and sample structure

For most users, the best default is DESeq2 on imported gene-level counts. If you quantified with Salmon, do not throw away transcript-level work by flattening it carelessly. Use tximport, and use tximeta when you can.

library(DESeq2)
library(tximeta)

se <- tximeta(read.csv("metadata/samples.csv"))
gse <- summarizeToGene(se)

dds <- DESeqDataSet(gse, design = ~ batch + condition)
dds <- DESeq(dds)

vsd <- vst(dds, blind = FALSE)
plotPCA(vsd, intgroup = c("condition", "batch"))

Differential expression

Recommended default

Gene-level differential expression

Default: DESeq2

Use when

Standard bulk RNA-seq differential expression with biological replicates and integer gene counts or tximport-derived gene-level counts.

Switch to an alternative when

Stay with DESeq2 for most standard bulk RNA-seq comparisons. Switch to edgeR when you still want a count-based method but want more hands-on control over filtering and model setup, or when you already work in an edgeR-style pipeline. Switch to limma-voom when the project naturally lives in the limma ecosystem, especially if you have many contrasts or want a linear-model workflow built around weighted log-expression values rather than DESeq2's count model.

Inputs

  • Gene-level count matrix or SummarizedExperiment
  • Sample metadata with condition, batch, and covariates
  • Clear contrasts and reference levels

Outputs

  • Normalized counts
  • Dispersion estimates
  • Log2 fold changes, adjusted p-values, shrunken effect sizes

Alternatives

  • edgeR
  • limma-voom
  • Likelihood-ratio testing for multi-level questions

What good output looks like

Replicates cluster sensibly, dispersion fits are stable, MA plots are not dominated by one broken sample, and shrunken log2 fold changes tell a coherent biological story.

Pros

  • Default choice for most users
  • Strong documentation and community practice
  • lfcShrink gives more stable effect-size interpretation

Cons

  • Not magic against bad design
  • Can be over-trusted when metadata are weak
  • Requires careful factor coding and contrast specification

Practical note

People often batch-correct the counts first and then fit DESeq2. Usually the right move is to model batch in the design, not to mutate the counts blindly upstream.

res <- results(dds, contrast = c("condition", "treated", "control"))
res_shrunk <- lfcShrink(dds, coef = "condition_treated_vs_control", type = "apeglm")

sig <- subset(as.data.frame(res_shrunk), padj < 0.05 & abs(log2FoldChange) > 1)
head(sig[order(sig$padj), ])

How to read the main DESeq2 outputs

pvalue / padj: the statistical significance output from the DESeq2 model. This is the main inference result.

log2FoldChange: the estimated effect size, including both magnitude and direction of change.

lfcShrink output: a shrunken, more stable version of log2 fold change that is usually better for ranking genes and practical interpretation.

vst() output: variance-stabilized expression values for visualization such as PCA or clustering, not for statistical testing.

DESeq2 performs inference by modeling raw count data, with normalization handled internally. Transformed outputs such as VST and normalized counts are for visualization and exploratory analysis, not for the main hypothesis test.

What is statistically detectable vs what is biologically meaningful?

A small adjusted p-value tells you that a change is statistically detectable under the model. It does not tell you that the change is biologically important, cell-intrinsic, or worth building a story around.

In many real bulk RNA-seq studies, effect size, consistency across replicates, and biological interpretability matter more than a binary significance line. Shrunken log2 fold change and overall coherence are often more useful for prioritization than asking whether a gene sits just above or below padj < 0.05.

Count models versus voom

DESeq2 is the clean default for most ordinary bulk RNA-seq comparisons: count data in, design formula specified, differential expression out.

edgeR is a good switch when you still want a count-based framework but want more explicit control over filtering and model choices, or when the rest of the project already uses edgeR.

limma-voom is a good switch when the real benefit is the limma workflow itself, especially for many contrasts or broader linear-model analyses. The key difference is that voom converts counts into weighted log-expression values before modeling rather than fitting the DESeq2 count model directly.

Batch effects and unwanted variation

Recommended default

Batch handling and unwanted variation

Default: Model batch in the design first; use sva, svaseq, or RUVSeq when needed

Use when

You have clear technical structure, latent confounding, or hidden variation that is not fully captured by your known metadata.

Switch to an alternative when

Use ComBat-style adjustment only when you understand why a corrected expression matrix is needed for a specific downstream task. Do not use it as a ritual before DE. Use svaseq or RUVSeq when hidden structure is suspected and a count-aware framework is still desired.

Inputs

  • Count matrix
  • Sample metadata
  • Biological design you are trying not to destroy
  • A realistic hypothesis about what the unwanted variation might be

Outputs

  • Adjusted design terms
  • Surrogate variables or unwanted-variation factors
  • Cleaner interpretation of residual technical structure

Alternatives

  • ComBat for selected corrected-expression tasks
  • svaseq when counts are the starting point
  • RUVSeq when control genes or replicate logic supports it

What good output looks like

Known technical structure becomes less dominant, but the biological contrast of interest remains visible and interpretable rather than flattened into noise.

Pros

  • Prevents technical structure from masquerading as biology
  • Works well when the design is honest
  • Can stabilize PCA and downstream interpretations

Cons

  • Easy to over-correct
  • Hidden-factor methods can absorb real biology
  • Badly confounded designs remain badly confounded

Practical note

Never try to 'correct away' a batch effect before asking whether batch and biology are partially or fully aliased. If they are, statistics cannot manufacture information that was never collected.

library(sva)

mod  <- model.matrix(~ condition, data = colData(dds))
mod0 <- model.matrix(~ 1, data = colData(dds))
sv   <- svaseq(counts(dds), mod = mod, mod0 = mod0)$sv

colData(dds)$sv1 <- sv[, 1]
design(dds) <- ~ sv1 + batch + condition
dds <- DESeq(dds)

What to inspect before believing any DE table

  • PCA or MDS on transformed counts
  • Sample-sample distances
  • Library-size spread and composition bias
  • Gene-body coverage and strandedness sanity checks
  • Outlier samples in MultiQC and RNA-specific QC
  • Whether low-count filtering was reasonable
  • Whether the design formula matches the actual experiment

Slow down before interpretation

A DEG table is not yet a biological conclusion

Before you move into enrichment or more advanced downstream layers, stop and ask what the DE pattern might actually reflect. Expression changes can represent specific biology, but they can also reflect response, compensation, injury, timing effects, generic stress programs, or differences in cell mixture.

Treat the DEG table as evidence about RNA abundance, not as direct proof of protein abundance or cellular function. The next job is to decide whether the pattern is robust, biologically coherent, and worth deeper follow-up at the pathway, regulatory, or composition level.

Downstream layers after DE

Once the DE result is credible, move upward in interpretation one layer at a time. Start with individual genes, then ask whether coherent gene sets shift, then whether individual samples carry programs, then whether those programs imply signaling or regulatory activity, and finally whether the bulk profile may reflect changing tissue composition.

Gene-level

This is the anchor layer. Differential expression tells you which genes move and by how much. Everything that follows should be read as an interpretation layered on top of that gene-level result, not as a substitute for it.

Stay here until you trust:

  • the contrast and design formula
  • the effect sizes and shrinkage behavior
  • the sample structure and batch handling
  • a few manually inspected genes that fit the biology you think you are seeing

Once you have a ranked gene-level result you believe, the next shift is from individual genes to coordinated programs.

Gene set-level

This is the first abstraction step. Instead of asking whether one gene is up or down, ask whether a pathway or process moves as a coherent set.

MethodMain questionBest inputUse it whenMain limitation
ORAWhich processes are over-represented in my chosen DEG list?A thresholded gene listYou want a simple summary of a clear DEG setStrongly depends on the threshold used to define DEGs
GSEA / fgseaDo genes from a pathway shift across the full ranked result even if few genes pass a hard cutoff?A ranked vector across all tested genesYou want to preserve sub-threshold structure and avoid arbitrary thresholdingStill depends on the quality and relevance of the gene-set collection
library(fgsea)

stats <- res_shrunk$stat
names(stats) <- rownames(res_shrunk)
stats <- sort(stats, decreasing = TRUE)

hallmark <- gmtPathways("refs/h.all.v2024.1.Hs.symbols.gmt")
fg <- fgsea(pathways = hallmark, stats = stats, minSize = 15, maxSize = 500)
head(fg[order(fg$padj), c("pathway", "NES", "padj")])
library(clusterProfiler)

ora <- enrichGO(
  gene = rownames(subset(sig, log2FoldChange > 1)),
  OrgDb = org.Hs.eg.db,
  keyType = "ENSEMBL",
  ont = "BP"
)

Gene-set dependence is real

Hallmark, Reactome, GO, and cell-type signature resources can all produce different interpretations from the same ranked list. That does not automatically mean one is wrong; it means the gene-set universe is part of the analysis.

Frame enrichment results as interpretations conditional on the chosen resource, and look for convergence across more than one sensible collection before making strong claims.

ORA and GSEA still summarize the group-level contrast. If the next question is whether individual samples carry a program strongly or weakly, move from gene sets to per-sample scores.

Sample-level pathway activity

Here the unit of interpretation shifts from the contrast to the sample. GSVA and ssGSEA turn a normalized expression matrix into per-sample pathway scores, which is useful when heterogeneity, subtype structure, responder patterns, or outlier biology matter.

MethodMain questionBest inputUse it whenMain limitation
GSVAWhich pathways vary smoothly across samples?A normalized expression matrixYou want continuous per-sample pathway scores for clustering, correlation, or phenotype associationScores are relative summaries, not direct proof of pathway activity
ssGSEAWhich samples look signature-high or signature-low?A normalized expression matrixYou want a simple rank-based per-sample score for pathway or signature burdenDepends on signature quality and can be sensitive to broad expression shifts

Use this same layer for cell state or functional state scoring. In practice that means scoring signatures such as:

  • cell cycle
  • proliferation
  • hypoxia
  • EMT
  • interferon response
  • stress or injury programs

These are still sample-level summaries. They tell you which samples carry a program, not yet which upstream regulator or pathway driver caused it.

library(GSVA)

ssgsea_scores <- gsva(
  as.matrix(vst_mat),
  hallmark,
  method = "ssgsea"
)

Once you have sample-level programs, the next shift is from descriptive scoring to inferred upstream drivers.

Regulatory / signaling inference

This layer asks what might be driving the expression pattern rather than just describing it. PROGENy estimates signaling pathway activity from downstream responsive genes, while DoRothEA + VIPER estimates transcription factor activity from coordinated target-gene behavior.

MethodMain questionBest inputUse it whenMain limitation
PROGENyDo the data look like downstream consequences of a signaling pathway being active?A normalized expression matrixYou care about pathway activity rather than simple overlap with pathway membersLimited to the pathways modeled by the resource
DoRothEA + VIPERWhich transcription factors best explain the observed expression pattern?A normalized expression matrix plus a TF-target networkYou want plausible upstream regulatory drivers rather than TF expression aloneDepends heavily on regulon quality and context relevance
library(progeny)

pathway_scores <- progeny(vst_mat, scale = TRUE, organism = "Human", top = 500)
library(dorothea)
library(viper)

data(dorothea_hs, package = "dorothea")
regulon <- subset(dorothea_hs, confidence %in% c("A", "B", "C"))
tf_scores <- viper(vst_mat, regulon, method = "scale")

These methods still assume the bulk profile is mostly reflecting state within the sampled mixture. If the mixture itself is changing, apparent pathways or TFs may partly be composition effects rather than cell-intrinsic regulation.

Tissue composition

This is the mixture layer. Here the question shifts from what programs are active to who is present in the sample. In bulk RNA-seq, that distinction matters because apparent inflammation, stromal biology, or differentiation can reflect changing proportions rather than changed state within one cell type.

MethodBest framed asNeeds reference?Best use caseMain caution
xCell2Enrichment or tissue-context scoreNo matched single-cell reference requiredQuick context readout when you want a broad enrichment-style estimateBetter for relative context than absolute proportions
EPICReference-based deconvolutionYesHuman bulk data where the reference assumptions fit the tissue reasonably wellReference mismatch can dominate the result
CIBERSORTxProportion-style deconvolutionUsually yesWhen a compatible signature matrix exists and transfers plausibly to your tissuePortability across contexts is limited
MuSiCComposition estimationYesYou have a genuinely matched single-cell reference from similar tissue and conditionA mismatched reference can make the output look precise but wrong

Recommended practical default:

  • start with xCell2 for a conservative context readout
  • use EPIC or CIBERSORTx when the reference assumptions fit the biological system
  • use MuSiC when you have a truly matched single-cell reference and want proportion-aware estimation

Cell composition versus cell state

A higher immune, stromal, or EMT-like signal can mean more cells of that type, the same cells in a different state, or both. Use deconvolution to stress-test your interpretation, not to declare exact ground truth from bulk alone.

At this point you should know whether your story is mainly gene-level, program-level, sample-state-level, regulatory, or compositional. If the remaining question is specifically about isoforms, splicing, or genomic position, move into a more structure-aware layer.

Genome-aware / transcript-level analysis

This is the advanced or optional layer. Once you collapse signal to gene-level counts, you lose exon structure, isoform usage, and genomic position. These methods are for cases where that lost structure is the actual biological question.

MethodWhat it keeps that gene-level DE losesMain questionUse it when
DEXSeq / DRIMSeqExon-level or transcript-usage structureDo transcript isoforms or exon usage change even when gene-level expression does not tell the full story?You care about differential transcript usage, isoform shifts, or exon-level regulation
rMATSSplicing-event structureAre specific splicing events such as exon skipping or intron retention changing between conditions?You have replicate-aware RNA-seq and the biology plausibly depends on splicing regulation
PREDAGenomic positionAre expression changes spatially clustered along chromosomes or genomic regions rather than scattered gene by gene?You suspect regional regulation, domain-level effects, or copy-number-linked patterns

Use this layer when the unresolved question is about transcript structure or genomic organization, not when the real question is pathway membership, sample state, or tissue composition.

library(DEXSeq)
library(DRIMSeq)
rmats.py --b1 treated_bams.txt --b2 control_bams.txt \
  --gtf refs/genes.gtf \
  --od rmats_out --tmp rmats_tmp \
  --readLength 101 --nthread 8
library(PREDA)

Targeted panels and ortholog sanity checks

Do not let an enrichment package decide your biology for you. Always return to a few hand-curated gene panels:

  • interferon and inflammatory genes
  • ECM and adhesion genes
  • mitochondrial and stress genes
  • lineage markers relevant to your tissue
  • known perturbation targets

For mouse studies pushed through human-centric resources, create an explicit ortholog table and keep both original and mapped identifiers side by side.

Common pitfalls

This section is intentionally blunt. Most bulk RNA-seq pain comes from a small set of repeated mistakes.

Technical and analytical pitfalls

  1. No real replication. Statistical sophistication does not replace biological replication.
  2. Annotation drift. If the index, tx2gene mapping, and visualization symbols came from different annotation releases, joins will break or silently lie.
  3. Wrong strandedness assumption. Use infer_experiment.py or equivalent checks. Counting with the wrong strand setting can quietly invert confidence in some loci.
  4. Blind trimming. Trim because there is a problem, not because it feels hygienic.
  5. Ignoring multi-mapping. Repetitive genes, paralogs, pseudogenes, and short-read ambiguity are biological and computational realities.
  6. Over-correcting batch. Batch correction can remove signal of interest if design and biology are entangled.
  7. Short reads for coordinate-aware questions. If the goal is genome-aware interpretation or positional follow-up, cheap short single-end data are often the real limitation.
  8. rRNA contamination denial. High rRNA fractions are not just annoying; they waste depth and can indicate library problems.
  9. Assuming weak signal means no biology. It can also mean wrong tissue compartment, weak perturbation, bad timing, or overwhelming composition changes.
  10. Treating every downstream method as mandatory. Most projects do not need every specialized scoring, regional, and network layer all at once.

What to do when signal is weak or libraries look bad

  • Re-check sample labels and metadata first.
  • Re-check strandedness and annotation consistency.
  • Inspect gene-body coverage and read distribution.
  • Ask whether the perturbation timing or tissue compartment was wrong.
  • Look at known positive-control genes or pathways before declaring failure.
  • Consider that composition shifts may dominate weak within-cell-state changes.
  • If everything depends on rare populations or spatial structure, bulk may not be the right assay.

Interpretation pitfalls

Interpretation principle

Expression changes reflect what cells are responding to or attempting, not necessarily what is functionally occurring.

This final list focuses on interpretation traps that still matter even after you have already checked for technical structure and composition effects.

  1. Compensatory versus functional change. Upregulation does not imply more function. It may reflect compensation after stress, damage, or failed homeostasis.
  2. Structure versus expression mismatch. Higher mRNA does not guarantee correct protein structure, localization, or tissue architecture.
  3. Stress and injury responses are nonspecific. Hypoxia, inflammation, interferon, EMT-like programs, and immediate-early responses are reused across many contexts.
  4. Pathway redundancy. Overlapping gene sets can make several pathways look independently significant when they really reflect one underlying program.
  5. Directionality ambiguity. Up or down does not automatically define activation, inhibition, or biological outcome.
  6. mRNA is not protein and not function. Translation, post-translational regulation, localization, and turnover matter.
  7. Temporal ambiguity. A single time point cannot cleanly separate cause from consequence.
  8. Reverse causality. Especially in tumor and inflammatory settings, observed changes may be downstream responses rather than drivers.
  9. Signatures are not mechanisms. Enrichment indicates similarity to a known program, not mechanistic activation.
  10. Cross-species limitations. Mouse-to-human ortholog mapping loses nuance and sometimes meaning.

If you want the shortest honest recommendation:

  1. Design the experiment around the biological question, not around whichever downstream package you hope to use.
  2. For most human and mouse bulk RNA-seq projects, use stranded poly(A) if RNA quality is good and the question is coding-gene expression.
  3. Start every dataset with FastQC + MultiQC, then add RSeQC, RNA-SeQC, or Picard when RNA-specific sanity checks matter.
  4. Use Salmon + tximport + tximeta for expression-first work.
  5. Use STAR when you genuinely need genome-aware analyses.
  6. Use DESeq2 for gene-level DE, then add lfc shrinkage, the right flavor of gene-set analysis, one sensible regulatory inference layer, and a conservative cell-context layer.
  7. Treat genome-aware regional analysis as a targeted follow-up when the biological question justifies it, not as a default decoration.
  8. Pick downstream methods based on the biological question they answer, not because they are popular on other people's RNA-seq figures.