Bulk RNA-seq: from biological sample to interpretable signal

Executive summary

This guide is intentionally split into two parts:

Upstream: everything from biological sample and library design through QC, quantification, and the construction of a trustworthy count table.
Downstream: everything after the count table, starting with core differential-expression analysis and then moving into the extra layers that help you squeeze more biology out of the dataset.

If you want one default path that works for most human or mouse bulk RNA-seq projects, make it this:

Plan the biology first: replication, RNA quality, library type, strandedness, and annotation version.
Treat FastQC + MultiQC as the universal first stop after FASTQ generation.
For expression-first work, quantify with Salmon and import with tximport, ideally keeping provenance with tximeta.
Use DESeq2 as the default differential-expression engine.
Only after you trust the counts and the design should you add enrichment, pathway activity, TF activity, cell-context inference, and more advanced transcript-aware analyses.

This page is not trying to list every transcriptomics method ever published. It is a practical guideline to the core analysis up to DEGs, and then to the downstream analyses that are most useful when you want to get more biological signal out of bulk RNA-seq without drifting into method tourism. The mainline recommendation is:

Clean mammalian tissue, expression-first question, reasonable RNA quality

poly(A) enrichment -> stranded library -> FASTQ QC -> Salmon or nf-core/rnaseq -> tximport/tximeta -> DESeq2 -> enrichment + pathway activity + TF activity + cell-context inference

Switch away from that default when the question changes:

Choose rRNA depletion when RNA is degraded, FFPE-like, intronic signal matters, or non-polyadenylated transcripts matter.
Choose STAR when you need genome alignments for genome-browser coverage, exon-aware counting, or genome-position-aware follow-up.
Choose paired-end and longer reads when genome-aware follow-up is an important part of the study rather than an afterthought.

Companion analysis repo

If you want a small runnable example that follows the downstream logic in this guide, use the companion airway walkthrough repo, Transcriptomics-extended. It walks through QC, DESeq2, ORA, GSEA, ssGSEA, PROGENy, and DoRothEA + VIPER on the Bioconductor airway dataset.

Who this is for

This guide stays strictly in bulk RNA-seq. If your core question depends on rare cell populations, lineage mixtures, or spatial localization, bulk RNA-seq may still be useful, but a single-cell, spatial, or long-read assay may answer the real question better.

Where are you in the process?

Upstream

Upstream: I need to get to a trustworthy count table

Choose library chemistry, QC the FASTQs, decide whether you need pseudoalignment or genome alignment, and build a count table you actually trust.

Jump to section

Downstream

Downstream: I have a count table and want more biology

Do the core DE analysis properly first, then choose the right downstream layer based on the specific biological question you are actually asking.

Jump to section

Pipeline overview

Upstream

Ends at count table

1. Biological planning

Sample quality, replication, annotation choice, poly(A) versus rRNA depletion, strandedness, read geometry.

2. FASTQ QC

FastQC + MultiQC first, then RNA-specific checks such as RSeQC, RNA-SeQC, or Picard when needed.

3. Trim only if needed

Trim Galore or direct Cutadapt when adapters or quality tails are a real problem, not by reflex.

4. Quantify or align

Salmon for expression-first work. STAR when you need genome-aware follow-up such as positional signal, regional analysis, or browser-style coverage inspection.

5. Count table

This is the handoff point. If you do not trust this object, do not move downstream yet.

Downstream

Starts from count table

1. Gene-level

Start with import, QC, design, normalization, and differential expression. This is where you establish which genes change.

2. Gene set-level

Move from individual genes to pathways or processes with ORA or GSEA once the DE result is trustworthy.

3. Sample-level pathway or state activity

Use GSVA, ssGSEA, or related signature scoring when the question is how programs vary across samples rather than just across groups.

4. Regulatory or signaling inference

Use PROGENy or DoRothEA plus VIPER when you want a more explicit hypothesis about upstream drivers of the expression pattern.

5. Tissue composition

Stress-test whether apparent biology may instead reflect changing cell-type mixture in the bulk sample.

6. Genome-aware or transcript-level

Reserve DEXSeq, DRIMSeq, rMATS, or PREDA for questions that depend on transcript structure or genomic position rather than gene-level abundance alone.

Upstream is about generating trustworthy inputs. Downstream is about asking the right biological question of those inputs. The count table is the boundary between the two.

Upstream: from sample to count table

The upstream half of the workflow is about getting from biological material to a count table you believe. If this part is weak, no downstream method will save it.

Before sequencing

The best bulk RNA-seq analysis begins before the sequencer sees a library. Bad experimental structure is harder to rescue than bad code.

Planning checklist

Decision	Default	Switch when	Why it matters later
RNA capture	Stranded poly(A) enrichment	Use rRNA depletion for degraded RNA, FFPE-like input, non-polyadenylated transcripts, intronic signal, or broader transcriptome coverage	Changes what molecules you are measuring and what quantifiers and interpretations make sense
Strandedness	Stranded	Only tolerate unstranded if the library is already made and you cannot change it	Resolves overlapping antisense transcription and improves counting in gene-dense regions
Read geometry	Single-end for simple gene-level expression only	Use paired-end when genome-aware follow-up or more structure-aware counting matters	Affects how confidently you can interpret structure-aware or coordinate-aware signal
Read length	Moderate lengths are fine for gene-level DE	Go longer when positional follow-up or more structure-aware interpretation will matter later	Short reads compress ambiguity into multi-mapping and transcript assignment problems
Annotation	Lock one version and keep it fixed	Switch only with intent and document it clearly	Annotation drift breaks joins, gene IDs, length corrections, and reproducibility

Library design choices

Recommended default

Library design for most bulk RNA-seq studies

Default: Stranded poly(A)-enriched bulk RNA-seq

Use when

You have reasonably intact human or mouse RNA and the main question is gene-level expression in coding transcripts.

Switch to an alternative when

Use stranded rRNA depletion when RNA is degraded, non-polyadenylated RNA matters, or you care about intronic signal and broader transcriptome coverage. Use paired-end and longer reads once genome-aware follow-up becomes a real objective.

Inputs

High-quality RNA from tissue, sorted cells, or cultured cells
Clear biological question and replicate structure
Human or mouse reference and annotation plan

Outputs

Libraries enriched for mature polyadenylated transcripts
Cleaner coding-focused signal
Straightforward downstream gene-level analysis

Alternatives

Stranded rRNA depletion
Unstranded protocols if legacy only
Paired-end longer-read libraries when coordinate-aware follow-up matters

What good output looks like

Replicates cluster mainly by biology, strandedness checks agree with the expected protocol, and gene-body coverage looks plausible for the library chemistry rather than erratic or heavily degraded.

Pros

Simple and efficient for gene-level DE
Strong fit for Salmon plus tximport or tximeta workflows
Strandedness helps in antisense and overlapping loci

Cons

Misses non-polyadenylated biology
Less forgiving with degraded RNA
Single-end short reads are weak for transcript-structure questions

Practical note

People often choose cheap single-end short reads and then later expect confident genome-aware follow-up. That is usually a design mismatch, not an analysis failure.

Official docs

Primary paper

Conesa et al. 2016, RNA-seq best practices

Poly(A) versus rRNA depletion

Poly(A) enrichment is the cleanest default for coding-focused mammalian expression studies with decent RNA integrity.

rRNA depletion is the better default for degraded RNA, FFPE-like material, non-polyadenylated transcripts, nascent or intronic signal, and broader transcriptome coverage. It often trades interpretive simplicity for coverage breadth.

Human and mouse annotation note

For computation, prefer stable IDs such as Ensembl gene identifiers. Use gene symbols for figures and readable tables later. Strip version suffixes only intentionally, because careless version stripping can silently merge or drop rows.

If you map mouse genes into human-centric signatures, treat ortholog mapping as a convenience layer. It is not ground truth, especially for immune, developmental, and rapidly evolving gene families.

Annotation and reference strategy

Use one reference bundle and freeze it:

GENCODE: strong default for human and mouse when you want well-curated gene and transcript models.
Ensembl: strong default when your ecosystem or annotation tooling is already Ensembl-centric.
NCBI RefSeq: use when a collaborator, institution, or downstream clinical context is already standardized on RefSeq.

Do not mix:

GENCODE transcriptome index with RefSeq gene annotation in downstream joins
Ensembl IDs from one release with symbols from another release
human-centric signature resources onto mouse data without explicitly documenting ortholog conversion

Reference choice	Best for	Watch out for
GENCODE	General human and mouse bulk RNA-seq with transcript-level work	Version changes will shift transcript membership and gene models
Ensembl	Ensembl-centric pipelines, BioMart-heavy workflows, consistent Ensembl IDs	Symbol updates and archived release handling can break later joins
RefSeq	Clinical, NCBI-standardized, or collaborator-constrained projects	Transcript model differences can change summarization and exon-level counts

Replication, power, and controls

Biological replication matters more than squeezing one more downstream algorithm out of underpowered data.

Use RNASeqPower or powsimR when you can still influence the experiment.
If you are already underpowered, be honest about effect-size uncertainty and treat pathway-level summaries as supportive, not rescuing.
ERCC-style spike-ins can be useful in selected designs, but they do not replace good biological replication and they can create a false sense of security if used as a generic fix.

# Simple planning logic, not a substitute for study-specific assumptions
library(RNASeqPower)
est_power <- rnapower(
  depth = 20,
  n = 4,
  cv = 0.3,
  effect = 1.5,
  alpha = 0.05
)
est_power

Planning pitfall

Do not treat batch as an afterthought. If all treated samples were processed on a different day, by a different person, or with a different kit lot, that is a design problem first and a statistics problem second.

From FASTQ to trusted quantification

Keep the data type explicit at each step:

FASTQ: raw reads and base qualities
BAM/CRAM: genome alignments
quant.sf or equivalent: transcript-level abundance estimates
gene count matrix: integer counts for count-based DE frameworks
normalized matrix: transformed expression for visualization and sample structure
ranked vector: gene-wise statistics for GSEA
splice-event or exon tables: event-aware follow-up
regional scores or genome-aware summaries: specialized outputs

Universal QC comes first

Recommended default

Initial sequencing QC

Default: FastQC + MultiQC

Use when

Always, immediately after FASTQ generation or after receiving raw data from a core facility.

Switch to an alternative when

Move beyond FastQC when you need RNA-specific sanity checks: strandedness inference, ribosomal contamination context, exon-intron read distribution, or gene-body bias. Those require RNA-aware tools and often alignments.

Inputs

FASTQ files
Sample sheet or sample naming convention
Knowledge of expected read length, library chemistry, and strandedness

Outputs

Per-sample FastQC reports
One MultiQC summary across the project
A first-pass decision on whether trimming or deeper RNA-specific QC is necessary

Alternatives

RNA-SeQC for RNA-process metrics
RSeQC for strandedness, read distribution, and gene-body coverage
Picard CollectRnaSeqMetrics for alignment-aware RNA metrics

What good output looks like

Adapters are absent or modest, quality tails are not catastrophic, replicate libraries look broadly similar, and any suspicious sample is obvious before downstream work begins.

Pros

Fast and universal
Easy to compare many samples
MultiQC makes bad samples obvious early

Cons

FastQC alone is not RNA-aware
Some warnings are expected and not actionable
Cannot tell you if strandedness matches the library without deeper checks

Practical note

Overreacting to every FastQC warning is a rookie move. RNA-seq libraries often show non-random sequence content and uneven positional composition for legitimate protocol reasons.

Official docs

Primary paper

Ewels et al. 2016, MultiQC

mkdir -p qc/fastqc
fastqc data/fastq/*.fastq.gz --outdir qc/fastqc
multiqc qc -o qc/multiqc

For RNA-specific QC after a genome-alignment step, check:

RSeQC: infer_experiment.py, read_distribution.py, geneBody_coverage.py
RNA-SeQC: broad process metrics, coverage and expression summaries
Picard CollectRnaSeqMetrics: coding, UTR, intronic, intergenic fractions, ribosomal bases

infer_experiment.py -r refs/hg38.bed -i align/sample.bam
geneBody_coverage.py -r refs/hg38_housekeeping.bed -i align/sample.bam -o qc/rseqc/sample
picard CollectRnaSeqMetrics \
  I=align/sample.bam \
  O=qc/picard/sample.rna_metrics.txt \
  REF_FLAT=refs/gencode.vM35.refFlat.txt \
  STRAND_SPECIFICITY=SECOND_READ_TRANSCRIPTION_STRAND

Trimming: useful, but do not do it reflexively

Recommended default

Adapter and quality trimming

Default: Trim Galore for simple projects; direct Cutadapt when you need control

Use when

Adapters or severe quality tails are real problems, or when a core facility note and FastQC make trimming clearly worthwhile.

Switch to an alternative when

Use direct Cutadapt when you need fine control over adapter logic, overlaps, minimum length thresholds, or unusual library structures. Skip trimming when the quantifier already handles modest imperfections and there is no actual problem to solve.

Inputs

FASTQ files
Expected adapter sequences
FastQC evidence or protocol knowledge

Outputs

Trimmed FASTQ files
Trim reports
Cleaner input for aligners or quantifiers when problems were genuine

Alternatives

Direct Cutadapt
No trimming if adapters and tails are not meaningfully harming downstream quantification
Pipeline-managed trimming inside nf-core/rnaseq

What good output looks like

Adapter content drops when it was truly present, short-read collapse is limited, and downstream mapping or quantification improves rather than merely looking cleaner.

Pros

Trim Galore is simple and wraps Cutadapt
Fixes genuine adapter contamination cleanly
Easy to explain and automate

Cons

Unnecessary trimming can discard useful sequence
Aggressive trimming can distort read-length distributions
People often trim out of habit rather than evidence

Practical note

Do not trim because a generic tutorial told you to. Trim because your data or protocol gives you a reason.

Official docs

Primary paper

Martin 2011, Cutadapt

trim_galore --paired \
  --cores 4 \
  sample_R1.fastq.gz sample_R2.fastq.gz \
  --output_dir trimmed/

Quantification strategy

Recommended default

Expression-first quantification

Default: Salmon -> tximport -> tximeta (optional but recommended)

Use when

The main goal is robust expression analysis and you do not need BAM-centric genome-aware follow-up as the primary output.

Switch to an alternative when

Move to STAR when you need genome-browser coverage, exon-level counting, or genome-position-aware analyses. Use aligned-read counting when the follow-up depends on BAMs rather than just expression estimates.

Inputs

Transcriptome index tied to a frozen annotation release
FASTQ files
Sample metadata and library type knowledge

Outputs

Transcript abundances
Gene-level matrices after tximport
Annotation-linked SummarizedExperiment objects if using tximeta

Alternatives

STAR plus featureCounts or Rsubread for BAM-centric work
kallisto for lightweight pseudoalignment
HISAT2 when genome alignment is needed but STAR is not your choice

What good output looks like

Sample-level library sizes and composition look plausible, imported gene-level counts align with biology, and no sample is clearly broken by unexpected strandedness or annotation mismatches.

Pros

Fast and accurate for expression-first workflows
Transcript-level estimation can improve gene-level inference when imported correctly
tximeta preserves annotation provenance and reduces metadata drift

Cons

Not the right primary output for browser coverage or coordinate-aware analysis
Requires care with transcriptome and annotation matching
Pseudoalignment does not replace true alignments for all questions

Practical note

The biggest failure mode is annotation mismatch: Salmon index built from one transcript set, tx2gene built from another, and downstream symbols from a third.

Official docs

Primary paper

salmon quant \
  -i refs/gencode.v44.salmon_index \
  -l A \
  -1 trimmed/sample_R1.fq.gz \
  -2 trimmed/sample_R2.fq.gz \
  -p 8 \
  --validateMappings \
  -o quants/sample

library(tximport)
library(tximeta)

coldata <- read.csv("metadata/samples.csv")
se <- tximeta(coldata)
gse <- summarizeToGene(se)

When STAR should be your default instead

Choose STAR when you need true genome alignments for exon-aware counting, genome-browser coverage, or genome-position-aware follow-up. Salmon is the default for expression-first questions, not for all questions.

Tool path	Use this when	Main output	Main drawback
Salmon + tximport	Fast, expression-first quantification	Transcript abundances and imported gene-level counts	Not a BAM-centric workflow
STAR + featureCounts or Rsubread	Coverage, browser tracks, exon-aware counting, region-aware follow-up	BAMs, junctions, gene or exon counts	Heavier compute and storage
kallisto	Lightweight expression workflows and fast iteration	Pseudoalignment quantifications	Less often the first choice in mixed BAM-centric labs
HISAT2 + counts	Alternative splice-aware aligner path	BAMs and countable alignments	Not usually my first choice over STAR for bulk mammalian work

Downstream: from count table to biological interpretation

The downstream half starts once you have a count table and credible metadata. The first job is still boring and essential: import, QC, design, normalization, and differential expression. Only then does it make sense to start asking more ambitious biological questions.

Core analysis

Once you have a count-like object and trustworthy metadata, the work becomes statistical and interpretive rather than purely computational.

Import, normalization, and sample structure

For most users, the best default is DESeq2 on imported gene-level counts. If you quantified with Salmon, do not throw away transcript-level work by flattening it carelessly. Use tximport, and use tximeta when you can.

library(DESeq2)
library(tximeta)

se <- tximeta(read.csv("metadata/samples.csv"))
gse <- summarizeToGene(se)

dds <- DESeqDataSet(gse, design = ~ batch + condition)
dds <- DESeq(dds)

vsd <- vst(dds, blind = FALSE)
plotPCA(vsd, intgroup = c("condition", "batch"))

Differential expression

Recommended default

Gene-level differential expression

Default: DESeq2

Use when

Standard bulk RNA-seq differential expression with biological replicates and integer gene counts or tximport-derived gene-level counts.

Switch to an alternative when

Stay with DESeq2 for most standard bulk RNA-seq comparisons. Switch to edgeR when you still want a count-based method but want more hands-on control over filtering and model setup, or when you already work in an edgeR-style pipeline. Switch to limma-voom when the project naturally lives in the limma ecosystem, especially if you have many contrasts or want a linear-model workflow built around weighted log-expression values rather than DESeq2's count model.

Inputs

Gene-level count matrix or SummarizedExperiment
Sample metadata with condition, batch, and covariates
Clear contrasts and reference levels

Outputs

Normalized counts
Dispersion estimates
Log2 fold changes, adjusted p-values, shrunken effect sizes

Alternatives

edgeR
limma-voom
Likelihood-ratio testing for multi-level questions

What good output looks like

Replicates cluster sensibly, dispersion fits are stable, MA plots are not dominated by one broken sample, and shrunken log2 fold changes tell a coherent biological story.

Pros

Default choice for most users
Strong documentation and community practice
lfcShrink gives more stable effect-size interpretation

Cons

Not magic against bad design
Can be over-trusted when metadata are weak
Requires careful factor coding and contrast specification

Practical note

People often batch-correct the counts first and then fit DESeq2. Usually the right move is to model batch in the design, not to mutate the counts blindly upstream.

Official docs

Primary paper

res <- results(dds, contrast = c("condition", "treated", "control"))
res_shrunk <- lfcShrink(dds, coef = "condition_treated_vs_control", type = "apeglm")

sig <- subset(as.data.frame(res_shrunk), padj < 0.05 & abs(log2FoldChange) > 1)
head(sig[order(sig$padj), ])

How to read the main DESeq2 outputs

pvalue / padj: the statistical significance output from the DESeq2 model. This is the main inference result.

log2FoldChange: the estimated effect size, including both magnitude and direction of change.

lfcShrink output: a shrunken, more stable version of log2 fold change that is usually better for ranking genes and practical interpretation.

vst() output: variance-stabilized expression values for visualization such as PCA or clustering, not for statistical testing.

DESeq2 performs inference by modeling raw count data, with normalization handled internally. Transformed outputs such as VST and normalized counts are for visualization and exploratory analysis, not for the main hypothesis test.

What is statistically detectable vs what is biologically meaningful?

A small adjusted p-value tells you that a change is statistically detectable under the model. It does not tell you that the change is biologically important, cell-intrinsic, or worth building a story around.

In many real bulk RNA-seq studies, effect size, consistency across replicates, and biological interpretability matter more than a binary significance line. Shrunken log2 fold change and overall coherence are often more useful for prioritization than asking whether a gene sits just above or below padj < 0.05.

Count models versus voom

DESeq2 is the clean default for most ordinary bulk RNA-seq comparisons: count data in, design formula specified, differential expression out.

edgeR is a good switch when you still want a count-based framework but want more explicit control over filtering and model choices, or when the rest of the project already uses edgeR.

limma-voom is a good switch when the real benefit is the limma workflow itself, especially for many contrasts or broader linear-model analyses. The key difference is that voom converts counts into weighted log-expression values before modeling rather than fitting the DESeq2 count model directly.

Batch effects and unwanted variation

Recommended default

Batch handling and unwanted variation

Default: Model batch in the design first; use sva, svaseq, or RUVSeq when needed

Use when

You have clear technical structure, latent confounding, or hidden variation that is not fully captured by your known metadata.

Switch to an alternative when

Use ComBat-style adjustment only when you understand why a corrected expression matrix is needed for a specific downstream task. Do not use it as a ritual before DE. Use svaseq or RUVSeq when hidden structure is suspected and a count-aware framework is still desired.

Inputs

Count matrix
Sample metadata
Biological design you are trying not to destroy
A realistic hypothesis about what the unwanted variation might be

Outputs

Adjusted design terms
Surrogate variables or unwanted-variation factors
Cleaner interpretation of residual technical structure

Alternatives

ComBat for selected corrected-expression tasks
svaseq when counts are the starting point
RUVSeq when control genes or replicate logic supports it

What good output looks like

Known technical structure becomes less dominant, but the biological contrast of interest remains visible and interpretable rather than flattened into noise.

Pros

Prevents technical structure from masquerading as biology
Works well when the design is honest
Can stabilize PCA and downstream interpretations

Cons

Easy to over-correct
Hidden-factor methods can absorb real biology
Badly confounded designs remain badly confounded

Practical note

Never try to 'correct away' a batch effect before asking whether batch and biology are partially or fully aliased. If they are, statistics cannot manufacture information that was never collected.

Official docs

Primary paper

library(sva)

mod  <- model.matrix(~ condition, data = colData(dds))
mod0 <- model.matrix(~ 1, data = colData(dds))
sv   <- svaseq(counts(dds), mod = mod, mod0 = mod0)$sv

colData(dds)$sv1 <- sv[, 1]
design(dds) <- ~ sv1 + batch + condition
dds <- DESeq(dds)

What to inspect before believing any DE table

PCA or MDS on transformed counts
Sample-sample distances
Library-size spread and composition bias
Gene-body coverage and strandedness sanity checks
Outlier samples in MultiQC and RNA-specific QC
Whether low-count filtering was reasonable
Whether the design formula matches the actual experiment

Slow down before interpretation

A DEG table is not yet a biological conclusion

Before you move into enrichment or more advanced downstream layers, stop and ask what the DE pattern might actually reflect. Expression changes can represent specific biology, but they can also reflect response, compensation, injury, timing effects, generic stress programs, or differences in cell mixture.

Treat the DEG table as evidence about RNA abundance, not as direct proof of protein abundance or cellular function. The next job is to decide whether the pattern is robust, biologically coherent, and worth deeper follow-up at the pathway, regulatory, or composition level.

Downstream layers after DE

Once the DE result is credible, move upward in interpretation one layer at a time. Start with individual genes, then ask whether coherent gene sets shift, then whether individual samples carry programs, then whether those programs imply signaling or regulatory activity, and finally whether the bulk profile may reflect changing tissue composition.

Gene-level

This is the anchor layer. Differential expression tells you which genes move and by how much. Everything that follows should be read as an interpretation layered on top of that gene-level result, not as a substitute for it.

Stay here until you trust:

the contrast and design formula
the effect sizes and shrinkage behavior
the sample structure and batch handling
a few manually inspected genes that fit the biology you think you are seeing

Once you have a ranked gene-level result you believe, the next shift is from individual genes to coordinated programs.

Gene set-level

This is the first abstraction step. Instead of asking whether one gene is up or down, ask whether a pathway or process moves as a coherent set.

Method	Main question	Best input	Use it when	Main limitation
ORA	Which processes are over-represented in my chosen DEG list?	A thresholded gene list	You want a simple summary of a clear DEG set	Strongly depends on the threshold used to define DEGs
GSEA / fgsea	Do genes from a pathway shift across the full ranked result even if few genes pass a hard cutoff?	A ranked vector across all tested genes	You want to preserve sub-threshold structure and avoid arbitrary thresholding	Still depends on the quality and relevance of the gene-set collection

library(fgsea)

stats <- res_shrunk$stat
names(stats) <- rownames(res_shrunk)
stats <- sort(stats, decreasing = TRUE)

hallmark <- gmtPathways("refs/h.all.v2024.1.Hs.symbols.gmt")
fg <- fgsea(pathways = hallmark, stats = stats, minSize = 15, maxSize = 500)
head(fg[order(fg$padj), c("pathway", "NES", "padj")])

library(clusterProfiler)

ora <- enrichGO(
  gene = rownames(subset(sig, log2FoldChange > 1)),
  OrgDb = org.Hs.eg.db,
  keyType = "ENSEMBL",
  ont = "BP"
)

Gene-set dependence is real

Hallmark, Reactome, GO, and cell-type signature resources can all produce different interpretations from the same ranked list. That does not automatically mean one is wrong; it means the gene-set universe is part of the analysis.

Frame enrichment results as interpretations conditional on the chosen resource, and look for convergence across more than one sensible collection before making strong claims.

ORA and GSEA still summarize the group-level contrast. If the next question is whether individual samples carry a program strongly or weakly, move from gene sets to per-sample scores.

Sample-level pathway activity

Here the unit of interpretation shifts from the contrast to the sample. GSVA and ssGSEA turn a normalized expression matrix into per-sample pathway scores, which is useful when heterogeneity, subtype structure, responder patterns, or outlier biology matter.

Method	Main question	Best input	Use it when	Main limitation
GSVA	Which pathways vary smoothly across samples?	A normalized expression matrix	You want continuous per-sample pathway scores for clustering, correlation, or phenotype association	Scores are relative summaries, not direct proof of pathway activity
ssGSEA	Which samples look signature-high or signature-low?	A normalized expression matrix	You want a simple rank-based per-sample score for pathway or signature burden	Depends on signature quality and can be sensitive to broad expression shifts

Use this same layer for cell state or functional state scoring. In practice that means scoring signatures such as:

cell cycle
proliferation
hypoxia
EMT
interferon response
stress or injury programs

These are still sample-level summaries. They tell you which samples carry a program, not yet which upstream regulator or pathway driver caused it.

library(GSVA)

ssgsea_scores <- gsva(
  as.matrix(vst_mat),
  hallmark,
  method = "ssgsea"
)

Once you have sample-level programs, the next shift is from descriptive scoring to inferred upstream drivers.

Regulatory / signaling inference

This layer asks what might be driving the expression pattern rather than just describing it. PROGENy estimates signaling pathway activity from downstream responsive genes, while DoRothEA + VIPER estimates transcription factor activity from coordinated target-gene behavior.

Method	Main question	Best input	Use it when	Main limitation
PROGENy	Do the data look like downstream consequences of a signaling pathway being active?	A normalized expression matrix	You care about pathway activity rather than simple overlap with pathway members	Limited to the pathways modeled by the resource
DoRothEA + VIPER	Which transcription factors best explain the observed expression pattern?	A normalized expression matrix plus a TF-target network	You want plausible upstream regulatory drivers rather than TF expression alone	Depends heavily on regulon quality and context relevance

library(progeny)

pathway_scores <- progeny(vst_mat, scale = TRUE, organism = "Human", top = 500)

library(dorothea)
library(viper)

data(dorothea_hs, package = "dorothea")
regulon <- subset(dorothea_hs, confidence %in% c("A", "B", "C"))
tf_scores <- viper(vst_mat, regulon, method = "scale")

These methods still assume the bulk profile is mostly reflecting state within the sampled mixture. If the mixture itself is changing, apparent pathways or TFs may partly be composition effects rather than cell-intrinsic regulation.

Tissue composition

This is the mixture layer. Here the question shifts from what programs are active to who is present in the sample. In bulk RNA-seq, that distinction matters because apparent inflammation, stromal biology, or differentiation can reflect changing proportions rather than changed state within one cell type.

Method	Best framed as	Needs reference?	Best use case	Main caution
xCell2	Enrichment or tissue-context score	No matched single-cell reference required	Quick context readout when you want a broad enrichment-style estimate	Better for relative context than absolute proportions
EPIC	Reference-based deconvolution	Yes	Human bulk data where the reference assumptions fit the tissue reasonably well	Reference mismatch can dominate the result
CIBERSORTx	Proportion-style deconvolution	Usually yes	When a compatible signature matrix exists and transfers plausibly to your tissue	Portability across contexts is limited
MuSiC	Composition estimation	Yes	You have a genuinely matched single-cell reference from similar tissue and condition	A mismatched reference can make the output look precise but wrong

Recommended practical default:

start with xCell2 for a conservative context readout
use EPIC or CIBERSORTx when the reference assumptions fit the biological system
use MuSiC when you have a truly matched single-cell reference and want proportion-aware estimation

Cell composition versus cell state

A higher immune, stromal, or EMT-like signal can mean more cells of that type, the same cells in a different state, or both. Use deconvolution to stress-test your interpretation, not to declare exact ground truth from bulk alone.

At this point you should know whether your story is mainly gene-level, program-level, sample-state-level, regulatory, or compositional. If the remaining question is specifically about isoforms, splicing, or genomic position, move into a more structure-aware layer.

Genome-aware / transcript-level analysis

This is the advanced or optional layer. Once you collapse signal to gene-level counts, you lose exon structure, isoform usage, and genomic position. These methods are for cases where that lost structure is the actual biological question.

Method	What it keeps that gene-level DE loses	Main question	Use it when
DEXSeq / DRIMSeq	Exon-level or transcript-usage structure	Do transcript isoforms or exon usage change even when gene-level expression does not tell the full story?	You care about differential transcript usage, isoform shifts, or exon-level regulation
rMATS	Splicing-event structure	Are specific splicing events such as exon skipping or intron retention changing between conditions?	You have replicate-aware RNA-seq and the biology plausibly depends on splicing regulation
PREDA	Genomic position	Are expression changes spatially clustered along chromosomes or genomic regions rather than scattered gene by gene?	You suspect regional regulation, domain-level effects, or copy-number-linked patterns

Use this layer when the unresolved question is about transcript structure or genomic organization, not when the real question is pathway membership, sample state, or tissue composition.

library(DEXSeq)
library(DRIMSeq)

rmats.py --b1 treated_bams.txt --b2 control_bams.txt \
  --gtf refs/genes.gtf \
  --od rmats_out --tmp rmats_tmp \
  --readLength 101 --nthread 8

library(PREDA)

Targeted panels and ortholog sanity checks

Do not let an enrichment package decide your biology for you. Always return to a few hand-curated gene panels:

interferon and inflammatory genes
ECM and adhesion genes
mitochondrial and stress genes
lineage markers relevant to your tissue
known perturbation targets

For mouse studies pushed through human-centric resources, create an explicit ortholog table and keep both original and mapped identifiers side by side.

Common pitfalls

This section is intentionally blunt. Most bulk RNA-seq pain comes from a small set of repeated mistakes.

Technical and analytical pitfalls

No real replication. Statistical sophistication does not replace biological replication.
Annotation drift. If the index, tx2gene mapping, and visualization symbols came from different annotation releases, joins will break or silently lie.
Wrong strandedness assumption. Use infer_experiment.py or equivalent checks. Counting with the wrong strand setting can quietly invert confidence in some loci.
Blind trimming. Trim because there is a problem, not because it feels hygienic.
Ignoring multi-mapping. Repetitive genes, paralogs, pseudogenes, and short-read ambiguity are biological and computational realities.
Over-correcting batch. Batch correction can remove signal of interest if design and biology are entangled.
Short reads for coordinate-aware questions. If the goal is genome-aware interpretation or positional follow-up, cheap short single-end data are often the real limitation.
rRNA contamination denial. High rRNA fractions are not just annoying; they waste depth and can indicate library problems.
Assuming weak signal means no biology. It can also mean wrong tissue compartment, weak perturbation, bad timing, or overwhelming composition changes.
Treating every downstream method as mandatory. Most projects do not need every specialized scoring, regional, and network layer all at once.

What to do when signal is weak or libraries look bad

Re-check sample labels and metadata first.
Re-check strandedness and annotation consistency.
Inspect gene-body coverage and read distribution.
Ask whether the perturbation timing or tissue compartment was wrong.
Look at known positive-control genes or pathways before declaring failure.
Consider that composition shifts may dominate weak within-cell-state changes.
If everything depends on rare populations or spatial structure, bulk may not be the right assay.

Interpretation pitfalls

Interpretation principle

Expression changes reflect what cells are responding to or attempting, not necessarily what is functionally occurring.

This final list focuses on interpretation traps that still matter even after you have already checked for technical structure and composition effects.

Compensatory versus functional change. Upregulation does not imply more function. It may reflect compensation after stress, damage, or failed homeostasis.
Structure versus expression mismatch. Higher mRNA does not guarantee correct protein structure, localization, or tissue architecture.
Stress and injury responses are nonspecific. Hypoxia, inflammation, interferon, EMT-like programs, and immediate-early responses are reused across many contexts.
Pathway redundancy. Overlapping gene sets can make several pathways look independently significant when they really reflect one underlying program.
Directionality ambiguity. Up or down does not automatically define activation, inhibition, or biological outcome.
mRNA is not protein and not function. Translation, post-translational regulation, localization, and turnover matter.
Temporal ambiguity. A single time point cannot cleanly separate cause from consequence.
Reverse causality. Especially in tumor and inflammatory settings, observed changes may be downstream responses rather than drivers.
Signatures are not mechanisms. Enrichment indicates similarity to a known program, not mechanistic activation.
Cross-species limitations. Mouse-to-human ortholog mapping loses nuance and sometimes meaning.

Final recommended path

If you want the shortest honest recommendation:

Design the experiment around the biological question, not around whichever downstream package you hope to use.
For most human and mouse bulk RNA-seq projects, use stranded poly(A) if RNA quality is good and the question is coding-gene expression.
Start every dataset with FastQC + MultiQC, then add RSeQC, RNA-SeQC, or Picard when RNA-specific sanity checks matter.
Use Salmon + tximport + tximeta for expression-first work.
Use STAR when you genuinely need genome-aware analyses.
Use DESeq2 for gene-level DE, then add lfc shrinkage, the right flavor of gene-set analysis, one sensible regulatory inference layer, and a conservative cell-context layer.
Treat genome-aware regional analysis as a targeted follow-up when the biological question justifies it, not as a default decoration.
Pick downstream methods based on the biological question they answer, not because they are popular on other people's RNA-seq figures.