Stability in feature selection for RNA-seq data

Objective

To ensure stable and interpretable feature selection in RNA-seq data for classifying kidney disease progression stages. The project emphasizes the importance of reproducibility in biomedical machine learning pipelines by assessing the consistency of selected features under different sampling conditions.

What is Stability in Feature Selection?

Stability refers to how consistently a feature (gene) is selected when feature selection is repeated on different subsamples of the dataset.

Why Stability Matters

**Reliability:** Ensures selected genes truly reflect the underlying biology, not noise or random chance.

Interpretability: Builds trust in the features used for predictions.

Reproducibility: Key for scientific research and clinical decision-making.

Dataset

   **Domain:** Gene expression (RNA-seq)

   **Use Case:** Classification of kidney disease into four subtypes:

   Early Progressive

   Early Stable

   Late Progressive

   Late Stable

Problem

  Standard feature selection methods may identify different gene sets each time they're run on               slightly different data, which undermines model reliability.

Goal: Find genes that are consistently informative, not just occasionally selected.

Proposed Solution: Stability-Based Feature Selection

Choose a base feature selection method.
Randomly split the dataset into smaller subsets (e.g., 50% splits), multiple times.
Apply feature selection on each subset to record which features are selected.
For each gene, calculate how often it was selected — its selection probability.
Use a threshold (e.g., 70%) to identify consistently selected features.