RNA-seq Analysis

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Introduction

RNA sequencing (RNA-seq) is a powerful technique for quantifying gene expression across the transcriptome. It enables the identification of differentially expressed genes (DEGs), detection of novel transcripts, and characterization of transcript isoforms under various biological conditions.

In this analysis, we investigate transcriptomic differences between HS vs. CL lesions using publicly available RNA-seq data from GSE127831, which is part of the study by Amorim et al., 2019. The dataset comprises 28 samples,from human skin biopsies of healthy individuals(n=7) and from lesions of patients infected with Leishmania braziliensis (n=21).

Instead of starting from raw FASTQ files, we retrieved pre-aligned gene-level expression data from the ARCHS4 database. ARCHS4 provides uniformly processed RNA-seq data for human and mouse samples, enabling efficient reuse and reanalysis of public datasets.

Our primary goals are: - To perform quality control and normalization of RNA-seq count data - To identify genes that are significantly differentially expressed between the experimental groups -To interpret biological significance using gene set enrichment or pathway analysis

We use the limma-voom and edgeR frameworks for normalization and differential expression analysis, both of which are well-established methods for RNA-seq data with complex experimental designs.

This document provides a reproducible workflow combining code, output, and interpretation. All steps are performed in R and rendered using R Markdown and the Knitr package.

R packages used

A variety of R packages was used for this analysis. All graphics and data wrangling were handled using the tidyverse suite of packages. All packages used are available from the Comprehensive R Archive Network (CRAN), Bioconductor.org, or Github.

Importing count data into R

Pre-aligned gene-level expression data from the ARCHS4 database was retreived for this dataset. ARCHS4 provides uniformly processed RNA-seq data for human and mouse samples, enabling efficient reuse and reanalysis of public datasets.

library(tidyverse)#for data wrangling
library(rhdf5)#for retreiving data from ARCHS4 database
library(edgeR)#for DGE analysis
library(gplots)#for heatmap.2
library(enrichplot)

archs4.human <- "human_gene_v2.latest.h5"
# use the h5 list (h5ls) function from the rhdf5 package to look at the contents of these databases
hdim <- h5ls(archs4.human)

# data for 67,186 HUMAN genes across 819,856 samples
all.samples.human <- h5read(archs4.human, name="meta/samples/geo_accession")
dim(all.samples.human)

## [1] 993191

#the sample identifiers are obtained from the GEO database for GSE 127831
mySamples <- c( "GSM3639530",    #skin_HS1
               "GSM3639531",    #skin_HS2
               "GSM3639532",    #skin_HS3
               "GSM3639533",    #skin_HS4
               "GSM3639534",    #skin_HS5
               "GSM3639535",    #skin_HS6
               "GSM3639536",    #skin_HS7
               "GSM3639537",    #skin_CL1
               "GSM3639538",    #skin_CL2
               "GSM3639539",    #skin_CL3
               "GSM3639540",    #skin_CL4
               "GSM3639541",    #skin_CL5
               "GSM3639542",    #skin_CL6
               "GSM3639543",    #skin_CL7
               "GSM3639544",    #skin_CL8
               "GSM3639545",    #skin_CL9
               "GSM3639546",    #skin_CL10
               "GSM3639547",    #skin_CL11
               "GSM3639548",    #skin_CL12
               "GSM3639549",    #skin_CL13
               "GSM3639550",    #skin_CL14
               "GSM3639551",    #skin_CL15
               "GSM3639552",    #skin_CL16
               "GSM3639553",    #skin_CL17
               "GSM3639554",    #skin_CL18
               "GSM3639555",    #skin_CL19
               "GSM3639556",    #skin_CL20
               "GSM3639557")    #skin_CL21
               
# Identify columns to be extracted from ARCHS4 database
my.sample.locations <- which(all.samples.human %in% mySamples) # first time you've seen the %in% operator.

# extract gene symbols from the metadata
genes <- h5read(archs4.human, "meta/genes/symbol")

# Extract expression data from ARCHS4 ----
expression <- h5read(archs4.human, "data/expression",
                     index=list(my.sample.locations, NULL))
# transpose to get genes as rows and samples as columns
expression <- t(expression)

rownames(expression) <- genes
colnames(expression) <- all.samples.human[my.sample.locations]
colSums(expression) #this shows the sequencing depth for each of the samples you've extracted

## GSM3639530 GSM3639531 GSM3639532 GSM3639533 GSM3639534 GSM3639535 GSM3639536 
##   53481826   64940474   52745249   20848486    2837407   51722656   37390057 
## GSM3639537 GSM3639538 GSM3639539 GSM3639540 GSM3639541 GSM3639542 GSM3639543 
##   50779924   35415106  101406579   18945879   66455004   84689579   40410575 
## GSM3639544 GSM3639545 GSM3639546 GSM3639547 GSM3639548 GSM3639549 GSM3639550 
##   73600594   74039897   39497323   44325744   52241136   88109937   89112596 
## GSM3639551 GSM3639552 GSM3639553 GSM3639554 GSM3639555 GSM3639556 GSM3639557 
##   84331463   64685799   56634215   56572087   69779095   38525553   78194883

archs4.dgelist <- DGEList(expression)
archs4.cpm <- cpm(archs4.dgelist)
colSums(archs4.cpm)

## GSM3639530 GSM3639531 GSM3639532 GSM3639533 GSM3639534 GSM3639535 GSM3639536 
##      1e+06      1e+06      1e+06      1e+06      1e+06      1e+06      1e+06 
## GSM3639537 GSM3639538 GSM3639539 GSM3639540 GSM3639541 GSM3639542 GSM3639543 
##      1e+06      1e+06      1e+06      1e+06      1e+06      1e+06      1e+06 
## GSM3639544 GSM3639545 GSM3639546 GSM3639547 GSM3639548 GSM3639549 GSM3639550 
##      1e+06      1e+06      1e+06      1e+06      1e+06      1e+06      1e+06 
## GSM3639551 GSM3639552 GSM3639553 GSM3639554 GSM3639555 GSM3639556 GSM3639557 
##      1e+06      1e+06      1e+06      1e+06      1e+06      1e+06      1e+06

Preprocessing

Impact of filtering and normalization

table(rowSums(archs4.dgelist$counts==0)==28)

## 
## FALSE  TRUE 
## 55449 11737

keepers <- rowSums(archs4.cpm>1)>=7
archs4.dgelist.filtered <- archs4.dgelist[keepers,]
dim(archs4.dgelist.filtered)

## [1] 21135    28

archs4.dgelist.filtered.norm <- calcNormFactors(archs4.dgelist.filtered, method = "TMM")

archs4.filtered.norm.log2.cpm <- cpm(archs4.dgelist.filtered.norm, log=TRUE)

# Extract sample metadata from ARCHS4 to create a study design file ----
# extract the sample source
sample_source_name <- h5read(archs4.human, "meta/samples/source_name_ch1")
# extract sample title
sample_title <- h5read(archs4.human, name="meta/samples/title")
# extract sample characteristics
sample_characteristics<- h5read(archs4.human, name="meta/samples/characteristics_ch1")

# let's try putting this all together in a study design file
studyDesign <- tibble(Sample_title = sample_title[my.sample.locations],
                      Sample_source = sample_source_name[my.sample.locations],
                      Sample_characteristics = sample_characteristics[my.sample.locations])

#based on what we extracted from ARCHS4 above, lets customize and clean-up this study design file
studyDesign <- tibble(Sample_title = sample_title[my.sample.locations],
                      genotype = rep(c("HS", "CL"), times= c(7,21)),
                      treatment = rep(c("NInf", "Inf"), times= c(7,21)))

#capture experimental variables as factors from this study design
genotype <- factor(studyDesign$genotype)
treatment <- factor(studyDesign$treatment)
sampleName <- studyDesign$Sample_title

Filtering was carried out to remove lowly expressed genes. Genes with less than 1 count per million (CPM) in at least 7 or more samples filtered out. This reduced the number of genes from 67186 to 21135.

table of filtered and normalized data

library(tidyverse)
library(DT)

# use dplyr 'mutate' function to add new columns based on existing data
targets <- read_tsv("GSE127831_studydesign.txt")
sampleLabels <- targets$sample
colnames(archs4.filtered.norm.log2.cpm) <- c(sampleLabels)
archs4.filtered.norm.log2.cpm.df <- as_tibble(archs4.filtered.norm.log2.cpm, rownames = "geneID")
archs4.filtered.norm.log2.cpm.df

## # A tibble: 21,135 × 29
##    geneID     HS1   HS2   HS3   HS4   HS5   HS6   HS7   CL1   CL2   CL3   CL4
##    <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 TSPAN6   4.42   4.57 4.52   4.49  4.17  4.05 4.51   2.85  2.37  3.58  3.55
##  2 TNMD     3.18   2.62 3.61   2.79  3.25  2.06 2.57  -4.82 -4.82 -3.49 -4.82
##  3 DPM1     5.16   5.24 5.02   5.10  4.76  5.01 4.84   5.28  4.76  5.53  5.32
##  4 SCYL3    4.85   4.64 4.93   4.67  4.25  4.58 4.73   4.77  4.52  4.48  4.32
##  5 C1ORF112 2.83   2.77 3.01   3.05  2.31  2.97 3.05   3.13  3.39  2.72  3.22
##  6 FGR      0.574  2.43 0.610  1.86  1.51  2.50 0.792  5.87  6.57  5.80  5.68
##  7 CFH      7.14   7.87 7.68   8.58  9.14  8.66 8.12   8.19  7.26  8.31  8.35
##  8 FUCA2    3.95   5.01 4.56   5.20  4.73  5.13 4.76   5.46  4.73  5.61  6.44
##  9 GCLC     5.89   5.74 5.56   5.80  5.18  4.66 5.39   4.73  4.19  5.02  4.97
## 10 NFYA     4.55   4.46 4.35   4.88  4.69  4.62 5.23   4.86  4.64  4.65  5.12
## # ℹ 21,125 more rows
## # ℹ 17 more variables: CL5 <dbl>, CL6 <dbl>, CL7 <dbl>, CL8 <dbl>, CL9 <dbl>,
## #   CL10 <dbl>, CL11 <dbl>, CL12 <dbl>, CL13 <dbl>, CL14 <dbl>, CL15 <dbl>,
## #   CL16 <dbl>, CL17 <dbl>, CL18 <dbl>, CL19 <dbl>, CL20 <dbl>, CL21 <dbl>

mydata.df <- archs4.filtered.norm.log2.cpm.df %>% 
  mutate(healthy.AVG = (HS1 + HS2 + HS3 + HS4 + HS5 + HS6 +HS7)/7,
         disease.AVG = (CL1 + CL2 + CL3 + CL4 + CL5 + CL6 + CL7 + CL8 + CL9 + CL10 + CL11 + CL12 + CL13 + CL14 + CL15 + CL16 + CL17 + CL18 + CL19 + CL20 + CL21)/21,
         #now make columns comparing each of the averages above that you're interested in
         LogFC = (disease.AVG - healthy.AVG)) %>% 
  mutate_if(is.numeric, round, 2)

#now look at this modified data table
mydata.df

## # A tibble: 21,135 × 32
##    geneID     HS1   HS2   HS3   HS4   HS5   HS6   HS7   CL1   CL2   CL3   CL4
##    <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 TSPAN6    4.42  4.57  4.52  4.49  4.17  4.05  4.51  2.85  2.37  3.58  3.55
##  2 TNMD      3.18  2.62  3.61  2.79  3.25  2.06  2.57 -4.82 -4.82 -3.49 -4.82
##  3 DPM1      5.16  5.24  5.02  5.1   4.76  5.01  4.84  5.28  4.76  5.53  5.32
##  4 SCYL3     4.85  4.64  4.93  4.67  4.25  4.58  4.73  4.77  4.52  4.48  4.32
##  5 C1ORF112  2.83  2.77  3.01  3.05  2.31  2.97  3.05  3.13  3.39  2.72  3.22
##  6 FGR       0.57  2.43  0.61  1.86  1.51  2.5   0.79  5.87  6.57  5.8   5.68
##  7 CFH       7.14  7.87  7.68  8.58  9.14  8.66  8.12  8.19  7.26  8.31  8.35
##  8 FUCA2     3.95  5.01  4.56  5.2   4.73  5.13  4.76  5.46  4.73  5.61  6.44
##  9 GCLC      5.89  5.74  5.56  5.8   5.18  4.66  5.39  4.73  4.19  5.02  4.97
## 10 NFYA      4.55  4.46  4.35  4.88  4.69  4.62  5.23  4.86  4.64  4.65  5.12
## # ℹ 21,125 more rows
## # ℹ 20 more variables: CL5 <dbl>, CL6 <dbl>, CL7 <dbl>, CL8 <dbl>, CL9 <dbl>,
## #   CL10 <dbl>, CL11 <dbl>, CL12 <dbl>, CL13 <dbl>, CL14 <dbl>, CL15 <dbl>,
## #   CL16 <dbl>, CL17 <dbl>, CL18 <dbl>, CL19 <dbl>, CL20 <dbl>, CL21 <dbl>,
## #   healthy.AVG <dbl>, disease.AVG <dbl>, LogFC <dbl>

datatable(mydata.df[,c(1,30:32)], 
          extensions = c('KeyTable', "FixedHeader"), 
          filter = 'top',
          options = list(keys = TRUE, 
                         searchHighlight = TRUE, 
                         pageLength = 10, 
                         lengthMenu = c("10", "25", "50", "100")))

The table shown below includes expression data for 21135 genes. You can sort and search the data directly from the table.

PCA plot

library(plotly)

# Principal component analysis (PCA) -------------
pca.res <- prcomp(t(archs4.filtered.norm.log2.cpm), scale.=F, retx=T)
#look at pca.res in environment
ls(pca.res)

## [1] "center"   "rotation" "scale"    "sdev"     "x"

summary(pca.res) # Prints variance summary for all principal components.

## Importance of components:
##                             PC1      PC2      PC3      PC4      PC5      PC6
## Standard deviation     110.1039 41.57249 38.68737 36.96260 31.13845 28.89612
## Proportion of Variance   0.4786  0.06823  0.05909  0.05394  0.03828  0.03296
## Cumulative Proportion    0.4786  0.54680  0.60589  0.65983  0.69810  0.73107
##                             PC7      PC8      PC9     PC10     PC11     PC12
## Standard deviation     24.92572 22.74729 21.30897 21.18538 20.21495 19.44713
## Proportion of Variance  0.02453  0.02043  0.01793  0.01772  0.01613  0.01493
## Cumulative Proportion   0.75559  0.77602  0.79394  0.81166  0.82779  0.84272
##                           PC13     PC14     PC15     PC16     PC17     PC18
## Standard deviation     18.8979 18.39168 17.97927 17.46455 17.46077 16.93905
## Proportion of Variance  0.0141  0.01335  0.01276  0.01204  0.01204  0.01133
## Cumulative Proportion   0.8568  0.87018  0.88294  0.89498  0.90701  0.91834
##                            PC19     PC20     PC21     PC22     PC23     PC24
## Standard deviation     16.53302 16.19797 15.68047 15.36846 15.16967 14.96179
## Proportion of Variance  0.01079  0.01036  0.00971  0.00932  0.00908  0.00884
## Cumulative Proportion   0.92913  0.93949  0.94920  0.95852  0.96761  0.97644
##                            PC25     PC26     PC27      PC28
## Standard deviation     14.30138 14.05055 13.95698 1.001e-13
## Proportion of Variance  0.00807  0.00779  0.00769 0.000e+00
## Cumulative Proportion   0.98452  0.99231  1.00000 1.000e+00

pc.var<-pca.res$sdev^2 #sdev^2 gives you the eigenvalues
pc.per<-round(pc.var/sum(pc.var)*100, 1)
pc.per

##  [1] 47.9  6.8  5.9  5.4  3.8  3.3  2.5  2.0  1.8  1.8  1.6  1.5  1.4  1.3  1.3
## [16]  1.2  1.2  1.1  1.1  1.0  1.0  0.9  0.9  0.9  0.8  0.8  0.8  0.0

# Visualize your PCA result ------------------
#lets first plot any two PCs against each other
#We know how much each sample contributes to each PC (loadings), so let's plot
pca.res.df <- as_tibble(pca.res$x)
pca.plot <- ggplot(pca.res.df) +
  aes(x=PC1, y=PC2, color=treatment, shape = genotype) +
  geom_point(size=4) +
  #geom_label() +
  # stat_ellipse() +
  xlab(paste0("PC1 (",pc.per[1],"%",")")) +
  ylab(paste0("PC2 (",pc.per[2],"%",")")) +
  labs(title="PCA plot",
       caption=paste0("produced on ", Sys.time())) +
  coord_fixed() +
  theme_bw()

ggplotly(pca.plot)

Volcano plot

#reading in the study design file 
design <- read.csv(gzfile("GSE127831_studydesign.csv.gz"))
write.table(design, "GSE127831_studydesign.txt", sep = "\t", row.names = FALSE, quote = FALSE)


#HS vs. CL lesions as a factor 
disease <- factor(targets$disease)

#preparing a model matrix for CL vs HS 
design1 <- model.matrix(~0 + disease)
colnames.design1 <- levels(disease)

# Model mean-variance trend and fit linear model to data ----
# Use VOOM function from Limma package to model the mean-variance relationship
v.DEGList.filtered.norm <- voom(archs4.dgelist.filtered.norm, design1, plot = TRUE)

# fit a linear model to your data
fit <- lmFit(v.DEGList.filtered.norm, design1)

# Contrast matrix ----
contrast.matrix.disease <- makeContrasts("infection vs. control"= diseasecutaneous - diseasecontrol,
                                           levels=design1)

# extract the linear model fit -----
fits <- contrasts.fit(fit, contrast.matrix.disease)
#get bayesian stats for your linear model fit
ebFit <- eBayes(fits)
#write.fit(ebFit, file="lmfit_results.txt")

# TopTable to view DEGs -----
myTopHits <- topTable(ebFit, adjust ="BH", coef=1, number=40000, sort.by="logFC")

# convert to a tibble
myTopHits.df <- myTopHits %>%
  as_tibble(rownames = "geneID")

# now plot
vplot <- ggplot(myTopHits.df) +
  aes(y=-log10(adj.P.Val), x=logFC, text = paste("Symbol:", ID)) +
  geom_point(size=2) +
  geom_hline(yintercept = -log10(0.01), linetype="longdash", colour="grey", linewidth=1) +
  geom_vline(xintercept = 1, linetype="longdash", colour="#BE684D", linewidth=1) +
  geom_vline(xintercept = -1, linetype="longdash", colour="#2C467A", linewidth=1) +
  annotate("rect", xmin = 1, xmax = 12, ymin = -log10(0.01), ymax = 7.5, alpha=.2, fill="#BE684D") +
  annotate("rect", xmin = -1, xmax = -12, ymin = -log10(0.01), ymax = 7.5, alpha=.2, fill="#2C467A") +
  labs(title="Volcano plot",
       subtitle = "Cutaneous leishmaniasis",
       caption=paste0("produced on ", Sys.time())) +
  theme_bw()

ggplotly(vplot)

Table of DEGs

To identify differentially expressed genes, precision weights were first applied to each gene based on its mean-variance relationship using VOOM, then data was normalized using the TMM method in EdgeR. Linear modeling and bayesian stats were employed via Limma to find genes that were up- or down-regulated in leishmania patients by 4-fold or more, with a false-discovery rate (FDR) of 0.01.

#extract the DEGs
results <- decideTests(ebFit, method="global", adjust.method="BH", p.value=0.01, lfc=1)
sampleLabels <- targets$sample
head(v.DEGList.filtered.norm$E)

##          GSM3639530 GSM3639531 GSM3639532 GSM3639533 GSM3639534 GSM3639535
## TSPAN6    4.4215045   4.567633  4.5217375   4.493129   4.181086   4.048499
## TNMD      3.1711747   2.609734  3.6070871   2.786308   3.275945   2.050838
## DPM1      5.1582223   5.242238  5.0148733   5.096451   4.770710   5.004531
## SCYL3     4.8530178   4.634597  4.9248901   4.670525   4.262770   4.582371
## C1ORF112  2.8228494   2.765199  3.0018776   3.045262   2.352567   2.969108
## FGR       0.5488926   2.423070  0.5843514   1.857862   1.582048   2.488291
##          GSM3639536 GSM3639537 GSM3639538 GSM3639539 GSM3639540 GSM3639541
## TSPAN6    4.5128208   2.844244   2.367165   3.577585   3.552950   3.183080
## TNMD      2.5661884  -6.653608  -6.404325  -4.086458  -5.205274  -5.469167
## DPM1      4.8399398   5.277499   4.758696   5.531426   5.319268   4.991289
## SCYL3     4.7276518   4.773181   4.515283   4.482361   4.316327   4.559199
## C1ORF112  3.0429112   3.122825   3.381945   2.712676   3.216791   3.218500
## FGR       0.7724368   5.871668   6.568476   5.800485   5.675075   5.940400
##          GSM3639542 GSM3639543 GSM3639544 GSM3639545 GSM3639546 GSM3639547
## TSPAN6     2.967780  4.1684385   4.015413   2.710141   3.551772   3.793119
## TNMD      -7.463718  0.7829547  -2.421751  -3.098569  -1.468298  -2.233914
## DPM1       5.093985  5.4922622   5.303504   5.078982   4.705077   5.418648
## SCYL3      4.692049  4.4714853   4.688136   4.612419   4.518426   4.980149
## C1ORF112   3.263351  2.6594248   2.276274   3.176085   3.406736   3.681808
## FGR        5.972343  4.7753634   5.800151   5.828897   5.926682   6.276590
##          GSM3639548 GSM3639549 GSM3639550 GSM3639551 GSM3639552  GSM3639553
## TSPAN6     3.278385   3.366171   3.437352   3.004041   3.304448 3.528899789
## TNMD      -4.352429  -3.798349  -1.557289  -3.868563  -4.703420 0.004827957
## DPM1       5.208668   5.135795   5.097529   4.807394   4.899836 4.983038471
## SCYL3      4.914826   4.596278   4.643880   4.365056   4.743250 4.939981026
## C1ORF112   3.346623   3.023793   2.953459   2.577392   3.240267 3.049577464
## FGR        6.136014   5.930487   6.383336   6.156954   6.447216 5.098143524
##          GSM3639554 GSM3639555 GSM3639556 GSM3639557
## TSPAN6     2.760758   3.050437   4.604296   4.694747
## TNMD      -6.989111  -4.758205  -6.029607  -1.451383
## DPM1       5.078659   5.016911   5.669531   4.915910
## SCYL3      4.881639   4.420211   4.578648   4.445319
## C1ORF112   3.515708   2.875516   3.044534   2.727420
## FGR        6.328160   6.593948   5.390879   2.974136

colnames(v.DEGList.filtered.norm$E) <- sampleLabels

diffGenes <- v.DEGList.filtered.norm$E[results[,1] !=0,]
head(diffGenes)

##              HS1      HS2       HS3      HS4      HS5      HS6       HS7
## TSPAN6 4.4215045 4.567633 4.5217375 4.493129 4.181086 4.048499 4.5128208
## TNMD   3.1711747 2.609734 3.6070871 2.786308 3.275945 2.050838 2.5661884
## FGR    0.5488926 2.423070 0.5843514 1.857862 1.582048 2.488291 0.7724368
## SEMA3F 6.4256798 5.729345 5.7213567 5.965981 5.821015 5.597115 5.9773292
## CFTR   4.7814568 3.834730 4.4752119 3.718728 4.033744 4.264076 3.5245915
## LAP3   6.1812705 6.351662 5.5855963 5.872332 6.349454 6.191206 6.6269640
##              CL1        CL2       CL3       CL4        CL5        CL6       CL7
## TSPAN6  2.844244  2.3671647  3.577585  3.552950  3.1830801  2.9677795 4.1684385
## TNMD   -6.653608 -6.4043248 -4.086458 -5.205274 -5.4691673 -7.4637181 0.7829547
## FGR     5.871668  6.5684756  5.800485  5.675075  5.9404004  5.9723428 4.7753634
## SEMA3F  4.607899  4.7662263  5.726011  3.841850  5.6955301  6.0136607 5.2318355
## CFTR   -1.099019 -0.7318994 -1.817970 -1.117811  0.3804984 -0.8638052 2.8762362
## LAP3    8.917641  8.5193277  9.413784  9.324462  8.7676193  9.0871640 8.8607082
##              CL8       CL9      CL10       CL11       CL12       CL13
## TSPAN6  4.015413  2.710141  3.551772  3.7931190  3.2783846  3.3661709
## TNMD   -2.421751 -3.098569 -1.468298 -2.2339139 -4.3524286 -3.7983493
## FGR     5.800151  5.828897  5.926682  6.2765900  6.1360141  5.9304870
## SEMA3F  4.779627  4.479655  4.459173  3.4414860  3.8907454  4.1428112
## CFTR    2.118082  1.024191  0.639505  0.8310415  0.5640481  0.6044988
## LAP3    9.119156  9.093318  8.553639  9.6550896  9.8108594  9.1633091
##              CL14       CL15      CL16        CL17       CL18       CL19
## TSPAN6  3.4373524  3.0040415  3.304448 3.528899789  2.7607584  3.0504371
## TNMD   -1.5572886 -3.8685634 -4.703420 0.004827957 -6.9891110 -4.7582053
## FGR     6.3833360  6.1569537  6.447216 5.098143524  6.3281603  6.5939485
## SEMA3F  4.3467463  5.1042858  4.326144 4.818452750  5.1411378  5.2886458
## CFTR    0.5544217 -0.7888362  1.046114 2.515584086 -0.1436209 -0.4802206
## LAP3    9.5488085  9.2699556  9.551564 9.442892888  8.8670470  9.5120170
##              CL20       CL21
## TSPAN6  4.6042962  4.6947469
## TNMD   -6.0296073 -1.4513833
## FGR     5.3908794  2.9741357
## SEMA3F  5.7845746  5.3132672
## CFTR    0.3102427  0.2366727
## LAP3    8.7840236  6.5096483

dim(diffGenes)

## [1] 5836   28

#convert your DEGs to a dataframe using as_tibble
diffGenes.df <- as_tibble(diffGenes, rownames = "geneID")
datatable(diffGenes.df, 
          extensions = c('KeyTable', "FixedHeader"), 
          caption = 'Table 1: DEGs in cutaneous leishmaniasis',
          options = list(keys = TRUE, searchHighlight = TRUE, pageLength = 10, lengthMenu = c("10", "25", "50", "100"))) %>%
  formatRound(columns=c(2:11), digits=2)

Heatmaps and modules

Pearson correlation was used to cluster 5836 differentially expressed genes, which were then represented as heatmap with the data scaled by Zscore for each row.

#clustering - using unsupervised method: correlation
clustRows <- hclust(as.dist(1-cor(t(diffGenes), method="pearson")), method="complete") 
clustColumns <- hclust(as.dist(1-cor(diffGenes, method="spearman")), method="complete")

#we'll look at these clusters in more detail later
module.assign <- cutree(clustRows, k=2)

#now assign a color to each module (makes it easy to identify and manipulate)
module.color <- rainbow(length(unique(module.assign)), start=0.1, end=0.9) 
module.color <- module.color[as.vector(module.assign)] 
myheatcolors2 <- colorRampPalette(c("purple", "white", "darkgreen"))(50)
myheatcolors1 <- bluered(75)
# Produce a static heatmap of DEGs ----
#plot the hclust results as a heatmap
heatmap.2(diffGenes, 
          Rowv=as.dendrogram(clustRows), 
          Colv=as.dendrogram(clustColumns),
          RowSideColors=module.color,
          col=rev(myheatcolors1), scale='row', labRow=rownames(diffGenes),
          density.info="none", trace="none",  
          cexRow = 0.8,                         # Slightly smaller gene labels
          cexCol = 0.7,                         # Smaller column labels
          srtCol = 45,                          # Rotate column labels
          adjCol = c(1, 1),                     # Right-justify rotated labels
          offsetCol = 0.5,                      # Better spacing
          margins = c(10, 20))                  # Increase bottom margin)

# View modules of co-regulated genes ----
# view your color assignments for the different clusters
names(module.color) <- names(module.assign) 

module.assign.df <- tibble(
  geneID = names(module.assign),
  module = as.vector(module.assign)
)

module.assign.df <- module.assign.df %>%
  mutate(moduleColor = case_when(
    module == 1 ~ "#FF9900",
    module == 2 ~ "#FF0099"))


ggplot(module.assign.df) +
  aes(module) +
  geom_bar(aes(fill=moduleColor)) +
  theme_bw()

#choose a cluster(s) of interest by selecting the corresponding number based on the previous graph
modulePick <- 2 #use 'c()' to grab more than one cluster from the heatmap.  e.g., c(1,2)
#now we pull out the genes from this module using a fancy subsetting operation on a named vector
myModule <- diffGenes[names(module.assign[module.assign %in% modulePick]),] 
hrsub <- hclust(as.dist(1-cor(t(myModule), method="pearson")), method="complete") 

# Create heatmap for chosen sub-cluster.
heatmap.2(myModule, 
          Rowv=as.dendrogram(hrsub), 
          Colv=NA, 
          labRow = NA,
          col=rev(myheatcolors1), scale="row", 
          density.info="none", trace="none", 
          RowSideColors=module.color[module.assign%in%modulePick], 
          cexRow = 0.8,                         # Slightly smaller gene labels
          cexCol = 0.7,                         # Smaller column labels
          srtCol = 45,                          # Rotate column labels
          adjCol = c(1, 1),                     # Right-justify rotated labels
          offsetCol = 0.5,                      # Better spacing
          margins = c(10, 20))                  # Increase bottom margin)

modulePick <- 1 
myModule_down <- diffGenes[names(module.assign[module.assign %in% modulePick]),] 
hrsub_down <- hclust(as.dist(1-cor(t(myModule_down), method="pearson")), method="complete") 

heatmap.2(myModule_down, 
          Rowv=as.dendrogram(hrsub_down), 
          Colv=NA, 
          labRow = NA,
          col=rev(myheatcolors1), scale="row", 
          density.info="none", trace="none", 
          RowSideColors=module.color[module.assign%in%modulePick],
          cexRow = 0.8,                         # Slightly smaller gene labels
          cexCol = 0.7,                         # Smaller column labels
          srtCol = 45,                          # Rotate column labels
          adjCol = c(1, 1),                     # Right-justify rotated labels
          offsetCol = 0.5,                      # Better spacing
          margins = c(10, 20))                  # Increase bottom margin

sortedupFC <- myTopHits[order(-myTopHits$logFC),]
sortedupFC <- sortedupFC[!duplicated(sortedupFC$ID) & !is.na(sortedupFC$ID), ]  # avoid duplicates
rownames(sortedupFC) <- sortedupFC$ID
sorteduptop <- sortedupFC[-grep("IG", sortedupFC$ID),]
#Top 100 genes upregulated:
top100up <- sorteduptop[1:100,]$ID
TopUPtable100 <- archs4.filtered.norm.log2.cpm[c(top100up),]
TopUPmatrixcoding100 <- as.matrix(TopUPtable100)
colormapX <- colorRampPalette(colors=c("darkgreen","white","purple"))(50)
HeatmapUP100 <- heatmap.2(TopUPmatrixcoding100,
                          scale = "row", key=TRUE,
                          keysize = 1, key.title = NA,
                          col=colormapX, dendrogram = "none", Rowv = F,
                          margins=c(5,25),
                          labCol = NA, labRow = rownames(TopUPtable100),
                          main = "",
                          density.info="none", trace="none",
                          cexRow=0.8, cexCol=1)

# Save heatmap of top 100 upregulated coding genes to PNG file
png("Top100_UP_heatmap.png", width=1200, height=1800) #width/height in pixels
heatmap.2(TopUPmatrixcoding100,
          scale = "row", key=TRUE,
          keysize = 1, key.title = NA,
          col=colormapX, dendrogram = "none", Rowv = F,
          margins=c(5,25),
          labCol = NA, labRow = rownames(TopUPmatrixcoding100),
          main = "",
          density.info="none", trace="none",
          cexRow=0.7, cexCol=1)
dev.off()

## png 
##   2

GO enrichment

GO enrichment for the 3028 genes induced by infection

library(tidyverse)
library(limma)
library(gplots) #for heatmaps
library(DT) #interactive and searchable tables of our GSEA results
library(GSEABase) #functions and methods for Gene Set Enrichment Analysis
library(Biobase) #base functions for bioconductor; required by GSEABase
library(gprofiler2) #tools for accessing the GO enrichment results using g:Profiler web resources
library(clusterProfiler) # provides a suite of tools for functional enrichment analysis
library(msigdbr) # access to msigdb collections directly within R
library(enrichplot) # great for making the standard GSEA enrichment plots
gost.res_up <- gost(rownames(myModule), organism = "hsapiens", correction_method = "fdr")
gostplot(gost.res_up, interactive = T, capped = T) #set interactive=FALSE to get plot for publications

GO enrichment for the 2808 genes downregulated due to infection

gost.res_down <- gost(rownames(myModule_down), organism = "hsapiens", correction_method = "fdr")
gostplot(gost.res_down, interactive = T, capped = T) #set interactive=FALSE to get plot for publications

GSEA

msigdbr_species()

## # A tibble: 20 × 2
##    species_name                    species_common_name                          
##    <chr>                           <chr>                                        
##  1 Anolis carolinensis             Carolina anole, green anole                  
##  2 Bos taurus                      bovine, cattle, cow, dairy cow, domestic cat…
##  3 Caenorhabditis elegans          <NA>                                         
##  4 Canis lupus familiaris          dog, dogs                                    
##  5 Danio rerio                     leopard danio, zebra danio, zebra fish, zebr…
##  6 Drosophila melanogaster         fruit fly                                    
##  7 Equus caballus                  domestic horse, equine, horse                
##  8 Felis catus                     cat, cats, domestic cat                      
##  9 Gallus gallus                   bantam, chicken, chickens, Gallus domesticus 
## 10 Homo sapiens                    human                                        
## 11 Macaca mulatta                  rhesus macaque, rhesus macaques, Rhesus monk…
## 12 Monodelphis domestica           gray short-tailed opossum                    
## 13 Mus musculus                    house mouse, mouse                           
## 14 Ornithorhynchus anatinus        duck-billed platypus, duckbill platypus, pla…
## 15 Pan troglodytes                 chimpanzee                                   
## 16 Rattus norvegicus               brown rat, Norway rat, rat, rats             
## 17 Saccharomyces cerevisiae        baker's yeast, brewer's yeast, S. cerevisiae 
## 18 Schizosaccharomyces pombe 972h- <NA>                                         
## 19 Sus scrofa                      pig, pigs, swine, wild boar                  
## 20 Xenopus tropicalis              tropical clawed frog, western clawed frog

hs_gsea <- msigdbr(species = "Homo sapiens") #gets all collections/signatures with human gene IDs
#take a look at the categories and subcategories of signatures available to you
hs_gsea %>%
  dplyr::distinct(gs_collection, gs_subcollection) %>%
  dplyr::arrange(gs_collection, gs_subcollection)

## # A tibble: 25 × 2
##    gs_collection gs_subcollection 
##    <chr>         <chr>            
##  1 C1            ""               
##  2 C2            "CGP"            
##  3 C2            "CP"             
##  4 C2            "CP:BIOCARTA"    
##  5 C2            "CP:KEGG_LEGACY" 
##  6 C2            "CP:KEGG_MEDICUS"
##  7 C2            "CP:PID"         
##  8 C2            "CP:REACTOME"    
##  9 C2            "CP:WIKIPATHWAYS"
## 10 C3            "MIR:MIRDB"      
## # ℹ 15 more rows

# choose a specific msigdb collection/subcollection
# since msigdbr returns a tibble, we'll use dplyr to do a bit of wrangling
hs_gsea_c2 <- msigdbr(species = "Homo sapiens", # change depending on species your data came from
                      category = "C2") %>% # choose your msigdb collection of interest
  dplyr::select(gs_name, gene_symbol) #just get the columns corresponding to signature name and gene symbols of genes in each signature

# Now that you have your msigdb collections ready, prepare your data
# grab the dataframe you made in and pull out just the columns corresponding to gene symbols and LogFC for at least one pairwise comparison for the enrichment analysis
# Pull out just the columns corresponding to gene symbols and LogFC for at least one pairwise comparison for the enrichment analysis
mydata.df.sub <- dplyr::select(mydata.df, geneID, LogFC)
# construct a named vector
mydata.gsea <- mydata.df.sub$LogFC
names(mydata.gsea) <- as.character(mydata.df.sub$geneID)
mydata.gsea <- sort(mydata.gsea, decreasing = TRUE)
mydata.gsea <- mydata.gsea[!duplicated(names(mydata.gsea))]
# run GSEA using the 'GSEA' function from clusterProfiler
set.seed(123) #set a random seed so that we can reproducible ordering for our GSEA results below
myGSEA.res <- GSEA(mydata.gsea, TERM2GENE=hs_gsea_c2, verbose=FALSE) #could replace C2CP with hs_gsea_c2 object you retrieved from msigdb above
myGSEA.df <- as_tibble(myGSEA.res@result) #using @ symbol - a slot in an S4 class object

# view results as an interactive table
datatable(myGSEA.df,
          extensions = c('KeyTable', "FixedHeader"),
          caption = 'Signatures enriched in leishmaniasis',
          options = list(keys = TRUE, searchHighlight = TRUE, pageLength = 10, lengthMenu = c("10", "25", "50", "100"))) %>%
  formatRound(columns=c(2:10), digits=2)

# create enrichment plots using the enrichplot package
gseaplot2(myGSEA.res,
          geneSetID = c(94,91), #can choose multiple signatures to overlay in this plot
          pvalue_table = FALSE, #can set this to FALSE for a cleaner plot
          #title = myGSEA.res$Description[c(94,90)]
) #can also turn off this title

# add a variable to this result that matches enrichment direction with phenotype
myGSEA.df <- myGSEA.df %>%
  mutate(phenotype = case_when(
    NES > 0 ~ "disease",
    NES < 0 ~ "healthy"))

# create 'bubble plot' to summarize y signatures across x phenotypes
myGSEA.sig <- myGSEA.df %>% 
       filter(p.adjust < 0.01)
 
# Get top 10 pathways per phenotype
top_10_each <- myGSEA.sig %>%
       group_by(phenotype) %>%
       arrange(p.adjust) %>%
       slice_head(n = 10)
# Now: get unique pathways (to keep common y-axis)
top_IDs <- unique(top_10_each$ID)
# Subset full data for those IDs (to get both phenotypes for each)
top_df <- myGSEA.df %>%
       filter(ID %in% top_IDs)
 
# Optional: shorten pathway names for display
top_df$ID <- gsub("REACTOME_", "", top_df$ID)
top_df$ID <- gsub("_", " ", top_df$ID)

# Plot
  ggplot(top_df, aes(x = phenotype, y = ID)) +
       geom_point(aes(size = setSize, color = NES, alpha = -log10(p.adjust))) +
      scale_color_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) +
      scale_alpha(range = c(0.4, 1)) +
      theme_bw() +
      labs(
             title = "Top GSEA Pathways by Phenotype",
              x = "Phenotype", y = "Pathway"
       ) +
      theme(
             axis.text.y = element_text(size = 7),
             axis.text.x = element_text(size = 10),
            plot.title = element_text(hjust = 0.5)
)

Variability in gene expression

# Identification of highly variable gene expression signatures in CL lesions.
  exprs_voom <- v.DEGList.filtered.norm$E
  # 1. Get group-wise means
  group_means <- data.frame(
    gene = rownames(exprs_voom),
    disease_mean = rowMeans(exprs_voom[, targets$disease == "cutaneous"]),
    healthy_mean = rowMeans(exprs_voom[, targets$disease == "control"])
  )

  # 2. Compute log2 fold change
  group_means$logFC <- group_means$disease_mean - group_means$healthy_mean
  
  
  # 3. Filter for genes:
  # - high disease expression (e.g. > 5)
  # - low healthy expression (e.g. < 1)
  # - large FC (e.g. logFC > 4)
  filtered_genes <- group_means %>%
    filter(disease_mean > 5, healthy_mean < 1, logFC > 4) %>%
    arrange(desc(logFC))
  
  # View top candidates
  head(filtered_genes, 10)

##        gene disease_mean healthy_mean     logFC
## 1  IGLV1-51     5.756022    -5.964370 11.720392
## 2     IGHG3     7.840361    -3.124648 10.965009
## 3   IGHV3-7     5.531642    -5.380446 10.912088
## 4  IGHV3-21     6.176464    -4.491587 10.668051
## 5  IGHV1-18     5.732944    -4.912038 10.644982
## 6  IGHV4-34     5.389532    -4.848115 10.237647
## 7  IGHV3-74     5.519295    -4.685615 10.204910
## 8   IGLV3-1     5.999906    -3.833910  9.833816
## 9     IGHG2     5.562514    -3.981531  9.544045
## 10  IGKV1-5     6.717677    -2.617405  9.335082

  #removing the genes that start with "IG" to identify other differentially expressed genes that can be potential treatment targets 
  # Remove genes that start with "IG"
  filtered_genes_noIG <- filtered_genes %>% filter(!str_detect(gene,"^IG"))
  
  #subset the expression matrix for genes of interest
  # Keep only the first matching row for each gene
  top5_genes <- head(filtered_genes_noIG, 5)
  expr_subset <- exprs_voom[rownames(exprs_voom) %in% top5_genes$gene, ]
  
  # Step 3: Convert to long format
  df_long <- as.data.frame(expr_subset) %>%
    rownames_to_column(var = "gene") %>%
    pivot_longer(-gene, names_to = "sample", values_to = "expression")
  
  # Step 4: Add sample metadata
  df_long <- df_long %>%
    left_join(targets, by = "sample")
  
  fc_table <- df_long %>%
    group_by(gene, disease) %>%
    summarise(mean_expr = mean(expression), .groups = "drop") %>%
    pivot_wider(names_from = disease, values_from = mean_expr) %>%
    mutate(log2FC = cutaneous - control,
           FC = round(2^log2FC)) %>%
    arrange(desc(FC))  # Optional: sort
  
  fc_labels <- paste0("FC ", fc_table$FC)
  
  label_df <- df_long %>%
    filter(disease == "cutaneous") %>%
    group_by(gene) %>%
    summarise(y = max(expression) + 0.5) %>%
    mutate(label = fc_labels)
  
  pd <- position_dodge(width = 0.75)
  ggplot(df_long, aes(x = gene, y = expression, fill = disease)) +
    # Boxplot with consistent width and dodge
    geom_boxplot(position = pd, outlier.shape = NA, alpha = 0.8, width = 0.5, color = "black") +
    
    # Jitter points, aligned using jitterdodge
    geom_jitter(aes(color = disease),
                size = 1.8, alpha = 0.7,
                position = position_jitterdodge(jitter.width = 0.2, dodge.width = 0.75)) +
    
    # FC labels above the boxes
    geom_text(data = label_df, aes(x = gene, y = y, label = label),
              inherit.aes = FALSE,
              vjust = -0.5, size = 4.5, fontface = "bold") +
    
    # Color & fill mapping
    scale_fill_manual(values = c("control" = "gray30", "cutaneous" = "red")) +
    scale_color_manual(values = c("control" = "gray30", "cutaneous" = "red")) +
    
    # Labels and theme settings
    labs(y = "CPM (log2 scale)", x = NULL) +
    geom_hline(yintercept = 0, linetype = "dotted", color = "gray30") +
    theme_minimal(base_size = 14) +
    theme(
      legend.position = "right",
      legend.title = element_blank(),
      axis.text.x = element_text(face = "italic", size = 13),
      panel.grid.major.x = element_blank(),
      panel.grid.minor = element_blank(),
      axis.title.y = element_text(size = 14, face = "bold")
    ) +
    labs( title = "Top 5 most variable gene expression in CL vs. HS")

expr_matrix <- v.DEGList.filtered.norm$E
# Define list of known cytotoxicity genes
    cytotoxic_genes <- c("GZMA", "GZMB", "PRF1", "NKG7", "GNLY", 
                        "CD8A", "CD8B", "FASLG", "IFNG", "TNF", 
                         "KLRK1", "CASP3", "BID", "IL2", "TRAIL","IL1B")
  
# Subset those present in your filtered gene list
    cytotoxic_targets <- filtered_genes_noIG %>%
                    filter(gene %in% cytotoxic_genes)

# View them sorted by logFC (descending)
    cytotoxic_targets <- cytotoxic_targets %>%
           arrange(desc(logFC))
    top5_cytotoxic_genes <- cytotoxic_targets %>%
           arrange(desc(logFC)) %>%
           slice(1:5) %>%
           pull(gene)

    expr_subset_cyto <- expr_matrix[rownames(expr_matrix) %in% top5_cytotoxic_genes, ] 
  
    df_long_cyto <- as.data.frame(expr_subset_cyto) %>%
           rownames_to_column(var = "gene") %>%
           pivot_longer(-gene, names_to = "sample", values_to = "expression")
  
    # Step 4: Add sample metadata
    df_long_cyto <- df_long_cyto %>%
      left_join(targets, by = "sample")
    
    # Calculate mean expression per gene per group
    fc_table_cyto <- df_long_cyto %>%
           group_by(gene, disease) %>%
           summarise(mean_expr = mean(expression), .groups = "drop") %>%
           pivot_wider(names_from = disease, values_from = mean_expr) %>%
           mutate(log2FC = cutaneous - control,
                               FC = round(2^log2FC)) %>%
           arrange(desc(FC))  # Optional: sort
    
    fc_labels_cyto <- paste0("FC ", fc_table_cyto$FC)
    label_df_cyto <- df_long_cyto %>%
           filter(disease == "cutaneous") %>%
           group_by(gene) %>%
           summarise(y = max(expression) + 0.5) %>%
           mutate(label = fc_labels_cyto)
  
    pd <- position_dodge(width = 0.75)
    ggplot(df_long_cyto, aes(x = gene, y = expression, fill = disease)) +
          # Boxplot with consistent width and dodge
           geom_boxplot(position = pd, outlier.shape = NA, alpha = 0.8, width = 0.5, color = "black") +
           
           # Jitter points, aligned using jitterdodge
           geom_jitter(aes(color = disease),
                       size = 1.8, alpha = 0.7,
                      position = position_jitterdodge(jitter.width = 0.2, dodge.width = 0.75)) +
           
           # FC labels above the boxes
           geom_text(data = label_df_cyto, aes(x = gene, y = y, label = label),
                                     inherit.aes = FALSE,
                                     vjust = -0.5, size = 4.5, fontface = "bold") +
           
           # Color & fill mapping
           scale_fill_manual(values = c("control" = "gray30", "cutaneous" = "red")) +
           scale_color_manual(values = c("control" = "gray30", "cutaneous" = "red")) +
           
           # Labels and theme settings
           labs(y = "CPM (log2 scale)", x = NULL) +
           geom_hline(yintercept = 0, linetype = "dotted", color = "gray30") +
           theme_minimal(base_size = 14) +
           theme(
                 legend.position = "right",
                 legend.title = element_blank(),
               axis.text.x = element_text(face = "italic", size = 13),
                 panel.grid.major.x = element_blank(),
                 panel.grid.minor = element_blank(),
                 axis.title.y = element_text(size = 14, face = "bold")
             )+
    labs( title = "Top 5 most variable cytotoxic genes expressed in CL vs. HS")

# Identify columns to be extracted from ARCHS4 database
my.sample.locations1 <- which(all.samples.human %in% mySamples[8:28]) # first time you've seen the %in% operator.

# extract gene symbols from the metadata
genes <- h5read(archs4.human, "meta/genes/symbol")

# Extract expression data from ARCHS4 ----
expression1 <- h5read(archs4.human, "data/expression",
                     index=list(my.sample.locations1, NULL))
# transpose to get genes as rows and samples as columns
expression1 <- t(expression1)

rownames(expression1) <- genes
colnames(expression1) <- all.samples.human[my.sample.locations1]
colSums(expression1) #this shows the sequencing depth for each of the samples you've extracted

## GSM3639537 GSM3639538 GSM3639539 GSM3639540 GSM3639541 GSM3639542 GSM3639543 
##   50779924   35415106  101406579   18945879   66455004   84689579   40410575 
## GSM3639544 GSM3639545 GSM3639546 GSM3639547 GSM3639548 GSM3639549 GSM3639550 
##   73600594   74039897   39497323   44325744   52241136   88109937   89112596 
## GSM3639551 GSM3639552 GSM3639553 GSM3639554 GSM3639555 GSM3639556 GSM3639557 
##   84331463   64685799   56634215   56572087   69779095   38525553   78194883

archs4.dgelist_outcomes <- DGEList(expression1)
archs4.cpm_outcomes <- cpm(archs4.dgelist_outcomes)
colSums(archs4.cpm_outcomes)

## GSM3639537 GSM3639538 GSM3639539 GSM3639540 GSM3639541 GSM3639542 GSM3639543 
##      1e+06      1e+06      1e+06      1e+06      1e+06      1e+06      1e+06 
## GSM3639544 GSM3639545 GSM3639546 GSM3639547 GSM3639548 GSM3639549 GSM3639550 
##      1e+06      1e+06      1e+06      1e+06      1e+06      1e+06      1e+06 
## GSM3639551 GSM3639552 GSM3639553 GSM3639554 GSM3639555 GSM3639556 GSM3639557 
##      1e+06      1e+06      1e+06      1e+06      1e+06      1e+06      1e+06

# Filter and normalize the extracted data ----
table(rowSums(archs4.dgelist_outcomes$counts==0)==21)

## 
## FALSE  TRUE 
## 54659 12527

keepers1 <- rowSums(archs4.cpm_outcomes>1)>=7
archs4.dgelist.filtered_outcomes <- archs4.dgelist_outcomes[keepers1,]
dim(archs4.dgelist.filtered_outcomes)

## [1] 19963    21

archs4.dgelist.filtered.norm_outcomes <- calcNormFactors(archs4.dgelist.filtered_outcomes, method = "TMM")

archs4.filtered.norm.log2.cpm_outcomes <- cpm(archs4.dgelist.filtered.norm_outcomes, log=TRUE)

# Extract sample metadata from ARCHS4 to create a study design file ----
# extract the sample source
sample_source_name <- h5read(archs4.human, "meta/samples/source_name_ch1")
# extract sample title
sample_title <- h5read(archs4.human, name="meta/samples/title")
# extract sample characteristics
sample_characteristics<- h5read(archs4.human, name="meta/samples/characteristics_ch1")

# let's try putting this all together in a study design file
studyDesign <- tibble(Sample_title = sample_title[my.sample.locations],
                      Sample_source = sample_source_name[my.sample.locations],
                      Sample_characteristics = sample_characteristics[my.sample.locations])

#based on what we extracted from ARCHS4 above, lets customize and clean-up this study design file
studyDesign <- tibble(Sample_title = sample_title[my.sample.locations],
                      genotype = rep(c("HS", "CL"), times= c(7,21)),
                      treatment = rep(c("NInf", "Inf"), times= c(7,21)))

#capture experimental variables as factors from this study design
genotype <- factor(studyDesign$genotype)
treatment <- factor(studyDesign$treatment)
sampleName <- studyDesign$Sample_title

targets.onlypatients <- targets[8:28,]

#failure vs cure as a factor 
outcome <- factor(targets.onlypatients$Treat..outcome)

#preparing a model matrix for CL vs HS 
design1 <- model.matrix(~0 + outcome)
colnames.design1 <- levels(outcome)

# Model mean-variance trend and fit linear model to data ----
# Use VOOM function from Limma package to model the mean-variance relationship
v.DEGList.filtered.norm_outcomes <- voom(archs4.dgelist.filtered.norm_outcomes, design1, plot = TRUE)

# fit a linear model to your data
fit1 <- lmFit(v.DEGList.filtered.norm_outcomes, design1)

# Contrast matrix ----
contrast.matrix.outcomes <- makeContrasts("failure vs. cure"= outcomefailure - outcomecure,
                                         levels=design1)

# extract the linear model fit -----
fits1 <- contrasts.fit(fit1, contrast.matrix.outcomes)
#get bayesian stats for your linear model fit
ebFit1 <- eBayes(fits1)
#write.fit(ebFit, file="lmfit_results.txt")

# TopTable to view DEGs -----
myTopHitsOutcomes <- topTable(ebFit1, adjust ="BH", coef=1, number=40000, sort.by="logFC")

# convert to a tibble
myTopHitsOutcomes.df <- myTopHitsOutcomes %>%
  as_tibble(rownames = "geneID")

# Add p-value significance categories
myTopHitsOutcomes_sig.df <- subset(myTopHitsOutcomes.df, P.Value < 0.05)
sigfailure <- myTopHitsOutcomes_sig.df$ID

myTopHitsOutcomes_sig.df$FC <- 2^(myTopHitsOutcomes_sig.df$logFC)
myTopHitsOutcomes_sig.df$neglog10P <- -log10(myTopHitsOutcomes_sig.df$P.Value)

cytotoxic_subset <- subset(myTopHitsOutcomes_sig.df, ID %in% cytotoxic_genes)

# View as table
# Display as interactive table
datatable(cytotoxic_subset,
          options = list(pageLength = 10, scrollX = TRUE),
          rownames = FALSE)

# Make sure this is the full DEG result (not just p < 0.05)
# myTopHitsOutcomes.df must contain columns: ID, FC, P.Value
myTopHitsOutcomes.df$neglog10P <- -log10(myTopHitsOutcomes.df$P.Value)
myTopHitsOutcomes.df$FC <- 2^(myTopHitsOutcomes.df$logFC)

library(ggplot2)

signature <- ggplot(myTopHitsOutcomes.df, aes(x = FC, y = neglog10P)) +
  # Plot all genes (gray)
  geom_point(color = "gray70") +
  
  # Highlight cytotoxic genes with p < 0.05 in blue
  geom_point(data = subset(myTopHitsOutcomes_sig.df, ID %in% cytotoxic_genes),
             aes(x = FC, y = neglog10P), color = "blue", size = 3) +
  
  # Label the cytotoxic genes
  geom_text(data = subset(myTopHitsOutcomes_sig.df, ID %in% cytotoxic_genes),
            aes(label = ID), color = "red", vjust = -1, size = 3.5) +
  
  # Axes limits and p-value threshold lines
  xlim(0, 4) + ylim(0, 4) +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
  geom_hline(yintercept = -log10(0.01), linetype = "dashed") +
  
  labs(
    x = "Fold Change (Failure vs. Cure)",
    y = "-log10(P.Value)",
    title = "Volcano Plot Highlighting Cytotoxic Genes (P < 0.05)"
  ) +
  theme_minimal(base_size = 14)

ggplotly(signature)

Conclusions

Describe the results in your own words. Some things to think about:

What are the key takeaways from the analysis?

Distinct transcriptional signatures differentiate HS and CL lesions.
Within CL, gene expression profiles correlate with treatment outcomes (cure vs. failure).
Cytotoxic genes (e.g., GZMB, PRF1) are highly expressed in failure cases.
Pathway analyses reveal immune activation and suppression patterns relevant to disease state.

What types of analyses would you want to do next?

Perform single-cell RNA-seq or deconvolution to resolve cell-type contributions.
Explore longitudinal changes in gene expression over treatment time points.
Apply machine learning to classify responders vs. non-responders.

Based on your analysis, are there any wet-lab experiments would might priortize?

Validate cytotoxic gene expression (GZMB, GNLY, PRF1) using qPCR or flow cytometry.
Use immunohistochemistry to confirm spatial localization of effector T cells.
Knockdown or overexpression studies in relevant cell lines to test function.
Cytokine assays to test IFNG-driven responses in patient-derived cells.

Session info

The output from running ‘sessionInfo’ is shown below and details all packages and version necessary to reproduce the results in this report.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.