Chapter 4 Overview of Omics Data Types

4.1 Gene expression (bulk RNA-seq)

On the day of sample collection, typically the shipment day (see section 3.5 above), 150 islets are hand picked and transferred to 1 ml of Trizol reagent. After vortexing, the sample is stored at -80°C until shipping or RNAseq at either Oxford or Stanford.

At destination, RNA is extracted from Trizol as per manufacturer instructions and concentration and quality is assessed by NanoDrop. RNA integrity is further checked by Bioanalyzer (RNA 6000 chips) and 20 uL of RNA samples at 20 ng/uL minimum is submitted to for library prep and sequencing to Novogene, on NovaSeq 6000 PE150 illumina platform, at 20M paired end reads per sample (6 G of raw data). After quality control via FastqQC (v0.11.9), the raw reads were aligned to the human genome reference (GRCh38) using STAR (v2.7.9a) with the gene annotation downloaded from the ENSEMBL database (v110). Gene expression levels were counted using featureCounts (v2.0.3). Raw counts were normalized using the TPM method to adjust sequencing depth.

Ensembl gene IDs were mapped to Entrez IDs and counts from any duplicated Entrez IDs were summed. Transcripts with over 80% zeros were filtered out. The RNAseq dataset contains seven batches, and significant batch effects were observed. Batch effects were corrected for using the Combat-Seq function in the sva R package (version 3.44.0), which preserves the count nature of the data. Counts were converted to log-counts-per-million (logCPM) using the relative log expression (RLE) method in the edgeR R package (version 3.38.4). Transcripts with logCPM variance in the lowest 20th percentile were filtered out. Since the sequencing depth was different across different batches, a substantial number of transcripts had zero counts in some batches and low but robust counts in other batches. These batch-related drop-out patterns led to large numbers of significant differentially expressed features that disappeared if individual batches were held out. To address this problem, counts with a value of zero or one were replaced with NAs, excluding these individual values from all downstream statistical analyses and leading to more stable differential expression analysis results.

The raw RNA sequencing data has been deposited in the European Genome-phenome Archive (EGA) under study number EGAS00001007241.

4.2 Gene expression (Nanostring)

On the day of sample collection (see section 3.5, above) 50 islets are hand-picked at the ADI IsletCore, washed with PBS, then lysed in 100ul of RNeasy Lysis Buffer (RLT, Qiagen) containing 1% beta-mercaptoethanol. Samples are stored at -80oC before running on Nanostring gene expression assay at BC Children Hospital (Vancouver). The assay was updated partway through data collection, where a small number of genes were removed and replaced with others, resulting in two different gene codesets. Briefly, 1ul cell lysate is used for hybridization with customized gene code sets that target 142 pancreatic islet genes and 6 reference genes. Hybridized complex is then loaded on cartridge and Nanostring nCounter to measure the target gene expression. The raw data is analyzed with nSolver4.0 software from Nanostring, producing a table of intensity values where larger values correspond to higher gene expression values. Full details are found here.

Data Processing: Significant batch effects were present across codesets, and so transcripts measured in both Nanostring arrays were batch corrected using the Combat function from the sva R package (version 3.44.0). Gene expression intensities were normalized by sample median (each value divided by the median intensity of all genes for that donor) and log2-transformed. Given the much smaller number of transcripts compared to the RNAseq data, no abundance or variance filters were applied.

4.3 Protein expression (proteomics)

Human islets are shipped to the University of British Columbia, hand-picked and cultured in RPMI for 24-72hrs before the experiment to allow the islets to recover from shipping. After recovery, 300 islets are hand-picked, washed with PBS and the pellet is snap frozen and stored at -80oC. LC-MS/MS analysis is performed with n = 3 technical replicates, using a NanoElute UHPLC system (Bruker Daltonics) with Aurora Series Gen2 (CSI) analytical column coupled to a Trapped Ion Mobility – Time of Flight mass spectrometer (timsTOF Pro; Bruker Daltonics, Germany) operated in Data-Independent Acquisition - Parallel Accumulation- Serial Fragmentation (DIA-PASEF) mode. Details of lysis and analysis are found here.

Protein abundances were filtered to remove any that had more than 50% missing values, median normalized, and log2 transformed. Missing values were imputed with the missForest method (random forest-based) using the imp4p R package (version 1.2.0). Uniprot IDs were converted to Entrez IDs, and rows with duplicate IDs were aggregated by taking the mean. Proteins with variance in the lowest 20th percentile were filtered out.

Raw proteomics data has been deposited to ProteomExchange via MASSive under dataset number PXD045422.

4.4 Gene expression (pseudobulk RNA-seq)

The single-cell counts from patch-seq data (section 3.5, below) were summarized at the donor by separating counts by cell type (alpha and beta). Only cells from donors with at least five cells were retained. Pseudobulk profiles were created by summing counts from all cells derived from the same donor. Ensembl gene IDs were mapped to Entrez IDs and counts from any duplicated Entrez IDs were summed. Each matrix was filtered to remove transcripts with greater than 80% zeros. Counts were converted to logCPM values using the TMM method from the edgeR R package (version 3.38.4). Due to the large variations in sequencing depth across batches, individual transcripts that had zero or one count were replaced with NA values.

4.5 Single-cell gene expression (patch-seq)

The collection and analysis of pancreas patch-seq data is described here and here. Reads from FASTQ files were trimmed using TrimGalore version 0.6.5 with auto-detection of adapter sequence, then aligned and counted using STAR version 2.7.9a to the GRCh38 human genome (Ensembl release 104). Cell count matrices were loaded into R and a Seurat object was generated using the Seurat R package (version 4.3.0). Only cells with patch-clamp data were included; data for additional cells are available from the listed accession numbers below. Cell types were identified by projection, using the MapQuery command from the R package Seurat, onto a large set of human islet single cell RNAseq data generated from publically available data (see below). Cells were filtered to remove any that were not alpha or beta cells, had fewer than 500 total counts across all transcripts, had fewer than 300 unique transcripts with non-zero count values, or had greater than 20% mitochondrial DNA. Ensembl gene IDs were mapped to Entrez IDs and counts from any duplicated Entrez IDs were summed. Counts were separated by cell type (alpha and beta) and glucose concentration (1, 5, and 10mM), resulting in six different matrices. Each matrix was filtered to remove transcripts with greater than 80% zeros. Counts were converted to logCPM values using the trimmed mean of M-values (TMM) method from the edgeR R package (version 3.38.4). TMM has been shown to perform better for single-cell RNAseq data compared to the RLE method because it is robust to the high drop-out rate. Individual transcripts that had zero counts were replaced with NA.

Raw sequencing reads are available in the NCBI Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA) under accession numbers GSE124742 and GSE164875 and GSE270484 at PancDB (https://hpap.pmacs.upenn.edu/).

4.6 Single-cell RNA-seq

Reads from FASTQ files were trimmed using TrimGalore version 0.6.5 with auto-detection of adapter sequence, then aligned to the GRCh38 human genome, Ensembl release 104, and counted using STAR version 2.7.9a except for files from GSE84133, which were aligned and counted using the inDrops protocol; files from GSE81076 and GSE85241, which were aligned and counted using the R package scruff (version 1.10.1) with Rsubread (version 1.22.2); and files from GSE269204, which were aligned and counted with Cell Ranger (10X Genomics Cell Ranger v6.1.2). Cell count matrices were loaded into R and objects generated using the R package Seurat (version 4.3.0), restricted to transcripts detected in more than 10 cells and cells with between 200 and 12,500 unique transcripts with non-zero count values; cells with percent mitochondrial transcripts in the top 25% for each dataset were also dropped. Transcript names were maintained as Ensembl identifiers to preserve the greatest number of unique transcripts. After combining all donors, transcripts expressed in fewer than 0.1% of cells were filtered. Donor HPAP-027 was dropped from the data set because there were very few cells (32 cells). The reciprocal PCA integration with references (HPAP-037, HPAP-040; HPAP-045, HPAP-050, HPAP-052, and HPAP-072) workflow from the R package Seurat (version 4.3.0) was followed, with k.anchor = 5, then FindNeighbors and FindCluster (resolution = 1.2). Cell types were first manually annotated based on the unambiguous, high level expression of marker genes in distinct clusters in UMAP space, except cytotoxic T cells, which did not form a distinct cluster but had restricted expression of TRAC. The remaining ambiguous cells were annotated using the MapQuery function by projecting unknown cells onto the manually annotated cells. Although these data include cells from donors with type 1 diabetes and type 2 diabetes, only cells from donors with no diabetes (ND) are visualised in the web tool for cell-type localization of called features, and are not downloadable .

Data are from the NCBI Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA) under accession numbers GSE81076, GSE81547, GSE81608, GSE83139, GSE84133, GSE85241, GSE86469, GSE154126; from the European Molecular Biology Laboratory European Biology Institute (EMBL-EBI) under accession number E-MTAB-5061; from the Human Pancreas Analysis Program at PancDB up to donor HPAP-109, and from unpublished data (GSE269204).