Integrated Clustering (n = 3) | Larval Testis X Inactivation

Integrated Clustering (n = 3; Seurat v3)

After testing both Seurat and Monocle (v2 and v3), I have decided to move forward with Seurat v3. All method give similar clusters, but Seurat v3 provides more advanced data integration and differential expression testing. Here I describe the integration of replicates 1, 2, and 3. This is followed by clustering of cells and annotation of cell type based on expression of known marker genes.

Individual Replicate Clustering

I started by clustering each replicate separately. This gives me an idea of overall data structure and a feeling for how many clusters to expect. Each replicate had very similar clustering patterns, with very distinct germline and somatic clusters. See individual clustering documentation for more details.

**Figure 1. Individual clusters from three replicates (Testis 1, 2, and 3 respectively).**

Integration and Clustering

Replicates are not common in scRNA-Seq. We could cluster each replicate separately and then map clusters between replicates. Or we can combine cells from all replicates and cluster once. The later would incorporate more data into the clustering callings process, which may improve overall clustering results.

Here we combine replicates and then cluster. First, we remove replicate batch effects. To do this, Seurat provides an alignment method based on canonical correlation analysis (CCA) which tries to improve correlation among samples. They have made some slight improvements over Seurat v2. I performed CCA in 30 dimensional space and selected the first 9 dimensions to run KNN clustering. For community detection (Louvain algorithm) I used a resolution of 0.3 which gives 10 clusters.

The 2D projections using uniform manifold approximation and projection (UMAP) shows a similar structure as individual replicates (Figure 1, 2, 3). Each replicate is represented in each cluster (Figure 2A, 3) and clusters range from 853 (9) to 3,217 (0) cells (Table 1). The contribution of cells to each cluster is based on the proportion of cells from each replicate. With clusters (0, 1, 2) having proportionally more cells from replicate 3 than other clusters.

Figure 2. UMAP projection of integrated clusters. Each cell (n = 18,792) is plotted in UMAP space. Colors represent either different replciates (panel A) or different cell types after cluster calling using community detection (panel B). Cluster numbering is random. — **Figure 2. UMAP projection of integrated clusters.** Each cell (n = 18,792) is plotted in UMAP space. Colors represent either different replciates (panel A) or different cell types after cluster calling using community detection (panel B). Cluster numbering is random.

Figure 3. UMAP projection of integrated clusters split by replicate. Cells from each replicate are plotted separately in UMAP space. Colors represent different cell types after clustering. — **Figure 3. UMAP projection of integrated clusters split by replicate.** Cells from each replicate are plotted separately in UMAP space. Colors represent different cell types after clustering.

Table 1. Number of cells per cluster.

Cluster	Num. Cells Per Cluster
0	3,217
1	3,149
2	2,652
3	2,388
4	1,510
5	1,466
6	1,401
7	1,376
8	870
9	853

Figure 4. Number of Cells per Replicate per Cluster. The number of cells from each replicate are plotted by cluster. — **Figure 4. Number of Cells per Replicate per Cluster.** The number of cells from each replicate are plotted by cluster.

Cluster Annotation

To annotate clusters I used a set of 73 marker genes from the literature (Figure 5). Based on patterns of gene expression, we identified 5 germ cell clusters and 5 somatic cell clusters.

Spermatogonia (SP: cluster 9)
Early Primary Spermatocytes (EPS: cluster 5)
Primary Spermatocytes (PS1: cluster 6)
Primary Spermatocytes (PS2: cluster 2)
Primary Spermatocytes (PS3: cluster 0)
Early Cyst Cells (ECY: cluster 4)
Cyst Cells (CY1: cluster 1)
Cyst Cells (CY2: cluster 3)
Terminal Epithelium (TE: cluster 7)
Pigment Cells (PC: cluster 8)

We are able to separate the germ cell lineage into spermatogonia (SP), early primary spermatocytes (EPS), and later primary spermatocytes (PS1, PS2, PS3). The ordering of the later primary spermatocytes is not clear. Similarly, in the somatic cyst cell lineage we identified an early cyst cell (ECY) based on tj expression and two later cyst cell types (CY1 and CY2) with CY2 showing the highest expression of eya.

Figure 5. Expression patterns of marker genes from the literature by putative cell type. We calculated z-scores for each gene by replicate by cluster after TPM normalization. We plot 73 marker genes from the literature as rows and cluster * replicate as columns. Each marker gene is known to describe a specific cell type (labeled on the right). We used these expression patterns to annotate and order clusters. White dashed lines separate clusters (columns) and gene groups (rows). — **Figure 5. Expression patterns of marker genes from the literature by putative cell type.** We calculated z-scores for each gene by replicate by cluster after TPM normalization. We plot 73 marker genes from the literature as rows and cluster * replicate as columns. Each marker gene is known to describe a specific cell type (labeled on the right). We used these expression patterns to annotate and order clusters. White dashed lines separate clusters (columns) and gene groups (rows).

Figure 6. UMAP projection after annotation. Each cluster was annotated with a putative cell type based on marker gene expression. — **Figure 6. UMAP projection after annotation.** Each cluster was annotated with a putative cell type based on marker gene expression.

To give a sense of the specificity and localization I plot the various marker genes below. On the first tab is a Zoom view of the genes I focused on when making annotation decisions. On the second tab (UMAP) I plot all marker genes for a given cell type. On the third tab (Violin) I plot the cell distribution for each gene in each cluster.

Spermatogonia (Cluster 9)

UMAP (Zoom)

UMAP

Violin

Spermatocytes (Clusters 0, 2, 5, 6)

UMAP (Zoom)

UMAP

Violin

CySC (Clusters 1, 3, 4)

UMAP (Zoom)

UMAP

Violin

TE (Cluster 7)

UMAP

Violin

Pigmet Cells (Cluster 8)

UMAP

Violin

Hub (??)

UMAP

Violin

Differential Expression

Next I look for differentially expressed genes among clusters. I use a One vs Rest approach to get a list of genes (biomarkers) that are up regulated in each cluster. Some biomarkers are up regulated in a single cluster (Figure 7) while others were up regulated in multiple cluster (Figure 8). Interestingly, some clusters of the germline and somatic cyst cell lineage have very few uniquely up regulated genes (PS2, PS3, and CY1), while the remaining clusters have more than 100. Perhaps these cells types are intermediates in the developmental program. Marker genes expressed in multiple cell types tend to be found within the same lineage, with only 367 genes up regulated in the germline and soma.

Figure 8. Biomarkers up regulated in multiple clusters. Using differential expression we identified a set of genes that are significantly up regulated in multiple clusters. We group genes that were up regulated in the germ line lineage only (Germ only), in the somatic lineage only (Soma only) and in both germ and somatic cells (Germ and Soma). Within gene group we used hierarchical clustering to sort genes. White dashed lines separate clusters (columns) and gene groups (rows). — **Figure 8. Biomarkers up regulated in multiple clusters.** Using differential expression we identified a set of genes that are significantly up regulated in multiple clusters. We group genes that were up regulated in the germ line lineage only (Germ only), in the somatic lineage only (Soma only) and in both germ and somatic cells (Germ and Soma). Within gene group we used hierarchical clustering to sort genes. White dashed lines separate clusters (columns) and gene groups (rows).

Biomarkers on the Literature Gene List

Using the biomakers list I looked to see what fraction of genes overlap our literature gene list. Every putative cell type had at least a few genes from the literature gene list. CY1 and CY2 had the lowest representation of literature genes on the biomaker list (7% and 19% respectively). TE and PC had the best representation (100%), but also had the smallest number of genes. As expected, the EPS cluster had the early tTAF (aly, can), while the later staged germ cells PS1, PS2, and PS3 had the later stage tTAFs (dj, ocn, fzo).

43% Overlap

		Literature Gene
		False	True
Biomarker	False	16,528	9
Biomarker	True	1,227	7

Lit Gene In Biomarker List: bam, Rbp4, Phf7, p53, AGO3, Prosalpha6, vas
Not In Biomarker List: esg, neur, nos, bgcn, spi, Dad, fest, dpr17, tut

EPS

43% Overlap

		Literature Gene
		False	True
Biomarker	False	16,626	13
Biomarker	True	1,122	10

Lit Gene In Biomarker List: aub, sa, aly, can, Taf12L, tomb, kmg, CG3927, tbrd-1, nht
Not In Biomarker List: CycA, CycB, Mst87F, bol, fzo, mia, dj, Reepl1, oys, topi, d-cup, bb8, ocn

PS1

43% Overlap

		Literature Gene
		False	True
Biomarker	False	16,058	13
Biomarker	True	1,690	10

Lit Gene In Biomarker List: CycB, Mst87F, bol, fzo, dj, Reepl1, CG3927, d-cup, bb8, ocn
Not In Biomarker List: aub, CycA, sa, aly, can, mia, Taf12L, tomb, kmg, oys, topi, tbrd-1, nht

PS2

47% Overlap

		Literature Gene
		False	True
Biomarker	False	16,295	12
Biomarker	True	1,453	11

Lit Gene In Biomarker List: CycB, Mst87F, bol, fzo, dj, Reepl1, Taf12L, CG3927, d-cup, bb8, ocn
Not In Biomarker List: aub, CycA, sa, aly, can, mia, tomb, kmg, oys, topi, tbrd-1, nht

PS3

21% Overlap

		Literature Gene
		False	True
Biomarker	False	17,370	18
Biomarker	True	378	5

Lit Gene In Biomarker List: Mst87F, dj, Taf12L, CG3927, ocn
Not In Biomarker List: aub, CycA, CycB, sa, aly, bol, can, fzo, mia, Reepl1, tomb, kmg, oys, topi, d-cup, bb8, tbrd-1, nht

EYC

36% Overlap

		Literature Gene
		False	True
Biomarker	False	17,083	17
Biomarker	True	662	9

Lit Gene In Biomarker List: tj, bnb, ImpL2, vn, Nrt, piwi, Wnt4, fax, kek1
Not In Biomarker List: eya, EcR, robo2, sev, so, usp, zfh1, fng, gbb, spict, sano, Cht5, Efa6, Nlg3, rdo, puc, br

CY1

7% Overlap

		Literature Gene
		False	True
Biomarker	False	17,495	24
Biomarker	True	250	2

Lit Gene In Biomarker List: bnb, fax
Not In Biomarker List: eya, EcR, tj, ImpL2, robo2, sev, so, usp, vn, Nrt, zfh1, piwi, Wnt4, fng, kek1, gbb, spict, sano, Cht5, Efa6, Nlg3, rdo, puc, br

CY2

19% Overlap

		Literature Gene
		False	True
Biomarker	False	16,929	21
Biomarker	True	816	5

Lit Gene In Biomarker List: bnb, fax, kek1, Nlg3, rdo
Not In Biomarker List: eya, EcR, tj, ImpL2, robo2, sev, so, usp, vn, Nrt, zfh1, piwi, Wnt4, fng, gbb, spict, sano, Cht5, Efa6, puc, br

100% Overlap

		Literature Gene
		False	True
Biomarker	False	17,147	0
Biomarker	True	618	6

Lit Gene In Biomarker List: abd-A, Abd-B, cv-2, N, nord, Piezo

100% Overlap

		Literature Gene
		False	True
Biomarker	False	17,054	0
Biomarker	True	715	2

Lit Gene In Biomarker List: ems, Sox100B

Potentially Miss Expressed Literature Genes

Next I wanted to look for genes that appear as a different marker on the literature gene list. This is complicated to look at because many of the markers are in multiple cell types. I only include clusters that had marker gene mapping to another cell type. There are only a few instances of a germline marker gene being found on the biomarker list of a somatic cell type. It is hard to know if this is because of cell contamination or if the gene is expressed in that cell type.

aub (ES, MS)
Taf12L (ES, MS, LS)

EPS

AGO3 (SP)

CY2

cv-2 (TE)
Dad (SP)
Prosalpha6 (ES)

ImpL2 (C, EC)
robo2 (H, C)
zfh1 (C, EC)
spi (SP, ES)
fax (C, EC, MC)
kek1 (C, EC, MC, LC)
Dad (SP)
gbb (H, C, EC)
Socs36E (H, C)
puc (C, EC)
br (C, EC)

abd-A (H, C, EC)
EcR (C, EC)
robo2 (H, C)
vn (C, EC)
zfh1 (C, EC)
N (H, C, TE)
spi (SP, ES)
fax (C, EC, MC)
kek1 (C, EC, MC, LC)
spict (EC)
Socs36E (H, C)
puc (C, EC)
br (C, EC)

Note abbreviations in parentheses are based on Miriam’s literature gene table.

Biomarker Gene Ontology

To get a better understanding of the underlying biological processes driving the differences among clusters I looked at gene ontology. SP and EPS have genes associated with transcription and translation. EPS was also significant for chromatin organization. PS1, PS2, and P3 show genes associated with sperm development. With genes up regulated in PS3 almost entirely associated with sperm production. The cyst cell lineage (ECY, CY1, and CY2) show evidence of genes associated with translation. The TE and PC clusters show genes associated with morphogenesis and growth.