Bi-Specific SL DDR shRNA Libraries – Project Details

RNAi Synthetic Lethal Screen in Cancer Cell Models

PI: Alex Chenchik
Company: Cellecta, Inc.
SBIR Topic 290 – Phase II
Contract No. HHSN261201200065C
Period Covered: 09/28/2012-09/27/2014

Background & Significance

DNA repair is crucial to an organism’s ability to maintain its genome integrity and fidelity. This task is particularly daunting due to constant assault on the DNA by genotoxic agents, nucleotide misincorporation during DNA replication, and the intrinsic biochemical instability of the DNA itself. Failure to repair DNA lesions may result in blockages of transcription and replication, mutagenesis, and/or cellular cytotoxicity. Correspondingly, DNA damage signaling and checkpoint control pathways are amongst the most commonly mutated networks in cancer. Because cancer has at its origin DNA mutation, it is by definition a disease of DNA repair. Yet for cancer cells to successfully proliferate, multiple DNA repair pathways are required. To overcome this difficulty, cancer cells often become addicted to DNA repair pathways and differentially vulnerable to targeted attack on individual gene expression within their DNA repair networks, especially those pathways that are closely associated with DNA replication. The therapeutic exploitation of the overreliance of a cancer cell on a specific DNA repair pathway is based on the concept of synthetic lethality. Two genes are synthetic lethal (SL) if mutation of either gene alone is compatible with viability but mutation of both leads to cell death or sickness. For example, BRCA1 and BRCA2-deficient breast and ovarian cancer cells display a deficiency in the homologous recombination repair mechanism for double strand breaks and results in increased genomic instability. Treatment of BRCA1 or BRCA2-deficient cells with PARP1 inhibitors induce DSBs and dramatically increases loss of viability due to loss of homologous recombination repair in these cells. In addition, synthetic lethal interactions have been also demonstrated for cancer cells with loss of p53 and protein kinases MK2, ATM and Chk2 involved in regulation of DDR and G2/M cell cycle checkpoint control in the context of DNA damaging chemotherapy. Unfortunately, such lethal combinations are generally not obvious or easily predicted a priori based on in silico analysis.

Rationale.

High-throughput loss-of-function genetic screens in mammalian cells using RNA interference are ideally suited to expediently explore the molecular basis of cancer development and/or progression. To this end, we have performed systematic RNAi screens aimed at unambiguously delineating synthetic lethal gene pairs using a panel of reference cancer cell lines in order to establish the generality of synthetic lethality for DNA- repair- related genes by ablating gene pairs en masse.

Specific Aims & Objectives of the SBIR Phase II Research Contract.

The primary objective of the 290 SBIR contract project was to develop a human pooled SL shRNA library targeting all annotated DDR genes (approximately 360) and to functionally/experimentally validate in reference cancer cell models. In additon, this project also required the development of supporting tools, including a public SL DDR database, protocols, reagents and software tools. The text below describes the studies and results organized in the format corresponding to the specific tasks of the contract: The following specific aims/milestones were pursued and successfully completed:

Aim 1. Construction of a 185K bi-specific DDR shRNA library.

Development and validation of bi-specific SL lentiviral vector.

As a first step in the construction of the 185K bi-specific shRNA library, we developed and validated a set of three novel SL vectors which could provide equal expression of two shRNAs from a combination of independent U6M, and U6, or H1 promoters. These three novel vectors are derived from the pRSL1 vector employed in the Phase I contract and have an additional 800-bp spacer between different RNA-polymerase II and III promoters in order to reduce promoter interference:

pRSL18-H1-cPPT-U6-UbiC-TagRFP-2A-Puro,
pRSL16-U6M-cPPT-Spacer-U6-UbiC-TagRFP-2A-Puro, and
pRSL17-U6M-cPPT-U6-Spacer-UbiC-TagRFP-2A-Puro,

wherein U6M, U6, and H1 are different RNA-polymerase III promoters for expression of shRNAs, and Spacer is an approximately 800bp non-functional DNA (tagGFP) fragment. As expected, the set of novel pRSL vectors was characterized by a reduced recombination rate (10-15%) similar to the original pRSL1 vector. The novel set of developed vectors was packaged, transduced in the MDA-231 cell line (selected for SL screens with a pooled DDR shRNA library), and analyzed for tagRFP expression level by FACS. All three SL vector types (pRSL16, pRSL17, and pRSL18) demonstrated similar packaging efficiency, and a high level of reporter (tagRFP) expression.

In order to test knockdown activity from the two independent RNA-polymerase III promoters, we have developed, for each pRSL vector, five different dual shRNA constructs comprising “toxic” DDR shRNAs for RAD51 (two different RAD51 shRNAs with low (RAD51L) and high (RAD51H) knockdown activities) and negative control “non-toxic” luciferase (Luc1 and Luc2) shRNAs. The dual shRNA constructs were developed using two-step cloning (Figure 3), wherein in the first step we cloned the shRNA1-shRNA2 cassette under the first (U6M or H1) promoter, followed by cloning the cPPT-Spacer-U6 cassette between shRNA1 and shRNA2. The resulting shRNA constructs with the following structure: U6M(H1)-shRNA1-cPPT-Spacer-U6-shRNA2, were validated by sequence analysis and packaged in HEK293 cells. The bi-specific RAD51-Luc shRNA constructs were transduced in the MDA-231 cell line at MOI=1, selected with puro (for 2 days), and analyzed for toxicity level by measuring cell number (staining with propidium iodide) after 3 days of additional growth. As shown in Figure 1, the pRSL16 vector constructs show the highest knockdown activity (general toxicity level) and similar toxicity level for RAD51H and RAD51L shRNAs expressed from both first (U6M) and second U6 promoter. Based on results of these validation studies, we selected pRSL16 vector (see map in Figure 2) for the 185K bi-specific DDR shRNA library construction.

 

Bar chart depicting toxicity of Luc-RAD51 and RAD51-Luc shRNA constructs in MDA-231 shows that the pRSL16 vector provides the highest knockdown and similar toxicity for shRNA under either first or second promoter

 

Figure 1. Toxicity of Luc-RAD51 and RAD51-Luc shRNA constructs after transduction in MDA-231 cell line. The height of each bar corresponds to cell number of MDA-231 cells selected with puro and grown for 5 days after transduction step.

pRSL16-U6M3-sh1-cPPT-TagGFP2-U6w-sh2-UbiC-TagRFP-2A-Puro Vector Map for expression of two shRNA from different promoters

Figure 2. Map of pRSL16 vector developed for construction of bi-specific shRNA libraries.

Design of 185K bi-specific DDR shRNA library.

Using the database of DDR genes developed during Phase I contract studies (Aim 3 studies), we selected a complete set of known 360 human DDR genes, including a set of positive control gene targets with known, well-characterized SL interactions.

Since the initial discovery of RNAi, a number of technical challenges pertaining to small RNA design, specificity, and stability in RNAi based research has arisen. For example, besides suppressing the intended target gene, synthetic RNAi can evoke off-target effects. Accordingly, under NIH SBIR 5R44HG003355 grant support funding we successfully developed an open-access RNAi-based functional genomics resource (http://www.decipherproject.net/). To maximize target gene knockdown and minimize off-target effects, RNAi guide sequences were selected from experimentally validated hairpin sequences from the DECIPHER Project and subsequently filtered/ranked using a support vector machine (SVM) off-target search classifier developed by Cellecta.

Next, we designed a redundant set of 3 shRNAs targeting all 360 DDR-FIN genes using on-chip oligonucleotide synthesis and a bar-coding strategy for quantitative phenotyping. The selected 360 DDR gene set comprising validated shRNAs included several “positive control” targets with well-characterized (using in vitro and clinical studies) SL interactions: PARP, BRCA, RAD51, XRCC2, XRCC6, ATM, and CHK2, as well as one “negative control” luciferase target. For the 65,000 (360x360/2) possible different binning combinations of the 360 DDR genes (including 360 “single” DDR-Luc combinations), we designed approximately 185,000 (three sets of ~65,000) double shRNA expression cassettes.

Construction of 185K bi-specific DDR shRNA library.

The set of 185,000 barcoded double shRNA oligonucleotides (shRNA1-Barcode-shRNA2) have been synthesized on the surfaces of custom microarrays (Agilent; Santa Clara, CA) through the Agilent Early Technology Access program available to Cellecta. The dual shRNA cassette oligonucleotide pool have been amplified by PCR using common flanking sequences, cloned into the lentiviral pRSL16-U6M-cPPT-Spacer-U6-UbiC-TagRFP-2A-Puro vector (Figure 2) using a two-step cloning protocol (Figure 3) developed in the course of Phase I contract studies. The developed 185K DDR shRNA library incorporates the improved shRNA design (with a destabilized 25-bp stem-loop structure) that was optimized in the preliminary studies for equal representation of the shRNA constructs in the library and for the highest knockdown efficacy in pooled-format screening. Each construct in the library has similar levels of expression of the first shRNA from the U6M promoter and the second shRNA from the U6 promoter (see Subtask 1A studies). Furthermore, each dual shRNA combination is barcoded to facilitate its identification by amplification of the barcodes from the genomic DNA with flanking primers, followed by representational HT sequence analysis using the Illumina (San Diego, CA) platform (HiSeq 2500). The developed 185K DDR shRNA libraries also express an RFP reporter and a puromycin selection marker (separated by a self-cleavable 2A peptide) for easy titering and selection of the transduced cells in combinatorial RNAi screening.

Diagram showing how the dual shRNA expression vector is constructed

Figure 3. Map of dual shRNA expression libraries (bottom). The dual shRNA expression constructs are generated by cloning the cPPT-Spacer-U6 cassette (top) between shRNA1-barcode and shRNA2 sequences in the “precursor” construct (middle) that was generated by cloning the shRNA1-Barcode-shRNA2 cassette downstream of the U6M promoter in the pRSL16 vector.

Quality Control Analysis of 185K DDR bi-specific shRNA library.

Primary quality control analysis of the constructed 185K DDR shRNA library was performed by conventional sequence analysis of 24 randomly selected clones. All 24 analyzed bi-specific shRNA clones have correct structure U6M-1st shRNA-stuffer-U6-2nd shRNA, i.e. at least 98% of the constructs in 185K DDR library have bi-specific shRNA constructs. Sequence analysis also revealed that from a total of 45 shRNA inserts, 9 inserts have 1-nucleotide deletions, i.e. mutation rate is approximately 0.35% (1 mutation in 300 nucleotides). Considering that only deletions in the antisense portion of shRNA could reduce shRNA knockdown activity, we estimate that in approximately in 93% of bi-specific constructs, both shRNA inserts are functional.

Additional quality control analysis for the representation of the bi-specific shRNA constructs in the 185K DDR shRNA plasmid library was performed by high-throughput (HT) sequencing using Illumina’s HiSeq 2500 platform. shRNA-specific barcodes were amplified from the plasmid library, and the abundance level of each construct was measured by copy number (number of reads) for each specific barcode. We revealed that more than 99% of the designed shRNA-specific barcodes and that at least 80% of the bi-specific shRNA constructs are present in the 185K DDR library with less than a 10-fold difference from the average abundance level.

Accomplishments:

The new improved pRSL16 vector with equal knockdown activity of two shRNAs has been developed and validated. The 185K bi-specific DDR shRNA library has been constructed, and its quality was validated by conventional and HT sequencing. In the developed 185K SL DDR shRNA library, at least 93% of bi-specific shRNA constructs are functional, and at least 80% of the constructs are present in the library with less than a 10-fold difference from the average abundance level.

Aim 2. RNAi screening and validation of the SL DDR interactions.

Primary RNAi screening.

As a first step, we packaged the 185K DDR shRNA library developed in Aim 1 studies in preparative scale in HEK293T cells using the protocol developed in Phase I contract studies. The packaged 185K DDR shRNA library was titered in HEK293, A549, C4-2, and MDA-231 cells in order to define the amount of virus necessary to infect 30-50% of the cells.

The packaged 185K DDR shRNA library was used for negative-selection primary viability screens aimed at the discovery of a comprehensive set of SL genes essential for the proliferation and/or survival of the A549, C4-2, and MDA-231 cell lines, which are derived from lung (non-small-cell), prostate, and breast tumors, respectively. Each cancer line (200x106 cells) has been transduced (at MOI=0.5) and treated with puromycin (for 3 days) 3 days after transduction to select the transduced cells. We didn’t use triplicates in the RNAi screen as the 185K DDR library was built in the newest generation of vector with clonal barcodes. Clonal barcodes allow monitoring of the proliferation rate for every transduced cell and provide more comprehensive statistics in viability screens than triplicates. At an infection efficiency of 50%, this protocol will yield approximately 100x106 transduced cells, a number that is sufficient to cover the complexity of the library with approximately 500-fold redundancy. The transduced cells have been grown (with a 4:1 split at 800x106 cells) and collected after 8 cell divisions post-transduction to reveal low-to-highly efficient SL interactions. The genomic DNA have been isolated, and the representation of all the integrated shRNA proviral constructs have been determined using PCR amplification of construct-specific 18-nucleotide barcodes followed by HT sequencing with at least 100M reads per sample (averaging 2,000 reads per DDR gene combination) on an Illumina HiSeq 2500 machine.

Synthetic Lethality Screen Data Analysis.

We used a modified version of the public R-based statistical package (www.r-project.org) to rank the cytotoxic single and bi-specific shRNAs identified in the primary viability screens. This software, which was developed in the Phase I contract studies, allows for the detection of depleted (cytotoxic) shRNAs by non-parametric ranking of the shRNAs according to differences in the barcode abundance level between the control (plasmid) and treatment (growth) samples.

Cytotoxic “single” shRNA controls (with constructs that express single DDR shRNA and Luc) were identified by a decrease in the abundance level (number of reads) in the cells grown for 8 divisions compared with the plasmid control. The SL bi-specific shRNA constructs with additive or synergistic toxicity were revealed by at least two-fold higher toxicity in comparison with the respective single control shRNA constructs. The internal negative controls (designed against luciferase mRNA) allowed us to estimate the cut-off rank value for the cytotoxic shRNA candidates and exclude outliers from downstream analyses.

As a result of RNAi SL screening data analysis, we were able to identify bi-specific SL shRNAs for approximately 8,000 gene combinations with at least a 1.5-fold increase in the synergistic toxicity level in at least one of the cancer cell lines used for the RNAi genetic screening. Notably, known synthetic lethal interactions (e.g. [PARP1/BRCA1], [p53/MK2|ATM|Chk2] and [XRCC2/RAD51|RAD52]) were successfully identified, giving us confidence the screens performed had identified novel SL gene interactions.

Potential Problems and Solutions: We did not anticipate any serious problems in the primary RNAi viability screening, but we noticed some level of non-specific toxicity in MDA-231 cells. Furthermore, the analysis of the RNAi genetic screen in MDA-231 cells revealed a higher level of noise and difficulties in setting up reliable cut-off value for identifying SL shRNA combinations. In order to address the problem with non-specific toxicity, we replaced MDA-231 cells with the similar MDA-468 cell line in the follow-up validation screens.

Construction of bi-specific DDR shRNA sublibraries for validation screen.

The two pooled, bi-specific 55K shRNA sublibraries were constructed and comprise all identified SL gene pairs outlined above (Aim 1). Briefly, we selected sets of functionally validated shRNAs to obtain a redundant set of 8 shRNAs (not included in libraries generated in Aim 1) for ~8,000 SL gene interactions. Importantly, using a redundant set of rationalized selectedshRNAs for each candidate SL DDR gene pair allowed us to rapidly confirm SL gene interactions en masse and exclude any candidate genetic interactions identified during the primary screen that were due to the off-target activity of the shRNAs. Internal negative controls (designed against luciferase mRNA) allowed genetic drift to be estimated and outliers to be excluded from downstream analyses. The set of validated shRNAs was selected from the public DECIPHER database and designed using the most recent version of the prediction algorithm we developed under SBIR HG003355 funding.

The set of approximately 110,000 barcoded double shRNA oligonucleotides (in both forward shRNA1-Barcode-shRNA2 and reverse shRNA2-Barcode-shRNA1 orientations) have been synthesized on the surfaces of custom microarrays (Agilent; Santa Clara, CA). The dual shRNA cassette oligonucleotide pool have been amplified by PCR using common flanking sequences, cloned into the lentiviral pRSL16-U6M-cPPT-Spacer-U6-UbiC-tagRFP-2A-Puro vector (Figure 2) using a two-step cloning protocol (Figure 3) optimized in Task 2 contract studies. Primary quality control analysis of the two constructed 55K DDR shRNA libraries was performed by conventional sequence analysis of 24 randomly selected clones. 23 from 24 analyzed bi-specific shRNA clones have the correct structure U6M-1st shRNA-stuffer-U6-2nd shRNA, i.e. at least 95% of the constructs in the 2x55K DDR libraries have bi-specific shRNA constructs. Sequence analysis also revealed that the mutation rate is approximately 0.3% (1 mutation in 350 nucleotides). Considering that only mutations in the antisense portion of shRNA could reduce shRNA knockdown activity, we estimate that in approximately in 95% of bi-specific constructs, both shRNA inserts are functional.

Secondary Validation SL RNAi screen.

The two pooled, bi-specific 55K DDR shRNA sublibraries, developed in Subtask 3A studies, have been packaged in preparative scale and titered in A549, C4-2 and MDA-MB-468 cell lines. In order to reveal the cancer-specific SL DDR gene interactions, we started the secondary validation screen using a panel of A549, C4-2 and MDA-MB-468 cell lines. Each cancer line (60 × 106 cells) has been transduced (at MOI=0.5) with one of the packaged 55K DDR libraries and treated with puromycin (for 3 days) to select the transduced cells. At a multiplicity of infection of 0.5, this protocol yields approximately 30 × 106 transduction events, a number that is sufficient to cover the complexity of the library with approximately 500-fold redundancy. The transduced cells have been grown (with a 4:1 split at 200 × 106 cells) and collected after 8 cell divisions post-transduction to reveal highly efficient SL interactions. The remaining cells have been grown and collected after an additional 4 cell divisions to allow the cells that express low-to-medium activity SL shRNA combinations to develop lethal or growth-inhibitory phenotypes. The genomic DNA has been isolated, and the representation of all the integrated shRNA proviral constructs was determined using PCR amplification of construct-specific 18-nucleotide barcodes and followed by HT sequencing with at least 100M reads per sample (averaging 2,000 reads per DDR gene combination) on a NextSeq500 machine. Cytotoxic “single” shRNA controls (with constructs that express a single DDR shRNA and Luc) were identified by a decrease in the abundance level (number of reads) in the cells grown for 8 or 12 divisions compared with the plasmid control using a software package that we developed in Task 2 studies. The SL bi-specific shRNA constructs with additive or synergistic toxicity were revealed by at least two-fold higher toxicity in comparison with the respective single control shRNA constructs. The internal negative controls (designed against luciferase mRNA) allowed the genetic drift to be estimated and outliers to be excluded from downstream analyses. We were able to confirm approximately 1000 SL DDR gene interactions for the follow-up validation studies using bioinformatics approach.

Integrating functional genetic and genomic datasets to predict novel SL DDR interactions.

To achieve this specific aim, we used rapidly accumulating cancer genomic data (TCGA) to identify candidate SL DDR interactions by applying several separate statistical inference procedures and a novel ensemble classifier algorithm (see below). Specifically, each procedure has its own input/outputs for a set of candidate SL pairs. Gene pairs that were identified as candidate SL by different data types were computationally weighted according to the ability of each data type to predict known genetic interactions, as opposed to genes that are known not to interact using information using anatomical expression patterns, phenotypes, functional annotations, microarray co-expression and protein interactions to predict novel SL DDR interactions. In addition to SL DDR interactions functionally determined using RNAi (Aim 1-2, above) three key inference procedures/metrics were also investigated: First, genomic survival of the fittest is based on the observation that cancer cells that have lost two SL-paired genes do not survive and are strongly selected against. Accordingly, as cells harboring SL co-inactivation are eliminated from the cancer cell population, SL interactions can be identified by analyzing somatic copy number alterations (SCNA) and somatic mutation data and detecting events of gene co-inactivation that occur significantly less than expected. Second, pairwise gene co-expression, is based on the notion that SL pairs tend to participate in closely related biological processes and hence are likely to be co-expressed. Finally, we mapped experimentally confirmed SL interactions performed in model organisms to their human DDR orthologs.

Figure 4 summarizes major SL DDR pathways identified and confirmed by the bioinformatics analysis. Importantly, SL DDR interactions can occur both between two components of a single biochemical pathway, or between components of two parallel pathways that can functionally compensate for each other. As summarized in Figure 4, ~1048 novel SL interaction were identified, with the majority mapping to canonical DNA Damage repair pathways/components. Interestingly, within-pathway DDR interactions accounted for a significant, albeit minority, of interactions in our analyses. For example, within-pathway interactions identified include interactions among components of the homologous recombination repair pathway, and interactions among components of nucleotide excision repair, respectively. We surmise that for interactions between hypomorphic alleles or partial loss-of-function mutations, within-pathway models may predominate.

Pie chart showing the distribution of major SL DDR pathway interactions identified in the RNAi screens

Figure 4. Summary of major DDR pathway SL interactions & distributions functionally identified and experimentally confirmed in reference cancer cell lines

SL DDR ensemble classifier algorithm development.

Unlike other machine-learning algorithms, decision trees are adaptable, easy to interpret, and produce highly accurate models using both categorical and continuous data variables. Accordingly, we used a modified tree classification algorithm to predict cancer-specific DDR synthetic lethality. Briefly, the predictive models for this study were produced using the TreeBoost predictive modeling program. TreeBoost DR SL models were generated, and their accuracy was compared using 10 fold cross validation. TreeBoost had a misclassification rate of 17.4% with sensitivity and specificity of 82.5% and 82.6%. TreeBoost is an implementation of the Stochastic Gradient Boosting, but it is specialized for a decision tree base learner, and it introduces a novel stochastic approach to selecting training rows during the boosting iterations. Each training sample presented to Stochastic Gradient Boosting consists of a set of predictor variable (independent) values “x” = {x1, x2, …, xn} and a parallel set of “target” variable values that are to be fitted by the model, y. Hence the model is trained on {yi,xi}1N. The goal of the training procedure is to generate a function F(x) that maps the x values to the associated target values, y. The form of the function is an additive series of (small) decision trees:

F(x) = ∑M(m=0)βmh(x′; am)

Where the “base learner”, h, is a decision tree, the parameters am = {a1,a2,…} define the decision tree splits for a tree in the series, βmis a weighting coefficient for the tree, and x′ is a modified set of training samples. Each decision tree in the series, was s trained using a subset of the training samples selected randomly without replacement. The target variable values, y′m , are “pseudo-residuals” consisting of the gradient of the loss function from the previous step.

We have accomplished large-scale SL RNAi screens using a panel of A549 (non-small-cell lung), C4-2 (prostate), and MDA-231 (breast) cancer cell lines. HT sequencing barcode enumeration data have been generated and analyzed.

We have designed, constructed two 55K SL DDR sublibraries and performed six large-scale SL RNAi validation screens with the 55K SL DDR sublibraries using a panel of A549 (non-small-cell lung), C4-2 (prostate), and MDA-468 (breast) cancer cell lines. We have identified and validated approximately 1,000 SL DDR interactions using combination of HT multiplex viability assay and integration with public resource SL data sets.

The results of the integrative experimental functional and bioinformatics validation studies are described on the Project Summary page. The data are no longer available from Cellecta.

Aim 3. Developing a SL DDR RNAi technological platform.

Development of SL DDR Relational Database.

There are different types of DNA damage and therefore different molecular pathways of DNA repair to correct them (including DNA strand cross-link repair, homologous recombination, non-homologous end joining, mismatch repair and nucleotide excision repair). In addition, numerous proteins and signaling cascades have been involved in these processes. For example, the ATM/ATR kinases, as well as DNA-PK, are key for detection of DNA lesions. In addition, repair factors such as Rad51, RPA and the fanconi anemia proteins directly act in repairing the DNA lesions. While, the p53 signaling network, the RAS GTPase superfamily, and the ubiquitin system are also involved in different aspects of the DNA damage response.

Since, functional interactions (FI) within the human DDR pathway and cross talk with other signaling pathways, are context-dependent and remain largely unknown, we generated a computational network model that represents an ensemble of potentially significant interactions and genetic linkage/association events. Specifically, to construct a DDR FI network and relational database, we collected genes encoding proteins with evidence of involvement in the DNA Damage Response (DDR) from five independent sources, including:

(i) published DDR canonical pathway maps,
(ii) human protein-protein interaction (PPI) data from public databases,
(iii) human orthologs of PPIs and genetic interactions extracted from several model species,
(iv) literature extracted DDR genes and
(v) transcriptome microarray data obtained after treatment of mammalian cells with stimulators of DNA repair pathways.

After assembly of a ‘master’ gene list, genes were subsequently parsed into two bins—high confidence versus lower-confidence sets on the basis of confidence criteria described herein. Importantly, for each category information, all core components were included in the final library, as were genes noted as lower confidence but that were included in a minimum of two independent search categories. Collectively, these data nominated approximately 360 DDR genes encoding proteins linked by at least one independent search/prioritization criterion.

(I.) For the data from pathway sources, Hugo gene symbols were extracted from the following maps focused on the DDR and associated signaling cascades:

REPAIRtoire (http://repairtoire.genesilico.pl);
repairGENES (http://www.repairgenes.org);
Human DNA Repair Genes (http://sciencepark.mdanderson.org/labs/wood/DNA_Repair_Genes.html);
Repair-FunMap; Pathway Commons (http://www.pathwaycommons.org);
KEGG (http://www.genome.jp/kegg);
BioCarta (http://www.biocarta.com);
Pathguide (http://www.pathguide.org); and
Reactome (http://www.reactome.org).

Gene names were manually inspected and, where necessary, converted to the corresponding official (NCBI) symbols. Genes included in a minimum four DDR pathways were designated as ‘core components.’

(II.) We used an orthology-based method in which Smith–Waterman searches were run for the human genome against all proteins in DIP, BIND, IntAct, MIPS, MINT, HPRD, and BioGRID mammalian protein-protein interaction databases. For each pair of interacting proteins, these files contain the information about the source database, the experimental system employed to determine the interaction, and the corresponding PubMed reference(s) where available. We analyzed their putative interactions giving confidence scores based on the level of homology to proteins found experimentally to interact and the amount of experimental data available. After ROC curve analysis, with a sensitivity of 83% and specificity of 81%, the human interactome consisted of approximately 183,000 non-redundant binary gene-gene interactions. Core genes (168 in total), encoding proteins involved in DNA repair were used as seeds for PPI searches with the aforementioned data file. Data for first rank i.e. direct interactors, were collected both by export from the corresponding databases and import into Pajek and cross comparing results. All the PPI information was subjected to further processing and dynamic manipulation by conversion into visualizable PPI networks using a spring-embedded layout.

(III.) To extract a set of DNA repair-centered interactions potentially conserved between humans and nonmammlian species, we used information assembled by the Michigan Proteomics Consortium that genetically interacted with evolutionary conserved core DDR genes (described above (I.)). Specifically, a PPI can be represented as a graph where proteins represent nodes (or vertices) and interactions represent edges. Therefore, we describe the DDR core network of the seven model organisms as GB = (VB,EB). For GB we calculated the graph measures mean degree, diameter, index of aggregation, connectivity, clustering coefficient, and assortative mixing coefficient. The graph measures were calculated with previously reported formulas with partial incorporation into the JUNG graph framework (http://jung.sourceforge.net/). As a control, we generated 1000 networks from GB with random sets of an equal number of query vertices (nodes) from VB with |VB| = |VH| and subsequent extraction of edges from GB.

(IV.) The Single Gene Biological Term Mapper feature of CoPub Mapper was used to extract all of the genes co-occuring in PubMed publications with the biological process keyword ‘DNA’ & ‘Damage’ & ‘Response’. In total, 987 genes have been co-published with ‘DDR’ two or more times. Of these genes, 317 co-occur with a relative score greater than zero. Notably, the relative score is a measure of the frequency of co-occurrence adjusted for the frequency of occurrence of each item individually. This set of 317 genes—those genes known to be involved in DDR—comprise the ‘Literature DDR Gene Set’.

(V.) Several published DDR Affymetrix microarray datasets were obtained from NCBI’s Gene Expression Omnibus portal (GEO). GeneSpring GX was used to analyze the data. Signal intensity values were normalized as follows. Values below 0.01 were set to 0.01. The 50.0th percentile of all measurements in that sample divided each measurement. Each gene was divided by the median of its measurements in all samples. If the median of the raw values was below 10, then 10 divided each measurement for that gene if the numerator was above 10, otherwise the measurement was discarded. A total of approximately, 4200 differentially expressed genes >1.5 changes were identified. Finally, the InterConnectedNess algorithm was used to prioritize the aforementioned candidate genes based on whether they directly interact with key (hub) DDR genes mapping to nucleotide excision repair, base excision repair, mismatch repair, DNA damage signaling, direct reversal repair, non–homologous end joining, homologous recombination repair & trans lesion synthesis, canonical DDR pathways. Significantly, the closeness between genes in a network was quantified by considering not only direct interaction of two genes but also the number of connectors between genes.

Finally, we extended our FI network to include non-PPI and protein-DNA interactions relevant to the DDR, allowing us to detect crosstalk among pathways based on protein interactions. To reduce the complexity of our network model while maximizing relevant protein information, we used Dijkstra’s algorithm to calculate the optimal set of interactions by computing the shortest paths between all candidate pathway members.

Integration of SL Functional Genetic Data with DDR Relational Database.

During Phase II contract studies, we developed an expanded version of the DDR Functional Interaction Database, by including the reference-normalized SL RNAi genetic interaction data from a panel of model cancer cell lines (see above Aim 1-2 studies). Towards this ambitious goal, we developed and implement (SL Data Viewer) – an open-source bioinformatics software platform for visualizing DDR molecular interaction networks and biological pathways and integratingthese networks with annotations and gene expression profile data. The whole system will be developed with PHP5 (using framework for development, including Zend Framework 1.0 and Smarty 2.6) and MySQL5 for DDR gene information storage. Specifically, SL DataViewer using Cytoscape interface shown in Figure 5 allowed registered users to more effectively visualize the functional relationships among genes identified in large-scale RNAi SL experiments. SL DataViewer will facilitate data mining, cross-database analysis and large-scale analysis of gene function. Importantly our database includes the following types of open-access information: (i) DNA damage linked to environmental mutagenic and cytotoxic agents, (ii) pathways comprising individual processes and enzymatic reactions involved in the removal of damage, (iii) proteins participating in DNA repair, (iv) diseases correlated with mutations in genes encoding DNA repair proteins and a summary of experimentally and computationally predicted SL DDR interactions. In addition, we are planning to include in the DDR database the drug interactions, mutation, CNV and transcriptome status for all genes included in our DDR-FIN. Accordingly, we will use data from CCLE (Broad-Novartis cancer Cell Line Encyclopedia), Project Achilles and Cancer Therapeutic Portal.

Accomplishments:

We have successfully developed SL DDR Relational Database and integrated functional screening and bioinformatics public resource SL-related data sets. This resource will facilitate the access to knowledge about human DDR signaling, correlation of human diseases with mutation in genes responsible for DNA integrity/stability, as well as information about the toxic and mutagenic agents causing DNA damage.

Cytoscape portal depicting sub-networks determined from RNAi screens with bi-specific SL DDR shRNA libraries

Figure 5. Cytoscape portal for visualization of experimentally determined SL DDR sub-network. Each node represents a gene included in DDR Functional Interaction Network, and each edge represents a functionally identified SL interaction using RNAi in reference cancer cell lines. Node size is proportional to the number of SL pairs a gene has and main DDR pathway clusters are summarized.

 

> Click here to continue to the Summary of the SL DDR Validation Project