Poster Abstracts

AR-BIC First annual conference - March 11-12, 2015

Presenter and Affiliation

Advancing Regulatory Science through Bioinformatics

Advancing Regulatory Science through Bioinformatics
Huixiao Hong, Roger Perkins and Weida Tong
Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, US

In 2010, the US FDA launched its Advancing Regulatory Science (ARS) initiative aimed at developing new tools, standards, and approaches to assessing safety, efficacy, quality, and performance across FDA-regulated products. The initiative identifies eight scientific areas that affect multiple regulated product domains or human populations, where bioinformatics play paramount roles. The Division of Bioinformatics and Biostatistics at FDA’s Center for Toxicological Research (NCTR) engages in bioinformatics applicable to such areas as biomarker development and validation, drug safety and repurposing, and personalized medicine. This poster will highlight selected bioinformatics research as well as selected databases and software tools that have been developed both in past years and more recently in support of FDA regulatory sciences. The DBB has led a large international consortium for the past eight years that has assessed the reliability of clinical and toxicological biomarkers derived from emerging microarray and next generation sequencing. Knowledge bases have been developed that aggregate diverse data associated with a disease, toxicity or phenotype, providing a means for mechanistic studies and development of predictive models. The Liver Toxicity Knowledge Base integrates in vitro, in vivo, gene expression data and textual data. The Endocrine Disruptor Knowledge Base contains in vitro and in vivo data for thousands of chemicals to build models to predict endocrine activity mediated by estrogen and androgen hormone receptors based solely on chemical structure. The Food-Borne Pathogen Genomics Knowledge Base provides tools to detect and characterize microbial isolates from gene expression data during pathogen outbreaks. ArrayTrack is a genomics tools widely used within FDA, as well as the public, private and academic research community worldwide. ArrayTrack provides an integrated means to manage, analyze and interpret omics data. It contains many statistical and visualization tools as well as libraries for gene and protein function and biological pathways. FDALabel is a web-based database containing the entire set of 40,000 FDA-approved drug labels. It contains a powerful and flexible search capability, and much other functionality valuable to researchers, regulators, drug developers and clinicians. FDALabel will provide an improved bridge for transparent drug safety knowledge exchange between the public and FDA. A common element of the databases and bioinformatics tools cited above is that they either are or will be openly available on the Internet, including an FDA external Cloud when available, thus advancing FDA data liberation. Many of NCTR’s bioinformatics tools can be accessed through the following link: FDA Bioinformations Tools.
AR-BIC-1

Discovery of Novel MicroRNAs in Rat Kidney Using Next Generation Sequencing, Microarray and Bioinformatics Technologies
Tao Chen1, Fanxue Meng1, Michael Hackenberg2, Zhiguang Li1, Jian Yan1,
1Division of Genetic and Molecular Toxicology, National Center for Toxicological Research, Food and Drug Administration, Jefferson.
2Dpto. de Genetica, Facultad de Ciencias, Universidad de Granada, Granada, Spain

Exploring the impact of miRNA-seq pipelines on downstream analysis
Halil Bisgin, Binsheng Gong, Yuping Wang, Weida Tong
Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration

The development of Liver Toxicity Knowledge Base (LTKB) for research and review of drug-induced liver injury

Genome-wide comparison of four toxicogenomics assay systems

Detecting Copy Number Variations via a Bayesian Approach Adapting to Both Whole Genome and Targeted Exome NGS

Structural Identification of Unknown Protein Structures

The Effects of Carbon Emissions on Coral Reefs

Phylogenetic analysis enzymes of amino acid biosynthetic pathways

Down-regulation of genes involved in lignin biosynthesis and a genomic approach to deciphering lignin biosynthesis in rice

Sexual dimorphism in the expression of genes encoding drug metabolizing enzymes/transporters may influence a drug’s disposition in adult F344 rats

Data Mining for Signal Detection from Adverse Event Reporting System Database

Weizhong Zhao1, 2, Zhichao Liu1, Yuping Wang1, James J. Chen 1, Wen Zou1 *
1Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, AR, US.
2College of Information Engineering, Xiangtan University, Xiangtan, Hunan Province, China.

The FDA centers receive reports from consumers, health care professionals, manufacturers, and others regarding the safety of various regulated products, such as drugs, vaccines, artificial hearts, surgical lasers, and nutritional supplements. It is a challenge to extract the information in these reports for better assessment of product safety and rapid detection of adverse event signals. In this study, we collected adverse reports in FDA Adverse Event Reporting System (FAERS) from the first quarter of 2004 to the first quarter of 2014. In preprocessing procedure, we cleaned the dataset and normalized the drug names by RxNorm, which is a standard nomenclature developed by the United States National Library of Medicine (NLM). Empirical Bayes geometric mean (EBGM) approach was utilized to identify the safety signals in the adverse reports related to 996 FDA approved drugs. New safety signals were identified when comparing with the currently available information in various sources. The outcome of this study is expected to enhance information input to the decision making process for drug safety detection and postmarketing surveillance.
Key words: drug safety signal, adverse event, EBGM, Data mining, Postmarketing surveillance.
AR-BIC-17

Systematically identifying and annotating long non-coding RNAs

Dan Li, Mary Yang
University of Arkansas at Little Rock, Little Rock, Arkansas

Long non-coding RNAs (lncRNAs) have been shown to play important roles in various biological processes and have been implicated in disease. Although lncRNAs have gained substantial attention in recent years, their regulatory mechanisms remain to be elucidated. High-throughput RNA sequencing (RNA-Seq) provides the unprecedented ability to annotate lncRNA transcripts, which can potentially advance our understanding of their
biological functions. Here, we built and compared two RNA-Seq processing pipelines using reference-guided and de novo strategies. By analyzing RNA-Seq data from brain tissues of 14 individuals, we identified 33,038 and 41,241 novel transcripts using Tophat - Cufflinks, and Trinity - GMap, respectively. Over 94% of the novel transcripts identified from both approaches are predicted to be lncRNAs according to their canonical features. Notably, 61.1% and 75.2% of lncRNAs obtained from the two approaches are unique, suggesting a combination of reference-based and de novo assembly may lead to more comprehensive discovery of novel transcripts. Furthermore, we found 2,268 and 2,436 lncRNAs (from reference-based and de novo assembly) were transcribed from bidirectional promoters shared with their closest protein-coding genes. The majority of lncRNAs (99.8%) are under-expressed in our dataset. Over 2,011/2,268 (88.7 %) of these protein-coding genes are under-expressed, whereas 257/ 2,268 (13.3%) of them are up-regulated. Applying DAVID analysis on up-regulated genes, we revealed 12 genes implicated in Alzheimer’s disease, and 23 with cell cycle and 9 with anti-apotosis functions. Our results demonstrate that lncRNAs regulate genes associated with Alzheimer’s disease and cancer.
AR-BIC-18

Bioinformatics challenges for research centers in Arkansas: ACNC case

Horacio Gomez-Acevedo1,2, Brian D. Piccolo1,2, Sudeepa Bhattacharyya2
1Arkansas Children’s Nutrition Center, 2Department of Pediatrics, UAMS

The use of high-throughput technologies in basic and translational research has grown steadily at the Arkansas Children’s Nutrition Center (ACNC). Genomic research has included a broad spectrum of technologies including RNA-seq, Chip-Seq, Methyl-Seq Human Methylation 450 Beadchip and traditional Affymetrix microarrays. Also, the center has acquired an UHPLC-Q orbitrap to carry on metabolomics and lipidomics analyzes. This increase in big data collection has concomitantly created bioinformatics challenges at different levels, namely: keeping up with changes in software, methodologies, statistical approaches, data management, data storage, and data presentation. Based on our experience at ACNC, we highlight some of our current solutions to these challenges in genomics and metabolomics. We also present bottleneck areas in which a more integrative collaboration with other centers or institutes in the region may synergistically increase the research quality in life sciences as well as in bioinformatics.
AR-BIC-19

Poster Abstracts AR-BIC First annual conference - March 11-12, 2015

Poster Number	Title	Affiliation	Presenter and Affiliation
AR-BIC-1	Hong, Huixiao	NCTR	Advancing Regulatory Science through Bioinformatics
AR-BIC-2	Chen, Tao	NCTR	Discovery of Novel MicroRNAs in Rat Kidney Using Next Generation Sequencing, Microarray and Bioinformatics Technologies
AR-BIC-3	Bisgin, Halil	NCTR	Exploring the impact of miRNA-seq pipelines on downstream analysis
AR-BIC-4	Ng, Huiwen	NCTR	Development of a competitive molecular docking approach for predicting estrogen receptor agonists and antagonists
AR-BIC-5	Luo, Heng	NCTR	Collection and molecular docking identification of associations between drugs and class I human leukocyte antigens for predicting idiosyncratic drug reactions
AR-BIC-6	Hao, Ye	NCTR	Deciphering adverse outcome pathways through network analysis of ToxCast data
AR-BIC-7	Chen, Yu-Chuan	NCTR	Ensemble Survival Trees for Identifying Subpopulations in Personalized Medicine
AR-BIC-8	Chen, Minjun	NCTR	The development of Liver Toxicity Knowledge Base (LTKB) for research and review of drug-induced liver injury
AR-BIC-9	Liu, Zhichao	NCTR	Genome-wide comparison of four toxicogenomics assay systems
AR-BIC-10	Wei, Yu-Chung	NCTR	Adapting to Both Whole Genome and Targeted Exome NGS
AR-BIC-11	Beger, Richard	NCTR	3D-SDAR analysis of a diverse dataset of 180 hERG inhibitors: Structural factors determining the binding potential
AR-BIC-12	Walker, Cameron	NCTR	Structural identification of unknown protein structures
AR-BIC-13	Hunt, Adrian	NCTR	The Effects of Carbon Emissions on Coral Reefs
AR-BIC-14	Machooka, Daniel	NCTR	Phylogenetic analysis enzymes of amino acid biosynthetic pathways
AR-BIC-15	Shang, Zhenhua	UAPB	Down-regulation of genes involved in lignin biosynthesis and a genomic approach to deciphering lignin biosynthesis in rice
AR-BIC-16	Vikrant, Vijay	UAPB	Sexual dimorphism in the expression of genes encoding drug metabolizing enzymes/transporters may influence a drug’s disposition in adult F344 rats
AR-BIC-17	Zhao, Weizhong	NCTR	Data Mining for Signal Detection from Adverse Event Reporting System Database
AR-BIC-18	Li, Dan	UALR	Down-regulation of genes involved in lignin biosynthesis and a genomic approach to deciphering lignin biosynthesis in rice
AR-BIC-19	Acevedo, Horacio Gomez-	UAMS	Bioinformatics challenges for research centers in Arkansas: ACNC case
AR-BIC-20	Crabtree, Nathan	NCTR	Building a computational evolution system to identify genes of interest in multi-class, RNA-seq data
AR-BIC-21	Gokulan, Kuppan	NCTR	Structure and specificity of L,D-Transpeptidase from Mycobacterium tuberculosis
AR-BIC-22	Barabote, Ravi D.	UAF	Omics analyses of microbial response to environment
AR-BIC-23	Wu, Leihong	NCTR	A novel clustering approach to find biomarkers in breast cancer subtypes based on expression network profiles
AR-BIC-24	Yu, Ke	NCTR	Making Tractable the Use of Vast Quantities of Regulatory-Related Textual Data
AR-BIC-25	Smith, Sidney	UAPB	Peptide Sequence Patterns Related to Omega Angles in Cis Conformation

Abstracts
Advancing Regulatory Science through Bioinformatics Huixiao Hong, Roger Perkins and Weida Tong Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, US In 2010, the US FDA launched its Advancing Regulatory Science (ARS) initiative aimed at developing new tools, standards, and approaches to assessing safety, efficacy, quality, and performance across FDA-regulated products. The initiative identifies eight scientific areas that affect multiple regulated product domains or human populations, where bioinformatics play paramount roles. The Division of Bioinformatics and Biostatistics at FDA’s Center for Toxicological Research (NCTR) engages in bioinformatics applicable to such areas as biomarker development and validation, drug safety and repurposing, and personalized medicine. This poster will highlight selected bioinformatics research as well as selected databases and software tools that have been developed both in past years and more recently in support of FDA regulatory sciences. The DBB has led a large international consortium for the past eight years that has assessed the reliability of clinical and toxicological biomarkers derived from emerging microarray and next generation sequencing. Knowledge bases have been developed that aggregate diverse data associated with a disease, toxicity or phenotype, providing a means for mechanistic studies and development of predictive models. The Liver Toxicity Knowledge Base integrates in vitro, in vivo, gene expression data and textual data. The Endocrine Disruptor Knowledge Base contains in vitro and in vivo data for thousands of chemicals to build models to predict endocrine activity mediated by estrogen and androgen hormone receptors based solely on chemical structure. The Food-Borne Pathogen Genomics Knowledge Base provides tools to detect and characterize microbial isolates from gene expression data during pathogen outbreaks. ArrayTrack is a genomics tools widely used within FDA, as well as the public, private and academic research community worldwide. ArrayTrack provides an integrated means to manage, analyze and interpret omics data. It contains many statistical and visualization tools as well as libraries for gene and protein function and biological pathways. FDALabel is a web-based database containing the entire set of 40,000 FDA-approved drug labels. It contains a powerful and flexible search capability, and much other functionality valuable to researchers, regulators, drug developers and clinicians. FDALabel will provide an improved bridge for transparent drug safety knowledge exchange between the public and FDA. A common element of the databases and bioinformatics tools cited above is that they either are or will be openly available on the Internet, including an FDA external Cloud when available, thus advancing FDA data liberation. Many of NCTR’s bioinformatics tools can be accessed through the following link: FDA Bioinformations Tools. AR-BIC-1

Discovery of Novel MicroRNAs in Rat Kidney Using Next Generation Sequencing, Microarray and Bioinformatics Technologies Tao Chen1, Fanxue Meng1, Michael Hackenberg2, Zhiguang Li1, Jian Yan1, 1Division of Genetic and Molecular Toxicology, National Center for Toxicological Research, Food and Drug Administration, Jefferson. 2Dpto. de Genetica, Facultad de Ciencias, Universidad de Granada, Granada, Spain MicroRNAs (miRNAs) are small non-coding RNAs that regulate a variety of biological processes. The version of the miRBase database (Release 18) includes 1,157 mouse and 680 rat mature miRNAs. Only one new rat mature miRNA was added to the rat miRNA database from version 16 to version 18 of miRBase, suggesting that many rat miRNAs remain to be discovered. Given the importance of rat as a model organism, discovery of the completed set of rat miRNAs is necessary for understanding rat miRNA regulation. In this study, next generation sequencing (NGS), microarray analysis and bioinformatics technologies were applied to discover novel miRNAs in rat kidneys. MiRanalyzer was utilized to analyze the sequences of the small RNAs generated from NGS analysis of rat kidney samples. Hundreds of novel miRNA candidates were examined according to the mappings of their reads to the rat genome, presence of sequences that can form a miRNA hairpin structure around the mapped locations, Dicer cleavage patterns, and the levels of their expression determined by both NGS and microarray analyses. Nine novel rat hairpin precursor miRNAs (pre-miRNA) were discovered with high confidence. Five of the novel pre-miRNAs are also reported in other species while four of them are rat specific. In summary, 9 novel pre-miRNAs and 14 novel mature miRNAs were identified via combination of NGS, microarray and bioinformatics high-throughput technologies. AR-BIC-2

Exploring the impact of miRNA-seq pipelines on downstream analysis Halil Bisgin, Binsheng Gong, Yuping Wang, Weida Tong Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration Background: Development of next-generation sequencing (NGS) techniques opened a new era in genomic research and led several studies in RNA-Seq. Despite the excitement, concerns have arisen about profiling tools and defining the standards. In recent years, FDA SEQC consortium took an initiative to address technical and statistical challenges in RNA-seq. However, similar issues have not been extensively studied for miRNA-Seq in the research community. Method: We investigated the effect of parameter space on downstream analysis by exploring four miRNA-Seq profiling tools (mirDeep2, mirExpress, miRNAkey, sRNAbench). Given the mirRNA-seq data generated from rat liver samples that were treated by Thioacetamide in four time points and three dose levels, we first compared the variance in the number of differentially expressed miRNAs (DEMs) for each tool with their own parameters. mirDeep2 and sRNAbench were further exploited with common parameters (genome mapping, windowing, quantification) to study the detection sensitivity and DEM variability along with normalization choice. Results: The analysis showed that under the same parameters sRNAbench detected more miRNAs most of which were also detected by mirDeep2. Under the same normalization method, mirDeep2 had more DEMs which showed higher overlap ratio with sRNAbench compared to detection sensitivities. While windowing introduced more variance in the detection, genome mapping was also effective in the variability of DEMs. For higher doses and longer durations, mirDeep2 was less sensitive to parameter changes which resulted in more agreement on DEMs within itself. Profiling parameters did not exceed 8%, when time, dose, and time-dose interaction were considered in the variance. A change in the normalization step affected the DEMs close to treatment factors, but the trend across time and dose remained similar. Conclusion: Results indicated that for the given normalization method, profiling parameters had limited impact on the downstream analysis. On the other hand, normalization considerably changed the number of DEMs, but choice of normalization still allowed time-dose pattern to follow similar trends which was the sign of treatment effect. AR-BIC-3

Development of a competitive molecular docking approach for predicting estrogen receptor agonists and antagonists Hui Wen Ng, Weida Tong and Huixiao Hong Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079 Molecular docking is a well-established molecular modeling technique commonly used in ligand screening and drug design. This method attempts to predict the binding mode and molecular interactions between a protein and a ligand as well as rank the predicted poses with scoring functions. The protein-ligand association in vivo is characterized by a dynamic process whereby protein-ligand binding is accompanied by a conformational change in the complex, a phenomenon commonly referred to as “induced-fit”. However, due to high computational costs, fully flexible docking remains impractical. In light of this, rigid docking and limited flexible docking become the most commonly practiced methods. The estrogen receptors (ERs) adopt distinctly different conformations upon binding to the agonists and antagonists. Using the ER subtype a agonist and antagonist conformations, we designed an in silico approach that more closely mimics the biological process, and used it to differentiate the agonist versus antagonist status of potential binders. The ability of this approach was first evaluated using true agonists and antagonists extracted from the crystal structures available in the protein data bank (PDB), and then further validated using a larger set of ligands from the literature. The usefulness of the approach was demonstrated with enrichment analysis in data sets with a large number of decoy ligands. The performance of individual agonist and antagonist docking conformations were found comparable to similar models in the literature. When combined in a competitive docking approach, they provided the ability to discriminate agonists from antagonists with good accuracy, as well as the ability to efficiently select true agonists and antagonists from decoys during enrichment analysis. In conclusion, this approach offers potential applications not only in drug discovery projects in the pharmaceutical industry but also in the screening of potential endocrine disrupting compounds (EDCs) by regulatory authorities to perform risk assessments on potential EDCs. AR-BIC-4

Collection and molecular docking identification of associations between drugs and class I human leukocyte antigens for predicting idiosyncratic drug reactions Heng Luo1,2, Huixiao Hong1 1 National Center for Toxicological Research, US Food and Drug Administration, 2 University of Arkansas at Little Rock/University of Arkansas for Medical Sciences joint Bioinformatics program Corresponding to: Huixiao.Hong@fda.hhs.gov Idiosyncratic drug reactions (IDRs) are rare, somewhat dose-independent, patient-specific and hard to predict. Human leukocyte antigens (HLAs) are the major histocompatibility complex (MHC) in humans, are highly polymorphic and are associated with specific IDRs. Therefore, it is important to identify potential drug-HLA associations so that individuals who would develop IDRs can be identified before drug exposure. We harvested the associations between drugs and HLAs from the literature and built up a database named HLADR. Molecular docking was used to explore the known associations. From the analysis of docking scores between the 17 drugs and 74 class I HLAs, it was observed that the significantly associated drug-HLA pairs had statistically lower docking scores than those not reported to be significantly associated (t-test p < 0.05). This indicates that molecular docking can be utilized for screening drug-HLA interactions and predicting potential IDRs, and may improve drug safety and the implementation of personalized medicine. Examining the binding modes of drugs in the docked HLAs suggested several distinct binding sites inside class I HLAs, expanding our knowledge of the underlying interaction mechanisms between drugs and HLAs. AR-BIC-5

Deciphering adverse outcome pathways through network analysis of ToxCast data Hao Ye 1, Heng Luo2, Hui Wen Ng1, Weigong Ge1, Weida Tong1, Huixiao Hong1* 1Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079; 2University of Arkansas at Little Rock/University of Arkansas for Medical Sciences Bioinformatics Graduate Program, Little Rock, Arkansas, AR 72204 Correspondence should be addressed to Dr. Huixiao Hong at huixiao.hong@fda.hhs.gov ToxCast data have been demonstrated to be efficient in characterizing the toxicological profiles of environmental chemicals. An adverse outcome pathway (AOP) is a group of molecular events related at higher levels of biological organizations (e.g. cell or tissue) that ultimately lead to an adverse outcome. Network analysis was frequently used to investigate the group properties of networks such as social network, electronic commerce network, and biological network. We first constructed a network in which the assays and chemicals assayed in ToxCast data were treated as nodes and the positive assay results were used to connect the nodes. We then applied a network analysis to inspire the understanding of ToxCast data and to identify potential AOPs. We also demonstrated the activity data of untested chemicals in the ToxCast assays could be predicted using the network analysis. We found the compound-assay network could be decomposed into seven densely connected modules based on its topological properties. Moreover, each of the seven modules was associated with different AOPs. For example, most of ER, AR, and GR related assays were significantly enriched in module one. We will present our results and discuss the implications, limitations and perspectives of the network analysis on ToxCast data. AR-BIC-6*

Ensemble Survival Trees for Identifying Subpopulations in Personalized Medicine Yu-Chuan Chen James J. Chen Recently, personalized medicine has received a great attention to improve safety and effectiveness in drug development. Personalized medicine aims to provide medical treatment that is tailored to the patient’s characteristics such as genomic biomarkers, disease history, etc., so that the benefit of treatment can be optimized. Subpopulations identification is to divide patients into several different subgroups where each subgroup corresponds to an optimal treatment. For two subgroups, traditionally multivariate Cox proportional hazards model is fitted and used to calculate the risk score when outcome is survival time endpoint. Median is commonly chosen as the cutoff value to separate patients. Here we propose a novel tree-based method that adopts the algorithm of relative risk trees to identify subgroup patients. After growing a relative risk tree, we apply ??-means clustering to group the terminal nodes based on the averaged covariates. We adopt an ensemble Bagging method to improve the performance of a single tree since it is well known that the performance of a single tree is quite unstable. A simulation study is conducted to compare the performance between our proposed method and the multivariate Cox model. The applications of our proposed method to three public cancer data sets are also conducted for illustration. AR-BIC-7

The development of Liver Toxicity Knowledge Base (LTKB) for research and review of drug-induced liver injury Minjun Chen, Eileen E Navarro Almario, Guangxu, Zhou, Ruyi He, Chuchu Hu, Marc Stone, Tina M Burgess, Shashi Amur, Victor Crentsi, Hong Fang, Weida Tong National Center for Toxicological Research Drug-induced liver injury (DILI) presents a significant challenge to drug development and regulatory application. The Liver Toxicity Knowledge Base (LTKB) aims to provide literature data and regulatory information about DILI to support research and review of drug safety. The LTKB contains ~3000 unique prescription drugs, including ~1400 drugs approved by the FDA, ~1300 drugs approved by other agencies like EMA, and 210 drugs withdrawn from the worldwide market. The following data are available for most of drugs in the LTKB: chemical structure, therapeutic use, PD/PK, DILI types and severity, DILI mechanisms, histopathology, drug targets, side effects, etc. The LTKB can serve as 1) a reference database when drug/DILI-related data need to be queried; 2) an assessment tool of DILI risk in humans for new chemical entities in the review process; and (3) a tool to support biomarker studies using emerging technologies (e.g., genomics, in vitro studies). AR-BIC-8

Genome-wide comparison of four toxicogenomics assay systems Zhichao Liu1, Hong Fang2, Joshua Xu1, Weida Tong1* 1 Division of Bioinformatics and Biostatistics, 2 Office of Scientific Coordination, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, Arkansas, 72079, USA To whom correspondence should be addressed at Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA. Telephone: (870) 543-7142. Fax: (870) 543-7854. E-mail: weida.tong@fda.hhs.govweida.tong@fda.hhs.gov Disclaimer: The views presented in this article do not necessarily reflect current or future opinion or policy of the US Food and Drug Administration. Any mention of commercial products is for clarification and not intended as endorsement. Assessing genome-wide difference and similarity of the in vitro and in vivo responses to drug treatment is essential to choose relevant toxicogenomics assays in drug safety study. We used the Japanese Toxicogenomics Project dataset that profiles 131 compounds (most are drugs) with microarrays in four assay systems for liver - two in vitro methods (rat and human primary hepatocytes) and two in vivo experiments (single dose and repeat dose studies). For each testing system, the drug-drug similarity score between any two drugs was calculated based on their shared gene expression patterns and ranked from the most similar to least similar pairs. Then, the testing systems were pairwisely compared based on ROC curve analysis to quantify the extent of ranking preservation of two ordered similarity lists. Two in vivo systems (AUC=0.90) and two in vitro systems (AUC=0.77) scored highest, indicating that the experiment platform (i.e., in vitro or in vivo) is the utmost important factor affecting the assay results. The results also implied that (a) an expensive assay testing system (i.e., in vivo repeat dose study) could be replaced by an inexpensive one (i.e., a short-term in vivo single dose study) and (b) species difference (i.e., rat in vitro and human in vitro) was less pronounced within the same testing system. We also found that a good concordance (AUC=0.70) between rat in vitro and in vivo repeat dose studies, indicating a potential replacement of animal-based testing method with an animal-free in vitro assay. Furthermore, we correlated the ranking preservation between assays against various liver related toxicological endpoints. For all of these endpoints examined, the concordance between rat and human was significantly improved (over 10% in average), highlighting that the extrapolation of rat data to humans was endpoint dependent. The proposed method in this study has many advantages over the traditional approaches such as insensitive to batch effect that is common for microarray data. AR-BIC-9*

Detecting Copy Number Variations via a Bayesian Approach Adapting to Both Whole Genome and Targeted Exome NGS Yu-Chung Wei12, Ching-Wei Chang1, Guan-Hua Huang2* 1 Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, FDA, Jefferson, AR 72079, USA ,2 Institute of Statistics, National Chiao Tung University, Hsinchu, Taiwan 30010, ROC * Corresponding author: E-mail: ghuang@stat.nctu.edu Copy number variations (CNVs) are genomic structural mutations with abnormal gene fragment copies. Current CNV detection algorithms for next generation sequencing (NGS) are developed for specific genome targets, including whole genome sequencing and targeted exome sequencing based on the differently data types and corresponding assumptions. Many whole genome tools assume the continuity of search space and reads uniform coverage across the genome. However, these assumptions break down in the exome capture because of discontinuous segments and exome specific functional biases. In order to develop a method adapting to both data types, we specify the large unconsidered genomic fragments as gaps to preserve the truly location information. A Bayesian hierarchical model was built and an efficient reversible jump Markov chain Monte Carlo inference algorithm was utilized to incorporate the gap information. The performance of gap settings for the Bayesian procedure was evaluated and compared with competing approaches using both simulations and real data from the 1000 Genomes Project. The proposed approach outperforms other existing methods in accuracy for both whole genome and targeted exome data. Keywords: Bayesian inference, copy number variations, next generation sequencing, reversible jump Markov chain Monte Carlo. AR-BIC-10

3D-SDAR analysis of a diverse dataset of 180 hERG inhibitors: Structural factors determining the binding potential Iva Slavova1, Svetoslav H. Slavov1, Dan A. Buzatu1, Jon G. Wilkes1, Richard D. Beger1 1NCTR, Jefferson, AR United States 3D-SDAR is a three dimensional spectral-data activity relationship (3D-QSDAR) approach utilizing fingerprints constructed from 13C and 15N NMR chemical shifts augmented with interatomic distances. 3D-QSDAR was used to model a diverse dataset of human Ether-a-go-go-Related Gene (hERG) blockers, some of which were drugs that can cause heart beat arrhythmia. After setting a commonly accepted IC50 threshold of 1?M, the 180 chemicals forming the initial dataset were split into two classes: 67 were defined as hERG blockers (or hERG+) while the remaining 113 compounds were labeled as hERG-. A simple IC50 distribution based rule was used to split the initial set of 180 compounds into a balanced modeling set (61 hERG+ and 57 hERG-) and an external test set (6 hERG+ and 56 hERG-). A total of 100 randomized PLS models splitting the modeling set into training (80%) and hold-out test (20%) sets were performed. On each step the compounds randomly assigned to the hold-out test and those in the external test were predicted. At the end, the quantitative predictions for each compound were averaged and a threshold of 0.5 was used to categorize them into hERG+ and hERG-. Different grid granularities and fixed ratios (derived from the gyromagnetic ratios of C and N) for the bin sizes in the C-C, C-N and N-N regions were explored. A 4 latent variables (LVs) model based on 6 ppm x 6 ppm x 1 Å bins for the C-C region, 6 ppm x 30 ppm x 1 Å bins for the C-N region and 30 ppm x 30 ppm x 1 Å bins for the N-N region performed best. The predictions for the 62 compounds in the external test set classified correctly 84% of the compounds (sensitivity = 1.00, specificity = 0.82 and area under the curve = 0.91). The bins with the highest frequencies of occurrence from the top two LVs of the randomized PLS model allowed the construction of a hERG toxicophore consisting of an AR ring and an amino group. It was demonstrated that a second aromatic ring would increase the hERG blocking potential. AR-BIC-11

Structural Identification of Unknown Protein Structures Cameron Walker and Karl A. Walker University of Arkansas at Pine Bluff, Pine Bluff, Arkansas This research involves the analysis of unknown protein structures. The purpose of this research is to develop improved algorithms that aid in the prediction of protein structure. It is our hope to provide alternatives to existing algorithmic approaches to bioinformatics, specifically protein threading. By matching protein sequences of unknown protein structures to that of known structures stored in our database, we can determine the longest common subsequences among proteins. Once data has been generated from our protein-threading algorithm, we perform statistical analysis upon that data to draw inferences that could lead to the identification of new, useful enzymes. AR-BIC-12

The Effects of Carbon Emissions on Coral Reefs Adrian Hunt, Britney Bolar and Karl Walker University of Arkansas at Pine Bluff, Pine Bluff, Arkansas Coral Zooanthellae (coral reefs) are one of the most diverse ecosystems in the world. A huge problem that is killing this species is carbon emission. By comparing the DNA sequences of the coral reefs that are affected vs. non-effected may help us to understand how some species have developed resistance to these emissions, thus providing further knowledge on how to protect and save this special species. AR-BIC-13

Phylogenetic analysis enzymes of amino acid biosynthetic pathways Daniel Machooka 1, Andrea carpenter 1, Joseph Onyilagha 1, Richard Walker 1, Karl Walker 1, Stephen Freeland 2, Serhan Dagtas 3 1 University of Arkansas at Pine Bluff, Pine Bluff, AR, 2 University of Maryland Baltimore, Baltimore, MD, 3 University of Arkansas at Little Rock, Little Rock, AR. Understanding how life formed from only a few molecules remains a great mystery. The biosynthetic pathways of the standard twenty amino acids may hold the key to solving this puzzle. In this research, we have closely examined the enzymes involved in these pathways and have conducted phylogenetic analysis of them in order to better understand how each amino acid evolved to become a part of the genetic code of life. AR-BIC-14

Down-regulation of genes involved in lignin biosynthesis and a genomic approach to deciphering lignin biosynthesis in rice Zhenhua Shang1, Sathish Kumar Ponniah1, Vibha Srivastava2, and Muthusamy Manoharan1* 1Department of Agriculture, University of Arkansas at Pine Bluff, AR 71601, USA 2Department of Crop, Soil & Environmental Sciences, University of Arkansas, Fayetteville, AR 72701, USA Corresponding Author – manoharanm@uapb.edu The objective of this project was to reduce lignin by down regulating genes involved in lignin biosynthesis in rice. A strategy of down-regulation of lignin biosynthetic genes, cinnamate 4-hydroxylase (C4H), hydroxycinnamoyl CoA: shikimate hydroxycinnamoyl transferase (HCT), coumarate 3-hydroxylase (C3'H), cinnamoyl CoA reductase (CCR), and cinnamyl alcohol dehydrogenase (CAD) has been used to decrease lignin content in rice. A novel binary vector (TL) in which the truncated lignin gene (s) driven only by the promoter and no terminator was constructed and transferred to Agrobacterium tumefaciens for infecting rice calli. Putative transgenic rice plants were regenerated after selection in regeneration medium (N6 medium containing 2.0 mg/L Kinetin, 0.02 mg/L NAA, 100 mg/L geneticin (G418) and 500 mg/L Carbenicillin) and confirmed by Polymerase Chain Reaction (PCR). Seeds were collected and germinated on MS medium containing 200 mg/L geneticin for segregation analysis. RNA was isolated from the segregated plants and Real-time qPCR was conducted. The results indicated 50% reduction of some of the genes (such as CAD) involved in lignin biosynthesis and may potentially lead to reduced lignin for efficient conversion of rice straw to cellulosic biofuel. In addition, a combination of protein sequence phylogeny on several genes was applied. Several genes that were strongly supported through bioinformatics analysis as involved in lignin biosynthesis were conformed by gene silencing studies, in which lignin levels were reduced as a result of targeting a single gene AR-BIC-15*

Sexual dimorphism in the expression of genes encoding drug metabolizing enzymes/transporters may influence a drug’s disposition in adult F344 rats Vikrant Vijay, Kejian Wang, Qiang Shi, James C Fuscoe Personalized Medicine Branch, Division of Systems Biology, National Center for Toxicological Research, USFDA, Jefferson, AR. A crucial step in developing a safe and effective drug is assessing how the body processes the drug. During non-clinical drug development, a drug candidate is often evaluated in adult animals of a single sex (males). If there are age- and/or sex-differences in the enzymes that metabolize the drug, there may be unrecognized age- and/or sex-related differences in the disposition, safety, and efficacy of the drug. Drug metabolizing enzymes including transporters (DME/T) play a major role in a drug’s detoxification, excretion, and/or activation, and thus differences in the DME/T expression profiles may play a key role in drug safety. Therefore, a rat (F344) model was used to identify differences in the basal hepatic transcriptional profiles of DME/T genes in adult males and females at 4 different ages (8, 15, 21 and 52 weeks). A comprehensive list of 336 rat DME/T genes was prepared using Pharmapendium as a key resource. In-house rat liver gene expression data (normalized) was obtained for 298 out of 336 DME/T genes. Genes were considered to be significantly differentially expressed between females (F) and males (M) at any one of the four ages, if the t-test p-value <0.05 and fold ratio (F/M or M/F) >2. 112 genes were significantly differentially expressed between the sexes in at least one age and 29 genes at all four ages. All of the 29 genes showed consistent higher expression in either females (12 genes) or males (17 genes) at all four ages. The genes with highest expression in females compared to males were Abcc3, Cyp3a9, Sult2a1, Adh6 and Cyp2c12 with a range of fold-differences of ~3-378. Genes with the highest expression in males compared to females were Cyp2a2, Sult1e1, Cyp2c11, Cyp2c13 and Cyp3a2 with a range of fold-differences of ~127-3876. The 29 enzymes encoded by these differentially expressed genes metabolize more than 600 drugs. Based on these findings, the disposition of these drugs may be different in the two sexes. In vivo studies in rats will be conducted to confirm these predictions of differential disposition of selected drugs metabolized by the differentially expressed enzymes. Once confirmed by in vivo studies, the results could be translated to humans in order to identify potential sexually dimorphic drug safety issues. AR-BIC-16

Data Mining for Signal Detection from Adverse Event Reporting System Database Weizhong Zhao1, 2, Zhichao Liu1, Yuping Wang1, James J. Chen 1, Wen Zou1 * 1Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, AR, US. 2College of Information Engineering, Xiangtan University, Xiangtan, Hunan Province, China. The FDA centers receive reports from consumers, health care professionals, manufacturers, and others regarding the safety of various regulated products, such as drugs, vaccines, artificial hearts, surgical lasers, and nutritional supplements. It is a challenge to extract the information in these reports for better assessment of product safety and rapid detection of adverse event signals. In this study, we collected adverse reports in FDA Adverse Event Reporting System (FAERS) from the first quarter of 2004 to the first quarter of 2014. In preprocessing procedure, we cleaned the dataset and normalized the drug names by RxNorm, which is a standard nomenclature developed by the United States National Library of Medicine (NLM). Empirical Bayes geometric mean (EBGM) approach was utilized to identify the safety signals in the adverse reports related to 996 FDA approved drugs. New safety signals were identified when comparing with the currently available information in various sources. The outcome of this study is expected to enhance information input to the decision making process for drug safety detection and postmarketing surveillance. Key words: drug safety signal, adverse event, EBGM, Data mining, Postmarketing surveillance. AR-BIC-17

Systematically identifying and annotating long non-coding RNAs Dan Li, Mary Yang University of Arkansas at Little Rock, Little Rock, Arkansas Long non-coding RNAs (lncRNAs) have been shown to play important roles in various biological processes and have been implicated in disease. Although lncRNAs have gained substantial attention in recent years, their regulatory mechanisms remain to be elucidated. High-throughput RNA sequencing (RNA-Seq) provides the unprecedented ability to annotate lncRNA transcripts, which can potentially advance our understanding of their biological functions. Here, we built and compared two RNA-Seq processing pipelines using reference-guided and de novo strategies. By analyzing RNA-Seq data from brain tissues of 14 individuals, we identified 33,038 and 41,241 novel transcripts using Tophat - Cufflinks, and Trinity - GMap, respectively. Over 94% of the novel transcripts identified from both approaches are predicted to be lncRNAs according to their canonical features. Notably, 61.1% and 75.2% of lncRNAs obtained from the two approaches are unique, suggesting a combination of reference-based and de novo assembly may lead to more comprehensive discovery of novel transcripts. Furthermore, we found 2,268 and 2,436 lncRNAs (from reference-based and de novo assembly) were transcribed from bidirectional promoters shared with their closest protein-coding genes. The majority of lncRNAs (99.8%) are under-expressed in our dataset. Over 2,011/2,268 (88.7 %) of these protein-coding genes are under-expressed, whereas 257/ 2,268 (13.3%) of them are up-regulated. Applying DAVID analysis on up-regulated genes, we revealed 12 genes implicated in Alzheimer’s disease, and 23 with cell cycle and 9 with anti-apotosis functions. Our results demonstrate that lncRNAs regulate genes associated with Alzheimer’s disease and cancer. AR-BIC-18

Bioinformatics challenges for research centers in Arkansas: ACNC case Horacio Gomez-Acevedo1,2, Brian D. Piccolo1,2, Sudeepa Bhattacharyya2 1Arkansas Children’s Nutrition Center, 2Department of Pediatrics, UAMS The use of high-throughput technologies in basic and translational research has grown steadily at the Arkansas Children’s Nutrition Center (ACNC). Genomic research has included a broad spectrum of technologies including RNA-seq, Chip-Seq, Methyl-Seq Human Methylation 450 Beadchip and traditional Affymetrix microarrays. Also, the center has acquired an UHPLC-Q orbitrap to carry on metabolomics and lipidomics analyzes. This increase in big data collection has concomitantly created bioinformatics challenges at different levels, namely: keeping up with changes in software, methodologies, statistical approaches, data management, data storage, and data presentation. Based on our experience at ACNC, we highlight some of our current solutions to these challenges in genomics and metabolomics. We also present bottleneck areas in which a more integrative collaboration with other centers or institutes in the region may synergistically increase the research quality in life sciences as well as in bioinformatics. AR-BIC-19

Building a computational evolution system to identify genes of interest in multi-class, RNA-seq data Nathan Crabtree1, John Bowyer1, Nysia George1, Jason Moore2 1National Center for Toxicological Research, 2Darmouth College Computational evolution systems (CESs) are knowledge discovery engines that use post-processing, pareto-optimization, and expert knowledge to identify novel, unexpected, and interesting relationships in large datasets.CESs have been developed to identify single nucleotide polymorphisms (SNPs) that are associated with prostate cancer. Existing CESs discriminate between binary-class datasets, e.g. treatment vs control or healthy vs diseased. Although previous work provides a great foundation, technological advancements and complex experimental designs have made it necessary to accommodate datasets with multiple classes. Multi-class discrimination can be done using a one-verses-one or one-verses-all approach where the dataset is broken down into multiple binary datasets. The other, better approach is to discriminate between all classes simultaneously. In this study, we develop a CES for multi-class data using the simultaneous approach. We demonstrate the performance of the proposed CES on a multiclass RNA sequencing (RNA-seq) dataset that was generated from blood samples harvested from rats in five different treatment groups. Results of the multiclass CES were compared to other machine learning approaches including random forests (RF) and support vector machines (SVM). Methods were evaluated based pre-processing strategies such as minimum redundancy maximum relevancy and expert knowledge. Classifiers were assessed based on accuracy and their ability to identify discriminant genes that play a role in immune system function. AR-BIC-20

Structure and specificity of L,D-Transpeptidase from Mycobacterium tuberculosis Kuppan Gokulan1, Sangeeta Khare1, Carl E. Cerniglia1, Steven L. Foley1, and Kottayil I.Varughese2 Division of 1* Microbiology, National Center for Toxicological Research, US-FDA, 3900 NCTR Road, Jefferson- AR-72079, USA, Department of 2* Physiology & Biophysics, University of Arkansas for Medical Sciences,#750, 4301 W. Markham St., Little Rock, AR 72205-7199, USA The final step of peptidoglycan (PG) synthesis in all bacteria is the formation of cross-linkage between PG stems. The cross-linking between amino acids in different PG chains gives the peptidoglycan cell wall a 3-dimensional structure and adds strength and rigidity to it. There are two distinct types of cross-linkages in bacterial cell walls. D,D-transpeptidase (D,D-TP) generate the classical 4?3 linkages and the L,D-transpeptidase (L,D-TP) generate 3?3 non-classical peptide cross linkages. The percentage of 3?3 cross linkages are more in non-replicating and multi-drug resistant bacteria than replicating and drug-susceptible bacteria. Penicillin and cephalosporin classes of ??lactams cannot inhibit L,D-TP function; however, carbapenems inactivate its function. We analyzed the structure of L,D-TP in the apo form and in complex with meropenem and imipenem. The periplasmic region of L,D-TP folds into three domains. The catalytic residues are situated in the C-terminal domain. The acylation reaction occurs between carbapenem antibiotics and the catalytic Cys-354 forming a covalent complex. This adduct formation mimics the acylation of L,D-TP with the donor PG stem. A novel aspect of this study is that in the crystal structures of the apo and the carbapenem complexes, the N-terminal domain has a muropeptide unit non-covalently bound to it. Another interesting observation is that the calcium complex crystallized as a dimer through head and tail interactions between the monomers. Importance: Tuberculosis continues to be a major global health problem, due to the emergence of drug, multi-drug and persistent bacteria. The present study is aimed at understanding the nature of drug resistance associated with L,D-TP and gaining insights for designing novel antibiotics against multi-drug resistant bacteria. Most of the ?-lactam antibiotics effective for D,D-TP are ineffective against L,D-TP. Among the two carbapenem antibiotics, imipenem is more effective than meropenem. Our crystal structures show that imipenem has stronger interactions at the active site of L,D-TP than meropenem and our analysis provides clues for designing more potent antibiotics. AR-BIC-21

AR-BIC-22

A novel clustering approach to find biomarkers in breast cancer subtypes based on expression network profiles Leihong Wu1, Zhichao Liu1, Joshua Xu1, Minjun Chen1, Hong Fang2, Weida Tong1, Wenming Xiao1 1 Division of Bioinformatics and Biostatistics, 2 Office of Scientific Coordination, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, Arkansas, 72079, USA Introduction: Identifying biomarkers in breast cancer subtypes for an improved clinical prognosis and precision treatment is a major purpose in breast cancer research. Gene expression analysis has been long applied to find biomarkers however is still challenging in breast cancer owing to its high heterogeneity. Recent advancement in network based methodologies would offer an enhanced approach to systematically study breast cancer subtypes and identifying biomarkers from a whole genome scale. Result: A network-based clustering algorithm was developed for the breast cancer subtyping analysis with gene expression data. By applying on MAQC-II breast cancer datasets, the clustering results of breast cancer samples were highly enriched with the clinical receptor status. In details, two clusters were enriched with HER2+ and triple negative breast cancer (TNBC) respectively, while other three clusters mostly contained ER+ tumor samples. In addition, we found that PSMD3, STARD3, GRB7 were highly up-regulated in the HER2+ cluster while UCHL1, MUC16, NRTN, ART3, CXCL10, CISD1, DBF4, KRT16, MSLN and NCAPD2 were highly up-regulated in the TNBC cluster, which could be potential biomarkers in these breast cancer subtypes. These potential biomarkers were then verified with independent breast cancer datasets generated using either microarray or RNA-seq platforms. Conclusion: The network based clustering algorithm can provide an enhanced capability to identify biomarker in breast cancer subtypes which may improve breast cancer prognosis and precision treatment. AR-BIC-23

Making Tractable the Use of Vast Quantities of Regulatory-Related Textual Data Ke Yu, Yijun Ding, Weizhong Zhao, Shi-Heng Wang, Wen Zou, James J. Chen, Roger Perkins, and Weida Tong* Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, AR 72079 For FDA to carry out its regulatory mission, prodigious quantities of largely or poorly structured textual information must be digested and interpreted. Agency lore describes new drug applications before the digital age arriving in 18-wheel tractor trailers. Post market safety surveillance of digital media constitutes untold terabytes of unstructured textual data. Where expert human eyes are too few, slow and/or expensive, the means to sift out information germane to regulatory questions is paramount. Probabilistic topic modeling offers a viable approach, where unstructured documents are characterized as probability distributions of latent topic themes that, in turn, are probability distributions of words. With such a model, the untenable process of searching and reading for answers to a regulatory question in a vast corpus reduces to more careful scrutiny of a small set of documents thematically related to the question. To test the effectiveness and validity of topic modelling, we constructed a ground truth data set with 59201 abstracts from PubMed that contained 39 tobacco use-related themes, and two entirely unrelated negative control themes. Latent Dirichlet allocation (LDA) and Pachinko Allocation Model (PAM) algorithms were separately applied to building topic models with the ground truth data set. Both approaches segregated documents into proper thematic truth categories, even those containing small fractions (<0.1%) of the documents, demonstrating high specificity and sensitivity of thematic characterization. We found the sub-topics in PAM are highly aligned to LDA topics, and the differentiation of sup-topics in PAM is not shown in this study, which might be data-dependent. The findings demonstrate the applicability of topic modeling in exploring FDA textual data, which provides a promising way to promote the treatment of cumulated documents in FDA. AR-BIC-24

Peptide Sequence Patterns Related to Omega Angles in Cis Conformation Sidney Smith1, Adrian Hunt1 , Karl Walker1 , Jerry Darsey2 1Department of Mathematics and Computer Science, University of Arkansas at Pine Bluff 2Chemistry Department, University of Arkansas at Little Rock Predicting the backbone conformation of proteins involves estimating the configurations of three torsion angles: phi, psi, and omega. Most of the flexibility in protein backbones is accounted for by torsion angles phi and psi because they correspond to covalent single bonds. Due to the partial double bond characteristic of peptide bonds, the omega torsion angles of a protein are largely found in trans conformation (close to 180º). The cis-trans isomerization of omega dihedral angles is directly involved in the folding of proteins and many functional aspects of proteins such as auto-inhibition control, channel gating, membrane binding, and dimerization interfaces (Craveur et al., 2013). In this study, we have analyzed the amino acid sequences relative omega torsion angles found in cis conformation (close to 0º) in order to better understand the mechanism of isomerization and to improve prediction of omega dihedrals in cis conformation. AR-BIC-25