First 2 days: Introduction Aplicações biomédicas em plataformas computacionais de alto desempenho Aceleración de aplicaciones biomédicas sobre plataformas gráficas de altas prestaciones Oswaldo Trelles [email protected] PROGRAMA CAPES/DGU EDITAL No 040/2012 PROGRAMA HISPANO-BRASILEÑO DE COOPERACION INTERUNIVERSITARIA LNCC, Petrópolis- Brasil, 2013 O.Trelles, PhD Contents Day 1 2 Contents description Course presentation (15mins) General overview DNA sequencing / Assembly / Annotation Sequence analysis concepts Algorithms' internals: Blast MuMer, HSPs identification Introdution to Gene Expression data analysis Sequential code internals Speaker ATR. Vasconcellos Oswaldo Trelles Oswaldo Trelles This document provides an overview of the Introductory session (first two days) O.Trelles, PhD Contents: headlines Headlines Contents presentation Basic concepts: biology, bioinformatics, HPC Sequence analysis internals Hands-on: sequential code O.Trelles, PhD Basic concepts on Biology • • • • • • • All living organisms are made-up of cells Each cell contains the full genetic material (genome DNA sequence of {A, C, G, T}) The genome is organised in chromosomes Chromosomes contains genes Genes are the instructions to synthetize proteins (Genetic code) The amount of each protein is regulated as response to changes in the environmental. Metabolism that can involve dozens or hundreds of catalysed reactions in pathways. O.Trelles, PhD DNA sequencing The DNA is long linear string of nucleotides Sequence, a partial or the full genome ( string of a given length. Sequencing the process of determining the exact order of the letters in the sequence. Assembly: re-build up the original sequence from the “reads” Solutions: an open issue De-novo: a new genme is sequenced Mapping: for re-sequenced genomes Copy Number Variations, SNPs, etc make the problem still more interesting ---if possible--- O.Trelles, PhD Gene identification & function Problems: • Prokaryote & Eukaryote cells • Intergenic regions. • Coding regions • Small portion of genome • Exons and introns. • Conservation (mutations) • Transposons and repeats • Alternative splaicing En color rosa se muestran los exones codificantes de la secuencia "prostate-specific antigen promoter RT isolated from a patient with prostate cancer" O.Trelles, PhD Functional annotation Biological sequence annotation is the process of finding, recovering and incorporating relevant biological information available in public databases in relation to an individual or massive collection of sequences. New insights about function, cellular location, phylogeny, biological process and/or protein structure, etc. In general is the next step in genome and EST sequencing O.Trelles, PhD Transcriptomics Transcripts (RNA) data Genes modify their expression levels as response to environmental stimuli, tissue location, time course... Variations in gene expression patterns can lead profound effects on biological functions being at the core of altered physiologic and pathologic processes. Large scale technologies are changing our view of the biological processes, including their dynamics. Identify genes that share expression patterns and hence might be regulated together are assumed to be in the same genetic pathway. O.Trelles, PhD Gene-expression data analysis Error removal: for reproducibility, reliability, compatibility and standardization of data Differential expression. identify over/under expressed The gene “expression profile” represents the different levels of expression along different experiments. Each gene has its own particular expression profile, but, it can be quite similar to the profile in other genes Clustering: identify genes that share a similar expression profile distance measure (Euclidean, correlation, ...) Method to proceed (hierarchical, kmeans, partitional, etc). O.Trelles, PhD Basic concepts on Bioinformatics Bioinformatics: Computer sciences applied to the processing of biological data Different areas associated to the different type of data Identificación de genes Protein Sequences Sequence Comparison Clonación Estadística comparativa secuenciación Rutas metabólicas Sequence / structure function filogenia Databases New technologies Seq DNA Estructura DNA Modelado Molecular Comparación de estructuras Statistical analysis Protein Seq protein Structure Computer programming Estudios evolutivos Expresión génica Web servers Integración de servicios Metabolites O.Trelles, PhD Basis of parallel programing • • • • • Parallel architecture taxonomy Parallel programming models Sources of ineficiency in parallel programming Performance measurements Hands-on: bioinformatics application’s internals O.Trelles, PhD Sequence Analysis Internals The essentials of biological sequence analysis towards its computational aspects Formal definition: A sequence is a string f characters representing DNA nucleotides or the protein amino acids DNA: A= {a,c,g,t|u} Why to compare sequences How to compare sequences Computational aspects qNEW-SEQ KNOWN-SEQ --SARGDFLNAA YALFFMRSHN FGHSDVLPVL |||||||| ||| ||||| ||||||| MMSARGDFLN-- YALSLMRSHN DEHSDVLPVL qNEW-SEQ –-CSLKHVAY WDAYQALIYW IKAMNQQTDTSI |||||||| |||||| | ||||||||| KNOWN-SEQ DVCSLKHVAY –VFQALIYW IKAMNQQTTLDT qNEW-SEQ --RPPDDQAF GHHHLPQAMH --SRLYVPS-SK ||| | || | ||||||| || KNOWN-SEQ TIRPPA---- GAFGLPTANT CISRLYVPSMSK O.Trelles, PhD Hands-on: sequential codes (1) ● ● ● Kmers frequencies Codon usage Qnormalization Kmers analysis: Sequential pseudocode (example) int main(int ac, char** av){ checkParameters(file, K); seq = malloc(SEQSIZE); freq= calloc(pow(4,K)); f = fopen(In.file,"rt"); Tot =readSeq(f,seq); // Load Seq into memory fclose(f); for(i=0;i<Tot-(K-1);i++){ n=kmerIndex(seq,i,K); if(n!=-1) freq[n]++; } printKmerFreqs(freq,K); } O.Trelles, PhD Hands-on: sequential codes (1) Qnormalization O.Trelles, PhD