Training Courses

The training courses on April 22 and 23 are free of charge for both registered and non-registered participants.

April 22, 2015 (Wednesday)
Conference Hall, Beijing Institute of Genomics, CAS
09:00 - 10:00 Wormbase literature curation workflow [Video (575M)]
Xiaodong Wang, WormBase & California Institute of Technology, USA

Synopsis: WormBase ( is a model organism database containing data about C. elegans and other nematodes. The WormBase literature curation workflow typically begins by downloading bibliographic information from PubMed followed by subsequent steps of pdf acquisition, data type flagging, entity recognition, and fact extraction. Using a combination of manual, semi-, and fully automated approaches including community curation, Perl scripts, Support Vector Machines (SVMs), and the Textpresso ( information retrieval system, WormBase curators annotate over 30 different data types to support C. elegans-based biomedical research.

10:00 - 11:00 PubChem: a case study for managing big data [PDF]
Yanli Wang, National Center for Biotechnology Information, USA

Synopsis: Chemical genomics (Chemogenomics) systematically screens small molecule libraries to identify drug candidates and chemical probes to characterize protein and gene functions. Advances in RNAi screening technology have enabled genome-wide functional screens for discovering new cellular pathways and therapeutic targets. The PubChem BioAssay database (, hosted by the National Center for Biotechnology Information (NCBI) at NIH, serves as a public repository for information generated by Chemogenomics and RNAi research. The integration of PubChem with the rest resources at NCBI provides a unique annotation service for NCBIs genomic information. This presentation will describe how this effort enables the retrieval of drug and chemical modulators, as well as biological and therapeutic relevance for many GenBank records.

11:00 - 12:00 Overview of annotation tools and curation workflow for the GENCODE gene sets [PDF] [Video (448M)]
Mark Thomas, Wellcome Trust Sanger Institute, UK

Synopsis: Combining computational analysis and manual annotation, with experimental validation, GENCODE provides the most comprehensive gene set for human and mouse. With the aim of identifying all gene features, we annotate coding genes, non-coding genes and pseudogenes, with an emphasis on alternative splicing. We utilize a wide range of next-generation sequencing sources including RNAseq data, CAGE and polyAseq analyses, together with proteomic data and comparative analysis. In this workshop, I will provide an insight into the annotation tools we use, highlighting the curational processes required to maintain the integrity of our data. We will discuss the merge process, detailed QC pipelines and the integration of controlled metadata, such as sequence ontology terms, extending to the different biotypes annotated.

April 23, 2015 (Thursday)
Function Room, Building 1, Beijing Friendship Hotel
09:00 - 12:00 Biocuration of GenBank & RefSeq
Ilene Mizrachi and Kim Pruitt, National Center for Biotechnology Information, USA

Synopsis: Curating sequence and literature data for RefSeq and Gene [PDF] [Video (551M)]
The National Center for Biotechnology Information (NCBI) reference sequence (RefSeq) project provides sequence standards for proteins, transcripts, genes, and genomes. The database is generated through processes that leverage computational analysis, collaboration, and curation. This presentation will focus on the vertebrate RefSeq collection. Several annotation groups have built genome annotation pipelines but no other international databases provides the depth and scope of sequence curation that is reflected in the vertebrate RefSeq database. This is because RefSeq curators primarily focus on accuracy of the transcript and protein sequence which becomes a high quality reagent that is used by international genome annotation pipelines. In addition to carrying out deep sequence analysis, RefSeq curators work with collaborators to improve gene and protein names, expand the bibliography that is available in NCBIs Gene resource, add brief functionally-relevant summaries, and apply feature annotation to sequence records that provides functional information about regions of the sequence. The presentation will include background information about the RefSeq database, information on the type of sequence review that RefSeq curators engage in, an overview of the analysis tools used, and examples of the biological data content that is added by curation.

Synopsis: NCBI Sequence Repositories: SRA and GenBank [PDF] [Video (644M)]
The National Center for Biotechnology Information (NCBI) hosts sequence data repositories Sequence Read Archive (SRA) and GenBank for the archiving of DNA and RNA sequence data, annotations and associated metadata. This talk will provide an overview on how sequence data is submitted, how it is processed and how it is retrieved. SRA and GenBank repositories have experienced significant growth in data deposition over the past few years. New submission systems are being developed to streamline the submission of sequence data and facilitate deposition of rich descriptive information about the biological sample and experimental metadata. The submitted data undergoes a number quality assurance checks before being released in the archive to ensure that the data and annotations is of high quality. Access to sequence data from the NCBI website will be described highlighting new resources that aid users in navigating through the vast amount of sequence data and related information. This presentation will describe how the curation staff processes sequence submissions and how the data can be searched for and retrieved for further analysis.

13:30 - 16:30 Biocuration of UniProt & Proteomics Databases
Claire O'Donovan and Sandra Orchard, EMBL-EBI

Synopsis: An introduction to UniProt Knowledgebase curation [PDF] [Video (285M)]
I will present UniProt Knowledgebase curation, including both manual and automatic approaches and how controlled vocabularies, ontologies, evidence attribution SOPs and QA all ensure data quality and richness for the users and interoperability with other resources.

Synopsis: The importance of Controlled Vocabularies and Data Standards in Biocuration [PDF] [Video (433M)]
Using examples from the IntAct molecular interaction database (, the Complex Portal ( and the Reactome pathways database ( I will show how extra value can be added to experimental data by manual annotation. The use of controlled vocabulary terms enables standardised and flexible querying over multiple data resources and the development of data standards has enabled databases in different countries to collaborate and present the user with an integrated, user-friendly resource.