Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org презентация

Содержание

Few genes are well annotated… Data: NCBI, February 2013

Слайд 1Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org

May 14, 2014

CBIIT

Slides: slideshare.net/andrewsu
Citizen

Science!

Слайд 2Few genes are well annotated…
Data: NCBI, February 2013


Слайд 3… because the literature is sparsely curated?


Слайд 4… because the literature is sparsely curated?


Слайд 5311,696 articles (1.5% of PubMed)
have been cited by GO annotations


Слайд 60
Sooner or later, the research community will need to be involved

in the annotation effort to scale up to the rate of data generation.

Слайд 7The Long Tail is a prolific source of content
News :
Video:
Product reviews:
Food

reviews:
Talent judging:

Newspapers
TV/Hollywood
Consumer reports
Food critics
Olympics

Blogs
YouTube
Amazon reviews
Yelp
American Idol


Слайд 8Wikipedia is reasonably accurate


Слайд 9Wikipedia has breadth and depth
http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008


Слайд 10We can harness the Long Tail of scientists to directly participate

in the gene annotation process.

Слайд 11From crowdsourcing to structured data


Слайд 12Filtering, extracting, and summarizing PubMed
Documents
Concepts
Review article


Слайд 13Filtering, extracting, and summarizing PubMed
Documents
Concepts


Слайд 14Wiki success depends on a positive feedback
Gene wiki page utility
Number of
users
Number

of
contributors

100

1

200

2


Слайд 1510,000 gene “stubs” within Wikipedia
Protein structure
Symbols and identifiers
Tissue expression pattern
Gene Ontology

annotations

Links to structured databases

Gene summary

Protein interactions

Linked references

Huss, PLoS Biol, 2008


Слайд 16Gene Wiki has a critical mass of readers
Total: 4.0 million views

/ month

Huss, PLoS Biol, 2008; Good, NAR, 2011


Слайд 17Gene Wiki has a critical mass of editors
Increase of ~10,000 words

/ month from >1,000 edits
Currently 1.42 million words
Approximately equal to 230 full-length articles

Good, NAR, 2011


Слайд 18A review article for every gene is powerful
References to the literature
Hyperlinks

to related concepts

Reelin: 98 editors, 703 edits since July 2002
Heparin: 358 editors, 654 edits since June 2003
AMPK: 109 editors, 203 edits since March 2004
RNAi: 394 editors, 994 edits since October 2002


Слайд 19Making the Gene Wiki more computable
Structured annotations
Free text


Слайд 20Filling the gaps in gene annotation
6319 novel GO annotations
2147 novel DO

annotations

Слайд 21Gene Wiki content improves enrichment analysis
GO term
Gene list
Concept recognition
PubMed abstracts
Enrichment analysis
GO:0007411
axon

guidance
(GO:0007411)

264 genes

Linked genes through PubMed

P = 1.55 E-20

811 articles


Слайд 22Gene Wiki content improves enrichment analysis
GO term
Gene list
Concept recognition
PubMed abstracts
Enrichment analysis
GO:0006936
GO:0006936
muscle

contraction (GO:0006936)

87 genes

Linked genes through PubMed

Linked genes through PubMed + Gene Wiki

P = 1.0

P = 1.22 E-09

251 articles

87 articles


Слайд 23Gene Wiki content improves enrichment analysis
p-value (PubMed only)
p-value (PubMed + GW)
Muscle

contraction

More significant PubMed + GW

More significant PubMed only

Good BM et al., BMC Genomics, 2011


Слайд 24Making the Gene Wiki more computable
Structured annotations
Free text
Analyses


Слайд 25Making the Gene Wiki more computable
Structured annotations
Free text
Databases


Слайд 26Expansion through outreach and incentives


Слайд 27Cardiovascular Gene Wiki Portal
CAMK2D -- CaM kinase II subunit delta
CSRP3 --

Cysteine and glycine-rich protein 3
GJA1 -- Gap junction alpha-1 protein / Connexin-43
MAPK14 -- Mitogen-activated protein kinase 14 / p38-α
MYL7 -- Myosin regulatory light chain 2, atrial isoform
MYL2 -- Myosin regulatory light chain 2, ventricular/cardiac isoform
PECAM1 -- Platelet endothelial cell adhesion molecule/CD31
RYR2 -- Ryanodine receptor 2
ATP2A2 -- Sarcoplasmic/endoplasmic reticulum calcium ATPase 2 / SERCA2
TNNI3 -- Troponin I, cardiac muscle
TNNT2 -- Troponin T, cardiac muscle

Peipei Ping
UCLA


Слайд 28The
Long Tail of scientists
is a valuable source of information

on gene function


Слайд 29From crowdsourcing to structured data


Слайд 30Gene databases are numerous and overlapping
… and hundreds more …


Слайд 31Why is there so much redundancy?
Users
Requests
Resources
Time
Community
development
BioGPS emphasizes community extensibility


Слайд 32Why do developers define the gene report view?
BioGPS emphasizes user customizability


Слайд 33Community extensibility and user customizability


Слайд 34Utility: A simple and universal plugin interface


Слайд 35Utility: A simple and universal plugin interface


Слайд 36Utility: A simple and universal plugin interface


Слайд 37Utility: A simple and universal plugin interface


Слайд 38Utility: A simple and universal plugin interface


Слайд 39Utility: A simple and universal plugin interface
Total of > 540 gene-centric

online databases registered as BioGPS plugins

Слайд 40Users: BioGPS has critical mass
Daily pageviews


Слайд 41Contributors: Explicit and implicit knowledge
540 plugins registered
(>300 publicly shared)

by over

120 users

spanning 280+ domains

Слайд 42Gene Annotation Query as a Service
http://mygene.info
High performance
3M hits/month
Highly scalable
13k species
16M genes
Weekly

data updates
JSON output
REST interface
Python/R/JS libraries

Слайд 43The
Long Tail of bioinformaticians
can collaboratively build a gene portal.


Слайд 44From crowdsourcing to structured data


Слайд 45The biomedical literature is growing fast


Слайд 46Information Extraction
Find mentions of high level concepts in text

Map mentions to

specific terms in ontologies

Identify relationships between concepts

Слайд 47Disease mentions in PubMed abstracts
NCBI Disease corpus
793 PubMed abstracts
(100 development,

593 training, 100 test)
12 expert annotators (2 annotate each abstract)

6,900 “disease” mentions

Doğan, Rezarta, and Zhiyong Lu. "An improved corpus of disease mentions in PubMed citations." Proceedings of the 2012 Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics.


Слайд 48Four types of disease mentions
Specific Disease:
“Diastrophic dysplasia”

Disease Class:
“Cancers”

Composite Mention:


“prostatic , skin , and lung cancer”

Modifier:
..the “familial breast cancer” gene , BRCA2..

Doğan, Rezarta, and Zhiyong Lu. "An improved corpus of disease mentions in PubMed citations." Proceedings of the 2012 Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics.


Слайд 49Question: Can a group of non-scientists collectively perform concept recognition in

biomedical texts?

Слайд 50The Turk
http://en.wikipedia.org/wiki/The_Turk


Слайд 51The Turk
http://en.wikipedia.org/wiki/The_Turk


Слайд 52Amazon Mechanical Turk (AMT)
For each task, specify:
a qualification test
how many workers

per task
how much we will pay per task

Manages:
parallel execution of jobs
worker access to tasks via qualification tests
payments
task advertising

1. Create tasks

2. Execute

3. Aggregate


Слайд 53Instructions to workers
Highlight all diseases and disease abbreviations
“...are associated with

Huntington disease ( HD )... HD patients received...”
“The Wiskott-Aldrich syndrome ( WAS ) , an X-linked immunodeficiency…”
Highlight the longest span of text specific to a disease
“... contains the insulin-dependent diabetes mellitus locus …”
Highlight disease conjunctions as single, long spans.
“... a significant fraction of familial breast and ovarian cancer , but undergoes…”
Highlight symptoms - physical results of having a disease
“XFE progeroid syndrome can cause dwarfism, cachexia, and microcephaly. Patients often display learning disabilities, hearing loss, and visual impairment.

Слайд 54Qualification test
Test #1: “Myotonic dystrophy ( DM ) is associated with

a ( CTG ) in trinucleotide repeat expansion in the 3-untranslated region of a protein kinase-encoding gene , DMPK , which maps to chromosome 19q13 . 3 . ”

Test #2: “Germline mutations in BRCA1 are responsible for most cases of inherited breast and ovarian cancer . However , the function of the BRCA1 protein has remained elusive . As a regulated secretory protein , BRCA1 appears to function by a mechanism not previously described for tumour suppressor gene products.”

Test #3: “We report about Dr . Kniest , who first described the condition in 1952 , and his patient , who , at the age of 50 years is severely handicapped with short stature , restricted joint mobility , and blindness but is mentally alert and leads an active life . This is in accordance with molecular findings in other patients with Kniest dysplasia and…”

26 yes / no questions


Слайд 55Qualification test results


Слайд 56Simple annotation interface
Click to see instructions
Highlight disease mentions


Слайд 57Experimental design
Task: Identify the disease mentions in the 593 abstracts from

the NCBI disease corpus
$0.06 per Human Intelligence Task (HIT)
HIT = annotate one abstract from PubMed
5 workers annotate each abstract


Слайд 58Aggregation function based on simple voting
1 or more votes (K=1)
K=2
K=3
K=4


Слайд 59Comparison to gold standard
593 documents
7 days
17 workers
$192.90


Слайд 60Comparison to gold standard
Max F = 0.69
0.79
0.82
k=1
2
3
2
3
4
5
0.85
k=1
N = 3
6
9
12
15
18
7
8
0.85
0.85


Слайд 61Comparison to gold standard
Max F = 0.69
0.79
0.82
k=1
2
3
2
3
4
5
0.85
k=1
N = 3
6
9
12
15
18
7
8
0.85
0.85


Слайд 62Comparison to gold standard
Max F = 0.69
0.79
0.82
k=1
2
3
2
3
4
5
0.85
k=1
N = 3
6
9
12
15
18
7
8
0.85
0.85


Слайд 63Comparison to gold standard
Max F = 0.69
0.79
0.82
k=1
2
3
2
3
4
5
0.85
k=1
N = 3
6
9
12
15
18
7
8
0.85
0.85


Слайд 64Comparisons to text-mining algorithms


Слайд 65Comparisons to human annotators
Average level of agreement between expert annotators (stage

1)

F = 0.76


Слайд 66Comparisons to human annotators
F = 0.76
F = 0.87
Average level of agreement

between expert annotators
(stage 2)

Слайд 67In aggregate, our worker ensemble is faster, cheaper and as accurate

as a single expert annotator for disease concept recognition.

Слайд 68Information Extraction
Find mentions of high level concepts in text

Map mentions to

specific terms in ontologies

Identify relationships between concepts

Слайд 69Annotating the relationships
This molecule inhibits the growth of a broad panel

of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies.

therapeutic target

subject

predicate

object

GENE

DISEASE


Слайд 70Citizen Science at Mark2Cure.org


Слайд 71The
Long Tail of citizen scientists
can collaboratively annotate biomedical

text.


Слайд 72Doug Howe, ZFIN
John Hogenesch, U Penn
Jon Huss, GNF
Luca de Alfaro, UCSC
Angel

Pizzaro, U Penn
Faramarz Valafar, SDSU
Pierre Lindenbaum,
Fondation Jean Dausset
Michael Martone, Rush
Konrad Koehler, Karo Bio
Warren Kibbe, Simon Lim, Northwestern
Lynn Schriml, U Maryland
Paul Pavlidis, U British Columbia
Peipei Ping, UCLA
Many Wikipedia editors
WP:MCB Project

Collaborators

Contact

http://sulab.org
asu@scripps.edu
@andrewsu
+Andrew Su

Citizen Science logo based on http://thenounproject.com/term/teamwork/39543/


Слайд 73Related AMT work
[1] Zhai et al 2013, used similar protocol to

tag medication names in clinical trials descriptions. F = 0.88 compared to gold standard
[2] Burger et al, using microtask workers to identify relationships between genes and mutations.
[3] Aroyo & Welty, used workers to identify relations between concepts in medical text.

[1] Zhai H. et al (2013) ”Web 2.0-Based Crowdsourcing for High-Quality Gold Standard Development in Clinical Natural Language Processing” J Med Internet Res
[2] Burger, John, et al. (2014) "Hybrid curation of gene-mutation relations combining automated extraction and crowdsourcing.” Mitre technical report
[3] Aroyo, Lora, and Chris Welty. Harnessing disagreement in crowdsourcing a relation extraction gold standard. Tech. Rep. RC25371 (WAT1304-058), IBM Research, 2013.


Обратная связь

Если не удалось найти и скачать презентацию, Вы можете заказать его на нашем сайте. Мы постараемся найти нужный Вам материал и отправим по электронной почте. Не стесняйтесь обращаться к нам, если у вас возникли вопросы или пожелания:

Email: Нажмите что бы посмотреть 

Что такое ThePresentation.ru?

Это сайт презентаций, докладов, проектов, шаблонов в формате PowerPoint. Мы помогаем школьникам, студентам, учителям, преподавателям хранить и обмениваться учебными материалами с другими пользователями.


Для правообладателей

Яндекс.Метрика