ehr idea

Published on May 2016 | Categories: Documents | Downloads: 45 | Comments: 0 | Views: 366
of x
Download PDF   Embed   Report

Comments

Content

Research and applications

Editor’s choice
Scan to access more
free content

1

Department of Health
Sciences Research, Mayo Clinic,
Rochester, Minnesota, USA
2
Department of Linguistics,
University of Colorado,
Boulder, Colorado, USA
3
Group Health Research
Institute, Seattle, Washington,
USA
4
Boston Children’s Hospital,
Harvard University, Boston,
Massachusetts, USA
5
Homer Warner Center for
Informatics Research,
Intermountain Healthcare, Salt
Lake City, Utah, USA
6
Agilex Technologies, Chantilly,
Virginia, USA
7
School of Biomedical
Informatics, University of Texas
Health Sciences Center,
Houston, Texas, USA
Correspondence to
Dr Jyotishman Pathak, Mayo
Clinic College of Medicine,
Rochester, MN 55902, USA;
[email protected]
Received 17 April 2013
Revised 7 October 2013
Accepted 11 October 2013
Published Online First
5 November 2013

To cite: Pathak J, Bailey KR,
Beebe CE, et al. J Am Med
Inform Assoc 2013;20:
e341–e348.

Normalization and standardization of electronic
health records for high-throughput phenotyping:
the SHARPn consortium
Jyotishman Pathak,1 Kent R Bailey,1 Calvin E Beebe,1 Steven Bethard,2
David S Carrell,3 Pei J Chen,4 Dmitriy Dligach,4 Cory M Endle,1 Lacey A Hart,1
Peter J Haug,5 Stanley M Huff,5 Vinod C Kaggal,1 Dingcheng Li,1 Hongfang Liu,1
Kyle Marchant,6 James Masanz,1 Timothy Miller,4 Thomas A Oniki,5 Martha Palmer,2
Kevin J Peterson,1 Susan Rea,5 Guergana K Savova,4 Craig R Stancl,1
Sunghwan Sohn,1 Harold R Solbrig,1 Dale B Suesse,1 Cui Tao,7 David P Taylor,5
Les Westberg,6 Stephen Wu,1 Ning Zhuo,5 Christopher G Chute1
ABSTRACT
Research objective To develop scalable informatics
infrastructure for normalization of both structured and
unstructured electronic health record (EHR) data into a
unified, concept-based model for high-throughput
phenotype extraction.
Materials and methods Software tools and
applications were developed to extract information from
EHRs. Representative and convenience samples of both
structured and unstructured data from two EHR systems
—Mayo Clinic and Intermountain Healthcare—were used
for development and validation. Extracted information
was standardized and normalized to meaningful use (MU)
conformant terminology and value set standards using
Clinical Element Models (CEMs). These resources were
used to demonstrate semi-automatic execution of MU
clinical-quality measures modeled using the Quality Data
Model (QDM) and an open-source rules engine.
Results Using CEMs and open-source natural language
processing and terminology services engines—namely,
Apache clinical Text Analysis and Knowledge Extraction
System (cTAKES) and Common Terminology Services
(CTS2)—we developed a data-normalization platform
that ensures data security, end-to-end connectivity, and
reliable data flow within and across institutions. We
demonstrated the applicability of this platform by
executing a QDM-based MU quality measure that
determines the percentage of patients between 18 and
75 years with diabetes whose most recent low-density
lipoprotein cholesterol test result during the measurement
year was <100 mg/dL on a randomly selected cohort of
273 Mayo Clinic patients. The platform identified 21 and
18 patients for the denominator and numerator of the
quality measure, respectively. Validation results indicate
that all identified patients meet the QDM-based criteria.
Conclusions End-to-end automated systems for
extracting clinical information from diverse EHR systems
require extensive use of standardized vocabularies and
terminologies, as well as robust information models for
storing, discovering, and processing that information. This
study demonstrates the application of modular and opensource resources for enabling secondary use of EHR data
through normalization into standards-based, comparable,
and consistent format for high-throughput phenotyping to
identify patient cohorts.

Pathak J, et al. J Am Med Inform Assoc 2013;20:e341–e348. doi:10.1136/amiajnl-2013-001939

INTRODUCTION
The Office of the National Coordinator for Health
Information Technology (HIT) in 2010 established
the Strategic Health IT Research Program (SHARP)
to address research and development challenges in
wide-scale adoption of HIT tools and technologies
for improved patient care and a cost-effective healthcare ecosystem. SHARP has four areas of focus1:
security in HIT (SHARPs2; led by University of
Illinois, Urbana-Champaign), patient-centered cognitive support (SHARPc3; led by University of Texas,
Houston), healthcare applications and network
design (SMART4; led by Harvard University), and
secondary use of electronic health records (EHRs)
(SHARPn5; led by Mayo Clinic).
The Mayo Clinic-led SHARPn program aims to
enhance patient safety and improve patient medical
outcomes by enabling the use of standards-based
EHR data in research and clinical practice. A core
component for achieving this goal is the ability to
transform heterogeneous patient health information,
typically stored in multiple clinical and health IT
systems, into standardized, comparable, consistent,
and queryable data.6 7 To this end, over the last
3 years, SHARPn has developed a suite of opensource and publicly available applications and
resources that support ubiquitous exchange, sharing
and reuse of EHR data collected during routine clinical processes in the delivery of patient care. In particular, the research and development efforts have
focused on four main areas: (1) clinical information
modeling and terminology standards; (2) a framework for normalization and standardization of clinical data—both structured and unstructured data
extracted from clinical narratives; (3) a platform for
representing and executing patient cohort identification and phenotyping logic; and (4) evaluation of
data quality and utility of the SHARPn resources.
In this article, we describe current research progress and future plans for these foci of the SHARPn
program. We illustrate our work via use cases in highthroughput phenotype extraction from EHR data,
specifically for meaningful use (MU) clinical quality
measures (CQMs), and discuss the challenges to
shared, secondary use of EHR data relative to the
work presented.
e341

Research and applications
structure used to convey patient data instances conforming to
CEMs.
A key component in defining and instantiating the CEMs are
structured codes and terminologies. In particular, the CEMs
used by SHARPn adopt MU standards and value sets
(use-case-specific subsets of standard vocabularies) for specifying
diagnosis, medications, laboratory results, and other classes of
data. Terminology services—especially mapping services—are
essential, and we have adopted Common Terminology Services
2 (CTS2)9 as our core terminology infrastructure.

Clinical data normalization and standardization
Syntactic and semantic data normalization

Figure 1 SecondaryUsePatient Clinical Element Model.

METHODS
SHARPn organizational and functional framework
The SHARPn vision is to develop and foster robust, scalable and
pragmatic open-source informatics resources and applications
that can facilitate large-scale use of heterogeneous EHR data for
secondary purposes. In the past 3 years, via this highly collaborative project with team members from seven different organizations, we have developed an organizational and functional
framework for SHARPn that is structured around informatics
infrastructure for (1) information modeling and terminology
standards, (2) clinical natural language processing (NLP), (3) clinical phenotyping, and (4) data quality. We describe these infrastructural components briefly in the following sections.

Clinical information modeling and terminology standards
Clinical concepts must be normalized if decision support and analytic applications are to operate reliably on heterogeneous EHR
data. To achieve this goal, we adopted Clinical Element Models
(CEMs)8 as target information models which have been designed
to provide a consistent architecture for representing clinical information in EHR systems. The goal of CEMs has been to specify
granular, computable models of all elements that might be stored
in an EHR system. The intent is to use these models in validating
data during persistence and in developing the services and applications that exchange data, thus promoting interoperability between
applications and systems. For the secondary use needs of
SHARPn, a set of generic CEMs for capturing core clinical information, such as ‘Medication’, ‘Labs’, and ‘Diagnosis’, have been
defined. CEMs are distributed in the following formats: Tree,
Constraint Definition Language (CDL, the language GE
Healthcare developed for authoring CEMs), and Extensible
Markup Language (XML) Schema Definitions (XSDs). Figure 1
shows the demographics model, SecondaryUsePatient, in Tree
format where the bracket shows the cardinality of the corresponding feature. For example, a SecondaryUsePatient instance has at
least one entry for PersonName (unbounded indicates that a
demographics instance can have multiple entries for PersonName)
while all other entries are optional. The CEM is authored in CDL
and processed by a compiler, which validates the CDL and can
generate various outputs, one of which is a SHARPn-specific XSD
e342

Figure 2 shows the data normalization pipeline architecture. The
SHARPn data normalization pipeline adopts Mirth Connect, an
open-source healthcare integration engine, as the interface engine,
and Apache Unstructured Information Management Architecture
(UIMA) as the software platform.10 It takes advantage of Mirth’s
ability to support the creation of interfaces between disparate
systems, and UIMA’s resource configuration ability to enable the
transformation of heterogeneous EHR data sources (including clinical narratives) to common clinical models and standard value sets.
The behavior of the normalization pipeline is driven through
UIMA’s resource configuration—that is, source and target information models (syntactic) and their associated value sets (semantic) as well as the mapping information from source to targets.
As discussed in the previous section, we adopt the CEMs and
CTS2 infrastructure for defining the information models and
access to terminologies and value sets, respectively, for enabling
syntactic and semantic normalizations. In particular, syntactic
normalization specifies where (in the source data or other location) to obtain the values that will fill the target model. They
are ‘structural’ mappings—mapping the input structure (eg, an
HL7 message) to the output structure (a CEM). For semantic
normalization, we draw value sets for problems, diagnoses,
laboratory observations, medications and other classes of data
from MU terminologies. The entire normalization process relies
on the creation or identification of syntactic and semantic mappings, and we adopt the CTS2 Mapping Service (http://www.
omg.org/spec/CTS2/1.0/) to specify these mappings. For brevity,
additional details about the normalization process, along with
examples, can be reviewed in the article by Kaggal et al.11

Normalization of the clinical narrative using NLP
The clinical narrative within the EHR consists primarily of care
providers’ notes describing the patient’s status, disease or tissue/
image. The normalization of textual notes requires NLP methods
that deal with the variety and complexity of human language. We
have developed sophisticated information extraction methods to
discover a relevant set of normalized summary information for a
given patient in a disease- and use-case agnostic way. We have
defined six ‘templates’—abstractions of CEMs—which are populated by processing the textual information and then map it to
the models. Table 1 represents the six templates along with their
attributes. The anchors for each template are a Medication, a
Sign/symptom, a Disease/disorder, a Procedure, a Laboratory
result and an Anatomic site. Some attributes are relevant to all
templates—for example, ‘negation_indicator’—others are specific to a particular template—for example, ‘dosage’ is specific to
Medications.
The main methods used for the normalization of the clinical
narrative are rules and supervised machine learning. As with any
supervised machine learning techniques, the algorithms require
labeled data points from which to learn the patterns and

Pathak J, et al. J Am Med Inform Assoc 2013;20:e341–e348. doi:10.1136/amiajnl-2013-001939

Research and applications

Figure 2 SHARPn clinical data normalization pipeline. (1) Data to be normalized are read from the file system. These data can also be transmitted
on NwHIN via TCP/IP from an external entity. (2) Mirth Connect invokes the normalization pipeline using one of its predefined channels and passes
the data (eg, HL7, CCD, tabulardata) to be normalized. (3) The normalization pipeline goes through initialization of the components (including
loading resources from the file system or other predefined resource such as the Common Terminology Services 2 (CTS2) and then performs syntactic
parsing and semantic normalization to generate normalized data in the form of a Clinical Element Model (CEM). (4) Normalized data are handed
back to Mirth Connect. (5) Mirth Connect uses one of the predefined channels to serialize the normalized CEM data to CouchDB or MySQL based on
the configuration. CTAKES, clinical Text Analysis and Knowledge Extraction System; DB, database; NLP, natural language processing; UIMA,
Unstructured Information Management Architecture.
evaluate the output. Therefore, a corpus representative of the
EHR from two SHARPn institutions (Mayo Clinic and Seattle
Group Health) was sampled and deidentified following Health
Insurance Portability and Accountability Act guidelines. The deidentification process was a combination of automatic output
from MITRE Identification Scrubber Tool (MIST)12 and manual
review. The corpus size is 500 K words that we intend to share
with the research community under data use agreements with

the originating institutions. The corpus was annotated by
experts for several layers following carefully extended or newly
developed annotation guidelines conformant with established
standards and conventions in the clinical NLP community to
allow interoperability. The annotated layers (syntactic and
semantic) allow learning structures over increasingly complex
language representations, thus enabling state-of-the-art information extraction from the clinical narrative. A detailed description

Table 1 Template generalizations of the Clinical Element Model used for the normalization of clinical narrative
Template-specific attributes
Common attributes

Template

Attribute

Template

Attribute

Template

Attribute

AssociatedCode
Conditional
Generic
Negation_indicator
Subject
Uncertainty_indicator

Medication

Change_status
Dosage
Duration
End_date
Form
Relative_temporal_context
Frequency
Route
Start_date
Strength
Body_laterality
Body_location
Body_side
Device
Duration
End_date
Method
Relative_temporal_context
Start_date

Sign/symptom

Alleviating_factor
Body_laterality
Body_location
Body_side
Course
Duration
End_date
Exacerbating_factor
Relative_temporal_context
Severity
Start_date
Abnormal_interpretation
Delta_flag
End_date
Lab_value
Ordinal_interpretation
Reference_range_narrative
Start_date

Disease/disorder

Alleviating_factor
Associated_sign_symptom
Body_laterality
Body_location
Body_side
Course
Duration
End_date
Exacerbating_factor
Relative_temporal_context
Start_date
Body_laterality
Body_side

Procedure

Laboratory results

Pathak J, et al. J Am Med Inform Assoc 2013;20:e341–e348. doi:10.1136/amiajnl-2013-001939

Anatomic site

e343

Research and applications
of the syntactic and select semantic layers is provided in
Albright et al.13
The best performing NLP methods are implemented as part
of the Apache clinical Text Analysis and Knowledge Extraction
System (cTAKES)14 which is built using UIMA. It comprises a
variety of modules including sentence boundary detection, syntactic parsers, named entity recognition, negation/uncertainty
discovery, and coreference resolver, to name a few. For method
details and formal evaluations, see Wu et al,15 Albright et al,13
Choi and Palmer,16 17 Clark et al,18 Savova et al,19 Sohn et al,20
and Zheng et al.21 Additional details on cTAKES are available
from http://ctakes.apache.org.

Data transfer, processing, storage and retrieval
infrastructure
Data transfer/processing
We have primarily leveraged two industry standard, open-source
tools for data exchange and transfer: Mirth Connect and
Aurion (NwHIN). Aurion provides gateway-to-gateway data
exchange using the NwHIN XDR (Document Submission)
protocol, enabling participating partners to push clinical documents in a variety of forms (HL7 2.x, clinical document architecture (CDA), and CEM). Mirth Connect software and
channels have been developed to provide Sender, Receiver,
Transformation, and Persistence channels in support of the
variety of use cases as well as enabling the interconnectivity of
the various SHARPn systems.

Data storage/retrieval
The SHARPn teams have developed two different technology
solutions for the storage and retrieval of normalized data. The
first is an open-source SQL-based solution. This solution
enables each individual CEM instance record to be stored in a
standardized SQL data model. Data can be queried directly
from the SQL database tables and extracted as individual fields
or as complete XML records. The second is a document storage
solution that leverages the open-source CouchDB22 repository
to store the CEM XML data as individual JavaScript Object
Notation ( JSON) documents, providing a more documentcentric view into the data. Both solutions leverage toolsets built
around the CEM data models for a variety of clinical data
including ‘NotedDrugs’, ‘Labs’, ‘Administrative Diagnosis’, and
other clinically relevant data.

High-throughput cohort identification and phenotype
extraction
SHARPn overloads the term ‘phenotyping’ to imply the algorithmic recognition of any cohort within an EHR for a defined
purpose, including case–control cohorts for genome-wide association studies, clinical trials, quality metrics, and clinical decision support. In the recent past, several projects, including

eMERGE,23 PGRN,24 and i2b2,25 have developed tools and
technologies for identifying patient cohorts using EHRs. A key
aspect of this process is to define inclusion and exclusion criteria
involving EHR data fields (eg, diagnoses, procedures, laboratory
results, and medications) and logical operators. We refer to
them as ‘phenotyping algorithms’, and typically represent them
as pseudocodes with varying degrees of formality and structure.
CQMs, for instance in the MU program, are examples of phenotyping algorithms maintained by the National Quality Forum
(NQF), based on the Quality Data Model (QDM)26 and represented in the HL7 Health Quality Measures Format (HQMF27
or eMeasure). The QDM is an information model and grammar
intended to represent data collected during routine clinical care
in EHRs as well as the basic logic required to articulate the algorithmic criteria for phenotype definitions.
While the NQF, individual measure developers, the National
Library of Medicine, and others have made improvements in the
clarity of eMeasure logic and coded value sets for the 2014 set of
CQMs28 for MU stage 2, at least in MU stage 1, these measures
often required human interpretation and translation into local
queries in order to extract necessary data and produce calculations. There have been two challenges in particular: (1) local data
elements in an EHR may not be natively represented in a format
consistent with the QDM including the required code systems
and value sets; (2) An EHR typically does not natively have the
capability to automatically consume and execute eMeasure logic.
In other words, work has been needed locally to translate the
logic (eg, using an approach such as SQL) and to map local data
elements and codes (eg, mapping an element using a proprietary
code to SNOMED). This greatly erodes the advantages of the
eMeasure approach, but has been the reality for many vendors
and institutions in the first stage of MU.29
To address these challenges, the SHARPn project has investigated aligning the representation of data with the QDM along
with automatic interpretation and execution of eMeasures using
the open-source JBoss Drools30 rules management system.
Specifically, using the Apache UIMA platform, we have developed a translator tool that converts QDM-defined phenotyping
algorithm criteria into executable Drools rules scripts. In
essence, the tool takes as input the HQMF XML file along with
the relevant value sets as inputs, maps the QDM categories and
criteria to CEMs, and, at run-time, generates JavaScript queries
to the JSON-based CouchDB CEM datastore containing patient
information. Figure 3 shows the overall architecture of the conversion tool, with additional details presented in Li et al.31

Data quality and consistency
Accurate and representative EHR data are required for effective
secondary use in health improvement measures and research.32
There are often multiple sources with the same information in
the EHR, such as medications from an application that supports

Figure 3 Architecture for Quality
Data Model (QDM) to Drools translator
system. AE, annotation engine; CAS,
common analysis system; CEM, Clinical
Element Model; DB, database; UIMA,
Unstructured Information Management
Architecture; XSLT, extensible
stylesheet transformation language.

e344

Pathak J, et al. J Am Med Inform Assoc 2013;20:e341–e348. doi:10.1136/amiajnl-2013-001939

Research and applications
the printing or messaging of prescriptions versus medication
data that were processed from progress notes using NLP.
Identical data elements from hospital, ambulatory, and nursing
home care, for example, have variations in their original use
context. We take a two-pronged view toward the requirement of
providing accurate and correct processing of source data. The
first is to validate correct functionality of the SHARPn components and services. The second is to develop methods and components to monitor and report end-to-end data validity for
intended users of the SHARPn deliverables. Our approach to
validating the normalization pipeline was to construct an environment in which representative samples of various types of clinical data are transmitted across the internet, the data are
persisted as normalized CEM-based data instances, and the
CEM-based instances are reconciled with the originally transmitted data. Figure 4 illustrates this approach.
In particular, we have validated a stream of 1000 HL7 V.2.5
medication messages end-to-end: from the source data submitted through an NwHIN gateway at Intermountain Healthcare to
normalized, Secondary Use, Noted Drug CEM instances
retrieved from the platform datastores at Mayo Clinic. Once a
validation procedure had been successfully tested in the SQL
datastore environment, this evaluation process was implemented
on output from the CouchDB. The content of all messages were
matched correctly between input and output data, with the
exception of one medication RxNorm code that was not populated in the CTS2 terminology server (more details in the
Results section). Other SHARPn CEMs are in various stages of
validation as the pipeline components continue to evolve. Our
continuing objective is to develop standard procedures to identify transmission/translation errors as data from new sites are
added to a centralized repository. The errors expose the modifications of underlying system resources within the normalization
pipeline necessary to produce an accurate conversion of the new
data into the canonical CEM standard.

RESULTS
Medication CEMs and terminology mappings
As discussed above, for testing our data normalization pipeline,
1000 ambulatory medication order instances were evaluated for
the correctness of structural and semantic mapping from HL7
V.2.3 Medication Order messages to CEMs. These medication
data compared included ingredient, strength, frequency, route,
dose and dose unit. Specifically, 266 distinct First Data Bank
GCN sequence_codes (GCN-SEQNO) were used in the test

messages. These generic clinical medication terms were a
random selection from Intermountain’s EHR data. All but one
(GCN-SEQNO, 031417, Serevent Diskus (Salmeterol) 50 μg/
Dose Inhalation Disk with Device) failed to map to an RxNorm
ingredient code in the pipeline. It has been noted in previous
evaluations of RxNorm coverage of actual medication data that
metered dose inhalers are, in general, challenging to map.33 34

MU quality measures
To demonstrate the applicability of our phenotyping infrastructure, we experimented with one of the MU phase 1 eMeasures:
NQF 0064 (Diabetes: Low Density Lipoprotein (LDL)
Management and Control35; figure 5). NQF 0064 determines the
percentage of patients between 18 and 75 years with diabetes
whose most recent LDL-cholesterol (LDL-C) test result during the
measurement year was <100 mg/dL. Thus, the denominator criterion identifies patients between 18 and 75 years of age who had
a diagnosis of diabetes (type 1 or type 2), and the numerator criterion identifies patients with an LDL-C measurement of <100 mg/
dL. The patient is not numerator-compliant if the result for the
most recent LDL-C test during the measurement period is
≥100 mg/dL, or is missing, or if an LDL-C test was not performed
during the measurement time period. NQF 0064 also excludes
patients with diagnoses of polycystic ovaries, gestational diabetes
or steroid-induced diabetes.
As the goal is to execute such an algorithm in a semiautomatic fashion, we first invoked the SHARPn data normalization and NLP pipelines in UIMA to populate the CEM database.
Specifically, we extracted demographics, diagnoses, medications,
clinical procedures and laboratory measurements for 273
patients from Mayo’s EHR systems (table 2) and normalized the
data to HL7 value sets, International Classification of Diseases,
Ninth Revision, Clinical Modification (ICD-9-CM), RxNorm,
Current Procedural Terminology (CPT)-4 and Logical
Observation Identifiers Names and Codes (LOINC) codes,
respectively. The NLP pipeline was particularly invoked in processing the medication data as demonstrated in prior work.36
For translating the QDM criteria into Drools rules, a key component comprised mapping from QDM data elements to CEMs.
This enables the mapping and extraction from disparate EHRs
to this normalized representation of the source data.
Additionally, CEMs contain attributes of provenance, such as
date/time, originating clinician, clinic, and entry application,
and other model-specific attributes, such as the order status of a
medication. These care process-related attributes are required in

Figure 4 Conceptual diagram of
validation processes. (1) Use-case data
are translated to HL7 and submitted to
the MIRTH interface engine, and then
(2) processed and stored as normalized
data objects. (3) A Java application
pulls data from the source database
and the Clinical Element Model (CEM)
database, compares them, then (4)
prints inconsistencies for manual
review.

Pathak J, et al. J Am Med Inform Assoc 2013;20:e341–e348. doi:10.1136/amiajnl-2013-001939

e345

Research and applications
Figure 5 Denominator, numerator
and exclusion criteria for NQF 0064:
Diabetes: Low Density Lipoprotein
(LDL) Management and Control.

QDM data specifications. CEMs are bound to a standard value
set. The ‘key’ code and the ‘data’ code are generally the CEM
‘qualifiers’ that enable a mapping from a QDM data element
specification to one or more CEMs. Table 3 shows a sample of
mappings developed for NQF 0064. Two QDM data categories,
‘Diagnosis’ and ‘Encounter’, may be satisfied by one or more

Table 2 Basic demographics for patients (N=273) evaluated for
NQF 0064
Category
Gender
Male
Female
Race
White
Black
Alaskan Indian
Asian
Pacific Islander
Other
Ethnicity
Hispanic, Latino or Spanish origin
Non-Hispanic, Latino or Spanish origin
Unknown
Age (years)
≤18
19–30
31–50
51–75
>75

e346

Number of patients

CEMs. Both of these map to an ‘AdministrativeDiagnosis’ CEM:
the QDM data element will be instantiated based on matching
the ‘data’ qualifier to the QDM value set. All QDM data specifications for NQF 0064 were successfully mapped to CEMs.
Executing NQF 0064 on this population of 273 patients identified 21 and 18 patients for the denominator and numerator,
respectively (table 4). Nineteen patients were excluded because
they did not meet either the denominator or numerator criterion for NQF 0064. Furthermore, we have implemented an
open-source platform (http://phenotypeportal.org) providing a
library of such cohort identification algorithms, as well as for
visualization, reporting and analysis of algorithm results.

DISCUSSION
132 (48%)
141 (52%)
236 (85%)
8 (3%)
10 (4%)
5 (2%)
4 (2%)
10 (4%)
7 (4%)
235 (85%)
31 (11%)
31
32
70
105
35

(11%)
(11%)
(26%)
(38%)
(14%)

We have briefly presented the SHARPn project whose core
mission is to develop methods and modular open-source
resources for enabling secondary use of EHR data through normalization into standards-based, comparable, and consistent
formats. A key aspect to achieving this is normalization of both
structured and unstructured clinical data into a unified, conceptbased formalism. Using CEMs and Apache UIMA, we have
developed a data-normalization platform that ensures data
security, end-to-end connectivity, and reliable data flow within
and across institutions. This platform leverages and builds on
our existing work on cTAKES and CTS2 open-source tools
which have been widely adopted in the health informatics community. With phenotyping as SHARPn’s major use case, we
have collaborated, and continue to collaborate, with NQF on
further development and adoption of QDM and HQMF for
standardized representation of cohort definition criteria. As illustrated in the previous section, our goal is to create a national
library of standardized phenotyping algorithms, and provide
publicly available infrastructure that facilitates their execution in
a semi-automatic fashion. Finally, this article describes a
Pathak J, et al. J Am Med Inform Assoc 2013;20:e341–e348. doi:10.1136/amiajnl-2013-001939

Research and applications
Table 3 Sample QDM category to CEM mapping for NQF 0064
QDM data element

QDM code system

CEM

CEM qualifier(s)

Diagnosis, active: diabetes

Laboratory test, result: LDL test

ICD-9-CM
SNOMED-CT
CPT
ICD-9-CM
LOINC

Data, Encounter Date
Data, Date Of Onset or Observed Start Time
Data, Encounter Date
Data, Encounter Date
Key, data

Medication, active: Meds indicative of diabetes
Patient characteristic: birth date
Race
ONC administrative Sex

Rx Norm
LOINC
CDCREC
Administrative Sex

Administrative Diagnosis
Secondary Use Assertion
Administrative Procedure
Administrative Diagnosis
Secondary Use Standard
LabObs Quantitative
Secondary Use Noted Drug
Secondary Use Patient
Secondary Use Patient
Secondary Use Patient

Encounter: non-acute inpatient and outpatient

data, Start Time
Birth Date
Administrative Race
Administrative Gender

CDREC, US Centers for Disease Control and Prevention Race and Ethnicity Code Set; CEM, Clinical Element Model; CPT, Current Procedural Terminology; ICD-9-CM, International
Classification of Diseases, Ninth Revision, Clinical Modification; LDL, low-density lipoprotein; LOINC, Logical Observation Identifiers Names and Codes; ONC, Office of the National
Coordinator; QDM, Quality Data Model.

systematic implementation for an informatics infrastructure.
While there are alternative commercial (eg, Oracle Cohort
Explorer37) and open-source (eg, PopHealth38) tools that
achieve similar functionality, to our knowledge, ours is the first
system that leverages an emerging national standard—namely
QDM—for representing phenotypic criteria, as well as using a
rules-based system (Drools) for execution of such criteria.
Several limitations and challenges remain to full achievement
of the SHARPn vision. First, to achieve any form of practical
semantic normalization, we are faced with the unavoidable
requirement for well-curated semantic mapping tables among
native data representations and their target standards. While
several semi-automated methods for mappings are being investigated,39–42 terminology mappings remain a largely laborintensive process. Second, within the context of secondary use,
metadata and other means of understanding the provenance and
meaning of source EHR data are highly pertinent. While CEMs
do accommodate provenance details, such information is typically either not readily available and, in a few cases, not
adequately processed by the data normalization framework, suggesting the requirement for future enhancement. Furthermore,
at present, for structured clinical data, we use HL7 2.x and
CCD messages as the payload of an NwHIN Document
Submission message, which are then transformed and normalized using the UIMA pipeline. However, the emergence of more
comprehensive standards, such as Consolidated Clinical
Document Architecture,43 will warrant further evaluation,
including investigation of the use of Mirth Connect and Aurion
NwHIN for transferring the message payload. Further, our
existing evaluations with eMeasures have been limited to MU
phase 1 as well as to criteria with core logical operators (eg,
OR, AND). However, QDM defines a comprehensive list of
operators including temporal logic (eg, NOW, WEEK,
CURRTIME), mathematical functions (eg, MEAN, MEDIAN),

and qualifiers (eg, FIRST, SECOND, RELATIVE FIRST) that
require a deeper understanding of the underlying data semantics. Our plan is to incorporate these additional capabilities as
well as 2014 MU phase 2 CQMs in future releases of the
QDM-to-Drools translator.

CONCLUSION
End-to-end automated systems for extracting clinical information from diverse EHR systems require extensive use of standardized vocabularies and ontologies, as well as robust information
models for storing, discovering, and processing that information. The main objective of SHARPn is to develop methods and
modular open-source resources for enabling secondary use of
EHR data through normalization to standards-based, comparable, and consistent formats for high-throughput phenotyping.
In this article, we present our current research progress and
plans for future work.
Correction notice This paper has been corrected since it was published Online
First. The middle initial of the fifth author has been corrected.
Contributors JP and CGC designed the study and wrote the manuscript. All other
authors checked the individual sections in the manuscript and discussion. All
authors contributed to and approved the final manuscript.
Funding This research and the manuscript was made possible by funding from the
Strategic Health IT Advanced Research Projects (SHARP) Program (90TR002)
administered by the Office of the National Coordinator for Health Information
Technology.
Competing interests None.
Patient consent Obtained.
Ethics approval This study was approved by the Mayo Clinic Institutional Review
Board (IRB#: 12-003424) as a minimal risk protocol.
Provenance and peer review Not commissioned; externally peer reviewed.

REFERENCES
1

Table 4 Criteria results for patients (N=273) evaluated for NQF
0064

2

NQF 0064 population

Number of patients

3

Initial patient population
Denominator
Numerator
Exclusion

273
21
18
19

4

5

Office_of_the_National_Coordinator. Strategic Health IT Advanced Research
Projects: SHARP. 2010. http://healthit.hhs.gov/sharp (accessed 7 Sep 2010).
SHARPs: Strategic Health IT Project Advanced Research Project on Security. http://
sharps.org (accessed 18 Dec 2012).
SHARPc: Strategic Health IT Advanced Research Project on Cognitive Informatics
and Decision Making. http://sharpc.org (accessed 18 Dec 2012).
Mandl KD, Mandel JC, Murphy SN, et al. The SMART Platform: early experience
enabling substitutable applications for electronic health records. J Am Med Inform
Assoc 2012;19:597–603.
Rea S, Pathak J, Savova G, et al. Building a robust, scalable and standards-driven
infrastructure for secondary use of EHR data: the SHARPn project. J Biomed Inform
2012;45:763–71.

Pathak J, et al. J Am Med Inform Assoc 2013;20:e341–e348. doi:10.1136/amiajnl-2013-001939

e347

Research and applications
6

7
8

9
10
11
12
13
14

15
16

17

18
19
20

21
22
23

24

e348

Kern LM, Malhotra S, Barrón Y, et al. Accuracy of electronically reported
“meaningful use” clinical quality measures: a cross-sectional study. Ann Intern Med
2013;158:77–83.
Blumenthal D, Tavenner M. The meaningful use regulation for electronic health
records. N Engl J Med 2010;363:501–4.
Oniki T, Coyle J, Parker C, et al. Lessons learned in detailed clinical modeling at
intermountain healthcare. American Medical Informatics Association (AMIA) Annual
Symposium; 2012:1638–9.
Common Terminology Services 2 (CTS 2). http://informatics.mayo.edu/cts2 (accessed
25 Dec 2012).
Ferrucci D, Lally A. Building an example application with the Unstructured
Information Management Architecture. IBM Syst J 2004;43:455–75.
Kaggal V, Oniki T, Marchant K, et al. SHARPn data normalization pipeline. AMIA
Annual Symposium; 2013. (Accepted for publication).
Aberdeen J, Bayer S, Yeniterzi R, et al. The MITRE identification scrubber toolkit:
design, training, and assessment. Int J Med Inform 2010;79:849–59.
Albright D, Lanfranchi A, Fredriksen A, et al. Towards syntactic and semantic annotations
of the clinical narrative. J Am Med Inform Assoc 2013;20:922–30.
Savova G, Masanz J, Ogren P, et al. Mayo clinical Text Analysis and Knowledge
Extraction System (cTAKES): architecture, component evaluation and applications.
J Am Med Inform Assoc 2010;17:507–13.
Wu S, Kaggal V, Dligach D, et al. A common type system for clinical Natural
Language Processing. J Biomed Semantics 2013. In press.
Choi JD, Palmer M. Getting the most out of transition-based dependency parsing.
Proceedings of the 49th Annual Meeting of the Association for Computational
Linguistics: Human Language Technologies: short papers—Volume 2; Portland,
Oregon: 2011.
Choi JD, Palmer M. Transition-based semantic role labeling using predicate
argument clustering. Proceedings of the ACL 2011 Workshop on Relational Models
of Semantics; Portland, Oregon: 2011.
Clark C, Aberdeen J, Coarr M, et al. MITRE system for clinical assertion status
classification. J Am Med Inform Assoc 2011;18:563–7.
Savova G, Ogren P, Duffy P, et al. Mayo Clinic NLP system for patient smoking
status identification. J Am Med Inform Assoc 2008;15:25–8.
Sohn S, Murphy SN, Masanz J, et al. Classification of medication status change in
clinical narratives. American Medical Informatics Association (AMIA) Annual
Symposium; 2010.
Zheng J, Chapman WW, Miller TA, et al. A system for coreference resolution for the
clinical narrative. J Am Med Inform Assoc 2012;19:660–7.
Apache CouchDB. http://couchdb.apache.org (accessed 12 Dec 2012).
McCarty C, Chisholm R, Chute C, et al. The eMERGE Network: a consortium of
biorepositories linked to electronic medical records data for conducting genomic
studies. BMC Med Genomics 2011;4:13.
Long RM, Berg JM. What to expect from the pharmacogenomics research network.
Clin Pharmacol Ther 2011;89:339–41.

25

26
27
28

29
30
31

32
33
34
35

36

37
38
39
40

41
42
43

Kohane IS, Churchill SE, Murphy SN. A translational engine at the national scale:
informatics for integrating biology and the bedside. J Am Med Inform Assoc
2012;19:181–5.
National Quality Forum (NQF) Quality Data Model (QDM). http://www.qualityforum.
org/Projects/h/QDS_Model/Quality_Data_Model.aspx (accessed 23 May 2012).
HL7 Health Quality Measure Format (HQMF). http://code.google.com/p/hqmf/
(accessed 23 May 2012).
Center for Medicare and Medicaid Services: 2014 Clinical Quality Measures. http://
www.cms.gov/Regulations-and-Guidance/Legislation/EHRIncentivePrograms/2014_
ClinicalQualityMeasures.html (accessed 25 Dec 2012).
Fu PC Jr, Rosenthal D, Pevnick JM, et al. The impact of emerging standards
adoption on automated quality reporting. J Biomed Inform 2012;45:772–81.
JBoss Drools Business Logic Integration Platform. http://www.jboss.org/drools
(accessed 23 May 2012).
Li D, Shrestha G, Murthy S, et al. Modeling and executing electronic health records
driven phenotyping algorithms using the NQF quality data model and JBoss®
drools engine. American Medical Informatics Association (AMIA) Annual
Symposium; 2012.
Hammond WE, Bailey C, Boucher P, et al. Connecting information to improve
health. Health Aff (Millwood) 2010;29:284–8.
Bell DS, O’Neill SM, Reynolds KA, et al. Evaluation of RxNorm in ambulatory
electronic prescribing. Santa Monica, California: RAND Health, 2011.
O’Neill SM, Bell DS. Evaluation of RxNorm for representing ambulatory
prescriptions. AMIA Annu Symp Proc 2010;2010:562–6.
NQF 0064: Diabetes: Low Density Lipoprotein (LDL) Management and Control.
http://ushik.ahrq.gov/ViewItemDetails?system=mu&itemKey=122574000 (accessed
18 Dec 2012).
Pathak J, Murphy SN, Willaert B, et al. Using RxNorm and NDF-RT to classify
medication data extracted from electronic health records: experiences from the
Rochester Epidemiology Project. American Medical Informatics Association (AMIA)
Annual Symposium; 2011.
Oracle Health Sciences Cohort Explorer. http://docs.oracle.com/cd/E24441_01/doc.
10/e25021/toc.htm (accessed 6 Sep 2013).
Project popHealth. http://projectpophealth.org/ (accessed 13 Mar 2012).
Ghazvinian A, Noy N, Jonquet C, et al. What four million mappings can tell you
about two hundred ontologies. 8th International Semantic Web Conference; 2009.
Ghazvinian A, Noy N, Jonquet C, et al. Creating mappings for ontologies in
biomedicine: simple methods work. American Medical Informatics Association
(AMIA) Annual Symposium; 2009.
Wang Y, Patrick J, Miller G, et al. Linguistic mapping of terminologies to
SNOMED-CT. Semantic Mining Conference on SNOMED CT; 2006.
Choi N, Song I, Han H. A survey on ontology mapping. ACM SIGMOD Record
2006;35:34–41.
HL7 Consolidated Clinical Document Architecture (CCDA). http://www.hl7.
org/implement/standards/product_brief.cfm?product_id=258 (accessed 11 Dec 2012).

Pathak J, et al. J Am Med Inform Assoc 2013;20:e341–e348. doi:10.1136/amiajnl-2013-001939

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close