>>Welcome to the Health IT Certificate program.  My name is Herbert Chase, and I'm going to be 
talking today about knowledge representation.  You can read my bio when you have nothing better to 
do.  I've been here in Columbia for many years and I've been in the Department of Biomedical 
Informatics since 2006. 
 As mentioned, we're going to talk about knowledge representation, and let's go over our learning 
objectives.  First, we're going to discuss what the definition of knowledge representation is.  Describe 
different data in the patient record, what needs to be represented, what doesn't.  What is the 
information that would be important to represent, so a computer might find it.  We're going to 
describe how information is stored and utilized by a computer program, to benefit healthcare.  We're 
going to go over terminologies, how do terminologies help us represent medical knowledge, and we'll 
go over the ones that are the most commonly used.  They'll be an additional discussion of the UMLS 
and SNOMED Codes, which are very interesting terminologies that provide more information than just 
standard definitions and categorization of concepts.  And, we'll touch upon the role of natural 
language processing and extracting important information.
Our talk today will be divided into four parts.  First, we'll discuss representation in medical data.  In part 
two, we'll spend a little time on the terminologies.  Part three, we'll talk about representation of 
unstructured data, and you'll have a better idea of what that is, as we proceed through this talk.  And 
lastly, in part four, we'll talk about representation of higher levels of knowledge, such as guidelines, 
which are quite challenging.
Let's, first, go over the definition of knowledge representation.  Knowledge representation simply 
means, how we represent our domain knowledge, in this case medicine, and what we know about 
medicine, how do we represent it in a language that a computer can make use of. We would like to 
represent it in a formal language, a computer program, and with the right programs in place, a 
computer can access this information and make use of it; make some decisions perhaps, and we'll talk 
about that in a second.  Obviously, this is a little bit obtuse at this point, the definition, but, in a general 
sense, I think you can appreciate that a computer has to access information and we have to represent 
the medical information in a way that would enable a computer to get to it and understand it.
Some of the things, some of the goals requiring knowledge representation in medicine, obviously, 
patient care, we might want to produce a summary, what are all the diseases a patient has, how is the 
patient doing, we want to construct expert systems, we're going to talk about two expert systems in a 
few minutes.  Can we write a program that actually figures out if the patient is doing well or not?  
Obviously, need to make reports, both financial as well as medical report, we'd like to manage the 
patients care, all this knowledge has, about the patient, including what clinic they go to or including 
who their doctor is, has to be represented in some way that a computer can access it.  We are going to 
want to transfer information, from one place to another, even within a hospital, and perhaps across 
town or maybe cross country.  We'd also like to find information, when was the last time this patient 
had their sugar measured, or they had their insulin dose changed.  And, lastly, knowledge 
representation benefits clinical research.  We could say, "Where are all the patients, or who are all the 
patients with diabetes mellitus and we'd have to represented the diabetes mellitus in the patients 
record in order to answer that question.
So, let's talk a little bit about what exactly is the kind of knowledge that we would want to represent in 
the patients record, that is to say, what are the key elements of patient record that need to be 
represented?  If we look at a typical patient, it gives us some idea of what we need to represent.  The 
patient is a 75 year old man, we'd want to represent the age and the gender with a long history of 
diabetes mellitus, certainly the condition diabetes mellitus and hypertension, depression and chronic 
kidney disease, they would all have to be represented n some way.  He's cared for by many 
physicians, we'd want to know who his physicians are, who practice in a variety of clinical settings, 
where has this patient been seen?  What clinics has this patient been seen?  What is the primary care 
physician's name, and where does that physician practice.  The patient is on various medicines that 
controls diabetes, high blood pressure and depression, so certainly the medicine, when they started, 
what their doses are would be essential to have in the, represented in the patient record.  The patient 
is routinely sent to have his blood drawn to measure tests indicative of his current disease status, such 
as hemoglobin A1C; of course, all the lab data would want to be represented.  This is just a little, the 
slide summarizes the many different pieces of the patients puzzle that we would want to represent, 
medical conditions, I've mentioned a couple of them, diagnosis and medicines, the various reports 
that we'd want to access, some general information, and overall care, we might want to represent the 
data in a way that we could construct an alert or provide information on the sign out, or have a 
guideline look at the patients match to see if the patients care was matching a guideline.  This just 
scratches the surface, there are many more categories of information that would be essential to 
represent.
I thought that the simplest way to drill down into the nitty gritty of knowledge representation would 
be to use a couple of examples, build a system and look to see what exactly is the knowledge that we 
need to represent in order to carry out the task, so I thought of two simple, straightforward examples 
for us to discuss.  The first is Computerized Provider Order Entry system, CPOE, and in the one that 
we're going to build, it will check for drug contra-indications, and adjust dosage, if necessary.  The 
second will be a clinical decision of support system that optimizes management of diabetes.  
So, let's first talk about CPOE.   What would be the essential elements of this system?  The computer 
would have to know, of course, that one drug should not be given if a patient is on another, so if A, 
drug A and B are interacting, and one should not give drug A and B, the computer has to know this in 
some manner and we'll figure out how that's going to be stored.  And, how could the knowledge be 
represented so that a computer program could use it?  Where would that knowledge be stored so that 
a computer could actually carry out this task?  The simplest way to represent the knowledge is in 
tables.  All of you, I'm certain, have filled out excel tables for one reason or another and you could 
imagine a table where one table was the patients prescribed drugs, and another table was for patient, 
drug interactions, and then a set of rules, the computer would obviously need a set of rules, simply 
programmed like many simple programs, if the patients on drug A, do not prescribe drug B. 
So, here is a very simple example of a CPOE, consider that the patient is currently on the three drugs 
listed over on the top table on the right, an MAO inhibitor, that's Monoamine oxidase inhibitor, is also, 
perhaps the patients on insulin or morphine, and imagine that the doctor decides that he or she wants 
to order Demerol, so the task here is, are there any interactions between any of the drugs the 
patients is already on and the Demerol, and the machine could consult the patients table of drugs, and 
see the three drugs they're on, and then consult an interaction table, and the interaction table might 
reveal the Demerol and the MAO inhibitor are contra-indicated, or there's an interaction between the 
two and that the Demerol should not be given, and you can imagine the machine sending back, based 
on a simple rule, the warning, "You may not order Demerol."  This very combination was responsible 
for the death of Libby Zion, in the 80's, and this started a revolution in medicine, if you will, where 
interns work hours were reduced, and the impedance to develop the systems to protect patients 
from drug interactions reached a new level, not see previously.  So, this is a perfect example of how 
you can protect, the machine can protect the doctor and the patient from prescribing two drugs, which 
interact.  
Another simple modification of the CPOE is to ask the system to determine whether or not a drug 
dose should be modified.  In this case, the doctor is ordering  gentamicin, it consults a table of the 
patients lab data, and we see that the patient has various lab values, serum sodium of 135, a creatinine 
of 1.4, a conversion can be made internally, through the computer, which converts the creatinine, 
which is a measurement of kidney function, into an actual filtration rate measurement or estimate, 
which is the much more accurate reflection of kidney function, and a simple rule could be posted or 
programmed, if the estimated GFR is less than 60, then the patient has reduced kidney function.  Now 
we can consult the adjustment table, we see that the doctor has order gentamicin, and we see the 
rule, if reduced function kidney, adjust, yes or no, yes, and the message can be sent back, please 
adjust the gentamicin level.  These are very, very effective simple tools, and you can see here that the 
knowledge was represented merely in tables, either lab, medicine or interaction, with one set of rules.  
And, in this case, a very simple set of rules that, basically, any high school kid learning JAVA, probably 
kindergarten kid now learning JAVA, could write this rule.
So our clinical decision support system will be a little more challenging, this prior CPO system, CPOE 
system that we just discussed was really quite straightforward, but the clinical decision support is going 
to be a little more sophisticated.  The computer must be able to, one, identify patients with medical 
conditions, and the one that we're going to focus on is diabetes mellitus.  It needs to consult a set of 
rules that represent treatment guidelines.  Now, treatment guidelines can be quite complex and at 
the very end of this session, we're going to talk about representation of guidelines, but for the time 
being, we'll just say that the decision support tool must consult the guidelines to see if the patient is, 
patients care has followed that.
So, let's discuss our diabetes clinical support tool.  First, as mentioned, the tool needs to figure out 
which patients have diabetes mellitus, and that will obviously take place in the electronic record, and 
we can ask, now, how is diabetes mellitus represented in the HER so that a computer and find it.  The 
second element would be the guideline, following the guidelines, and we're going to use hemoglobin 
A1C, which reflects diabetes control, that's a lab test.  How is the lab test represented, again, so a 
computer can find it, and make a decision as to whether or not the patients values are within the 
preferred range.  And, lastly, we consider, we can consider having the computer send a message, if 
controls not optimal.  
A schematic of a typical decision support tool would characterize the elements that we just discussed, 
patient has diabetes mellitus, if it does, and the hemoglobin A1C is above a range that we consider.
^M00:14:00
[Silence]
^M00:14:14
By large, disease, that a patient has, are represented by codes, in the medical record, and for several 
decades, the codes are the ICD9, International Classifications of Diseases, ICD, 9th version, ICD9 codes, 
which, for which every disease has an ICD9 code, and we're going to talk more about ICD9 codes a little 
bit later on, when I discuss terminologies, but for the time being, and for the purpose of this, the 
clinical decision support tool, let's just say that, if the patient has diabetes mellitus, then the code, in 
this case for diabetes, 249, appears in their record someplace.  So, imagine a table, once again, a 
table's a very effective way of storing information, structured information that a computer can find, 
patient 01 has three different codes, one of which is diabetes mellitus.  Now, a key issue to explore is 
how does that code actually get into the patients record, and for years, it has been manually, that is to 
say a coder or a physician actually types in or before typing in, wrote in the code 249.0, if the patient 
had diabetes mellitus.  
So, and I want you to keep this in mind, because we're going to talk a little bit later on about 
automated ways to code patients, but for the, for the most part, codes end up in the patients electric 
health record because a human has made the determination that the patient has diabetes mellitus.  
This is largely, if not entirely, for billing purposes.  ICD9 codes are used for billing purposes.  If you want 
to get reimbursed, as a physician or a hospital or healthcare provider, you need to have the patients 
ICD9 codes listed, they need to be appropriate, they need to explicitly represent what the patient 
actually has, and you'll seen in a few minutes, that there's a huge amount of granularity, when it 
comes to ICD9 coding and diabetes, for example, there are many variations, but for the time being, 
let's just do vanilla, plain vanilla diabetes mellitus.  Now, you could imagine another way that the code 
would get there with the physician or the provider checked off, in a box, diabetes mellitus, and see 
the little radial button and the check box, and that would automatically populate the table.
One brief word of International Classification of Diseases, ICD9, it's a terminology, which is a finite 
enumerated set of terms intended to convey information unambiguously.  So diabetes should have 
one term, I mean is one term and should have one code.  Each condition has a specific code, as we say, 
diabetes is 249, and the important, and this will become, I think, a little more obvious later on, the 
important point is that no condition could have or should have more than one code, for example, you 
could imagine the situation where a patient has chronic kidney disease, because they have diabetes, 
or, or not or, that we could imagine characterizing that patient in a different way, we could say the 
patient has diabetes and the diabetic complication of chronic kidney disease.  Now, is this diabetes, 
under the diabetes umbrella, or is this chronic kidney disease, whatever we decide, there should only 
be one code for that.  We're going to see in a second that that is the case, there is only one code.  
We'll get back to this later on, but I did want to give you some idea of what the ICD9 terminology looks 
like.  We see that 249 is diabetes mellitus, and it includes, 249 includes diabetes mellitus due to and 
secondary with drug induced, chemical induced infection, but it excludes, for example, diabetes during 
pregnancy, gestational diabetes.  If a patient has diabetes mellitus during pregnancy, they're 
supposed to get a different code, and if the newborn, neonatal diabetes mellitus has diabetes, again, 
it's still the same disease diabetes mellitus, then it should be characterized as 775.  
So, on one hand, it seemed pretty straightforward, diabetes is 249, well it ends up being far more 
complicated than that, and I think this next slide helps us understand some of the, both the granularity 
and the complexity, 250, not 249, the code 250 is the code that would be used with patients with 
diabetes and renal manifestations, that's kidney manifestations, and I had just mentioned about 
chronic kidney disease, it says in the ICD9 coding book, use an additional code to identify manifestation 
as chronic kidney disease.  In other words they're saying, if the patient has diabetes and chronic 
kidney disease, they should get two codes.  So, we're going to talk more about this later on, but the 
point is, the main point, is that patients with diabetes have a code and a computer can figure out if the 
patients have diabetes, for looking for the code 249, we could program the computer to look for 250, 
we could program to look for all the variations of diabetes mellitus, and it would be looking for an 
explicit number.
Now, our imaginary clinical decision support system also need to consult the lab data; is the patient's 
hemoglobin A1C above a certain value?  And, we could ask the same, how would a computer know 
that a patient has an elevated hemoglobin A1C?  So, we need to discuss representation of lab data.  
Each lab test has a number code, not surprisingly, and you're getting the point here that we store 
things with numbers, just like your social security number or your telephone number.  Records stored 
in a table along with patient's medical record and date.  You want to know the date of that test, and 
you want to know what the value of that test was.  But, unfortunately, coding like ICD9 codes, is not 
that simple, and for example, the patient might have gone to have the test at Columbia, the same old 
hemoglobin A1C, they might have had it at Columbia, there's a code for that, but then, maybe several 
weeks later they went to Cornell for follow-up and they have exactly the same test, the hemoglobin 
A1C, but it's a different lab, different materials, maybe the [inaudible] were slightly different, it has its 
own code, the test at Cornell, and a computer that was trying to assist physicians manage diabetes, it  
had to consult the hemoglobin A1C value, it would have to recognize both codes.  And, we have, here 
at Columbia and Cornell, and this is probably true of most systems, we have a hierarchy of codes, this 
is called the Medical Entities Dictionary, we have, right in the left, in the middle 60231 hemoglobin A1C 
test, we see parents above, we're going to talk about a hierarchy in one second, I'm going to show you 
that slide, and children that these tests are all constructed as an hierarchy, and we'll see in a second 
why that helps the computer find it.  So, if we look at all the hemoglobin A1C tests, we see that there's 
a hierarchy where there, at the very top, are glycosylated hemoglobin tests, there's nothing more 
general that that, that's exactly what we're looking for, this hemoglobin A1C, is hemoglobin, its 
glycosylated hemoglobin that we're measuring, but we see that there's branch points, on either side, 
there's two children, one is glycosylated hemoglobin, and the other one is hemoglobin A1C test, they 
are slightly different, if we continue down to the children, we see on the left side, there is the 
Columbia test, 35826 and on the right side is the Cornell test, the 57902.  And my point here is a simple 
one, that a computer has to find both of them, and, the way the computer does it is by working its way 
down the tree.  It goes to the top test, and it works its way down, and any test that is child of the 
60230, at the top, is going to be considered a hemoglobin A1C test and could be consulted by the 
computer.  
This hierarchy has very specific qualities, and it's called an ontology.  An ontology is a formal 
representation of knowledge, a set of concepts with a domain, so, in this case, the domain is lab tests, 
there are relationships between those concepts, such as ISA, and we'll get back to the hierarchy in 
second, and we use the ontology to reason about the entities of that domain.  We reason that if the 
test is a child of the top parent, then it must be a hemoglobin A1C test, and this is just a restatement of 
that ontology, all the Columbia tests on the bottom left is a total glycosylated hemoglobin test and is a 
glycosylated hemoglobin test and the same goes on the right side, for that New York hospital, this 
ontology helps us find all the tests that measure hemoglobin A1C; wherever they're at, if they're 
Cornell, the Allen Pavilion, four o'clock in the morning, in the emergency room, done by a different 
method, all of those are going to be found by the computer, if it marches its way down this grid, this 
hierarchy.
  And now, we have everything we need for our clinical decision support tool.  We have a way to 
identify if a patient has diabetes mellitus using the ICD9 codes.  We have a way of identifying if a 
patient has had a hemoglobin A1C test, and no matter where they had it, the Allen Pavilion, Columbia, 
the Emergency Room.  And, with a simple set of rules, we can determine whether or not the patient is 
within the guidelines, and if necessary, send a little message to the provider saying that the patient is 
not within the guidelines.
You can see, from these two examples, that representing the knowledge was pretty straightforward 
and not that complicated, we had tables, patients medicines, medicine interaction table, we had lab 
data tables, we also use codes from terminologies and ontology's, we had our lab ontology code that 
we needed for the hemoglobin A1C and we have the diabetes code from the ICD9.  We had a set of 
rules.  We had rules for kidney function.  We had rules for drug to drug interaction.  We had rules for 
diabetic control or outcome, in terms of the hemoglobin A1C.  
We also discussed extraction; how did we get those codes?  We had humans enter the codes for the 
most part and we had some automation.  
The knowledge representation can be summarized in this next slide, where we see various additional 
information, and how that might be represented.  On the blue left, we see lab data, blood pressure, 
temperature, these are all numbers, this is what we call structured information, its information that 
you can type into an excel cell, if you will; numeric, it might me strings of words, but a computer knows 
how to find it, it's being doing so for decades. 
We also have unstructured information in the red, and that is far more complicated for a computer to 
deal with.  We have the unstructured information of the free text note, the doctor writing about the 
patient in the note, and we have, also, the note from the radiologist, looking at the image or the 
electro cardiographer saying that the patient electrocardiogram was abnormal, and the unstructured 
data is text and that is not easily computable, if you will.  We're going to spend time, later on in this 
talk, discussing the role of natural language processing, and converting what we call unstructured data 
text into a structured format, and could also, obviously, be done manually, a, as we discussed, a 
human coder reads the free text note, and says, "Aha, the doctor said that patient had diabetes 
mellitus, and now we're going to assign a code."  At the end of the day, either through structured 
numeric data or unstructured process data, either manually or through NLP, we end up giving the 
patient codes for all sorts of things; their disease codes and drug codes and medicine codes, etc.; 
we're going to talk all about those terminology in a second.
So, in summary, there are different data types that a computer can access immediately, such as 
structured or through some intervention, converting unstructured to structured.  At the end of the 
day, if all the information that we would need, and we'll see if, you know, and I'll be that this is not 
really plausible, but we aspire to have as much information about a patient as we can stored as codes, 
perhaps, in a huge set of tables.  And, this complete part 1, and now we'll go on to part 2.
^M00:29:15
[Silence]
^M00:29:19
Now we're going to start part 2, and we'll spend some time talking about terminologies.  We have 
already seen the ICD9 code for diabetes mellitus, and terminology.