>>Welcome to the Health IT Certificate program. My name is Herbert Chase, and I'm going to be talking today about knowledge representation. You can read my bio when you have nothing better to do. I've been here in Columbia for many years and I've been in the Department of Biomedical Informatics since 2006. As mentioned, we're going to talk about knowledge representation, and let's go over our learning objectives. First, we're going to discuss what the definition of knowledge representation is. Describe different data in the patient record, what needs to be represented, what doesn't. What is the information that would be important to represent, so a computer might find it. We're going to describe how information is stored and utilized by a computer program, to benefit healthcare. We're going to go over terminologies, how do terminologies help us represent medical knowledge, and we'll go over the ones that are the most commonly used. They'll be an additional discussion of the UMLS and SNOMED Codes, which are very interesting terminologies that provide more information than just standard definitions and categorization of concepts. And, we'll touch upon the role of natural language processing and extracting important information. Our talk today will be divided into four parts. First, we'll discuss representation in medical data. In part two, we'll spend a little time on the terminologies. Part three, we'll talk about representation of unstructured data, and you'll have a better idea of what that is, as we proceed through this talk. And lastly, in part four, we'll talk about representation of higher levels of knowledge, such as guidelines, which are quite challenging. Let's, first, go over the definition of knowledge representation. Knowledge representation simply means, how we represent our domain knowledge, in this case medicine, and what we know about medicine, how do we represent it in a language that a computer can make use of. We would like to represent it in a formal language, a computer program, and with the right programs in place, a computer can access this information and make use of it; make some decisions perhaps, and we'll talk about that in a second. Obviously, this is a little bit obtuse at this point, the definition, but, in a general sense, I think you can appreciate that a computer has to access information and we have to represent the medical information in a way that would enable a computer to get to it and understand it. Some of the things, some of the goals requiring knowledge representation in medicine, obviously, patient care, we might want to produce a summary, what are all the diseases a patient has, how is the patient doing, we want to construct expert systems, we're going to talk about two expert systems in a few minutes. Can we write a program that actually figures out if the patient is doing well or not? Obviously, need to make reports, both financial as well as medical report, we'd like to manage the patients care, all this knowledge has, about the patient, including what clinic they go to or including who their doctor is, has to be represented in some way that a computer can access it. We are going to want to transfer information, from one place to another, even within a hospital, and perhaps across town or maybe cross country. We'd also like to find information, when was the last time this patient had their sugar measured, or they had their insulin dose changed. And, lastly, knowledge representation benefits clinical research. We could say, "Where are all the patients, or who are all the patients with diabetes mellitus and we'd have to represented the diabetes mellitus in the patients record in order to answer that question. So, let's talk a little bit about what exactly is the kind of knowledge that we would want to represent in the patients record, that is to say, what are the key elements of patient record that need to be represented? If we look at a typical patient, it gives us some idea of what we need to represent. The patient is a 75 year old man, we'd want to represent the age and the gender with a long history of diabetes mellitus, certainly the condition diabetes mellitus and hypertension, depression and chronic kidney disease, they would all have to be represented n some way. He's cared for by many physicians, we'd want to know who his physicians are, who practice in a variety of clinical settings, where has this patient been seen? What clinics has this patient been seen? What is the primary care physician's name, and where does that physician practice. The patient is on various medicines that controls diabetes, high blood pressure and depression, so certainly the medicine, when they started, what their doses are would be essential to have in the, represented in the patient record. The patient is routinely sent to have his blood drawn to measure tests indicative of his current disease status, such as hemoglobin A1C; of course, all the lab data would want to be represented. This is just a little, the slide summarizes the many different pieces of the patients puzzle that we would want to represent, medical conditions, I've mentioned a couple of them, diagnosis and medicines, the various reports that we'd want to access, some general information, and overall care, we might want to represent the data in a way that we could construct an alert or provide information on the sign out, or have a guideline look at the patients match to see if the patients care was matching a guideline. This just scratches the surface, there are many more categories of information that would be essential to represent. I thought that the simplest way to drill down into the nitty gritty of knowledge representation would be to use a couple of examples, build a system and look to see what exactly is the knowledge that we need to represent in order to carry out the task, so I thought of two simple, straightforward examples for us to discuss. The first is Computerized Provider Order Entry system, CPOE, and in the one that we're going to build, it will check for drug contra-indications, and adjust dosage, if necessary. The second will be a clinical decision of support system that optimizes management of diabetes. So, let's first talk about CPOE. What would be the essential elements of this system? The computer would have to know, of course, that one drug should not be given if a patient is on another, so if A, drug A and B are interacting, and one should not give drug A and B, the computer has to know this in some manner and we'll figure out how that's going to be stored. And, how could the knowledge be represented so that a computer program could use it? Where would that knowledge be stored so that a computer could actually carry out this task? The simplest way to represent the knowledge is in tables. All of you, I'm certain, have filled out excel tables for one reason or another and you could imagine a table where one table was the patients prescribed drugs, and another table was for patient, drug interactions, and then a set of rules, the computer would obviously need a set of rules, simply programmed like many simple programs, if the patients on drug A, do not prescribe drug B. So, here is a very simple example of a CPOE, consider that the patient is currently on the three drugs listed over on the top table on the right, an MAO inhibitor, that's Monoamine oxidase inhibitor, is also, perhaps the patients on insulin or morphine, and imagine that the doctor decides that he or she wants to order Demerol, so the task here is, are there any interactions between any of the drugs the patients is already on and the Demerol, and the machine could consult the patients table of drugs, and see the three drugs they're on, and then consult an interaction table, and the interaction table might reveal the Demerol and the MAO inhibitor are contra-indicated, or there's an interaction between the two and that the Demerol should not be given, and you can imagine the machine sending back, based on a simple rule, the warning, "You may not order Demerol." This very combination was responsible for the death of Libby Zion, in the 80's, and this started a revolution in medicine, if you will, where interns work hours were reduced, and the impedance to develop the systems to protect patients from drug interactions reached a new level, not see previously. So, this is a perfect example of how you can protect, the machine can protect the doctor and the patient from prescribing two drugs, which interact. Another simple modification of the CPOE is to ask the system to determine whether or not a drug dose should be modified. In this case, the doctor is ordering gentamicin, it consults a table of the patients lab data, and we see that the patient has various lab values, serum sodium of 135, a creatinine of 1.4, a conversion can be made internally, through the computer, which converts the creatinine, which is a measurement of kidney function, into an actual filtration rate measurement or estimate, which is the much more accurate reflection of kidney function, and a simple rule could be posted or programmed, if the estimated GFR is less than 60, then the patient has reduced kidney function. Now we can consult the adjustment table, we see that the doctor has order gentamicin, and we see the rule, if reduced function kidney, adjust, yes or no, yes, and the message can be sent back, please adjust the gentamicin level. These are very, very effective simple tools, and you can see here that the knowledge was represented merely in tables, either lab, medicine or interaction, with one set of rules. And, in this case, a very simple set of rules that, basically, any high school kid learning JAVA, probably kindergarten kid now learning JAVA, could write this rule. So our clinical decision support system will be a little more challenging, this prior CPO system, CPOE system that we just discussed was really quite straightforward, but the clinical decision support is going to be a little more sophisticated. The computer must be able to, one, identify patients with medical conditions, and the one that we're going to focus on is diabetes mellitus. It needs to consult a set of rules that represent treatment guidelines. Now, treatment guidelines can be quite complex and at the very end of this session, we're going to talk about representation of guidelines, but for the time being, we'll just say that the decision support tool must consult the guidelines to see if the patient is, patients care has followed that. So, let's discuss our diabetes clinical support tool. First, as mentioned, the tool needs to figure out which patients have diabetes mellitus, and that will obviously take place in the electronic record, and we can ask, now, how is diabetes mellitus represented in the HER so that a computer and find it. The second element would be the guideline, following the guidelines, and we're going to use hemoglobin A1C, which reflects diabetes control, that's a lab test. How is the lab test represented, again, so a computer can find it, and make a decision as to whether or not the patients values are within the preferred range. And, lastly, we consider, we can consider having the computer send a message, if controls not optimal. A schematic of a typical decision support tool would characterize the elements that we just discussed, patient has diabetes mellitus, if it does, and the hemoglobin A1C is above a range that we consider. ^M00:14:00 [Silence] ^M00:14:14 By large, disease, that a patient has, are represented by codes, in the medical record, and for several decades, the codes are the ICD9, International Classifications of Diseases, ICD, 9th version, ICD9 codes, which, for which every disease has an ICD9 code, and we're going to talk more about ICD9 codes a little bit later on, when I discuss terminologies, but for the time being, and for the purpose of this, the clinical decision support tool, let's just say that, if the patient has diabetes mellitus, then the code, in this case for diabetes, 249, appears in their record someplace. So, imagine a table, once again, a table's a very effective way of storing information, structured information that a computer can find, patient 01 has three different codes, one of which is diabetes mellitus. Now, a key issue to explore is how does that code actually get into the patients record, and for years, it has been manually, that is to say a coder or a physician actually types in or before typing in, wrote in the code 249.0, if the patient had diabetes mellitus. So, and I want you to keep this in mind, because we're going to talk a little bit later on about automated ways to code patients, but for the, for the most part, codes end up in the patients electric health record because a human has made the determination that the patient has diabetes mellitus. This is largely, if not entirely, for billing purposes. ICD9 codes are used for billing purposes. If you want to get reimbursed, as a physician or a hospital or healthcare provider, you need to have the patients ICD9 codes listed, they need to be appropriate, they need to explicitly represent what the patient actually has, and you'll seen in a few minutes, that there's a huge amount of granularity, when it comes to ICD9 coding and diabetes, for example, there are many variations, but for the time being, let's just do vanilla, plain vanilla diabetes mellitus. Now, you could imagine another way that the code would get there with the physician or the provider checked off, in a box, diabetes mellitus, and see the little radial button and the check box, and that would automatically populate the table. One brief word of International Classification of Diseases, ICD9, it's a terminology, which is a finite enumerated set of terms intended to convey information unambiguously. So diabetes should have one term, I mean is one term and should have one code. Each condition has a specific code, as we say, diabetes is 249, and the important, and this will become, I think, a little more obvious later on, the important point is that no condition could have or should have more than one code, for example, you could imagine the situation where a patient has chronic kidney disease, because they have diabetes, or, or not or, that we could imagine characterizing that patient in a different way, we could say the patient has diabetes and the diabetic complication of chronic kidney disease. Now, is this diabetes, under the diabetes umbrella, or is this chronic kidney disease, whatever we decide, there should only be one code for that. We're going to see in a second that that is the case, there is only one code. We'll get back to this later on, but I did want to give you some idea of what the ICD9 terminology looks like. We see that 249 is diabetes mellitus, and it includes, 249 includes diabetes mellitus due to and secondary with drug induced, chemical induced infection, but it excludes, for example, diabetes during pregnancy, gestational diabetes. If a patient has diabetes mellitus during pregnancy, they're supposed to get a different code, and if the newborn, neonatal diabetes mellitus has diabetes, again, it's still the same disease diabetes mellitus, then it should be characterized as 775. So, on one hand, it seemed pretty straightforward, diabetes is 249, well it ends up being far more complicated than that, and I think this next slide helps us understand some of the, both the granularity and the complexity, 250, not 249, the code 250 is the code that would be used with patients with diabetes and renal manifestations, that's kidney manifestations, and I had just mentioned about chronic kidney disease, it says in the ICD9 coding book, use an additional code to identify manifestation as chronic kidney disease. In other words they're saying, if the patient has diabetes and chronic kidney disease, they should get two codes. So, we're going to talk more about this later on, but the point is, the main point, is that patients with diabetes have a code and a computer can figure out if the patients have diabetes, for looking for the code 249, we could program the computer to look for 250, we could program to look for all the variations of diabetes mellitus, and it would be looking for an explicit number. Now, our imaginary clinical decision support system also need to consult the lab data; is the patient's hemoglobin A1C above a certain value? And, we could ask the same, how would a computer know that a patient has an elevated hemoglobin A1C? So, we need to discuss representation of lab data. Each lab test has a number code, not surprisingly, and you're getting the point here that we store things with numbers, just like your social security number or your telephone number. Records stored in a table along with patient's medical record and date. You want to know the date of that test, and you want to know what the value of that test was. But, unfortunately, coding like ICD9 codes, is not that simple, and for example, the patient might have gone to have the test at Columbia, the same old hemoglobin A1C, they might have had it at Columbia, there's a code for that, but then, maybe several weeks later they went to Cornell for follow-up and they have exactly the same test, the hemoglobin A1C, but it's a different lab, different materials, maybe the [inaudible] were slightly different, it has its own code, the test at Cornell, and a computer that was trying to assist physicians manage diabetes, it had to consult the hemoglobin A1C value, it would have to recognize both codes. And, we have, here at Columbia and Cornell, and this is probably true of most systems, we have a hierarchy of codes, this is called the Medical Entities Dictionary, we have, right in the left, in the middle 60231 hemoglobin A1C test, we see parents above, we're going to talk about a hierarchy in one second, I'm going to show you that slide, and children that these tests are all constructed as an hierarchy, and we'll see in a second why that helps the computer find it. So, if we look at all the hemoglobin A1C tests, we see that there's a hierarchy where there, at the very top, are glycosylated hemoglobin tests, there's nothing more general that that, that's exactly what we're looking for, this hemoglobin A1C, is hemoglobin, its glycosylated hemoglobin that we're measuring, but we see that there's branch points, on either side, there's two children, one is glycosylated hemoglobin, and the other one is hemoglobin A1C test, they are slightly different, if we continue down to the children, we see on the left side, there is the Columbia test, 35826 and on the right side is the Cornell test, the 57902. And my point here is a simple one, that a computer has to find both of them, and, the way the computer does it is by working its way down the tree. It goes to the top test, and it works its way down, and any test that is child of the 60230, at the top, is going to be considered a hemoglobin A1C test and could be consulted by the computer. This hierarchy has very specific qualities, and it's called an ontology. An ontology is a formal representation of knowledge, a set of concepts with a domain, so, in this case, the domain is lab tests, there are relationships between those concepts, such as ISA, and we'll get back to the hierarchy in second, and we use the ontology to reason about the entities of that domain. We reason that if the test is a child of the top parent, then it must be a hemoglobin A1C test, and this is just a restatement of that ontology, all the Columbia tests on the bottom left is a total glycosylated hemoglobin test and is a glycosylated hemoglobin test and the same goes on the right side, for that New York hospital, this ontology helps us find all the tests that measure hemoglobin A1C; wherever they're at, if they're Cornell, the Allen Pavilion, four o'clock in the morning, in the emergency room, done by a different method, all of those are going to be found by the computer, if it marches its way down this grid, this hierarchy. And now, we have everything we need for our clinical decision support tool. We have a way to identify if a patient has diabetes mellitus using the ICD9 codes. We have a way of identifying if a patient has had a hemoglobin A1C test, and no matter where they had it, the Allen Pavilion, Columbia, the Emergency Room. And, with a simple set of rules, we can determine whether or not the patient is within the guidelines, and if necessary, send a little message to the provider saying that the patient is not within the guidelines. You can see, from these two examples, that representing the knowledge was pretty straightforward and not that complicated, we had tables, patients medicines, medicine interaction table, we had lab data tables, we also use codes from terminologies and ontology's, we had our lab ontology code that we needed for the hemoglobin A1C and we have the diabetes code from the ICD9. We had a set of rules. We had rules for kidney function. We had rules for drug to drug interaction. We had rules for diabetic control or outcome, in terms of the hemoglobin A1C. We also discussed extraction; how did we get those codes? We had humans enter the codes for the most part and we had some automation. The knowledge representation can be summarized in this next slide, where we see various additional information, and how that might be represented. On the blue left, we see lab data, blood pressure, temperature, these are all numbers, this is what we call structured information, its information that you can type into an excel cell, if you will; numeric, it might me strings of words, but a computer knows how to find it, it's being doing so for decades. We also have unstructured information in the red, and that is far more complicated for a computer to deal with. We have the unstructured information of the free text note, the doctor writing about the patient in the note, and we have, also, the note from the radiologist, looking at the image or the electro cardiographer saying that the patient electrocardiogram was abnormal, and the unstructured data is text and that is not easily computable, if you will. We're going to spend time, later on in this talk, discussing the role of natural language processing, and converting what we call unstructured data text into a structured format, and could also, obviously, be done manually, a, as we discussed, a human coder reads the free text note, and says, "Aha, the doctor said that patient had diabetes mellitus, and now we're going to assign a code." At the end of the day, either through structured numeric data or unstructured process data, either manually or through NLP, we end up giving the patient codes for all sorts of things; their disease codes and drug codes and medicine codes, etc.; we're going to talk all about those terminology in a second. So, in summary, there are different data types that a computer can access immediately, such as structured or through some intervention, converting unstructured to structured. At the end of the day, if all the information that we would need, and we'll see if, you know, and I'll be that this is not really plausible, but we aspire to have as much information about a patient as we can stored as codes, perhaps, in a huge set of tables. And, this complete part 1, and now we'll go on to part 2. ^M00:29:15 [Silence] ^M00:29:19 Now we're going to start part 2, and we'll spend some time talking about terminologies. We have already seen the ICD9 code for diabetes mellitus, and terminology.