>>Hello, welcome again to the Health IT Certificate program and lecture series. I'm Adam Wilcox and this is the second lecture on data entry and storage in clinical computing systems. For this lecture I'm going to go through and example of an object oriented event model approach which is slightly different from the event model approach that I talked about before. It's an interesting implementation just because it kind of exposes the rationale for the event model and the benefits of the approach. It's not necessarily something you would come across regularly but if you can understand it as an example and why it was done, you can understand the consequences of other decisions around data entry and storage. I'll also talk about data normalization and denormalization which hopefully will help you understand why columns and tables may sometimes be arranged the way that they are and the consideration if you're ever designing your own data tables. I'll also discuss data vocabulary issues briefly because that's talked about in other lectures but I still talk about it just because it's an issue for data entry and storage. And finally, I'm going to talk about clinical data warehouses which are a type of clinical repository use for data analysis generally. So the object oriented approach, the example that I'm pulling from is one that was done at Intermountain Healthcare where they had an event model and they moved it to a more expansive clinical event model where they created, they packaged the entire event as an object and they had the event stored and encapsulated in what was called binary and coding rules or burr strings [ phonetic spelling ]. Now binary and coding rules is not that important for you understand, just think of it as a way that you've compressed and created a string of all of the information that maybe in the event. The reason they did these and their motivation was because it became faster to read the entire event then to be moving around to find its different parts and then they had a secondary relational database structure to index the clinical events. So, this is similar to the event model and here's more detail of the event model that I mentioned before about that was described on [inaudible] where you have the event and you have the, you know, you can events around visits and laboratory results. But then you can have these different actions which may be a history of physical examination or hemoglobin tests. These are all [inaudible] relationships in children supertypes and subtypes, etc. In the clinical event model, now this is information taken from Craig Parker's thesis when he was working in [inaudible] healthcare, you have a result and a result contains all the components like a lab result contains the testing, the value and then these lab results with the, could also be children of events that would be more encapsulating this. Now, this information, it's kind of stored in a nomenclature here or a structure but looks a lot like a computing language, it was called ASN1, other forms in which things can be stored or XML but that's really just the syntax for how information is stored. Now, why would they choose a model that's more object oriented where you have the whole event encapsulated in one string rather then the other event model that I talked about where you have events and then multiple components in a table? The reason is, is because it was following the mechanics actually of the data storage around the disks, so you have here a picture of a disk and when a disk, the way that the disk gets stored is you have these spinning disks and this head moves around along these disks and it writes the information into magnetic form to this disk. And I think probably most of you are familiar with some type about how this works. It's not completely as relevant anymore because a lot of the information or a lot of the data storage mechanics have moved to solid state. But even in a solid state storage while you may not have the moving parts of a disk you do have some ways of traversing that structure that are optimized in some way that are not optimized. So you might ask, well what are the optimal ways of data storage and retrieval within a spinning disk? The thing that moves the fastest in this mechanical disk is that is the spinning of the disk. That moves really, really fast at thousands of rotations per minute. So you've got this thing just spinning around really, really quickly and this head is moving around to different parts and when it gets over a certain part of the disk that's where it's picking up the data as it moves underneath it as the disk is spinning underneath it and that's where it can be read from. The thing that takes a long time relatively compared to the spinning is moving this disk head to the right place. So if the disk is spinning really fast, if you're in one spot but you've got, and you want to move across many different tables to gather information each of those tables will be stored in a different spot and likely wouldn't be close together. And so you have to move this disk head around the disk to the different tables. The advantage to this object oriented approach is what they're trying to do is they're trying to package the whole thing all together underneath, you know, in one table at a time. So, if you go to the event you've got an event that represents the whole form, the whole data entry form or a whole lab test or even you may have what something that represents the entire visit. And you move over it and you can read all of the data at once and it's actually easier to just grab all of the data and then in memory throw out almost everything that you don't need or throw out everything that you don't need which may be almost everything except for that one value in a test, then it is around and find all the components of that test. So that's kind of why they did this model is because it was following the mechanics of the data. Just to review and when you're doing this clinical event model you must have the data packaged together so that you can read it all at once. The problem is is that when you have a really complex form or really complex data type, it can take a long time to store because you can't store the thing until you've got everything gathered together and it can be compressed all at once and then thrown into this data place. Another problem is that sometimes the vocabulary that's used which is used to both optimize the storage but also to facilitate the retrieval and deal with the synonymy may get very large. And in some cases that vocabulary can actually exceed what can be stored kind of in the ram or in the memory of the machine. And if that's the case then you have to keep going back to the disk for the translation and that's been shown to be huge performance cost. What's done with this clinical event model is, you know, you could ask the question well how do you know what event is storing what piece of information? How do you know if they're, if when you're looking for a lab test that that clinical visit has that type of lab value? Why would you be going to, because otherwise it can be really inefficient because if you have to keep searching through each clinical event and saying, oh was the lab test in this one? No. Was the lab test in this one? No. So what is done is you have both a relational database and a relational database that kind of gives the location of different types of data in the clinical events. And one of the issues with this clinical event model is this relational database that's used to navigate to the clinical events can be difficult to model. You know, when it's modeled correctly it doesn't effect performance poorly but there are times when arbitrary decisions about the modeling unknowingly can effect performance. Those are just some of the disadvantages. But the real advantage again, is going back to that image of the disk is that it's so much easier to read a lot of data at once and then throw away a large part of that that you don't need then it is to have to move that disk head to just gather a precise pieces of information around the disk. And that was the advantage to the clinical event model. Now the reason I'm telling you about this is not because it's the type of system that you're going to likely see but rather if you think about that it kind of explains some of the advantage to the event model versus the relational table model where this whole disadvantage of moving around the disk heads is happening to a lesser degree. In the event model we have two tables and the data can be physically organized around the patient. So we really don't have to move that disk head very far when we're looking at one patient's information and we're staying on one table. Where as in the relational model you can have the data for a visit or the data even for a [inaudible] is spread across multiple tables and to retrieve that you're really moving and traversing back and forth across this disk head. The second issue that I want to talk about is a normalization problem or what's called the normalization problem. This happens when a database structure becomes normalized which means that the structure is made to fit a database normal form. Now, if you've taken database classes you may know what the first normal form or second normal form are. What it really means is that you model the [inaudible] structure in an optimal way. It's a standard practice, it's not just a standard it's actually a recommended practice whenever databases are collected. It's important to optimize this for storage efficiency both in terms of the efficiency of storing data but in making the data as compact as possible. But what can happen in a database when it is normalized perfectly or actually even normalized too well is that the querying can result in multiple joins that can become highly inefficient especially when you've got tables with a lot of data or in other words, very large tables with many, many rows. Let me kind of describe in more detail how this happens. So here we have a normalized data structure. This is actually a normalized data structure that follows somewhat of an event model. Then we have three tables here rather then just two, two tables representing the components and the observation and values. So you can have an event table that represents, you know, the patient identifying and the date and the event ID. Think of that either as a visit or a form or a lab test. And then you have the observation table which are the specific components underneath that and then among those observations you could have different values. So in this structure you have a few events that within those events you may have multiple observations and then within the observations you have the values mapping that. And the challenge here is that when it's normalized correctly, you wouldn't be storing redundant data across the different tables so, the patient ID and the date are pieces of information that don't need to be stored on the observation table because they're stored in the event table and they don't need in the values table either. Because that data can be retrieved by linking back to it using the event ID mapping from the observation table to the event table or using the event ID and observation ID in sequence mapping back to the values. The challenge is if you ever want to get a value out of something you have to joint these three tables together and in an event model these tables will become very, very large. You'll have many, many clinical results and many clinical events for all of the different patients and so what the piece of information you're looking at is just one row among thousands or millions of events and thousands or millions of observations and thousands or millions of values. And so these quickly become the largest tables in your database and just to find relevant information you have to join them all together. If you wanted to say, show me all the patient X's potassium laboratory values, you have to find all the potassium values in the value table but you don't which patient it is or if you want to do it within a range of dates, you've got to join these tables back to the event table. And that joining of really large tables can be really inefficient and really difficult to do. So, how do you fix this problem? You denormalize it which is not necessarily perfect form a database but, and it has a disadvantage that you need to make sure that the information are kept consistent across the data but when done correctly, can have huge impacts in terms of the efficiency of data retrieval. So in this case what I've added to the observation table and to the values table is the patient identifier and the date that the report only stored in the event table. So now imagine that query that I talked about before where you want to get all of the potassium values, which is a lab test, for a certain patient that within a certain range of time. Now, because I have the patient ID and I have the date to columns that were known to be very important that were only stored in the event table are not stored in the values table. I just have to query the values table. So what was a three table join, a three large table join, is now just running through one table and it can be high efficient. And I've seen this again, in a couple of places where the denormalization had a huge impact on performance. Now, why am I showing you this? I'm showing necessarily that you're going to be in this situation where you're going to be identifying the data and saying, oh I figured out what are performance problem here where you're going to be informing the vendor that the issue is that they have to denormalize the table but rather if you're ever involved in looking at data structures behind the systems and you notice that there seem to be a lot of duplicated columns in certain ways that you can understand the reason that may be the case is for the efficiency and the denormalization issue is one of the big gains in efficiency that can be had. Now, I want to talk just briefly, just one slide, about some vocabulary issues that are related to data storage and data access. I know that vocabulary is talked about in a lot more detail in some other lectures and so I don't want to go into that level of detail about all of the ways to manage vocabulary with the different techniques to use. But I do want to talk about its relevance in data storage and data access. Now there's three concepts that are important to understand. Data dictionary or terminology and those two are pretty much synonymous and then a vocabulary. So a data dictionary and terminology are just kind of terms that maybe used within specific table just a mapping of what the term is to what it actually means and whereas the vocabulary is a structure that can manage that terminology. Normally with a simple database all you really need is a good data dictionary that says what the different codes actually mean and that's because the definitions are usually just specific to a table that it's just stored in that one table. However, when you get to clinical information systems that usually isn't the case because medicine is complex and you have concepts that would be represented in one table could also somewhat synonymous to concepts that are in another table. An example, just a simple example is where you could have blood pressure that would be stored or stored from different data entry forms, one for a physician, one for a nurse or one for a medical assistant and these would all be stored in different tables but the concept of blood pressure even though it might be stored by a different person or it may be done in a different way is still needs to be related to each other. So if the data models are spread across multiple tables in this way, just having the terminology really is going to be insufficient because it doesn't manage the synonymy. So vocabularies are really critical in order to have a comprehensive view of the data. For example, if you wanted to write a query that says, give me all of the blood pressures that have been stored for this patient because I want to see how well they're doing in that management of the hypertension, I would need a vocabulary to be able to tell me that the concept of blood pressure is represented in each of these different tables. And so, while they may be stored with a different code the vocabulary would link those columns together and I could, and I would know how to query them that takes advantage of that link. Now, the challenge is that often where people get in trouble in vocabularies is that they haven't managed them as the, in the initial configuration of the forms. So the warning is that vocabulary is likely to be very important. That terminology alone is probably going to be insufficient if you're configuring forms and so you need to manage the vocabulary throughout the configuration period. And the other challenge, the related challenge is is that you usually don't know how valuable the vocabulary is until you deal with problems where it's too late. So this is just a warning when you're doing the form configuration to be aware that vocabulary management is important throughout the configuration period. The last topic I want to discuss is clinical data warehouses. And clinical data warehouses are different from a clinical repository in that the clinical warehouse is usually used just for an analytic purpose. It's not typically used for active patient care. If you wanted to pull up, show me the data that our available for a certain patient that I'm seeing right now, that's usually done through an electronic health record and that data are pulled directly from the clinical repository or the clinical database. The data warehouse is something more used for a backend analysis where you wanted to compare different patients, often happens after the patient goes home. But there are some exceptions but for the most part, it's done for post analytics about what you can do better next time. So it's another example of the secondary store of information but it's never the primary store and the role of a data warehouse to provide access for ad hoc queries reporting and analytics. Now, when these ad hoc queries reporting and analytics are done it's usually done across patients. The repositories and the clinical databases are usually more optimized around just the information for a specific patient which is why data may be physically located, physically stored on a per patient level. But that's not going to be effective when you do it from, when you're trying to do an analytic query across many different patients. Relating back to that picture of the disk that I showed you and the disk kept moving because you've got to move across all of these different patients, so sometimes it's helpful to localize a bunch of different patient information in specific tables and that's really what a data warehouse is doing. So a data warehouse structure usually has identifying information, it could have a table that has information about patients, then you would have some data that represent the main data that would be in a clinical repository and an event model. Those would be the event tables and a relational model, these would just be copies of the relational tables or they may be groupings, they may represent groups of relational tables together. You could also have look up tables that are vocabulary translation tables and often these contain patient data. But then what's interesting about a data warehouse is that you have these specialty data marks or these tables that are built specifically around an individual analytic task or a group of analytic tasks that use similar information. Let me give an example, so let me give an example, here we have part of the schema from the clinical data warehouse at Columbia University. Now, on the right side we have these different data marks that are specific to a clinical area. So you'd have a data mark around neurology or a data mark around OBGYN where we have extracted the data and optimized it for the types of queries that the analysts for neurology or the analysts for OBGYN might be getting. The visit table which is on the left side is also a type of data mark. When the data is stored in the event model and the clinical database here, we don't store it as a visit, we don't store the concept of, you know, just one line per visit where the admission date and the discharge date are all stored together in one row. We instead store the concept of an admission and a discharge. Now that's really efficient for clinical repository and for viewing the patient data because you don't want to wait to store information that the patient was admitted until after the patient's discharged. You know, you want to be storing the events that happened but in an analytics it's actually really useful to know, you know, what the timeframe is between a visit. And so we've created this data mark around visits where we've taken the admissions and the discharges and we've linked them together and then we can say, oh here's the admit date and here's the discharge date and this represents one visit. And then we gather different relevant information for that. At the bottom you just see examples of the pulmonology event tables that would be stored. These would be components of the relational or of the event tables that are in the database but then just filtered down to just the pulmonolgy ones because they could be, so while they represent the exact data they're in the repository, they have been filtered somewhat so that they only represent a certain area of so that they can be used more efficiently for [inaudible]. Now, because of this, because we have these very specific tables within groups, the clinical data warehouse schema, which is shown here, becomes pretty complex. Whereas the repository representation that where the data are initially coming from may contain only 10 to 15 tables when we create the warehouse we're going to create little tables off to the side in different areas that are optimized under specific tasks and so the tables can multiply a lot. This isn't necessarily a problem because the data are not, it's not a single store of data. All of this data are secondary stores and you can have the data duplicated across different places. So for example, you can have one table which represents all of the visits and you can have another table which may have slightly different information which is optimized around OBGYN visits and so that's in some sense a subset of the other tables of the visit table though much of the information is duplicative. But that's not really a problem in a data warehouse because it's designed to facilitate an analysis and so the duplicative data across different places isn't a problem as long as the people writing the queries can understand that and know that when they're querying something these are tables that they need to access in different ways. And you may in your career see a data warehouse and know that certain characteristics of the data warehouse do not fit the characteristics of a repository and what you wouldn't do in a repository is commonly done in the warehouse. I just want to just kind of explain that the reason this is done is it's optimized really around retrieval, around cross patients and sometimes in doing that it's efficient to create almost a separate table for a separate group of queries.