How are complex mentions of pathologies detected in IOMED
Natural language processing (NLP) has attracted a great deal of attention in recent times in the clinical domain. The reason behind this interest is that textual data - consisting of clinical notes containing patients' medical history, diagnoses, medications, etc. - is a highly abundant resource in the clinical setting. One of the most frequent applications of NLP methods in clinical research consists in finding instances of a certain medical concept (a specific disease or treatment) in a huge corpus of clinical notes (e.g. in a whole hospital).
This task usually consists of two parts: Named Entity Recognition (NER), recognizing the entity in text, and Named Entity Linking (NEL), assigning to each entity a code which specifies the meaning of the entity under some medical terminology. These two tasks are the basis for many use cases of clinical NLP, such as recruiting patients for clinical studies, finding underdiagnosed patients, pharmacovigilance, etc.
At IOMED, we develop specific systems responsible for the NER and NEL tasks. However, no model is perfect and some clinical mentions are ignored by the model. The reason behind these instances not being successfully detected may be that they are a rare mention or an abbreviation of a disease that was not previously shown to the model or maybe a misspelling of the target terms.
Why is it important to detect these missed mentions? For example, let’s imagine that we want to identify all the patients in a hospital that have Psoriasis. Our model is able to detect all the mentions of Psoriasis in the clinical notes however it misses out those instances in which this pathology is acknowledged as PSO, even though physicians usually use abbreviations like PSO to refer to medical terms. As a consequence we will be missing out on patients with this pathology.
Moreover, the detection of these left out entities is also important to know how well our models are performing and more important, for the models to improve, learning from examples that it was not able to correctly identify. NER-NEL systems are classically evaluated using the notions of True Positives (TP, entities found by the system which are correct), False Positives (FP, detected entities that are incorrect) and False Negatives (FN, entities that were not found), being the latter, the ones in which we are interested in this post.
An example of the NER-NEL system is illustrated in the above image. In the first scenario the model detects two entities in the notes, psoriasis and hands. Let’s focus for now on the psoriasis mention. This result can be validated by experts which will assess the model predictions. Then, if it is correct it would be considered a True Positive, and in the opposite case a False Positive . On the other hand, in the second scenario the model, as in the example presented before, hasn’t detected any entities in a note where PSO was mentioned. Then, this unidentified mention of the pathology Psoriasis would be considered as a False Negative.
When evaluating the performance of a NER-NEL system, True Positives and False Positives can be easily estimated by manually annotating a representative sample of the clinical documents we are working with. However, this cannot be considered for the estimation of False Negatives, due to the huge amount of clinical texts in a hospital (often in the order of tens of millions). Therefore, we needed a system capable of detecting those mentions that our model is missing. Even though the perfect system would be manually reviewing all clinical notes to find the model missed entities, there are far smarter and more effective ways of finding intricate mentions.
In order to simplify this problem, let’s go back to our Psoriasis example. Both Psoriasis and PSO are different forms of mentioning the same medical concept, consequently, we can assume that the context of these mentions in the clinical notes would be the same. Keeping this in mind, to find notes where PSO is mentioned, we can reformulate the problem to find similar notes where Psoriasis has been found. This would reduce the number of notes human reviewers need to revise in order to find False Negatives of the target pathology.
Therefore, what we propose for this problem is an intermediate setting where instead of automatically finding FN entities, we automatize the search of notes with a high probability of containing them. Thus, given an entity from which we are interested in finding out possible overlooked mentions, we use the clinical notes which contain positive mentions found by our NER/NEL system to search for similar notes where this system has not found any positive cases. The solution we came with is an extra step we perform after the model’s execution and The methodology proposed consists in three simple steps:
-
Transform all notes into a numeric representation via your favorite algorithm
For the sake of simplicity we show how notes can be represented numerically via the simplest algorithm, Bag of Words. Every note is represented by a sequence of 0 and 1s. Every position in the sequence represents one word that has been found in the notes. If the note represented has the word, that position will be populated by a 1. Some more recent algorithms tend to show better performance, - TF-IDF, word embedding models and transformers can do a great job given this task.
-
Select notes where the model has found a mention of the target entity and average its numeric representation into a single representation.
Once again, let’s say our model is rather dummy and only detects psoriasis when it finds the word Psoriasis, yet existing several other ways of referring to this pathology, for example PSO. This step would consist of selecting all notes where Psoriasis has been detected and averaging its sequences column-wise. This final vector would be considered the query vector.
-
Rank the rest of notes given the cosine distance to the query vector and send to validation
There is something pretty amazing about sequences of numbers, the distance between two sequences can be measured as the cosine of the angle that forms these sequences in space. Then, closer sequences will mean similar notes. For this example, by using the Bag of Word algorithm, closer sequences will be notes with a higher number of shared words. Subsequently, closer sequences to the target entity representation will hold a higher probability of containing a reference of the entity.
And this is it, with this three-step system, notes with high probability of containing mentions of a pathology can be identified and prioritized for validation. It is worth noting that the simplest procedure has been described for every stage, however this system is flexible and allows changing the different components for more powerful methods.