The importance of not losing sight of the quality of your data

February 15, 2021

“Without a systematic way to start and keep data clean, bad data wll happen.” - Donato Diorio.

You have probably seen that a letter is missing in the previous sentence, but it is easy to overlook a small mistake. In this case, missing a letter has not been a big problem, since the phrase is still understandable, but if I continued to write forgetting more letters or even words, this text would end up not being understood and it would cease to serve its purpose. This is a very simple example of what not having control over the state of the data can entail and it can be extrapolated to any process in charge of manipulating data. 

Identifying these types of errors or problems and applying actions to correct and avoid them, to guarantee that the data meets certain requirements, is the main objective of the data quality processes.

How can I measure the quality of the data? 

The criteria that are taken into account to assess whether or not some data meet a certain quality vary depending on the context, the purpose, or data uses. Furthermore, a measurement can hardly be generalized to the entire dataset available to an organization. This means that, for each situation, it is necessary to identify which measurements allow an accurate view of the state of certain data. Some of the variables or dimensions most frequently used to measure data quality are the following:

  • Accuracy: the similarity or closeness of the data with its representation in the real world or in its origin.
  • Completeness: all the necessary data are present.
  • Consistency: the data is consistent and without inconsistencies.
  • Temporality: the data meets the required update and availability conditions.
  • Uniqueness: each attribute in the data appears uniquely and there is no duplication.
  • Validity: the data conforms to the business, standard, format, or range requirements established.

The combination of these (or other) dimensions is what identifies the quality of a data set. Having a quality measurement allows you to keep track and ensure that quality is preserved at all times.

What does it mean to use poor quality data?

Today, the heart of many companies is their data. They are used transversally in all areas of a company and are an essential asset, both for making decisions and for obtaining certain results.

Using erroneous or poor quality data in any of these processes may imply not achieving the expected results or making the wrong decisions, with all the consequences that this may entail. 

In other cases, not having mechanisms in charge of automatically identifying and adjusting problems in the data, implies that the problems are detected late and always requiring manual intervention to correct or clean these irregularities. These interventions directly affect efficiency, causing many delays and even stopping a certain operation completely. 

All these situations end up generating distrust in the data and, to all the time invested in correcting the problems, an effort will have to be added to recover this lost confidence. Therefore, not detecting any problem in the data in time can become critical.

What can be done to ensure data quality?

There are many techniques to improve and guarantee data quality, and there are more and more solutions on the market that allow you to evaluate the status of your data. In addition to applying the measurements themselves, there are other important aspects to take into account to identify any problem and, above all, anticipate and adapt to problems that may appear in the future:

  • Communication and collaboration. Communicating with all the parties involved, understanding the business logic and what are the purposes that the data must fulfill is key to having a clear vision of what requirements your data must fulfill.
  • Analysis of the data . Analyzing and carefully studying the structure and purpose of the data allows adjusting the measures implemented to the quality requirements.
  • Monitoring of quality measurements . Maintaining constant monitoring of quality measurements allows you to have a vision at all times regarding the state in which they are and to be able to detect anomalies.
  • Monitored, robust and tested applications . An error in any application or service in charge of dealing with data can cause data to be corrupted or lost. If these are monitored and perform robustly and well tested, this is less likely to occur. 

Any of these points is key and, also, they must be maintained over time, since the data and associated uses can change over time and must be adapted to these changes. 

Challenges

Applying any of the above aspects can become very complex depending on the dimensions and variability of the data. This makes achieving the best possible data quality in an acceptable time and maintaining it over time a complicated challenge but with a very valuable result. 

Currently, at IOMED, ​​we find an increasing volume and variability of data with the integration of new hospitals. We continuously work on the implementation of processes that guarantee the quality of the data and doing so in this constantly growing scenario is a challenge that we face with great enthusiasm.



Image Description

Sandra Pulido

Data Engineer