The 7 Most Common Dataset Errors

 

At GeoDirectory we pride ourselves on our exceptional data cleansing for our clients address based datasets. Within these datasets there are numerous common issues which can be eradicated by the process of data cleaning. Here are the most common issues when data cleansing:

1.           Spelling errors
As one might suspect, spelling errors are the most common mistake to be found within the majority of datasets. These occur easily, yet are usually difficult to visually identify in tabular data. Thankfully, some spelling errors can be swiftly identified by running the data through a spell-checker. In the end of the day accurate data should result in more accurate, reliable outcomes and ultimately that is what we all are striving for.

2.           Redundant or irrelevant data
Datasets that contain redundant or irrelevant data are more common than one might expect. Frequently we see datasets that haven’t been updated in years. Some of this data will essentially be redundant. Unfortunately, on top of that, throughout the years random pieces of information, usually irrelevant to what the original purpose of the dataset was, accumulate within the important columns. A common example we see in GeoDirectory are telephone numbers or names randomly in the address columns. Removing this data is ideal, however, if this extra information is truly necessary putting it in a separate column or in the comments improves the quality of the dataset.

3.           Duplications
Another sneaky, yet common, mistake that regularly goes unnoticed are unnecessary duplicate records. A duplicate record is where the same piece of data has been entered more than once. These can unfortunately massively skew outcomes as the data quantities are technically inaccurate. Duplicate records often occur when datasets have been combined or because it was not known there was already an entry. Thankfully this issue can often be quickly rectified using simple functions on excel.

4.           Syntax errors
Syntax errors are another set of issues that can often be quickly rectified using functions on excel. In linguistics, "syntax" refers to the rules that govern the ways in which words combine to form phrases, clauses, and sentences. Things such as removing extra white spaces can be helpful. Even padding strings out with other characters to form a certain width can have a dramatically positive impact for when working with the data in the future. For example, some numerical codes represented with prepending zeros to ensure they always have the same number of digits. Eg 123 => 00123 (5 digits), 9876 => 09876 (5 digits). Although generally simple, fixing syntax errors can drastically improve the dataset.

5.           Multiple representations for the same thing
The issue of multiple representations particularly occurs with databases that have multiple users. People often try to save time when entering data by abbreviating terms. If these abbreviations are not consistent, it can cause errors in the dataset. Too many cooks really can spoil the broth!

6.           Uniformity
Another issue is lack of uniformity. Uniformity is the degree to which the data is specified using the same unit of measure. The date might follow the European format or USA format. The currency is sometimes in euro and sometimes in pounds. For this reason, the data may need to be converted to a single measure unit.

7.           Data within the standardised data entry processes
The final question “is the data within the standardised data entry processes?” is probably key to having an excellent database. The majority of the aforementioned issues fall somewhat under this category. If everything is organized and in a standardized format it is easier to pull the needed information at a given time and the overall productivity will increase.

To conclude data cleansing is a form of data management. It’s the process of going through a database and updating or removing information that is incorrect, irrelevant, duplicated, improperly formatted or incomplete. Data cleansing is important because it improves your data quality and in doing so, the system will become more efficient, reliable and accurate, ultimately increasing the overall productivity. When data is clean, it is easier to combine it with different datasets and gain deeper insights. If data is not clean, then decisions made on the basis of it could be wrong. This can affect products or insights you develop from data and, in some cases, have an enormous negative impact. For more information, feel free to check out our interesting blog on the seven golden rules of data quality.

Experience AddressFix for yourself

Get your Free data health check now

Upload your data

POWERED BY:

contact

General Post Office O’Connell Street Dublin 1 D01 F5P2 Ph: +353 1 705 7005

Get in touch >

connect with us

BACkED BY

An Post GeoDirectory DAC, registered in Ireland, registered number 240986.