Part 8 - Data Governance for BDA – Data Policies & Processes

IS2021_8

Evolution of Data Governance

Correct VS. Correction
- Tools are used to fix (correction) data but not to ensure data is valid or correct.
Data Repurposing
- “Fitness for purpose” → “Fitness for purpose(s)”.

Because of the above, thus the needs for oversight (monitored controls) incorporated into SDLC (Software Development Life Cycle) and across application infrastructure.

As a result, the above leads to the discipline of Data Governance.

Objectives of data governance are to institute right levels of control to achieve one of the following outcomes,

Alert
- Identify data issues that might have negative business impact.
Triage
- Prioritize those issues in relation to their corresponding business value drivers.
Remediate
- Have data stewards take the proper actions when alerted to the existence of those issues.

Data governance does not only enable a degree of control for data created and shared within an organization, it also empowers the data stewards to take corrective action.

Big Data and Data Governance

The key characteristics of big data analytics are not universally adaptable to the conventional approaches to data quality and data governance.

The Difference with Big Datasets

BDA is generally centered on consuming massive amounts of a combination of structured and unstructured data, while without considering the business impacts of errors or inconsistencies across different sources from where the data originated or how frequently it is acquired.

When the acquired datasets and data streams originate outside the organization, there is little facility for control over the input.

With big data systems that absorb massive volumes of data, some of which originates externally, there are limited opportunities to engage process owners to influence modifications or corrections to the source.

Big Data Oversight: Five Key Concepts

Let’s look at the five key concepts for data practitioners and business process owners need to keep in mind.

Managing consumer data expectations.
Identifying the critical data quality dimensions.
Monitoring consistency of metadata and reference data as a basis for entity extraction.
Repurposing and reinterpretation of data.
Data enrichment and enhancement when possible.

Managing Consumer Data Expectations

The quality of information must be directly related to the ways the business processes are either expected to be improved by better quality of data or how ignoring data problems leading to undesired negative impacts.

For the scope of different BDA projects, we must ascertain these collective user expectations by engaging different consumers of BDA to discuss how quality aspects of the input data that might affect the computed results.

Engaging the consumers for the requirements is a process of discussion with the known end users, coupled with some degree of speculation and anticipation of:

Who the pool of potential end users are.
What they might want to do with a dataset.
What their levels of expectation are.
How those expectations can be measured and monitored as well as realistic remedial actions that can be taken.

Identifying the Critical Data Quality Dimensions

Determine the dimensions of data quality that are relevant to the business and then distinguish those that are only measurable from those that are both measurable and controllable.

Dimensions for measuring the quality of information used for BDA:

Temporal Consistency
- Measuring timing characteristics of datasets used in BDA to see whether they are aligned from a temporal perspective.
Timelines
- Measuring if the data streams are delivered according to end-consumer expectations.
Currency
- Measuring whether the datasets are up-to-date.
Precision consistency
- Assessing if the units of measure associated with each data source share the same precision and if those units are properly harmonized or not.
Unique Identifiability
- Focusing on the ability to uniquely identify entities within datasets and data streams and link those entities to known system of record information.
Semantic Consistency
- This metadata activity may incorporate a glossary of business terms, hierarchies and taxonomies for business concepts, and relationships across concept taxonomies for standardizing ways that entities identified in structured and unstructured data are tagged in preparation for data use, e.g. maximum and highest temperature → many-to-one problem.

Consistency of Metadata and Reference Data for Entity Extraction

Analyzing relationships and connectivity in text data is key to entity identification in unstructured text.

Because of the variety of types of data that span both structured and unstructured sources, one must be aware of the degree to which unstructured text is replete with nuances, variation, and double meanings.

Entity identification and extraction depend on the differentiation between words and phrases that carry high levels of “meaning” from those that are used to establish connections and relationships, mostly embedded within the language of the text, e.g. name consists of first name, middle and last name → one-to-many problem.

Since the same terms and phrases may have different meanings depending on the participating constituency generating the content, it yet again highlights the need for precision in semantics associated with concepts extracted from data sources and streams, e.g. qualification may be academic or professional.

Repurposing and Reinterpretation of Data

One of the foundational concepts for the use of data for analytics is the possibility of finding interesting patterns that can lead to actionable insight and any acquired dataset may be used for any potential purpose at any time in the future.

Governance means establishing some limits around the scheme for repurposing.

New policies may be necessary when it comes to determining:

What data to acquire and what to ignore.
Which concepts to capture.
Which ones should be trashed.
What volume of data to be retained.
For how long.
Other qualitative data management and custodianship policies.

Data Enrichment and Enhancement

We can not have control over the quality and validity of data that is acquired from outside the organization.

Validation rules can be used to score the usability of the data based on end-user requirements.

If the above scores are below the level of acceptability, we can:

Do not use the data at all.
Use the data in its “unacceptable” state and modulate your users’ expectations in relation to the validity score.
Change the data to a more acceptable form.

Example

A large online retailer want to drive increased sales through relationship analysis and look at sales correlations.

When processing millions of transactions a day, a minimal number of inconsistencies, incomplete records or errors are likely to be irrelevant.

Should incorrect values be an impediment to the analysis and making changes does not significantly alter the data from its original form other than in a positive and an expected way, data enhancement and enrichment may be a reasonable alternative.

If address locations may be incomplete or even incorrect, standardizing an address’s format and applying corrections is a consistent way to improve the data.