- - 4 min read
Guide to Data Catalog Tools and Architecture
- Posted on December 26, 2018
What is Data Catalog?
Data Catalog provides a single self-service environment to the users which helps them to find, understand and trust the data source. It also helps the users to discover the new data sources if there is any. Discovering and understanding data sources are the initial steps for registering the sources. Users search for the data on the basis of the needs and filter the results for the appropriate results. In Enterprises, Data Lake needed for Business Intelligence, Data Scientists, ETL Developers where the right data needed. Data Catalog discovery used by the users to find the data which fit their needs.
How Data Catalog Works?
Building a Data Catalog starts from collecting the metadata of the sources. After obtaining the metadata, the metadata entities need to categorize and assign different tags. ML(Machine Learning) and NLP(Natural Language Processing) used to automate these processes. Metadata entities auto assign a tag according to the name of the data entity with the help of Machine Learning model. In the end, the data steward reviews the things and add more values to Data Catalog.
Data Catalog Benefits
Spend more time using the data not finding - As per Forrester Forbes report, data scientists spent than 75% of their time to understand and find the data. And more than 75% of them doesn't like that part of their job. This due to the questions which they have before working on the queries. The main reason for this problem in an organization is the poor mechanism of handling and tracking all the data. A good Data Catalog helps the Data Scientist or Business Analyst to understand the data and to answer the question which they have.
To implement Access Control - When an organization grows, role-based policies needed, don't want everybody to modify the data. Access Control should be implemented while building the Data Lake. Particular roles assigned to the users and according to those roles, Data Access should be controlled. In Hadoop ecosystem, implement using Apache Ranger. For the sensitive data in the Data Lake, use encryption for the Data Protection.
To Reduce Cost by Eliminating Data Redundancies - A good Data Catalog helps us to find the data redundancies and eliminated. This can help us to save the storage cost and data management costs.
To follow Laws - There are different protection laws to follow as per the data such as GDPR, BASEL, GDSN, HIPAA and many more. These laws must be followed while dealing with any data. But these laws stand for different use-cases and doesn't implies every dataset, to understand that we need to know about the dataset. A good Data Catalog helps us to make sure that Data Compliances followed by giving a view on Data Lineage and to use Access Control.
Why Data Catalog Matters?
Helps in Understanding the data - A good Data Catalog helps the user in understanding the data. A data catalog makes it easier for the user to find the relevant data and know the data, it also gives information about the data such as where it is being used and from where its generating.
Allow users to work multiple data sources - Data catalog consists of one or more data sources. It helps users to find the quality data source and to gain better knowledge about multiple data sources.
To Follow Regulatory compliance - There are several data-related laws like HIPAA, BASEL, and GDPR. All of these laws driven from different perspectives and use cases, but in the end, they all come down to better governance of data with a focus on Data Lineage and Access Control.
How to Adopt Data Catalog?
Building a Data Catalog is a multi-step process which includes -
Metadata Extraction - Metadata extraction is the very first step of building the data catalog. In this step, the metadata of a defined source is collected and stored into the metadata store. It helps in understanding the defined data asset.
Data Sampling - Data sampling used to understand the schema, tables, databases.
Auto-Titling (ML and NLP) - Every organization has a naming convention of using abbreviation to define the schema. NLP model to assign that abbreviation a common name understood by the users who are using Data Catalog.
Query Log Ingestion - Query Log Ingestion to collect additional information about the datasets and give a complete picture of each dataset just like Data Lineage, Data Usability.
Crowd Sourcing & Expert Sourcing - Up to this layer, Data Catalog is ready and just need to add more values to the Data Catalog. NLP model has corrected the names of the data assets collected from the data sources but the Computer-Human Collaboration also necessary to verify the things.
Data Catalog Best Practises
Assigning Ownership for the data set - Ownership of each data set must be defined. There must be a person to whom the user contacts in case of an issue. A good Data Catalog also must tell about the owner of any particular data set.
Machine-Human Collaboration - After building a Data Catalog, the data sets verified from the users to make it more accurate.
Searchability - Data Catalog should support searchability. Searchability enables Data Asset Discovery, data consumers easily find assets that meet needs.
Data Protection - Define Access policieslicy to prevent unauthorized data access.
Data Catalog Tools
- Alation Data Catalog
- Cloudera Navigator
- Informatica Data Catalog
- Collibra Data Catalog