Big Data Governance: How to Govern Data Outside of Databases
Last week in my post, I discussed why you would choose to govern data outside of the database. Today I will discuss how you do it. Most conversations about Big Data Governance focus on why it is necessary but rarely on how it is done. The activities performed to set up a Data Governance program, whether involved with structured or unstructured data, still include defining an organizational structure, creating a charter, defining common business terms, and identifying the Data Stewards in the organization.
In fact, when you hear Data Governance described by a document oriented organization, there is almost no difference in document governance and the governance of data that is found in databases.
Data Steward Responsibilities by Data Store
Data Governance is implemented by Data Stewards who are responsible for one or more data stores against which they perform the following initial and on-going activities:
- Assign data steward responsibilities for domains and data stores
- Profile and perform quality assessment of data
- Ensure retention policies are established, creation and change procedures are documented
- Identify and log issues with data in data stores
- Assign analysts to determine source of problems and recommend solutions
- Implement recommended process improvements, data cleanup, system changes
- Document business rules and set up data quality monitoring
- Monitor data and data quality
- Respond to requests for information and reported issues
Organizing Data Stewards in the Organization
Since Data Stewardship is usually organized by data store, and structured and unstructured data are stored in separate data stores, it is usually true that the Data Stewards may be by data type, as well as the other breakdowns of responsibility. For example, the Data Steward focused on the governance of documents in a business area may be different from the Data Steward responsible for customer master data. Data Stewards are usually organized across various dimensions:
- Business / technical
- Producers / Consumers
- Line of Business / Department
- Function / Application (Data Store)
- Data Domain?
- Data Type?
Data Stewardship Tools
The tools used for Data Stewardship include
- Business Glossary
- Issue Tracking
- Data Profiling
- Metadata Repository
- Data Governance Review and Approval Workflow
The tool sets for unstructured data stewardship may be significantly different to those focused on the management of data in databases. The tools for unstructured data management will also include:
- Ontology and hierarchy management
- Content management
- Scanning and OCR
NoSQL Data Stewardship
Most NoSQL databases (non-relational databases) can be governed in the same way as relational databases, although the profiling tools used on relational databases will usually have to be replaced by profiling using utilities specific to the database in question. Text search tools work on document databases and Hadoop data structures.
Managing Data In Motion
Traditional Data Governance programs may not be including managing the non-persistent data passing through the organization, or the “data in motion”. Even Data Governance programs that focus on the data in databases, and certainly Big Data Governance programs, should be establishing responsibility for the rules that govern the movement and transformation of data in the organization. These things are not just technical “code” but the business decisions on how critical data in the organization is transformed and calculated.
- Transformation rules (into and out of data warehouses and marts, MDM hubs)
- Canonical models
- Message layouts
- External data sources
- Matching and merging rules
- Data streams
- Key data extracts and calculations
See my new book on Data Integration for more on “Managing Data In Motion.”
Data Governance Maturity
For organizations that are focused on the creation or management of unstructured data or documents, such as mortgage companies, publishers, media companies, and pharmaceutical companies who file drug submissions, the governance of unstructured data is crucial to their franchise and Data Governance of this data is very mature. For most organizations, some policies and tools for the governance of email and documents probably exist, but having a full Big Data Governance program is usually limited to organizations with very mature Data Governance capabilities.