Module 4: Canonical Models and Data Quality
Canonical Models and Data Quality
Key Terms
-
Entities: Entities are higher-level concepts in data modeling that group together related tables representing a common theme or purpose, such as Customer or Order.
-
Attributes: Attributes are specific data points within an entity, representing common columns across grouped tables, such as FirstName or OrderDate.
Pre-Requsities
- Before proceeding with canonical data modeling, ensure that data mappings have been generated to create the necessary entities and attributes. Refer to the Generate Mappings section for detailed instructions.
Step 1: Data Modeling with Canonical Loading
In data modeling, you can use canonical loading to group entities from different data sources, ensuring a unified structure. Additionally, Data Quality Recommendations help in enhancing the data quality and compatibility across entities.
-
Canonical Loading:
- Use the "Table Profile and Canonical Load Type" feature to group entities.
- This helps in creating a consistent data structure across different data sources.
-
Generate Business Key:
- Click on the "Generate Business Key" button to create a unique identifier for a given entity.
- This key helps in maintaining consistency and uniqueness across records.
-
Check Data Compatibility:
Click on "Check Data Compatibility" to evaluate:
- Data Compatability: The similarity in data between attributes from entities across schemas., through the overlap percentage and relationship type
- Schema Compatibility: to verify if an attribute in an entity maps to attributes in other source entities.
- The output is a true or false value indicating compatibility.
-
Data Lake Load:
- Click on "Data Lake Load" to generate data from the canonical loading of different attributes.
- This feature helps in loading the data into a data lake for further processing and analysis.
-
Data Warehouse Load:
- Click on "Data Warehouse Load" to store the generated parquet files into a data warehouse.
- This ensures that the data is available for querying and reporting in a structured format.
-
Add Custom Attributes
-
Select Entity Name: Choose the entity for the custom attribute.
-
Define Source Attributes: Specify the source attributes.
-
Enter Custom Attribute Details: Provide name, description, business key status, classification, and datatype.
-
Enter Prompt Text: Describe the purpose of the custom attribute.
-
Generate and Review Code: Click "Generate and Preview Code" and review it.
-
Save the Custom Attribute: Click "Save" to add the attribute.
-
Step 2: Data Quality Recommendations
Data Quality Recommendations involves defining data structures and relationships for a given entity. During this step, you also specify potential data quality checks to ensure the integrity and accuracy of the data.
-
Select Entity:
- From the list of entities, choose the one you want to model (e.g., Blending Operations).
-
Generate Data Quality Recommendations:
- Click the "Generate Data Quality Recommendations" button.
- The LLM (Large Language Model) will provide a list of suggested data quality checks.
-
Review Data Quality Rules:
- Look over the recommended data quality rules for your selected entity.
- Each suggestion includes:
- Category: The type of quality check (e.g., Cleansing, Validation).
- Subcategory: Specific focus of the rule (e.g., Null Handling, Range Checking).
- Attributes: The data attributes the rule applies to (e.g., Blending ID).
- Rule Description: What the rule does (e.g., Replace null values with default values).
- Rule Explanation: Why the rule is important (e.g., Ensures all records have a valid Blending ID).
-
Discretionary Implementation:
- Users have the discretion to accept, modify, or reject the suggested rules.
- Implement the rules that best fit your data quality requirements.
-
Add Data Quality Recommendations
- Select the entity and attribute you wish to improve.
- Choose a rule category and sub-category (e.g., cleansing transformations, null handling).
- Enter prompt text to guide the LLM in generating data quality rules.
- Review the generated code to ensure it meets your needs.
- Save the recommendations to implement the data quality checks.