How to Ingest and Transform Hive Database Names in DataHub

Original Slack Thread

Hi
How to ingest hive database schema name schema_test as schema in datahub ?

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

To help you ingest the Hive database schema named schema_test into DataHub as schema, please provide the following details:

  1. Are you using the UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g., 0.12.0, 0.13.3.4)?
  3. What data source(s) are you integrating with DataHub? (e.g., Hive, HDFS)?

This information will help tailor the instructions to your specific setup.

In the meantime, here is a general approach using the CLI for DataHub version 0.13.3.4:

  1. Install DataHub CLI and Hive Plugin:
    Ensure you have Python 3.6+ installed. Then, install the DataHub CLI and the Hive connector plugin.

    python3 -m pip install --upgrade pip wheel setuptools
    python3 -m pip install --upgrade acryl-datahub
    pip install --upgrade 'acryl-datahub[hive]'
    
  2. Create a Recipe File:
    Create a YAML file (e.g., hive_recipe.yml) that defines the source and sink for metadata ingestion. Here is an example configuration for Hive:

    # hive_recipe.yml
    
    # Hive source configuration
    source:
      type: hive
      config:
        host_port: "localhost:10000"  # Replace with your Hive server host and port
        database: "schema_test"       # Specify the schema to ingest
    
    # Recipe sink configuration
    sink:
      type: "datahub-rest"
      config:
        server: "http://localhost:8080"  # Replace with your DataHub server URL
        # If using authentication, add your token here
        # token: "<Your API key>"
    
  3. Run the Ingestion:
    Use the DataHub CLI to run the ingestion based on your recipe file.

    datahub ingest -c hive_recipe.yml
    

For more details on configuring the Hive source, refer to the <https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/hive|Hive Ingestion Documentation>. For more information on the datahub ingest command, see the <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/cli-ingestion.md|CLI Ingestion Guide>.

Would you like more detailed steps or additional information on any specific part of the process?

Sources:

<@U06TM7M40RK> how to replace _test in schema name before ingesting to datahub ?

To replace _test in the schema name before ingesting it into DataHub, you will need to use a custom transformer. DataHub allows you to write custom transformers to modify metadata during the ingestion process.

Here are the steps to achieve this:

  1. Create a Custom Transformer:
    You need to write a custom transformer that modifies the schema name by removing _test.

  2. Configure the Ingestion Recipe:
    Use the custom transformer in your ingestion recipe.

Step-by-Step Guide

1. Write a Custom Transformer

Create a Python file (e.g., custom_transformer.py) with the following content:

from datahub.ingestion.api.transform import Transformer
from datahub.metadata.schema_classes import DatasetPropertiesClass

class CustomSchemaNameTransformer(Transformer):
    def transform(self, entity):
        if isinstance(entity, DatasetPropertiesClass):
            # Modify the schema name by removing '_test'
            entity.name = entity.name.replace('_test', '')
        return entity

# Register the transformer
transformer = CustomSchemaNameTransformer()

2. Configure the Ingestion Recipe

Create a YAML file (e.g., hive_recipe.yml) that includes the custom transformer:

# hive_recipe.yml

# Hive source configuration
source:
  type: hive
  config:
    host_port: "localhost:10000"  # Replace with your Hive server host and port
    database: "schema_test"       # Specify the schema to ingest

# Custom transformer configuration
transformers:
  - type: custom_transformer.CustomSchemaNameTransformer

# Recipe sink configuration
sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"  # Replace with your DataHub server URL
    # If using authentication, add your token here
    # token: "<Your API key>"

3. Run the Ingestion

Use the DataHub CLI to run the ingestion based on your recipe file.

datahub ingest -c hive_recipe.yml

Additional Information

  • Custom Transformer Documentation: For more details on writing custom transformers, refer to the <Dataset | DataHub Custom Transformer Documentation>.

Would you like more detailed steps or additional information on any specific part of the process?

Sources:

<@U06TM7M40RK> getting Nonetype object has no attribute transform

The error “NoneType object has no attribute ‘transform’” suggests that the custom transformer might not be correctly instantiated or registered. Let’s ensure that the custom transformer is properly defined and integrated into the ingestion pipeline.

Here is a more detailed step-by-step guide to help you resolve this issue:

Step 1: Create the Custom Transformer

Create a Python file (e.g., custom_transformer.py) with the following content:

from datahub.configuration.common import ConfigModel
from datahub.ingestion.api.common import PipelineContext
from datahub.ingestion.api.transform import Transformer
from datahub.ingestion.transformer.base_transformer import BaseTransformer, SingleAspectTransformer
from datahub.metadata.schema_classes import DatasetPropertiesClass

class CustomSchemaNameTransformerConfig(ConfigModel):
    # Add any configuration parameters your transformer needs here
    pass

class CustomSchemaNameTransformer(BaseTransformer, SingleAspectTransformer):
    ctx: PipelineContext
    config: CustomSchemaNameTransformerConfig

    def __init__(self, config: CustomSchemaNameTransformerConfig, ctx: PipelineContext):
        super().__init__()
        self.ctx = ctx
        self.config = config

    @classmethod
    def create(cls, config_dict: dict, ctx: PipelineContext) -> "CustomSchemaNameTransformer":
        config = CustomSchemaNameTransformerConfig.parse_obj(config_dict)
        return cls(config, ctx)

    def entity_types(self) -> list:
        return ["dataset"]

    def transform_aspect(self, entity_urn: str, aspect_name: str, aspect: DatasetPropertiesClass) -> DatasetPropertiesClass:
        if isinstance(aspect, DatasetPropertiesClass):
            # Modify the schema name by removing '_test'
            aspect.name = aspect.name.replace('_test', '')
        return aspect

Step 2: Configure the Ingestion Recipe

Create a YAML file (e.g., hive_recipe.yml) that includes the custom transformer:

# hive_recipe.yml

# Hive source configuration
source:
  type: hive
  config:
    host_port: "localhost:10000"  # Replace with your Hive server host and port
    database: "schema_test"       # Specify the schema to ingest

# Custom transformer configuration
transformers:
  - type: "custom_transformer.CustomSchemaNameTransformer"
    config:
      # Add any configuration parameters your transformer needs here

# Recipe sink configuration
sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"  # Replace with your DataHub server URL
    # If using authentication, add your token here
    # token: "<Your API key>"

Step 3: Run the Ingestion

Ensure that the custom_transformer.py file is in the same directory as your ingestion recipe or properly referenced. Then, use the DataHub CLI to run the ingestion based on your recipe file.

datahub ingest -c hive_recipe.yml

Additional Information

  • Custom Transformer Documentation: For more details on writing custom transformers, refer to the <Dataset | DataHub Custom Transformer Documentation>.

Would you like more detailed steps or additional information on any specific part of the process?

Sources:

Can you please provide recipe to transform hive database name before loading your datahub