How to Ingest and Transform Hive Database Names in DataHub

user-1 · July 29, 2024, 12:02am

Hi
How to ingest hive database schema name schema_test as schema in datahub ?

datahub_team · July 29, 2024, 12:02am

Hey there! Make sure your message includes the following information if relevant, so we can help more effectively!

Are you using UI or CLI for ingestion?
Which DataHub version are you using? (e.g. 0.12.0)
What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?

We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a or so other folks in the Community know if you’re still looking for help on the topic!

user-2 · July 29, 2024, 12:02am

To help you ingest the Hive database schema named schema_test into DataHub as schema, please provide the following details:

Are you using the UI or CLI for ingestion?
Which DataHub version are you using? (e.g., 0.12.0, 0.13.3.4)?
What data source(s) are you integrating with DataHub? (e.g., Hive, HDFS)?

This information will help tailor the instructions to your specific setup.

In the meantime, here is a general approach using the CLI for DataHub version 0.13.3.4:

Install DataHub CLI and Hive Plugin:
Ensure you have Python 3.6+ installed. Then, install the DataHub CLI and the Hive connector plugin.

python3 -m pip install --upgrade pip wheel setuptools
python3 -m pip install --upgrade acryl-datahub
pip install --upgrade 'acryl-datahub[hive]'

Create a Recipe File:
Create a YAML file (e.g., hive_recipe.yml) that defines the source and sink for metadata ingestion. Here is an example configuration for Hive:

# hive_recipe.yml

# Hive source configuration
source:
  type: hive
  config:
    host_port: "localhost:10000"  # Replace with your Hive server host and port
    database: "schema_test"       # Specify the schema to ingest

# Recipe sink configuration
sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"  # Replace with your DataHub server URL
    # If using authentication, add your token here
    # token: "<Your API key>"

Run the Ingestion:
Use the DataHub CLI to run the ingestion based on your recipe file.
```
datahub ingest -c hive_recipe.yml
```

For more details on configuring the Hive source, refer to the <https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/hive|Hive Ingestion Documentation>. For more information on the datahub ingest command, see the <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/cli-ingestion.md|CLI Ingestion Guide>.

Would you like more detailed steps or additional information on any specific part of the process?

Sources:

<https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/hive|Hive Ingestion Documentation>
<https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/README.md|DataHub CLI Installation>
<https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/cli-ingestion.md|CLI Ingestion Guide> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-1 · July 29, 2024, 12:02am

<@U06TM7M40RK> how to replace _test in schema name before ingesting to datahub ?

user-2 · July 29, 2024, 12:02am

To replace _test in the schema name before ingesting it into DataHub, you will need to use a custom transformer. DataHub allows you to write custom transformers to modify metadata during the ingestion process.

Here are the steps to achieve this:

Create a Custom Transformer:
You need to write a custom transformer that modifies the schema name by removing _test.
Configure the Ingestion Recipe:
Use the custom transformer in your ingestion recipe.

Step-by-Step Guide

1. Write a Custom Transformer

Create a Python file (e.g., custom_transformer.py) with the following content:

from datahub.ingestion.api.transform import Transformer
from datahub.metadata.schema_classes import DatasetPropertiesClass

class CustomSchemaNameTransformer(Transformer):
    def transform(self, entity):
        if isinstance(entity, DatasetPropertiesClass):
            # Modify the schema name by removing '_test'
            entity.name = entity.name.replace('_test', '')
        return entity

# Register the transformer
transformer = CustomSchemaNameTransformer()

2. Configure the Ingestion Recipe

Create a YAML file (e.g., hive_recipe.yml) that includes the custom transformer:

# hive_recipe.yml

# Hive source configuration
source:
  type: hive
  config:
    host_port: "localhost:10000"  # Replace with your Hive server host and port
    database: "schema_test"       # Specify the schema to ingest

# Custom transformer configuration
transformers:
  - type: custom_transformer.CustomSchemaNameTransformer

# Recipe sink configuration
sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"  # Replace with your DataHub server URL
    # If using authentication, add your token here
    # token: "<Your API key>"

3. Run the Ingestion

Use the DataHub CLI to run the ingestion based on your recipe file.

datahub ingest -c hive_recipe.yml

Additional Information

Custom Transformer Documentation: For more details on writing custom transformers, refer to the <Dataset | DataHub Custom Transformer Documentation>.

Would you like more detailed steps or additional information on any specific part of the process?

Sources:

<Dataset | DataHub Custom Transformer Documentation>
<https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/hive|Hive Ingestion Documentation>
<https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/README.md|DataHub CLI Installation>
<https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/cli-ingestion.md|CLI Ingestion Guide> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-1 · July 29, 2024, 12:02am

<@U06TM7M40RK> getting Nonetype object has no attribute transform

user-2 · July 29, 2024, 12:02am

The error “NoneType object has no attribute ‘transform’” suggests that the custom transformer might not be correctly instantiated or registered. Let’s ensure that the custom transformer is properly defined and integrated into the ingestion pipeline.

Here is a more detailed step-by-step guide to help you resolve this issue:

Step 1: Create the Custom Transformer

Create a Python file (e.g., custom_transformer.py) with the following content:

from datahub.configuration.common import ConfigModel
from datahub.ingestion.api.common import PipelineContext
from datahub.ingestion.api.transform import Transformer
from datahub.ingestion.transformer.base_transformer import BaseTransformer, SingleAspectTransformer
from datahub.metadata.schema_classes import DatasetPropertiesClass

class CustomSchemaNameTransformerConfig(ConfigModel):
    # Add any configuration parameters your transformer needs here
    pass

class CustomSchemaNameTransformer(BaseTransformer, SingleAspectTransformer):
    ctx: PipelineContext
    config: CustomSchemaNameTransformerConfig

    def __init__(self, config: CustomSchemaNameTransformerConfig, ctx: PipelineContext):
        super().__init__()
        self.ctx = ctx
        self.config = config

    @classmethod
    def create(cls, config_dict: dict, ctx: PipelineContext) -> "CustomSchemaNameTransformer":
        config = CustomSchemaNameTransformerConfig.parse_obj(config_dict)
        return cls(config, ctx)

    def entity_types(self) -> list:
        return ["dataset"]

    def transform_aspect(self, entity_urn: str, aspect_name: str, aspect: DatasetPropertiesClass) -> DatasetPropertiesClass:
        if isinstance(aspect, DatasetPropertiesClass):
            # Modify the schema name by removing '_test'
            aspect.name = aspect.name.replace('_test', '')
        return aspect

Step 2: Configure the Ingestion Recipe

Create a YAML file (e.g., hive_recipe.yml) that includes the custom transformer:

# hive_recipe.yml

# Hive source configuration
source:
  type: hive
  config:
    host_port: "localhost:10000"  # Replace with your Hive server host and port
    database: "schema_test"       # Specify the schema to ingest

# Custom transformer configuration
transformers:
  - type: "custom_transformer.CustomSchemaNameTransformer"
    config:
      # Add any configuration parameters your transformer needs here

# Recipe sink configuration
sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"  # Replace with your DataHub server URL
    # If using authentication, add your token here
    # token: "<Your API key>"

Step 3: Run the Ingestion

Ensure that the custom_transformer.py file is in the same directory as your ingestion recipe or properly referenced. Then, use the DataHub CLI to run the ingestion based on your recipe file.

datahub ingest -c hive_recipe.yml

Additional Information

user-2 · July 29, 2024, 12:02am

Custom Transformer Documentation: For more details on writing custom transformers, refer to the <Dataset | DataHub Custom Transformer Documentation>.

Would you like more detailed steps or additional information on any specific part of the process?

Sources:

<Dataset | DataHub Custom Transformer Documentation>
<https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/hive|Hive Ingestion Documentation>
<https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/README.md|DataHub CLI Installation>
<https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/cli-ingestion.md|CLI Ingestion Guide> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-1 · July 29, 2024, 12:02am

Can you please provide recipe to transform hive database name before loading your datahub

Topic		Replies	Views
Ingesting a Specific Table from Source Platform into DataHub using Recipe or CLI ingestion	1	56	September 2, 2024
Ingesting Data Using DataHub UI Guidance - Step-by-Step Instructions ingestion	4	264	June 17, 2024
Ingesting Hive Metastore Entities into DataHub for Metadata Management ingestion	3	103	April 22, 2024
Ingesting Hive Metadata from Cloudera Data Warehouse in DataHub: Seeking Help to Resolve Errors ingestion	11	27	May 20, 2024
Troubleshooting Data Ingestion Issues from Databricks to DataHub ingestion	10	85	July 8, 2024

How to Ingest and Transform Hive Database Names in DataHub

Related topics