Hi
How to ingest hive database schema name schema_test as schema in datahub ?
Hey there! Make sure your message includes the following information if relevant, so we can help more effectively!
- Are you using UI or CLI for ingestion?
- Which DataHub version are you using? (e.g. 0.12.0)
- What data source(s) are you integrating with DataHub? (e.g. BigQuery)
<@U06TM7M40RK> Can you help with the questions in this thread?
We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a or so other folks in the Community know if you’re still looking for help on the topic!
To help you ingest the Hive database schema named schema_test
into DataHub as schema
, please provide the following details:
- Are you using the UI or CLI for ingestion?
- Which DataHub version are you using? (e.g., 0.12.0, 0.13.3.4)?
- What data source(s) are you integrating with DataHub? (e.g., Hive, HDFS)?
This information will help tailor the instructions to your specific setup.
In the meantime, here is a general approach using the CLI for DataHub version 0.13.3.4:
-
Install DataHub CLI and Hive Plugin:
Ensure you have Python 3.6+ installed. Then, install the DataHub CLI and the Hive connector plugin.python3 -m pip install --upgrade pip wheel setuptools python3 -m pip install --upgrade acryl-datahub pip install --upgrade 'acryl-datahub[hive]'
-
Create a Recipe File:
Create a YAML file (e.g.,hive_recipe.yml
) that defines the source and sink for metadata ingestion. Here is an example configuration for Hive:# hive_recipe.yml # Hive source configuration source: type: hive config: host_port: "localhost:10000" # Replace with your Hive server host and port database: "schema_test" # Specify the schema to ingest # Recipe sink configuration sink: type: "datahub-rest" config: server: "http://localhost:8080" # Replace with your DataHub server URL # If using authentication, add your token here # token: "<Your API key>"
-
Run the Ingestion:
Use the DataHub CLI to run the ingestion based on your recipe file.datahub ingest -c hive_recipe.yml
For more details on configuring the Hive source, refer to the <https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/hive|Hive Ingestion Documentation>. For more information on the datahub ingest
command, see the <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/cli-ingestion.md|CLI Ingestion Guide>.
Would you like more detailed steps or additional information on any specific part of the process?
Sources:
- <https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/hive|Hive Ingestion Documentation>
- <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/README.md|DataHub CLI Installation>
- <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/cli-ingestion.md|CLI Ingestion Guide> 0 button 0 button Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> how to replace _test in schema name before ingesting to datahub ?
To replace _test
in the schema name before ingesting it into DataHub, you will need to use a custom transformer. DataHub allows you to write custom transformers to modify metadata during the ingestion process.
Here are the steps to achieve this:
-
Create a Custom Transformer:
You need to write a custom transformer that modifies the schema name by removing_test
. -
Configure the Ingestion Recipe:
Use the custom transformer in your ingestion recipe.
Step-by-Step Guide
1. Write a Custom Transformer
Create a Python file (e.g., custom_transformer.py
) with the following content:
from datahub.ingestion.api.transform import Transformer
from datahub.metadata.schema_classes import DatasetPropertiesClass
class CustomSchemaNameTransformer(Transformer):
def transform(self, entity):
if isinstance(entity, DatasetPropertiesClass):
# Modify the schema name by removing '_test'
entity.name = entity.name.replace('_test', '')
return entity
# Register the transformer
transformer = CustomSchemaNameTransformer()
2. Configure the Ingestion Recipe
Create a YAML file (e.g., hive_recipe.yml
) that includes the custom transformer:
# hive_recipe.yml
# Hive source configuration
source:
type: hive
config:
host_port: "localhost:10000" # Replace with your Hive server host and port
database: "schema_test" # Specify the schema to ingest
# Custom transformer configuration
transformers:
- type: custom_transformer.CustomSchemaNameTransformer
# Recipe sink configuration
sink:
type: "datahub-rest"
config:
server: "http://localhost:8080" # Replace with your DataHub server URL
# If using authentication, add your token here
# token: "<Your API key>"
3. Run the Ingestion
Use the DataHub CLI to run the ingestion based on your recipe file.
datahub ingest -c hive_recipe.yml
Additional Information
- Custom Transformer Documentation: For more details on writing custom transformers, refer to the <Dataset | DataHub Custom Transformer Documentation>.
Would you like more detailed steps or additional information on any specific part of the process?
Sources:
- <Dataset | DataHub Custom Transformer Documentation>
- <https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/hive|Hive Ingestion Documentation>
- <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/README.md|DataHub CLI Installation>
- <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/cli-ingestion.md|CLI Ingestion Guide> 0 button 0 button Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> getting Nonetype object has no attribute transform
The error “NoneType object has no attribute ‘transform’” suggests that the custom transformer might not be correctly instantiated or registered. Let’s ensure that the custom transformer is properly defined and integrated into the ingestion pipeline.
Here is a more detailed step-by-step guide to help you resolve this issue:
Step 1: Create the Custom Transformer
Create a Python file (e.g., custom_transformer.py
) with the following content:
from datahub.configuration.common import ConfigModel
from datahub.ingestion.api.common import PipelineContext
from datahub.ingestion.api.transform import Transformer
from datahub.ingestion.transformer.base_transformer import BaseTransformer, SingleAspectTransformer
from datahub.metadata.schema_classes import DatasetPropertiesClass
class CustomSchemaNameTransformerConfig(ConfigModel):
# Add any configuration parameters your transformer needs here
pass
class CustomSchemaNameTransformer(BaseTransformer, SingleAspectTransformer):
ctx: PipelineContext
config: CustomSchemaNameTransformerConfig
def __init__(self, config: CustomSchemaNameTransformerConfig, ctx: PipelineContext):
super().__init__()
self.ctx = ctx
self.config = config
@classmethod
def create(cls, config_dict: dict, ctx: PipelineContext) -> "CustomSchemaNameTransformer":
config = CustomSchemaNameTransformerConfig.parse_obj(config_dict)
return cls(config, ctx)
def entity_types(self) -> list:
return ["dataset"]
def transform_aspect(self, entity_urn: str, aspect_name: str, aspect: DatasetPropertiesClass) -> DatasetPropertiesClass:
if isinstance(aspect, DatasetPropertiesClass):
# Modify the schema name by removing '_test'
aspect.name = aspect.name.replace('_test', '')
return aspect
Step 2: Configure the Ingestion Recipe
Create a YAML file (e.g., hive_recipe.yml
) that includes the custom transformer:
# hive_recipe.yml
# Hive source configuration
source:
type: hive
config:
host_port: "localhost:10000" # Replace with your Hive server host and port
database: "schema_test" # Specify the schema to ingest
# Custom transformer configuration
transformers:
- type: "custom_transformer.CustomSchemaNameTransformer"
config:
# Add any configuration parameters your transformer needs here
# Recipe sink configuration
sink:
type: "datahub-rest"
config:
server: "http://localhost:8080" # Replace with your DataHub server URL
# If using authentication, add your token here
# token: "<Your API key>"
Step 3: Run the Ingestion
Ensure that the custom_transformer.py
file is in the same directory as your ingestion recipe or properly referenced. Then, use the DataHub CLI to run the ingestion based on your recipe file.
datahub ingest -c hive_recipe.yml
Additional Information
- Custom Transformer Documentation: For more details on writing custom transformers, refer to the <Dataset | DataHub Custom Transformer Documentation>.
Would you like more detailed steps or additional information on any specific part of the process?
Sources:
- <Dataset | DataHub Custom Transformer Documentation>
- <https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/hive|Hive Ingestion Documentation>
- <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/README.md|DataHub CLI Installation>
- <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/cli-ingestion.md|CLI Ingestion Guide> 0 button 0 button Hint: Mention @runllm in the thread for followups.
Can you please provide recipe to transform hive database name before loading your datahub