Capturing Column-Level Descriptions from Vertica Tables for DataHub Integration

Original Slack Thread

Is it possible to get column level descriptions stored as comments from Vertica tables?

Hey there! :wave:

<@U06TM7M40RK> Can you help with the questions in this thread?

:information_source: Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!

To retrieve column-level descriptions stored as comments from Vertica tables and have them reflected in DataHub, you need to ensure that the ingestion process is correctly configured to capture these comments as descriptions.

Based on the <https://github.com/datahub-project/datahub/blob/master/docs/api/datahub-apis.md|DataHub API documentation>, DataHub supports adding descriptions to columns, which can be done through various methods, including using GraphQL mutations. However, the specific process for automatically capturing comments from Vertica and mapping them to descriptions in DataHub is not explicitly detailed in the provided data sources.

If you are using a custom ingestion job, you might need to ensure that the ingestion logic is set up to read the comments from the Vertica metadata and map them to the appropriate fields in DataHub. This might involve customizing the ingestion code or configuration to extract and apply these comments as descriptions.

Would you like more detailed guidance on how to configure or customize the ingestion process to achieve this? If so, please let me know, and I can provide further steps or code examples based on the available data sources.

Sources:

I would like more detailed guidance on how to configure or customize the ingestion process to achieve that

I would like more detailed guidance on how to configure or customize the ingestion process to achieve that <@U06TM7M40RK>

To configure or customize the ingestion process to capture column-level descriptions stored as comments from Vertica tables and reflect them in DataHub, you can follow these steps:

Step-by-Step Guide

  1. Enable Descriptions in Ingestion Recipe: Ensure that your ingestion recipe is configured to include descriptions. This involves setting the appropriate flags in your configuration file.

  2. Custom Transformer: Implement a custom transformer to extract comments from Vertica and map them to column descriptions in DataHub.

  3. Run the Ingestion: Use the DataHub CLI to execute the ingestion process with the custom configuration.

Step 1: Enable Descriptions in Ingestion Recipe

Create or modify your ingestion recipe YAML file to include descriptions. Here is an example configuration for Vertica:

source:
  type: vertica
  config:
    host_port: "localhost:5433"
    database: "your_database"
    username: "your_username"
    password: "your_password"
    include_tables: true
    include_views: true
    include_columns: true
    include_descriptions: true  # Ensure descriptions are included

sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"

Step 2: Implement a Custom Transformer

Create a custom transformer to extract comments from Vertica and map them to column descriptions. Here is a simplified example:

from datahub.ingestion.api.transform import Transformer
from datahub.metadata.schema_classes import SchemaFieldClass

class VerticaCommentToDescription(Transformer):
    def transform(self, entity):
        if hasattr(entity, 'schemaMetadata') and entity.schemaMetadata:
            for field in entity.schemaMetadata.fields:
                if isinstance(field, SchemaFieldClass):
                    # Extract comment from Vertica and set as description
                    comment = self.get_comment_from_vertica(field.fieldPath)
                    if comment:
                        field.description = comment
        return entity

    def get_comment_from_vertica(self, field_path):
        # Implement logic to fetch comment from Vertica for the given field_path
        return "Sample comment from Vertica"

    @classmethod
    def create(cls, config_dict, ctx):
        return cls(config_dict)

Step 3: Run the Ingestion

  1. Save the Custom Transformer: Save the VerticaCommentToDescription class in a Python file accessible to your ingestion process.

  2. Update the Ingestion Recipe: Add the custom transformer to your ingestion recipe:

transformers:
  - type: path.to.your.VerticaCommentToDescription
  1. Execute the Ingestion: Run the ingestion command using the DataHub CLI:
datahub ingest -c path/to/your/ingestion_recipe.yaml

Additional Considerations

  • Database Access: Ensure that your ingestion process has the necessary permissions to access Vertica metadata, including comments.
  • Testing: Test the ingestion process with a small dataset to verify that comments are correctly mapped to descriptions.

Would you like more detailed steps or further assistance on any specific part of this process?

Sources:

  - type: path.to.your.VerticaCommentToDescription```
give me an example of what the path would look like. Also the option include_descriptions: true does not exist for vertica currently <@U06TM7M40RK>

To specify the path for a custom transformer in your ingestion recipe, you need to provide the fully qualified Python module and class name. This path should reflect the directory structure where your custom transformer code is located.

Example Path for Custom Transformer

Assuming your custom transformer is defined in a file named custom_transformer.py and is located in a directory named transformers, the path in your ingestion recipe would look like this:

transformers:
  - type: transformers.custom_transformer.VerticaCommentToDescription

Directory Structure Example

Here’s an example of how your directory structure might look:

/your_project_directory
    /transformers
        custom_transformer.py
    ingestion_recipe.yaml

Custom Transformer Code Example

Ensure your custom_transformer.py file contains the VerticaCommentToDescription class:

from datahub.ingestion.api.transform import Transformer
from datahub.metadata.schema_classes import SchemaFieldClass

class VerticaCommentToDescription(Transformer):
    def transform(self, entity):
        if hasattr(entity, 'schemaMetadata') and entity.schemaMetadata:
            for field in entity.schemaMetadata.fields:
                if isinstance(field, SchemaFieldClass):
                    # Extract comment from Vertica and set as description
                    comment = self.get_comment_from_vertica(field.fieldPath)
                    if comment:
                        field.description = comment
        return entity

    def get_comment_from_vertica(self, field_path):
        # Implement logic to fetch comment from Vertica for the given field_path
        return "Sample comment from Vertica"

    @classmethod
    def create(cls, config_dict, ctx):
        return cls(config_dict)

Note on include_descriptions

As you mentioned, the include_descriptions: true option does not exist for Vertica in the current DataHub configuration. Instead, you will need to rely on custom logic within your transformer to fetch and apply comments as descriptions.

Running the Ingestion

Once your transformer is set up and your ingestion recipe is configured, you can run the ingestion process using the DataHub CLI:

datahub ingest -c path/to/your/ingestion_recipe.yaml

Would you like more detailed steps or further assistance on any specific part of this process?

Sources:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><