Using Python Emitter for Ingesting Database Structure with DataHub

Original Slack Thread

hello <@U06TM7M40RK>, ¿is it good practice to use the Pythom Emitter to ingest the whole database structure, including its tables, documents, and so forth, if so, provide examples of an emitter that does this, if using the Web UI is better, then confirm it

Parece que es posible utilizar el emisor de Python de DataHub para realizar la ingesta de la estructura completa de una base de datos, incluidas sus tablas y documentos. Según mi entendimiento, el uso del emisor de Python es una opción viable para escribir metadatos y linajes en el almacén de metadatos de DataHub, especialmente si puedes extraer la metadata de los scripts de Python locales, como se menciona en una <Slack en Slack>.

Mi comprensión es que el uso del emisor de REST de Python es una opción adecuada cuando se desea una interfaz de bloqueo para enviar eventos de metadatos a través de HTTP, como se describe en la <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/as-a-library.md|documentación de DataHub>. Este método es útil cuando la simplicidad y la confirmación de que los metadatos se han guardado en el almacén de metadatos de DataHub son más importantes que el rendimiento de la emisión de metadatos.

Aquí tienes un ejemplo de cómo crear un esquema de conjunto de datos y emitirlo a DataHub usando el emisor REST de Python, basado en un <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/dataset_schema.py|ejemplo de GitHub>:

import datahub.emitter.mce_builder as builder
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import (
    AuditStampClass,
    OtherSchemaClass,
    SchemaFieldClass,
    SchemaFieldDataTypeClass,
    SchemaMetadataClass,
    StringTypeClass,
)

# Create a MetadataChangeProposalWrapper object
event = MetadataChangeProposalWrapper(
    entityUrn=builder.make_dataset_urn(platform="hive", name="foodb.barTable", env="PROD"),
    aspect=SchemaMetadataClass(
        schemaName="customer",
        platform=builder.make_data_platform_urn("hive"),
        version=0,
        hash="",
        platformSchema=OtherSchemaClass(rawSchema="__insert raw schema here__"),
        fields=[
            SchemaFieldClass(
                fieldPath="address.zipcode",
                type=SchemaFieldDataTypeClass(type=StringTypeClass()),
                nativeDataType="VARCHAR(100)",
                nullable=True,
                description="This is the zipcode of the address. Specified using extended form and limited to addresses in the United States",
                globalTags=GlobalTagsClass(
                    tags=[TagAssociationClass(tag=builder.make_tag_urn("location"))]
                ),
            )
        ],
    ),
)

# Create rest emitter
rest_emitter = DatahubRestEmitter(gms_server="http://localhost:8080")
rest_emitter.emit(event)

Este ejemplo muestra cómo crear un esquema de conjunto de datos y emitirlo a DataHub utilizando el emisor REST. Puedes modificar el dataset_urn, schema_fields y otros parámetros según sea necesario para tu caso de uso específico. Si deseas más detalles sobre cómo usar el emisor de Python para la ingesta de metadatos, puedes consultar la <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/as-a-library.md|documentación de DataHub sobre el uso del emisor de Python>.

¿Te gustaría más detalles sobre cómo implementar un emisor para toda la estructura de una base de datos, incluyendo tablas y documentos?

Fuentes:

Sources:

why did you answer in Spanish <@U06TM7M40RK>? please provide an answer in English

I’m here to assist with questions related to DataHub. If you have any questions about DataHub, feel free to ask, and I’ll be happy to help! :+1: 0 button :-1: 0 button :light_bulb: Hint: Mention @runllm in the thread for followups.

thank you, please <@U06TM7M40RK> provide the same answer that you just provided at the beginning, but in English

It seems likely that using the Python Emitter to ingest the entire database structure, including tables and documents, is a viable approach with DataHub. The Python Emitter allows you to programmatically emit metadata events to DataHub, which can be particularly useful for custom or complex ingestion scenarios, as mentioned in the <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/as-a-library.md|DataHub Python Emitter Documentation>.

Here’s an example of how you can create a dataset schema and emit it to DataHub using the Python REST emitter, based on the <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/dataset_schema.py|DataHub Example: Dataset Schema>:

import datahub.emitter.mce_builder as builder
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import (
    AuditStampClass,
    OtherSchemaClass,
    SchemaFieldClass,
    SchemaFieldDataTypeClass,
    SchemaMetadataClass,
    StringTypeClass,
)

# Create a MetadataChangeProposalWrapper object
event = MetadataChangeProposalWrapper(
    entityUrn=builder.make_dataset_urn(platform="hive", name="food_db.sales", env="PROD"),
    aspect=SchemaMetadataClass(
        schemaName="customer",
        platform=builder.make_data_platform_urn("hive"),
        version=0,
        hash="",
        platformSchema=OtherSchemaClass(rawSchema="__insert raw schema here__"),
        fields=[
            SchemaFieldClass(
                fieldPath="address.zipcode",
                type=SchemaFieldDataTypeClass(type=StringTypeClass()),
                nativeDataType="VARCHAR(50)",
                description="This is the zipcode of the address. Specified using extended form and limited to addresses in the United States",
                lastModified=AuditStampClass(time=1640692800000, actor="urn:li:corpuser:ingestion")
            ),
            SchemaFieldClass(
                fieldPath="address.street",
                type=SchemaFieldDataTypeClass(type=StringTypeClass()),
                nativeDataType="VARCHAR(100)",
                description="Street corresponding to the address",
                lastModified=AuditStampClass(time=1640692800000, actor="urn:li:corpuser:ingestion")
            ),
            SchemaFieldClass(
                fieldPath="last_sold_date",
                type=SchemaFieldDataTypeClass(type=DateTypeClass()),
                nativeDataType="Date",
                description="Date of the last sale date for this property",
                lastModified=AuditStampClass(time=1640692800000, actor="urn:li:corpuser:ingestion")
            )
        ]
    )
)

# Initialize the DataHub REST emitter
emitter = DatahubRestEmitter(gms_server="http://localhost:8080")
emitter.emit(event)

This code snippet demonstrates how to create a dataset schema and emit it to DataHub using the REST emitter. You can modify the dataset_urn, schema_fields, and other parameters as needed for your specific use case.

It seems likely that using the Python Emitter is a suitable approach for ingesting the entire database structure, including tables and documents, directly from code without connecting to the actual data source. This can be done by creating and sending Metadata Change Proposals (MCPs) directly to DataHub, as mentioned in the <https://datahubproject.io/docs/0.13.1/metadata-ingestion/as-a-library|DataHub Metadata Ingestion as a Library> guide.

Would you like more detailed steps or additional examples?

Sources:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><