Error with 'AvroException' Attribute in DataHub Metadata Ingestion

Original Slack Thread

‘AvroException’ not found in ‘avro.schema’ module during metadata ingestion

Description:
I encountered an error while using the DataHub metadata-ingestion framework (version 0.14.0.2). The pipeline execution fails due to an AttributeError, specifically that the ‘avro.schema’ module has no attribute ‘AvroException’.

Steps to Reproduce:

  1. Set up a metadata ingestion pipeline using the File source.
  2. Attempt to run the pipeline with a JSON file containing metadata.
  3. The pipeline fails with the following error:

Error Message:
PipelineExecutionError: (‘Source reported errors’, FileSourceReport(…))

Traceback:
Traceback (most recent call last):
File “…/batch_pipeline_ingest.py”, line 108, in <module>
run_pipeline(config)
File “…/batch_pipeline_ingest.py”, line 103, in run_pipeline
pipeline.raise_from_status()
File “…/datahub/ingestion/run/pipeline.py”, line 594, in raise_from_status
raise PipelineExecutionError(
datahub.configuration.common.PipelineExecutionError: (‘Source reported errors’, FileSourceReport(…))

The FileSourceReport contains multiple entries with the same error:
“module ‘avro.schema’ has no attribute ‘AvroException’”

Environment:

  • Operating System: WSL2(ubuntu-22.04)
  • Python Version: 3.10
  • DataHub Version: 0.14.0.2
  • Relevant package versions:
    • avro: [version]
    • [any other relevant packages and their versions]

Expected Behavior:
The pipeline should successfully process the metadata JSON file without raising an AttributeError related to ‘AvroException’.

Actual Behavior:
The pipeline fails with an AttributeError, stating that the ‘avro.schema’ module has no attribute ‘AvroException’.

Additional Context:
This error occurs consistently across multiple runs and affects the processing of various metadata entries in the JSON file.

Possible Related Issues:

  • Is there a version mismatch between the avro library and the version expected by DataHub?
  • Has there been a recent change in the avro library that might have removed or renamed the ‘AvroException’?

I would appreciate any insights or suggestions on how to resolve this issue. Let me know if you need any additional information or if there are any specific diagnostic steps I should take.

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

<@U06TM7M40RK> Can you help with the questions in this thread?

:information_source: _Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!

It looks like you’re encountering an issue with the avro.schema module while running a DataHub metadata ingestion pipeline. Specifically, the error indicates that the AvroException attribute is not found in the avro.schema module. This issue is likely related to a version mismatch or a change in the avro library.

Steps to Resolve

  1. Check Avro Library Version:
    Ensure that you are using a compatible version of the avro library. The AvroException class might have been removed or renamed in newer versions of the library.

  2. Install a Compatible Version:
    You can try installing a specific version of the avro library that includes the AvroException class. For example, version 1.11.0 is known to work with DataHub.

    pip install avro==1.11.0
    
  3. Reinstall DataHub:
    After installing the compatible version of the avro library, reinstall DataHub to ensure that all dependencies are correctly set up.

    pip uninstall acryl-datahub
    pip install acryl-datahub
    

Example Commands

Here are the commands you can run to address the issue:

# Uninstall existing Avro installation
pip uninstall avro

# Install a specific version of Avro
pip install avro==1.11.0

# Reinstall DataHub
pip uninstall acryl-datahub
pip install acryl-datahub

Additional Information

  • Compatibility: The DataHub team is actively working on making the platform compatible with various versions of its dependencies. However, as of now, the safest approach is to use the versions known to work.
  • Documentation: Refer to the <https://datahubproject.io/docs/cli|DataHub documentation> for any additional configuration details.

Would you like more detailed steps on any of these troubleshooting points or further assistance on a specific part of the process?

Sources:

<@U06TM7M40RK> When ingesting a metadata file named metadata.json into GMS with the source type set to file, the metadata.json file should include the following structures for datasetProperties and schemaMetadata:

To ingest a metadata file named metadata.json into GMS with the source type set to file, the metadata.json file should include the following structures for datasetProperties and schemaMetadata.

Example metadata.json Structure

Here is an example of how you can structure your metadata.json file to include datasetProperties and schemaMetadata:

[
  {
    "entityType": "dataset",
    "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:mysql,my-dataset,PROD)",
    "changeType": "UPSERT",
    "aspectName": "datasetProperties",
    "aspect": {
      "description": "This is a sample dataset",
      "customProperties": {
        "key1": "value1",
        "key2": "value2"
      }
    }
  },
  {
    "entityType": "dataset",
    "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:mysql,my-dataset,PROD)",
    "changeType": "UPSERT",
    "aspectName": "schemaMetadata",
    "aspect": {
      "schemaName": "my-dataset-schema",
      "platform": "urn:li:dataPlatform:mysql",
      "version": 0,
      "created": {
        "time": 1627847260000,
        "actor": "urn:li:corpuser:datahub"
      },
      "lastModified": {
        "time": 1627847260000,
        "actor": "urn:li:corpuser:datahub"
      },
      "fields": [
        {
          "fieldPath": "field1",
          "nativeDataType": "VARCHAR",
          "type": {
            "type": "string"
          },
          "description": "This is field1"
        },
        {
          "fieldPath": "field2",
          "nativeDataType": "INT",
          "type": {
            "type": "number"
          },
          "description": "This is field2"
        }
      ]
    }
  }
]

Explanation

  1. datasetProperties:

    • entityType: Specifies the type of entity, which is dataset in this case.
    • entityUrn: The unique identifier for the dataset.
    • changeType: The type of change, which is UPSERT to update or insert the record.
    • aspectName: The name of the aspect, which is datasetProperties.
    • aspect: Contains the properties of the dataset, such as description and customProperties.
  2. schemaMetadata:

    • entityType: Specifies the type of entity, which is dataset in this case.
    • entityUrn: The unique identifier for the dataset.
    • changeType: The type of change, which is UPSERT to update or insert the record.
    • aspectName: The name of the aspect, which is schemaMetadata.
    • aspect: Contains the schema metadata, including schemaName, platform, version, created, lastModified, and fields.

Ingestion Recipe

To ingest this metadata.json file, you can use the following YAML configuration:

source:
  type: file
  config:
    filename: "./metadata.json"
    format: "json"

sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"  # Replace with your DataHub instance URL
    token: "your-auth-token"  # Replace with your DataHub auth token if required

Running the Ingestion

Use the DataHub CLI to run the ingestion command with the created recipe:

datahub ingest -c your-ingestion-recipe.yaml

Would you like more detailed steps or additional examples on any specific part of this process?

Sources:

<@U06TM7M40RK> S3사용안해. 로컬 metadata.json이야.

Understood. If you are using a local metadata.json file and not S3, you can follow these steps to ingest the metadata into DataHub using the file source type.

Example metadata.json Structure

Ensure your metadata.json file includes the necessary structures for datasetProperties and schemaMetadata. Here is an example:

[
  {
    "entityType": "dataset",
    "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:mysql,my-dataset,PROD)",
    "changeType": "UPSERT",
    "aspectName": "datasetProperties",
    "aspect": {
      "description": "This is a sample dataset",
      "customProperties": {
        "key1": "value1",
        "key2": "value2"
      }
    }
  },
  {
    "entityType": "dataset",
    "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:mysql,my-dataset,PROD)",
    "changeType": "UPSERT",
    "aspectName": "schemaMetadata",
    "aspect": {
      "schemaName": "my-dataset-schema",
      "platform": "urn:li:dataPlatform:mysql",
      "version": 0,
      "created": {
        "time": 1627847260000,
        "actor": "urn:li:corpuser:datahub"
      },
      "lastModified": {
        "time": 1627847260000,
        "actor": "urn:li:corpuser:datahub"
      },
      "fields": [
        {
          "fieldPath": "field1",
          "nativeDataType": "VARCHAR",
          "type": {
            "type": "string"
          },
          "description": "This is field1"
        },
        {
          "fieldPath": "field2",
          "nativeDataType": "INT",
          "type": {
            "type": "number"
          },
          "description": "This is field2"
        }
      ]
    }
  }
]

Ingestion Recipe

Create a YAML file (e.g., ingestion_recipe.yaml) with the following content to define the ingestion process:

source:
  type: file
  config:
    filename: "./metadata.json"
    format: "json"

sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"  # Replace with your DataHub instance URL
    token: "your-auth-token"  # Replace with your DataHub auth token if required

Running the Ingestion

Use the DataHub CLI to run the ingestion command with the created recipe:

datahub ingest -c ingestion_recipe.yaml

Detailed Steps

  1. Prepare the metadata.json File:

    • Ensure your metadata.json file is structured correctly as shown in the example above.
  2. Create the Ingestion Recipe:

    • Save the YAML configuration in a file named ingestion_recipe.yaml.
  3. Run the Ingestion Command:

    • Execute the following command in your terminal to start the ingestion process:
    datahub ingest -c ingestion_recipe.yaml
    

This command will read the metadata from the metadata.json file and ingest it into your DataHub instance.

Would you like more detailed steps or additional examples on any specific part of this process?

Sources:

<@U06TM7M40RK> please, show me example.json

Sure! Below is an example of a metadata.json file that includes both datasetProperties and schemaMetadata aspects. This example is designed to be ingested into DataHub using the file source type.

Example metadata.json

[
  {
    "entityType": "dataset",
    "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:mysql,my-dataset,PROD)",
    "changeType": "UPSERT",
    "aspectName": "datasetProperties",
    "aspect": {
      "description": "This is a sample dataset",
      "customProperties": {
        "key1": "value1",
        "key2": "value2"
      }
    }
  },
  {
    "entityType": "dataset",
    "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:mysql,my-dataset,PROD)",
    "changeType": "UPSERT",
    "aspectName": "schemaMetadata",
    "aspect": {
      "schemaName": "my-dataset-schema",
      "platform": "urn:li:dataPlatform:mysql",
      "version": 0,
      "created": {
        "time": 1627847260000,
        "actor": "urn:li:corpuser:datahub"
      },
      "lastModified": {
        "time": 1627847260000,
        "actor": "urn:li:corpuser:datahub"
      },
      "fields": [
        {
          "fieldPath": "field1",
          "nativeDataType": "VARCHAR",
          "type": {
            "type": "string"
          },
          "description": "This is field1",
          "nullable": true,
          "isPartOfKey": false
        },
        {
          "fieldPath": "field2",
          "nativeDataType": "INT",
          "type": {
            "type": "number"
          },
          "description": "This is field2",
          "nullable": false,
          "isPartOfKey": true
        }
      ]
    }
  }
]

Explanation

  1. datasetProperties:

    • entityType: Specifies the type of entity, which is dataset in this case.
    • entityUrn: The unique identifier for the dataset.
    • changeType: The type of change, which is UPSERT to update or insert the record.
    • aspectName: The name of the aspect, which is datasetProperties.
    • aspect: Contains the properties of the dataset, such as description and customProperties.
  2. schemaMetadata:

    • entityType: Specifies the type of entity, which is dataset in this case.
    • entityUrn: The unique identifier for the dataset.
    • changeType: The type of change, which is UPSERT to update or insert the record.
    • aspectName: The name of the aspect, which is schemaMetadata.
    • aspect: Contains the schema metadata, including schemaName, platform, version, created, lastModified, and fields.

Ingestion Recipe

Create a YAML file (e.g., ingestion_recipe.yaml) with the following content to define the ingestion process:

source:
  type: file
  config:
    filename: "./metadata.json"
    format: "json"

sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"  # Replace with your DataHub instance URL
    token: "your-auth-token"  # Replace with your DataHub auth token if required

Running the Ingestion

Use the DataHub CLI to run the ingestion command with the created recipe:

datahub ingest -c ingestion_recipe.yaml

This command will read the metadata from the metadata.json file and ingest it into your DataHub instance.

Would you like more detailed steps or additional examples on any specific part of this process? :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.