Error with 'AvroException' Attribute in DataHub Metadata Ingestion

user-2 · September 2, 2024, 12:05am

‘AvroException’ not found in ‘avro.schema’ module during metadata ingestion

Description:
I encountered an error while using the DataHub metadata-ingestion framework (version 0.14.0.2). The pipeline execution fails due to an AttributeError, specifically that the ‘avro.schema’ module has no attribute ‘AvroException’.

Steps to Reproduce:

Set up a metadata ingestion pipeline using the File source.
Attempt to run the pipeline with a JSON file containing metadata.
The pipeline fails with the following error:

Error Message:
PipelineExecutionError: (‘Source reported errors’, FileSourceReport(…))

Traceback:
Traceback (most recent call last):
File “…/batch_pipeline_ingest.py”, line 108, in <module>
run_pipeline(config)
File “…/batch_pipeline_ingest.py”, line 103, in run_pipeline
pipeline.raise_from_status()
File “…/datahub/ingestion/run/pipeline.py”, line 594, in raise_from_status
raise PipelineExecutionError(
datahub.configuration.common.PipelineExecutionError: (‘Source reported errors’, FileSourceReport(…))

The FileSourceReport contains multiple entries with the same error:
“module ‘avro.schema’ has no attribute ‘AvroException’”

Environment:

Operating System: WSL2(ubuntu-22.04)
Python Version: 3.10
DataHub Version: 0.14.0.2
Relevant package versions:
- avro: [version]
- [any other relevant packages and their versions]

Expected Behavior:
The pipeline should successfully process the metadata JSON file without raising an AttributeError related to ‘AvroException’.

Actual Behavior:
The pipeline fails with an AttributeError, stating that the ‘avro.schema’ module has no attribute ‘AvroException’.

Additional Context:
This error occurs consistently across multiple runs and affects the processing of various metadata entries in the JSON file.

Possible Related Issues:

Is there a version mismatch between the avro library and the version expected by DataHub?
Has there been a recent change in the avro library that might have removed or renamed the ‘AvroException’?

I would appreciate any insights or suggestions on how to resolve this issue. Let me know if you need any additional information or if there are any specific diagnostic steps I should take.

datahub_team · September 2, 2024, 12:05am

Hey there! Make sure your message includes the following information if relevant, so we can help more effectively!

<@U06TM7M40RK> Can you help with the questions in this thread?

_Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!

user-1 · September 2, 2024, 12:05am

It looks like you’re encountering an issue with the avro.schema module while running a DataHub metadata ingestion pipeline. Specifically, the error indicates that the AvroException attribute is not found in the avro.schema module. This issue is likely related to a version mismatch or a change in the avro library.

Steps to Resolve

Check Avro Library Version:
Ensure that you are using a compatible version of the avro library. The AvroException class might have been removed or renamed in newer versions of the library.
Install a Compatible Version:
You can try installing a specific version of the avro library that includes the AvroException class. For example, version 1.11.0 is known to work with DataHub.
```
pip install avro==1.11.0
```
Reinstall DataHub:
After installing the compatible version of the avro library, reinstall DataHub to ensure that all dependencies are correctly set up.
```
pip uninstall acryl-datahub
pip install acryl-datahub
```

Example Commands

Here are the commands you can run to address the issue:

# Uninstall existing Avro installation
pip uninstall avro

# Install a specific version of Avro
pip install avro==1.11.0

# Reinstall DataHub
pip uninstall acryl-datahub
pip install acryl-datahub

Additional Information

Compatibility: The DataHub team is actively working on making the platform compatible with various versions of its dependencies. However, as of now, the safest approach is to use the versions known to work.
Documentation: Refer to the <https://datahubproject.io/docs/cli|DataHub documentation> for any additional configuration details.

Would you like more detailed steps on any of these troubleshooting points or further assistance on a specific part of the process?

Sources:

<Issues · datahub-project/datahub · GitHub Docker Quickstart Issue> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-2 · September 2, 2024, 12:05am

<@U06TM7M40RK> When ingesting a metadata file named metadata.json into GMS with the source type set to file, the metadata.json file should include the following structures for datasetProperties and schemaMetadata:

user-1 · September 2, 2024, 12:05am

To ingest a metadata file named metadata.json into GMS with the source type set to file, the metadata.json file should include the following structures for datasetProperties and schemaMetadata.

Example metadata.json Structure

Here is an example of how you can structure your metadata.json file to include datasetProperties and schemaMetadata:

[
  {
    "entityType": "dataset",
    "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:mysql,my-dataset,PROD)",
    "changeType": "UPSERT",
    "aspectName": "datasetProperties",
    "aspect": {
      "description": "This is a sample dataset",
      "customProperties": {
        "key1": "value1",
        "key2": "value2"
      }
    }
  },
  {
    "entityType": "dataset",
    "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:mysql,my-dataset,PROD)",
    "changeType": "UPSERT",
    "aspectName": "schemaMetadata",
    "aspect": {
      "schemaName": "my-dataset-schema",
      "platform": "urn:li:dataPlatform:mysql",
      "version": 0,
      "created": {
        "time": 1627847260000,
        "actor": "urn:li:corpuser:datahub"
      },
      "lastModified": {
        "time": 1627847260000,
        "actor": "urn:li:corpuser:datahub"
      },
      "fields": [
        {
          "fieldPath": "field1",
          "nativeDataType": "VARCHAR",
          "type": {
            "type": "string"
          },
          "description": "This is field1"
        },
        {
          "fieldPath": "field2",
          "nativeDataType": "INT",
          "type": {
            "type": "number"
          },
          "description": "This is field2"
        }
      ]
    }
  }
]

Explanation

datasetProperties:
- entityType: Specifies the type of entity, which is dataset in this case.
- entityUrn: The unique identifier for the dataset.
- changeType: The type of change, which is UPSERT to update or insert the record.
- aspectName: The name of the aspect, which is datasetProperties.
- aspect: Contains the properties of the dataset, such as description and customProperties.
schemaMetadata:
- entityType: Specifies the type of entity, which is dataset in this case.
- entityUrn: The unique identifier for the dataset.
- changeType: The type of change, which is UPSERT to update or insert the record.
- aspectName: The name of the aspect, which is schemaMetadata.
- aspect: Contains the schema metadata, including schemaName, platform, version, created, lastModified, and fields.

Ingestion Recipe

To ingest this metadata.json file, you can use the following YAML configuration:

source:
  type: file
  config:
    filename: "./metadata.json"
    format: "json"

sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"  # Replace with your DataHub instance URL
    token: "your-auth-token"  # Replace with your DataHub auth token if required

user-1 · September 2, 2024, 12:05am

Running the Ingestion

Use the DataHub CLI to run the ingestion command with the created recipe:

datahub ingest -c your-ingestion-recipe.yaml

Would you like more detailed steps or additional examples on any specific part of this process?

Sources:

https://github.com/datahub-project/datahub/pull/11083|https://github.com/datahub-project/datahub/pull/11083 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-2 · September 2, 2024, 12:05am

<@U06TM7M40RK> S3사용안해. 로컬 metadata.json이야.

user-1 · September 2, 2024, 12:05am

Understood. If you are using a local metadata.json file and not S3, you can follow these steps to ingest the metadata into DataHub using the file source type.

Example metadata.json Structure

Ensure your metadata.json file includes the necessary structures for datasetProperties and schemaMetadata. Here is an example:

[
  {
    "entityType": "dataset",
    "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:mysql,my-dataset,PROD)",
    "changeType": "UPSERT",
    "aspectName": "datasetProperties",
    "aspect": {
      "description": "This is a sample dataset",
      "customProperties": {
        "key1": "value1",
        "key2": "value2"
      }
    }
  },
  {
    "entityType": "dataset",
    "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:mysql,my-dataset,PROD)",
    "changeType": "UPSERT",
    "aspectName": "schemaMetadata",
    "aspect": {
      "schemaName": "my-dataset-schema",
      "platform": "urn:li:dataPlatform:mysql",
      "version": 0,
      "created": {
        "time": 1627847260000,
        "actor": "urn:li:corpuser:datahub"
      },
      "lastModified": {
        "time": 1627847260000,
        "actor": "urn:li:corpuser:datahub"
      },
      "fields": [
        {
          "fieldPath": "field1",
          "nativeDataType": "VARCHAR",
          "type": {
            "type": "string"
          },
          "description": "This is field1"
        },
        {
          "fieldPath": "field2",
          "nativeDataType": "INT",
          "type": {
            "type": "number"
          },
          "description": "This is field2"
        }
      ]
    }
  }
]

Ingestion Recipe

Create a YAML file (e.g., ingestion_recipe.yaml) with the following content to define the ingestion process:

source:
  type: file
  config:
    filename: "./metadata.json"
    format: "json"

sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"  # Replace with your DataHub instance URL
    token: "your-auth-token"  # Replace with your DataHub auth token if required

Running the Ingestion

Use the DataHub CLI to run the ingestion command with the created recipe:

datahub ingest -c ingestion_recipe.yaml

Detailed Steps

Prepare the metadata.json File:
- Ensure your metadata.json file is structured correctly as shown in the example above.
Create the Ingestion Recipe:
- Save the YAML configuration in a file named ingestion_recipe.yaml.
Run the Ingestion Command:
- Execute the following command in your terminal to start the ingestion process:
```
datahub ingest -c ingestion_recipe.yaml
```

This command will read the metadata from the metadata.json file and ingest it into your DataHub instance.

Would you like more detailed steps or additional examples on any specific part of this process?

Sources:

<https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/metadata-file|Metadata File Source Documentation> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-2 · September 2, 2024, 12:05am

<@U06TM7M40RK> please, show me example.json

user-1 · September 2, 2024, 12:05am

Sure! Below is an example of a metadata.json file that includes both datasetProperties and schemaMetadata aspects. This example is designed to be ingested into DataHub using the file source type.

Example metadata.json

[
  {
    "entityType": "dataset",
    "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:mysql,my-dataset,PROD)",
    "changeType": "UPSERT",
    "aspectName": "datasetProperties",
    "aspect": {
      "description": "This is a sample dataset",
      "customProperties": {
        "key1": "value1",
        "key2": "value2"
      }
    }
  },
  {
    "entityType": "dataset",
    "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:mysql,my-dataset,PROD)",
    "changeType": "UPSERT",
    "aspectName": "schemaMetadata",
    "aspect": {
      "schemaName": "my-dataset-schema",
      "platform": "urn:li:dataPlatform:mysql",
      "version": 0,
      "created": {
        "time": 1627847260000,
        "actor": "urn:li:corpuser:datahub"
      },
      "lastModified": {
        "time": 1627847260000,
        "actor": "urn:li:corpuser:datahub"
      },
      "fields": [
        {
          "fieldPath": "field1",
          "nativeDataType": "VARCHAR",
          "type": {
            "type": "string"
          },
          "description": "This is field1",
          "nullable": true,
          "isPartOfKey": false
        },
        {
          "fieldPath": "field2",
          "nativeDataType": "INT",
          "type": {
            "type": "number"
          },
          "description": "This is field2",
          "nullable": false,
          "isPartOfKey": true
        }
      ]
    }
  }
]

Explanation

datasetProperties:
- entityType: Specifies the type of entity, which is dataset in this case.
- entityUrn: The unique identifier for the dataset.
- changeType: The type of change, which is UPSERT to update or insert the record.
- aspectName: The name of the aspect, which is datasetProperties.
- aspect: Contains the properties of the dataset, such as description and customProperties.
schemaMetadata:
- entityType: Specifies the type of entity, which is dataset in this case.
- entityUrn: The unique identifier for the dataset.
- changeType: The type of change, which is UPSERT to update or insert the record.
- aspectName: The name of the aspect, which is schemaMetadata.
- aspect: Contains the schema metadata, including schemaName, platform, version, created, lastModified, and fields.

Ingestion Recipe

Create a YAML file (e.g., ingestion_recipe.yaml) with the following content to define the ingestion process:

source:
  type: file
  config:
    filename: "./metadata.json"
    format: "json"

sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"  # Replace with your DataHub instance URL
    token: "your-auth-token"  # Replace with your DataHub auth token if required

user-1 · September 2, 2024, 12:05am

Running the Ingestion

Use the DataHub CLI to run the ingestion command with the created recipe:

datahub ingest -c ingestion_recipe.yaml

This command will read the metadata from the metadata.json file and ingest it into your DataHub instance.

Would you like more detailed steps or additional examples on any specific part of this process? 0 button 0 button Hint: Mention @runllm in the thread for followups.

Topic		Replies	Views
Troubleshooting JSON Schema Ingestion Error with DataHub CLI ingestion	5	8	October 7, 2024
Schema Issue Causing Airbyte Pipeline Crashes ingestion	9	45	May 20, 2024
Datahub Ingestion Issues with Impala Hive Connector and SQLAlchemy Recipes ingestion	12	72	March 4, 2024
Troubleshooting Stateful Ingestion Exception in CLI ingestion	6	27	July 22, 2024
Troubleshooting JSON File Ingestion Error in DataHub UI ingestion	3	30	June 17, 2024

Error with 'AvroException' Attribute in DataHub Metadata Ingestion

Hey there! Make sure your message includes the following information if relevant, so we can help more effectively!

<@U06TM7M40RK> Can you help with the questions in this thread?

Related topics