Troubleshooting JSON Schema Ingestion Error with DataHub CLI

Original Slack Thread

<@U06TM7M40RK> Hello Team
While evaluating DataHub, I’m trying to ingest Json schemas using CLI. My recipe file is as follows
pipeline_name: json_schema_ingestion
source:
type: json-schema
config:
path: https://json.schemastore.org/petstore-v1.0.json # e.g. https://json.schemastore.org/petstore-v1.0.json
platform: SchemaRegistry # e.g. schemaregistry
# platform_instance:
stateful_ingestion:
enabled: true # recommended to have this turned on

Error log while executing the command
datahub ingest -c
I’m getting the error as placed in the log file
Please can you help figuring out the error.attachment

To help you troubleshoot the issue with ingesting JSON schemas using the DataHub CLI, let’s break down the steps and potential issues based on the provided data sources.

Recipe File Configuration
Your recipe file seems to be correctly formatted for ingesting JSON schemas. Here is a quick review of the key components:

pipeline_name: json_schema_ingestion
source:
  type: json-schema
  config:
    path: <https://json.schemastore.org/petstore-v1.0.json> # e.g. <https://json.schemastore.org/petstore-v1.0.json>
    platform: SchemaRegistry # e.g. schemaregistry
    # platform_instance: <add a platform_instance if there are multiple schema repositories>
    stateful_ingestion:
      enabled: true # recommended to have this turned on

Common Issues and Solutions

  1. Path Configuration: Ensure that the path field is correctly pointing to a valid JSON schema URL. The URL should be accessible and return a valid JSON schema.

  2. Platform Field: According to the <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/docs/sources/json-schema/json-schema.md|JSON Schema Ingestion Documentation>, the platform field is mandatory. Ensure that the platform name is correctly specified and registered in DataHub.

  3. Stateful Ingestion: The stateful_ingestion configuration is optional but recommended. Ensure that it is correctly configured.

Error Log Analysis
The error log you provided seems to be an HTML document rather than a JSON or plain text log. This might indicate that the URL provided in the path field is not returning the expected JSON schema but rather an HTML page. This can happen if the URL is incorrect or if there is a redirection happening.

Steps to Resolve

  1. Verify URL: Ensure that the URL in the path field is correct and accessible. You can test this by opening the URL in a web browser to see if it returns the expected JSON schema.

  2. Check Platform Registration: Ensure that the platform specified in the platform field is registered in DataHub. You can use the put platform command to register your custom platform if needed.

  3. Review Logs: If the error persists, review the DataHub CLI logs for more detailed error messages. Ensure that the logs are in a readable format (JSON or plain text) to help identify the issue.

Example Command
Here is an example command to run the ingestion:

datahub ingest -c <path_to_recipe_file>

Additional Help
If you need more detailed guidance or if the issue persists, please provide the exact error message from the DataHub CLI logs in a readable format.

Would you like more detailed guidance on any specific part of the process?

Sources:

The error log file is provided in txt format. the error is as follows {datahub.entrypoints:218} - Command failed: Failed to configure the source (json-schema): Subscripted generics cannot be used with class and instance checks

<@U07FV3UHE4D> Please help

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><

Using python version 3.8.10 also gives the same error