Troubleshooting Iceberg Source Recipe for Ingestion via Glue Catalog

Original Slack Thread

Hi
Can someone please help with the source receipe for iceberg. I have my iceberg tables in s3 and glue as catalog registry. I am having difficulty in writing correct source receipe.

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

Sure, I can help with that! To provide the most accurate assistance, could you please answer the following questions?

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g., 0.12.0)?
  3. What data source(s) are you integrating with DataHub (e.g., BigQuery)?

This information will help tailor the solution to your specific setup.

Sources:

<@U06TM7M40RK>
I have my iceberg tables in s3 and glue as catalog registry. Could you please help with the ingestion receipe for iceberg

Certainly! Based on your setup where you have Iceberg tables in S3 and are using AWS Glue as the catalog registry, you can configure the DataHub ingestion recipe accordingly.

Here is an example of a YAML configuration for ingesting Iceberg tables from S3 with Glue as the catalog:

source:
  type: "iceberg"
  config:
    env: "PROD"
    catalog:
      name: "my_glue_catalog"
      type: "glue"
      config:
        aws_access_key_id: "<your_aws_access_key_id>"
        aws_secret_access_key: "<your_aws_secret_access_key>"
        aws_region: "<your_aws_region>"
        warehouse: "s3a://<your_s3_bucket>/warehouse/"
    platform_instance: "my_glue_catalog_instance"
    table_pattern:
      allow:
        - ".*"  # Adjust this pattern to match the tables you want to include
    profiling:
      enabled: true

sink:
  type: "datahub-rest"
  config:
    server: "<your_datahub_server_url>"

Explanation:

  • source.type: Specifies the source type as “iceberg”.
  • source.config.env: The environment for the assets (e.g., “PROD”).
  • source.config.catalog: Configuration for the Glue catalog.
    • name: Name of your Glue catalog.
    • type: Set to “glue” to indicate that Glue is being used as the catalog.
    • config: AWS configuration details.
      • aws_access_key_id: Your AWS access key ID.
      • aws_secret_access_key: Your AWS secret access key.
      • aws_region: The AWS region where your Glue catalog is located.
      • warehouse: The S3 path to your Iceberg warehouse.
  • platform_instance: The instance name for the platform.
  • table_pattern: Regex patterns to filter the tables to be ingested.
    • allow: List of regex patterns to include in ingestion.
  • profiling: Configuration for data profiling.
    • enabled: Set to true to enable profiling.
  • sink: Configuration for the sink where the ingested metadata will be sent.
    • type: Set to “datahub-rest”.
    • config.server: URL of your DataHub server.

Make sure to replace placeholders like <your_aws_access_key_id>, <your_aws_secret_access_key>, <your_aws_region>, <your_s3_bucket>, and <your_datahub_server_url> with your actual values.

Would you like more details on any specific part of the configuration?

Sources:

<@U06TM7M40RK> with the given receipe i get below error
ERROR {datahub.entrypoints:218} - Command failed: Property table_type missing, could not determine type: s3_access_logs_db.mybucket_logs

The error message indicates that the table_type property is missing and DataHub cannot determine the type for the specified table. This can happen if the Iceberg source configuration doesn’t correctly identify the tables in your Glue catalog.

To address this, ensure that your Glue catalog and S3 paths are correctly specified and that the necessary permissions are in place. Additionally, you might need to specify more detailed configurations for the Glue catalog.

Here’s an updated example recipe with additional configurations and checks:

source:
  type: "iceberg"
  config:
    env: "PROD"
    catalog:
      name: "my_glue_catalog"
      type: "glue"
      config:
        aws_access_key_id: "<your_aws_access_key_id>"
        aws_secret_access_key: "<your_aws_secret_access_key>"
        aws_region: "<your_aws_region>"
        warehouse: "s3a://<your_s3_bucket>/warehouse/"
    platform_instance: "my_glue_catalog_instance"
    table_pattern:
      allow:
        - ".*"  # Adjust this pattern to match the tables you want to include
    profiling:
      enabled: true
    table_type: "iceberg"  # Explicitly specify the table type

sink:
  type: "datahub-rest"
  config:
    server: "<your_datahub_server_url>"

Steps to Troubleshoot:

  1. Verify Glue Catalog Configuration:
    Ensure that the Glue catalog name and AWS configurations are correct. Double-check that the Glue catalog contains the Iceberg tables.

  2. Check AWS Permissions:
    Make sure the AWS credentials provided have the necessary permissions to access the Glue catalog and the S3 bucket.

  3. Explicitly Specify Table Type:
    Adding table_type: "iceberg" explicitly in the configuration to help DataHub identify the table type.

  4. Enable Debug Logging:
    Run the ingestion with debug logging enabled to get more detailed error messages. You can do this by adding the --debug flag to your ingestion command:

    datahub ingest -c your_recipe.yml --debug
    

If the issue persists, please provide the debug logs for further analysis.

Would you like more detailed steps on any of these points?

Sources:

<@U06TM7M40RK> please provide the ingestion receipe for iceberg ingestion by having iceberg tables on local filesystem

Certainly! If you have Iceberg tables on a local filesystem, you can configure the DataHub ingestion recipe accordingly. Below is an example YAML configuration for ingesting Iceberg tables from a local filesystem:

source:
  type: "iceberg"
  config:
    env: "DEV"
    catalog:
      name: "local_catalog"
      type: "rest"
      config:
        uri: "http://localhost:8181"
        warehouse: "file:///path/to/your/warehouse/"
    platform_instance: "local_iceberg_instance"
    table_pattern:
      allow:
        - ".*"  # Adjust this pattern to match the tables you want to include
    profiling:
      enabled: true

sink:
  type: "datahub-rest"
  config:
    server: "<your_datahub_server_url>"

Explanation:

  • source.type: Specifies the source type as “iceberg”.
  • source.config.env: The environment for the assets (e.g., “DEV”).
  • source.config.catalog: Configuration for the local catalog.
    • name: Name of your local catalog.
    • type: Set to “rest” for a REST-based catalog.
    • config: Configuration details for the catalog.
      • uri: URI of your local Iceberg catalog service.
      • warehouse: Path to your local Iceberg warehouse.
  • platform_instance: The instance name for the platform.
  • table_pattern: Regex patterns to filter the tables to be ingested.
    • allow: List of regex patterns to include in ingestion.
  • profiling: Configuration for data profiling.
    • enabled: Set to true to enable profiling.
  • sink: Configuration for the sink where the ingested metadata will be sent.
    • type: Set to “datahub-rest”.
    • config.server: URL of your DataHub server.

Steps:

  1. Install the DataHub Iceberg Plugin:
    Ensure you have the DataHub Iceberg plugin installed:

    pip install 'acryl-datahub[iceberg]'
    
  2. Run the Ingestion:
    Execute the ingestion command with your YAML configuration:

    datahub ingest -c your_recipe.yml
    

Make sure to replace placeholders like <your_datahub_server_url> with your actual values.

Would you like more details on any specific part of the configuration?

Sources:

<@U06TM7M40RK>
‘failures’: {‘get-catalog’: [“Failed to get catalog: HTTPConnectionPool(host=‘localhost’, port=8181): Max retries exceeded with url: /v1/config?warehouse=file%3A%2F%2F%2FUsers%2Fdivyab%2FDesktop%2Fwarehouse%2F (Caused by NewConnectionError(‘<urllib3.connection.HTTPConnection object at 0x1093fff40>: Failed to establish a new connection: [Errno 61] Connection refused’))”]}

The error message indicates that the ingestion job is unable to connect to the Iceberg REST catalog service running on localhost:8181. This usually means that the Iceberg REST catalog service is not running or is not accessible at the specified URL.

Steps to Resolve:

  1. Ensure Iceberg REST Catalog Service is Running:
    Make sure that the Iceberg REST catalog service is running and accessible on localhost:8181. You can start the Iceberg REST catalog service using the appropriate command or configuration.

  2. Verify the URI Configuration:
    Double-check the uri configuration in your YAML file to ensure it points to the correct URL where the Iceberg REST catalog service is running.

Example Configuration:
Here’s a revised example configuration assuming you have the Iceberg REST catalog service running locally:

source:
  type: "iceberg"
  config:
    env: "DEV"
    catalog:
      name: "local_catalog"
      type: "rest"
      config:
        uri: "http://localhost:8181"  # Ensure this is the correct URL
        warehouse: "file:///path/to/your/warehouse/"
    platform_instance: "local_iceberg_instance"
    table_pattern:
      allow:
        - ".*"  # Adjust this pattern to match the tables you want to include
    profiling:
      enabled: true

sink:
  type: "datahub-rest"
  config:
    server: "<your_datahub_server_url>"

Starting the Iceberg REST Catalog Service:
If you haven’t already, you need to start the Iceberg REST catalog service. Here’s an example command to start it using Docker:

docker run -d -p 8181:8181 --name iceberg-rest-catalog apache/iceberg-rest-catalog:latest

Ensure that the service is up and running by checking the logs:

docker logs iceberg-rest-catalog

Debugging Steps:

  1. Check Service Status:
    Ensure the Iceberg REST catalog service is running and accessible at http://localhost:8181.

  2. Test Connectivity:
    You can use tools like curl to test connectivity:

    curl http://localhost:8181/v1/config?warehouse=file:///path/to/your/warehouse/
    
  3. Check Firewall/Network Settings:
    Ensure there are no firewall or network settings blocking access to localhost:8181.

Running the Ingestion:
Once the Iceberg REST catalog service is confirmed to be running, you can run the ingestion command:

datahub ingest -c your_recipe.yml

Would you like more detailed steps on starting the Iceberg REST catalog service or any other part of the process?

Sources:

<@U06TM7M40RK>
when i use aws glue catalog for iceberg ingestion i get below error
Command failed: When reading information for key ‘output/divyapoc/divyapoctable/metadata/00000-808cac2c-4b7a-4b62-b9ec-5b157d3f569d.metadata.json’ in bucket ‘divyapoc’: AWS Error ACCESS_DENIED during HeadObject operation: No response body

The error message indicates that there is an ACCESS_DENIED error when trying to read a specific metadata file from an S3 bucket. This typically means that the AWS credentials provided do not have the necessary permissions to access the S3 bucket or the specific object within the bucket.

Steps to Resolve:

  1. Verify AWS IAM Permissions:
    Ensure that the AWS IAM role or user has the necessary permissions to access the S3 bucket and the objects within it. Specifically, you need s3:GetObject and s3:ListBucket permissions.

  2. Update IAM Policy:
    Update the IAM policy attached to the role or user to include the required permissions.

Example IAM Policy:
Here’s an example IAM policy that grants the necessary permissions to access the S3 bucket and objects:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket",
        "s3:GetBucketLocation"
      ],
      "Resource": [
        "arn:aws:s3:::divyapoc",
        "arn:aws:s3:::divyapoc/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "glue:GetDatabases",
        "glue:GetTables",
        "glue:GetTable",
        "glue:GetDatabase",
        "glue:SearchTables",
        "glue:GetTableVersions",
        "glue:GetTableVersion",
        "glue:GetPartition",
        "glue:GetPartitions"
      ],
      "Resource": [
        "arn:aws:glue:<your_aws_region>:<your_account_id>:catalog",
        "arn:aws:glue:<your_aws_region>:<your_account_id>:database/*",
        "arn:aws:glue:<your_aws_region>:<your_account_id>:table/*"
      ]
    }
  ]
}

Replace <your_aws_region> and <your_account_id> with your actual AWS region and account ID.

Example Configuration for Iceberg Ingestion with Glue Catalog:
Ensure your ingestion recipe is correctly configured to use the Glue catalog and includes the necessary AWS credentials:

source:
  type: "iceberg"
  config:
    env: "PROD"
    catalog:
      name: "my_glue_catalog"
      type: "glue"
      config:
        aws_access_key_id: "<your_aws_access_key_id>"
        aws_secret_access_key: "<your_aws_secret_access_key>"
        aws_region: "<your_aws_region>"
        warehouse: "s3a://divyapoc/"
    platform_instance: "my_glue_catalog_instance"
    table_pattern:
      allow:
        - ".*"  # Adjust this pattern to match the tables you want to include
    profiling:
      enabled: true

sink:
  type: "datahub-rest"
  config:
    server: "<your_datahub_server_url>"

Steps to Troubleshoot:

  1. Check IAM Role/User Permissions:
    Ensure the IAM role or user has the necessary permissions as specified in the IAM policy above.

  2. Test S3 Access:
    Use AWS CLI to test access to the S3 bucket and the specific object:

    aws s3 ls s3://divyapoc/output/divyapoc/divyapoctable/metadata/
    

   aws s3 cp <s3://divyapoc/output/divyapoc/divyapoctable/metadata/00000-808cac2c-4b7a-4b62-b9ec-5b157d3f569d.metadata.json> .

3. **Ensure Correct AWS Credentials**:
   Verify that the AWS credentials provided in the ingestion recipe are correct and have the necessary permissions.

Would you like more detailed steps on any of these points?

Sources:
- [Glue Configuration Details](<https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/glue#config-details>)
- [AWS IAM Policy Reference](<https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies.html>) :+1: 0 button :-1: 0 button :bulb: *Hint:* Mention @runllm in the thread for followups.