Troubleshooting S3 Recipe Issues with Partitioned Datasets

Original Slack Thread

<@U06TM7M40RK> give me example s3 recipe creating datasets on folder level with partition of year, month, day, hour. Files are in json format. Partition example is year=2024/month=01/day=01/hour=00/

To create datasets on a folder level with partitions for year, month, day, and hour using files in JSON format, you can use the path_specs configuration in your S3 ingestion recipe. According to the <S3 / Local Files | DataHub documentation on S3 path specs>, you can specify the path to include these partitions.

Here’s an example of how you might configure your path_specs:

path_specs:
  - include: s3://your-bucket-name/{table}/year={partition_key[0]}/month={partition_key[1]}/day={partition_key[2]}/hour={partition_key[3]}/*.json

In this configuration:

  • {table} is a placeholder for the folder level that represents the dataset.
  • {partition_key[0]}, {partition_key[1]}, {partition_key[2]}, and {partition_key[3]} are placeholders for the partition keys, which in this case are year, month, day, and hour.

This setup will scan for JSON files in the specified partitioned structure and create datasets accordingly.

Would you like more details on this configuration or any other aspect of DataHub’s ingestion process?

Sources:

<@U06TM7M40RK> using the given partition I am getting an error Unable to find any files in the folder

The error message “Unable to find any files in the folder” suggests that the path specified in your path_specs might not be correctly matching the structure of your S3 bucket. Here are a few things to check and consider:

  1. Verify the Path Structure: Ensure that the path structure in your path_specs matches exactly with the structure of your S3 bucket. Double-check the folder names and partition keys to ensure they are correctly specified.

  2. Check for Typographical Errors: Make sure there are no typos in the bucket name, folder names, or partition keys.

  3. Use Wildcards Appropriately: If your folder structure is more complex or if there are additional folders, you might need to use wildcards (*) to match those levels. For example, if there are additional folders between the table and the year partition, you might need to adjust the path.

  4. Ensure Files Exist: Verify that there are indeed JSON files in the specified path. You can do this by manually checking the S3 bucket.

  5. Debugging: Run the ingestion in debug mode to get more detailed logs about which paths are being scanned. This can help identify where the mismatch is occurring.

  6. Example Path Spec: Based on the <S3 / Local Files | DataHub documentation>, ensure your path spec is correctly formatted. Here’s an example:

    path_specs:
      - include: s3://your-bucket-name/{table}/year={partition[0]}/month={partition[1]}/day={partition[2]}/hour={partition[3]}/*.json
    

If you continue to face issues, please provide more details about your S3 bucket structure and the exact path spec you are using, and I can help troubleshoot further.

Sources:

<@U06TM7M40RK> s3 recipe with partition is not working with error Unable to find any files in the folder

To address the error “Unable to find any files in the folder” when using an S3 recipe with partitions, you should verify a few key aspects of your configuration:

  1. Path Structure: Ensure that the path specified in your path_specs matches the actual structure of your S3 bucket. The placeholders for partitions should align with the folder structure in your bucket.

  2. Correct Use of Placeholders: Make sure you are using the correct placeholders for partitions. For example, if your path is structured as year=2024/month=01/day=01/hour=00, your path spec should look like this:

    path_specs:
      - include: s3://your-bucket-name/{table}/year={partition[0]}/month={partition[1]}/day={partition[2]}/hour={partition[3]}/*.json
    

    Here, {partition[0]}, {partition[1]}, {partition[2]}, and {partition[3]} correspond to year, month, day, and hour, respectively.

  3. File Existence: Verify that there are indeed JSON files in the specified path. You can do this by manually checking the S3 bucket.

  4. Debugging: Run the ingestion in debug mode to get more detailed logs about which paths are being scanned. This can help identify where the mismatch is occurring.

  5. Wildcards: If there are additional folders or if the structure is more complex, consider using wildcards (*) to match those levels.

If you continue to face issues, please provide more details about your S3 bucket structure and the exact path spec you are using, and I can help troubleshoot further.

Sources:

<@U06TM7M40RK> I have this path spec, which is not working

      - include: "<s3://my-bucket/test/group/{table}/year={partition[0]}/month={partition[1]}/day={partition[2]}/hour={partition[3]}/*.json>"
        allow_double_stars: False
        # autodetect_partitions: True
        # default_extension: json
        enable_compression: True
        sample_files: True
        traversal_method: MIN_MAX
        table_name: test_events```

To troubleshoot the issue with your path_specs configuration, let’s go through a few key points and suggestions:

  1. Path Structure: Ensure that the path structure in your path_specs matches exactly with the structure of your S3 bucket. Double-check the folder names and partition keys to ensure they are correctly specified.

  2. Placeholders: The placeholders {table}, {partition[0]}, {partition[1]}, {partition[2]}, and {partition[3]} should correspond to the actual folder structure in your S3 bucket. Make sure these placeholders are correctly aligned with your bucket’s structure.

  3. Quotes: Remove the angle brackets (< and >) around the S3 path. They are not needed and might cause issues. The path should be specified as a string without these brackets.

  4. Configuration Options:

    • allow_double_stars: This option is not mentioned in the provided data sources, so it might not be supported. Consider removing it.
    • autodetect_partitions: If you want to enable automatic detection of partitions, uncomment this line.
    • default_extension: This option is not mentioned in the provided data sources, so it might not be supported. Consider removing it.
    • enable_compression: Ensure that this option is necessary for your use case. It is not mentioned in the provided data sources.
    • sample_files: This option is not mentioned in the provided data sources, so it might not be supported. Consider removing it.
    • traversal_method: The MIN_MAX method is used to find the latest and earliest partitions by ordering the partitions at each level. Ensure this is the desired behavior.
    • table_name: This should be correctly set to the desired table name.
  5. Example Path Spec: Based on the <S3 / Local Files | DataHub documentation>, here’s a revised version of your path spec:

    path_specs:
      - include: "s3://my-bucket/test/group/{table}/year={partition[0]}/month={partition[1]}/day={partition[2]}/hour={partition[3]}/*.json"
        traversal_method: MIN_MAX
        table_name: test_events
    

If you continue to face issues, please provide more details about your S3 bucket structure and the exact error message you are receiving, and I can help troubleshoot further.

Sources:

<@U06TM7M40RK> since s3 path_spec is not working, how can I ingest s3 data source without using s3 recipe?

I’m not able to find any data sources to answer your question. You might want to specify the type of data you are trying to ingest or the alternative method you are considering. If this doesn’t help, please reach out to the DataHub support team for further assistance. :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

<@U06TM7M40RK> instead of using S3 recipe, is there a way to manually ingest s3 data source using datahub sdk or rest emitter?

It seems likely that you can manually ingest data from an S3 data source using the DataHub SDK or REST emitter, although the specific details are not fully covered in the provided data sources. Here are some general steps you might consider:

  1. Using DataHub SDK: You can use the DataHub Python SDK to programmatically create and emit metadata to DataHub. This involves creating metadata objects and using the SDK to send them to your DataHub instance.

  2. Using REST Emitter: You can manually construct the metadata payloads and send them to the DataHub REST API. This requires knowledge of the DataHub metadata model and the REST API endpoints.

  3. Programmatic Ingestion: According to a <Slack discussion>, you can run ingestion programmatically using Pipeline.create(config).run(...). This method allows you to dynamically configure and execute ingestion pipelines.

  4. Example Code: While specific code examples are not provided in the data sources, you would typically start by importing the necessary modules from the DataHub SDK, creating metadata objects (such as datasets, schemas, etc.), and then using the SDK’s methods to emit these objects to DataHub.

If you need more detailed guidance or specific code examples, please let me know, and I can help break down the process further.

Sources:

  • <Slack discussion> :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

<@U06TM7M40RK> using s3 ingestion source {table} does not work.

When using the {table} placeholder in the S3 ingestion source and encountering issues, there are a few things to consider:

  1. Correct Usage of {table}: The {table} placeholder is used to represent a folder level in your S3 path that should be treated as a dataset. Ensure that the folder structure in your S3 bucket aligns with how you are using {table} in your path spec.

  2. Path Spec Configuration: According to the <S3 / Local Files | DataHub documentation>, the path spec should be correctly formatted to match the structure of your S3 bucket. For example, if your bucket structure is s3://my-bucket/foo/{table}/year=2024/month=01/day=01/hour=00/*.json, ensure that the path spec reflects this structure.

  3. Debugging: Run the ingestion in debug mode to get more detailed logs about which paths are being scanned. This can help identify where the mismatch is occurring.

  4. Example Path Spec: Here’s an example of how you might configure your path_specs:

    path_specs:
      - include: s3://my-bucket/foo/{table}/year={partition[0]}/month={partition[1]}/day={partition[2]}/hour={partition[3]}/*.json
    
  5. Check for Errors: Ensure there are no typographical errors in the path spec and that the bucket and folder names are correct.

If you continue to face issues, please provide more details about your S3 bucket structure and the exact path spec you are using, and I can help troubleshoot further.

Sources:

<@U06TM7M40RK> I am getting Unable to find any files in the folder error when using {table} in the recipe file

The error “Unable to find any files in the folder” when using {table} in your S3 ingestion recipe suggests that there might be a mismatch between the specified path in your path_specs and the actual structure of your S3 bucket. Here are some steps to troubleshoot and resolve the issue:

  1. Verify Path Structure: Ensure that the path structure in your path_specs matches exactly with the structure of your S3 bucket. The {table} placeholder should correspond to a folder level in your bucket.

  2. Check for Typographical Errors: Double-check the bucket name, folder names, and partition keys to ensure they are correctly specified.

  3. Use Wildcards Appropriately: If your folder structure is more complex or if there are additional folders, you might need to use wildcards (*) to match those levels.

  4. Debugging: Run the ingestion in debug mode to get more detailed logs about which paths are being scanned. This can help identify where the mismatch is occurring.

  5. Example Path Spec: Based on the <S3 / Local Files | DataHub documentation>, ensure your path spec is correctly formatted. Here’s an example:

    path_specs:
      - include: s3://your-bucket-name/{table}/year={partition[0]}/month={partition[1]}/day={partition[2]}/hour={partition[3]}/*.json
    
  6. Check for File Existence: Verify that there are indeed JSON files in the specified path. You can do this by manually checking the S3 bucket.

If you continue to face issues, please provide more details about your S3 bucket structure and the exact path spec you are using, and I can help troubleshoot further.

Sources:

  • <Slack discussion on S3 ingestion issues> :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><