Filtering Glue Databases by S3 Location Using Custom Transformer

Original Slack Thread

Hi! When using the Glue ingestion source, is there any way to filter the Glue databases based on their S3 location?

Hey there! :wave:

<@U06TM7M40RK> Can you help with the questions in this thread?

:information_source: Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!

I’m not able to find any data sources to answer your question. You might want to specify if you are looking for a way to filter databases by S3 location using a specific configuration or tool. If this doesn’t help, please reach out to the DataHub support team for further assistance. :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

By the way, I found these links that might be relevant to what you’re looking for. Let me know if this helps: - <python - Filtering DynamicFrame with AWS Glue or PySpark - Stack Overflow DynamicFrame with AWS Glue or PySpark> - <https://docs.aws.amazon.com/glue/latest/dg/edit-jobs-source-s3-files.html|Using files in Amazon S3 for the data source - AWS Glue> - <Medium AWS Glue: Managing Data Catalogs and Querying Metadata in S3 …> - <aws-reference-architectures/datalake Ingestion using Amazon Glue | aws-reference … - GitBook>

<@U06TM7M40RK> Can I use a custom transformer to filter Glue databases based on their location?

It seems likely that you can use a custom transformer to filter Glue databases based on their S3 location. Custom transformers in DataHub allow you to modify metadata before it reaches the ingestion sink, which could include filtering databases based on specific criteria such as their S3 location.

According to the <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/docs/transformer/intro.md|DataHub documentation on transformers>, you can write your own module to transform metadata events. This means you could potentially create a transformer that checks the S3 location of each Glue database and filters them accordingly.

Would you like more detailed steps or a code example on how to implement this?

Sources:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><