Masking Sensitive Data in DataHub Stats: A Guide to Using Transformers

Original Slack Thread

Hi <@U06TM7M40RK>
How do i mask sensitive data in datahub stats

To mask sensitive data in DataHub stats, you can use transformers to modify the data before it is ingested into DataHub. This approach allows you to apply custom transformations, such as masking sensitive information, to the data.

Here are the steps to achieve this:

  1. Identify the Data to Mask: Determine which parts of your data contain sensitive information that needs to be masked.

  2. Create a Custom Transformer: Write a custom transformer that will modify the sensitive data before it is ingested. You can refer to the <Dataset | DataHub Transformer documentation> for guidance on creating custom transformers.

  3. Apply the Transformer in Your Ingestion Recipe: Use the custom transformer in your ingestion recipe to ensure that the sensitive data is masked during the ingestion process.

Here is a conceptual example of how you might write a custom transformer to mask sensitive data:

from datahub.ingestion.api.transform import Transformer
from datahub.metadata.schema_classes import MetadataChangeEventClass

class MaskSensitiveDataTransformer(Transformer):
    def transform(self, mce: MetadataChangeEventClass) -> MetadataChangeEventClass:
        # Example: Masking email addresses in the dataset
        for aspect in mce.proposedSnapshot.aspects:
            if hasattr(aspect, 'email'):
                aspect.email = '****@****.com'
        return mce

# Register the transformer
transformer = MaskSensitiveDataTransformer()
  1. Configure the Ingestion Recipe: Include the custom transformer in your ingestion recipe configuration.
source:
  type: your_data_source
  config:
    # Your data source configuration
transformers:
  - type: custom
    config:
      class_name: path.to.MaskSensitiveDataTransformer
sink:
  type: datahub-rest
  config:
    server: 'http://datahub-gms:8080'

Would you like more detailed steps on how to implement these changes, or do you have any specific questions about this process?

Sources: