Removing Tables Starting with "DEV_" During Ingestion in DataHub

user-2 · March 4, 2024, 3:49pm

Is it possible to omit or not ingest all tables whose names start with "DEV_"?```

datahub_team · March 4, 2024, 3:49pm

Hey there! Make sure your message includes the following information if relevant, so we can help more effectively!

Are you using UI or CLI for ingestion?
Which DataHub version are you using? (e.g. 0.12.0)
What data source(s) are you integrating with DataHub? (e.g. BigQuery)

user-3 · March 4, 2024, 3:49pm

For sure, looks like the teradata integration has the _table_pattern.deny field.
you can use regex to match anything starting with DEV and ignore it, which I think would look like this:

            deny:
                - 'DEV_.*'```

user-2 · March 4, 2024, 3:49pm

Thank you very much <@U05JJ9WESHL>. I’ve already taken it completely without any restrictions. Now I want to delete the tables whose names start with “DEV_”. If I now ingest with the restriction:

            deny:
                - 'DEV_.*'```
will the tables whose names start with "DEV_" be removed or will I need to delete everything and ingest again with the restriction? In the latter case, I will lose the information already inserted in the DataHub.

user-1 · March 4, 2024, 3:49pm

ingesting with ‘deny’ table_pattern won’t remove already ingested table metadata. One solution I think of is to write a small script to retrieve all urns(with GQL) in the platform and delete one you don’t need

user-2 · March 4, 2024, 3:49pm

Thank you <@U0445MUD81W>!
Is there also a way to omit DBs whose name starts with “DEV_”?

user-1 · March 4, 2024, 3:49pm

similar to table_pattern, along with that add schema_pattern

            deny:
                - 'DEV_.*'```

user-1 · March 4, 2024, 3:49pm

in new version of Datahub it is database_pattern

user-2 · March 4, 2024, 3:49pm

Thank you <@U0445MUD81W>!
As for writing a script to recover all the ballot boxes and delete specific ones, I’ll have to study to try to do it.

user-2 · March 4, 2024, 3:49pm

Another question <@U0445MUD81W>, where does the DataHub store the information? I noticed that it uses mysql, but where can I access this DB?

user-1 · March 4, 2024, 3:49pm

here is GQL query to get all urn for give platform

              search(input: {{type: DATASET, 
               query: "*", start: 10, count: 10000
                  orFilters: [
                  {{
                    and: [
                    {{
                          field: "platform"
                          values: ["{platform}"]
                          condition: CONTAIN
                      }}
                    ]
                  }}
                ]
              }}
              ){{
                start
                count
                total
                searchResults {{
                  entity {{
                    urn
                    ... on Dataset {{
                      urn
                      }}
                    }}
                  }}
                }}
            }}```

user-1 · March 4, 2024, 3:49pm

By default Datahub uses MySQL, Elasticsearch, and Kafka in the persistence layer
MySql runs in container, it looks like something below
5704b3bb9a18 mariadb:10.5.8 "docker-entrypoint.s…" 2 months ago Up 28 hours (healthy) 0.0.0.0:3306->3306/tcp mysql

user-2 · March 4, 2024, 3:49pm

Thanks <@U0445MUD81W>,
Is there any way to access this BD?

There are two more things that I couldn’t do: The first is to change the default DataHub user (I followed the documentation on the official website, but I can’t find the file to make the change). The second is backup, which gives an error when executing the command “datahub docker quickstart --backup”

user-1 · March 4, 2024, 3:49pm

Yes, you can access using any SQL client using JDBC connector, default username:datahub, pass:datahub

user-2 · March 4, 2024, 3:49pm

Thanks! I’ll try it soon

user-1 · March 4, 2024, 3:49pm

please take a look at this link to change default DataHub
https://datahubproject.io/docs/authentication/changing-default-credentials/

user-2 · March 4, 2024, 3:49pm

<@U0445MUD81W>, I noticed now that the lineage was not loaded. Was there an option missing in the ingestion configuration?

user-1 · March 4, 2024, 3:49pm

there are many lineage configs for ingest recipe based on needs eg:

        include_copy_lineage: false
        include_table_lineage: true
        include_tables: true
        include_unload_lineage: false```
check out full option here
<https://datahubproject.io/docs/generated/ingestion/sources/mysql>

user-2 · March 4, 2024, 3:49pm

Thank you <@U0445MUD81W>, later I will try with this configuration:

source:
    type: teradata
    config:
        host_port: 'xxxxxxxxxxxxxxxx:1025'
        username: xxxxxxxxx
        password: xxxxxxxxxxx
        include_table_lineage: true
        stateful_ingestion:
            enabled: true
        table_lineage_mode: sql_based
        include_copy_lineage: false
        include_table_lineage: true
        include_tables: true
        include_unload_lineage: false
        schema_pattern:
                    deny:
                        - 'DEV_.*'```

user-2 · March 4, 2024, 3:49pm

Previously I used this:

source:
    type: teradata
    config:
        host_port: 'xxxxxxxxxxxxxx'
        username: xxxxxxx
        password: xxxxxxxxx
        include_table_lineage: true
        include_usage_statistics: true
        stateful_ingestion:
            enabled: true```

Topic		Replies	Views
Disabling Ownership of Tables in DataHub Ingestion ingestion	1	54	March 4, 2024
Datahub SQL Parser Issue with ALTER TABLE Query ingestion	1	30	May 6, 2024
Troubleshooting Error Ingesting Iceberg Table with "/" in Python DataHub Library ingestion	2	40	May 20, 2024
Listing Connection Schemas Before Ingestion using DataHub ingestion	3	48	March 4, 2024
Ingesting Data from Python SDK and Classifying Tables by Database Name ingestion	2	44	April 29, 2024

Removing Tables Starting with "DEV_" During Ingestion in DataHub

Related topics