Removing Tables Starting with "DEV_" During Ingestion in DataHub

Original Slack Thread

Is it possible to omit or not ingest all tables whose names start with "DEV_"?```

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

For sure, looks like the teradata integration has the _table_pattern.deny field.
you can use regex to match anything starting with DEV
and ignore it, which I think would look like this:

            deny:
                - 'DEV_.*'```

Thank you very much <@U05JJ9WESHL>. I’ve already taken it completely without any restrictions. Now I want to delete the tables whose names start with “DEV_”. If I now ingest with the restriction:

            deny:
                - 'DEV_.*'```
will the tables whose names start with "DEV_" be removed or will I need to delete everything and ingest again with the restriction? In the latter case, I will lose the information already inserted in the DataHub.

ingesting with ‘deny’ table_pattern won’t remove already ingested table metadata. One solution I think of is to write a small script to retrieve all urns(with GQL) in the platform and delete one you don’t need

Thank you <@U0445MUD81W>!
Is there also a way to omit DBs whose name starts with “DEV_”?

similar to table_pattern, along with that add schema_pattern

            deny:
                - 'DEV_.*'```

in new version of Datahub it is database_pattern

Thank you <@U0445MUD81W>!
As for writing a script to recover all the ballot boxes and delete specific ones, I’ll have to study to try to do it.

Another question <@U0445MUD81W>, where does the DataHub store the information? I noticed that it uses mysql, but where can I access this DB?

here is GQL query to get all urn for give platform

              search(input: {{type: DATASET, 
               query: "*", start: 10, count: 10000
                  orFilters: [
                  {{
                    and: [
                    {{
                          field: "platform"
                          values: ["{platform}"]
                          condition: CONTAIN
                      }}
                    ]
                  }}
                ]
              }}
              ){{
                start
                count
                total
                searchResults {{
                  entity {{
                    urn
                    ... on Dataset {{
                      urn
                      }}
                    }}
                  }}
                }}
            }}```

By default Datahub uses MySQL, Elasticsearch, and Kafka in the persistence layer
MySql runs in container, it looks like something below
5704b3bb9a18 mariadb:10.5.8 "docker-entrypoint.s…" 2 months ago Up 28 hours (healthy) 0.0.0.0:3306-&gt;3306/tcp mysql

Thanks <@U0445MUD81W>,
Is there any way to access this BD?

There are two more things that I couldn’t do: The first is to change the default DataHub user (I followed the documentation on the official website, but I can’t find the file to make the change). The second is backup, which gives an error when executing the command “datahub docker quickstart --backup”

Yes, you can access using any SQL client using JDBC connector, default username:datahub, pass:datahub

Thanks! I’ll try it soon

please take a look at this link to change default DataHub
https://datahubproject.io/docs/authentication/changing-default-credentials/

<@U0445MUD81W>, I noticed now that the lineage was not loaded. Was there an option missing in the ingestion configuration?

there are many lineage configs for ingest recipe based on needs eg:

        include_copy_lineage: false
        include_table_lineage: true
        include_tables: true
        include_unload_lineage: false```
check out full option here
<https://datahubproject.io/docs/generated/ingestion/sources/mysql>

Thank you <@U0445MUD81W>, later I will try with this configuration:

source:
    type: teradata
    config:
        host_port: 'xxxxxxxxxxxxxxxx:1025'
        username: xxxxxxxxx
        password: xxxxxxxxxxx
        include_table_lineage: true
        stateful_ingestion:
            enabled: true
        table_lineage_mode: sql_based
        include_copy_lineage: false
        include_table_lineage: true
        include_tables: true
        include_unload_lineage: false
        schema_pattern:
                    deny:
                        - 'DEV_.*'```

Previously I used this:

source:
    type: teradata
    config:
        host_port: 'xxxxxxxxxxxxxx'
        username: xxxxxxx
        password: xxxxxxxxx
        include_table_lineage: true
        include_usage_statistics: true
        stateful_ingestion:
            enabled: true```