Hi all, I have a question in regard to adding already existing documentation/descriptions to newly ingested assets with tables and columns? I have ingested a new source and descriptions are all empty, but instead of copy pasting/writing manually all my existing documentation on the source fields, is there a way that i can import an excel file or text file and map it to the corresponding fields filling in the empty descriptions column pr table field? Thank you for any help
Hey there! 
<@U06TM7M40RK> Can you help with the questions in this thread?
Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!
Yes, you can bulk upload table and column descriptions for already ingested datasets using a CSV file, which can be exported from Excel. DataHub provides a CSV Enricher source for this purpose. According to the <https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/csv|CSV Enricher documentation>, you can use the CSV Enricher to bulk upload metadata such as descriptions, tags, glossary terms, and owners. Here’s a brief overview of how you can achieve this:
-
Prepare the CSV File: Create a CSV file with the necessary metadata. The CSV should have a header row and the following columns:
resource
,subresource
,glossary_terms
,tags
,owners
,ownership_type
,description
,domain
,ownership_type_urn
. For example:resource,subresource,glossary_terms,tags,owners,ownership_type,description,domain,ownership_type_urn "urn:li:dataset:(urn:li:dataPlatform:postgres,post.user.table1,PROD)",,,,"urn:li:corpuser:admin",,"Table 1 description",, "urn:li:dataset:(urn:li:dataPlatform:postgres,post.user.table1,PROD)",column1,,,,"Column 1 description",,
-
Ingest the CSV File: Use the DataHub CLI to ingest the CSV file. You need to create a recipe file for the ingestion process. Here’s an example of a recipe file:
source: type: csv-enricher config: filename: ./path/to/your/file.csv write_semantics: PATCH # or OVERRIDE
-
Run the Ingestion: Execute the ingestion using the DataHub CLI:
datahub ingest -c path/to/your/recipe.yaml
This process will update the descriptions for the specified tables and columns in DataHub.
Would you like more detailed steps or examples on any part of this process?
Sources:
- <https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/csv|CSV Enricher documentation>
1 button
0 button
Hint: Mention <@U06TM7M40RK> in the thread for followups.
Hi ! Did this interaction resolve your issue or do you still require assistance ?
Please tag @ RunLLM if you still need further assistance !
Or tag @Datahub Community Support if you require a person to take a look
Hope this helps ! ><
<@U06TM7M40RK> Can i use the graphical user interface and choose ingest csv to do the job?
Yes, you can use the DataHub UI to ingest CSV files for metadata enrichment. According to the <https://datahubproject.io/docs/0.13.1/ui-ingestion|UI Ingestion Guide>, you can configure and execute metadata ingestion through the DataHub user interface. Here’s how you can do it:
-
Navigate to the Ingestion Tab:
- Open the DataHub UI.
- Go to the
Ingestion
tab.
-
Create a New Source:
- Click on
+ Create new source
at the top left of the table.
- Click on
-
Select CSV as the Source Type:
- Scroll to the bottom of the sources list and select
CSV
.
- Scroll to the bottom of the sources list and select
-
Configure the Source:
- Fill out the form with the necessary details:
- Source Type:
CSV
- Path: Specify the path to the CSV file on your local system or a URL if it’s hosted online.
- Source Type:
- Fill out the form with the necessary details:
-
Run the Ingestion:
- Save the configuration and run the ingestion process.
This method allows you to use the UI to manage the ingestion process without needing to use the CLI. However, it’s important to ensure that your CSV file is formatted correctly to match the expected structure for metadata enrichment.
Would you like more detailed instructions or have any specific questions about the process?
Sources:
- <https://datahubproject.io/docs/0.13.1/ui-ingestion|UI Ingestion Guide>
0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> also, do i have to mention the table in a separate row, or could i just list the columns i want to add descriptions for? and do i have to use double quotes on the text or is this just for demo purposes?
When using the CSV Enricher to add descriptions, you need to follow a specific format for the CSV file. Here’s how you can structure it:
-
Separate Rows for Tables and Columns:
- You should have a separate row for the table itself if you want to add a description to the table. Each column that you want to add a description for should also have its own row. This is because the
resource
field specifies the dataset, and thesubresource
field specifies the column within that dataset.
- You should have a separate row for the table itself if you want to add a description to the table. Each column that you want to add a description for should also have its own row. This is because the
-
Use of Double Quotes:
- Double quotes are necessary when the URNs or any field values contain commas. This is to ensure that the CSV parser correctly interprets the fields. If your text does not contain commas, you can omit the quotes, but it’s generally a good practice to use them to avoid any parsing issues.
Here’s an example of how your CSV might look:
resource,subresource,glossary_terms,tags,owners,ownership_type,description,domain,ownership_type_urn
"urn:li:dataset:(urn:li:dataPlatform:postgres,post.user.table1,PROD)",,,,"urn:li:corpuser:admin",,"Table 1 description",,
"urn:li:dataset:(urn:li:dataPlatform:postgres,post.user.table1,PROD)","column1",,,,"Column 1 description",,
In this example:
- The first row is for the table itself.
- The second row is for a specific column within that table.
Would you like more detailed guidance on any specific part of this process?
Sources:
- <https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/csv|CSV Enricher documentation>
0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> I still cannot get it to work, if i copy the urn from the datahub is says: urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:databricks,lab.jobtechdbt.jobtech__brz__jtfeilstatusnavn,PROD),kode), In the csv-file i have the following line:
urn:li:dataset:(urn:li:dataPlatform:databricks,lab.jobtechdbt.jobtech__brz__jtfeilstatusnavn,PROD),kode,“Kode for feilstatus.”, Any ideas why it doesn’t update the description on the column?
To troubleshoot why the description is not updating for the column, let’s go through a few potential issues and solutions:
-
URN Format:
- Ensure that the URN format in your CSV file matches exactly what DataHub expects. The URN you provided for the column in the CSV should be correct if it matches the format
urn:li:dataset:(urn:li:dataPlatform:databricks,lab.jobtechdbt.jobtech__brz__jtfeilstatusnavn,PROD)
. - The
subresource
field should be the column name, which you have askode
. This seems correct.
- Ensure that the URN format in your CSV file matches exactly what DataHub expects. The URN you provided for the column in the CSV should be correct if it matches the format
-
CSV Formatting:
- Make sure there are no extra spaces or hidden characters in your CSV file that might affect parsing.
- Ensure that the CSV file is saved with UTF-8 encoding.
-
CSV Enricher Configuration:
- Verify that your ingestion recipe is correctly configured to use the
csv-enricher
source and that the path to your CSV file is correct. - Ensure that the
write_semantics
is set toPATCH
orOVERRIDE
as needed.
- Verify that your ingestion recipe is correctly configured to use the
-
DataHub Logs:
- Check the DataHub logs for any errors or warnings during the ingestion process. This might provide more insight into what might be going wrong.
-
DataHub Version:
- Ensure that you are using a compatible version of DataHub that supports the CSV Enricher functionality.
-
Column Existence:
- Double-check that the column
kode
exists in the datasetlab.jobtechdbt.jobtech__brz__jtfeilstatusnavn
in DataHub.
- Double-check that the column
If all the above checks are correct and it still doesn’t work, you might want to try running the ingestion process again and closely monitor the logs for any specific error messages.
Would you like more detailed steps on any specific part of this process?
Sources:
- <https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/csv|CSV Enricher documentation>
0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> This is what is in the csv-file: resource,subresource,glossary_terms,tags,owners,ownership_type,description,domain,ownership_type_urn
“urn:li:dataset:(urn:li:dataPlatform:databricks,lab.jobtechdbt.jobtech__brz__jtfeilstatusnavn,PROD)”,kode,“Kode for feilstatus.”, This is the copy of the column urn: urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:databricks,lab.jobtechdbt.jobtech__brz__jtfeilstatusnavn,PROD),kode). Please advice what is wrong
The issue seems to be related to the format and structure of the CSV file and how the subresource
is being specified. Let’s break down the steps to ensure everything is set up correctly:
-
CSV Structure:
- Your CSV file should have the
resource
column specifying the dataset URN and thesubresource
column specifying the column name. This part seems correct based on your description.
- Your CSV file should have the
-
Column URN:
- The
subresource
field in your CSV should only contain the column name, not the full column URN. The full column URN you provided (urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:databricks,lab.jobtechdbt.jobtech__brz__jtfeilstatusnavn,PROD),kode)
) is not needed in the CSV file. Instead, just usekode
in thesubresource
field.
- The
-
CSV Content:
- Ensure your CSV file looks like this:
resource,subresource,glossary_terms,tags,owners,ownership_type,description,domain,ownership_type_urn "urn:li:dataset:(urn:li:dataPlatform:databricks,lab.jobtechdbt.jobtech__brz__jtfeilstatusnavn,PROD)",kode,,,,"Kode for feilstatus.",,
- Ensure your CSV file looks like this:
-
Ingestion Configuration:
- Make sure your ingestion recipe is correctly configured to use the
csv-enricher
source. Here’s an example configuration:source: type: "csv-enricher" config: filename: /path/to/your/csvfile.csv write_semantics: PATCH delimiter: "," sink: type: "datahub-rest" config: server: "http://localhost:8080"
- Make sure your ingestion recipe is correctly configured to use the
-
Check Logs:
- After running the ingestion, check the logs for any errors or warnings that might indicate what went wrong.
If you follow these steps and ensure that the CSV file and configuration are correct, the description should be updated for the specified column. If the issue persists, reviewing the logs for specific error messages can provide further insights.
Would you like more detailed guidance on any specific part of this process?
Sources:
- <https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/csv|CSV Enricher documentation>
0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> It does not give an error, but nothing is being written
If the ingestion process completes without errors but the descriptions are not being updated, here are a few additional steps to troubleshoot and resolve the issue:
-
Verify CSV File Path:
- Double-check that the path to the CSV file specified in your ingestion recipe is correct and accessible by the DataHub process.
-
Check DataHub Logs:
- Even if there are no errors, the logs might contain warnings or other messages that could provide insights into why the descriptions are not being updated. Look for any messages related to the CSV Enricher process.
-
Ensure Correct URN and Column Name:
- Confirm that the dataset URN and column name (
subresource
) in the CSV file exactly match those in DataHub. Any discrepancies in case sensitivity or naming can cause the update to be ignored.
- Confirm that the dataset URN and column name (
-
Write Semantics:
- Ensure that the
write_semantics
in your recipe is set toPATCH
orOVERRIDE
.PATCH
will update existing fields without overwriting other metadata, whileOVERRIDE
will replace existing metadata.
- Ensure that the
-
Check for Existing Descriptions:
- If there are existing descriptions, ensure that the
write_semantics
is set to allow updates. IfPATCH
is used, it should update the existing descriptions.
- If there are existing descriptions, ensure that the
-
DataHub Version Compatibility:
- Ensure that you are using a version of DataHub that supports the CSV Enricher functionality. If you are using an older version, consider upgrading to a more recent version.
-
Manual Verification:
- Manually verify in the DataHub UI if the dataset and column exist and are accessible. Sometimes, issues with visibility or permissions can affect updates.
-
Test with a Minimal CSV:
- Create a minimal CSV file with just one entry and test the ingestion process to see if it updates correctly. This can help isolate the issue.
If these steps do not resolve the issue, you might want to try running the ingestion process again and closely monitor the logs for any specific messages that could indicate what might be going wrong.
Would you like more detailed guidance on any specific part of this process?
Sources:
- <https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/csv|CSV Enricher documentation>
0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> If i want to update the urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:databricks,lab.jobtechdbt.jobtech__brz__jtfeilstatusnavn,PROD),kode) with new description, how should the csv file look like?
To update the description of a specific column using the CSV Enricher in DataHub, you need to ensure that your CSV file is correctly formatted. Here’s how your CSV file should look:
-
CSV File Structure:
- The
resource
column should contain the dataset URN. - The
subresource
column should contain the column name you want to update. - The
description
column should contain the new description you want to apply.
- The
-
Example CSV File:
- For the column
kode
in the datasetlab.jobtechdbt.jobtech__brz__jtfeilstatusnavn
, your CSV file should look like this:
- For the column
resource,subresource,glossary_terms,tags,owners,ownership_type,description,domain,ownership_type_urn
"urn:li:dataset:(urn:li:dataPlatform:databricks,lab.jobtechdbt.jobtech__brz__jtfeilstatusnavn,PROD)","kode",,,,"New description for kode.",,
-
Important Notes:
- Ensure that the
resource
field contains the correct dataset URN. - The
subresource
field should only contain the column name (kode
in this case). - Use double quotes around fields if they contain commas or special characters.
- Make sure the CSV file is saved with UTF-8 encoding to avoid any parsing issues.
- Ensure that the
-
Ingestion Configuration:
- Ensure your ingestion recipe is set up to use the
csv-enricher
source and points to the correct CSV file path.
- Ensure your ingestion recipe is set up to use the
source:
type: "csv-enricher"
config:
filename: /path/to/your/csvfile.csv
write_semantics: PATCH
delimiter: ","
sink:
type: "datahub-rest"
config:
server: "http://localhost:8080"
By following this structure, the CSV Enricher should be able to update the description for the specified column in DataHub.
Would you like more detailed guidance on any specific part of this process?
Sources:
- <https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/csv|CSV Enricher documentation>
0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> Still nothing. Are there any support departments that may help?
If you’re still experiencing issues with updating descriptions using the CSV Enricher, reaching out to the DataHub community and support channels can be very helpful. Here are some options:
-
DataHub Community Slack:
- Join the <https://datahubproject.io/docs/0.13.1/slack|DataHub Community Slack> where you can ask questions and get help from other users and contributors. It’s a great place to get support and share your experiences.
-
GitHub Issues:
- You can check the <https://github.com/datahub-project/datahub/issues|DataHub GitHub repository> for any existing issues similar to yours or create a new issue. The community and maintainers often respond to issues reported there.
-
DataHub Documentation:
- Review the <https://datahubproject.io/docs/0.13.1/|DataHub documentation> for detailed guides and troubleshooting tips. It can provide insights into common issues and their solutions.
-
Office Hours:
- DataHub often hosts office hours where you can get live support from the core team. Check the Slack channel for announcements about upcoming sessions.
-
Contact Support:
- If you are using a managed version of DataHub or have a support contract, you can reach out to your provider’s support team for assistance.
These resources should help you get the support you need to resolve the issue. Would you like more information on any of these options?
Sources:
- <https://datahubproject.io/docs/0.13.1/slack|DataHub Community Slack>
- <https://datahubproject.io/docs/0.13.1/|DataHub Documentation>
0 button
0 button
Hint: Mention @runllm in the thread for followups.