<@U06TM7M40RK> How do I ingest custom metadata? For example, I want to ingest metadata for my youtube channel. I already have a script scraping this data for entities such as Playlists and Videos and saving them as a json file. But I can’t figure out how to ingest the actual data.
To ingest custom metadata into DataHub, such as metadata for your YouTube channel, you can follow these steps:
-
Prepare Your Data: Ensure your scraped data is in a JSON format that DataHub can understand. This typically involves structuring your data as Metadata Change Events (MCEs) or Metadata Change Proposals (MCPs). You can refer to the <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/mce_files/bootstrap_mce.json|DataHub metadata format> for guidance.
-
Create an Ingestion Recipe: Write a YAML configuration file (recipe) that specifies the source of the data and the destination (DataHub). Here is an example of how you might configure a file-based ingestion source:
source: type: "file" config: path: "./youtube_metadata.json" format: "json" sink: type: "datahub-rest" config: server: "http://localhost:8080" # Replace with your DataHub instance URL token: "your-auth-token" # Replace with your DataHub auth token if required
-
Run the Ingestion: Use the DataHub CLI to run the ingestion based on the recipe. You can execute the following command:
datahub ingest -c your_ingestion_recipe.yaml
-
Verify the Ingestion: After running the ingestion, check the DataHub UI to verify that the metadata has been correctly ingested.
If your JSON file is not in the required format, you may need to transform it into the appropriate structure. You can also consider writing a custom ingestion source if your use case requires more complex transformations or if you want to automate the ingestion process further.
Would you like more detailed steps or additional examples on any specific part of this process?
Sources:
- https://datahubspace.slack.com/archives/CV2KB471C/p1685096054.659819|https://datahubspace.slack.com/archives/CV2KB471C/p1685096054.659819
- https://datahubspace.slack.com/archives/CUMUWQU66/p1726631040.182089|https://datahubspace.slack.com/archives/CUMUWQU66/p1726631040.182089
0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> Yeah, how do I do step one? Please go into extreme detail. My data is simple, for example, playlists is essentially just contains “id” and “name” fields. e.g.
"id": "1",
"name": "Playlist 1"
}, {
"id": "2",
"name": "Playlist 2"
}]```
To transform your simple JSON data into a format that DataHub can ingest, you need to convert it into Metadata Change Events (MCEs) or Metadata Change Proposals (MCPs). Here’s a detailed step-by-step guide on how to achieve this:
Step 1: Understand the DataHub Metadata Format
DataHub uses a specific format for metadata ingestion, typically involving MCEs or MCPs. These formats include details about the entity type, the unique identifier (URN), and the aspects of the entity.
Step 2: Define the Entity URN
For each playlist, you need to define a unique URN. A URN in DataHub is a unique identifier for an entity. For example, you can construct a URN for a playlist like this:
urn:li:playlist:(youtube,<playlist_id>)
For your data, it would look like:
urn:li:playlist:(youtube,1)
urn:li:playlist:(youtube,2)
Step 3: Create Metadata Change Events (MCEs)
You need to create MCEs for each playlist. An MCE is a JSON object that describes a change to an entity’s metadata. Here’s how you can structure your data:
[
{
"proposedSnapshot": {
"com.linkedin.pegasus2avro.metadata.snapshot.PlaylistSnapshot": {
"urn": "urn:li:playlist:(youtube,1)",
"aspects": [
{
"com.linkedin.pegasus2avro.common.Name": {
"name": "Playlist 1"
}
}
]
}
}
},
{
"proposedSnapshot": {
"com.linkedin.pegasus2avro.metadata.snapshot.PlaylistSnapshot": {
"urn": "urn:li:playlist:(youtube,2)",
"aspects": [
{
"com.linkedin.pegasus2avro.common.Name": {
"name": "Playlist 2"
}
}
]
}
}
}
]
Step 4: Save the MCEs to a JSON File
Save the above JSON structure to a file, for example, youtube_playlists_mce.json
.
Step 5: Prepare the Ingestion Recipe
Create a YAML ingestion recipe that points to your JSON file:
source:
type: "file"
config:
path: "./youtube_playlists_mce.json"
format: "json"
sink:
type: "datahub-rest"
config:
server: "http://localhost:8080" # Replace with your DataHub instance URL
token: "your-auth-token" # Replace with your DataHub auth token if required
Step 6: Run the Ingestion
Use the DataHub CLI to ingest the data:
datahub ingest -c your_ingestion_recipe.yaml
This process will ingest your playlist data into DataHub, allowing you to view and manage it through the DataHub UI.
Would you like more detailed steps or further assistance on any specific part of the process?
Sources:
- https://github.com/datahub-project/datahub/blob/master/docs/modeling/extending-the-metadata-model.md|https://github.com/datahub-project/datahub/blob/master/docs/modeling/extending-the-metadata-model.md
- https://datahubspace.slack.com/archives/CUMUWQU66/p1726631040.182089|https://datahubspace.slack.com/archives/CUMUWQU66/p1726631040.182089
0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> that didn’t work. Can you explain what some of the parts are doing?
"com.linkedin.pegasus2avro.metadata.snapshot.PlaylistSnapshot"```
These don't make sense to me.
Certainly! Let’s break down the components of the Metadata Change Event (MCE) structure, specifically focusing on the "proposedSnapshot"
and "com.linkedin.pegasus2avro.metadata.snapshot.PlaylistSnapshot"
parts.
Understanding the MCE Structure
-
proposedSnapshot
:- This is a key in the MCE JSON structure that indicates a proposed change to the metadata of an entity. It contains the details of the entity and the aspects of metadata that are being proposed for change.
-
com.linkedin.pegasus2avro.metadata.snapshot.PlaylistSnapshot
:- This is a placeholder for the specific type of snapshot you are working with. In DataHub, a snapshot represents the current state of one or more metadata aspects associated with a particular entity.
- The
PlaylistSnapshot
is a hypothetical example here. In practice, you would replace this with the actual snapshot type that corresponds to your entity. For example, if you are working with datasets, you might see something likeDatasetSnapshot
.
Customizing for Your Use Case
Since you are working with custom metadata for YouTube playlists, you will need to define a custom snapshot type that represents your playlists. This involves extending the DataHub metadata model to include a new entity type for playlists.
Steps to Define a Custom Snapshot
-
Define a New Entity Type:
- You need to define a new entity type in DataHub for your playlists. This involves creating a new PDL (Pegasus Data Language) schema that describes the entity and its aspects.
-
Create a Snapshot for the Entity:
- Define a snapshot for your new entity type. This snapshot will include the URN and the aspects of the entity.
-
Update the MCE Structure:
- Once you have defined your custom entity and snapshot, update the MCE structure to use your custom snapshot type.
Example of a Custom Snapshot
Here’s a simplified example of what a custom snapshot might look like for a playlist:
{
"proposedSnapshot": {
"com.example.metadata.snapshot.PlaylistSnapshot": {
"urn": "urn:li:playlist:(youtube,1)",
"aspects": [
{
"com.example.metadata.common.Name": {
"name": "Playlist 1"
}
}
]
}
}
}
In this example:
com.example.metadata.snapshot.PlaylistSnapshot
is a custom snapshot type you would define.com.example.metadata.common.Name
is a custom aspect that represents the name of the playlist.
Next Steps
To proceed, you will need to define your custom entity and snapshot types in DataHub. This involves modifying the DataHub metadata model, which can be a complex task. If you need further guidance on how to extend the DataHub metadata model, I can provide more detailed instructions.
Would you like more detailed steps on how to define a custom entity and snapshot in DataHub?
Sources:
- https://datahubspace.slack.com/archives/CUMUWQU66/p1717519586.804389|https://datahubspace.slack.com/archives/CUMUWQU66/p1717519586.804389
- https://github.com/datahub-project/datahub/issues/1717|https://github.com/datahub-project/datahub/issues/1717
0 button
0 button
Hint: Mention @runllm in the thread for followups.
Hi ! Did this interaction resolve your issue or do you still require assistance ?
Please tag @ RunLLM if you still need further assistance !
Or tag @Datahub Community Support if you require a person to take a look
Hope this helps ! ><
I wasn’t able to figure this out. It seems like a complex task and I’m too new to DataHub to understand what’s going on. Quoting you:
> To proceed, you will need to define your custom entity and snapshot types in DataHub. This involves modifying the DataHub metadata model, which can be a complex task.