Exploring DataHub Integration with Stitch for Data Lineage with Postgres and Snowflake Data Movement

Original Slack Thread

I am new to DataHub and I am trying to explore its capabilities against the tools used in our Data ecosystem. Among the tools that we use, there is a tool called Stitch (from Talend) that is used to ingest data from Postgres to our Data Warehouse (Snowflake). We are trying to build the Data Catalog for all the DBs and tables and try to build a Data Lineage that could show the list of tables that are fetched from Postgres into Snowflake.
Can you please suggest how can I make use of DataHub to ingest the records from Stitch? This would be a custom dataset that we are trying to ingest into DataHub and looking for the steps and pointers needed to accomplish the same
Any suggestion is highly appreciated

Hey Dev, welcome to DataHub. We don’t yet have a Stitch connector but this request has come in the community quite often. Do you mind sharing what version of Stitch you are using and what the official API of that version provides?

We use Stitch as a SaaS product. I assume we are using the latest version but I need to check if the version is provided.
The API endpoint is however: https://api.stitchdata.com/v4/

Any pointers/steps/examples to set up a new dataset in DataHub would greatly help. Appreciate your/DataHub’s help on the same

I would take a look at how the fivetran integration was built (https://github.com/datahub-project/datahub/tree/master/metadata-ingestion/src/datahub/ingestion/source/fivetran)

and you should be able to use the same concepts to represent Stitch pipelines

typically you would integrate with the API to extract the Pipeline and its configuration along with lineage to the DataSources and then emit the relevant entities (DataFlow, DataJob) and aspects (datajobInputOutput etc.).

Thanks for the input. So does this mean I need to fork the repo, make a source similar to Fivetran, and use it for our purpose?

Pretty much…