Full Report
Data engineers, do you struggle to manage your data pipelines?As data pipelines become more complex and involve multiple team members, it can be challenging to keep track of changes, collaborate effectively, and deploy pipelines to different environments in a controlled manner.In this post we will talk about how data engineers can manage their data fusion pipelines across instances and namespaces using the pipeline git integration feature.Background & OverviewAs enterprises move towards modernizing their businesses with digital transformation, they’re faced with the challenge to adapt to the volume, velocity, and veracity of data. The only real way to address this challenge is to iterate fast - delivering data projects sustainably, predictably and quickly. To help customers achieve this goal, Cloud Data Fusion now supports iterative data pipeline design and team-based development/version control systems (VCS) integration. In this post, we mainly focus on the Git integration feature: team-based development/VCS integration. To learn more about iterative data pipeline design in Cloud Data Fusion, see Edit Pipelines.Nowadays, many developer tools have integrations with VCS systems. It improves development efficiency, assists CI/CD and facilitates team collaborations. With the pipeline git integration feature in Cloud Data Fusion, ETL developers are able to manage pipelines using Github, so that they can implement proper development processes, such as code reviews, promotion/demotion between environments for the pipelines. Below we will showcase these user journeys.Before you beginYou need a Cloud Data Fusion instance with version 6.9.1 or above.Only GitHub is supported as the git hosting provider.Currently Cloud Data Fusion only supports personal access token (PAT) auth mechanisms. Please refer to Creating a fine-grained personal access token to create a PAT with limited permissions to read and write to the git repository.To connect to a GitHub server from a private Cloud Data Fusion instance, you must configure network settings for public source access. For more information, see Create a private instance and Connect to a public source from a private instance.Link a GitHub repositoryThe first step is to link the GitHub repository. It could be a newly created repository or an existing one. Cloud Data Fusion lets you link a GitHub repository with a namespace. Once the repository is linked with a namespace, you can push deployed pipelines from namespace to repository, or pull and deploy pipelines from repository to namespace.To link a Github repository, follow these steps:1. In the Cloud Data Fusion web interface, click hamburger Menu --> Namespace Admin. 2. On the Namespace Admin page, click the Source Control Management tab. 3. Click Link Repository and fill in the below fields. Repository URL (required)Default branch (optional)Path prefix (optional)Authentication type (optional)Token name (required)Token (required)User name (optional) To verify the configuration click the VALIDATE button and you should see a green banner indicating a valid GitHub connection. Click the SAVE & CLOSE button to save the configuration.4. You can always edit/delete the configuration later, as needed. Unlinking the repository with a namespace will not delete the configurations present in GitHubUse Case: Use linked GitHub repository to manage pipeline across instances/namespacesImagine Bob is an IT admin at an ecommerce company. The team has already built several data pipelines. Recently the company created a new Cloud Data Fusion instance for a newly opened company branch. Bob wants to replicate the existing pipelines from the existing instance to the new instance. In the past, Bob had to manually export and import those pipelines. It is cumbersome to do so and prone to error. With the git integration, let’s see how Bob’s workflow has improved.Pushing pipelines to GitHub repositoryIn the same Source Control Management page after the above configuration, Bob can see the configured repository as below: To view the deployed pipelines in the current namespace, Bob clicks SYNC PIPELINES. Then, to push the DataFusionQuickStart pipeline config to the linked repository, Bob selects the PUSH TO REMOTE checkbox by the pipeline. A dialog appears where Bob enters a commit message and clicks Push. They can see the pipeline is pushed successfully. Now Bob can switch to the GitHub repository page and check the pushed pipeline configuration JSON file: Similarly, to see details about the pipeline that was pushed, Bob goes to the Cloud Data Fusion REMOTE PIPELINES tab. Pulling pipelines from linked repositoryTo initiate a new instance with existing pipelines, Bob can link the same repository to a namespace to the new instance.To deploy the pipelines that Bob pushed to the linked GitHub repository, Bob opens the Source Control Management page, clicks on the SYNC PIPELINES button and switches to the REMOTE PIPELINES tab. Now they can choose the pipeline of interest and click PULL TO NAMESPACE. In the LOCAL PIPELINES tab, Bob can see the newly deployed pipeline. They could also see the new pipeline in the deployed pipeline list page: Use case: Team-based developmentBillie built several pipelines to perform data analytics in different environments, such as test, staging, and prod. One of their pipelines classifies on-time and delayed orders, based on whether shipping time takes more than two days. Due to the increased number of customer orders during Black Friday, Billie just received a change request from the business to increase the expected delivery time temporarily. Billie could edit the pipeline and modify it iteratively to find a proper increased time. But Billie doesn’tt want to risk deploying the new changes into the prod environment without fully testing it. Before, there was no easy way for them to apply the new changes across from testing env to staging, and finally Prod. With git integration, let’s see how Billie can solve this problem.Edit the pipeline1. Billie opens the deployed pipeline in the Studio page2. Billie clicks on the cog icon on the right side of the top bar. 3. In the drop down, Billie selects Edit which starts the edit flow. 4. Billie makes the necessary changes in the plugin config. Once done, Billie clicks Deploy and a new version of the pipeline will be deployed. To learn more about iterative data pipeline design in Cloud Data Fusion, see Edit Pipelines. Push the latest pipeline version to gitThe namespace was already linked with a git repository and a previous version of the pipeline has already been pushed. Billie clicks on the cog icon on the right side of the top bar. In the drop-down, select Push to remote. A dialog will be shown to give the commit message. Once confirmed, the pipeline push process begins. In case of success Billie sees a green banner at the top. Billie can now go and check in GitHub that the new pipeline config has been synced.Merge the changes to mainFor a proper review flow we suggest using different branches for different environments. Billie can push the changes to a development branch and then create a pull request to merge the changes to the main branch.Pulling the latest pipeline version from GitHubThe production namespace has been linked with the main branch of the git repository. There already exists the older version of the pipeline.Billie clicks on the cog icon on the right side of the top bar. In the drop down Billie selects Pull to namespace. The pull process will take some time to complete as it also deploys the new version of the pipeline Once succeeded Billie can now click on the history button at the top bar and sees a new version has been deployed. Billie can now verify the change in the plugin config. In the above steps, we see how Billie applies the new pipeline changes across from testing env to finally Prod with the git integration feature. Please visit https://cloud.google.com/data-fusion and learn more about data fusion features.
Analysis Summary
# Best Practices: Secure Git Integration for Cloud Data Fusion
## Overview
These practices address the security and governance challenges of managing data pipelines as code. By integrating Cloud Data Fusion with GitHub, organizations can move away from manual, error-prone exports toward a structured Version Control System (VCS) that supports peer reviews, environment isolation, and automated deployment auditing.
## Key Recommendations
### Immediate Actions
1. **Implement Fine-Grained PATs:** Only use GitHub "Fine-grained personal access tokens." Configure them with the absolute minimum scope required (read/write access only to specific repositories).
2. **Verify Private Connectivity:** If using a private Cloud Data Fusion instance, ensure network configurations (NAT gateways or proxy settings) are restricted to GitHub’s official IP ranges to prevent data exfiltration.
3. **Upgrade Instances:** Ensure all instances are at version 6.9.1 or above to support modern security integration features.
### Short-term Improvements (1-3 months)
1. **Branch-Based Environment Isolation:** Map specific namespaces to specific branches (e.g., `dev` namespace to `develop` branch; `prod` namespace to `main` branch).
2. **Mandatory Code Reviews:** Enable GitHub "Branch Protection Rules" on the `main` branch to require at least one pull request review before changes can be merged and pulled into production.
3. **Audit Logs:** Regularly review Data Fusion and GitHub audit logs to monitor who is pushing/pulling pipeline configurations.
### Long-term Strategy (3+ months)
1. **Automated CI/CD Integration:** Transition from manual "Push/Pull" actions within the UI to automated pipelines that validate JSON configurations against security schemas before allowing merges.
2. **Secrets Management Integration:** Ensure that no sensitive credentials (API keys, DB passwords) are stored in the pipeline JSON files pushed to Git; use Data Fusion macros and secret managers instead.
## Implementation Guidance
### For Small Organizations
- Use a single repository with folders for different pipelines.
- Rely on Fine-grained PATs tied to a "Service Account" GitHub user rather than a personal employee account.
### For Medium Organizations
- Implement a 2-tier environment strategy (Staging and Production).
- Use Branch Protection Rules to ensure the IT Admin is the only one authorized to merge to the `main` branch.
### For Large Enterprises
- Use separate GitHub repositories for different business units to ensure data pipeline isolation.
- Enforce private instance connectivity to GitHub via secure Cloud Interconnect or Cloud NAT to ensure traffic never traverses the public internet.
## Configuration Examples
### Secure Repository Linking
When linking your repository in **Namespace Admin > Source Control Management**, use the following security-conscious settings:
- **Repository URL:** Use the HTTPS URL of the specific repository.
- **Authentication Type:** Personal Access Token (PAT).
- **Token:** *[Use a Fine-grained PAT with 1-year max expiration]*
- **Path Prefix:** Use this to isolate pipeline files within a subdirectory (e.g., `/project-alpha/pipelines/`) to prevent repository clutter.
## Compliance Alignment
- **NIST SP 800-53:** Aligns with Configuration Management (CM) and System and Information Integrity (SI) controls.
- **CIS Google Cloud Computing Platform Benchmark:** Supports logging, monitoring, and IAM best practices for Data Fusion.
- **SOC2:** Facilitates "Change Management" criteria by providing a verifiable audit trail of pipeline modifications.
## Common Pitfalls to Avoid
- **Hardcoding PATs:** Never store the Personal Access Token in documentation or shared notes.
- **Over-permissioning:** Avoid using "Classic" PATs with `repo` (full control) scope; use "Fine-grained" tokens instead.
- **Direct Link to Production:** Never link the developer's "Experimental" branch directly to the Production namespace.
- **Manual Overwrites:** Avoid making direct edits in the Production UI; always pull from the Git "Main" branch to ensure the VCS remains the "Source of Truth."
## Resources
- **Google Cloud Data Fusion Documentation:** `https://cloud.google.com/data-fusion/docs`
- **GitHub PAT Documentation:** `https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens`
- **Private Instance Networking:** `https://cloud.google.com/data-fusion/docs/how-to/create-private-ip`