Full Report
Google Cloud’s Dataproc lets you run native Apache Spark and Hadoop clusters on Google Cloud in a simpler, more cost-effective way. In this blog, we will talk about our newest optional components available in Dataproc’s Component Exchange: Docker and Apache Flink.Docker container on DataprocDocker is a widely used container technology. Since it’s now a Dataproc optional component, Docker daemons can now be installed on every node of the Dataproc cluster. This will give you the ability to install containerized applications and interact with Hadoop clusters easily on the cluster. In addition, Docker is also critical to supporting these features:Running containers with YARNPortable Apache Beam jobRunning containers on YARN allows you to manage dependencies of your YARN application separately, and also allows you to create containerized services on YARN. Get more details here. Portable Apache Beam packages jobs into Docker containers and submits them the Flink cluster. Find more detail about Beam portability. Docker optional component is also configured to use Google Container Registry, in addition to the default Docker registry. This lets you use container images managed by your organization.Here is how to create a Dataproc cluster with the Docker optional component:gcloud beta dataproc clusters create \ --optional-components=DOCKER \ --image-version=1.5When you run the Docker application, the log will be streamed to Cloud Logging, using gcplogs driver.If your application does not depend on any Hadoop services, check out Kubernetes and Google Kubernetes Engine to run containers natively. For more on using Dataproc, check out our documentation.Apache Flink on DataprocAmong streaming analytics technologies, Apache Beam and Apache Flink stand out. Apache Flink is a distributed processing engine using stateful computation. Apache Beam is a unified model for defining batch and steaming processing pipelines. Using Apache Flink as an execution engine, you can also run Apache Beam jobs on Dataproc, in addition to Google’s Cloud Dataflow service.Flink and running Beam on Flink are suitable for large-scale, continuous jobs, and provide:A streaming-first runtime that supports both batch processing and data streaming programsA runtime that supports very high throughput and low event latency at the same timeFault-tolerance with exactly-once processing guaranteesNatural back-pressure in streaming programsCustom memory management for efficient and robust switching between in-memory and out-of-core data processing algorithmsIntegration with YARN and other components of the Apache Hadoop ecosystemOur Dataproc team here at Google Cloud recently announced that Flink Operator on Kubernetes is now available. It allows you to run Apache Flink jobs in Kubernetes, bringing the benefits of reducing platform dependency and producing better hardware efficiency. Basic Flink ConceptsA Flink cluster consists of a Flink JobManager and a set of Flink TaskManagers. Like similar roles in other distributed systems such as YARN, JobManager has responsibilities such as accepting jobs, managing resources and supervising jobs. TaskManagers are responsible for running the actual tasks. When running Flink on Dataproc, we use YARN as resource manager for Flink. You can run Flink jobs in 2 ways: job cluster and session cluster. For the job cluster, YARN will create JobManager and TaskManagers for the job and will destroy the cluster once the job is finished. For session clusters, YARN will create JobManager and a few TaskManagers.The cluster can serve multiple jobs until being shut down by the user.How to create a cluster with FlinkUse this command to get started:gcloud beta dataproc clusters create \ --optional-components=FLINK \ --image-version=1.5How to run a Flink jobAfter a Dataproc cluster with Flink starts, you can submit your Flink jobs to YARN directly using the Flink job cluster. After accepting the job, Flink will start a JobManager and slots for this job in YARN. The Flink job will be run in the YARN cluster until finished. The JobManager created will then be shut down. Job logs will be available in regular YARN logs. Try this command to run a word-counting example: code_block )])]> The Dataproc cluster will not start a Flink Session cluster by default. Instead, Dataproc will create the script “/usr/bin/flink-yarn-daemon,” which will start a Flink session. If you want to start a Flink session when Dataproc is created, use the metadata key to allow it: code_block \\\r\n --optional-components=FLINK \\ \r\n --image-version=1.5 \\\r\n --metadata flink-start-yarn-session=true'), ('language', ''), ('caption', )])]> If you want to start the Flink session after Dataproc is created, you can run the following command on master node: code_block )])]> Submit jobs to that session cluster. You’ll need to get the Flink JobManager URL: code_block : /usr/lib/flink/examples/batch/WordCount.jar'), ('language', ''), ('caption', )])]> How to run a Java Beam jobIt is very easy to run an Apache Beam job written in Java. There is no extra configuration needed. As long as you package your Beam jobs into a JAR file, you do not need to configure anything to run Beam on Flink. This is the command you can use: code_block )])]> How to run a Python Beam job written in PythonBeam jobs written in Python use a different execution model. To run them in Flink on Dataproc, you will also need to enable the Docker optional component. Here’s how to create a cluster: code_block \\\r\n --optional-components=FLINK,DOCKER'), ('language', ''), ('caption', )])]> You will also need to install necessary Python libraries needed by Beam, such as apache_beam and apache_beam[gcp]. You can pass in a Flink master URL to let it run in a session cluster. If you leave the URL out, you need to use the job cluster mode to run this job: code_block )])]> After you’ve written your Python job, simply run it to submit: code_block )])]> Learn more about Dataproc.
Analysis Summary
# Industry News: Google Cloud Deepens Data Analytics Ecosystem with Docker and Apache Flink on Dataproc
## Summary
Google Cloud enhanced its Dataproc service by introducing optional components for **Docker** and **Apache Flink**, significantly expanding its capabilities for running contemporary, containerized, and real-time data workloads natively alongside existing Spark/Hadoop clusters. This move directly addresses customer demand for greater dependency management flexibility and advanced streaming analytics within a cost-effective managed service framework.
## Key Details
- Date: October 15, 2020 (Based on article publication date)
- Companies Involved: Google Cloud
- Category: Product launch | Feature Update
## The Story
Google Cloud Dataproc, their managed service for Apache Spark and Hadoop, gained two major optional components: Docker and Apache Flink. The **Docker** integration allows organizations to install Docker daemons on cluster nodes, enabling the execution of containerized applications, managing application dependencies separately via YARN, and streamlining the packaging of Apache Beam jobs. The **Apache Flink** integration positions Dataproc as a viable execution engine for stateful, high-throughput, low-latency stream and batch processing workloads, allowing users to run Apache Beam jobs on Flink directly on Dataproc, complementing Google's existing Cloud Dataflow service. This update provides users with granular control over deployment, whether using ephemeral (job) clusters or persistent (session) clusters managed through YARN.
## Business Impact
### For the Companies Involved
- **Google Cloud:** Reinforces Dataproc’s value proposition against competing managed Hadoop/Spark services by integrating modern DevOps (Docker) and advanced stream processing (Flink) capabilities directly into the platform, boosting feature parity and developer appeal. It also drives usage of associated services like Google Container Registry (GCR) and Cloud Logging.
### For Competitors
- Competitors in the managed Big Data space (e.g., AWS EMR, Azure HDInsight) face pressure to ensure similar levels of integration for Flink and container support, especially for portability demands driven by Apache Beam adoption.
### For Customers
- **Flexibility and Portability:** Customers gain the ability to use Docker for dependency isolation, critical for complex application stacks, and can now run high-performance Flink jobs without migrating entirely to alternative platforms like GKE or Dataflow for their streaming needs.
- **Cost Optimization:** Running streaming workloads on Dataproc may offer a more cost-effective alternative for organizations heavily invested in the Hadoop ecosystem compared to dedicated services.
### For the Market
- The addition signals a maturation of managed open-source services, where "simple and cost-effective" infrastructure must now fully support containerization and leading-edge stream processing paradigms.
## Technical Implications
The integration layers Flink on top of YARN as the resource manager within Dataproc. The Docker component is crucial as it underpins portable Apache Beam execution across Flink, allowing Python Beam jobs to run by packaging them as containers. Logging integration with Cloud Logging via the `gcplogs` driver centralizes monitoring.
## Strategic Analysis
- **Market Positioning:** Google is positioning Dataproc as the flexible, comprehensive big data workbench, capable of handling both traditional batch processing (Hadoop/Spark) and modern streaming requirements (Flink/Beam) within unified infrastructure management.
- **Competitive Advantage:** By making Flink a simple optional component, GCloud lowers the barrier to entry for high-performance stateful streaming. The Docker support provides an immediate answer to dependency hell common in complex distributed deployments.
- **Challenges:** Full adoption depends on the ease of migrating workloads and managing two distinct runtime deployment styles (job vs. session clusters) within YARN.
## Industry Reactions
*(No external reactions are available in the provided text, but it can be inferred that this move is generally seen as positive for ecosystem completeness.)*
## Future Outlook
- Expect further integration of container-native tooling and real-time processing frameworks across Google Cloud's data portfolio to maintain competitive parity with other cloud providers prioritizing portability and low-latency analytics.
- Further simplification of running Flink Session clusters by default or via single-flag enablement will likely be a subsequent roadmap item.
## For Security Professionals
The Docker integration requires rigorous attention to **image provenance and vulnerability scanning**, especially since the component interfaces directly with organizational repositories like Google Container Registry; ensure that container images deployed via these new features adhere to organizational security policies before execution on production clusters. Monitoring integration with Cloud Logging assists in auditing container execution within the YARN environment.