WORKLOAD AUTOMATION COMMUNITY
  • Home
  • Blogs
  • Forum
  • Resources
  • Events
    • IWA 9.5 Roadshows
  • About
  • Contact
  • What's new

Dataflow- Backbone of Data Analytics with HCL Workload Automation

1/11/2022

0 Comments

 
Picture
​Let us begin with understanding of Google Cloud dataflow what it is all about before moving to our GCP Cloud Dataflow plugin and how it benefits to our workload automation users.
​Data is generated in real-time from websites, mobile apps, IoT devices and other workloads. Capturing, processing, and analyzing this data is a priority for all businesses. However, data from these systems is not often in the format that is conducive for analysis or for effective use by downstream systems. That’s where Dataflow comes in! dataflow is used for processing and enriching batch or stream data for use cases such as analysis, machine learning or data warehousing.
Picture
Dataflow templates offer a collection of pre-built templates with an option to create your own custom ones!
Here in workload automation, you can implement Text files on Cloud Storage to BigQuery template. 

The Text Files on Cloud Storage to BigQuery pipeline is a batch pipeline that allows you to stream text files stored in Cloud Storage, transform them using a JavaScript User Defined Function (UDF) that you provide, and append the results to BigQuery.
​
Let us now understand the plugin part with job definition parameters:
 
Log in to the Dynamic Workload Console and open the Workload Designer. Choose to create a new job and select “GCP Cloud Dataflow” job type in the Cloud section.

Picture
Figure 1: Job Definition Page

Connection
 
Establishing connection to the Google Cloud server: 
 
Use this section to connect to the Google Cloud in two ways.
  1. GCP Default Credentials -- If Customer VM’s already resides inside GCP environment then no need to provide credential explicitly.
  2. GCP Server Credentials – Manually provide below details.
 
Service Account - The service account associated to your GCS account. Click the Select button to choose the service account in the cloud console.
Project ID - The project ID is a unique name associated with each project. It is mandatory and unique for each service account.
Test Connection - Click to verify if the connection to the Google Cloud works correctly.
Picture
Figure 2: Connection Page

Action

In Action tab specify the bucket name and operation which you want to perform.
 
  • Job Name - Specify the name of the job, must be unique among running jobs.
  • Region - Choose a dataflow regional endpoint to deploy worker instances and store job metadata.
  • Template Path – Text files on Cloud Storage to BigQuery.
Ex. gs://your-project-region/latest/GCS_text_to_BigQuery
Batch pipeline. Reads text files stored in Cloud Storage, transforms them using a JavaScript user-defined function (UDF), and outputs the result to BigQuery.
  • Javascript UDF path in Cloud Storage - The Cloud Storage path pattern for the JavaScript code containing your user-defined functions.
Ex: gs://your-bucket/your-transforms/*.js
  • JSON Path - The Cloud Storage path to the JSON file that defines your BigQuery schema. Ex: gs://your-bucket/your-schema.json
  • Javascript UDF Name - The name of the function to call from your JavaScript file. Use only letters, digits, and underscores. Ex: transform_udf1.
  • BigQuery Output Table - The location of the BigQuery table in which to store your processed data. If you reuse an existing table, it will be overwritten. Ex: your-project:your-dataset.your-table.
  • Cloud Storage Input Path - The path to the Cloud Storage text to read. Ex: gs://your-bucket/your-file.txt
  • Temporary BigQuery Directory - Temporary directory for the BigQuery loading process. Ex: gs://your-bucket/your-files/temp-dir.
  • Temporary Location - Path and filename prefix for writing temporary files. Ex: gs://your-bucket/temp.
  • Network - Network to which workers will be assigned. If empty or unspecified, the service will use the network "default".
  • Subnetwork - Subnetwork to which workers will be assigned, if desired. Value can be either a complete URL or an abbreviated path. If the subnetwork is located in a Shared VPC network, you must use the complete URL.
Picture
Picture
Figure 3: Action Page

Submitting your job
​ 
It is time to Submit your job into the current plan. You can add your job to the job stream that automates your business process flow. Select the action menu in the top-left corner of the job definition panel and click on Submit Job into Current Plan. A confirmation message is displayed, and you can switch to the Monitoring view to see what is going on.
Figure 4: Submit Job Into Current Plan Page

Monitor Page
Picture
Picture
Figure 5: Monitor Page

Users can cancel the running the job by clicking kill option.
 
Job Log Details
Picture
Figure 6: Job Log Page

WorkFlow Page
Picture
Figure 7: Workflow Details Page

Are you curious to try out the GCP Cloud Dataflow plugin? Download the integrations from the Automation Hub and get started or drop a line at mahalakshmi_m@hcl.com.

Authors Bio
Picture
Suhas H N, Senior Developer at HCL Technologies 

Works as a Plugin Developer in Workload Automation. Acquired skills on Java, Spring, Spring Boot, Microservices, AngularJS, JavaScript.
​
View my profile on LinkedIn

Picture
Rabic Meeran K, Senior Engineer at HCL Technologies 

Responsible for design and develop integration plug-ins for Workload Automation. Having two decades of experience in creating software products to enterprise customers.
View my profile on LinkedIn

Picture
Saket Saurav, Tester (Senior Engineer) at HCL Technologies 
​
Responsible for performing Automation and Manual Testing for different plugins in Workload Automation using Java Unified Test Automation Framework. Hands-on experience on Java programming language, Web Services with databases like Oracle and SQL Server. 
View my profile on LinkedIn
0 Comments

Your comment will be posted after it is approved.


Leave a Reply.

    Archives

    March 2023
    February 2023
    January 2023
    December 2022
    September 2022
    August 2022
    July 2022
    June 2022
    May 2022
    April 2022
    March 2022
    February 2022
    January 2022
    December 2021
    October 2021
    September 2021
    August 2021
    July 2021
    June 2021
    May 2021
    April 2021
    March 2021
    February 2021
    January 2021
    December 2020
    November 2020
    October 2020
    September 2020
    August 2020
    July 2020
    June 2020
    May 2020
    April 2020
    March 2020
    January 2020
    December 2019
    November 2019
    October 2019
    August 2019
    July 2019
    June 2019
    May 2019
    April 2019
    March 2019
    February 2019
    January 2019
    December 2018
    November 2018
    October 2018
    September 2018
    August 2018
    July 2018
    June 2018
    May 2018
    April 2018
    March 2018
    February 2018
    January 2018
    December 2017
    November 2017
    October 2017
    September 2017
    August 2017
    July 2017
    June 2017
    May 2017

    Categories

    All
    Analytics
    Azure
    Business Applications
    Cloud
    Data Storage
    DevOps
    Monitoring & Reporting

    RSS Feed

www.hcltechsw.com
About HCL Software 
HCL Software is a division of HCL Technologies (HCL) that operates its primary software business. It develops, markets, sells, and supports over 20 product families in the areas of DevSecOps, Automation, Digital Solutions, Data Management, Marketing and Commerce, and Mainframes. HCL Software has offices and labs around the world to serve thousands of customers. Its mission is to drive ultimate customer success with their IT investments through relentless innovation of its products. For more information, To know more  please visit www.hcltechsw.com.  Copyright © 2019 HCL Technologies Limited
  • Home
  • Blogs
  • Forum
  • Resources
  • Events
    • IWA 9.5 Roadshows
  • About
  • Contact
  • What's new