how to pass dynamic parameters in google cloud dataflow pipeline - google-cloud-platform

I have written code to inject CSV file from GCS to BigQuery with hardcoded ProjectID, Dataset, Table name, GCS Temp & Staging location.
I am looking code that should read
ProjectID
Dataset
Table name
GCS Temp & Staging location parameters
from BigQuery table(Dynamic parameters).
Code:-
public class DemoPipeline {
public static TableReference getGCDSTableReference() {
TableReference ref = new TableReference();
ref.setProjectId("myprojectbq");
ref.setDatasetId("DS_Emp");
ref.setTableId("emp");
return ref;
}
static class TransformToTable extends DoFn<String, TableRow> {
#ProcessElement
public void processElement(ProcessContext c) {
String input = c.element();
String[] s = input.split(",");
TableRow row = new TableRow();
row.set("id", s[0]);
row.set("name", s[1]);
c.output(row);
}
}
public interface MyOptions extends PipelineOptions {
/*
* Param
*
*/
}
public static void main(String[] args) {
MyOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(MyOptions.class);
options.setTempLocation("gs://demo-xxxxxx/temp");
Pipeline p = Pipeline.create(options);
PCollection<String> lines = p.apply("Read From Storage", TextIO.read().from("gs://demo-xxxxxx/student.csv"));
PCollection<TableRow> rows = lines.apply("Transform To Table",ParDo.of(new TransformToTable()));
rows.apply("Write To Table",BigQueryIO.writeTableRows().to(getGCDSTableReference())
//.withSchema(BQTableSemantics.getGCDSTableSchema())
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER));
p.run();
}
}

Even to read from an initial table (Project ID / dataset / tables names) where other data is contained, you need to hardcode such information in somewhere. Properties files as Haris recommended is a good approach, look at the following suggestions:
Java Properties file. Used when parameters have to be changed or tuned. In general, changes that don't require new compilation. It's a file that has to live or attached to your java classes. Reading this file from GCS is feasible but a weird option.
Pipeline Execution Parameters. Custom parameters can be a workaround for your question, please check Creating Custom Options to understand how can be accomplished, here is a small example.

Related

How to count the number of rows in the input file of the Google Dataflow file processing?

I am trying to count the number of rows in an input file and I am using Cloud dataflow Runner for creating the template. In the below code, I am reading the file from a GCS bucket, processing it and then storing the output in a Redis instance.
But I am unable to count the number of lines of the input file.
Main Class
public static void main(String[] args) {
/**
* Constructed StorageToRedisOptions object using the method PipelineOptionsFactory.fromArgs to read options from command-line
*/
StorageToRedisOptions options = PipelineOptionsFactory.fromArgs(args)
.withValidation()
.as(StorageToRedisOptions.class);
Pipeline p = Pipeline.create(options);
p.apply("Reading Lines...", TextIO.read().from(options.getInputFile()))
.apply("Transforming data...",
ParDo.of(new DoFn<String, String[]>() {
#ProcessElement
public void TransformData(#Element String line, OutputReceiver<String[]> out) {
String[] fields = line.split("\\|");
out.output(fields);
}
}))
.apply("Processing data...",
ParDo.of(new DoFn<String[], KV<String, String>>() {
#ProcessElement
public void ProcessData(#Element String[] fields, OutputReceiver<KV<String, String>> out) {
if (fields[RedisIndex.GUID.getValue()] != null) {
out.output(KV.of("firstname:"
.concat(fields[RedisIndex.FIRSTNAME.getValue()]), fields[RedisIndex.GUID.getValue()]));
out.output(KV.of("lastname:"
.concat(fields[RedisIndex.LASTNAME.getValue()]), fields[RedisIndex.GUID.getValue()]));
out.output(KV.of("dob:"
.concat(fields[RedisIndex.DOB.getValue()]), fields[RedisIndex.GUID.getValue()]));
out.output(KV.of("postalcode:"
.concat(fields[RedisIndex.POSTAL_CODE.getValue()]), fields[RedisIndex.GUID.getValue()]));
}
}
}))
.apply("Writing field indexes into redis",
RedisIO.write().withMethod(RedisIO.Write.Method.SADD)
.withEndpoint(options.getRedisHost(), options.getRedisPort()));
p.run();
}
Sample Input File
xxxxxxxxxxxxxxxx|bruce|wayne|31051989|444444444444
yyyyyyyyyyyyyyyy|selina|thomas|01051989|222222222222
aaaaaaaaaaaaaaaa|clark|kent|31051990|666666666666
Command to execute the pipeline
mvn compile exec:java \
-Dexec.mainClass=com.viveknaskar.DataFlowPipelineForMemStore \
-Dexec.args="--project=my-project-id \
--jobName=dataflow-job \
--inputFile=gs://my-input-bucket/*.txt \
--redisHost=127.0.0.1 \
--stagingLocation=gs://pipeline-bucket/stage/ \
--dataflowJobFile=gs://pipeline-bucket/templates/dataflow-template \
--runner=DataflowRunner"
I have tried to use the below code from the StackOverflow solution but it doesn't work me.
PipelineOptions options = ...;
DirectPipelineRunner runner = DirectPipelineRunner.fromOptions(options);
Pipeline p = Pipeline.create(options);
PCollection<Long> countPC =
p.apply(TextIO.Read.from("gs://..."))
.apply(Count.<String>globally());
DirectPipelineRunner.EvaluationResults results = runner.run(p);
long count = results.getPCollection(countPC).get(0);
I have gone through the Apache Beam documentation as well but didn't find anything helpful. Any help on this will be really appreciated.
I resolved this issue by adding the Count.globally() and applying to a PCollection<String> after the pipeline reads the file.
I have added the below code:
PCollection<String> lines = p.apply("Reading Lines...", TextIO.read().from(options.getInputFile()));
lines.apply(Count.globally()).apply("Count the total records", ParDo.of(new RecordCount()));
where I have created a new Class (RecordCount.java) which extends DoFn<Long, Void> which just logs the count.
RecordCount.java
import org.apache.beam.sdk.transforms.DoFn;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class RecordCount extends DoFn<Long, Void> {
private static final Logger LOGGER = LoggerFactory.getLogger(RecordCount.class);
#ProcessElement
public void processElement(#Element Long count) {
LOGGER.info("The total number of records in the input file is: ", count);
}
}
}
Proper way to do this is to write the count to a storage system using a Beam connector (or using a Beam ParDo). Pipeline result is not directly available to the main program since Beam runner could parallelize computation and execution may not happen in the same computer.
For example (pseudocode):
p.apply(TextIO.Read.from("gs://..."))
.apply(Count.<String>globally())
.apply(ParDo(MyLongToStringParDo()))
.apply(TextIO.Write.to("gs://..."));
If you need to handle output directly in the main program, you can read from GCS using a client library after Beam program ends (make sure to specify p.run().waitUntilFinish() in this case). Alternatively, you can move your computation (that needs the count) into a Beam PTransform and make that part of your pipeline.

Write into cloud sql using dataflow JdbcIO api

I have a requirement where i have to write a PCollection of String to a Cloud SQL using Cloud Dataflow API.
pipeline.apply(TextIO.read().from("gs://***/sampleBigtable.csv"))
.apply(JdbcIO.write()
.withDataSourceConfiguration(DataSourceConfiguration
.create("org.postgresql.Driver", "jdbc:postgresql://***:5432/test")
.withUsername("**").withPassword("password10"))
.withStatement("insert into person values(?,?)")
.withPreparedStatementSetter(
new JdbcIO.PreparedStatementSetter < Object > () {
/**
*
*/
private static final long serialVersionUID = 1 L;
#Override
public void setParameters(Object arg0, PreparedStatement query)
throws Exception {
// TODO Auto-generated method stub
query.setString(1, "Hello");
query.setString(1, "Hi");
}
}));
This is the sample code I am trying. A very simple version of what I want to do.
Also, is it feasible to write into Cloud SQL from Dataflow using a parDo and writing simple insert statements?
The previous transform outputs a PCollection<String>, so you need to specify that was the input type to the JdbcIO<T>.write()
Something like this:
pipeline
.apply(TextIO.read().from("gs://***/sampleBigtable.csv"))
.apply(JdbcIO.<String>write().withDataSourceConfiguration(
DataSourceConfiguration.create("org.postgresql.Driver","jdbc:postgresql://***:5432/test")
.withUsername("**")
.withPassword("password10"))
.withStatement("insert into person values(?,?)")
.withPreparedStatementSetter((element, query) -> {
query.setInt(1, 1);
query.setString(2, "Hello");
})
);

Naming Blob Dynamically for WebJob on a Schedule

I have a web job which is creating a blob based on the return value of a WebClient call. This is working fine. But as you can see from the Blob attribute (see code below), the name of the file is static. So, it is getting overwritten every time in blob storage.
Function class:
public class Functions
{
private static int _retryCount;
private static readonly int _retryLimit = int.Parse(ConfigurationManager.AppSettings["retryLimit"]);
private static readonly string _ghostRestfullUri = ConfigurationManager.AppSettings["ghostRestfullUri"];
[NoAutomaticTrigger]
public static void LightUpSite([Blob("ghost/response.json")] out string output, TextWriter logger)
{
_retryCount = 0;
output = string.Empty;
do
{
try
{
using (var request = new WebClient())
{
var response = request.DownloadString(_ghostRestfullUri);
_retryCount++;
output = response;
break;
}
}
catch(Exception exception)
{
logger.WriteLine("Job failed. Retry number:{0}", _retryCount);
}
} while (_retryCount < _retryLimit);
}
}
Main menu:
public class Program
{
static void Main()
{
var host = new JobHost();
host.Call(typeof(Functions).GetMethod("LightUpSite"));
}
}
How can I use placeholders to dynamically name the incoming file?
I have already tried the following:
ghost/{name}
ghost/{BlobName}
Other things to note:
This job is run on a schedule, so the host does not run and block
This job does not get invoked by a trigger, it just wakes up and runs;
Because the source is not coming from a message queue object or a uploaded file, I can’t figure out how I am supposed to name this blob.
Perhaps somehow using the blob storage API directly?
To name an output blob dynamically use IBinder as shown in this sample
To name an input blob dynamically as in a call from from Host.Call just pass the name of blob as argument:
static void Main()
{
var host = new JobHost();
host.Call(typeof(Functions).GetMethod("LightUpSite"), new {blobArgumentName= "container/blob"});
}

MapReduce job with mixed data sources: HBase table and HDFS files

I need to implement a MR job which access data from both HBase table and HDFS files. E.g., mapper reads data from HBase table and from HDFS files, these data share the same primary key but have different schema. A reducer then join all columns (from HBase table and HDFS files) together.
I tried look online and could not find a way to run MR job with such mixed data source. MultipleInputs seem only work for multiple HDFS data sources. Please let me know if you have some ideas. Sample code would be great.
After a few days of investigation (and get help from HBase user mailing list), I finally figured out how to do it. Here is the source code:
public class MixMR {
public static class Map extends Mapper<Object, Text, Text, Text> {
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String s = value.toString();
String[] sa = s.split(",");
if (sa.length == 2) {
context.write(new Text(sa[0]), new Text(sa[1]));
}
}
}
public static class TableMap extends TableMapper<Text, Text> {
public static final byte[] CF = "cf".getBytes();
public static final byte[] ATTR1 = "c1".getBytes();
public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
String key = Bytes.toString(row.get());
String val = new String(value.getValue(CF, ATTR1));
context.write(new Text(key), new Text(val));
}
}
public static class Reduce extends Reducer <Object, Text, Object, Text> {
public void reduce(Object key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String ks = key.toString();
for (Text val : values){
context.write(new Text(ks), val);
}
}
}
public static void main(String[] args) throws Exception {
Path inputPath1 = new Path(args[0]);
Path inputPath2 = new Path(args[1]);
Path outputPath = new Path(args[2]);
String tableName = "test";
Configuration config = HBaseConfiguration.create();
Job job = new Job(config, "ExampleRead");
job.setJarByClass(MixMR.class); // class that contains mapper
Scan scan = new Scan();
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false); // don't set to true for MR jobs
scan.addFamily(Bytes.toBytes("cf"));
TableMapReduceUtil.initTableMapperJob(
tableName, // input HBase table name
scan, // Scan instance to control CF and attribute selection
TableMap.class, // mapper
Text.class, // mapper output key
Text.class, // mapper output value
job);
job.setReducerClass(Reduce.class); // reducer class
job.setOutputFormatClass(TextOutputFormat.class);
// inputPath1 here has no effect for HBase table
MultipleInputs.addInputPath(job, inputPath1, TextInputFormat.class, Map.class);
MultipleInputs.addInputPath(job, inputPath2, TableInputFormat.class, TableMap.class);
FileOutputFormat.setOutputPath(job, outputPath);
job.waitForCompletion(true);
}
}
There is no OOTB feature that supports this. A possible workaround could be to Scan your HBase table and write the Results to a HDFS file first and then do the reduce-side join using MultipleInputs. But this will incur some additional I/O overhead.
A pig script or hive query can do that easily.
sample pig script
tbl = LOAD 'hbase://SampleTable'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'info:* ...', '-loadKey true -limit 5')
AS (id:bytearray, info_map:map[],...);
fle = LOAD '/somefile' USING PigStorage(',') AS (id:bytearray,...);
Joined = JOIN A tbl by id,fle by id;
STORE Joined to ...

Same Instances header ( arff ) for all my database queries

I am using InstanceQuery , SQL queries, to construct my Instances. But my query results does not come in the same order always as it is normal in SQL.
Beacuse of this Instances constucted from different SQL has different headers. A simple example can be seen below. I suspect my results changes because of this behavior.
Header 1
#attribute duration numeric
#attribute protocol_type {tcp,udp}
#attribute service {http,domain_u}
#attribute flag {SF}
Header 2
#attribute duration numeric
#attribute protocol_type {tcp}
#attribute service {pm_dump,pop_2,pop_3}
#attribute flag {SF,S0,SH}
My question is : How can I give correct header information to Instance construction.
Is something like below workflow is possible?
get pre-prepared header information from arff file or another place.
give instance construction this header information
call sql function and get Instances (header + data)
I am using following sql function to get instances from database.
public static Instances getInstanceDataFromDatabase(String pSql
,String pInstanceRelationName){
try {
DatabaseUtils utils = new DatabaseUtils();
InstanceQuery query = new InstanceQuery();
query.setUsername(username);
query.setPassword(password);
query.setQuery(pSql);
Instances data = query.retrieveInstances();
data.setRelationName(pInstanceRelationName);
if (data.classIndex() == -1)
{
data.setClassIndex(data.numAttributes() - 1);
}
return data;
} catch (Exception e) {
throw new RuntimeException(e);
}
}
I tried various approaches to my problem. But it seems that weka internal API does not allow solution to this problem right now. I modified weka.core.Instances append command line code for my purposes. This code is also given in this answer
According to this, here is my solution. I created a SampleWithKnownHeader.arff file , which contains correct header values. I read this file with following code.
public static Instances getSampleInstances() {
Instances data = null;
try {
BufferedReader reader = new BufferedReader(new FileReader(
"datas\\SampleWithKnownHeader.arff"));
data = new Instances(reader);
reader.close();
// setting class attribute
data.setClassIndex(data.numAttributes() - 1);
}
catch (Exception e) {
throw new RuntimeException(e);
}
return data;
}
After that , I use following code to create instances. I had to use StringBuilder and string values of instance, then I save corresponding string to file.
public static void main(String[] args) {
Instances SampleInstance = MyUtilsForWeka.getSampleInstances();
DataSource source1 = new DataSource(SampleInstance);
Instances data2 = InstancesFromDatabase
.getInstanceDataFromDatabase(DatabaseQueries.WEKALIST_QUESTION1);
MyUtilsForWeka.saveInstancesToFile(data2, "fromDatabase.arff");
DataSource source2 = new DataSource(data2);
Instances structure1;
Instances structure2;
StringBuilder sb = new StringBuilder();
try {
structure1 = source1.getStructure();
sb.append(structure1);
structure2 = source2.getStructure();
while (source2.hasMoreElements(structure2)) {
String elementAsString = source2.nextElement(structure2)
.toString();
sb.append(elementAsString);
sb.append("\n");
}
} catch (Exception ex) {
throw new RuntimeException(ex);
}
MyUtilsForWeka.saveInstancesToFile(sb.toString(), "combined.arff");
}
My save instances to file code is as below.
public static void saveInstancesToFile(String contents,String filename) {
FileWriter fstream;
try {
fstream = new FileWriter(filename);
BufferedWriter out = new BufferedWriter(fstream);
out.write(contents);
out.close();
} catch (Exception ex) {
throw new RuntimeException(ex);
}
This solves my problem but I wonder if more elegant solution exists.
I solved a similar problem with the Add filter that allows adding attributes to Instances. You need to add a correct Attibute with proper list of values to both datasets (in my case - to test dataset only):
Load train and test data:
/* "train" contains labels and data */
/* "test" contains data only */
CSVLoader csvLoader = new CSVLoader();
csvLoader.setFile(new File(trainFile));
Instances training = csvLoader.getDataSet();
csvLoader.reset();
csvLoader.setFile(new File(predictFile));
Instances test = csvLoader.getDataSet();
Set a new attribute with Add filter:
Add add = new Add();
/* the name of the attribute must be the same as in "train"*/
add.setAttributeName(training.attribute(0).name());
/* getValues returns a String with comma-separated values of the attribute */
add.setNominalLabels(getValues(training.attribute(0)));
/* put the new attribute to the 1st position, the same as in "train"*/
add.setAttributeIndex("1");
add.setInputFormat(test);
/* result - a compatible with "train" dataset */
test = Filter.useFilter(test, add);
As a result, the headers of both "train" and "test" are the same (compatible for Weka machine learning)