Running BeamSql WithoutCoder or Making Coder Dynamic - google-cloud-platform

I am reading data from file and converting it to BeamRecord But While i am Doing Query on that it Show Error-:
Exception in thread "main" java.lang.ClassCastException: org.apache.beam.sdk.coders.SerializableCoder cannot be cast to org.apache.beam.sdk.coders.BeamRecordCoder
at org.apache.beam.sdk.extensions.sql.BeamSql$QueryTransform.registerTables(BeamSql.java:173)
at org.apache.beam.sdk.extensions.sql.BeamSql$QueryTransform.expand(BeamSql.java:153)
at org.apache.beam.sdk.extensions.sql.BeamSql$QueryTransform.expand(BeamSql.java:116)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:533)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:465)
at org.apache.beam.sdk.values.PCollectionTuple.apply(PCollectionTuple.java:160)
at TestingClass.main(TestingClass.java:75)
But When I am Providing it a Coder Then It Runs Perfectly.
I am little confused that if I am reading data from a file the file data schema changes on every run because I am using templates so is there any way I can use Default Coder or Without Coder, i can Run the Query.
For Reference Code is Below Please Check.
PCollection<String> ReadFile1 = PBegin.in(p).apply(TextIO.read().from("gs://Bucket_Name/FileName.csv"));
PCollection<BeamRecord> File1_BeamRecord = ReadFile1.apply(new StringToBeamRecord()).setCoder(new Temp().test().getRecordCoder());
PCollection<String> ReadFile2= p.apply(TextIO.read().from("gs://Bucket_Name/FileName.csv"));
PCollection<BeamRecord> File2_Beam_Record = ReadFile2.apply(new StringToBeamRecord()).setCoder(new Temp().test1().getRecordCoder());
new Temp().test1().getRecordCoder() --> Returning HardCoded BeamRecordCoder Values Which I need to fetch at runtime
Conversion From PColletion<String> to PCollection<TableRow> is Below-:
Public class StringToBeamRecord extends PTransform<PCollection<String>,PCollection<BeamRecord>> {
private static final Logger LOG = LoggerFactory.getLogger(StringToBeamRecord.class);
#Override
public PCollection<BeamRecord> expand(PCollection<String> arg0) {
return arg0.apply("Conversion",ParDo.of(new ConversionOfData()));
}
static class ConversionOfData extends DoFn<String,BeamRecord> implements Serializable{
#ProcessElement
public void processElement(ProcessContext c){
String Data = c.element().replaceAll(",,",",blank,");
String[] array = Data.split(",");
List<String> fieldNames = new ArrayList<>();
List<Integer> fieldTypes = new ArrayList<>();
List<Object> Data_Conversion = new ArrayList<>();
int Count = 0;
for(int i = 0 ; i < array.length;i++){
fieldNames.add(new String("R"+Count).toString());
Count++;
fieldTypes.add(Types.VARCHAR); //Using Schema I can Set it
Data_Conversion.add(array[i].toString());
}
LOG.info("The Size is : "+Data_Conversion.size());
BeamRecordSqlType type = BeamRecordSqlType.create(fieldNames, fieldTypes);
c.output(new BeamRecord(type,Data_Conversion));
}
}
}
Query is -:
PCollectionTuple test = PCollectionTuple.of(
new TupleTag<BeamRecord>("File1_BeamRecord"),File1_BeamRecord)
.and(new TupleTag<BeamRecord>("File2_BeamRecord"), File2_BeamRecord);
PCollection<BeamRecord> output = test.apply(BeamSql.queryMulti(
"Select * From File1_BeamRecord JOIN File2_BeamRecord "));
Is thier anyway i can make Coder Dynamic or I can Run Query with Default Coder.

Related

How to use Isolationforest in weka?

I am trying to use isolationforest in weka ,but I cannot find a easy example which shows how to use it ,who can help me ?thanks in advance
import weka.classifiers.misc.IsolationForest;
public class Test2 {
public static void main(String[] args) {
IsolationForest isolationForest = new IsolationForest();
.....................................................
}
}
I strongly suggest you to study a little bit the implementation for IslationForest.
The following code work loading a CSV file with first column with Class (note: a single class value will produce only (1-anomaly score) if it's binary you will get the anomaly score too. Otherwise it just return an error). Note I skip the second column (that in my case is the uuid that is not needed for anomaly detection)
private static void findOutlier(File in, File out) throws Exception {
CSVLoader loader = new CSVLoader();
loader.setSource(new File(in.getAbsolutePath()));
Instances data = loader.getDataSet();
// setting class attribute if the data format does not provide this information
// For example, the XRFF format saves the class attribute information as well
if (data.classIndex() == -1)
data.setClassIndex(0);
String[] options = new String[2];
options[0] = "-R"; // "range"
options[1] = "2"; // first attribute
Remove remove = new Remove(); // new instance of filter
remove.setOptions(options); // set options
remove.setInputFormat(data); // inform filter about dataset **AFTER** setting options
Instances newData = Filter.useFilter(data, remove); // apply filter
IsolationForest randomForest = new IsolationForest();
randomForest.buildClassifier(newData);
// System.out.println(randomForest);
FileWriter fw = new FileWriter(out);
final Enumeration<Attribute> attributeEnumeration = data.enumerateAttributes();
for (Attribute e = attributeEnumeration.nextElement(); attributeEnumeration.hasMoreElements(); e = attributeEnumeration.nextElement()) {
fw.write(e.name());
fw.write(",");
}
fw.write("(1 - anomaly score),anomaly score\n");
for (int i = 0; i < data.size(); ++i) {
Instance inst = data.get(i);
final double[] distributionForInstance = randomForest.distributionForInstance(inst);
fw.write(inst + ", " + distributionForInstance[0] + "," + (1 - distributionForInstance[0]));
fw.write(",\n");
}
fw.flush();
}
The previous function will add at the CSV at last column the anomaly values. Please note I'm using a single class so for getting the corresponding anomaly I do 1 - distributionForInstance[0] otherwise you ca do simply distributionForInstance[1] .
A sample input.csv for getting (1-anomaly score):
Class,ignore, feature_0, feature_1, feature_2
A,1,21,31,31
A,2,41,61,81
A,3,61,37,34
A sample input.csv for getting (1-anomaly score) and anomaly score:
Class,ignore, feature_0, feature_1, feature_2
A,1,21,31,31
B,2,41,61,81
A,3,61,37,34

Unit Test for Apex Trigger that Concatenates Fields

I am trying to write a test for a before trigger that takes fields from a custom object and concatenates them into a custom Key__c field.
The trigger works in the Sandbox and now I am trying to get it into production. However, whenever I try and do a System.assert/assertEquals after I create a purchase and perform DML, the value of Key__c always returns null. I am aware I can create a flow/process to do this, but I am trying to solve this with code for my own edification. How can I get the fields to concatenate and return properly in the test? (the commented out asserts are what I have tried so far, and have failed when run)
trigger Composite_Key on Purchases__c (before insert, before update) {
if(Trigger.isBefore)
{
for(Purchases__c purchase : trigger.new)
{
String eventName = String.isBlank(purchase.Event_name__c)?'':purchase.Event_name__c+'-';
String section = String.isBlank(purchase.section__c)?'':purchase.section__c+'-';
String row = String.isBlank(purchase.row__c)?'':purchase.row__c+'-';
String seat = String.isBlank(String.valueOf(purchase.seat__c))?'':String.valueOf(purchase.seat__c)+'-';
String numseats = String.isBlank(String.valueOf(purchase.number_of_seats__c))?'':String.valueOf(purchase.number_of_seats__c)+'-';
String adddatetime = String.isBlank(String.valueOf(purchase.add_datetime__c))?'':String.valueOf(purchase.add_datetime__c);
purchase.Key__c = eventName + section + row + seat + numseats + adddatetime;
}
}
}
#isTest
public class CompositeKeyTest {
public static testMethod void testPurchase() {
//create a purchase to fire the trigger
Purchases__c purchase = new Purchases__c(Event_name__c = 'test', section__c='test',row__c='test', seat__c=1.0,number_of_seats__c='test',add_datetime__c='test');
Insert purchase;
//System.assert(purchases__c.Key__c.getDescribe().getName() == 'testesttest1testtest');
//System.assertEquals('testtesttest1.0testtest',purchase.Key__c);
}
static testMethod void testbulkPurchase(){
List<Purchases__c> purchaseList = new List<Purchases__c>();
for(integer i=0 ; i < 10; i++)
{
Purchases__c purchaserec = new Purchases__c(Event_name__c = 'test', section__c='test',row__c='test', seat__c= i+1.0 ,number_of_seats__c='test',add_datetime__c='test');
purchaseList.add(purchaserec);
}
insert purchaseList;
//System.assertEquals('testtesttest5testtest',purchaseList[4].Key__c,'Key is not Valid');
}
}
You need to requery the records after inserting them to get the updated data from the triggers/database

Using Mockito to test Java Hbase API

This is the method that I am testing. This method gets some Bytes from a Hbase Database based on an specific id, in this case called dtmid. The reason I why I want to return some specific values is because I realized that there is no way to know if an id will always be in Hbase. Also, the column Family and column name could change.
#Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
try {
if (tuple.size() > 0) {
Long dtmid = tuple.getLong(0);
byte[] rowKey = HBaseRowKeyDistributor.getDistributedKey(dtmid);
Get get = new Get(rowKey);
get.addFamily("a".getBytes());
Result result = table.get(get);
byte[] bidUser = result.getValue("a".getBytes(),
"co_created_5076".getBytes());
collector.emit(new Values(dtmid, bidUser));
}
} catch (IOException e) {
e.printStackTrace();
}
}
On my main class when this method is called I want to return a specific value. The method should return some bytes.
byte[] bidUser = result.getValue("a".getBytes(),
"co_created_5076".getBytes());
This is what I have on my Unit Test.
#Test
public void testExecute() throws IOException {
long dtmId = 350000000770902930L;
final byte[] COL_FAMILY = "a".getBytes();
final byte[] COL_QUALIFIER = "co_created_5076".getBytes();
//setting a key value pair to put in result
List<KeyValue> kvs = new ArrayList<KeyValue>();
kvs.add(new KeyValue("--350000000770902930".getBytes(), COL_FAMILY, COL_QUALIFIER, Bytes.toBytes("ExpedtedBytes")));
// I create an Instance of result
Result result = new Result(kvs);
// A mock tuple with a single dtmid
Tuple tuple = mock(Tuple.class);
bolt.table = mock(HTable.class);
Result mcResult = mock(Result.class);
when(tuple.size()).thenReturn(1);
when(tuple.getLong(0)).thenReturn(dtmId);
when(bolt.table.get(any(Get.class))).thenReturn(result);
when(mcResult.getValue(any(byte[].class), any(byte[].class))).thenReturn(Bytes.toBytes("Bytes"));
BasicOutputCollector collector = mock(BasicOutputCollector.class);
// Execute the bolt.
bolt.execute(tuple, collector);
ArgumentCaptor<Values> valuesArg = ArgumentCaptor
.forClass(Values.class);
verify(collector).emit(valuesArg.capture());
Values d = valuesArg.getValue();
//casting this object in to a byteArray.
byte[] i = (byte[]) d.get(1);
assertEquals(dtmId, d.get(0));
}
I am using this down here to return my bytes.For some reason is not working.
when(mcResult.getValue(any(byte[].class), any(byte[].class))).thenReturn(Bytes
.toBytes("myBytes"));
For some reason when I capture the values, I still get the bytes that I specified here:
List<KeyValue> kvs = new ArrayList<KeyValue>();
kvs.add(new KeyValue("--350000000770902930".getBytes(),COL_FAMILY, COL_QUALIFIER, Bytes
.toBytes("ExpedtedBytes")));
Result result = new Result(kvs);
How about replacing
when(bolt.table.get(any(Get.class))).thenReturn(result);
with...
when(bolt.table.get(any(Get.class))).thenReturn(mcResult);

MapReduce job with mixed data sources: HBase table and HDFS files

I need to implement a MR job which access data from both HBase table and HDFS files. E.g., mapper reads data from HBase table and from HDFS files, these data share the same primary key but have different schema. A reducer then join all columns (from HBase table and HDFS files) together.
I tried look online and could not find a way to run MR job with such mixed data source. MultipleInputs seem only work for multiple HDFS data sources. Please let me know if you have some ideas. Sample code would be great.
After a few days of investigation (and get help from HBase user mailing list), I finally figured out how to do it. Here is the source code:
public class MixMR {
public static class Map extends Mapper<Object, Text, Text, Text> {
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String s = value.toString();
String[] sa = s.split(",");
if (sa.length == 2) {
context.write(new Text(sa[0]), new Text(sa[1]));
}
}
}
public static class TableMap extends TableMapper<Text, Text> {
public static final byte[] CF = "cf".getBytes();
public static final byte[] ATTR1 = "c1".getBytes();
public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
String key = Bytes.toString(row.get());
String val = new String(value.getValue(CF, ATTR1));
context.write(new Text(key), new Text(val));
}
}
public static class Reduce extends Reducer <Object, Text, Object, Text> {
public void reduce(Object key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String ks = key.toString();
for (Text val : values){
context.write(new Text(ks), val);
}
}
}
public static void main(String[] args) throws Exception {
Path inputPath1 = new Path(args[0]);
Path inputPath2 = new Path(args[1]);
Path outputPath = new Path(args[2]);
String tableName = "test";
Configuration config = HBaseConfiguration.create();
Job job = new Job(config, "ExampleRead");
job.setJarByClass(MixMR.class); // class that contains mapper
Scan scan = new Scan();
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false); // don't set to true for MR jobs
scan.addFamily(Bytes.toBytes("cf"));
TableMapReduceUtil.initTableMapperJob(
tableName, // input HBase table name
scan, // Scan instance to control CF and attribute selection
TableMap.class, // mapper
Text.class, // mapper output key
Text.class, // mapper output value
job);
job.setReducerClass(Reduce.class); // reducer class
job.setOutputFormatClass(TextOutputFormat.class);
// inputPath1 here has no effect for HBase table
MultipleInputs.addInputPath(job, inputPath1, TextInputFormat.class, Map.class);
MultipleInputs.addInputPath(job, inputPath2, TableInputFormat.class, TableMap.class);
FileOutputFormat.setOutputPath(job, outputPath);
job.waitForCompletion(true);
}
}
There is no OOTB feature that supports this. A possible workaround could be to Scan your HBase table and write the Results to a HDFS file first and then do the reduce-side join using MultipleInputs. But this will incur some additional I/O overhead.
A pig script or hive query can do that easily.
sample pig script
tbl = LOAD 'hbase://SampleTable'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'info:* ...', '-loadKey true -limit 5')
AS (id:bytearray, info_map:map[],...);
fle = LOAD '/somefile' USING PigStorage(',') AS (id:bytearray,...);
Joined = JOIN A tbl by id,fle by id;
STORE Joined to ...

Hbase Map/reduce-How to access individual columns of the table?

I have a table called User with two columns, one called visitorId and the other called friend which is a list of strings. I want to check whether the VisitorId is in the friendlist. Can anyone direct me as to how to access the table columns in a map function?
I'm not able to picture how data is output from a map function in hbase.
My code is as follows:
ublic class MapReduce {
static class Mapper1 extends TableMapper<ImmutableBytesWritable, Text> {
private int numRecords = 0;
private static final IntWritable one = new IntWritable(1);
private final IntWritable ONE = new IntWritable(1);
private Text text = new Text();
#Override
public void map(ImmutableBytesWritable row, Result values, Context context) throws IOException {
//What should i do here??
ImmutableBytesWritable userKey = new ImmutableBytesWritable(row.get(), 0, Bytes.SIZEOF_INT);
context.write(userkey,One);
}
//context.write(text, ONE);
} catch (InterruptedException e) {
throw new IOException(e);
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = HBaseConfiguration.create();
Job job = new Job(conf, "CheckVisitor");
job.setJarByClass(MapReduce.class);
Scan scan = new Scan();
Filter f = new RowFilter(CompareOp.EQUAL,new SubstringComparator("mId2"));
scan.setFilter(f);
scan.addFamily(Bytes.toBytes("visitor"));
scan.addFamily(Bytes.toBytes("friend"));
TableMapReduceUtil.initTableMapperJob("User", scan, Mapper1.class, ImmutableBytesWritable.class,Text.class, job);
}
}
So Result values instance would contain the full row from the scanner.
To get the appropriate columns from the Result I would do something like :-
VisitorIdVal = value.getColumnLatest(Bytes.toBytes(columnFamily1), Bytes.toBytes("VisitorId"))
friendlistVal = value.getColumnLatest(Bytes.toBytes(columnFamily2), Bytes.toBytes("friendlist"))
Here VisitorIdVal and friendlistVal are of the type keyValue http://archive.cloudera.com/cdh/3/hbase/apidocs/org/apache/hadoop/hbase/KeyValue.html, to get their values out you can do a Bytes.toString(VisitorIdVal.getValue())
Once you have extracted the values from columns you can check for "VisitorId" in "friendlist"