Change a tuple using pig

Change a tuple using pig - tuples

I need to substitute characters of a tuple using Pig UDF. For eg, if i have a line in the file as "hello world, Hello WORLD, hello\WORLD" required to be transformed as "hello_world,hello_world,hello_world". To accomplish this, i tried below UDF:
package myUDF;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
public class ReplaceValues extends EvalFunc<Tuple>
{
public Tuple exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
String str = (String)input.get(0);
str=str.replace(" ", "_");
str=str.replace("/","");
str=str.replace("\\","");
TupleFactory tf = TupleFactory.getInstance();
Tuple t = tf.newTuple();
t.append(str);
return t;
}catch(Exception e){
throw new IOException("Caught exception processing input row ", e);
}
}
}
but when calling this UDF via pig script i am facing issues, please help me in resolving this:
A = load '/user/cloudera/Stage/ActualDataSet.csv' using PigStorage(',') AS (Rank:chararray,NCTNumber:chararray,Title:chararray,Recruitment:chararray);
B = FILTER A by Rank == 'Rank';
C = FOREACH B GENERATE PigUDF.ReplaceValues(B);
Error: Pig script failed to parse:
Invalid scalar projection: B : A column needs to be projected from a relation for it to be used as a scalar

You have to pass the field that you are trying to modify and not the relation B.Assuming the field that you are trying to match is Title, then you would call the UDF like below
C = FOREACH B GENERATE B.Rank,B.NCTNumber,PigUDF.ReplaceValues(B.Title),B.Recruitment;
Note that if you are trying to replace it in the entire record then your load statement is incorrect.You will have to load the entire record as one line:chararray and then pass the line to your UDF.
Also, instead of an UDF you can use REGEX to match and replace the string of your choice.

In your Pig script, you are passing entire Bag "B" in the UDF, while it accepts tuple as an argument.
Instead pass the field like B.Title as given below.
C = FOREACH B GENERATE PigUDF.ReplaceValues(B.Title);
you can call this UDF on other fields also in the same line as:
C = FOREACH B GENERATE PigUDF.ReplaceValues(B.Title), PigUDF.ReplaceValues(B.Rank);

Related

How to use Isolationforest in weka?

I am trying to use isolationforest in weka ,but I cannot find a easy example which shows how to use it ,who can help me ?thanks in advance
import weka.classifiers.misc.IsolationForest;
public class Test2 {
public static void main(String[] args) {
IsolationForest isolationForest = new IsolationForest();
.....................................................
}
}

I strongly suggest you to study a little bit the implementation for IslationForest.
The following code work loading a CSV file with first column with Class (note: a single class value will produce only (1-anomaly score) if it's binary you will get the anomaly score too. Otherwise it just return an error). Note I skip the second column (that in my case is the uuid that is not needed for anomaly detection)
private static void findOutlier(File in, File out) throws Exception {
CSVLoader loader = new CSVLoader();
loader.setSource(new File(in.getAbsolutePath()));
Instances data = loader.getDataSet();
// setting class attribute if the data format does not provide this information
// For example, the XRFF format saves the class attribute information as well
if (data.classIndex() == -1)
data.setClassIndex(0);
String[] options = new String[2];
options[0] = "-R"; // "range"
options[1] = "2"; // first attribute
Remove remove = new Remove(); // new instance of filter
remove.setOptions(options); // set options
remove.setInputFormat(data); // inform filter about dataset **AFTER** setting options
Instances newData = Filter.useFilter(data, remove); // apply filter
IsolationForest randomForest = new IsolationForest();
randomForest.buildClassifier(newData);
// System.out.println(randomForest);
FileWriter fw = new FileWriter(out);
final Enumeration<Attribute> attributeEnumeration = data.enumerateAttributes();
for (Attribute e = attributeEnumeration.nextElement(); attributeEnumeration.hasMoreElements(); e = attributeEnumeration.nextElement()) {
fw.write(e.name());
fw.write(",");
}
fw.write("(1 - anomaly score),anomaly score\n");
for (int i = 0; i < data.size(); ++i) {
Instance inst = data.get(i);
final double[] distributionForInstance = randomForest.distributionForInstance(inst);
fw.write(inst + ", " + distributionForInstance[0] + "," + (1 - distributionForInstance[0]));
fw.write(",\n");
}
fw.flush();
}
The previous function will add at the CSV at last column the anomaly values. Please note I'm using a single class so for getting the corresponding anomaly I do 1 - distributionForInstance[0] otherwise you ca do simply distributionForInstance[1] .
A sample input.csv for getting (1-anomaly score):
Class,ignore, feature_0, feature_1, feature_2
A,1,21,31,31
A,2,41,61,81
A,3,61,37,34
A sample input.csv for getting (1-anomaly score) and anomaly score:
Class,ignore, feature_0, feature_1, feature_2
A,1,21,31,31
B,2,41,61,81
A,3,61,37,34

Struggling with simple Groovy List operations

I've abstracted a very simple situation here, in which I want to pass a list of strings into my cleanLines function and get a list of strings back out. Unfortunately, I'm new to Groovy and I've spent about a day trying to get this to work with no avail. Here's a stand-alone test that exhibits the problem I'm having:
import static org.junit.Assert.*;
import java.util.List;
import org.junit.Test;
class ConfigFileTest {
private def tab = '\t'
private def returnCarriage = '\r'
private def equals = '='
List <String> cleanLines(List <String> lines) {
lines = lines.collect(){it.findAll(){c -> c != tab && c != returnCarriage}}
lines = lines.findAll(){it.contains(equals)}
lines = lines.collect{it.trim()}
}
#Test
public void test() {
List <String> dirtyLines = [" Colour=Red",
"Shape=Square "]
List <String> cleanedLines = ["Colour=Red",
"Shape=Square"]
assert cleanLines(dirtyLines) == cleanedLine
}
}
I believe that I've followed the correct usage for collect(), findAll() and trim(). But when I run the test, it crashes on the trim() line stating
groovy.lang.MissingMethodException: No signature of method:
java.util.ArrayList.trim() is applicable for argument types: ()
values: []
. Something's suspicious.
I've been staring at this for too long and noticed that my IDE thinks the type of my first lines within the cleanLines function is List<String>, but that by the second line it has type Collection and by the third it expects type List<Object<E>>. I think that String is an Object and so this might be okay, but it certainly hints at a misunderstanding on my part. What am I doing wrong? How can I get my test to pass here?

Here's a corrected script:
import groovy.transform.Field
#Field
def tab = '\t'
#Field
def returnCarriage = '\r'
#Field
def equals = '='
List <String> cleanLines(List <String> lines) {
lines = lines.findAll { it.contains(equals) }
lines = lines.collect { it.replaceAll('\\s+', '') }
lines = lines.collect { it.trim() }
}
def dirtyLines = [" Colour=Red",
"Shape=Square "]
def cleanedLines = ["Colour=Red", "Shape=Square"]
assert cleanLines(dirtyLines) == cleanedLines
In general findAll and collect are maybe not mutually exclusive but have different purposes. Use findAll to find elements that matches certain criteria, whereas collect when you need to process/transform the whole list.

this line
lines = lines.collect(){it.findAll(){c -> c != tab && c != returnCarriage}}`
replaces the original list of Strings with the list of lists. Hence NSME for ArrayList.trim(). You might want to replace findAll{} with find{}

You can clean the lines like this:
def dirtyLines = [" Colour=Red", "Shape=Square "]
def cleanedLines = ["Colour=Red", "Shape=Square"]
assert dirtyLines.collect { it.trim() } == cleanedLines

If you separate the first line of cleanLines() into two separate lines and print lines after each, you will see the problem.
it.findAll { c -> c != tab && c != returnCarriage }
will return a list of strings that match the criteria. The collect method is called on every string in the list of lines. So you end up with a list of lists of strings. I think what you are looking for is something like this:
def cleanLines(lines) {
return lines.findAll { it.contains(equals) }
.collect { it.replaceAll(/\s+/, '') }
}

How to get all rows containing (or equaling) a particular ID from an HBase table?

I have a method which select the row whose rowkey contains the parameter passed into.
HTable table = new HTable(Bytes.toBytes(objectsTableName), connection);
public List<ObjectId> lookUp(String partialId) {
if (partialId.matches("[a-fA-F0-9]+")) {
// create a regular expression from partialId, which can
//match any rowkey that contains partialId as a substring,
//and then get all the row with the specified rowkey
} else {
throw new IllegalArgumentException(
"query must be done with hexadecimal values only");
}
}
I don't know how to finish code above.
I just know the following code can get the row with specified rowkey in Hbase.
String rowkey = "123";
Get get = new Get(Bytes.toBytes(rowkey));
Result result = table.get(get);

You can use RowFilter filter with RegexStringComparator to do that. Or, if it is just to fetch the rows which match a given substring you can use RowFilter with SubstringComparator. This is how you use HBase filters :
public static void main(String[] args) throws IOException {
Configuration conf = HBaseConfiguration.create();
HTable table = new HTable(conf, "demo");
Scan s = new Scan();
Filter f = new RowFilter(CompareOp.EQUAL, new SubstringComparator("abc"));
s.setFilter(f);
ResultScanner rs = table.getScanner(s);
for(Result r : rs){
System.out.println("RowKey : " + Bytes.toString(r.getRow()));
//rest of your logic
}
rs.close();
table.close();
}
The above piece of code will give you all the rows which contain abc as a part of their rowkeys.
HTH

Cascading - regex parser - wrong number of fields

Starting to play with Cascading on Amazon EMR, have managed to get it running BUT falling at a fairly simple hurdle and I was hoping someone could shed some light on it.
My code:
import java.util.Properties;
import cascading.flow.Flow;
import cascading.flow.FlowDef;
import cascading.flow.hadoop.HadoopFlowConnector;
import cascading.pipe.Pipe;
import cascading.property.AppProps;
import cascading.scheme.hadoop.TextLine;
import cascading.tap.Tap;
import cascading.tap.hadoop.Hfs;
import cascading.tuple.Fields;
import cascading.operation.regex.RegexParser;
import cascading.pipe.Each;
import cascading.tap.SinkMode;
public class Main
{
public static void
main( String[] args )
{
String inPath = args[ 0 ];
String outPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create the source tap
TextLine sourceScheme = new TextLine(new Fields("line"));
Tap inTap = new Hfs( sourceScheme, inPath );
// create the sink tap
TextLine sinkScheme = new TextLine( new Fields("custid", "movieids"));
Tap outTap = new Hfs( sinkScheme, outPath, SinkMode.REPLACE );
Fields filmFields = new Fields("custid", "movieids");
String filmRegex = "([0-9]:*[,.]*)";
RegexParser parser = new RegexParser(filmFields, filmRegex);
Pipe importPipe = new Each("import", new Fields("line"), parser, Fields.RESULTS );
// connect the taps, pipes, etc., into a flow
Flow parsedFlow = new HadoopFlowConnector(properties).connect(inTap, outTap, importPipe);
// run the flow
parsedFlow.start();
parsedFlow.complete();
}
}
My input (no empty lines):
1:2
2:4
5:1
3:9
My output:
Task TASKID="task_201305241444_0003_m_000000" TASK_TYPE="MAP" TASK_STATUS="FAILED" FINISH_TIME="1369408133954" ERROR="cascading\.tuple\.TupleException: operation added the wrong number of fields, expected: ['custid', 'movieids'], got result size: 1
at cascading\.tuple\.TupleEntryCollector\.add(TupleEntryCollector\.java:82)
at cascading\.operation\.regex\.RegexParser\.onFoundGroups(RegexParser\.java:168)
at cascading\.operation\.regex\.RegexParser\.operate(RegexParser\.java:151)
at cascading\.flow\.stream\.FunctionEachStage\.receive(FunctionEachStage\.java:99)
at cascading\.flow\.stream\.FunctionEachStage\.receive(FunctionEachStage\.java:39)
at cascading\.flow\.stream\.SourceStage\.map(SourceStage\.java:102)
at cascading\.flow\.stream\.SourceStage\.run(SourceStage\.java:58)
at cascading\.flow\.hadoop\.FlowMapper\.run(FlowMapper\.java:127)
at org\.apache\.hadoop\.mapred\.MapTask\.runOldMapper(MapTask\.java:441)
at org\.apache\.hadoop\.mapred\.MapTask\.run(MapTask\.java:377)
at org\.apache\.hadoop\.mapred\.Child$4\.run(Child\.java:255)
at java\.security\.AccessController\.doPrivileged(Native Method)
at javax\.security\.auth\.Subject\.doAs(Subject\.java:396)
at org\.apache\.hadoop\.security\.UserGroupInformation\.doAs(UserGroupInformation\.java:1132)
at org\.apache\.hadoop\.mapred\.Child\.main(Child\.java:249)
The reg ex checks out fine at http://regexpal.com/
Thanks a lot
Duncan

You get an exception because your regular expression yields one result, where two result fields are excepted (namely "custid" and "movieids"), because the regular expression contains just a single group (...).
If you just want to split at the colon, either use an expression with 2 groups, for example:
String filmRegex = "(\\d):(\\d)";
or \d+, respectively, if your numbers can have more than one digit.
Or, more easily, just split the input data into its fields automatically when reading from the file by using a TextDelimited input scheme:
Scheme sourceScheme = new TextDelimited(new Fields("custid", "movieids"), ":");

EF4: Get the linked column names from NavigationProperty of an EDMX

I am generating POCOs (lets say they are subclasses of MyEntityObject) by using a T4 template from an EDMX file.
I have 3 entities, e.g.:
MyTable1 (PrimaryKey: MyTable1ID)
MyTable2 (PrimaryKey: MyTable2ID)
MyTable3 (PrimaryKey: MyTable3ID)
These entities have the following relations:
MyTable1.MyTable1ID <=>
MyTable2.MyTable1ID (MyTable1ID is the
foreign key to MyTable1)
MyTable2.MyTable2ID <=>
MyTable3.MyTable2ID (MyTable2ID is the
foreign key to MyTable2)
Or in another view:
MyTable1 <= MyTable2 <= MyTable3
I want to extract all foreign key relations
NavigationProperty[] foreignKeys = entity.NavigationProperties.Where(np => np.DeclaringType == entity && ((AssociationType)np.RelationshipType).IsForeignKey).ToArray();
forewach (NavigationProperty foreignKey in foreignKeys)
{
// generate code....
}
My Question: How can I extract the column names that are linked between two entities?
Something like this:
void GetLinkedColumns(MyEntityObject table1, MyEntityObject table2, out string fkColumnTable1, out string fkColumnTable2)
{
// do the job
}
In the example
string myTable1Column;
string myTable2Column;
GetLinkedColumns(myTable1, myTable2, out myTable1Column, out myTable2Column);
the result should be
myTable1Column = "MyTable1ID";
myTable2Column = "MyTable2ID";

The first answer works if your foreign key columns are exposed as properties in your conceptual model. Also, the GetSourceSchemaTypes() method is only available in some of the text templates included with EF, so it is helpful to know what this method does.
If you want to always know the column names, you will need to load the AssociationType from the storage model as follows:
// Obtain a reference to the navigation property you are interested in
var navProp = GetNavigationProperty();
// Load the metadata workspace
MetadataWorkspace metadataWorkspace = null;
bool allMetadataLoaded =loader.TryLoadAllMetadata(inputFile, out metadataWorkspace);
// Get the association type from the storage model
var association = metadataWorkspace
.GetItems<AssociationType>(DataSpace.SSpace)
.Single(a => a.Name == navProp.RelationshipType.Name)
// Then look at the referential constraints
var toColumns = String.Join(",",
association.ReferentialConstraints.SelectMany(rc => rc.ToProperties));
var fromColumns = String.Join(",",
association.ReferentialConstraints.SelectMany(rc => rc.FromProperties));
In this case, loader is a MetadataLoader defined in EF.Utility.CS.ttinclude and inputFile is a standard string variable specifying the name of the .edmx file. These should already be declared in your text template.

Not sure exactly whether you want to generate code using the columns or not, but this may partly help to answer your question (How can I extract the column names that are linked between two entities?) ...
NavigationProperty[] foreignKeys = entity.NavigationProperties
.Where(np => np.DeclaringType == entity &&
((AssociationType)np.RelationshipType).IsForeignKey).ToArray();
foreach (NavigationProperty foreignKey in foreignKeys)
{
foreach(var rc in GetSourceSchemaTypes<AssociationType>()
.Single(x => x.Name == foreignKey.RelationshipType.Name)
.ReferentialConstraints)
{
foreach(var tp in rc.ToProperties)
WriteLine(tp.Name);
foreach(var fp in rc.FromProperties)
WriteLine(fp.Name);
}
}

This code works fine on my Visual Studio 2012
<## template language="C#" debug="true" hostspecific="true"#>
<## include file="EF.Utility.CS.ttinclude"#>
<#
string inputFile = #"DomainModel.edmx";
MetadataLoader loader = new MetadataLoader(this);
EdmItemCollection ItemCollection = loader.CreateEdmItemCollection(inputFile);
foreach (EntityType entity in ItemCollection.GetItems<EntityType>().OrderBy(e => e.Name))
{
foreach (NavigationProperty navProperty in entity.NavigationProperties)
{
AssociationType association = ItemCollection.GetItems<AssociationType>().Single(a => a.Name == navProperty.RelationshipType.Name);
string fromEntity = association.ReferentialConstraints[0].FromRole.Name;
string fromEntityField = association.ReferentialConstraints[0].FromProperties[0].Name;
string toEntity = association.ReferentialConstraints[0].ToRole.Name;
string toEntityField = association.ReferentialConstraints[0].ToProperties[0].Name;
}
}
#>

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Change a tuple using pig - tuples

Related

How to use Isolationforest in weka?

Struggling with simple Groovy List operations

How to get all rows containing (or equaling) a particular ID from an HBase table?

Cascading - regex parser - wrong number of fields

EF4: Get the linked column names from NavigationProperty of an EDMX

Categories

Resources