Cascading - regex parser - wrong number of fields - regex

Starting to play with Cascading on Amazon EMR, have managed to get it running BUT falling at a fairly simple hurdle and I was hoping someone could shed some light on it.
My code:
import java.util.Properties;
import cascading.flow.Flow;
import cascading.flow.FlowDef;
import cascading.flow.hadoop.HadoopFlowConnector;
import cascading.pipe.Pipe;
import cascading.property.AppProps;
import cascading.scheme.hadoop.TextLine;
import cascading.tap.Tap;
import cascading.tap.hadoop.Hfs;
import cascading.tuple.Fields;
import cascading.operation.regex.RegexParser;
import cascading.pipe.Each;
import cascading.tap.SinkMode;
public class Main
{
public static void
main( String[] args )
{
String inPath = args[ 0 ];
String outPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create the source tap
TextLine sourceScheme = new TextLine(new Fields("line"));
Tap inTap = new Hfs( sourceScheme, inPath );
// create the sink tap
TextLine sinkScheme = new TextLine( new Fields("custid", "movieids"));
Tap outTap = new Hfs( sinkScheme, outPath, SinkMode.REPLACE );
Fields filmFields = new Fields("custid", "movieids");
String filmRegex = "([0-9]:*[,.]*)";
RegexParser parser = new RegexParser(filmFields, filmRegex);
Pipe importPipe = new Each("import", new Fields("line"), parser, Fields.RESULTS );
// connect the taps, pipes, etc., into a flow
Flow parsedFlow = new HadoopFlowConnector(properties).connect(inTap, outTap, importPipe);
// run the flow
parsedFlow.start();
parsedFlow.complete();
}
}
My input (no empty lines):
1:2
2:4
5:1
3:9
My output:
Task TASKID="task_201305241444_0003_m_000000" TASK_TYPE="MAP" TASK_STATUS="FAILED" FINISH_TIME="1369408133954" ERROR="cascading\.tuple\.TupleException: operation added the wrong number of fields, expected: ['custid', 'movieids'], got result size: 1
at cascading\.tuple\.TupleEntryCollector\.add(TupleEntryCollector\.java:82)
at cascading\.operation\.regex\.RegexParser\.onFoundGroups(RegexParser\.java:168)
at cascading\.operation\.regex\.RegexParser\.operate(RegexParser\.java:151)
at cascading\.flow\.stream\.FunctionEachStage\.receive(FunctionEachStage\.java:99)
at cascading\.flow\.stream\.FunctionEachStage\.receive(FunctionEachStage\.java:39)
at cascading\.flow\.stream\.SourceStage\.map(SourceStage\.java:102)
at cascading\.flow\.stream\.SourceStage\.run(SourceStage\.java:58)
at cascading\.flow\.hadoop\.FlowMapper\.run(FlowMapper\.java:127)
at org\.apache\.hadoop\.mapred\.MapTask\.runOldMapper(MapTask\.java:441)
at org\.apache\.hadoop\.mapred\.MapTask\.run(MapTask\.java:377)
at org\.apache\.hadoop\.mapred\.Child$4\.run(Child\.java:255)
at java\.security\.AccessController\.doPrivileged(Native Method)
at javax\.security\.auth\.Subject\.doAs(Subject\.java:396)
at org\.apache\.hadoop\.security\.UserGroupInformation\.doAs(UserGroupInformation\.java:1132)
at org\.apache\.hadoop\.mapred\.Child\.main(Child\.java:249)
The reg ex checks out fine at http://regexpal.com/
Thanks a lot
Duncan

You get an exception because your regular expression yields one result, where two result fields are excepted (namely "custid" and "movieids"), because the regular expression contains just a single group (...).
If you just want to split at the colon, either use an expression with 2 groups, for example:
String filmRegex = "(\\d):(\\d)";
or \d+, respectively, if your numbers can have more than one digit.
Or, more easily, just split the input data into its fields automatically when reading from the file by using a TextDelimited input scheme:
Scheme sourceScheme = new TextDelimited(new Fields("custid", "movieids"), ":");

Related

Working with dynamic regex value in dart & flutter

Sometimes we want to implement validation dynamically while taking input from the users. This means you want to store the regex somewhere in the DB & want to add/update the validation accordingly and get it in the API response. But right now, this is not possible:
final regex = new RegExp(r'${regex_value}'); // will raise error
So what can be the solution to work with dynamic regex?
So to work with dynamic regex, we can do this - get the base64 encoded value of regex from the server/Backend/DB and decode it like this:
import 'dart:convert';
void main() {
final base64Decoder = base64.decoder;
const base64Bytes = 'XlxkezAsM30oPzpbLixdXGR7MSwyfSk/JA==';
final decodedBytes = base64Decoder.convert(base64Bytes);
print( utf8.decode(decodedBytes) );
final regex = RegExp(utf8.decode(decodedBytes));
print( regex.hasMatch('111.2') );
print( regex.hasMatch('111.22') );
print( regex.hasMatch('111.222') );
print( regex.hasMatch('111.') );
}

Change a tuple using pig

I need to substitute characters of a tuple using Pig UDF. For eg, if i have a line in the file as "hello world, Hello WORLD, hello\WORLD" required to be transformed as "hello_world,hello_world,hello_world". To accomplish this, i tried below UDF:
package myUDF;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
public class ReplaceValues extends EvalFunc<Tuple>
{
public Tuple exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
String str = (String)input.get(0);
str=str.replace(" ", "_");
str=str.replace("/","");
str=str.replace("\\","");
TupleFactory tf = TupleFactory.getInstance();
Tuple t = tf.newTuple();
t.append(str);
return t;
}catch(Exception e){
throw new IOException("Caught exception processing input row ", e);
}
}
}
but when calling this UDF via pig script i am facing issues, please help me in resolving this:
A = load '/user/cloudera/Stage/ActualDataSet.csv' using PigStorage(',') AS (Rank:chararray,NCTNumber:chararray,Title:chararray,Recruitment:chararray);
B = FILTER A by Rank == 'Rank';
C = FOREACH B GENERATE PigUDF.ReplaceValues(B);
Error: Pig script failed to parse:
Invalid scalar projection: B : A column needs to be projected from a relation for it to be used as a scalar
You have to pass the field that you are trying to modify and not the relation B.Assuming the field that you are trying to match is Title, then you would call the UDF like below
C = FOREACH B GENERATE B.Rank,B.NCTNumber,PigUDF.ReplaceValues(B.Title),B.Recruitment;
Note that if you are trying to replace it in the entire record then your load statement is incorrect.You will have to load the entire record as one line:chararray and then pass the line to your UDF.
Also, instead of an UDF you can use REGEX to match and replace the string of your choice.
In your Pig script, you are passing entire Bag "B" in the UDF, while it accepts tuple as an argument.
Instead pass the field like B.Title as given below.
C = FOREACH B GENERATE PigUDF.ReplaceValues(B.Title);
you can call this UDF on other fields also in the same line as:
C = FOREACH B GENERATE PigUDF.ReplaceValues(B.Title), PigUDF.ReplaceValues(B.Rank);

How to Set Height of PdfPTable in iTextSharp

i downloaded the last version of iTextSharp dll. I generated a PdfPTable object and i have to set it's height. Despite to set width of PdfPTable, im not able to set height of it. Some authors suggest to use 'setFixedHeight' method. But the last version of iTextSharp.dll has not method as 'setFixedHeight'. It's version is 5.5.2. How can i do it?
Setting a table's height doesn't make sense once you start thinking about it. Or, it makes sense but leaves many questions unanswered or unanswerable. For instance, if you set a two row table to a height of 500, does that mean that each cell get's 250 for a height? What if a large image gets put in the first row, should the table automatically respond by splitting 400/100? Then what about large content in both rows, should it squish those? Each of these scenarios produces different results that make knowing what a table will actually do unreliable. If you look at the HTML spec you'll see that they don't even allow setting a fixed height for tables.
However, there's a simple solution and that's just setting the fixed height of the cells themselves. As long as you aren't using new PdfPCell() you can just set DefaultCell.FixedHeight to whatever you want.
var t = new PdfPTable(2);
t.DefaultCell.FixedHeight = 100f;
t.AddCell("Hello");
t.AddCell("World");
t.AddCell("Hello");
t.AddCell("World");
doc.Add(t);
If you are manually creating cells then you need to set the FixedHeight on each:
var t = new PdfPTable(2);
for(var i=0;i<4;i++){
var c = new PdfPCell(new Phrase("Hello"));
c.FixedHeight = 75f;
t.AddCell(c);
}
doc.Add(t);
However, if you want normal table-ness and must set a fixed height that chops things that don't fit you can also use a ColumnText. I wouldn't recommend this but you might have a case for it. The code below will only show six rows.
var ct = new ColumnText(writer.DirectContent);
ct.SetSimpleColumn(100, 100, 200, 200);
var t = new PdfPTable(2);
for(var i=0;i<100;i++){
t.AddCell(i.ToString());
}
ct.AddElement(t);
ct.Go();
you can use any one of following
cell.MinimumHeight = 20f;
or
cell.FixedHeight = 30f;
The premise is that you have downloaded the jar iText jar,you can try this code,you can achive this function,In a A4 paper out a row of three columns of dataeg:
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import com.lowagie.text.Document;
import com.lowagie.text.DocumentException;
import com.lowagie.text.Paragraph;
import com.lowagie.text.pdf.PdfPCell;
import com.lowagie.text.pdf.PdfPTable;
import com.lowagie.text.pdf.PdfWriter;
public class ceshi {
public static final String DEST = "D:\\fixed_height_cell.pdf";
public static void main(String[] args) throws IOException,
DocumentException {
File file = new File(DEST);
file.getParentFile().mkdirs();
new ceshi().createPdf(DEST);
}
public void createPdf(String dest) throws IOException, DocumentException {
Document document = new Document();
PdfWriter.getInstance(document, new FileOutputStream(dest));
document.open();
PdfPTable table = new PdfPTable(3);// Set a row and the three column of
// A4 paper
table.setWidthPercentage(100);
PdfPCell cell;
for (int r = 1; r <= 2; r++) {// Set display two lines
for (int c = 1; c <= 3; c++) {// Set to display a row of three columns
cell = new PdfPCell();
cell.addElement(new Paragraph("test"));
cell.setFixedHeight(285);// Control the fixed height of each cell
table.addCell(cell);
}
}
document.add(table);
document.close();
}
}

docx4j create unnumbered / bullet list

I would like to create an unnumbered list with bullets using docx4j in my Word document. I have found the following code that is supposed to do the work. But whatever I try, the generated list is a numbered list! I use Word 2010, German version and docx4j-2.8.1.
wordMLPackage = WordprocessingMLPackage.createPackage();
ObjectFactory factory = new org.docx4j.wml.ObjectFactory();
P p = factory.createP();
org.docx4j.wml.Text t = factory.createText();
t.setValue(text);
org.docx4j.wml.R run = factory.createR();
run.getContent().add(t);
p.getContent().add(run);
org.docx4j.wml.PPr ppr = factory.createPPr();
p.setPPr(ppr);
// Create and add <w:numPr>
NumPr numPr = factory.createPPrBaseNumPr();
ppr.setNumPr(numPr);
// The <w:ilvl> element
Ilvl ilvlElement = factory.createPPrBaseNumPrIlvl();
numPr.setIlvl(ilvlElement);
ilvlElement.setVal(BigInteger.valueOf(0));
// The <w:numId> element
NumId numIdElement = factory.createPPrBaseNumPrNumId();
numPr.setNumId(numIdElement);
numIdElement.setVal(BigInteger.valueOf(1));
wordMLPackage.getMainDocumentPart().addObject(p);
Can someone help me to generate a real unordered, buletted list?!
Hope this helps you.
import org.docx4j.XmlUtils;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
import org.docx4j.openpackaging.parts.WordprocessingML.NumberingDefinitionsPart;
import org.docx4j.wml.*;
import javax.xml.bind.JAXBException;
import java.io.File;
import java.math.BigInteger;
public class GenerateBulletedList {
private static final String BULLET_TEMPLATE ="<w:numbering xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\">" +
"<w:abstractNum w:abstractNumId=\"0\">" +
"<w:nsid w:val=\"12D402B7\"/>" +
"<w:multiLevelType w:val=\"hybridMultilevel\"/>" +
"<w:tmpl w:val=\"AECAFC2E\"/>" +
"<w:lvl w:ilvl=\"0\" w:tplc=\"04090001\">" +
"<w:start w:val=\"1\"/>" +
"<w:numFmt w:val=\"bullet\"/>" +
"<w:lvlText w:val=\"\uF0B7\"/>" +
"<w:lvlJc w:val=\"left\"/>" +
"<w:pPr>" +
"<w:ind w:left=\"360\" w:hanging=\"360\"/>" +
"</w:pPr>" +
"<w:rPr>" +
"<w:rFonts w:ascii=\"Symbol\" w:hAnsi=\"Symbol\" w:hint=\"default\"/>" +
"</w:rPr>" +
"</w:lvl>" +
"</w:abstractNum>"+
"<w:num w:numId=\"1\">" +
"<w:abstractNumId w:val=\"0\"/>" +
"</w:num>" +
"</w:numbering>";
public static void main(String[] args) throws Exception{
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.createPackage();
createBulletedList(wordMLPackage);
wordMLPackage.save(new File("Output.docx"));
}
private static void createBulletedList(WordprocessingMLPackage wordMLPackage)throws Exception{
NumberingDefinitionsPart ndp = new NumberingDefinitionsPart();
wordMLPackage.getMainDocumentPart().addTargetPart(ndp);
ndp.setJaxbElement((Numbering) XmlUtils.unmarshalString(BULLET_TEMPLATE));
wordMLPackage.getMainDocumentPart().addObject(createParagraph("India"));
wordMLPackage.getMainDocumentPart().addObject(createParagraph("United Kingdom"));
wordMLPackage.getMainDocumentPart().addObject(createParagraph("France"));
}
private static P createParagraph(String country) throws JAXBException {
ObjectFactory factory = new org.docx4j.wml.ObjectFactory();
P p = factory.createP();
String text =
"<w:r xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\">" +
" <w:rPr>" +
"<w:b/>" +
" <w:rFonts w:ascii=\"Arial\" w:cs=\"Arial\"/><w:sz w:val=\"16\"/>" +
" </w:rPr>" +
"<w:t>" + country + "</w:t>" +
"</w:r>";
R r = (R) XmlUtils.unmarshalString(text);
org.docx4j.wml.R run = factory.createR();
run.getContent().add(r);
p.getContent().add(run);
org.docx4j.wml.PPr ppr = factory.createPPr();
p.setPPr(ppr);
// Create and add <w:numPr>
PPrBase.NumPr numPr = factory.createPPrBaseNumPr();
ppr.setNumPr(numPr);
// The <w:numId> element
PPrBase.NumPr.NumId numIdElement = factory.createPPrBaseNumPrNumId();
numPr.setNumId(numIdElement);
numIdElement.setVal(BigInteger.valueOf(1));
return p;
}
}
The code you have posted says "use list number 1, level 0".
Evidently that list is a numbered list.
Have a look in your numbering definitions part for a bulleted list, and use that one.
If you don't have a bulleted list there, you'll need to add it. You can upload a sample docx to the docx4j online demo, to have it generate appropriate content for you. Or see ListHelper for an example of how it can be done.
private static P getBulletedParagraph(Text text, int i) {
ObjectFactory objCreator = Context.getWmlObjectFactory(); // Object used to
create other Docx4j Objects
P paragraph = objCreator.createP(); // create Paragraph object
PPr ppr = objCreator.createPPr(); // create ppr
NumPr numpr = objCreator.createPPrBaseNumPr();
PStyle style = objCreator.createPPrBasePStyle();// create Pstyle
NumId numId = objCreator.createPPrBaseNumPrNumId();
numId.setVal(BigInteger.valueOf(6));
numpr.setNumId(numId);
R run = objCreator.createR();
Br br = objCreator.createBr();
run.getContent().add(text);
Ilvl iLevel = objCreator.createPPrBaseNumPrIlvl(); // create Ilvl Object
numpr.setIlvl(iLevel);
iLevel.setVal(BigInteger.valueOf(i)); // Set ilvl value
ppr.setNumPr(numpr);
style.setVal("ListParagraph"); // set value to ListParagraph
ppr.setPStyle(style);
paragraph.setPPr(ppr);
paragraph.getContent().add(run);
// paragraph.getContent().add(br); Adds line breaks
return paragraph;
}
I believe this is what you're looking for. This method will return you a paragraph object that has bullets. If you uncomment out the last line before returning paragraph, your paragraph object will also contain line breaks. If you didn't know, "Ilvl", or "eye-level" means indentation. Its the same as clicking the tab button when typing to a document the conventional way. Setting the ilvl element to 1 is the same as clicking tab 1 time. Setting it to 2 is the same as clicking the tab button 2 times, and so on. So it doesn't matter what number you give it. Although it will change the type of bullet you get. What really matters is the numid. Setting the numid to 6 will give you bullets instead of numbers. You will also need to set the PPr style to "ListParagraph". I hope this helps.

Skip feature when classifying, but show feature in output

I've created a dataset which contains +/- 13000 rows with +/- 50 features. I know how to output every classification result: prediction and actual, but I would like to be able to output some sort of ID with those results. So i've added a ID column to my dataset but I don't know how disregard the ID when classifying while still being able to output the ID with every prediction result. I do know how to select features to output with every prediction.
Use FilteredClassifier. See this and this .
Let's say follwoing are the attributes in the bbcsport.arff that you want to remove and is in a file attributes.txt line by line..
serena
serve
service
sets
striking
tennis
tiebreak
tournaments
wimbledon
..
Here is how you may include or exclude the attributes by setting true or false. (mutually elusive) remove.setInvertSelection(false)
BufferedReader datafile = new BufferedReader(new FileReader("bbcsport.arff"));
BufferedReader attrfile = new BufferedReader(new FileReader("attributes.txt"));
Instances data = new Instances(datafile);
List<Integer> myList = new ArrayList<Integer>();
String line;
while ((line = attrfile.readLine()) != null) {
for (n = 0; n < data.numAttributes(); n++) {
if (data.attribute(n).name().equalsIgnoreCase(line)) {
if(!myList.contains(n))
myList.add(n);
}
}
}
int[] attrs = myList.stream().mapToInt(i -> i).toArray();
Remove remove = new Remove();
remove.setAttributeIndicesArray(attrs);
remove.setInvertSelection(false);
remove.setInputFormat(data); // init filter
Instances filtered = Filter.useFilter(data, remove);
'filtered' has the final attributes..
My blog .. http://ojaslabs.com/include-exclude-attributes-in-weka