PigUnit test failed due to brackets and comma's - unit-testing

I am trying to create a PIG script which changes the order of the columns. This is what I have come up with so far :
inputdata = LOAD 'path/to/file/on/hdfs' USING PigStorage() AS (param1:chararray, param2:chararray, param3:chararray);
outputdata = FOREACH inputdata GENERATE param1, param3, param2;
DUMP outputdata;
I've not tried this yet on HDFS but I figured I'd go ahead and write the unit test first. Unfortunately it doesn't work.
Unit Test code :
PigTest test = new PigTest("path_to_script.pig");
FixHadoopOnWindows.runFix();
String[] input = {
"valueparam1\tvalueparam2\tvalueparam3"
};
String[] output = {
"valueparam1\tvalueparam3\tvalueparam2"
};
test.assertOutput("inputdata", input, "outputdata", output);
The FixHadoopOnWindows bit is a fix so I can run my unit tests on a windows machine easily. I found it in some blog and it helped resolve the permission issues I was having.
So now my tests run, but the problem is that the assertOutput fails. When I check the difference, I get this:
Expected:
valueparam1 valueparam3 valueparam2
Actual:
(valueparam1,valueparam3,valueparam2)
So I'm getting these brackets and comma's I never asked for. Now I'm not sure whether this is a bug in my unit testing code or in my actual script so any advice to get me started would be great. Thanks.

Your code looks ok. The brackets mean that your relation outputdata consists of a tuple with three values.
If you later want to store your data separated by tabs just do STORE outputdata INTO 'dest' USING PigStorage('\t');
http://pig.apache.org/docs/r0.11.0/basic.html#Data+Types+and+More

Figured it out. The PigUnit reads out the outputdata value, which is a Tuple in PIG. It's not until I store it to file, that the tuples are converted to a tab separated record.

Related

Stata : generate/replace alternatives?

I use Stata since several years now, along with other languages like R.
Stata is great, but there is one thing that annoys me : the generate/replace behaviour, and especially the "... already defined" error.
It means that if we want to run a piece of code twice, if this piece of code contains the definition of a variable, this definition needs 2 lines :
capture drop foo
generate foo = ...
While it takes just one line in other languages such as R.
So is there another way to define variables that combines "generate" and "replace" in one command ?
I am unaware of any way to do this directly. Further, as #Roberto's comment implies, there are reasons simply issuing a generate command will not overwrite (see: replace) the contents of a variable.
To be able to do this while maintaining data integrity, you would need to issue two separate commands as your question points out (explicitly dropping the existing variable before generating the new one) - I see this as method in which Stata forces the user to be clear about his/her intentions.
It might be noted that Stata is not alone in this regard. SQL Server, for example, requires the user drop an existing table before creating a table with the same name (in the same database), does not allow multiple columns with the same name in a table, etc. and all for good reason.
However, if you are really set on being able to issue a one-liner in Stata to do what you desire, you could write a very simple program. The following should get you started:
program mkvar
version 13
syntax anything=exp [if] [in]
capture confirm variable `anything'
if !_rc {
drop `anything'
}
generate `anything' `exp' `if' `in'
end
You then would naturally save the program to mkvar.ado in a directory that Stata would find (i.e., C:\ado\personal\ on Windows. If you are unsure, type sysdir), and call it using:
mkvar newvar=expression [if] [in]
Now, I haven't tested the above code much so you may have to do a bit of de-bugging, but it has worked fine in the examples I've tried.
On a closing note, I'd advise you to exercise caution when doing this - certainly you will want to be vigilant with regard to altering your data, retain a copy of your raw data while a do file manipulates the data in memory, etc.

Quick code advice for beginner

I am a beginner and I want to know whats wrong with this code. I want to count the number of times the push button is being pushed in portA. Then show this values using the LEDS in portC. Thanks
You need braces around a multi-statement block, if you want to use it as the body of an if (or for or whatever) statement:
else if (PORTA.RA2==1) {
count = count+1;
PORTC = count;
}
otherwise only the first statement is conditional; so your code executes PORTC = count; every time, whatever the result of the if tests.
I like to put braces around all such blocks, even there's only a single statement, so I can't forget to add them if I add more statements later.
Also, main must return int not void, and you should take more care formatting your code to match its logical structure.
UPDATE: Also, you never initialise count, so it has an arbitrary floating-point value. You want a small integer type, since it's only supposed to take integer values from 0 to 16, and you need to initialise it:
char count = 0;
If you're setting TRISA to 1 that means the only input on that port is RA0, but you are trying to use RA2. Be sure to clear the ANSELA0 bit. Make sure you set the config bits properly or else your code might not run.
To avoid getting downvoted in the future:
Choose an informative question title.
Properly indent your code.
Say the exact PIC you are using and what board it is on.
Say what development environment and compiler you are using.
Provide pictures of your setup so we can check your wiring.
Most importantly, tell us exactly how you are testing the code, what the expected result is, and what you are actually observing.
My company offers more advice here: http://www.pololu.com/support

Comparing two documents

I have two very large lists. They both were originally in excel, but the larger one is a list of emails (about 160,000) of them with other information like their name and address etc. And the smaller one is a list of just 18,000 emails.
My question is what would be the easiest way to get rid of all 18,000 rows from the first document that contain the email addresses from the second?
I was thinking regex or maybe there is another application I can use? I have tried searching online but it seems like there isn't much specific to this. I also tried notepad++ but it freezes when I try to compare these large files.
-Thank You in Advance!!
Good question. One way I would tackle this is making a C++ program [you could extrapolate the idea to the language of your choice; You never mentioned which languages you were proficient in] that read each item of the smaller file into a vector of strings. First, of course, use Excel to save the files as CSV instead of XLS or XLSX, which will comma-separate the values so you can work with them easier. For the larger list, "Save As" a copy of just email addresses, deleting the other rows for now.
Then, you could open the larger list and use a nested loop to check if you should output to an output file. Something like:
bool foundMatch=false;
for(int y=0;y<LargeListVector.size();y++) {
for(int x=0;x<SmallListVector.size();x++) {
if(SmallListVector[x]==LargeListVector[y]) foundMatch=true;
}
if(!foundMatch) OutputVector.append(LargeListVector[y]);
foundMatch=false;
}
That might be partially pseudo-code, but do you get the idea?
So I read a forum post at : Here
=MATCH(B1,$A$1:$A$3,0)>0
Column B would be the large list, with the 160,000 inputs and column A was my list of things I needed to delete of 18,000.
I used this to match everything, and in a separate column pasted this formula. It would print out either an error or TRUE. If the data was in both columns it printed out true.
Then because I suck with excel, I threw this text into Notepad++ and searched for all lines that contained TRUE (match case, because in my case some of the data had the word true in it without caps.) I marked those lines, then under search, bookmarks, I removed all lines with bookmarks. Pasted that back into excel and voila.
I would like to thank you guys for helping and pointing me in the right direction :)

Repeating pattern matching in Eclipse regexp

I'm trying to turn XML files in to SQL statements within Eclipse to save on the manual work of converting large files.
A line such as:
<TABLE_NAME COL_1="value1" COL_2="value2"/>
should be converted to:
insert into TABLE_NAME (COL_1, COL_2) values ("value1", "value2");
So far, I have managed to match and capture the table name and the first column/value pair with:
<(\w+)( \w+)=(".+?").*/>
The .* near the end is just there to test the first part of the pattern and should be removed once it is complete.
The following replace pattern yields the following result:
insert into $1 ($2) values ($3);
insert into TABLE_NAME ( COL_1) values ("value1");
The problem I'm having is that the number of columns is different for different tables, so I would like a generic pattern that would match n column/value pairs and repeatedly use the captured groups in the replace pattern. I haven't managed to understand how to do this yet although \G seems to be a good candidate.
Ideally, this would be solved in a single regexp statement, though I also wouldn't be against multiple statements that have to be run in sequence (though I really don't want to have to force the developer to execute once for every column/value pair).
Anyone have any ideas?
I did not in the end solve this with regular expressions as originally envisaged. Taking alternative solutions in mind, it was solved in the end by re-factoring test code to make it re-usable and then co-opting dbunit to use tests to write data to the different stages of the development databases.
I know that this is a misuse of tests, but it does make the whole thing much simpler since the user gets a green/red feedback as to whether the data is inserted or not and the same test data can be inserted for manual acceptance tests as is used for the automated component/integration tests (which used dbunit with an in-memory database). As long as these fake test classes are not added to the test suite, then they are only executed by hand.
The resulting fake tests ended up being extremely simple and lightweight, for instance:
public class MyComponentTestDataInserter extends DataInserter {
#Override
protected String getDeleteSql() {
return "./input/component/delete.sql";
}
#Override
protected String getInsertXml() {
return "./input/component/TestData.xml";
}
#Test
public void insertIntoDevStage() {
insertData(Environment.DEV);
}
#Test
public void insertIntoTestStage() {
insertData(Environment.TEST);
}
}
This also had the additional advantage that a developer can insert the data in to a single environment by simply executing a single test method via the context menu in the IDE. At the same time, by running the whole test class, the exact same data can be deployed to all environments simultaneously.
Also, a failure at any point during data cleanup or insertion causes the whole transaction to be rolled back, preventing an inconsistent state.

Unit Testing Object/Model Converters

There are several places, where one has to convert one data object into another. For example incoming data from a webservice or a REST service into an object that is persistable.
Is there a way to unit test that all incoming data gets filled into the right places of the "outgoing" objects without copying the converter logic inside the test?
If the fields are all called the same, and one is feeling adventurous, reflections could do some work.. But I don't feel like going down that path..
Acceptance tests won't catch a bug if say a Person that has a name and a firstname gets converted into a Person where name == firstname due to some copy+paste mistake.
So right now I just skip testing object/model conversion and rather take a really good look at my converter.
Has anyone any idea on how to do this differently?
If you need to test that multiplication works, you should not replicate the multiplication logic. Define test data that you know are correct, and test that the multiplicaiton is ok.
assert( 4*5, 20 )
and not
assert( 4*5, 4*5 )
Here the test data are 4, 5, 20, and test that logic that ties them is the multiplication. The same principle holds in your case. Define test data and test that convertion produces the right results.
(As you point out, making test themsleves generic with reflection, etc., defeats the purpose of testing.)