Pig: Bag to Tuple - mapreduce

I am new to Pig and still exploring efficient ways to do simple things.
For example, I have a bag of events
{"events":[{"event": ev1}, {"event": ev2}, {"event":ev3}, ....]}
And I want to collapse that as just a tuple, something like
{"events":[ev1, ev2, ev3, ....]}
Is there a way to achieve this in Pig?
I have veen struggling through this for a while, but without much success :(.
Thanks in advance

Looking at your input it seems that your schema is something like:
A: {name:chararray, vals:{(inner_name:chararray, inner_value:chararray)}}
As I mentioned in a comment to your question, actually turning this into an array of nothing but inner_values will be extremely difficult since you don't know how many fields you could potentially have. When you don't know the number of fields you should always try to use a bag in Pig.
Luckily, if you can in fact use a bag for this it is trivial:
-- Project out only inner_value from the bag vals
B = FOREACH A GENERATE name, vals.inner_value ;

Thanks all for informative comments. They helped me.
However, I found I was missing an important feature of a Schema, namely, every field has a key, and a value (a map). So now I achieve what I wanted by writing a UDF converting the bag to a comma separated string of values:
package BagCondenser;
import java.io.IOException;
import java.util.Iterator;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.Tuple;
public class BagToStringOfCommaSeparatedSegments
extends EvalFunc<String> {
#Override
public String exec(Tuple input) throws IOException {
// Condensed bag to be returned
String listOfSegmentIds = new String("");
// Cast the input to a bag
Object inputObject = input.get(0);
// Throw error if not bag-able input
if (!(inputObject instanceof DataBag))
throw new IOException("Expected input to be a bag, but got: "
+ inputObject.getClass());
// The input bag
DataBag bag = (DataBag) inputObject;
Iterator it = bag.iterator();
// Collect second fields of each tuple and add to the output bag
while(it.hasNext()) {
// If the return string already had values, append a ','
if ( ! listOfSegmentIds.equals("") )
listOfSegmentIds += ",";
Tuple tuple = (Tuple) it.next();
listOfSegmentIds += tuple.get(0).toString();
}
return listOfSegmentIds;
}
}

Related

How to sort an object list in Java?

I have a list object type and I want to sort it by date in ascending order.
First of all I get the resevations that are between these dates and saving it to a new List.
Now I need someway to sort it.
I tried Collections.sort(reservationsByDate) & Collections.sort(reservationsByDate, Collections.reverseSort() , but it didn't produce anything.
I'm kinda new to java so if theres something that im missing please help me implement this.
heres my code:
public List<Reservation> getAllReservationsSortedByDate(LocalDate from, LocalDate to) {
int fromDate = from.getDayOfMonth();
int toDate = to.getDayOfMonth();
ArrayList<Reservation> reservationsByDate = new ArrayList<>();
for (Reservation reservation : reservations) {
if (reservation.getFromDate().getDayOfMonth() >= fromDate && reservation.getToDate().getDayOfMonth() <= toDate) {
reservationsByDate.add(reservation);
}
}
//reservationsByDate needs to be sorted by dates in ascending order...
return reservationsByDate;
}
Thank you for your help.
You are iterating over "reservations". The definition of this field is not shown. If it is empty the result would always be an empty list.

FLUTTER / DART LISTS - How can you check if a sequence of elements contained in a list exists in the same order in another list?

I am working on a vocabulary quiz app (French to English / English to French).
I need to check if what user types is what is expected.
Example : "Une machine à laver" --> Expected answer is "A washing machine".
The user can make many different sorts of mistakes such as spelling : "A watching machine", a word order mistake : "A machine washing" or a total mistake "A garden tool".
I have managed to deal with the checking when the expected word is just one word : "un jardin : a garden".
But with compound words, the issue of words order comes up.
I split the Strings into two lists :
_expectedAnswer contains the different elements of the answer : [washing,machine]
_typedAnswer contains the different elements of what the user typed : [machine, washing] or [watching,machine] or [washing,machine], or [a,washing,machine] (Here, the user added an article before the noun, which shouldn't be seen as a mistake).
At the end, the algorithm must tell the user what type of mistakes he has done :
Word order mistake.
Word order is correct, but one or all elements contain spelling problems.
word order mistake + spelling problems
completely wrong.
For the spelling check I use the levenstein algorithm. And it is satisfactory.
So first I thought I should check if _typedAnswer contains all the elements of _expectedAnswer.
-> Should check if the sequence of elements is the same : the order is OK, and no spelling pb.
-> Elements are all present but sequence is not respected : Problem with word order, no spelling mistakes.
-> Elements are not present
--> Then we check if "similar elements" are present (which would indicate user made a spelling mistake) .... and check the order of elements too...
Any advice to help me work out this algorithm ?
I have read a lot about all the functions linked to dart lists, but I have to admit, that I kinda got lost on which one would be pertinent to use...
I took the the last hour to solve your problem. I made it easy for you to test it and understand what is the logic behind it.
In order to use it in your code you have to put the control-flow statements into a separate function and pass the list of user input elements as well as the list of expected result elements to it.
For BOTH lists you need each word as a String, which is trimmed from all whitespaces!
I believe you are able to do that part, as described in your question already. I would deeply appreciate if you can upvote for my effort and accept the answer, if it helped you out.
EDIT: Requires an official dart package on pub.dev
Add a line like this to your package's pubspec.yaml:
dependencies:
collection: ^1.15.0
Here is the logic, please copy & test in inside DartPad:
import 'package:collection/collection.dart';
void main() {
List expectedAnswer = ["one", "two", "three"];
List listWrongSpelling = ["oe", "two", "three"];
List listWrongSpellings = ["oe", "twoo", "three"];
List listWrongOrder = ["two", "one", "three"];
List listEntirelyWrong = ["dskfrm", "twhoo", "111", "tedsf"];
List listIdentical = ["one", "two", "three"];
// FYI: Checks if there is any kind of mistake (is used below dynamically)
Function eq = const ListEquality().equals;
final result1 = eq(expectedAnswer, listWrongSpelling); // false
final result2 = eq(expectedAnswer, listWrongSpellings); // false
final result3 = eq(expectedAnswer, listWrongOrder); // false
final result4 = eq(expectedAnswer, listEntirelyWrong); // false
final result5 = eq(expectedAnswer, listIdentical); // true, the perfect answer
// print(result1);
// print(result2);
// print(result3);
// print(result4);
// print(result5);
// CHECK IF ANSWER IS PERFECT:
bool isPerfect(List toBeChecked, List expectedResult) {
Function eq = const ListEquality().equals;
return eq(toBeChecked, expectedResult) ? true : false;
}
// ONLY EXECUTE OTHERS IF THERE IS AN MISTAKE:
// Checks for word order mistake
// Condition: Must contain each element with identical value, hence only order can be wrong
bool checkOrder(List toBeChecked, List expectedElements) {
List<bool> listOfResults = [];
for (var element in toBeChecked)
{
bool result = expectedElements.contains(element);
listOfResults.add(result);
}
// If one element is not in expectedElements return false
return listOfResults.contains(false) ? false : true;
}
// Checks for any spelling errors
bool checkSpelling(List toBeChecked, List expectedElements) {
// Once this function gets executed there are only two errors possible:
// 1st: Unexpected elements (e.g. an article) (and possibly spelling errors) >> return false
// 2nd: No unexpected elements BUT spelling errors >> return true
return toBeChecked.length == expectedElements.length ? true : false;
}
// FINAL RESULT
String finalResult = "";
// Please try out the other lists from above for all possible cases!
bool isPerfectAnswer = isPerfect(listIdentical, expectedAnswer);
bool isWordOrderIncorrect = checkOrder(listIdentical, expectedAnswer);
bool isSpellingIncorrect = checkSpelling(listIdentical, expectedAnswer);
if(isPerfectAnswer) {
// The user entered the correct solution perfectly
finalResult = "Everything is correct!";
} else if(isWordOrderIncorrect) {
// CHECKS IF ONLY WORD ORDER IS WRONG
// false means there are unexpected elements in the user input
// true there are no unexpected elements, but the order is not correct, since the first check failed!
// Is true the case, then both lists contain the same elements.
finalResult = "Your word order is incorrect!";
} else if(isSpellingIncorrect) {
// Either unexpected elements (lenght of lists must differ) OR misspelled (same length, error in spelling)
finalResult = "Your spelling is incorrect!";
} else {
// If it gets here, the input has:
// Unexpected elements (e.g. an article), possibly spelling errors AND possibly order mistakes
// You could check if the order is correct, but what´s the point to write complex logic for that,
// knowing there are also unexpected elements, like additional prepositions or wrong words, in addition
// to spelling mistakes.
// Just show your user a message like this:
finalResult = "Unfortunatelly your answer is incorrect. Try again!";
}
// PRINT RESULT:
print(finalResult);
}

Comparing Substrings to JList Strings

In advance, please forgive me if I do not give adequate background information for my question. Long time reader, first time asker.
I am making a program where one has a database of cars accessed through a tab delimited .txt file (we did something like this recently in my programming class, so I wanted to expand upon it).
Instead of using the terminal window, my format is displaying the Car objects (containing make, model, year, price, etc.) in ArrayList. I'm using JFrame, a JList, and a ListModel since I'm using an array of Car objects.
In my program, I wanted to create a delete method where the user could delete items from the database. Initially they would select the item from the JList and then would click on the delete button. This invokes the delete() method, which is the tab shown below...
void delete()
{
int i = list.getSelectedIndex();
String string = (String)listModel.getElementAt(i);
for(Car c : cars)
{
String year = String.valueOf(c.getYear());
String conditionReport = String.valueOf(c.getConditionReport());
String price = String.valueOf(c.getPrice());
if(c.getMake().indexOf(string) != -1 && c.getModel().indexOf(string) != -1 && year.indexOf(string) != -1 && conditionReport.indexOf(string) != -1 && price.indexOf(string) != -1 && c.getWarranty().indexOf(string) != -1 && c.getComments().indexOf(string) != -1)
{
int choice = JOptionPane.showConfirmDialog(null, "Are you sure you would like to remove the " + cars.get(i).getYear() + " " + cars.get(i).getMake() + " " + cars.get(i).getModel() + "?", "Choose One", JOptionPane.YES_NO_OPTION);
if(choice == JOptionPane.NO_OPTION || choice == JOptionPane.CLOSED_OPTION)
{
return;
} else
{
cars.remove(c);
listModel.removeElementAt(i);
}
}
}
writeFile();
}
I have pinpointed my issue to be inside the if statement. (I printed things before and after to try to find where the program is lying. 'list' is my JList and 'listmodel' is my default list model. Car is an object I created that contains the elements (as seen by the get methods). The elements shown in the listModel are merely Strings that show getMake(), getModel(), and so forth... (Each 'get' item is separated by about 10 spaces.)
What am I doing wrong in the if statement? I figured that the getMake() and getModel() (and so forth) would be substrings of the index selected.
Thank you so much for your assistance! Any input regarding ways I could make further questions more specific and clear would be greatly appreciated!
It seems like you are doing this to find the selected Car in some kind of data structure. You would be better off doing something like programming a custom list model that had access to cars itself. Then you could retrieve the selection more immediately. If cars is an ArrayList that list merely parallels I also don't see why you can't do something to the effect of cars.remove(list.getSelectedIndex());. Or since JList can display any object, override Car's toString to display what the list currently displays and have the list display Cars. Then you can cars.remove((Car)list.getSelectedValue());.
But aside from that, based on your description it sounds like you mean to do the evaluation the other way. It's the list item that should contain all of the Car attributes, rather than all of the Car attributes containing the list item. So something like
if( string.contains(c.getMake()) && string.contains(year) // and so on
(Or with indexOf but since contains merely returns indexOf > -1, using contains makes your code somewhat shorter.)

How to remove elements by substrings from a STL container

I have a vector of objects (objects are term nodes that amongst other fields contai a string field with the term string)
class TermNode {
private:
std::wstring term;
double weight;
...
public:
...
};
After some processing and calculating the scores these objects get finally stored in a vector of TermNode pointers such as
std::vector<TermNode *> termlist;
A resulting list of this vector, containing up to 400 entries, looks like this:
DEBUG: 'knowledge' term weight=13.5921
DEBUG: 'discovery' term weight=12.3437
DEBUG: 'applications' term weight=11.9476
DEBUG: 'process' term weight=11.4553
DEBUG: 'knowledge discovery' term weight=11.4509
DEBUG: 'information' term weight=10.952
DEBUG: 'techniques' term weight=10.4139
DEBUG: 'web' term weight=10.3733
...
What I try to do is to cleanup that final list for substrings also contained in phrases inside the terms list. For example, looking at the above list snippet, there is the phrase 'knowledge discovery' and therefore I would like to remove the single terms 'knowledge' and 'discovery', because they are also in the list and redundant in this context. I want to keep the phrases containing the single terms. I am also thinking about to remove all strings equal or less 3 characters. But that is just a thought for now.
For this cleanup process I would like to code a class using remove_if / find_if (using the new C++ lambdas) and it would be nice to have that code in a compact class.
I am not really sure on how to solve this. The problem is that I first would have to identify what strings to remove, by probably setting a flag as an delete marker. That would mean I would have to pre-process that list. I would have to find the single terms and the phrases that contain one of those single terms. I think that is not an easy task to do and would need some advanced algorithm. Using a suffix tree to identify substrings?
Another loop on the vector and maybe a copy of the same vector could to the clean up. I am looking for something most efficient in a time manner.
I been playing with the idea or direction such as showed in std::list erase incompatible iterator using the remove_if / find_if and the idea used in Erasing multiple objects from a std::vector?.
So the question is basically is there a smart way to do this and avoid multiple loops and how could I identify the single terms for deletion? Maybe I am really missing something, but probably someone is out there and give me a good hint.
Thanks for your thoughts!
Update
I implemented the removal of redundant single terms the way Scrubbins recommended as follows:
/**
* Functor gets the term of each TermNode object, looks if term string
* contains spaces (ie. term is a phrase), splits phrase by spaces and finally
* stores thes term tokens into a set. Only term higher than a score of
* 'skipAtWeight" are taken tinto account.
*/
struct findPhrasesAndSplitIntoTokens {
private:
set<wstring> tokens;
double skipAtWeight;
public:
findPhrasesAndSplitIntoTokens(const double skipAtWeight)
: skipAtWeight(skipAtWeight) {
}
/**
* Implements operator()
*/
void operator()(const TermNode * tn) {
// --- skip all terms lower skipAtWeight
if (tn->getWeight() < skipAtWeight)
return;
// --- get term
wstring term = tn->getTerm();
// --- iterate over term, check for spaces (if this term is a phrase)
for (unsigned int i = 0; i < term.length(); i++) {
if (isspace(term.at(i))) {
if (0) {
wcout << "input term=" << term << endl;
}
// --- simply tokenze term by space and store tokens into
// --- the tokens set
// --- TODO: check if this really is UTF-8 aware, esp. for
// --- strings containing umlauts, etc !!
wistringstream iss(term);
copy(istream_iterator<wstring,
wchar_t, std::char_traits<wchar_t> >(iss),
istream_iterator<wstring,
wchar_t, std::char_traits<wchar_t> >(),
inserter(tokens, tokens.begin()));
if (0) {
wcout << "size of token set=" << tokens.size() << endl;
for_each(tokens.begin(), tokens.end(), printSingleToken());
}
}
}
}
/**
* return set of extracted tokens
*/
set<wstring> getTokens() const {
return tokens;
}
};
/**
* Functor to find terms in tokens set
*/
class removeTermIfInPhraseTokensSet {
private:
set<wstring> tokens;
public:
removeTermIfInPhraseTokensSet(const set<wstring>& termTokens)
: tokens(termTokens) {
}
/**
* Implements operator()
*/
bool operator()(const TermNode * tn) const {
if (tokens.find(tn->getTerm()) != tokens.end()) {
return true;
}
return false;
}
};
...
findPhrasesAndSplitIntoTokens objPhraseTokens(6.5);
objPhraseTokens = std::for_each(
termList.begin(), termList.end(), objPhraseTokens);
set<wstring> tokens = objPhraseTokens.getTokens();
wcout << "size of tokens set=" << tokens.size() << endl;
for_each(tokens.begin(), tokens.end(), printSingleToken());
// --- remove all extracted single tokens from the final terms list
// --- of similar search terms
removeTermIfInPhraseTokensSet removeTermIfFound(tokens);
termList.erase(
remove_if(
termList.begin(), termList.end(), removeTermIfFound),
termList.end()
);
for (vector<TermNode *>::const_iterator tl_iter = termList.begin();
tl_iter != termList.end(); tl_iter++) {
wcout << "DEBUG: '" << (*tl_iter)->getTerm() << "' term weight=" << (*tl_iter)->getNormalizedWeight() << endl;
if ((*tl_iter)->getNormalizedWeight() <= 6.5) break;
}
...
I could'nt use the C++11 lambda syntax, because on my ubuntu servers have currently g++ 4.4.1 installed. Anyways. It does the job for now.
The way to go is to check the quality of the resulting weighted terms with other search result sets and see how I can improve the quality and find a way to boost the more relevant terms in conjunction with the original query term. It might be not an easy task to do, I wish there would be some "simple heuristics".
But that might be another new question when stepped further a little more :-)
So thanks to all for this rich contribution of thoughts!
What you need to do is first, iterate through the list and split up all the multi-word values into single words. If you're allowing Unicode, this means you will need something akin to ICU's BreakIterators, else you can go with a simple punctuation/whitespace split. When each string is split into it's constituent words, then use a hash map to keep a list of all the current words. When you reach a multi-word value, then you can check if it's words have already been found. This should be the simplest way to identify duplicates.
I can suggest you to use the "erase-remove" idiom in this way:
struct YourConditionFunctor {
bool operator()(TermNode* term) {
if (/* you have to remove term */) {
delete term;
return true;
}
return false;
}
};
and then write:
termlist.erase(
remove_if(
termlist.begin(),
termlist.end(),
YourConditionFunctor()
),
termlist.end()
);

How to add age to ageList =[12,13,23]

may be too simple groovy question but....please help
i have a list like this:
def ageList =[12,13,23]
i want to get this:
def newAgeList =[age:12,age:13,age:23]
could some one help me out?
thank you so much!
Does this work for you?
def newAgeList = ageList.inject([:]) { map, item -> if (!map['age']) map['age'] = []; map['age'] << item; map }
his would result in: ['age':[12, 13, 23]]
Otherwise, you can get the literal value as something like:
def newAgeList = ageList.collect { "age:$it" }
his would result in: ['age:12', 'age:13', 'age:23']
A third option:
def newAgeList = ageList.collect { ['age':it] }
This would result in: [['age':12], ['age':13], ['age':23]]
Unfortunately, you can't do this as a map like you showed above as map keys must be unique.
Really it all depends on what you are trying to do with the result.
Don't know if this is possible since you want to use the same map key 'age' for three different values. You'll end up overwriting the existing value with a new value.