How to create a key value map when there are duplicate keys? - mapreduce

I am new to pig. I have the below output.
(001,Kumar,Jayasuriya,1123456754,Matara)
(001,Kumar,Sangakkara,112722892,Kandy)
(001,Rajiv,Reddy,9848022337,Hyderabad)
(002,siddarth,Battacharya,9848022338,Kolkata)
(003,Rajesh,Khanna,9848022339,Delhi)
(004,Preethi,Agarwal,9848022330,Pune)
(005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar)
(006,Archana,Mishra,9848022335,Chennai)
(007,Kumar,Dharmasena,758922419,Colombo)
(008,Mahela,Jayawerdana,765557103,Colombo)
How can I create a map of the above so that the output will look something like,
001#{(Kumar,Jayasuriya,1123456754,Matara),(Kumar,Sangakkara,112722892,Kandy),(001,Rajiv,Reddy,9848022337,Hyderabad)}
002#{(siddarth,Battacharya,9848022338,Kolkata)}
I tried the ToMap function.
mapped_students = FOREACH students GENERATE TOMAP($0,$1..);
But I am unable dump the output from the above command as the process throws an error and stops there. Any help would be much appreciated.

I think you are trying to achieve is group records into tuples having same id.
according to TOMAP function it Converts key/value expression pairs into a map, hence you wont be able to group your rest records, and will result in something like unable to open iterator for alias..
as per your desiring output here is the piece of code.
A = LOAD 'path_to_data/data.txt' USING PigStorage(',') AS (id:chararray,first:chararray,last:chararray,phone:chararray,city:chararray);
If you do not want to give schema then:
A = LOAD 'path_to_data/data.txt' USING PigStorage(',');
B = GROUP A BY $0; (this relation will group all your records based on your first column)
DESCRIBE B; (this will show your described schema)
DUMP B;
Hope this helps..

Related

Converting JSON into Table (PowerQuery)

What would be a correct PowerQuery syntax to extract the information from this Web JSON into a table:
I'm not very familiar with PowerQuery, and this is probably the only time I'll need this, so I'd be grateful if someone would help me out without refering me to documentation. Thanks
[{"time_entry_group": {"minutes": 301,"time_entries_params": {"locked": "0","from": "2021-02-01","to": "2021-02-28","customer_id": "11223344","project_id": "223388","service_id": "435248"},"revenue": 57691.6666666667,"project_id": 223388,"project_name": "Scrb","service_id": 435248,"service_name": "Meetings","month": "202102"}}
, {"time_entry_group": {"minutes": 1175,"time_entries_params": {"locked": "1","from": "2021-01-01","to": "2021-01-31","customer_id": "11223344","project_id": "223388","service_id": "421393"},"revenue": 225208.333333333,"project_id": 223388,"project_name": "Scrb","service_id": 421393,"service_name": "Design","month": "202101"}}
, {"time_entry_group": {"minutes": 24,"time_entries_params": {"locked": "1","from": "2021-01-01","to": "2021-01-31","customer_id": "11223344","project_id": "3168911","service_id": "95033"},"revenue": 4600.0,"project_id": 3168911,"project_name": "youkn Dev","service_id": 95033,"service_name": "Reviews","month": "202101"}}]
For future reference, if you have a column that you need to expand, you can instead click this arrow icon to the right of the column name. Clicking it should display a menu that should then allow you to specify which nested columns you want to get expand or get at. To be clear, it will expand that column for all rows in that table, not just one.
The JSON you've included is basically an array of objects, so maybe use:
Json.Document to parse the JSON, which should give you a list of records
Table.FromRecords to turn the list of records into a table.
Table.ExpandRecordColumn to expand a nested record columns.
Example implementation:
let
json = "[{""time_entry_group"":{""minutes"":301,""time_entries_params"":{""locked"":""0"",""from"":""2021-02-01"",""to"":""2021-02-28"",""customer_id"":""11223344"",""project_id"":""223388"",""service_id"":""435248""},""revenue"":57691.6666666667,""project_id"":223388,""project_name"":""Scrb"",""service_id"":435248,""service_name"":""Meetings"",""month"":""202102""}},{""time_entry_group"":{""minutes"":1175,""time_entries_params"":{""locked"":""1"",""from"":""2021-01-01"",""to"":""2021-01-31"",""customer_id"":""11223344"",""project_id"":""223388"",""service_id"":""421393""},""revenue"":225208.333333333,""project_id"":223388,""project_name"":""Scrb"",""service_id"":421393,""service_name"":""Design"",""month"":""202101""}},{""time_entry_group"":{""minutes"":24,""time_entries_params"":{""locked"":""1"",""from"":""2021-01-01"",""to"":""2021-01-31"",""customer_id"":""11223344"",""project_id"":""3168911"",""service_id"":""95033""},""revenue"":4600,""project_id"":3168911,""project_name"":""youkn Dev"",""service_id"":95033,""service_name"":""Reviews"",""month"":""202101""}}]",
parsed = Json.Document(json),
initialTable = Table.FromRecords(List.Transform(parsed, each [time_entry_group])),
expanded = Table.ExpandRecordColumn(initialTable, "time_entries_params", {"locked", "from", "to", "customer_id"})
in
expanded
One thing about the code above is that it doesn't expand nested fields project_id and service_id (present within time_entries_params). This is because these columns already exist in the table (and having duplicate column names would cause an error). I've assumed this isn't a problem, as the nested values aren't different.

Ignite SqlFieldsQuery specific keys

Using the ignite C++ API, I'm trying to find a way to perform an SqlFieldsQuery to select a specific field, but would like to do this for a set of keys.
One way to do this, is to do the SqlFieldsQuery like this,
SqlFieldsQuery("select field from Table where _key in (" + keys_string + ")")
where the keys_string is the list of the keys as a comma separated string.
Unfortunately, this takes a very long time compared to just doing cache.GetAll(keys) for the set of keys, keys.
Is there an alternative, faster way of getting a specific field for a set of keys from an ignite cache?
EDIT:
After reading the answers, I tried changing the query to:
auto query = SqlFieldsQuery("select field from Table t join table(_key bigint = ?) i on t._key = i._key")
I then add the arguments from my set of keys like this:
for(const auto& key: keys) query.AddArgument(key);
but when running the query, I get the error:
Failed to bind parameter [idx=2, obj=159957, stmt=prep0: select field from Table t join table(_key bigint = ?) i on t._key = i._key {1: 159956}]
Clearly, this doesn't work because there is only one '?'.
So I then tried to pass a vector<int64_t> of the keys, but I got an error which basically says that std::vector<int64_t> did not specialize the ignite BinaryType. So I did this as defined here. When calling e.g.
writer.WriteInt64Array("data", data.data(), data.size())
I gave the field a arbitrary name "data". This then results in the error:
Failed to run map query remotely.
Unfortunately, the C++ API is neither well documented, nor complete, so I'm wondering if I'm missing something or that the API does not allow for passing an array as argument to the SqlFieldsQuery.
Query that uses IN clause doesn't always use indexes properly. The workaround for this is described here: https://apacheignite.readme.io/docs/sql-performance-and-debugging#sql-performance-and-usability-considerations
Also if you have an option to to GetAll instead and lookup by key directly, then you should use it. It will likely be more effective anyway.
Query with operator "IN" will not always use indexes. As a workaround, you can rewrite the query in the following way:
select field from Table t join table(id bigint = ?) i on t.id = i.id
and then invoke it like:
new SqlFieldsQuery(
"select field from Table t join table(id bigint = ?) i on t.id = i.id")
.setArgs(new Object[]{ new Integer[] {2, 3, 4} }))

How to iterate on each line of a rdd which contains textFile

I'm trying to do something like this
file = sc.textFile('mytextfile')
def myfunction(mystring):
new_value = mystring
for i in file.toLocalIterator()
if i in mystring:
new_value = i
return new_value;
rdd_row = some_data_frame.map(lambda u: Row(myfunction(u.column_name)))
But I get this error
It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers
The problem is (as is clearly stated in the error message) that you are trying to work with an RDD inside the map. File is an RDD. it can have various transformations on it (e.g. you are trying to do a local iterator on it). But you are trying to use the transformation inside another - the map.
UPDATE
If I understand correctly you have a dataframe df with a column URL. You also have a text file which contains blacklist values.
Lets assume for the sake of argument that your blacklist files is a csv with a column blacklistNames and that the dataframe df's URL column is already parsed. i.e. you just want to check if URL is in the blacklistNames columns.
What you can do is something like this:
df.join(blackListDF, df["URL"]==blackListDF["blacklistNames"], "left_outer")
This join basically adds a blacklistNames column to your original dataframe which would contain the matched name if it is in the blacklist and null otherwise. Now all you need to do is filter based on whether or not the new column is null or not.

How to search multiple strings in a string?

I want to check in a powerquery new column if a string like "This is a test string" contains any of the strings list items {"dog","string","bark"}.
I already tried Text.PositionOfAny("This is a test string",{"dog","string","bark"}), but the function only accepts single-character values
Expression.Error: The value isn't a single-character string.
Any solution for this?
This is a case where you'll want to combine a few M library functions together.
You'll want to use Text.Contains many times against a list, which is a good case for List.Transform. List.AnyTrue will tell you if any string matched.
List.AnyTrue(List.Transform({"dog","string","bark"}, (substring) => Text.Contains("This is a test string", substring)))
If you wished that there was a Text.ContainsAny function, you can write it!
let
Text.ContainsAny = (string as text, list as list) as logical =>
List.AnyTrue(List.Transform(list, (substring) => Text.Contains(string, substring))),
Invoked = Text.ContainsAny("This is a test string", {"dog","string","bark"})
in
Invoked
Another simple solution is this:
List.ContainsAny(Text.SplitAny("This is a test string", " "), {"dog","string","bark"})
It transforms the text into a list because there we find a function that does what you need.
If it's a specific (static) list of matches, you'll want to add a custom column with an if then else statement in PQ. Then use a filter on that column to keep or remove the columns. AFAIK PQ doesn't support regex so Alexey's solution won't work.
If you need the lookup to be dynamic, it gets more complicated... but doable you essentially need to
have an ID column for the original row.
duplicate the query so you have two queries, then in the newly created query
split the text field into separate columns, usually by space
unpivot the newly created columns.
get the list of intended names
use list.generate method to generate a list that shows 1 if there's a match and 0 if there isn't.
sum the values of the list
if sum > 0 then mark that row as a match, usually I use the value 1 in a new column. Then you can filter the table to keep only rows with value 1 in the new column. Then group this table on ID - this is the list of ID that contain the match. Now use the merge feature to merge in the first table ensuring you keep only rows that match the IDs. That should get you to where you want to be.
Thanks for giving me the lead. In my own case I needed to ensure two items exist in a string hence I replaced formula as:
List.AllTrue(List.Transform({"/","2017"},(substring) => Text.Contains("4/6/2017 13",substring)))
it returned true perfectly.
You can use regex here with logical OR - | expression :
/dog|string|bark/.test("This is a test string") // retruns true

How can I SELECT records using a select list made of foreign keys?

I have a table, DEBTOR, with a structure like this:
and a second table, DEBTOR.INFO structured like this:
I have a select list made of record IDs from the DEBTOR.INFO table. How can I
select * from DEBTOR WHERE 53 IN (name of select list)?
Is this even possible?
I realize this query looks more like SQL than RetrieVe but I wrote it that way for an easier understanding of what I'm trying to accomplish.
Currently, I accomplish this query by writing
SELECT DEBTOR WITH 53 EQ [paste list of DEBTOR.INFO record IDs]
but obviously this is unwieldy for large lists.
It looks to me that you cant do that. Even if you use and i-descriptor, It only works in one direction. TRANS("DEBTOR.INFO",53,0,"X") works from the DEBTOR file but not the other way. So TRANS("DEBTOR",#ID,53,"X") from DEBTOR.INFO will return nothing.
See this article on U2's site for a possible solution.
Would something like this work (two steps):
SELECT DEBTOR.INFO SAVING PACKET
LIST DEBTOR ....
This creates a select list of the data in the PACKET field in the DEBTOR.INFO file and makes it active. (If you have duplicate values that way you can add the keyword UNIQUE after SAVING).
Then the subsequent LIST command uses that active select list which contains values found in the #ID field of the file DEBTOR.
Not sure if you are still looking at this, but there is a simple option that will not require a lot of programming.
I did it with a program, a subroutine and a dictionary item.
First I set a named common variable to contain the list of DEBTOR.INFO ids:
SETLIST
*
* Use named common to hold list of keys
COMMON /MYKEYS/ KEYLIST
*
* Note for this example I am reading the list from SAVEDLISTS
OPEN "SAVEDLISTS" TO FILE ELSE STOP "CAN NOT OPEN SAVEDLISTS"
READ KEYLIST FROM FILE, "MIKE000" ELSE STOP "NO MIKE000 ITEM"
Now, I can create a subroutine that checks for a value in that list
CHECKLIST
SUBROUTINE CHECKLIST( RVAL, IVAL)
COMMON /MYKEYS/ KEYLIST
LOCATE IVAL IN KEYLIST <1> SETTING POS THEN
RVAL = 1
END ELSE RVAL = 0
RETURN
Lastly, I use a dictionary item to call the subroutine with the field I am looking for:
INLIST:
I
SUBR("CHECKLIST", FK)
IN LIST
10R
S
Now all I have to do is put the correct criteria on my list statement:
LIST DEBTOR WITH INLIST = 1 ACCOUNT STATUS FK
Id use the very powerfull EVAL with an XLATE ;
SELECT DEBTOR WITH EVAL \XLATE('DEBTOR.INFO',#RECORD<53>,'-1','X')\ NE ""