How to parse through a column in Pig to create additional columns

How to parse through a column in Pig to create additional columns - regex

New Apache Pig user here. I basically have data in a format and need to split this into 6 columns to create my desired schema and then load into Pig for my existing script to run.
Sorry if the format below is untidy, i cant upload a picture due to reputation score.
Existing format has 3 columns
User-Equipment values::key:bytearray values:value:bytearray
user1-mobile 20130306-AC 9
user1-mobile 20130306-AT 21
user2-laptop 20130306-BC 0
Required format:
User Equipment Date Type "Count or Time" Value
user1 mobile 20130306 A C 9
user1 mobile 20130306 A T 21
Any suggestions on how to ge this done? IS there a regex I need to write?
The tricky thing here is all the columns have a delimiter (-) between them except "Type" and column "C or T"

If you don't have a common delimiter I can think of two possibilities:
You could implement your own LoadFunc as explained here: http://ofps.oreilly.com/titles/9781449302641/load_and_store_funcs.html
You could use REGEX_EXTRACT_ALL as explained here: Apache Pig: Extra query parameters from web log
Here you go for 2.:
A = LOAD 'abc.txt' AS (line:CHARARRAY);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line, '^(.+?)\\-(.+?)\\s(.+?)\\-(.)(.)\\s(.+)$')) AS (User:CHARARRAY,Equipment:CHARARRAY,Date:CHARARRAY,Type:CHARARRAY,CountorTime:CHARARRAY,Value:CHARARRAY);

Related

How to analyze data with a column containing multiple values

I'm attempting to analyze idea submissions in Power BI where there is a column with multiple values separated by commas. Here is an example of the table layout (each row being a submission):
Key
...
Tags
1
...
Chat, Service, Dallas
2
...
Banking, IVR, Miami, Zelle
3
...
New York, Collections
...
...
...
The tags column has the data I'm trying to analyze and it's sorted in whichever order they are first entered in by the submitter so they don't follow a certain structure necessarily. Some submissions may have as little as 2 tags and some as much as 15. I'm trying to figure out a way to structure the data in a way that Power BI can analyze each tag (if that makes sense, I'm sorry I'm having a difficult time explaining).
For instance, I want to be able to see the number of submissions by department (like chat or collections). I know I can split the tags column and have a separate column created for each tag but the problem I run into is that the new columns created have different values in each row depending on the order. For example, the new table after splitting the tags column would look like this:
Key
...
Tag1
Tag2
Tag3
Tag4
1
...
Chat
Service
Dallas
2
...
Banking
IVR
Miami
Zelle
3
...
New York
Collections
...
...
...
...
...
...
As you can see, the Tag1 column has mixed values in the sense that row 1 and 2 contains a department (chat and banking) but row 3 contains a location (New York). I suppose the question I'm trying to ask is if anyone has any recommendations on how to better analyze the tags so I can answer questions like:
What departments are sending in more submissions than others?
Which site locations are sending in more submissions?
I appreciate any help and advice. I hope this makes sense!

I'd suggest expanding to new rows rather than new columns.
Then your data looks like this
If you have a table of tags where each one is categorized as "Department" or "Location" or whatever, you can then merge that table onto the one above to have a nice Category column to help filter in your reporting.

Extracting String Portions in SQL using Regular expressions

Hi All,
I have a query related to Regular expressions in SQL.
I have a case where a portion of string has to be extracted from a column. The portion of that column will be prefixed with my column A. Please see the screenshot for the sample data. I have also added the output expected in a separate column (highlighted in green).
Scenarios:
Now if a column value has more than 1 unique number then that has to be shown up with Null
Eg: To verify CAN06010025, CAN06010026 & CAN06010030 after the approval.
In the above string I have more than 1 number(bold portion)
and this case should be ignored (meaning it has to give me Null Value).
If there is only one number and if it is repetitive then I have to consider that case and extract the portion of String..
Eg: Project USA12: Id USA12S001: Contact required -USA12S001- form to be updated
In this example, the portion I wanted to extract is repetitive and I am looking to extract the highlighted portion alone.
The same applies to the other cases as well.
I tried with the below sql. The challenge is my Col A can also be present in Col B (Line 2 in screenshot) and this code is considering my Col A portion when I count with REGEXP_COUNT function and is giving me the value as Null. My expectation is to extract that USA12S001 portion from the column.
Could you please help in achieving this where the above two conditions satisfies.
SQL:
SELECT
ColA,
ColB,
case when REGEXP_COUNT(ColB,ColA) >2 THEN NULL
ELSE REPLACE(REPLACE(concat(regexp_substr(ColB,ColA||'([[:alnum:]]+\.?)'),
nvl(regexp_substr(ColB,ColA||'(\-[[:digit:]]+)'),
regexp_substr(ColB,ColA||'([[:space:]]\-[[:space:]][[:digit:]]+)'))),
' ',''),'.','')
END AS Result
FROM
table
Test Data:
Col A
CAN06
USA12
USA27
HUN04
CAN05
USA24
CAN06
Col B
to verify CAN06010025, CAN06010026 & CAN06010030 after the approval
Project USA12: Id USA12S001: Contact required -USA12S001- form to be updated
Project USA27: Id: USA27S001: Prod
To review id HUN04S002-HUN04S004 after the due date.
ID: CAN05S005 with the details as CAN05S005 are completed.
Project USA24: Id: USA24S009: Data Issue
"Project: Subject CAN06S009: V2 & V3- Id CAN06S010: V1"

If the REGEXP_COUNT is the only issue, then the answer is simple: change
case when REGEXP_COUNT(ColB,ColA) >2
to:
case when REGEXP_COUNT(ColB,ColA || '[[:alnum:]]') >2

SSRS grouping all but one field

I have a query that all the fields are the same except for one. I want to group my SSRS report so that it has "blanks" for all the duplicated fields and only show the "different" one WHEN there is a "duplicate" record.
For instance:
Case Number PersonID Narrative
123 1 xxx
345 3
456 9 ABCD
KFL
So record 1 has a narrative and only one record. Record 2 has no narrative. Record 3 & 4 are the same case, same person, two different narratives.
I thought by grouping by all the other fields that I would achieve these results but that is not working, I still get the 456 and 9 on my 4th record when I have grouped by the other fields.
How can I get just the narrative to display when all the other fields in that record match the previous record?
Thanks,
Leslie

You can see my answer for:
How to get only one value in SSRS?
You have similar situation. You need to use expression for first 2 columns:
=IIF(Fields!CaseNumber.Value = Previous(Fields!CaseNumber.Value), "", Fields!CaseNumber.Value)
=IIF(Fields!PersonID.Value = Previous(Fields!PersonID.Value), "", Fields!PersonID.Value)
This will hide all repeating "Case Number" and "PersonID".
Don't forget to replace "CaseNumber" and "PersonID" with proper column names in DataSet.

Fuzzy match on google sheets

I'm trying to fuzzy match two columns in google sheets, i've tried numerous formulas but I think it's going to come down to a script to help out.
I have a column with product ID's e.g.
E20067
and then I have another sheet with another column which has image url's relating to this product code such as
http://wholesale.test.com/product/E20067/web_images/E20067.jpg
http://wholesale.test.com/product/E20067/high_res/E20067.jpg
http://wholesale.test.com/product/E20067/high_res/E20067-2.jpg
What I'm wanting to do is "fuzzy" match both of these columns for their product ID, and then create a new column for each match. So it would have the product ID then on the same row in multiple columns each product image URL - like the image below:
Is there a way to do this in google sheets using a script or a formula?

In Google sheets there are a few powerful 'regex' formulas.
Suppose, you have ID list in column A, and URL list in column B
Then use formula:
=REGEXEXTRACT(B1,JOIN("|",$A$1:$A$3))
It will match one of ID's. Drag the formula down to see the result as in picture above.
See more info here

Old thread but, in case you find yourself here, search for my Google Sheets add-on called Flookup. It should do exactly what you want.
For this case, you can use this function:
Flookup (lookupValue, tableArray, lookupCol, indexNum, threshold, [rank], [range])
The parameter details are:
lookupValue: the value you're looking up
tableArray: the table you want to search
lookupCol: the column you want to search
indexNum: the column you want data to be returned from
threshold: the percentage similarity below which data shouldn't be returned
rank: the nth best match (i.e. if the first one isn't to your liking)
range: choose to return the percentage similarity or row number for each match
You can find out more at the official website (examples and such).
Please note that, whereas the OP appears to want the whole list of possible matches, Flookup will only return one result at a time.
Flookup can now return a list of all possible matches through its LRM mode.

Try the following. I am assuming the product codes are in Sheet1 and the URLs are in Sheet2. Both in column A:
=iferror(transpose(FILTER(Sheet2!$A$2:$A,Search("*"& A2 &"*",Sheet2!$A$2:$A))))
Copy down.
If you want to show the image instead of the url try:
=arrayformula(image(iferror(transpose(FILTER(Sheet2!$A$2:$A,Search("*"& A2 &"*",Sheet2!$A$2:$A))))))

Rails 4 + MongoDB + Search query LIKE does not give correct output

In Rails, I am trying to fetch data from mongodb using LIKE query by providing regular expression but even though not getting the correct output.
Model : User
_id, name, display_name, age, address, nick_name
a1, Johny, Johny K, 12, New York, John
b1, James, James Waltor, 15, New York, James
c1, Joshua, Joshua T, 13, California, Josh
Now I have 3 set of records.
Query 1 : Search User having 'Jo' as keyword in initial name
User.where(name: /^jo/i)
Output - Only One record - instead of two.
Query 2 :- Match the text with all column values
User.where($where: /^jo/i)
Not getting the proper output.

Ok on the Query 1, can you output the documents. I believe one of your records in 'name' has a character in front of it such as white space. I just run the same query locally and it pulled multiple records back.
Try this:
User.where(name/(.*)jo(.*)/i).count and see what that returns. It should match 2. If that works, then you'll need to look at what is incorrect with the store value.
On Query 2, where have you seen this syntax. The $where is expecting a string of a js function to execute to match records. In your case to match any field within the document with an expression you would need to do a recursive function across each field in each document.

For Query 2 to match against all fields
One solution, although inefficient, is to do it within the Rails app instead of Mongodb query.
e.g.
User.all.select do | user | user.attributes.values.grep(/^jo/i).any? end

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js