I want to get the last element of a line using pig script. I cant use $ as the index of last element is not fixed. I tried using Regular Expression but it is not working. I tried using $-1 to get it but it didn't work. I am posting only a sample as my actual file contains more of PID's.
Sample:
MSH|�~\&|LAB|LAB|HEATH|HEA-HEAL|20247||OU�R01|M1738000000001|P|2.3|||ER|ER|
PID|1|YXQ120185751001|YXQ120185751001||ELJKDP##PDUB||19790615|F||| H LGGH VW��ZHVW FKHVWHU�SD�19380|||||||4002C340778A|000009561|ELJKDP##PDUB19790615F
i want ot get the last value of PID i;e ELJKDP##PDUB19790615F
for that i have tried below code's but it is not working.
Code 1:
STOCK_A = LOAD '/user/rt/PARSED' USING PigStorage('|');
data = FILTER STOCK_A BY ($0 matches '.*PID.*');
MSH_DATA = FOREACH data GENERATE $2 AS id, $5 AS ame , $7 AS dob, $8 AS gender, $-1 AS rk;
Code 2:
STOCK_A = LOAD '/user/rt/PARSED' USING PigStorage('|');
data = FILTER STOCK_A BY ($0 matches '.*PID.*');
MSH_DATA = FOREACH data GENERATE $2 AS id, $5 AS ame , $7 AS dob, $8 AS gender, REGEX_EXTRACT(data,'\\s*(\\w+)$',1) AS rk;
Error for Code 2:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed
to parse: Invalid scalar projection: data : A
column needs to be projected from a relation for it to be used as a
scalar
Please help
This should work
REGEX_EXTRACT(data,'([^|]+$)',1) AS rk
[^|]+$ matches everything to the right of the last pipe character.
Related
I would like to list up devices and put their prices next to them.
My goal is to check different sites every week and notice trends.
This is a hobby project, I know there are sites that already do this.
For instance:
Device | URL Site 1 | Site 1 | URL Site 2 | Site 2
Device a | http://... | €40,00 | http://... | €45,00
Device b | http://... | €28,00 | http://... | €30,50
Manually, this is a lot of work (checking every week), so I thought a Macro in Excel would help. The thing is, I would like to put the data in a single cell and excel only recognises tables. Solution: view source code, read price, export price to specific cell.
I think this is all possible within Excel, but I can't quiet figure out how to read the price or other given data and how to put it in one specific cell. Can I specify coordinates in the source code, or is there a more effective way of thinking?
First of all you have to find out how does the website works. For the page you asked I have done the following:
Opened http://www.mediamarkt.de page in Chrome.
Typed BOSCH WTW 85230 in the search box, suggestion list appeared.
Pressed F12 to open developer tools and clicked Network tab.
Each time I was typing, the new request appeared (see yellow areas):
Clicked the request to examine general info:
You can see that it uses GET method and some parameters including url-encoded product name.
Clicked the Response tab to examine the data returning from the server:
You can see it is a regular JSON, full content is as follows:
{"suggestions":[{"attributes":{"energyefficiencyclass":"A++","modelnumber":"2004975","availabilityindicator":"10","customerrating":"0.00000","ImageUrl":"http://pics.redblue.de/artikelid/DE/2004975/CHECK","collection":"shop","id":"MediaDEdece2358813","currentprice":"444.00","availabilitytext":"Lieferung in 11-12 Werktagen"},"hitCount":0,"image":"http://pics.redblue.de/artikelid/DE/2004975/CHECK","name":"BOSCH WTW 85230 Kondensationstrockner mit Warmepumpentechnologie (8 kg, A++)","priority":9775,"searchParams":"/Search.ff?query=BOSCH+WTW+85230+Kondensationstrockner+mit+W%C3%A4rmepumpentechnologie+%288+kg%2C+A+%2B+%2B+%29\u0026channel=mmdede","type":"productName"}]}
Here you can find "currentprice":"444.00" property with the price.
Simplified the request by throwing out some optional parameters, it turned out that the same JSON response can be received by the URL http://www.mediamarkt.de/FACT-Finder/Suggest.ff?channel=mmdede&query=BOSCH+WTW+85230
That data was enough to built some code, assuming that first column intended for products:
Option Explicit
Sub TestMediaMarkt()
Dim oRange As Range
Dim aResult() As String
Dim i As Long
Dim sURL As String
Dim sRespText As String
' set source range with product names from column A
Set oRange = ThisWorkbook.Worksheets(1).Range("A1:A3")
' create one column array the same size
ReDim aResult(1 To oRange.Rows.Count, 1 To 1)
' loop rows one by one, make XHR for each product
For i = 1 To oRange.Rows.Count
' build up URL
sURL = "http://www.mediamarkt.de/FACT-Finder/Suggest.ff?channel=mmdede&query=" & EncodeUriComponent(oRange.Cells(i, 1).Value)
' retrieve HTML content
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", sURL, False
.Send
sRespText = .responseText
End With
' regular expression for price property
With CreateObject("VBScript.RegExp")
.Global = True
.MultiLine = True
.IgnoreCase = True
.Pattern = """currentprice""\:""([\d.]+)""" ' capture digits after 'currentprice' in submatch
With .Execute(sRespText)
If .Count = 0 Then ' no matches, something going wrong
aResult(i, 1) = "N/A"
Else ' store the price to the array from the submatch
aResult(i, 1) = .Item(0).Submatches(0)
End If
End With
End With
Next
' output resultion array to column B
Output Sheets(1).Range("B1"), aResult
End Sub
Function EncodeUriComponent(strText)
Static objHtmlfile As Object
If objHtmlfile Is Nothing Then
Set objHtmlfile = CreateObject("htmlfile")
objHtmlfile.parentWindow.execScript "function encode(s) {return encodeURIComponent(s)}", "jscript"
End If
EncodeUriComponent = objHtmlfile.parentWindow.encode(strText)
End Function
Sub Output(oDstRng As Range, aCells As Variant)
With oDstRng
.Parent.Select
With .Resize( _
UBound(aCells, 1) - LBound(aCells, 1) + 1, _
UBound(aCells, 2) - LBound(aCells, 2) + 1 _
)
.NumberFormat = "#"
.Value = aCells
.Columns.AutoFit
End With
End With
End Sub
Filled worksheet with some product names:
Launched the sub and got the result:
It is just the example how to retrieve a data from the website via XHR and parse a response with RegExp, I hope it helps.
I have a phone book file rubrica.txt, which contains records (name, second name, phone number, telephone number, date, date in seconds) on each line like that (each entry is separated by a space):
andrea mantovani 3476589456 0451234567 2016/05/16 1463419858190456946
marco verratti 1265897654 3057634987 2016/05/16 1463419948782978926
zlatan ibrahimovic 2937485929 1938472639 2016/05/16 1463420078149548084
cesc fabregas 5641287659 3456789123 2016/05/16 1463420324574207170
andrea mantovani 3402948586 0459687124 2016/05/17 1463500810082293135
marco rossi 3951326586 0458793540 2016/05/17 1463500836814967504
I want to view on output all contacts that have been added after a date inserted by me.
At first I read the date that I want and convert it to seconds with the following script:
echo "Digit the date"
read date_jap #read a date(yyyy/mm/dd)
data_sec=$(date +%s -d $data_jap) #convert the date in sec
This part of code function. I explain that to be more clear.
I don't know how can I compare this date with the date (the last entry) in file rubrica.txt.
I used:
cat $RUBRICA | awk '/$data_sec < \6/ { print }'
to display all contacts whose date in seconds in the field 6 of the line (example taking my file at the first line: 1463419858190456946) is greater than date_sec.
$data_sec < \6 I know is incorrect. I must fix it.
Let's assume your date in seconds is the value of the third record and assign that:
data_sec=1463420078149548084
Now we get this value into awk using -v, then compare the sixth field to it:
$ awk -v mydate="$data_sec" '$6 > mydate' rubrica.txt
cesc fabregas 5641287659 3456789123 2016/05/16 1463420324574207170
andrea mantovani 3402948586 0459687124 2016/05/17 1463500810082293135
marco rossi 3951326586 0458793540 2016/05/17 1463500836814967504
If the expression $6 > mydate evaluates to true, the record gets printed.
from googlefinance import getQuotes
import json
import time as t
import re
List = ["A","AA","AAB"]
Time=t.localtime() # Sets variable Time to retrieve date/time info
Date2= ('%d-%d-%d %dh:%dm:%dsec'%(Time[0],Time[1],Time[2],Time[3],Time[4],Time[5])) #formats time stamp
while True:
for i in List:
try: #allows elements to be called and if an error does the next step
Data = json.dumps(getQuotes(i.lower()),indent=1) #retrieves Data from google finance
regex = ('"LastTradePrice": "(.+?)",') #sets parse
pattern = re.compile(regex) #compiles parse
price = re.findall(pattern,Data) #retrieves parse
print(i)
print(price)
except: #sets Error coding
Error = (i + ' Failed to load on: ' + Date2)
print (Error)
It will display the quote as: ['(number)'].
I would like it to only display the number, which means removing the brackets and quotes.
Any help would be great.
Changing:
print(price)
into:
print(price[0])
prints this:
A
42.14
AA
10.13
AAB
0.110
Try to use type() function to know the datatype, in your case type(price)
it the data type is list use print(price[0])
you will get the output (number), for brecess you need to check google data and regex.
At the moment I am not working as efficient as I could be. For the problem I have I almost know certain that there is a smarter and better way to fix it.
What I am trying to do:
I got a string like this:
'NL 4633 4809 KTU'
The NL is a country code from an existing table and KTU is an university code from an existing table. I need to put this string in my function and check if the string is validated.
In my function (to validate the string) this is what I am working on. I have managed to split up the string with this:
countryCode := checkISIN; -- checkISIN is the full string ('NL 4633 4809 KTU') and I am giving the NL value to this variable. countryCode is the type varchar2(50)
countryCode := regexp_substr(countryCode, '[^ ]+', 1, 1);
Now that I have the country code as shown below:
NL
Has valid country code
I want to validate/check the country code for it's existence from it's own table. I tried this:
if countryCode in ('NL', 'FR', 'DE', 'GB', 'BE', 'US', 'CA')
then dbms_output.put_line('Has valid country code');
else
dbms_output.put_line('Has invald country code. Change the country code to a valid one');
end if;
This works, but it's not dynamically. If someone adds a country code then I have to change the function again.
So is there a (smart/dynamically) way to check the country codes for their existing tables?
I hope my question is not too vague
Cheers
If you have Country codes table and it looks like this:
ID | NAME
----------
1 | NL
2 | FR
3 | BE
when you parse string, you can make like this :
select count(1)
into v_quan
from CountryCodes cc
where nvl(countryCode,'') = cc.name
if v_quan > 0 then
dbms_output.put_line('Has valid country code');
else
dbms_output.put_line('Has invald country code. Change the country code to a valid one');
end if;
I have sampledata.csv which contains data as below,
2,4/1/2010,5.97
2,4/6/2010,12.71
2,4/7/2010,34.52
2,4/12/2010,7.89
2,4/14/2010,17.17
2,4/16/2010,9.25
2,4/19/2010,26.74
I want to filter the data in pig script so that only data with valid date are considered.
Say if the date is like '4//2010' or '/9/2010', then it has to be filtered out.
Below is the pig script I have written and the output I am getting while dumping the data.
script:
data = load 'sampledata.csv' using PigStorage(',') as (custid:int, date:chararray,amount:float);
cleadata = FILTER data by REGEX_EXTRACT(date, '(([1-9])|(1[0-2]))/(([0-2][1-9])|([3][0-1]))/([1-9]{4})', 1) != null;
Output:
2014-09-14 18:21:30,587 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1003: Unable to find an operator for alias cleandata
I am a beginner in pig scripting. If you have come across this kind of error,please let me know how to resolve.
here the solution for your problem. I have modified the Regex also, if you want you can change the regex according to your need.
input.txt
2,04/1/0000,5.97
2,04/1/2010,5.97
2,44/6/2010,12.71
2,4/07/2010,34.52
2,4/\12/2010,7.89
2,4/14/2010/,17.17
2,/16/2010,9.25
2,4/19//2010,26.74
2,4//19/2010,26.74
PigScript:
A = LOAD 'input.txt' USING PigStorage(',') AS (custid:int,date:chararray,amount:float);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(date, '(0?[1-9]|1[0-2])/([1-2][0-9]|[3][0-1]|0?[1-9])/([1-2][0-9]{3})')) AS (month,day,year);
C = FOREACH B GENERATE CONCAT(month,'/',day,'/',year) AS extractedDate;
D = FILTER C BY extractedDate is not null;
DUMP D;
Output:
(04/1/2010)
(4/07/2010)