Pig - Remove line feed, return and tab - regex
I'm trying to remove the characters: \n, \t and \r from a column in Pig but I'm getting the wrong output.
Here is what I'm doing:
qr_1 = LOAD 'hdfs://localhost:9000/sample.csv' USING PigStorage(',') as (Id:int,PostTypeId:int,AcceptedAnswerId:int,ParentId:int,CreationDate:chararray,DeletionDate:chararray,Score:int,ViewCount:int,Body:chararray,OwnerUserId:int,OwnerDisplayName:chararray,LastEditorUserId:int,LastEditorDisplayName:chararray,LastEditDate:chararray,LastActivityDate:chararray,Title:chararray,Tags:chararray,AnswerCount:int,CommentCount:int,FavoriteCount:int,ClosedDate:chararray,CommunityOwnedDate:chararray);
qr_1 = FOREACH qr_1 GENERATE Id .. ViewCount, REPLACE(Body,'\n','') as Body, OwnerUserId .. ;
qr_1 = FOREACH qr_1 GENERATE Id .. ViewCount, REPLACE(Body,'\r','') as Body, OwnerUserId .. ;
qr_1 = FOREACH qr_1 GENERATE Id .. ViewCount, REPLACE(Body,'\t','') as Body, OwnerUserId .. ;
Input:
5585779,1,5585800,,2011-04-07 18:27:54,,1432,3090250,"<p>How can I convert a <code>String</code> to an <code>int</code> in Java?</p>
<p>My String contains only numbers and I want to return the number it represents.</p>
<p>For example, given the string <code>""""1234""""</code> the result should be the number <code>1234</code>.</p>",537967,,2756409,user166390,2015-09-10 21:30:42,2016-03-07 00:42:49,Converting String to Int in Java?,<java><string><type-conversion>,12,0,239
Output:
(5585779,1,5585800,,2011-04-07 18:27:54,,1432,3090250,"<p>How can I convert a <code>String</code> to an <code>int</code> in Java?</p>,,,,,,,,,,,,,)
(,,,,,,,,,,,,,,,,,,,,,)
(,,,,,,,,,,,,,,,,,,,,)
(,,,,,,,,,,,,,,,,,,,,,)
(,,537967,,2756409,user166390,,,Converting String to Int in Java?,,12,0,239,,,,,,,,,)
What am I doing?
Thanks.
Also "\\n" doesn't make a difference.
There is comma in your data and that's why the fields and the schema are not matching.Use CSVLoader and then use the REPLACE command to replace '\\t','\\n','\\r'
<p>For example, given the string
Related
Regex that extract string of length that is encoded in string
I have the following string to parse: X4IitemX6Nabc123 that is structured as follows: X... marker for 'field identifier' 4... length of item (name), will change according to length of item name I... identifier for item name, must not be extracted, fixed item... value that should be extraced as "name" X... marker for 'field identifier' 6... length of item (name), will change according to length of item name N... identifier for item number, must not be extracted, fixed abc123... value that should be extraced as "num" Only these two values will be contained in the string, the sequence is also always the same (name, nmuber). What I have so far is \AX(?I<namelen>\d+)U(?<name>.+)X(?<numlen>\d+)N(?<num>.+)$ But that does not take into account that the length of the name is contained in the string itself. Somehow the .+ in the name group should be replaced by .{4}. I tried {$1}, {${namlen}} but that does not yield the result I expect (on rubular.com or regex.191) Any ideas or further references?
What you ask for is only possible in languages that allow code insertions in the regex pattern. Here is a Perl example: #!/usr/bin/perl use warnings; use strict; my $text = "X4IitemX6Nabc123"; if ($text =~ m/^X(?<namelen>[0-9]+)I(?<name>(??{".{".$^N."}"}))X(?<numlen>[0-9]+)N(?<num>.+)$/) { print $text . ": PASS!\n"; } else { print $text . ": FAIL!\n" } # -> X4IitemX6Nabc123: PASS! In other languages, use a two-step approach: Extract the number after X, Build a regex dynamically using the result of the first step. See a JavaScript example: const text = "X4IitemX6Nabc123"; const rx1 = /^X(\d+)/; const m1 = rx1.exec(text) if (m1) { const rx2 = new RegExp(`^X(?<namelen>\\d+)I(?<name>.{${m1[1]}})X(?<numlen>\\d+)N(?<num>.+)$`) if (rx2.test(text)) { console.log(text, '-> MATCH!') } else console.log(text, '-> FAIL!'); } else { console.log(text, '-> FAIL!') } See the Python demo: import re text = "X4IitemX6Nabc123" rx1 = r'^X(\d+)' m1 = re.search(rx1, text) if m1: rx2 = fr'^X(?P<namelen>\d+)I(?P<name>.{{{m1.group(1)}}})X(?P<numlen>\d+)N(?P<num>.+)$' if re.search(rx2, text): print(text, '-> MATCH!') else: print(text, '-> FAIL!') else: print(text, '-> FAIL!') # => X4IitemX6Nabc123 -> MATCH!
REGEX_EXTRACT in pig does not works
I want to remove double quotes '"' from begining and end of each field. I'm trying to apply regexp in pig, but seems it doesn't work Input: (main_170521230001.csv,"9","2017-05-21 23:00:01.472636") (main_170521230001.csv,"91","2017-05-21 23:00:01.472636") (main_170521230001.csv,"592","2017-05-21 23:00:01.472636") Pig script: raw = LOAD '/data/csv' using PigStorage(',','-tagFile') as ( fn:chararray, gid:chararray, createdts:chararray); res = foreach raw generate REGEX_EXTRACT(fn, '([^"](.*)[^"])',1) as (fn:chararray), REGEX_EXTRACT(gid, '([^"](.*)[^"])',1) as (gid:chararray), REGEX_EXTRACT(createdts, '([^"](.*)[^"])',1) as (createdts:chararray); dump res; Output: (ain_170521230001.cs,,017-05-21 23:00:01.47263) (ain_170521230001.cs,91,017-05-21 23:00:01.47263) (ain_170521230001.cs,592,017-05-21 23:00:01.47263) I expected: (main_170521230001.csv,9,2017-05-21 23:00:01.472636) (main_170521230001.csv,91,2017-05-21 23:00:01.472636) (main_170521230001.csv,592,2017-05-21 23:00:01.472636) I want to receive all characters between "". Examples: "abc" -> abc abc -> abc ""abc""" -> abc "a"b"c" -> a"b"c Thats why I'm using this pattern: '([^"](.*)[^"])' It works fine, except one case - if there is a single character between double quotes this pattern returns empty string why does it happen so?
Load the data into a single field and use REPLACE.You can then use STRSPLIT to get the individual fields. raw = LOAD '/data/csv' USING TextLoader(); res = foreach raw generate REPLACE($0,"\\"",''); res_new = foreach res generate STRSPLIT($0,',',3); dump res_new;
Remove text between two tags
I'm trying to remove some text between two tags [ & ] [13:00:00] I want to remove 13:00:00 from [] tags. This number is not the same any time. Its always a time of the day so, only Integer and : symbols. Someone can help me? UPDATE: I forgot to say something. The time (13:00:00) was picked from a log file. Looks like that: [10:56:49] [Client thread/ERROR]: Item entity 26367127 has no item?! [10:57:25] [Dbutant] misterflo13 : ils coute chere les enchent aura de feu et T2 du spawn??* [10:57:35] [Amateur] firebow ?.SkyLegend.? : ouai 0 [10:57:38] [Novice] iPasteque : ils sont gratuit me [10:57:41] [Novice] iPasteque : ils sont gratuit mec * [10:57:46] [Dbutant] misterflo13 : on ma dit k'ils etait payent :o [10:57:57] [Novice] iPasteque : on t'a mytho alors Ignore the other text I juste want to remove the time between [ & ] (need to looks like []. The time between [ & ] is updated every second.
It looks like your log has specific format. And you seem want to get rid of the time and keep all other information. Ok - read in comments I didn't test it but it should work ' Read log Dim logLines() As String = File.ReadAllLines("File_path") If logLines.Length = 0 Then Return ' prepare array to fill sliced data Dim lines(logLines.Length - 1) As String For i As Integer = 0 To logLines.Count - 1 ' just cut off time part and add empty brackets for each line lines(i) = "[]" & logLines(i).Substring(10) Next What you see above - if you know that your file comes in certain format, just use position in the string where to cut it off. Note: Code above can be done in 1 line using LINQ If you want to actually get the data out of it, use IndexOf. Since you looking for first occurrence of "[" or "]", just use start index "0" ' get position of open bracket in string Dim openBracketPos As Integer = myString.IndexOf("[", 0, StringComparison.OrdinalIgnoreCase) ' get position of close bracket in string Dim closeBracketPos As Integer = myString.IndexOf("]", 0, StringComparison.OrdinalIgnoreCase) ' get string between open and close bracket Dim data As String = myString.Substring(openBracketPos + 1, closeBracketPos - 1)
This is another possibility using Regex: Public Function ReplaceTime(ByVal Input As String) As String Dim m As Match = Regex.Match(Input, "(\[)(\d{1,2}\:\d{1,2}(\:\d{1,2})?)(\])(.+)") Return m.Groups(1).Value & m.Groups(4).Value & m.Groups(5).Value End Function It's more of a readability nightmare but it's efficient and it takes only the brackets containing a time value. I also took the liberty of making it match for example 13:47 as well as 13:47:12. Test: http://ideone.com/yogWfD (EDIT) Multiline example: You can combine this with File.ReadAllLines() (if that's what you prefer) and a For loop to get the replacement done. Public Function ReplaceTimeMultiline(ByVal TextLines() As String) As String For x = 0 To TextLines.Length - 1 TextLines(x) = ReplaceTime(TextLines(x)) Next Return String.Join(Environment.NewLine, TextLines) End Function Above code usage: Dim FinalT As String = ReplaceTimeMultiline(File.ReadAllLines(<file path here>)) Another multiline example: Public Function ReplaceTimeMultiline(ByVal Input As String) As String Dim ReturnString As String = "" Dim Parts() As String = Input.Split(Environment.NewLine) For x = 0 To Parts.Length - 1 ReturnString &= ReplaceTime(Parts(x)) & If(x < (Parts.Length - 1), Environment.NewLine, "") Next Return ReturnString End Function Multiline test: http://ideone.com/nKZQHm
If your problem is to remove numeric strings in the format of 99:99:99 that appear inside [], I would do: //assuming you want to replace the [......] numeric string with an empty []. Should you want to completely remove the tag, just replace with string.Empty Here's a demo (in C#, not VB, but you get the point (you need the regex, not the syntax anyway) List<string> list = new List<string> { "[13:00:00]", "[4:5:0]", "[5d2hu2d]", "[1:1:1000]", "[1:00:00]", "[512341]" }; string s = string.Join("\n", list); Console.WriteLine("Original input string:"); Console.WriteLine(s); Regex r = new Regex(#"\[\d{1,2}?:\d{1,2}?:\d{1,2}?\]"); foreach (Match m in r.Matches(s)) { Console.WriteLine("{0} is a match.", m.Value); } Console.WriteLine(); Console.WriteLine("String with occurrences replaced with an empty string:"); Console.WriteLine(r.Replace(s, string.Empty).Trim());
Trying to parse string values into an array after a pattern match
I have the following lines in a text file: <Entry> <Key argument="ComputerNames"/> <Value type="string" argument="localhost,localhost,engine1,engine2"/></Entry> <Entry> <Key argument="BranchIDMultiple"/> <Value type="int" argument="1"/></Entry> I know how to find the line that has ComputerNames. I know how to read the next line as well. I need to parse the line as follows where the number of arguments can be dynamic. Parse output should be: #result = $result[0]=localhost, $result[1]=localhost, $result[2]=engine1, $result[3]=engine2. There must be at least one argument, but there can be more as well.. I'm not able to construct the right regex to accomplish the split. Any ideas?
Let's say input contains your following xml line. Since you've mentioned that you know how to extract this line. I've left that portion to you. After you got this line use the following regex String regex ="argument=\"[a-zA-Z0-9,]*\"" ; Pattern pattern = Pattern.compile(regex); Matcher matcher = pattern.matcher(input); String[] op; if(matcher.find()) { op = input.subString(matcher.start(),matcher.end()).split(","); }
Ok, Here is what I have: --- After many different trials, I was finally able to get something that works. See below: BEGIN { require 5.8.0; } use strict; use warnings; # string to test regular expressions my $test_string = '<Value type="string" argument="400teets,localhost,localhost,engine1,engine2,engine50,engine100,100afdasfdas"/></Entry>'; # print out the initial string print "The initial string is: $test_string\n\n"; # first set of arguments - all words that have a comma after them my #first_words = ($test_string =~ /(\w+),/g); # print first set of arguments print "\nFirst set of arguments found\n"; foreach my $word (#first_words) { print "$word\n"; } # second set of arguments - all words that have a comma before them my #last_words = ($test_string =~ /,(\w+)/g); #print second set of arguments print "\nSecond set of arguments found\n"; foreach my $word (#last_words) { print "$word\n"; } #merge the sets by popping the last element off of last_words array and pushing it into the first_words array push(#first_words,pop(#last_words)); #print the results print "\nMerged Sets\n"; foreach my $word (#first_words) { print "$word\n"; } # END OF PROGRAM --- Really, if you exclude all of the print statements and comments, all you really need is these three lines: my #first_words = ($test_string =~ /(\w+),/g); my #last_words = ($test_string =~ /,(\w+)/g); push(#first_words,pop(#last_words)); --- Here is the output: The initial string is: First set of arguments found 400teets localhost localhost engine1 engine2 engine50 engine100 Second set of arguments found localhost localhost engine1 engine2 engine50 engine100 100afdasfdas Merged Sets 400teets localhost localhost engine1 engine2 engine50 engine100 100afdasfdas
Verify and cut a string using regexp in matlab
I have the following string: {'output',{'variable','VGRG_Pos_Var1/Parameters/D_foo'},'date',734704.60904050921} I would like to verify the format of the string that the word 'variable' is the second word and i would like to retrive the string after the last '/' in the 3rd string (In this example 'D_foo'). how could i verify this and retrive the sting i search? I tried the following: regexp(str,'{''\w+'',{''variable'',''([(a-z)|(A-Z)|/|_])+') without success REMARK The string to analysis is not splited after the komma, it is only due to length of the string. EDIT my string is: '{''output'',{''variable'',''VGRG_Pos_Var1/Parameters/D_foo''},''date'',734704.60904050921}'; and not a cell, which could be understood. I added the sybol ' at the start and end of the string to symbolizied that it is a string.
I realise that you mention using regexp in the question, but I'm not sure if this is a requirement? If other solutions are acceptable you could try this: str='{''output'',{''variable'',''VGRG_Pos_Var1/Parameters/D_foo''},''date'',734704.60904050921}'; parts1=textscan( str, '%s','delimiter',{',','{','}'},'MultipleDelimsAsOne',1); parts2=textscan( parts1{1}{3}, '%s','delimiter',{'/',''''},'MultipleDelimsAsOne',1); string=parts2{1}{end} match=strcmp(parts1{1}{2},'variable')
To answer the first part of your question, you can write this: str = {'output',{'variable','VGRG_Pos_Var1/Parameters/D_foo'},'date',734704.60904050921}; temp = str(2); %this holds the cell containing the two strings if cmpstr(temp{1}(1), 'variable') %do stuff end For the second part you can do this: str = {'output',{'variable','VGRG_Pos_Var1/Parameters/D_foo'},'date',734704.60904050921}; temp = str(2); %like before, this contains the cell temp = temp{1}(2); %this picks out the second string in the cell temp = char(temp); %turns the item from a cell to a string res = strsplit(temp, '/'); %splits the string where '/' are found, res is an array of strings string = res(3); %assuming there will always be just 2 '/'s.