Regular Expression in Pig Latin - regex

I want to search for the string '15200' (without quotes) in tuples. So, for the following input:
15200
15200,4000
4000,15200
4000,15200,4025
152000
152000,4000
4000,152000
4000,152000,4025
115200
115200,4000
4000,115200
4000,115200,4025
The output should be :
15200,15200
15200,4000,15200
4000,15200,15200
4000,15200,4025,15200
152000,-1
152000,4000,-1
4000,152000,-1
4000,152000,4025,-1
115200,-1
115200,4000,-1
4000,115200,-1
4000,115200,4025,-1
My Pig code looks like this:
A = LOAD '/user/test' USING PigStorage() AS (logic:chararray);
B = FOREACH A GENERATE
logic,
((logic matches '(^|,)15200($|,)')? '15200' :'-1') AS expt;
But when I Dump B, I get:
(15200,15200)
(15200,4000,-1)
(4000,15200,-1)
(4000,15200,4025,-1)
(152000,-1)
(152000,4000,-1)
(4000,152000,-1)
(4000,152000,4025,-1)
(115200,-1)
(115200,4000,-1)
(4000,115200,-1)
(4000,115200,4025,-1)

Try this:
.*?\b15200\b.*
Regex Demo: https://regex101.com/r/n6EP1s/2

Related

REGEX_EXTRACT in pig does not works

I want to remove double quotes '"' from begining and end of each field.
I'm trying to apply regexp in pig, but seems it doesn't work
Input:
(main_170521230001.csv,"9","2017-05-21 23:00:01.472636")
(main_170521230001.csv,"91","2017-05-21 23:00:01.472636")
(main_170521230001.csv,"592","2017-05-21 23:00:01.472636")
Pig script:
raw = LOAD '/data/csv' using PigStorage(',','-tagFile') as (
fn:chararray,
gid:chararray,
createdts:chararray);
res = foreach raw generate
REGEX_EXTRACT(fn, '([^"](.*)[^"])',1) as (fn:chararray),
REGEX_EXTRACT(gid, '([^"](.*)[^"])',1) as (gid:chararray),
REGEX_EXTRACT(createdts, '([^"](.*)[^"])',1) as (createdts:chararray);
dump res;
Output:
(ain_170521230001.cs,,017-05-21 23:00:01.47263)
(ain_170521230001.cs,91,017-05-21 23:00:01.47263)
(ain_170521230001.cs,592,017-05-21 23:00:01.47263)
I expected:
(main_170521230001.csv,9,2017-05-21 23:00:01.472636)
(main_170521230001.csv,91,2017-05-21 23:00:01.472636)
(main_170521230001.csv,592,2017-05-21 23:00:01.472636)
I want to receive all characters between "".
Examples:
"abc" -> abc
abc -> abc
""abc""" -> abc
"a"b"c" -> a"b"c
Thats why I'm using this pattern:
'([^"](.*)[^"])'
It works fine, except one case - if there is a single character between double quotes this pattern returns empty string
why does it happen so?
Load the data into a single field and use REPLACE.You can then use STRSPLIT to get the individual fields.
raw = LOAD '/data/csv' USING TextLoader();
res = foreach raw generate REPLACE($0,"\\"",'');
res_new = foreach res generate STRSPLIT($0,',',3);
dump res_new;

return first instance of unmatched regex scala

Is there a way to return the first instance of an unmatched string between 2 strings with Scala's Regex library?
For example:
val a = "some text abc123 some more text"
val b = "some text xyz some more text"
a.firstUnmatched(b) = "abc123"
Regex is good for matching & replacing in strings based on patterns.
But to look for the differences between strings? Not exactly.
However, diff can be used to find differences.
object Main extends App {
val a = "some text abc123 some more text 321abc"
val b = "some text xyz some more text zyx"
val firstdiff = (a.split(" ") diff b.split(" "))(0)
println(firstdiff)
}
prints "abc123"
Is regex desired after all? Then realize that the splits could be replaced by regex matching.
The regex pattern in this example looks for words:
val reg = "\\w+".r
val firstdiff = (reg.findAllIn(a).toList diff reg.findAllIn(b).toList)(0)

Scala Regular Expression

I am having some trouble getting this regular expression just right:
Sample string looks something like this:
"li3001easdasfsaasfasdi5ei1409529297ei1409529597ed16:acl_dec_tag_listl15:DEFAULT_CASE_11e18:avc_app_name_statsd29:Generic Search Engine Trafficd4:sizei5875e5:totali5ee16:Odnoklassniki.Rud4:sizei456e5:totali1ee7:Unknownd4:sizei6391e5:totali2ee5:Yahood4:sizei15673e5:totali1ee10:Yahoo Maild4:sizei5982e5:totali1e"
I want the string to be grouped like this:
(li<digit 1-4>e <string of varying length> i<single digit>e) (<string2 of varying length>)
This is my attempt at this regex so far: (li\d{1,}e.*i\de)(.*)
I would like only the first occurrence of li<digits 1-4>e as well.
Simple mistake. * is a greedy operator, meaning it will match as much as it can and still allow the remainder of the regex to match. Use *? instead for a non-greedy match meaning "zero or more — preferably as few as possible".
(li\d{1,}e.*?i\de)(.*)
val s = "li3001easdasfsaasfasdi5easdasfsafas"
val p = """li(\d{1,4})e([^i]*)i(\d)(.*)""".r
val p(first,second,third,fourth) = s
results in:
first: String = 3001
second: String = asdasfsaasfasd
third: String = 5
fourth: String = easdasfsafas
Not sure if that answers your question, but hope it helps.

Regex that does excludes characters that repeat (3) three or more times

I'm looking for a regex that searches a file and does NOT return (i.e. it excludes) chars that repeat 3 or more times consecutively in a string. I've tried this expression below, but it's NOT doing the the job :( ..something that looks fwd and backward and excludes strings that have 3 or more repeating back-to back chars. i.e. it should return abcdefg, but not 3333ahg or gagjjjjagy or hdajgjga111
(?!(.)\1{3})
Try using following regex to match string containing 3 or more repeating back-to-back characters
(.)\1{2,}
And then invert the match using flags. Most of the languages support it.
For example, with grep
$ cat file
abcdefg
gagjagyyy
3333ahg
$ grep -v -E '(.)\1{2,}' file
abcdefg
If you are using C#, you may try this:
using System;
using System.IO;
using System.Linq;
using System.Text.RegularExpressions;
class Program
{
const string
isMatch = "IsMatch",
pattern = #"(?:(?<Open>\w*?(\w)\1{{2,}}\w*)|(?<{0}>\w*))";
static void Main(string[] args)
{
var input = File.ReadAllText("input.txt");
var regex = String.Format(pattern, isMatch);
var matches = Regex.Matches(input, regex)
.Cast<Match>()
.Select<Match, Group>(m => m.Groups[isMatch])
.Where(g => g.Value != string.Empty)
.ToList();
matches.ForEach(m => Console.WriteLine(m.Value));
}
}
Try this:
^(?!.*(.)\1\1.*$).+$ - matches whole string as one word
(?=\b|^)(?!\w*(\w)\1\1\w*)\w+(?:\b|$) - matches one word
Example: http://rubular.com/r/dkIHkDo67g

Using RegEx split the string

I have a string like '[1]-[2]-[3],[4]-[5],[6,7,8],[9]' or '[Computers]-[Apple]-[Laptop],[Cables]-[Cables,Connectors],[Adapters]', I'd like the Pattern to get the list result, but don't know how to figure out the pattern. Basically the comma is the split, but [6,7,8] itself contains the comma as well.
the string: [1]-[2]-[3],[4]-[5],[6,7,8],[9]
the result:
[1]-[2]-[3]
[4]-[5]
[6,7,8]
[9]
or
the string: [Computers]-[Apple]-[Laptop],[Cables]-[Cables,Connectors],[Adapters]
the result:
[Computers]-[Apple]-[Laptop]
[Cables]-[Cables,Connectors]
[Adapters]
,(?=\[)
This pattern splits on any comma that is followed by a bracket, but keeps the bracket within the result text.
The (?=*stuff*) is known as a "lookahead assertion". It acts as a condition for the match but is not itself part of the match.
In C# code:
String inputstring = "[Computers]-[Apple]-[Laptop],[Cables]-[Cables,Connectors],[Adapters]";
foreach(String s in Regex.Split(inputstring, #",(?=\[)"))
System.Console.Out.WriteLine(s);
In Java code:
String inputstring = "[Computers]-[Apple]-[Laptop],[Cables]-[Cables,Connectors],[Adapters]";
Pattern p = Pattern.compile(",(?=\\[)"));
for(String s : p.split(inputstring))
System.out.println(s);
Either produces:
[Computers]-[Apple]-[Laptop]
[Cables]-[Cables,Connectors]
[Adapters]
Although I believe the best approach here is to use split (as presented by #j__m's answer), here's an approach that uses matching rather than splitting.
Regex:
(\[.*?\](?!-))
Example usage:
String input = "[Computers]-[Apple]-[Laptop],[Cables]-[Cables,Connectors],[Adapters]";
Pattern p = Pattern.compile("(\\[.*?\\](?!-))");
Matcher m = p.matcher(input);
while (m.find()) {
System.out.println(m.group(1));
}
Resulting output:
[Computers]-[Apple]-[Laptop]
[Cables]-[Cables,Connectors]
[Adapters]
An answer that doesn't use regular expressions (if that's worth something in ease of understanding what's going on) is:
substitute "]#[" for "],["
split on "#"