Regular Expression in Pig Latin

Regular Expression in Pig Latin - regex

I want to search for the string '15200' (without quotes) in tuples. So, for the following input:
15200
15200,4000
4000,15200
4000,15200,4025
152000
152000,4000
4000,152000
4000,152000,4025
115200
115200,4000
4000,115200
4000,115200,4025
The output should be :
15200,15200
15200,4000,15200
4000,15200,15200
4000,15200,4025,15200
152000,-1
152000,4000,-1
4000,152000,-1
4000,152000,4025,-1
115200,-1
115200,4000,-1
4000,115200,-1
4000,115200,4025,-1
My Pig code looks like this:
A = LOAD '/user/test' USING PigStorage() AS (logic:chararray);
B = FOREACH A GENERATE
logic,
((logic matches '(^|,)15200($|,)')? '15200' :'-1') AS expt;
But when I Dump B, I get:
(15200,15200)
(15200,4000,-1)
(4000,15200,-1)
(4000,15200,4025,-1)
(152000,-1)
(152000,4000,-1)
(4000,152000,-1)
(4000,152000,4025,-1)
(115200,-1)
(115200,4000,-1)
(4000,115200,-1)
(4000,115200,4025,-1)

Try this:
.*?\b15200\b.*
Regex Demo: https://regex101.com/r/n6EP1s/2

Related

REGEX_EXTRACT in pig does not works

I want to remove double quotes '"' from begining and end of each field.
I'm trying to apply regexp in pig, but seems it doesn't work
Input:
(main_170521230001.csv,"9","2017-05-21 23:00:01.472636")
(main_170521230001.csv,"91","2017-05-21 23:00:01.472636")
(main_170521230001.csv,"592","2017-05-21 23:00:01.472636")
Pig script:
raw = LOAD '/data/csv' using PigStorage(',','-tagFile') as (
fn:chararray,
gid:chararray,
createdts:chararray);
res = foreach raw generate
REGEX_EXTRACT(fn, '([^"](.*)[^"])',1) as (fn:chararray),
REGEX_EXTRACT(gid, '([^"](.*)[^"])',1) as (gid:chararray),
REGEX_EXTRACT(createdts, '([^"](.*)[^"])',1) as (createdts:chararray);
dump res;
Output:
(ain_170521230001.cs,,017-05-21 23:00:01.47263)
(ain_170521230001.cs,91,017-05-21 23:00:01.47263)
(ain_170521230001.cs,592,017-05-21 23:00:01.47263)
I expected:
(main_170521230001.csv,9,2017-05-21 23:00:01.472636)
(main_170521230001.csv,91,2017-05-21 23:00:01.472636)
(main_170521230001.csv,592,2017-05-21 23:00:01.472636)
I want to receive all characters between "".
Examples:
"abc" -> abc
abc -> abc
""abc""" -> abc
"a"b"c" -> a"b"c
Thats why I'm using this pattern:
'([^"](.*)[^"])'
It works fine, except one case - if there is a single character between double quotes this pattern returns empty string
why does it happen so?

Load the data into a single field and use REPLACE.You can then use STRSPLIT to get the individual fields.
raw = LOAD '/data/csv' USING TextLoader();
res = foreach raw generate REPLACE($0,"\\"",'');
res_new = foreach res generate STRSPLIT($0,',',3);
dump res_new;

return first instance of unmatched regex scala

Is there a way to return the first instance of an unmatched string between 2 strings with Scala's Regex library?
For example:
val a = "some text abc123 some more text"
val b = "some text xyz some more text"
a.firstUnmatched(b) = "abc123"

Regex is good for matching & replacing in strings based on patterns.
But to look for the differences between strings? Not exactly.
However, diff can be used to find differences.
object Main extends App {
val a = "some text abc123 some more text 321abc"
val b = "some text xyz some more text zyx"
val firstdiff = (a.split(" ") diff b.split(" "))(0)
println(firstdiff)
}
prints "abc123"
Is regex desired after all? Then realize that the splits could be replaced by regex matching.
The regex pattern in this example looks for words:
val reg = "\\w+".r
val firstdiff = (reg.findAllIn(a).toList diff reg.findAllIn(b).toList)(0)

Scala Regular Expression

I am having some trouble getting this regular expression just right:
Sample string looks something like this:
"li3001easdasfsaasfasdi5ei1409529297ei1409529597ed16:acl_dec_tag_listl15:DEFAULT_CASE_11e18:avc_app_name_statsd29:Generic Search Engine Trafficd4:sizei5875e5:totali5ee16:Odnoklassniki.Rud4:sizei456e5:totali1ee7:Unknownd4:sizei6391e5:totali2ee5:Yahood4:sizei15673e5:totali1ee10:Yahoo Maild4:sizei5982e5:totali1e"
I want the string to be grouped like this:
(li<digit 1-4>e <string of varying length> i<single digit>e) (<string2 of varying length>)
This is my attempt at this regex so far: (li\d{1,}e.*i\de)(.*)
I would like only the first occurrence of li<digits 1-4>e as well.

Simple mistake. * is a greedy operator, meaning it will match as much as it can and still allow the remainder of the regex to match. Use *? instead for a non-greedy match meaning "zero or more — preferably as few as possible".
(li\d{1,}e.*?i\de)(.*)

val s = "li3001easdasfsaasfasdi5easdasfsafas"
val p = """li(\d{1,4})e([^i]*)i(\d)(.*)""".r
val p(first,second,third,fourth) = s
results in:
first: String = 3001
second: String = asdasfsaasfasd
third: String = 5
fourth: String = easdasfsafas
Not sure if that answers your question, but hope it helps.

Regex that does excludes characters that repeat (3) three or more times

I'm looking for a regex that searches a file and does NOT return (i.e. it excludes) chars that repeat 3 or more times consecutively in a string. I've tried this expression below, but it's NOT doing the the job :( ..something that looks fwd and backward and excludes strings that have 3 or more repeating back-to back chars. i.e. it should return abcdefg, but not 3333ahg or gagjjjjagy or hdajgjga111
(?!(.)\1{3})

Try using following regex to match string containing 3 or more repeating back-to-back characters
(.)\1{2,}
And then invert the match using flags. Most of the languages support it.
For example, with grep
$ cat file
abcdefg
gagjagyyy
3333ahg
$ grep -v -E '(.)\1{2,}' file
abcdefg

If you are using C#, you may try this:
using System;
using System.IO;
using System.Linq;
using System.Text.RegularExpressions;
class Program
{
const string
isMatch = "IsMatch",
pattern = #"(?:(?<Open>\w*?(\w)\1{{2,}}\w*)|(?<{0}>\w*))";
static void Main(string[] args)
{
var input = File.ReadAllText("input.txt");
var regex = String.Format(pattern, isMatch);
var matches = Regex.Matches(input, regex)
.Cast<Match>()
.Select<Match, Group>(m => m.Groups[isMatch])
.Where(g => g.Value != string.Empty)
.ToList();
matches.ForEach(m => Console.WriteLine(m.Value));
}
}

Try this:
^(?!.*(.)\1\1.*$).+$ - matches whole string as one word
(?=\b|^)(?!\w*(\w)\1\1\w*)\w+(?:\b|$) - matches one word
Example: http://rubular.com/r/dkIHkDo67g

Using RegEx split the string

I have a string like '[1]-[2]-[3],[4]-[5],[6,7,8],[9]' or '[Computers]-[Apple]-[Laptop],[Cables]-[Cables,Connectors],[Adapters]', I'd like the Pattern to get the list result, but don't know how to figure out the pattern. Basically the comma is the split, but [6,7,8] itself contains the comma as well.
the string: [1]-[2]-[3],[4]-[5],[6,7,8],[9]
the result:
[1]-[2]-[3]
[4]-[5]
[6,7,8]
[9]
or
the string: [Computers]-[Apple]-[Laptop],[Cables]-[Cables,Connectors],[Adapters]
the result:
[Computers]-[Apple]-[Laptop]
[Cables]-[Cables,Connectors]
[Adapters]

,(?=\[)
This pattern splits on any comma that is followed by a bracket, but keeps the bracket within the result text.
The (?=*stuff*) is known as a "lookahead assertion". It acts as a condition for the match but is not itself part of the match.
In C# code:
String inputstring = "[Computers]-[Apple]-[Laptop],[Cables]-[Cables,Connectors],[Adapters]";
foreach(String s in Regex.Split(inputstring, #",(?=\[)"))
System.Console.Out.WriteLine(s);
In Java code:
String inputstring = "[Computers]-[Apple]-[Laptop],[Cables]-[Cables,Connectors],[Adapters]";
Pattern p = Pattern.compile(",(?=\\[)"));
for(String s : p.split(inputstring))
System.out.println(s);
Either produces:
[Computers]-[Apple]-[Laptop]
[Cables]-[Cables,Connectors]
[Adapters]

Although I believe the best approach here is to use split (as presented by #j__m's answer), here's an approach that uses matching rather than splitting.
Regex:
(\[.*?\](?!-))
Example usage:
String input = "[Computers]-[Apple]-[Laptop],[Cables]-[Cables,Connectors],[Adapters]";
Pattern p = Pattern.compile("(\\[.*?\\](?!-))");
Matcher m = p.matcher(input);
while (m.find()) {
System.out.println(m.group(1));
}
Resulting output:
[Computers]-[Apple]-[Laptop]
[Cables]-[Cables,Connectors]
[Adapters]

An answer that doesn't use regular expressions (if that's worth something in ease of understanding what's going on) is:
substitute "]#[" for "],["
split on "#"

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regular Expression in Pig Latin - regex

Try this: .?\b15200\b. Regex Demo: https://regex101.com/r/n6EP1s/2

Related

REGEX_EXTRACT in pig does not works

return first instance of unmatched regex scala

Scala Regular Expression

Regex that does excludes characters that repeat (3) three or more times

Using RegEx split the string

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regular Expression in Pig Latin - regex

Try this: .*?\b15200\b.* Regex Demo: https://regex101.com/r/n6EP1s/2

Related

REGEX_EXTRACT in pig does not works

return first instance of unmatched regex scala

Scala Regular Expression

Regex that does excludes characters that repeat (3) three or more times

Using RegEx split the string

Categories

Resources

Try this: .?\b15200\b. Regex Demo: https://regex101.com/r/n6EP1s/2