Applying Groovy RegEx with Conditional Matching - regex

Using Groovy and regular expression(s) how can I convert this:
String shopping = "SHOPPING LIST(TOMATOES, TEA, LENTIL SOUP: packets=2) for Saturday"
to print out
Shopping for Saturday
TOMATOES
TEA
LENTIL SOUP (2 packets)

I'm not a regex guru, so i couldn't find a regex to do the conversion in just on replaceAll step (i think it should be possible to do it that way). This works though:
def shopping = "SHOPPING LIST(TOMATOES, TEA, LENTIL SOUP: packets=2) for Saturday"
def (list, day) = (shopping =~ /SHOPPING LIST\((.*)\) for (\w+)/)[0][1,2]
println "Shopping for $day\n" +
list.replaceAll(/: packets=(\d+)/, ' ($1 packets)')
.replaceAll(', ', '\n')
First it captures the strings "TOMATOES, TEA: packets=50, LENTIL SOUP: packets=2" and "Saturday" into the variables list and day respectively. Then it processes the list string to convert it in the desired output replacing the "packets=" occurrences and splitting the list by commas (.replaceAll(', ', '\n') is equivalent to .split(', ').join('\n')).
One thing to notice is that if the shopping string does not match the first regex, it will throw an exception for trying to access the first match ([0]). You can avoid that by doing:
(shopping =~ /SHOPPING LIST\((.*)\) for (\w+)/).each { match, list, day ->
println "Shopping for $day\n" +
list.replaceAll(/: packets=(\d+)/, ' ($1 packets)')
.replaceAll(', ', '\n')
}
Which won't print anything if the first regex doesn't match.

I like to use the String find method for these kinds of cases, I think it's clearer than the =~ syntax:
String shopping = "SHOPPING LIST(TOMATOES, TEA, LENTIL SOUP: packets=2) for Saturday"
def expected = """Shopping for Saturday
TOMATOES
TEA
LENTIL SOUP (2 packets)"""
def regex = /SHOPPING LIST\((.*)\) for (.+)/
assert expected == shopping.find(regex) { full, items, day ->
List<String> formattedItems = items.split(", ").collect { it.replaceAll(/: packets=(\d+)/, ' ($1 packets)') }
"Shopping for $day\n" + formattedItems.join("\n")
}

Related

Regex that extract string of length that is encoded in string

I have the following string to parse:
X4IitemX6Nabc123
that is structured as follows:
X... marker for 'field identifier'
4... length of item (name), will change according to length of item name
I... identifier for item name, must not be extracted, fixed
item... value that should be extraced as "name"
X... marker for 'field identifier'
6... length of item (name), will change according to length of item name
N... identifier for item number, must not be extracted, fixed
abc123... value that should be extraced as "num"
Only these two values will be contained in the string, the sequence is also always the same (name, nmuber).
What I have so far is
\AX(?I<namelen>\d+)U(?<name>.+)X(?<numlen>\d+)N(?<num>.+)$
But that does not take into account that the length of the name is contained in the string itself. Somehow the .+ in the name group should be replaced by .{4}. I tried {$1}, {${namlen}} but that does not yield the result I expect (on rubular.com or regex.191)
Any ideas or further references?
What you ask for is only possible in languages that allow code insertions in the regex pattern.
Here is a Perl example:
#!/usr/bin/perl
use warnings;
use strict;
my $text = "X4IitemX6Nabc123";
if ($text =~ m/^X(?<namelen>[0-9]+)I(?<name>(??{".{".$^N."}"}))X(?<numlen>[0-9]+)N(?<num>.+)$/) {
print $text . ": PASS!\n";
} else {
print $text . ": FAIL!\n"
}
# -> X4IitemX6Nabc123: PASS!
In other languages, use a two-step approach:
Extract the number after X,
Build a regex dynamically using the result of the first step.
See a JavaScript example:
const text = "X4IitemX6Nabc123";
const rx1 = /^X(\d+)/;
const m1 = rx1.exec(text)
if (m1) {
const rx2 = new RegExp(`^X(?<namelen>\\d+)I(?<name>.{${m1[1]}})X(?<numlen>\\d+)N(?<num>.+)$`)
if (rx2.test(text)) {
console.log(text, '-> MATCH!')
} else console.log(text, '-> FAIL!');
} else {
console.log(text, '-> FAIL!')
}
See the Python demo:
import re
text = "X4IitemX6Nabc123"
rx1 = r'^X(\d+)'
m1 = re.search(rx1, text)
if m1:
rx2 = fr'^X(?P<namelen>\d+)I(?P<name>.{{{m1.group(1)}}})X(?P<numlen>\d+)N(?P<num>.+)$'
if re.search(rx2, text):
print(text, '-> MATCH!')
else:
print(text, '-> FAIL!')
else:
print(text, '-> FAIL!')
# => X4IitemX6Nabc123 -> MATCH!

Searching for multiple cases in a string using regex in Python 2.7x

So I get some data from a csv, I want to normalise all the rows on the nr_nou column so they have just "FARA NUMAR" in that cell, instead of "f n", "fn ", "Fara numar" etc...
I'm going to give out the chunks of code that are relevant:
pattern1 = re.compile(r"\b\s*f\s*a*r*a*\s*nu*m*a*r*\s*\b")
elif ind == nr_nou:
if re.search(pattern1, data):
data = "FARA NUMAR"
Part of a CSV row:
device2,120L,13/07/2019 12:51,Sat Daia,F.N.,Fara Numar,14,,,INCOMPLETA,,,45.8007164,24.2572791,"45.8007164,24.2572791"
So next I would like to change those two values "F.N." and "Fara Numar"
Regards!
Try using re.sub with an appropriate pattern:
row = "device2,120L,13/07/2019 12:51,Sat Daia,F.N.,Fara Numar,14,,,INCOMPLETA,,,45.8007164,24.2572791,45.8007164,24.2572791"
row = re.sub(r'(?<![^,])(?:F\.N\.|Fara Numar)(?![^,])', 'FARA NUMAR', text)
print(row)
This prints:
device2,120L,13/07/2019 12:51,Sat Daia,FARA NUMAR,FARA NUMAR,14,,,INCOMPLETA,,,45.8007164,24.2572791,45.8007164,24.2572791
Here is an explanation of the regex pattern:
(?<![^,]) assert that what precedes is either comma of the start of the input
(?:
F\.N\. match "F.N."
| OR
Fara Numar match "Fara Numar"
)
(?![^,]) assert that what follows is either comma or the end of the input
Ok, so a friend gave me the answer. It has to be case insensitive, like this:
pattern1 = re.compile(r'\b\s*f\s*a*r*a*\s*nu*m*a*r*\s*\b', re.IGNORECASE)
elif ind == nr_nou:
if re.search(pattern1, data):
data = "FARA NUMAR"

Vim - use regex to lexicographically compare strings (to find earlier/later dates)

I want to write a simple regex, in vim, that will find all strings lexicographically smaller than another string.
Specifically, I want to use this to compare dates formatted as 2014-02-17. These dates are lexicographically sortable, which is why I use them.
My specific use case: I'm trying to run through a script and find all the dates that are earlier than today's today.
I'm also OK with comparing these as numbers, or any other solution.
I don't think there is anyway to do this easily in regex. For matching any date earlier than the current date you can use run the function below (Some of the stuff was stolen from benjifisher)
function! Convert_to_char_class(cur)
if a:cur =~ '[2-9]'
return '[0-' . (a:cur-1) . ']'
endif
return '0'
endfunction
function! Match_number_before(num)
let branches = []
let init = ''
for i in range(len(a:num))
if a:num[i] =~ '[1-9]'
call add(branches, init . Convert_to_char_class(a:num[i]) . repeat('\d', len(a:num) - i - 1))
endif
let init .= a:num[i]
endfor
return '\%(' . join(branches, '\|') .'\)'
endfunction
function! Match_date_before(date)
if a:date !~ '\v\d{4}-\d{2}-\d{2}'
echo "invalid date"
return
endif
let branches =[]
let parts = split(a:date, '-')
call add(branches, Match_number_before(parts[0]) . '-\d\{2}-\d\{2}')
call add(branches, parts[0] . '-' . Match_number_before(parts[1]) . '-\d\{2}')
call add(branches, parts[0] . '-' . parts[1] . '-' .Match_number_before(parts[2]))
return '\%(' . join(branches, '\|') .'\)'
endfunction
To use you the following to search for all matches before 2014-02-24.
/<C-r>=Match_date_before('2014-02-24')
You might be able to wrap it in a function to set the search register if you wanted to.
The generated regex for dates before 2014-02-24 is the following.
\%(\%([0-1]\d\d\d\|200\d\|201[0-3]\)-\d\{2}-\d\{2}\|2014-\%(0[0-1]\)-\d\{2}\|2014-02-\%([0-1]\d\|2[0-3]\)\)
It does not do any validation of dates. It assumes if you are in that format you are a date.
Equivalent set of functions for matching after the passed in date.
function! Convert_to_char_class_after(cur)
if a:cur =~ '[0-7]'
return '[' . (a:cur+1) . '-9]'
endif
return '9'
endfunction
function! Match_number_after(num)
let branches = []
let init = ''
for i in range(len(a:num))
if a:num[i] =~ '[0-8]'
call add(branches, init . Convert_to_char_class_after(a:num[i]) . repeat('\d', len(a:num) - i - 1))
endif
let init .= a:num[i]
endfor
return '\%(' . join(branches, '\|') .'\)'
endfunction
function! Match_date_after(date)
if a:date !~ '\v\d{4}-\d{2}-\d{2}'
echo "invalid date"
return
endif
let branches =[]
let parts = split(a:date, '-')
call add(branches, Match_number_after(parts[0]) . '-\d\{2}-\d\{2}')
call add(branches, parts[0] . '-' . Match_number_after(parts[1]) . '-\d\{2}')
call add(branches, parts[0] . '-' . parts[1] . '-' .Match_number_after(parts[2]))
return '\%(' . join(branches, '\|') .'\)'
endfunction
The regex generated was
\%(\%([3-9]\d\d\d\|2[1-9]\d\d\|20[2-9]\d\|201[5-9]\)-\d\{2}-\d\{2}\|2014-\%([1-9]\d\|0[3-9]\)-\d\{2}\|2014-02-\%([3-9]\d\|2[5-9]\)\)
You do not say how you want to use this; are you sure that you really want a regular expression? Perhaps you could get away with
if DateCmp(date, '2014-02-24') < 0
" ...
endif
In that case, try this function.
" Compare formatted date strings:
" #param String date1, date2
" dates in YYYY-MM-DD format, e.g. '2014-02-24'
" #return Integer
" negative, zero, or positive according to date1 < date2, date1 == date2, or
" date1 > date2
function! DateCmp(date1, date2)
let [year1, month1, day1] = split(a:date1, '-')
let [year2, month2, day2] = split(a:date2, '-')
if year1 != year2
return year1 - year2
elseif month1 != month2
return month1 - month2
else
return day1 - day2
endif
endfun
If you really want a regular expression, then try this:
" Construct a pattern that matches a formatted date string if and only if the
" date is less than the input date. Usage:
" :echo '2014-02-24' =~ DateLessRE('2014-03-12')
function! DateLessRE(date)
let init = ''
let branches = []
for c in split(a:date, '\zs')
if c =~ '[1-9]'
call add(branches, init . '[0-' . (c-1) . ']')
endif
let init .= c
endfor
return '\d\d\d\d-\d\d-\d\d\&\%(' . join(branches, '\|') . '\)'
endfun
Does that count as a "simple" regex? One way to use it would be to type :g/ and then CRTL-R and = and then DateLessRE('2014-02-24') and Enter, followed by the rest of your command. In other words,
:g/<C-R>=DateLessRE('2014-02-24')<CR>/s/foo/bar
EDIT: I added a concat (:help /\&) that matches a complete "formatted date string". Now, there is no need to anchor the pattern.
Use nested subpatterns. It starts simple, with the century:
[01]\d\d\d-\d\d-\d\d|20
As for each digit to follow, use one of the following patterns; you may want to replace .* by an appropriate sequence of \d and -.
for 0: (0
for 1: (0.*|1
for 2: ([01].*|2
for 3: ([0-2].*|3
for 4: ([0-3].*|4
for 5: ([0-4].*|5
for 6: ([0-5].*|6
for 7: ([0-6].*|7
for 8: ([0-7].*|8
for 9: ([0-8].*|9
For the last digit, you only need the digit range, e.g.:
[0-6]
Finally, all parentheses should be closed:
)))))
In the example of 2014-02-17, this becomes:
[01]\d\d\d-\d\d-\d\d|20
(0\d-\d\d-\d\d|1
([0-3]-\d\d-\d\d|4
-
(0
([01]-\d\d|2
-
(0\d|1
[0-6]
)))))
Now in one line:
[01]\d\d\d-\d\d-\d\d|20(0\d-\d\d-\d\d|1([0-3]-\d\d-\d\d|4-(0([01]-\d\d|2-(0\d|1[0-6])))))
For VIM, let's not forget to escape (, ) and |:
[01]\d\d\d-\d\d-\d\d\|20\(0\d-\d\d-\d\d\|1\([0-3]-\d\d-\d\d\|4-\(0\([01]-\d\d\|2-\(0\d\|1[0-6]\)\)\)\)\)
Would be best to try and generate this (much like in FDinoff's answer), rather than write it yourself...
Update:
Here is a sample AWK script to generate the correct regex for any date yyyy-mm-dd.
#!/usr/bin/awk -f
BEGIN { # possible overrides for non-VIM users
switch (digit) {
case "ascii" : digit = "[0-9]"; break;
case "posix" : digit = "[:digit:]"; break;
default : digit = "\\d";
}
switch (metachar) {
case "unescaped" : escape = ""; break;
default : escape = "\\";
}
}
/^[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]$/ {
print BuildRegex($0);
}
function BuildRegex(s) {
if (s ~ /^[1-9][^1-9]*$/) {
regex = LessThanOnFirstDigit(s);
}
else {
regex = substr(s, 1, 1) BuildRegex(substr(s, 2)); # recursive call
if (s ~ /^[1-9]/) {
regex = escape "(" LessThanOnFirstDigit(s) escape "|" regex escape ")";
}
}
return regex;
}
function LessThanOnFirstDigit(s) {
first = substr(s, 1, 1) - 1;
rest = substr(s, 2);
gsub(/[0-9]/, digit, rest);
return (first ? "[0-" first "]" : "0") rest;
}
Call it like this:
echo 2014-02-17 | awk -f genregex.awk
Of course, you can write such a simple generator in any language you like.
Would be nice to do it in Vimscript, but I have no experience with that, so I will leave that as a home assignment.
If you wanted to search for all dates that were less than 2014-11-23, inclusive, you would use the following regex.
2014-(?:[1-9]|1[0-1])-(?:[1-9]|1[0-9]|2[0-3])
for a better explanation of the regex visit regex101.com and paste the regex in. You can also test it by using that site.
The basics of the regex are to search all dates that:
start with 2014-
either contain a single character from 1 - 9
or a 1 and a single character from 0 - 1, i.e. numbers from 1 - 11
finished by - and numbers from 1 - 23 done in the same style as the second term

highlighting phrase or words problem

function highlight_phrase($str, $phrase, $class='highlight')
{
if ($str == '')
{
return '';
}
if ($phrase != '')
{
return preg_replace('/('.preg_quote($phrase, '/').')/Ui', '<span class="'.$class.'">'."\\1".'</span>', $str);
}
return $str;
}
above code is what i use to highlight phrases in a string. I have problem with following issues:
if phrase is new car it matches new car and new cars both in a string meaning it highlights new car of new cars but i need not highlight new cars.
I could check for space but what if phrase ends with ,.? or ! etc.
Use the \b pattern to match word boundaries, i.e. in your case /\b(new car)\b/ will match
"the new car is blue"
"the new car."
"new car"
but not
"all the new cars".
Add (?!\w) to the regex. This will cause it to only match when the phrase is followed by a non-word character [^a-zA-Z0-9_].
return preg_replace('/('.preg_quote($phrase, '/')(?!\w)')/Ui', '<span class="'.$class.'">'."\\1".'</span>', $str);

Regular expression to match word pairs joined with colons

I don't know regular expression at all. Can anybody help me with one very simple regular expression which is,
extracting 'word:word' from a sentence. e.g "Java Tutorial Format:Pdf With Location:Tokyo Javascript"?
Little modification:
the first 'word' is from a list but second is anything. "word1 in [ABC, FGR, HTY]"
guys situation demands a little more
modification.
The matching form can be "word11:word12 word13 .. " till the next "word21: ... " .
things are becoming complex with sec.....i have to learn reg ex :(
thanks in advance.
You can use the regex:
\w+:\w+
Explanation:
\w - single char which is either a letter(uppercase or lowercase), digit or a _.
\w+ - one or more of above char..basically a word
so \w+:\w+
would match a pair of words separated by a colon.
Try \b(\S+?):(\S+?)\b. Group 1 will capture "Format" and group 2, "Pdf".
A working example:
<html>
<head>
<script type="text/javascript">
function test() {
var re = /\b(\S+?):(\S+?)\b/g; // without 'g' matches only the first
var text = "Java Tutorial Format:Pdf With Location:Tokyo Javascript";
var match = null;
while ( (match = re.exec(text)) != null) {
alert(match[1] + " -- " + match[2]);
}
}
</script>
</head>
<body onload="test();">
</body>
</html>
A good reference for regexes is https://developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/RegExp
Use this snippet :
$str=" this is pavun:kumar hello world bk:systesm" ;
if ( preg_match_all ( '/(\w+\:\w+)/',$str ,$val ) )
{
print_r ( $val ) ;
}
else
{
print "Not matched \n";
}
Continuing Jaú's function with your additional requirement:
function test() {
var words = ['Format', 'Location', 'Size'],
text = "Java Tutorial Format:Pdf With Location:Tokyo Language:Javascript",
match = null;
var re = new RegExp( '(' + words.join('|') + '):(\\w+)', 'g');
while ( (match = re.exec(text)) != null) {
alert(match[1] + " = " + match[2]);
}
}
I am currently solving that problem in my nodejs app and found that this is, what I guess, suitable for colon-paired wordings:
([\w]+:)("(([^"])*)"|'(([^'])*)'|(([^\s])*))
It also matches quoted value. like a:"b" c:'d e' f:g
Example coding in es6:
const regex = /([\w]+:)("(([^"])*)"|'(([^'])*)'|(([^\s])*))/g;
const str = `category:"live casino" gsp:S1aik-UBnl aa:"b" c:'d e' f:g`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
Example coding in PHP
$re = '/([\w]+:)("(([^"])*)"|\'(([^\'])*)\'|(([^\s])*))/';
$str = 'category:"live casino" gsp:S1aik-UBnl aa:"b" c:\'d e\' f:g';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
// Print the entire match result
var_dump($matches);
You can check/test your regex expressions using this online tool: https://regex101.com
Btw, if not deleted by regex101.com, you can browse that example coding here
here's the non regex way, in your favourite language, split on white spaces, go through the element, check for ":" , print them if found. Eg Python
>>> s="Java Tutorial Format:Pdf With Location:Tokyo Javascript"
>>> for i in s.split():
... if ":" in i:
... print i
...
Format:Pdf
Location:Tokyo
You can do further checks to make sure its really "someword:someword" by splitting again on ":" and checking if there are 2 elements in the splitted list. eg
>>> for i in s.split():
... if ":" in i:
... a=i.split(":")
... if len(a) == 2:
... print i
...
Format:Pdf
Location:Tokyo
([^:]+):(.+)
Meaning: (everything except : one or more times), :, (any character one ore more time)
You'll find good manuals on the net... Maybe it's time for you to learn...