How to exclude a string in groovy using regex? - regex

I've got a string characters, such as XXXabcdacefgabcdcbefgabcdmn. I need to exclude the string "efg" from the original string and each item must start with "abcd" (so I cannot just use split simply).
Here is my sample code:
def str = "XXXabcdacefgabcdcbefgabcdmn"
def matcher = (str =~ /^\/(?!efg)([a-z0-9]+)$/)
//I just tried the solution searched from google but it don't work.
matcher.each {
println it
}
The expected result should be:
abcdac
abcdcb
abcdmn
Any comment is very appreciated.

def s = "XXXabcdacefgabcdcbefgabcdmn"
def m = s =~ /abcd(?:(?!efg).)*/
(0..<m.count).each { print m[it] + '\n' }
Working Demo
Explanation:
abcd # 'abcd'
(?: # group, but do not capture (0 or more times):
(?! # look ahead to see if there is not:
efg # 'efg'
) # end of look-ahead
. # any character except \n
)* # end of grouping
You could also split here:
def s = "XXXabcdacefgabcdcbefgabcdmn"
def m = s.split(/efg/)*.dropWhile { it != 'a' }
println m.join('\n')

I need to exclude the string "efg" from the original string and it should start with "abcd"
This might help you. Get matched group from desired index.
(abcd(.*?)(?=efg)|(?<=efg).*$)
DEMO
OR try this one as well
(abcd(.*?)(?=efg|$))
DEMO
Sample code:
def str = "XXXabcdacefgabcdcbefgabcdmn"
def matcher = str =~ /(abcd(.*?)(?=efg)|(?<=efg).*$)/
matcher.each { println it[0] }

Here is something without using regex but pure tools provided by Groovy. :)
def str = "XXXabcdacefgabcdcbefgabcdmn"
assert ['abcdac', 'abcdcb', 'abcdmn'] ==
str.split(/efg/).findAll { it.contains(/abcd/) }*.dropWhile { it != 'a' }

Ok, you can use this pattern with a matcher:
(?:(?=abcd)|\G(?!\A)efg)((?:(?!efg).)*)
The substrings you need are in the first capturing group.
demo
An other way:
(?:(?=abcd)|\G(?!\A)efg)((?>[^e]+|e(?!fg))*)
demo

Related

Regex - get list comma separated allow spaces before / after the comma

I try to extract an images array/list from a commit message:
String commitMsg = "#build #images = image-a, image-b,image_c, imaged , image-e #setup=my-setup fixing issue with px"
I want to get a list that contains:
["image-a", "image-b", "image_c", "imaged", "image-e"]
NOTES:
A) should allow a single space before/after the comma (,)
B) ensure that #images = exists but exclude it from the group
C) I also searching for other parameters like #build and #setup so I need to ignore them when looking for #images
What I have until now is:
/(?i)#images\s?=\s?<HERE IS THE MISSING LOGIC>/
I use find() method:
def matcher = commitMsg =~ /(?i)#images\s?=\s?([^,]+)/
if(matcher.find()){
println(matcher[0][1])
}
You can use
(?i)(?:\G(?!^)\s?,\s?|#images\s?=\s?)(\w+(?:-\w+)*)
See the regex demo. Details:
(?i) - case insensitive mode on
(?:\G(?!^)\s?,\s?|#images\s?=\s?) - either the end of the previous regex match and a comma enclosed with single optional whitespaces on both ends, or #images string and a = char enclosed with single optional whitespaces on both ends
(\w+(?:-\w+)*) - Group 1: one or more word chars followed with zero or more repetitions of - and one or more word chars.
See a Groovy demo:
String commitMsg = "#build #images = image-a, image-b,image_c, imaged , image-e #setup=my-setup fixing issue with px"
def re = /(?i)(?:\G(?!^)\s?,\s?|#images\s?=\s?)(\w+(?:-\w+)*)/
def res = (commitMsg =~ re).collect { it[1] }
print(res)
Output:
[image-a, image-b, image_c, imaged, image-e]
An alternative Groovy code:
String commitMsg = "#build #images = image-a, image-b,image_c, imaged , image-e #setup=my-setup fixing issue with px"
def re = /(?i)(?:\G(?!^)\s?,\s?|#images\s?=\s?)(\w+(?:-\w+)*)/
def matcher = (commitMsg =~ re).collect()
for(m in matcher) {
println(m[1])
}

python replace line text with weired characters

How do I replace the following using python
GSA*HC*11177*NYSfH-EfC*23130303*0313*1*R*033330103298
STEM*333*3001*0030303238
BHAT*3319*33*33377*23330706*031829*RTRCP
NUM4*41*2*My Break Room Place*****6*1133337
I want to replace the all character after first occurence of '*' . All characters must be replace except '*'
Example input:
NUM4*41*2*My Break Room Place*****6*1133337
example output:
NUM4*11*1*11 11111 1111 11111*****1*1111111
Fairly simple, use a callback to return group 1 (if matched) unaltered, otherwise
return replacement 1
Note - this also would work in multi-line strings.
If you need that, just add (?m) to the beginning of the regex. (?m)(?:(^[^*]*\*)|[^*\s])
You'd probably want to test the string for the * character first.
( ^ [^*]* \* ) # (1), BOS/BOL up to first *
| # or,
[^*\s] # Not a * nor whitespace
Python
import re
def repl(m):
if ( m.group(1) ) : return m.group(1)
return "1"
str = 'NUM4*41*2*My Break Room Place*****6*1133337'
if ( str.find('*') ) :
newstr = re.sub(r'(^[^*]*\*)|[^*\s]', repl, str)
print newstr
else :
print '* not found in string'
Output
NUM4*11*1*11 11111 1111 11111*****1*1111111
If you want to use regex, you can use this one: (?<=\*)[^\*]+ with re.sub
inputs = ['GSA*HC*11177*NYSfH-EfC*23130303*0313*1*R*033330103298',
'STEM*333*3001*0030303238',
'BHAT*3319*33*33377*23330706*031829*RTRCP',
'NUM4*41*2*My Break Room Place*****6*1133337']
outputs = [re.sub(r'(?<=\*)[^\*]+', '1', inputline) for inputline in inputs]
Regex explication here

Extracting user-defined Groovy tokens from strings

Groovy here : I need to scan a string for a substring of the form:
${token}:<someValue>]
That is:
A user-define (dynamic) token string (could be anything at runtime); then
A colon (:); then
Anything (<someValue>); then finally
A right squre bracket (])
So basically something like:
def String fetchTokenValue(String toScan, String token) {
if(toScan.matches(".*${token}:.*]")) {
String everythingBetweenColonAndRBracket = ???
return everythingBetweenColonAndRBracket
} else {
return 'NO_DICE'
}
}
Such that the output would be as follows:
fetchTokenValue('swkokd sw:defroko swodjejr blah:fizzbuzz] wdkerko', 'blah') => 'fizzbuzz'
fetchTokenValue('swkokd sw:defroko swodjejr blah:fizzbuzz] wdkerko', 'boo') => 'NO_DICE'
I'm struggling with the regex as well as how to, if a match is made, extract all the text between the colon and the right square bracket. We can assume there will only ever be one match, or simply operate on the first match that is found (if it exists).
Any ideas where I'm going awry?
You may use [^\]]* subpattern (a negated character class [^...] that matches any chars other than those defined inside it) to match zero or more chars other than ] and use a capturing group to capture that text and only return Group 1 contents. Also, it is a good idea to automatically escape the input token so as to avoid illegal pattern syntax issues:
import java.util.regex.*;
def String fetchTokenValue(String toScan, String token) {
def matcher = ( toScan =~ /.*${Pattern.quote(token)}:([^\]]*)].*/ )
if(matcher.matches()) {
return matcher.group(1)
} else {
return 'NO_DICE'
}
}
println fetchTokenValue('swkokd sw:defroko swodjejr blah:fizzbuzz] wdkerko', 'blah')
See the online Groovy demo
You could use this regex which grabs anything up to a ] into a group
def String fetchTokenValue(String toScan, String token) {
def match = toScan =~ /.+${token}:([^\]]+)/
if(match) { match[0][1] } else { 'NO_DICE' }
}
def str = 'swkokd sw:defroko swodjejr blah:fizzbuzz] wdkerko'
assert fetchTokenValue(str, 'blah') == 'fizzbuzz'
assert fetchTokenValue(str, 'boo') == 'NO_DICE'

Python Replacement of Shortcodes using Regular Expressions

I have a string that looks like this:
my_str = "This sentence has a [b|bolded] word, and [b|another] one too!"
And I need it to be converted into this:
new_str = "This sentence has a <b>bolded</b> word, and <b>another</b> one too!"
Is it possible to use Python's string.replace or re.sub method to do this intelligently?
Just capture all the characters before | inside [] into a group . And the part after | into another group. Just call the captured groups through back-referencing in the replacement part to get the desired output.
Regex:
\[([^\[\]|]*)\|([^\[\]]*)\]
Replacemnet string:
<\1>\2</\1>
DEMO
>>> import re
>>> s = "This sentence has a [b|bolded] word, and [b|another] one too!"
>>> m = re.sub(r'\[([^\[\]|]*)\|([^\[\]]*)\]', r'<\1>\2</\1>', s)
>>> m
'This sentence has a <b>bolded</b> word, and <b>another</b> one too!'
Explanation...
Try this expression: [[]b[|](\w+)[]] shorter version can also be \[b\|(\w+)\]
Where the expression is searching for anything that starts with [b| captures what is between it and the closing ] using \w+ which means [a-zA-Z0-9_] to include a wider range of characters you can also use .*? instead of \w+ which will turn out in \[b\|(.*?)\]
Online Demo
Sample Demo:
import re
p = re.compile(ur'[[]b[|](\w+)[]]')
test_str = u"This sentence has a [b|bolded] word, and [b|another] one too!"
subst = u"<bold>$1</bold>"
result = re.sub(p, subst, test_str)
Output:
This sentence has a <bold>bolded</bold> word, and <bold>another</bold> one too!
Just for reference, in case you don't want two problems:
Quick answer to your particular problem:
my_str = "This sentence has a [b|bolded] word, and [b|another] one too!"
print my_str.replace("[b|", "<b>").replace("]", "</b>")
# output:
# This sentence has a <b>bolded</b> word, and <b>another</b> one too!
This has the flaw that it will replace all ] to </b> regardless whether it is appropriate or not. So you might want to consider the following:
Generalize and wrap it in a function
def replace_stuff(s, char):
begin = s.find("[{}|".format(char))
while begin != -1:
end = s.find("]", begin)
s = s[:begin] + s[begin:end+1].replace("[{}|".format(char),
"<{}>".format(char)).replace("]", "</{}>".format(char)) + s[end+1:]
begin = s.find("[{}|".format(char))
return s
For example
s = "Don't forget to [b|initialize] [code|void toUpper(char const *s)]."
print replace_stuff(s, "code")
# output:
# "Don't forget to [b|initialize] <code>void toUpper(char const *s)</code>."

How to find all matches of a pattern in a string using regex

If I have a string like:
s = "This is a simple string 234 something else here as well 4334
and a regular expression like:
regex = ~"[0-9]{3}"
How can I extract all the words from the string using that regex? In this case 234 and 433?
You can use CharSequence.findAll:
def triads = s.findAll("[0-9]{3}")
assert triads == ['234', '433']
Latest documentation of CharSequence.findAll
You have to use capturing groups. You can check groovy's documentation about it:
http://mrhaki.blogspot.com/2009/09/groovy-goodness-matchers-for-regular.html
For instance, you can use this code:
s = "This is a simple string 234 something else here as well 4334"
regex = /([0-9]{3})/
matcher = ( s=~ regex )
if (matcher.matches()) {
println(matcher.getCount()+ " occurrence of the regular expression was found in the string.");
println(matcher[0][1] + " found!")
}
As a side note:
m[0] is the first match object.
m[0][0] is everything that matched in this match.
m[0][1] is the first capture in this match.
m[0][n] is the n capture in this match.
You could do something like this.
def s = "This is a simple string 234 something else here as well 4334"
def m = s =~ /[0-9]{3}/
(0..<m.count).each { println m[it][0] }
Output ( Working Demo )
234
433
def INPUT= 'There,once,was,a man,from,"the , extremely,,,,bad .,, , edge",of
the,"universe, ultimately,, is mostly",empty,,,space';
def REGEX = ~/(\"[^\",]+),([^\"]+\")/;
def m = (INPUT =~ REGEX);
while (true) {
m = (INPUT =~ REGEX);
if (m.getCount()>0) {
INPUT = INPUT.replaceAll(REGEX,'$1!-!$2');
System.out.println(INPUT );
} else {
break;
}
}