Match "\b(OneTwoThree|OneTwo|TwoThree)\b" with least repetition - regex

I managed this with PCRE only, but I'd like it to work with Javascript's RegExp as well. That, and the regex is ugly. Are there any other, saner ways of accomplishing this?
Note, that while the topic says "OneTwoThree", I'm using "qwe" for brevity.
$ cat test.txt | grep -oP '\b(q(\g<we>|\g<w>)|(?<we>(?<w>w)e))\b'
qwe
qw
we
File test.txt contains:
qwe qw we q w e qq qe wq ww eq ew ee qqq qqw qqe qwq qww qeq qew qee wqq wqw wqe wwq www wwe weq wew wee eqq eqw eqe ewq eww ewe eeq eew eee
(Only the first three should match.)

Something like this would work for your sample data:
/\b(qwe?|we)\b/
/\b(q?we|qw)\b/
Which you can test here.
But for the full pattern you specified in the title it would be
/\b(OneTwo(Three)?|TwoThree)\b/
/\b((One)?TwoThree|OneTwo)\b/
Now, this is not more readable, but it does reduce redundancy slightly:
/\b(?!w\b)q?we?\b/
Which you can test here
Or for your full pattern:
/\b(?!Two\b)(One)?Two(Three)?\b/

Maybe this but not sure -
# \b(?=..)q?we?\b
\b
(?= . . )
q? w e?
\b

Related

In regex capture group, exclude one word

I have this type of url:
https://example.com/en/app/893245
https://example.com/ru/app/wq23245
https://example.com/app/8984245
I want to extract only word between com and app
https://example.com/en/app/893245 -> en
https://example.com/ru/app/wq23245 -> ru
https://example.com/app/8984245 ->
I tried to exclude app from capture group but I don't know how to do it except like this:
.*com\/((?!app).*)\/app
Is it possible to something like this but excluding the word app from being captured? example\.com\/(\w+|?!app)\/
Rubular link: https://rubular.com/r/NnojSgQK7EuelE
If you need a plain regex you may use lookarounds:
/(?<=example\.com\/)\w+(?=\/app)/
Or, probably better in a context of a URL:
/(?<=example\.com\/)[^\/]+(?=\/app)/
See the Rubular demo.
In Ruby, you may use
strs = ['https://example.com/en/app/893245','https://example.com/ru/app/wq23245','https://example.com/app/8984245']
strs.each { |s|
p s[/example\.com\/(\w+)\/app/, 1]
}
# => ["en", "ru", nil]
you could use sed
sed -n -f script.sed yourinput.txt
and inside script.sed:
s/.*com\/\(.*\)\/app.*/\1/p
Example input:
https://example.com/en/app/893245
https://example.com/ru/app/wq23245
https://example.com/app/8984245
Example output:
$ sed -n -f comapp.sed comapp.txt
en
ru

Golang regexp to match string until a given sequence of characters

I have a string that could have a -name followed by value (that can have spaces) and there could also be -descr after that followed by a value (the -descr followed by value may nor may not be there):
Example strings:
runcmd -name abcd xyz -descr abc def
or
runcmd -name abcd xyz
With Go language, how do I write regexp, that returns me the string before -descr if it exists. so, for both examples above, the result should be:
runcmd -name abcd xyz
I was trying:
regexp.MustCompile(`(-name ).+?=-descr`)
But, that did not return any match. I wanted to know the correct regexp to get the string up until -descr if it exists
You could capturin first part with -name in a group, then match what is in between and use an optional second capturing group to match -descr and what follows.
Then you could use the capturing groups when creating the desired result.
^(.*? -name\b).*?(-descr\b.*)?$
Regex demo | Go demo
For example:
s := "runcmd -name abcd xyz -descr abc def"
re1 := regexp.MustCompile(`^(.*? -name\b).*?(-descr\b.*)?$`)
result := re1.FindStringSubmatch(s)
fmt.Printf(result[1] + "..." + result[2])
Result:
runcmd -name...-descr abc def
By "Does not work", do you mean it doesn't match anything, or just not what you expect?
https://regex101.com/ is generally very helpful when testing regular expressions.
I do not believe there's a simple way to achieve what you want. Things become a lot simpler if we can assume the text betweeen -name and -descr doesn't contain any - in which case, regex.MustCompile(`-name ([^-]*)`) should work
With this kind of thing, often it's easier and clearer to use 2 regular expressions. So the first strips -descr and anything following it, and the first matches -name and all subsequent characters.
You are not dealing with a regular language here, so there is no reason to bust out the (slow) regexp engine. The strings package is quite enough:
package main
import (
"fmt"
"strings"
"unicode"
)
func main() {
fmt.Printf("%q\n", f("runcmd -name abcd xyz -descr abc def"))
fmt.Printf("%q\n", f("runcmd -name abcd xyz"))
fmt.Printf("%q\n", f("-descr abc def"))
}
func f(s string) string {
if n := strings.Index(s, "-descr"); n >= 0 {
return strings.TrimRightFunc(s[:n], unicode.IsSpace)
}
return s
}
// Output:
// "runcmd -name abcd xyz"
// "runcmd -name abcd xyz"
// ""
Try it on the playground: https://play.golang.org/p/RFC65CYe6mp

Match all prefixes of a string

I am looking for a regex that matches if the string or any prefix of the string is matched. For example, if I had the string 'abcd' it would match
- a
- abc
- aaaa
but not
- baa
- the
My current regex solution is a | ab | abc | abcd - but wondering if there is a more succinct way.
It looks like the easiest way to achieve what I was after is the solution I posted in the question, a | ab | abc | abcd
I'm not sure of what you want, so here two different solution.
First solution
echo "a\nabc\naaaa\nbaa\nthe\naaabcd\nadc" | egrep "^a*b*c*d*$" | egrep -v "^$"
It will take only words where a, b, c and d are in this order. Also, it will avoid empty line.
Output
a
abc
aaaa
aaabcd
Second solution
If you want only matching for the first char:
echo "a\nabc\naaaa\nbaa\nthe\naaabcd\nadc" | egrep "^a+[bcd]{0,}$"
Output
a
abc
aaaa
aaabcd
adc
Try this regex:
^(?:abcd|abc|ab|a)+$
Click for Demo
OR you can use this:
^(?:ab?c?d?)+$
Click for Demo - This 2nd regex will also match strings like ad, acd etc. Not sure if you want it this way.
OR a slight modification of the answer posted by #Wiktor in the comments:
^(?:a(?:b(?:cd?)?)?)+$ - Link
Explanation:(for 1st regex)
^ - asserts the start of the string. You can also use a \b instead of it, in this case.
(?:abcd|abc|ab|a)+ - matches 1+ occurrences of either abcd or abc or ab or a. You wrote it the other way around.
$ - asserts the end of the string. You can also use a \b instead of it, in this case.

Grep pattern between quotes

I'm trying to grep a code base to find alpha numeric codes between quotes. So, for example my code base might contain the line
some stuff "A234DG3" maybe more stuff
And I'd like to output: A234DG3
I'm lucky in that I know my string is 7 long and only integers and the letters A-Z, a-z.
After a bit of playing I've come up with the following, but it's just not coming out with what I'd like
grep -ro '".*"' . | grep [A-Za-z0-9]{7} | less
Where am I going wrong here? It feels like grep should give me what I want, but am I better off using something else? Cheers!
The problem is that an RE is pretty much required to match the longest sequence it can. So, given something like:
a "bcd" efg "hij" klm "nop" q
A pattern of ".*" should match: "bcd" efg "hij" klm "nop" (everything from the first quote to the last quote), not just "bcd".
You probably want a pattern more like "[^"]*" to match the open-quote, an arbitrary number of other things, then a close quote.
Using basic or extended POSIX regular expressions there is no way to extract the value between the quotes with grep. Since that I would use sed for a portable solution:
sed -n 's/.*\"\([^"]\+\)".*/\1/p' <<< 'some stuff "A234DG3" maybe more stuff'
However, having GNU goodies, GNU grep will support PCRE expressions with the -P command line option. You can use this:
grep -oP '.*?"\K[^"]+(?=")' <<< 'some stuff "A234DG3" maybe more stuff'
.*" matches everything until the first quote - including it. The \K option clears the matching buffer and therefore works like a handy, dynamic lookbehind assertion. (I could have used a real lookbehind but I like \K). [^"]+ matches the text between the quotes. (?=") is a lookahead assertion the ensure after the match will follow a " - without including it into the match.
So after more playing about I've come up with this which gives me what I'm after:
grep -r -E -o '"[A-Za-z0-9]{7}"' . | less
With the -E allowing the use of the {7} length matcher

Unix egrep command how to create a pattern to match the following?

I want to ask about back reference in egrep.
I have a file, it contains:
aa aa someothertext
and there are something like 77 77
How do I use back reference to match the pattern 'aa aa' and '77 77'?
I tried:
egrep '(aa )\1' file.txt
and it will match 'aa aa'. Then. I tried to replace 'aa' with ' ([a-zA-Z0-9])\1', which yields:
egrep '(([a-zA-Z0-9])\1 )\1' file.txt
It won't work.
I'd appreciate if you can help!
Remember that capturing groups are indexed by their opening parenthesis: you were calling the first group before it was defined.
In ((a)b), \1 is referring to (a)b and \2 to a.
To fix this, you can use the correct index:
(([a-zA-Z0-9])\2 )\1