regex: accepting but not capturing a pattern - regex

I want a pattern that matches on
ab
a-b
a b
a b
a-b
where a and b can be any pattern, but are reduced to a and b for simplicity.
I want to return "ab" in all these cases. Can I do it all by regex or do I have to receive the matched expressions along with the separator characters and process them in code, by replacing the said characters and the like?

Might misunderstood your meaning, if so I'm sorry about it.
You can group things in regexp with quotes (),
For example, with your case:
(a)(-|\s+)?(b)
And later use \1 and \3 to refer a and b. so \1\3 would mean ab.
Note some tools may need to use \\1\\3 instead.
Check the doc of your language to find out the exact regexp rules.
I'm not sure where will you use this, here I use sed as an example:
$ echo -e "ab\na-b\na b\na b\n"|sed -E 's/^(a)(-| +)?(b)$/\1\3/'
ab
ab
ab
ab
Note the regex used here is ^(a)(-| +)?(b)$, the ^ and $ are to match the beginning and ending of a string/line.
In other words, those lines can be accepted by that regexp -- In some cases it's already validated.
But if you want to return ab, that's not simple matching but an addtional step of replace/reorganizing needed.

Related

Regex AND search inside block which enclosed by something delimiter

I want to regex AND search (?=)(?=) inside block which enclosed by something delimiter such as #
In following sample regex, what I expected is, cat to ugly matches to the pattern inside # cat B to before # cat C.
But the regex match to nothing.
regex
^#(?=[\s\S]*(cat))(?=[\s\S]*(ugly))^#
text
# cat A
the cat is
very cute.
# cat B
the cat is
very ugly.
# cat C
the cat is
very good.
#
You can test the regex on https://regexr.com/
In your pattern ^#(?=[\s\S]*(cat))(?=[\s\S]*(ugly))^# you use match a # from the start of the string ^#, followed by 2 positive lookaheads and then again match ^#. That is why you don't get a match.
To get a more exact match, you could start the pattern with ^# cat B
If you want to use lookaheads, you might use 2 capturing groups in the positive lookahead. If you want to search for cat and ugly as whole words you might use word boundaries \b.
The (?s) is a modifier that enables the dot matching a newline for which you might also use /s as a flag instead.
(?s)(?=^# cat B.*?(cat).*?(ugly).*?^# cat C
Regex demo
But it might be easier to not use the lookahead and match instead:
(?s)^# cat B.*?(cat).*?(ugly).*?^# cat C$
Php demo
This RegEx might help you to design/match your target words by bounding them using \n.
((.+)(cat)(.+))\n((.+)(ugly)(.+))
Just to be simple, it creates four groups for each target keywords: 🐈 and ugly, where your target keywords can be called using $3 and $7:
You could additionally bound it with start ^ and end $, if you wish.
This expression only works when your target keywords are in the middle of both lines.

How do I add characters to RegEx results?

I am not a programmer but I am having to use RegEx for a particular purpose. How do I add specific characters to what is being returned from RegEx?
For example, if I have a list as follows:
XYZ ABC 123
How do I use RegEx to add something specific to the end of each? For example, if I want all three to end with .com for example?
You can try this script, replacing XYZ abc 123 with your full list:
echo "XYZ abc 123" | sed -E 's/([a-zA-Z0-9]+)/\1.com/g'
Explanation:
s/ Starts a substitution regex
([a-zA-Z0-9]+) Capture at least one alphanumeric
/ End regex
\1.com replaces the capture with itself plus adds .com
/g Global modifier (for all matches)
Without knowing which regex engine you want to use, there are many other ways to do this. In the future, please give more information.

How to extract match pattern only using regex without other tools

I need to write a regular expression that after seeing "aaa" code, this regex should print only 6-digit code, not entire line. There is only one 6-digit code in a line, and it is after "aaa".
I can't use sed, awk, grep ... etc. My application only accepts regex.
Examples:
x aaa y z 123456 returns 123456
aaa x 654321 y z returns 654321
I tried this regex with backreference, not sure how not to repeat [\d]{6} though
(.*)(aaa)(.*)[\d]{6}((?(2)[\d]{6}|.+)
but it prints the entire line.
Any suggestions?
You could do something like
aaa.+?(\d{6})
and then returning only the first group (with \1)
You could also use backreference with a different regex:
(?<=aaa.+?)\d{6}
this means that you want the first 6 digits after aaa and any other character. Unfortunately many languages don't support variable length backreferences, so I'd go with the first one

matching lines in vi that contain any permutation of a set of strings

I am trying to search for lines that contain any permutation of a group of words (case-insensitively). For example, if I am interested in the words foo and bar, I would want to match the first four lines but not the last four lines in the following file:
Foo and bar.
Bar and foo.
The foo and the bar.
The bar and the foo.
Foobar.
Barfoo.
The foobar.
The barfoo.
Having looked at this post, I realize I can construct something like this in perl:
perl -n -e 'print if (/\bfoo\b.*?\bbar\b/i || /\bbar\b.*?\bfoo\b/i)' file
which correctly matches only the first four lines. Alternatively, using a look-ahead construct as suggested by this post, the match can be made with slightly more concise code:
perl -n -e 'print if (/(?=.*\bfoo\b)(?=.*\bbar\b)/i)' file
I cannot, however, figure out how to write these in vim regex syntax, which I find to be far more byzantine than perl regex syntax. I have tried many different expressions in vim using the search function (/ or ?), but none of them produce successful matches. I realize that instead of the (?=string) syntax used by perl, vim uses \(string\)\#= and string\&.
However, a variety of attempts, e.g.:
\c\(foo\)\#=\(bar\)#=
\c\(foo\)\#=\.*\(bar\)#=
\cfoo\&bar\&
(where \c is used for a case-insensitive match) have all been unsuccessful.
Could someone please demonstrate the correct vim syntax?
Try: \c.*\<foo\>.*\&.*\<bar\>.*. This should match the whole of each of the first four lines.
You were closest with \c\(foo\)\#=\(bar\)#=, but since you don't want e.g. foobar, barfoo to match it's necessary to use begin/end of word matching: \<\>.
Using \& simplifies the pattern a bit.
If you don't need the whole line matches from that pattern, just a hit on any line that matches, you can simplify this regex a bit more by killing the trailing .* pieces in the pattern: \c.*\<foo\>\&.*\<bar\>
Try the following:
/^\c\(.*\<foo\>\)\#=\(.*\<bar\>\)\#=/
This is the same thing as the lookahead version from Perl, \#= makes the previous element or group a positive lookahead. \< and \> are the vim equivalent to \b, and \c enables case insensitive matching. I added the ^ anchor so it will match each line only once.

Replace repeating characters with one with a regex

I need a regex script to remove double repetition for these particular words..If these character occurs replace it with single.
/[\s.'-,{2,0}]
These are character that if they comes I need to replace it with single same character.
Is this the regex you're looking for?
/([\s.'-,])\1+/
Okay, now that will match it. If you're using Perl, you can replace it using the following expression:
s/([\s.'-,])\1+/$1/g
Edit: If you're using :ahem: PHP, then you would use this syntax:
$out = preg_replace('/([\s.\'-,])\1+/', '$1', $in);
The () group matches the character and the \1 means that the same thing it just matched in the parentheses occurs at least once more. In the replacement, the $1 refers to the match in first set of parentheses.
Note: this is Perl-Compatible Regular Expression (PCRE) syntax.
From the perlretut man page:
Matching repetitions
The examples in the previous section display an annoying weakness. We were only matching 3-letter words, or chunks of words of 4 letters or less. We'd like to be able to match words or, more generally, strings of any length, without writing out tedious alternatives like \w\w\w\w|\w\w\w|\w\w|\w.
This is exactly the problem the quantifier metacharacters ?, *, +, and {} were created for. They allow us to delimit the number of repeats for a portion of a regexp we consider to be a match. Quantifiers are put immediately after the character, character class, or grouping that we want to specify. They have the following meanings:
a? means: match 'a' 1 or 0 times
a* means: match 'a' 0 or more times, i.e., any number of times
a+ means: match 'a' 1 or more times, i.e., at least once
a{n,m} means: match at least "n" times, but not more than "m" times.
a{n,} means: match at least "n" or more times
a{n} means: match exactly "n" times
As others said it depends on you regex engine but a small example how you could do this:
/([ _-,.])\1*/\1/g
With sed:
$ echo "foo , bar" | sed 's/\([ _-,.]\)\1*/\1/g'
foo , bar
$ echo "foo,. bar" | sed 's/\([ _-,.]\)\1*/\1/g'
foo,. bar
Using Javascript as mentioned in a commennt, and assuming (It's not too clear from your question) the characters you want to replace are space characters, ., ', -, and ,:
var str = 'a b....,,';
str = str.replace(/(\s){2}|(\.){2}|('){2}|(-){2}|(,){2}/g, '$1$2$3$4$5');
// Now str === 'a b..,'
If I understand correctly, you want to do the following: given a set of characters, replace any multiple occurrence of each of them with a single character. Here's how I would do it in perl:
perl -pi.bak -e "s/\.{2,}/\./g; s/\-{2,}/\-/g; s/'{2,}/'/g" text.txt
If, for example, text.txt originally contains:
Here is . and here are 2 .. that should become a single one. Here's
also a double -- that should become a single one. Finally here we have
three ''' which should be substituted with one '.
it is modified as follows:
Here is . and here are 2 . that should become a single one. Here's
also a double - that should become a single one. Finally here we have
three ' which should be substituted with one '.
I simply use the same replacement regex for each character in in the set: for example
s/\.{2,}/\./g;
replaces 2 or more occurrences of a dot character with a single dot. I concatenate several of this expressions, one for each character of your original set.
There may be more compact ways of doing this, but, I think this is simple and it works :)
I hope it helps.