Matching on repeated substrings in a regex - regex

Is it possible for a regex to match based on other parts of the same regex?
For example, how would I match lines that begins and end with the same sequence of 3 characters, regardless of what the characters are?
Matches:
abcabc
xyz abc xyz
Doesn't Match:
abc123
Undefined: (Can match or not, whichever is easiest)
ababa
a
Ideally, I'd like something in the perl regex flavor. If that's not possible, I'd be interested to know if there are any flavors that can do it.

Use capture groups and backreferences.
/^(.{3}).*\1$/
The \1 refers back to whatever is matched by the contents of the first capture group (the contents of the ()). Regexes in most languages allow something like this.

You need backreferences. The idea is to use a capturing group for the first bit, and then refer back to it when you're trying to match the last bit. Here's an example of matching a pair of HTML start and end tags (from the link given earlier):
<([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1>
This regex contains only one pair of parentheses, which capture the string matched by [A-Z][A-Z0-9]* into the first backreference. This backreference is reused with \1 (backslash one). The / before it is simply the forward slash in the closing HTML tag that we are trying to match.
Applying this to your case:
/^(.{3}).*\1$/
(Yes, that's the regex that Brian Carper posted. There just aren't that many ways to do this.)
A detailed explanation for posterity's sake (please don't be insulted if it's beneath you):
^ matches the start of the line.
(.{3}) grabs three characters of any type and saves them in a group for later reference.
.* matches anything for as long as possible. (You don't care what's in the middle of the line.)
\1 matches the group that was captured in step 2.
$ matches the end of the line.

For the same characters at the beginning and end:
/^(.{3}).*\1$/
This is a backreference.

This works:
my $test = 'abcabc';
print $test =~ m/^([a-z]{3}).*(\1)$/;
For matching the beginning and the end you should add ^ and $ anchors.

Related

How can I get the second part of a hyphenated word using regex?

For example, I have the word: sh0rt-t3rm.
How can I get the t3rm part using perl regex?
I could get sh0rt by using [(a-zA-Z0-9)+]\[-\], but \[-\][(a-zA-Z0-9)+] doesn't work to get t3rm.
The syntax used for the regex is not correct to get either sh0rt or t3rm
You flipped the square brackets and the parenthesis, and the hyphen does not have to be between square brackets.
To get sh0rt in sh0rt-t3rm you you might use for example one of:
Regex
Demo
Explanation
\b([a-zA-Z0-9]+)-
Demo 1
\b is a word boundary to prevent a partial word match, the value is in capture group 1.
\b[a-zA-Z0-9]+(?=-)
Demo 2
Match the allowed chars in the character class, and assert a - to the right using a positive lookahead (?=-)
To get t3rm in sh0rt-t3rm you might use for example one of:
Regex
Demo
Explanation
-([a-zA-Z0-9]+)\b
Demo 3
The other way around with a leading - and get the value from capture group 1.
-\K[a-zA-Z0-9]+\b
Demo 4
Match - and use \K to keep out what is matched so far. Then match 1 or more times the allowed chars in the character class.
If your whole target string is literally just sh0rt-t3rm then you want all that comes after the -.
So the barest and minimal version, cut precisely for this description, is
my ($capture) = $string =~ /-(.+)/;
We need parenthesis on the left-hand-side so to make regex run in a list context because that's when it returns the matches (otherwise it returns true/false, normally 1 or '').
But what if the preceding text may have - itself? Then make sure to match all up to that last -
my ($capture) = $string =~ /.*-(.+)/;
Here the "greedy" nature of the * quantifier makes the previous . match all it possibly can so that the whole pattern still matches; thus it goes up until the very last -.
There are of course many other variations on how the data may look like, other than just being one hyphenated-word. In particular, if it's a part of a text, you may want to include word-boundaries
my ($capture) = $string =~ /\b.*?-(.+?)\b/;
Here we also need to adjust our "wild-card"-like pattern .+ by limiting it using ? so that it is not greedy. This matches the first such hyphenated word in the $string. But if indeed only "word" characters fly then we can just use \w (instead of . and word-boundary anchors)
my ($capture) = $string =~ /\w*?-(\w+)/;
Note that \w matches [a-zA-Z0-9_] only, which excludes some characters that may appear in normal text (English, not to mention all other writing systems).
But this is clearly getting pickier and cookier and would need careful close inspection and testing, and more complete knowledge of what the data may look like.
Perl offers its own tutorial, perlretut, and the main full reference is perlre
-([a-zA-Z0-9]+) will match a - followed by a word, with just the word being captured.
Demo

regex in parenthesis at the beginning

I have a regex trying to divide questions by speciality. Say I have the following regex:
(?P<speciality>[0-9x]+)
It works fine for this question (correct match: 7)
(7)Which of the following is LEAST to be considered as a risk factor for esophageal cancer?;
And for this (correct match: 8 and 13)
(8,13)30 year old woman with amenorrhea, low serum estrogen and high serum LH/FSH, the most likely diagnosis is:
But not for this one (incorrect match: 20).
First trimester spontaneous abortion (before 20 wk) is most commonly due to:
I only need the numbers in parentheses at the beginning of the question, all other parentheses should be ignored. Is this possible with a regex alone (lookahead?).
If your regex flavor supports \G continuous matching and \K reset beginning of match, try:
(?:^\(|\G,)\K[\dx]+
^\( would match parenthesis at start | OR \G match , after last match. Then \K resets and match + one or more of [\dx]. (\d is a shorthand for [0-9]). Matches will be in $0.
Test at regex101.com; Regex FAQ
PHP example
$str = "(1x,2,3x) abc (1,2x,3) d";
preg_match_all('~(?:^\(|\G,)\K[\dx]+~', $str, $out);
print_r($out[0]);
Array
(
[0] => 1x
[1] => 2
[2] => 3x
)
Test at eval.in
Perhaps something like this will work (you don't mention the regex flavor that you're using, though I am guessing it is PCRE by the use of the named group - and yes, it does use positive lookahead):
^\((?P<speciality>(?:[0-9x]+,?)+)(?=\))/mg
The caret ^ combined with the multiline modifier \m (which causes the anchors ^ and $ to match the beginning and end of lines, respectively, instead of the beginning and end of the string) will ensure that what is matched is at the start of the paragraph. The specialties will be captured in the specialty named capture group; the only caveat is that if more than one specialty is given (as in your example starting (8,13)) the capture will be a comma-delimited list, just as the specialty is a comma-delimited list (to use the same example, the capture will be 8,13 in that case).
Please see Regex Demo here.
(?P<speciality>[0-9x]+) matches any nonempty sequence of digits anywhere in the input. the parentheses just delimit the capturing group but are not part of the match.
to match a number (or more separated by commas) between parentheses at the beginning of the line you could use something like this
^\((\d+)(,(\d+))*\)
EDIT
it seems repeated capturing groups, as in (,(\d+))*, will only return the last match. so to get the values it'd be necessary to catch the complete list of numbers and parse it afterwards:
^\((?P<specialities>(\d+)(,(\d+))*)\)
will catch one or more numbers separated by commas, between parentheses.
added the start of line anchor so it is at the beginning of the line.
Demo

Regular expression to find specific string and add characters when the're not already there in notepad++

Okay, I have zero knowledge of regular expressions so if someone can direct me to a better way to figure this out then by all means please do.
I figured out that a series of files are missing a particular naming convention for the database they will write to. So some might be dbname1, dbname2, dbname3, abcdbname4, abcdbname5 and they all need to have that abc in the beginning. I want to write a regular expression that will find all tags in the file that do not follow immediately by abc and add in abc. Any ideas how I can do this?
Again, forgive me if this is poorly worded/expressed. I really have absolutely zero knowledge of regular expressions. I can't find any questions that are asking this. I know that there are questions asking how to add strings to lines but not how to add only to lines that are missing the string when some already have it.
I thought I had written this in but I'm looking at lines that look like this
<Name>dbname</Name>
or
<Name>abcdbname</Name>
and I need to get them all to have that abc at the beginning
Cameron's answer will work, but so will this. It's called a negative lookbehind.
(?<!abc)(dbname\d+)
This regex looks for dbname followed by 1 or more digits, and not prefixed by abc. So it will capture dbname113.
This looks for any occurrence of dbname not immediately prefixed by the string "abc". THe original name is in the capture group \1 so you can replace this regex with abc\1 and all your files will be properly prefixed.
Not every program/language that implements regex (famously, javascript) supports lookbehinds, but most do and Notepad++ certainly does. Lookarounds (lookbehind / lookaheads) are exceedingly handy once you get the hang of them.
?<! negative lookbehind, ?<= positive lookbehind / lookbehind, ?! negative lookhead, and ?= lookahead all must be used within parantheses as I did above, but they're not used in capturing so they do not create capture groups, hence why the second set of parentheses is able to be referenced as \1 (or $1 depending on the language)
Edit: Given some better example criteria, this is possibly more what you're looking for.
Find: (<Name>)(.*?(?<!abc)dbname\d+)(</Name>)
Replace: \1abc\2\3
Alternatively, something a bit easier to understand, you can do this or something like this:
Find: (<Name>)(abc)?(dbname\d+)(</Name>)
Replace: \1abc\3\4
What this is does is:
Matches <Name>, captures as backreference 1.
Looks for abc and captures it, if it's there as backreference 2, otherwise 2 contains nothing. The ? after (abc) means match 0 or 1 times.
Looks for the dbname and captures it. and captures as backreference 3.
Matches </Name>, captures as backreference 4.
By replacing with \1abc\3\4, you kind of drop abc off dbname if it exists and replace dbname with abcdbname in all instances.
You can take this a step further and
Find: (<Name>)(?:abc)?(dbname\d+)(</Name>)
Replace: \1abc\2\3
prefix the abc with ?: to create a noncapturing group, so the backreferences for replacing are sequential.
Replace \bdbname(\d+) with abcdbname\1.
The \b means "word boundary", so it won't match the abc versions, but will match the others. The (...) parentheses represent a capturing group, which capture everything that's matched in-between into a numbered variable that can be later referenced (there's only one here so it goes in \1). The \d+ matches one or more digit characters.

Greedy/non-greedy quantifiers in ABAP regular expressions

I would like to extract 2 things from this string: | 2013.10.10 FEL felsz
regex -> Date field -> the needed value will be only the 2013.10.10 (in this case)
regex -> String between 2013.10.10 and felsz string -> the needed value will be only the FEL string (in this case).
I tried with the following regexes as with not too much success:
(.*?<p/\s>.*?)(?=\s)
(.*?<p/("[0-9]+">.*?)(?=\s)
Do you have any suggestions?
As mentioned in comments, since ABAP doesn't allow non-greedy match with *?, if you can count on felsz occurring only immediately after the second portion you want to match you could use:
(\d{4}\.\d\d\.\d\d) (.*) felsz
(PS: Invalidated first answer: in non-ABAP systems where *? is supported, the following regex will get both values into submatches. The date will be in submatch 1 and the other value (FEL in this case) will be in submatch 2 : `(\d{4}.\d\d.\d\d) (.*?) felsz)
Is "felsz" variable? Can the white space vary? Can your date format vary? If not:
\| (\d{4}\.\d{2}\.\d{2}) (.*?) felsz
Otherwise:
\|\s+?(\d{4}\.\d{2}\.\d{2})\s+?(.*?)\s+?[a-z]+
Then access capture groups 1/2.
The regex
\d+\.\d+\.\d+
matches 2013.10.10 in the given string. Explanation and demonstration: http://regex101.com/r/bL7eO0
(?<=\d ).*(?= felsz)
should work to match FEL. Explanation and demonstration: http://regex101.com/r/pV2mW5
If you want them in capturing groups, you could use the regex:
\| (\d+\.\d+\.\d+) (.+?) .*
Explanation and demonstration: http://regex101.com/r/rQ6uU4
How about:
(?:\d+\.\d+\.\d+\s)(.*)\s See it in action.
This matches FEL
Some things I took for granted:
the date always comes first and is a mix of numbers and periods
the date is always followed by a space
the word to capture is always followed by a space
the word to capture never contains a space
Assuming that FEL is always a single word (that is, delimited by a space), you could use the following expression:
(\d{4}\.\d\d\.\d\d) ([^\s]+) (.*)

Notepad++ regular expressions

First of all, regular expressions are quite possibly the most confusing thing I have every dealt with - with that being said I cannot believe how efficient they can make ones life.
So I am trying to understand the wildcard regex with no luck
Need to turn
f_firstname
f_lastname
f_dob
f_origincountry
f_landing
Into
':f_firstname'=>$f_firstname,
':f_lastname'=>$f_lastname,
':f_dob'=>$f_dob,
':f_origincountry'=>$f_origincountry,
':f_landing'=>$f_landing,
In the answer can you please briefly describe the regex you are using, I have been reading the tutorials but they boggle my mind. Thanks.
Edit: As Chris points out, you can improve the regex by cleaning up any white space there may be in the target string. I also replace the dot with \w as he did because it's better practice than using the .
Search: ^f_(\w+)\s*$
^ # start at the beginning of the line
f_ # look for f_
(\w+) # capture in a group all characters
\s* # optionally skip over (don't capture) optional whitespace
$ # end of the line
Replace: ':f_\1'=>$f_\1,
':f_ # beginning of replacement string
\1 # the group of characters captured above
'=>$f_ # some more characters for the replace
\1, # the capture group (again)
Find: (^.*)
Replace with: ':$1'=>$$1,
Find What:
(f_\w+)
Here we're matching f_ followed by a word character \w+ (the plus mean one or more times). Wrapping the whole thing in brackets means we can reference this group in the replace pattern
Replace With:
':\1'=>$\1,
This is simply your result phrase but instead of hardcoding the f words I've put \1 to reference the group in the search