How do I properly match Regular Expressions? - regex

I have a list of objects output from ldapsearch as follows:
dn: cn=HPOTTER,ou=STUDENTS,ou=HOGWARTS,o=SCHOOL
dn: cn=HGRANGER,ou=STUDENTS,ou=HOGWARTS,o=SCHOOL
dn: cn=RWEASLEY,ou=STUDENTS,ou=HOGWARTS,o=SCHOOL
dn: cn=DMALFOY,ou=STUDENTS,ou=HOGWARTS,o=SCHOOL
dn: cn=SSNAPE,ou=FACULTY,ou=HOGWARTS,o=SCHOOL
dn: cn=ADUMBLED,ou=FACULTY,ou=HOGWARTS,o=SCHOOL
So far, I have the following regex:
/\bcn=\w*,/g
Which returns results like this:
cn=HPOTTER,
cn=HGRANGER,
cn=RWEASLEY,
cn=DMALFOY,
cn=SSNAPE,
cn=ADUMBLED,
I need a regex that returns results like this:
HPOTTER
HGRANGER
RWEASLEY
DMALFOY
SSNAPE
ADUMBLED
What do I need to change in my regex so the pattern (the cn= and the comma) is not included in the results?
EDIT: I will be using sed to do the pattern matching, and piping the output to other command line utilities.

You will have to perform a grouping. This is done by modifying the regex to:
/\bcn=\(\w*\),/g
This will then populate your result into a grouping variable. Depending on your language how to extract this value will differ. (For you with sed the variable will be \1)
Note that most regex flavors you don't have to escape the brackets (), but since you're using sed you will need to as shown above.
For an excellent resource on Regular Expressions I suggest: Mastering Regular Expressions

OK, the place where you asked the more specific question was closed as "exact duplicate" of this, so I'm copying my answer from there to here:
If you want to use sed, you can use something like the following:
sed -e 's/dn: cn=\([^,]*\),.*$/\1/'
You have to use [^,]* because in sed, .* is "greedy" meaning it will match everything it can before looking at any following character. That means if you use \(.*\), in your pattern it will match up to the last comma, not up to the first comma.

Check out Expresso I have used it in the past to build my RegEx. It is good to help learning too.

The quick and dirty method is to use submatches assuming your engine supports it:
/\bcn=(\w*),/g
Then you would want to get the first submatch.

Without knowing what language you're using, we can't tell for sure, but in most regular expression parsers, if you use parenthesis, such as
/\bcn=(\w*),/g
then you'll be able to get the first matching pattern (often \1) as exactly what you are searching for. To be more specific, we need to know what language you are using.

If your regex supports Lookaheads and Lookbehinds then you can use
/(?<=\bcn=)\w*(?=,)/g
That will match
HPOTTER
HGRANGER
RWEASLEY
DMALFOY
SSNAPE
ADUMBLED
But not the cn= or the , on either side. The comma and cn= still have to be there for the match, it just isn't included in the result.

Sounds more like a simple parsing problem and not regex. An ANTLR grammar would sort this out in no time.

Related

Regular expression with a lookahead to capture text between two starting points with no explicit end point

I have a regular expression that works at https://regex101.com/r/VQkNze/1 that I've been trying to get to work in Tcl but cannot. Regular expressions tax my little brain so I'm likely doing something stupid. I've been trying in Tcl and found this regex web site searching through other SO questions; and tried my expression on the site in order to ask my question here and was surprised that it generated the desired result. So, I assume it has to do with a difference in Tcl or is a strange coincidence.
Would you please tell me what I'm doing wrong or overlooking? Thank you.
I tried the solution in this SO answer but couldn't get it to work in Tcl either.
I should have added that in Tcl I also tried:
regexp -all -inline {<span class="verse" id="V[[:digit:]]+">\
([[:digit:]])+? <\/span>(?=.+?(<span class="verse"|<\/div>))}
which separated the spans as desired; but, of course, does not capture the text because it is in the lookahead. But whatever I try to move the (.+?) for the text out of the lookahead, the spans are no longer separated as they are in the regex web site example.
In Tcl regex, the laziness/greediness is set with the first greedy/lazy quantifier. You need to use
<span class="verse" id="V[[:digit:]]+?">([[:digit:]]+?) </span>(.+?)(?=<span class="verse"|</div>)
to make it consistent with most other regex flavors, where V[[:digit:]]+? sets all quantifiers to lazy matching mode.

PERL Regex Negation Issue

I have written a regex to pick files of the format
(ABC.*\.DAT) in perl.
How to write a negation for the above regex?
I already tried expressions like (?!ABC.*)\.DAT or (?!(ABC.*\.DAT))
Any help is appreciated.
(?s:(?!ABC).)*\.DAT
You can try this negation based regex. See demo.
The above can be safely embedded into a larger pattern. For example,
/^(?:(?!ABC).)*\.DAT\z/s
If you are trying to match the whole input, and if ABC doesn't end with ., .D, .DA or .DAT, then the following will be faster:
/^(?!.*ABC)\.DAT\z/s

Perl Extended Regular Expressions - match with multiple question marks inside

I have got a weird thing to solve in perl using regular expressions.
Consider the strings -
abcdef000000123
blaDeF002500456
wefdEF120045423
All of these strings are matching with the below regular expression when I tried in C with pcre library support :
???[dD][eE][fF][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]
But I'm unable to achieve the same in perl code. I'm getting some weird errors.
Please help with the piece of perl code with which these two things match.
Thanks in advance...
? is called quantifier that makes preceding pattern or group an optional match. Independently ? doesn't make any sense in regex and you are getting an error like: Quantifier follows nothing in regex.
Following regex should work for you in perl:
...[dD][eE][fF][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]
OR even more concise regex:
.{3}[dD][eE][fF][0-9]{9}
Each dot means match any character.
PS: You probably are getting confused by shell's glob vs regex.
That looks more like a file system regex than a PCRE. In Perl, the ? is a quantifier, not a wild card. You may want to replace them with . to get the same results in anything Perl compatible.
I might use ...[dD][eE][fF][0-9]{9} or even replace the [0-9] with \d.
qr/[A-z]{3}def[0-9]{9}/i
should be the Perl Regex object used to validate the mentioned strings.
Regards

regular expression and substitution

In Latex, I had a lot of math expressions with subscriptions in terms of 123, now, I need to change them to \alpha \beta \gamma instead of 123.
for example:
$E_{223}$ to $E_{\beta\beta\gamma}$
and
$E_{31}$ to $_{\gamma\alpha}$
However, I also have power index which is not supposed to be altered, such as $E^3_{112}$ should be change to $E^3_{\alpha\alpha\beta}$.
Is there a way to use regular expression to make this task easier? I know some regular expression from unix and perl, but seems inadequate for this problem.
thank you for anything!
I'm not 100% familiar with Latex, but typical regex would look like this:
(?<\^)#
Where the # is 1, 2 or 3. Then, in your replace, you would replace the matches with \alpha, \beta and \gamma. The (?<\^) is a negative look-behind that says to only replace instances of that number when they aren't preceded by a ^ character (your power indicator).
If typical regex doesn't permit, I'll delete my answer.
In Perl you could do things like:
$text =~ s#\$\w[^${\s]*_{\K([123]+)(?=}\$)#
local $_ = $1;
s/1/\\alpha/g; s/2/\\beta/g; s/3/\\gamma/g;
$_
#ge;
Try these:
replace (?<!\^\d|\d{2}|\d{3}|\d{4})1 with \alpha
replace (?<!\^\d|\d{2}|\d{3}|\d{4})2 with \beta
replace (?<!\^\d|\d{2}|\d{3}|\d{4})3 with \gamma
Edit: These regexes make sure that it won't replace a number from an exponent. You may have to tweak them to check for optional - if you have negative exponents.
Edit 2: #QTax pointed out that you can't use a variable length lookbehinds.
Subexp of look-behind must be fixed character length.
But different character length is allowed in top level
alternatives only.
Reference: http://tacosw.com/latexian/help/find/regex.html
I don't know what editor or regex engine you're using for this, but here's the basic idea I'd go with in Perl-ish regex:
Replace this:
(?<=\{\d*)1(?=\d*\})
With this:
\\alpha
I think you'll want to set the g flag as well.
Not sure if I have the right escaping syntax (it's been a while since I touched Perl) but I think so.
Repeat as necessary for \beta, \gamma, etc.

How to search (using regex) for a regex literal in text?

I just stumbled on a case where I had to remove quotes surrounding a specific regex pattern in a file, and the immediate conclusion I came to was to use vim's search and replace util and just escape each special character in the original and replacement patterns.
This worked (after a little tinkering), but it left me wondering if there is a better way to do these sorts of things.
The original regex (quoted): '/^\//' to be replaced with /^\//
And the search/replace pattern I used:
s/'\/\^\\\/\/'/\/\^\\\/\//g
Thanks!
You can use almost any character as the regex delimiter. This will save you from having to escape forward slashes. You can also use groups to extract the regex and avoid re-typing it. For example, try this:
:s#'\(\\^\\//\)'#\1#
I do not know if this will work for your case, because the example you listed and the regex you gave do not match up. (The regex you listed will match '/^\//', not '\^\//'. Mine will match the latter. Adjust as necessary.)
Could you avoid using regex entirely by using a nice simple string search and replace?
Please check whether this works for you - define the line number before this substitute-expression or place the cursor onto it:
:s:'\(.*\)':\1:
I used vim 7.1 for this. Of course, you can visually mark an area before (onto which this expression shall be executed (use "v" or "V" and move the cursor accordingly)).