Lookahead in Oracle's regular expressions - regex

I've got a problem with Oracle's regexes. I have lots of phone-numbers in different tables. Now my task is to unify them. So I take out all blanks, underscores, minuses and whatnot. But then the tricky part comes - which seemed so easy at first.
There are numbers both with and without international code, so e.g. 0046812345678 and 0812345678. So I want to replace one single (!) leading zero with '0046'. I thought that ^0(?=[1-9]) would do the job but Oracle seems to think that lookaheads are useless.
(^0)(1|2|3|4|5|6|7|8|9) doesn't do the job either (or (^01|02|03|04|05|06|07|08|09) for that matter) since it would replace the first non-zero number as well making 0812345678 into 004612345678 (so, the first '8' disappears).
I searched and tried for quite some time now and can't come up with any more possibilities. Any help would be greatly appreciated. Thanks in advance!

You need to add the first 1-9 to the result so that only numbers starting with a single 0 are matched. To keep the first 1-9 we capture it (using parenthesis) and add it to the replace part (using \1). This seems to work:
select regexp_replace('0812345678', '^0([1-9])', '0046\1') from dual;
Result: 0046812345678

Related

Notepad++ masschange using regular expressions

I have issues to perform a mass change in a huge logfile.
Except the filesize which is causing issues to Notepad++ I have a problem to use more than 10 parameters for replacement, up to 9 its working fine.
I need to change numerical values in a file where these values are located within quotation marks and with leading and ending comma: ."123,456,789,012.999",
I used this exp to find and replace the format to:
,123456789012.999, (so that there are no quotation marks and no comma within the num.value)
The exp used to find is:
([,])(["])([0-9]+)([,])([0-9]+)([,])([0-9]+)([,])([0-9]+)([\.])([0-9]+)(["])([,])
and the exp to replace is:
\1\3\5\7\9\10\11\13
The problem is parameters \11 \13 are not working (the chars eg .999 as in the example will not appear in the changed values).
So now the question is - is there any limit for parameters?
It seems for me as its not working above 10. For shorter num.values where I need to use only up to 9 parameters the string for serach and replacement works fine, for the example above the search works but not the replacement, the end of the changed value gets corrupted.
Also, it came to my mind that instead of using Notepad++ I could maybe change the logfile on the unix server directly, howerver I had issues to build the correct perl syntax. Anyone who could help with that maybe?
After having a little play myself, it looks like back-references \11-\99 are invalid in notepad++ (which is not that surprising, since this is commonly omitted from regex languages.) However, there are several things you can do to improve that regular expression, in order to make this work.
Firstly, you should consider using less groups, or alternatively non-capture groups. Did you really need to store 13 variables in that regex, in order to do the replacement? Clearly not, since you're not even using half of them!
To put it simply, you could just remove some brackets from the regex:
[,]["]([0-9]+)[,]([0-9]+)[,]([0-9]+)[,]([0-9]+)[.]([0-9]+)["][,]
And replace with:
,\1\2\3\4.\5,
...But that's not all! Why are you using square brackets to say "match anything inside", if there's only one thing inside?? We can get rid of these, too:
,"([0-9]+),([0-9]+),([0-9]+),([0-9]+)\.([0-9]+)",
(Note I added a "\" before the ".", so that it matches a literal "." rather than "anything".)
Also, although this isn't a big deal, you can use "\d" instead of "[0-9]".
This makes your final, optimised regex:
,"(\d+),(\d+),(\d+),(\d+)\.(\d+)",
And replace with:
,\1\2\3\4.\5,
Not sure if the regex groups has limitations, but you could use lookarounds to save 2 groups, you could also merge some groups in your example. But first, let's get ride of some useless character classes
(\.)(")([0-9]+)(,)([0-9]+)(,)([0-9]+)(,)([0-9]+)(\.)([0-9]+)(")(,)
We could merge those groups:
(\.)(")([0-9]+)(,)([0-9]+)(,)([0-9]+)(,)([0-9]+)(\.)([0-9]+)(")(,)
^^^^^^^^^^^^^^^^^^^^
We get:
(\.)(")([0-9]+)(,)([0-9]+)(,)([0-9]+)(,)([0-9]+\.[0-9]+)(")(,)
Let's add lookarounds:
(?<=\.)(")([0-9]+)(,)([0-9]+)(,)([0-9]+)(,)([0-9]+\.[0-9]+)(")(?=,)
The replacement would be \2\4\6\8.
If you have a fixed length of digits at all times, its fairly simple to do what you have done. Even though your expression is poorly written, it does the job. If this is the case, look at Tom Lords answer.
I played around with it a little bit myself, and I would probably use two expressions - makes it much easier. If you have to do it in one, this would work, but be pretty unsafe:
(?:"|(\d+),)|(\.\d+)"(?=,) replace by \1\2
Live demo: http://regex101.com/r/zL3fY5

emacs syntax highlight numbers not part of words (with regex?)

I've moved to emacs recently and I am used to/like numbers being highlighted. A quick hack I took from here puts the following in my .emacs:
(add-hook 'after-change-major-mode-hook
'(lambda () (font-lock-add-keywords
nil
'(("\\([0-9]+\\)"
1 font-lock-warning-face prepend)))))
Which gives a good start, i.e. any digit is highlighted. However, I am a complete beginner with regex and would ideally like the following behaviour:
Also highlight the decimal point if it's part of a float, e.g. 12.34
Do not highlight any part of the number if it is next/part of a word. e.g. in these cases: foo11 ba11r 11spam, none of the '1's should be highlighted
Allow 'e' within two number integers to allow scientific notation (not required, bonus credit)
Unfortunately this looks very much like a 'do this for me' question which I am loathe to post, but I have failed thus far to make any decent progress myself.
About as far as I have got is discovering [^a-zA-Z][0-9]+[^a-zA-Z] to match anything but a letter either side (e.g. an equals sign), but all this does is include the adjacent symbol in the highlighting. I am not sure how to tell it 'only highlight the numbers if there isn't a letter on either side'.
Of course, I can't imagine regex is the way to go with complicated syntax highlighting, so any good number highlighting in emacs ideas are also welcome,
Any help very much appreciated. (In case it makes any difference, this is for use when Python coding.)
Start by going to your scratch buffers and typing in a some test text. put some numbers in there, some identifiers that contain numbers, some numbers with missing parts (like .e12), etc. These will be our testcases and will let us experiment rapidly. Now run M-x re-builder to enter the regex builder mode, which will let you try out any regex against the text of the current buffer to see what it matches. This is a very handy mode; you'll be able to use it all the time. Just note that because Emacs lisp requires you to put regexes into strings, you must double up on all of your backslashes. You're already doing that correctly, but I'm not going to double them up in here.
So, limiting the match to numbers that are not part of identifiers is pretty easy. \b will match word boundaries, so putting one at either end of your regex will make it match a whole word
You can match floats just by adding a period to the character class you started with, so that it becomes [0-9.]. Unfortunately, that can match a period all on it's own; what we really want is [0-9]*\.?[0-9]+, which will match 0 or more digits followed by an optional period followed by one or more digits.
A leading sign can be matched with [-+]?, so that gets us negative numbers.
To match exponents we need an optional group: \(...\)?, and since we are only using this for highlighting, and don't actually need to separate out the content of the group, we can do \(?:...\), which will save the regex matcher a little time. Inside the group we will need to match an "e" ([eE]), an optional sign ([-+]?), and one or more digits ([0-9]+).
Putting it all together: [-+]?\b[0-9]*\.?[0-9]+\(?:[eE][-+]?[0-9]+\)?\b. Note that I've put the optional sign before the first word boundary, because the "+" and "-" characters create a word boundary.
First of all, lose the add-hook and the lambda. The font-lock-add-keywords call doesn't need either. If you want this only for python-mode, pass the mode symbol as the first argument instead of nil.
Second, there are two main ways to do that.
Add a grouping construct around the digits. The numbers in the font-lock-keywords forms correspond to the groups, so this would be '(("\\([^a-zA-Z]\\([0-9]+\\)[^a-zA-Z]\\)" 2 font-lock-warning-face prepend). The outer grouping is rather useless here, though, so this can be simplified to '(("[^a-zA-Z]\\([0-9]+\\)[^a-zA-Z]" 1 font-lock-warning-face prepend).
Just use the beginning and end of symbol backslash constructs. Then the regexp looks like this: \_<[0-9]+\_>. We can highlight the whole match here, so there's no need for the group number: '(("\\_<[0-9]+\\_>" . font-lock-warning-face prepend). As a variation, you could use the beginning-of-word and end-of-word constructs, but you probably don't want to highlight numbers adjacent to underscores or whatever other characters, if any, python-mode has in the syntax class symbol.
And lastly, there's probably no need for prepend. The numbers are likely all unhighlighted before this, and if you consider possible interaction with other minor modes like whitespace, you'd better choose append, or just omit this element entirely.
End result:
(font-lock-add-keywords nil '(("\\_<[0-9]+\\_>" . font-lock-warning-face)))

Regular Expression for input validation

I am trying to learn regular expressions and was hoping someone could help me out. WOuld appreciate if someone can help me come up with a regular expression to validate that an input must be of the form
Graph: XY5, YZ4, ST7
Each part such as XY5 represents an edge in the graph and the number represents a the edge weight. There can be any number of such edges.
This is what I have till now. It's probably not correct
"^Graph:\\s{1}[A-ZA-Z\\d,\\s]+"
This might be what you're looking for:
/^Graph: (?:[A-Z]{2}\d(?:$|, ?))+/
See it here in action: http://regexr.com?309av
Here's an explanation of what the regex does (screenshot from RegexBuddy, which is probably the best tool for you if you're trying to learn Regular Expressions):
Try this
/^Graph:(\s+[A-Z][A-Z]\d+)+$/
You should explain your input format a little better. This might do it, from the single example I have and what you said. It does not allow a graph to be empty, which may or may not be part of your requirements.
"^Graph:(\s\w{2}\d+,?)+"
to explain:
^Graph: will cover the start of the line
(\s\w{2}\d+,?)+
\s is a space
\w{2} matches exactly 2 alphanumeric characters (hint: you could make this better!)
\d+ matches 1 or more digits, since I am assuming an edge can have a two digit length ( such as 10)
,? matches a comma optionally. (hint: you could make this better as well, as it will not necessitate a comma between each entry!, maybe by using an or and the end of string delimiter!)
I purposely left some room for improvement, because if you think of some of it on your own, you will accomplish your goal of becoming better with regular expressions.

Regex to match a string that does not contain 'xxx'

One of my homework questions asked to develop a regex for all strings over x,y,z that did not contain xxx
After doing some reading I found out about negative lookahead and made this which works great:
(x(?!xx)|y|z)*
Still, in the spirit of completeness, is there anyway to write this without negative lookahead?
Reading I have done makes me think it can be done with some combination of carets (^), but I cannot get the right combination so I am not sure.
Taking it a step further, is it possible to exclude a string like xxx using only the or (|) operator, but still check the strings in a recursive fashion?
EDIT 9/6/2010:
Think I answered my own question. I messed with this some more, trying make this regex with only or (|) statements and I am pretty sure I figured it out... and it isn't nearly as messy as I thought it would be. If someone else has time to verify this with a human eye I would appreciate it.
(xxy|xxz|xy|xz|y|z)*(xxy|xxz|xx|xy|xz|x|y|z)
Try this:
^(x{0,2}(y|z|$))*$
The basic idea is this: for match at most 2 X's, followed by another letter or the end of the string.
When you reach a point where you have 3 X's, the regex has no rule that allows it to keep matching, and it fails.
Working example: http://rubular.com/r/ePH0fHlZxL
A less compact way to write the same is (with free spaces, usually the /x flag):
^(
y| # y is ok
z| # so is z
x(y|z|$)| # a single x, not followed by x
xx(y|z|$) # 2 x's, not followed by x
)*$
Based on the latest edit, here's an ever flatter version of the pattern: I'm not entirely sure I understand your fascination with the pipe, but you can eliminate some more options - by allowing an empty match on the second group you don't need to repeat permutations from the first group. That regex also allows ε, which I think is included in your language.
^(xxy|xxz|xy|xz|y|z)*(xx|x|)$
Basically you have the right answer already - well done you. :)
Carat (^) in a set [^abc] will only match where it does not find a character in that set so it's application for matching orders of characters (i.e. strings) is limited and weak.
Regex has numeric quantifiers {n} and {a,b} which allow you to match a defined number of repititions of a pattern, which would work for this specific pattern (because it's 'x' repeated) but it's not particularily expressive of the problem you're trying to solve (even for regex!) and is a bit brittle (it wouldn't be appropriate for negative match 'xyx' for example.
An or pattern again would be verbose and rather unexpressive but it could be done as the fragment:
(x|xx)[^x] // x OR xx followed by NOT x
Obviously you can do this with an iterative algorithm but that's highly inefficient compared to a regex.
Well done for thinking beyond the solution though.
I know you don't want to use lookahead, but here's another way to solve this:
^(?:(?!xxx)[xyz])*$
will match any line of characters x, y or z as long as it doesn't contain the string xxx.

how to eliminate dots from filenames, except for the file extension

I have a bunch of files that look like this:
A.File.With.Dots.Instead.Of.Spaces.Extension
Which I want to transform via a regex into:
A File With Dots Instead Of Spaces.Extension
It has to be in one regex (because I want to use it with Total Commander's batch rename tool).
Help me, regex gurus, you're my only hope.
Edit
Several people suggested two-step solutions. Two steps really make this problem trivial, and I was really hoping to find a one-step solution that would work in TC. I did, BTW, manage to find a one-step solution that works as long as there's an even number of dots in the file name. So I'm still hoping for a silver bullet expression (or a proof/explanation of why one is strictly impossible).
It appears Total Commander's regex library does not support lookaround expressions, so you're probably going to have to replace a number of dots at a time, until there are no dots left. Replace:
([^.]*)\.([^.]*)\.([^.]*)\.([^.]*)$
with
$1 $2 $3.$4
(Repeat the sequence and the number of backreferences for more efficiency. You can go up to $9, which may or may not be enough.)
It doesn't appear there is any way to do it with a single, definitive expression in Total Commander, sorry.
Basically:
/\.(?=.*?\.)//
will do it in pure regex terms. This means, replace any period that is followed by a string of characters (non-greedy) and then a period with nothing. This is a positive lookahead.
In PHP this is done as:
$output = preg_replace('/\.(?=.*?\.)/', '', $input);
Other languages vary but the principle is the same.
Here's one based on your almost-solution:
/\.([^.]*(\.[^.]+$)?)/\1/
This is, roughly, "any dot stuff, minus the dot, and maybe plus another dot stuff at the end of the line." I couldn't quite tell if you wanted the dots removed or turned to spaces - if the latter, change the substitution to " \1" (minus the quotes, of course).
[Edited to change the + to a *, as Helen's below.]
Or substitute all dots with space, then substitute [space][Extension] with .[Extension]
A.File.With.Dots.Instead.Of.Spaces.Extension
to
A File With Dots Instead Of Spaces Extension
to
A File With Dots Instead Of Spaces.Extension
Another pattern to find all dots but the last in a (windows) filename that I've found works for me in Mass File Renamer is:
(?!\.\w*$)\.
I don't know how useful that is to other users, but this page was an early search result and if that had been on here it would have saved me some time.
It excludes the result if it's followed by an uninterrupted sequence of alphanumeric characters leading to the end of the input (filename) but otherwise finds all instances of the dot character.
You can do that with Lookahead. However I don't know which kind of regex support you have.
/\.(?=.*\.)//
Which roughly translates to Any dot /\./ that has something and a dot afterwards. Obviously the last dot is the only one not complying. I leave out the "optionality" of something between dots, because the data looks like something will always be in between and the "optionality" has a performance cost.
Check:
http://www.regular-expressions.info/lookaround.html