Intelligent RegEx Replacement - regex

I'm setting up a system to parse a string with very specified syntax and fix user errors. For example, the syntax requires dates in a m/d/yy format (no leading 0s), so I need to make the following substitutions:
10/01/13 -> 10/1/13
10/10/13 -> no change
10/1/13 -> no change
01/10/13 -> 1/10/13
I have a lot of rules like this by which I need to find portions of a string and fix those portions. I can use RegEx to identify what needs to be corrected easily. For an easier example, I want to find CBUx[2-9], but then I need to replace with something like this CBU x [2-9] (spaces around x if preceded by CBU and follwed by a digit). Example:
input text: "blah blah CBUx3"
matched: "CBUx3"
replace: "CBU x 3"
output text: "blah blah CBU x 3"
Is this possible? Note that I'm fully aware I could write code to find the slashes and digits. I'm specifically trying to do this with an "intelligence RegEx replace". I have a lot of different types of corrections that I can match with RegEx, and I would like to avoid writing specific correction procedures for each.

Maybe something like that for the leading zeroes:
\b0+([1-9])
And replace with $1 (or \1 depending on the language, though \1 is less common nowadays).
But something a bit better might be with the use of a negative lookbehind:
(?<![.,])\b0+([1-9])
So that the 0 in 10,001.002 are not changed to 10,1.2.
regex101 demo
The word boundary, \b, makes sure that the 0 (or more) are at the beginning of the number and the negative lookbehind is for cases of decimals and thousand separators, assuming that you have have floating numbers in the string. Note that this will however prevent the removal of zeroes in a date format of 11.01.13. A more complex regex can however be made with the assumption that such a date always have a least one number after a second dot (itself after 2 numbers since dates and months take at most 2 digits) without encountering anything other than other numbers, which makes the regex look like...
(?<![.,](?![0-9]{2}\.[0-9]))\b0+([1-9])
And which renders to something like this.
For the CBUx[2-9], you can use a capture group as well:
CBUx([2-9])
And replace with: CBU x $1 (or \1)
There might be some tweaks I didn't consider for the leading zero removal part, but that's what I can think about right now.

Related

Money amount regex currency name after the amount

I am trying to create a regular expression that matches money amount (various currencies either in front of or after the given amount. The decimals are separated by a dot or by a comma).
This is what I've got so far:
\$[0-9.,]+|\£[0-9.,]+|\€[0-9.,]+
However, if I put currencies in the square brackets together with the other signs, it does not work as I expect it to (it still doesn't match 20,000$, only $20,000 and I want it to match both).
Can you tell me how I can modify my regex so that it also matches the amounts with the currency after the digits?
Also, is the only way to include more than one currency in the regex to separate them with a pipe and rewrite the same regular expression over and over again?
Updated:
This regex should match numbers with decimal group separators (zero or more) and a decimal point (zero or one):
(?:\d{1,3},)*\d{1,3}(?:\.\d+)?
For your use-case you should be happy with this regex:
[\$£€](?:\d{1,3},)*\d{1,3}(?:\.\d/{1,2})?|(?:\d{1,3},)*\d{1,3}(?:\.\d{1,2})?[\$£€]
Legacy answer
There is no way in regular expressions (at least that I know of) that would allow you to swap the order of two groups of characters, thus you'll have to specify it like "AB or BA".
Hope, this one works for you:
[\$\£\€]\d+(?:[.,]\d+)?|\d+(?:[.,]\d+)?[\$\£\€]
The \d+(?:[.,]\d+)? part could be simplified back to [\d.,]+. The simplest for of regex (with a lot of information lost) is this:
[\$£€]?[\d.,]+[\$£€]?
... but that allows a lot of erroneous inputs, like 20.$ or $.,€ or simply 5.

Regular Expression to match most explicit string

I have some experience with regular expressions but I am far from expert level and need a way to match the record with the most explicit string in a file where each record begins with a unique 1-5 digit integer and is padded with various other characters when it is shorter than 5 digits. For example, my file has records that begin with:
32000
3201X
32014
320xy
In this example, the non-numeric characters represent wildcards. I thought the following regex examples would work but rather than match the record with the MOST explicit number, they always match the record with the LEAST explicit number. Remember, I do not know what is in the file so I need to test all possibilities to locate the MOST explicit match.
If I need to search for 32000, the regex looks something like:
/^3\D{4}|^32\D{3}|^320\D{2}|^3200\D|^32000/
It should match 32000 but it matches 320xy
If I need to search for 32014, the regex looks something like:
/^3\D{4}|^32\D{3}|^320\D{2}|^3201\D|^32014/
It should match 32014 but it matches 320xy
If I need to search for 32015, the regex looks something like:
/^3\D{4}|^32\D{3}|^320\D{2}|^3201\D|^32015/
It should match 3201x but it matches 320xy
In each case, the matched result is the LEAST specific numeric value. I also tried reversing the regex as follows by still get the same results:
/^32014|^3201\D|^320\D{2}|^32\D{3}|^3\D{4}/
Any help is much appreciated.
Okay, if you want to match a string literally then use anchors. Then specify the string you want matched. For instance match '123456xyz' where the xyz can be anything excep numeric use:
'^123456[^0-9]{3}$'
If you prefer specific letters to match at the end, if they will always be x y or z then use:
'^123456[xyz]{3}$'
Note the ^ and $ anchor the string to start with 12345 and end with three letters that are x y or z.
Good luck!
Ok, I did quite some tinkering here. I am 99% percent sure that this is pretty much impossible (if we don't cheat and interpolate code into the regex). The reason is you will need a negative lookbehind with variable length at some point.
However, I came up with two alternatives. One is if you want just to find the "most exact match", the second one is if you want to replace it with something. Here we go:
/(32000)|\A(?!.*32000).*(3200\D)|\A(?!.*3200[0\D]).*(320\D\D)|\A(?!.*320[0\D][0\D]).*(32\D\D\D)|\A(?!.*32[0\D][0\D][0\D]).*(3\D\D\D\D)/m
Question:
So what is my "most exact match" here?
Answer:
The concatenation of the 5 matched groups - \1\2\3\4\5. In fact always only one of them will match, the other 4 will be empty.
/(32000)|\A(?!.*32000)(.*)(3200\D)|\A(?!.*3200[0\D])(.*)(320\D\D)|\A(?!.*320[0\D][0\D])(.*)(32\D\D\D)|\A(?!.*32[0\D][0\D][0\D])(.*)(3\D\D\D\D)/m
Question:
How can I use this to replace my "most exact match"?
Answer:
In this case your "most exact match" will be the concatenation of \1\3\5\7\9, but we will have also matched some other things before that, namely \2\4\6\8 (again, only one of these can be non empty). Therefore if you want to replace your "most exact match" with fubar you can match with the above regex and replace with \2\4\6\8fubar
Another way you can think about it (and might be helpful) is that your "most exact match" will be the last matched line of either of the two regexes.
Two things to note here:
I used Ruby style RE, \A means the beginning of the string (not the beginning of a line - ^). \m means multi line mode. You should be able to find syntax for the same things in your language/technology as long as it uses some flavor of PCRE.
This can be slow. If we don't find exact match we might possibly have to match and replace the entire string (if the non exact match can be found at the end of the string).

Add two decimal digits to a number range regex

I've created a Regexp to validate a direction in degrees, between -359 and +359 (with optional sign). This is my regex:
const QString xWindDirectionPattern("[+-]{0,1}([0-9]{1,2}|[12][0-9]{2}|3[0-5][0-9])");
Now, I want to add two decimal numbers, in order to write numbers from -359.99 to +359.99. I've tried something like appending \.[0-9]{1,2}|[0-9]{1,3} but It does not work.
I'd like to have optional decimal point so I can have
23.3 valid
23.33 valid
23 valid
23.333 not valid
I've read some other questions, like this one, but I'm not able to modify the example to match a number range, like in my case.
How can I achieve this result?
Thanks in advance for your replies.
How can achieve this?
I've created a Regexp to validate a direction in degrees, between -359 and +359
No, you can't. You shouldn't. You are using the wrong tool. Regex cannot do the kinds of validation, which require it to dig into the semantics of the characters.
Regex can only process and match text, but cannot identify what they actually mean. Basically Regex are good for parsing regular language, and bad for almost everything else.
For e.g.:
A Regex can match 3 digits, but it would be extremely impractical to use it to match 3 digits that fall in range - [259, 634]. For that you would need to know the meaning of each individual digits in that number.
A Regex can match a pattern for date like - \d\d/\d\d/\d\d, but it cannot identify which part is date, and which part is month.
Similarly, it can find you two numbers x and y, but it cannot identify, whether x < y or not.
The task as above require you to understand the meaning of the text. Regex can't do that.
Well, of course you have come up with a regex for sure, but as you can see it is highly un-flexible. A little change in your requirement, will screw both - the regex and you.
You should better use corresponding language features - constructs like if-else to make sure you are reading degrees in that range, and not regex.
You can do this:
[+-]{0,1}((?:[0-9]{1,2}|[12][0-9]{2}|3[0-5][0-9])(?:\.[0-9]{1,2})?)
This will allow an a decimal point followed by one or two digits. You'll probably also want to use start and end anchors (^ / $) to ensure that there are no characters other than this pattern in your string—without this, 23.333 would be allowed because 23.33 matches the above pattern:
^[+-]{0,1}((?:[0-9]{1,2}|[12][0-9]{2}|3[0-5][0-9])(?:\.[0-9]{1,2})?)$
You can test it out here.
Try [+-]?([1-9]\d?|[12]\d{2}|3[0-5]\d)(\.\d{1,2})?.
[+-]? Optional Sign
[1-9]\d? 1 or 2 digit number
[12]\d{2} 100 to 299
3[0-5]\d 300 to 359
(\.\d{1,2})? Optional decimal point followed by 1 or two digits

emacs syntax highlight numbers not part of words (with regex?)

I've moved to emacs recently and I am used to/like numbers being highlighted. A quick hack I took from here puts the following in my .emacs:
(add-hook 'after-change-major-mode-hook
'(lambda () (font-lock-add-keywords
nil
'(("\\([0-9]+\\)"
1 font-lock-warning-face prepend)))))
Which gives a good start, i.e. any digit is highlighted. However, I am a complete beginner with regex and would ideally like the following behaviour:
Also highlight the decimal point if it's part of a float, e.g. 12.34
Do not highlight any part of the number if it is next/part of a word. e.g. in these cases: foo11 ba11r 11spam, none of the '1's should be highlighted
Allow 'e' within two number integers to allow scientific notation (not required, bonus credit)
Unfortunately this looks very much like a 'do this for me' question which I am loathe to post, but I have failed thus far to make any decent progress myself.
About as far as I have got is discovering [^a-zA-Z][0-9]+[^a-zA-Z] to match anything but a letter either side (e.g. an equals sign), but all this does is include the adjacent symbol in the highlighting. I am not sure how to tell it 'only highlight the numbers if there isn't a letter on either side'.
Of course, I can't imagine regex is the way to go with complicated syntax highlighting, so any good number highlighting in emacs ideas are also welcome,
Any help very much appreciated. (In case it makes any difference, this is for use when Python coding.)
Start by going to your scratch buffers and typing in a some test text. put some numbers in there, some identifiers that contain numbers, some numbers with missing parts (like .e12), etc. These will be our testcases and will let us experiment rapidly. Now run M-x re-builder to enter the regex builder mode, which will let you try out any regex against the text of the current buffer to see what it matches. This is a very handy mode; you'll be able to use it all the time. Just note that because Emacs lisp requires you to put regexes into strings, you must double up on all of your backslashes. You're already doing that correctly, but I'm not going to double them up in here.
So, limiting the match to numbers that are not part of identifiers is pretty easy. \b will match word boundaries, so putting one at either end of your regex will make it match a whole word
You can match floats just by adding a period to the character class you started with, so that it becomes [0-9.]. Unfortunately, that can match a period all on it's own; what we really want is [0-9]*\.?[0-9]+, which will match 0 or more digits followed by an optional period followed by one or more digits.
A leading sign can be matched with [-+]?, so that gets us negative numbers.
To match exponents we need an optional group: \(...\)?, and since we are only using this for highlighting, and don't actually need to separate out the content of the group, we can do \(?:...\), which will save the regex matcher a little time. Inside the group we will need to match an "e" ([eE]), an optional sign ([-+]?), and one or more digits ([0-9]+).
Putting it all together: [-+]?\b[0-9]*\.?[0-9]+\(?:[eE][-+]?[0-9]+\)?\b. Note that I've put the optional sign before the first word boundary, because the "+" and "-" characters create a word boundary.
First of all, lose the add-hook and the lambda. The font-lock-add-keywords call doesn't need either. If you want this only for python-mode, pass the mode symbol as the first argument instead of nil.
Second, there are two main ways to do that.
Add a grouping construct around the digits. The numbers in the font-lock-keywords forms correspond to the groups, so this would be '(("\\([^a-zA-Z]\\([0-9]+\\)[^a-zA-Z]\\)" 2 font-lock-warning-face prepend). The outer grouping is rather useless here, though, so this can be simplified to '(("[^a-zA-Z]\\([0-9]+\\)[^a-zA-Z]" 1 font-lock-warning-face prepend).
Just use the beginning and end of symbol backslash constructs. Then the regexp looks like this: \_<[0-9]+\_>. We can highlight the whole match here, so there's no need for the group number: '(("\\_<[0-9]+\\_>" . font-lock-warning-face prepend). As a variation, you could use the beginning-of-word and end-of-word constructs, but you probably don't want to highlight numbers adjacent to underscores or whatever other characters, if any, python-mode has in the syntax class symbol.
And lastly, there's probably no need for prepend. The numbers are likely all unhighlighted before this, and if you consider possible interaction with other minor modes like whitespace, you'd better choose append, or just omit this element entirely.
End result:
(font-lock-add-keywords nil '(("\\_<[0-9]+\\_>" . font-lock-warning-face)))

Regex for detecting numbers away from each other?

so basically I want to detect if in these strings:
Hello 123 My 222 dear 112 troll 12 8889
192.1.1.254:10000
the numbers are in a format like this:
[0 to 255][ANYTHING][0 to 255][ANYTHING][0 to 255][ANYTHING][0 to 255][ANYTHING][0 to 65536]
Does anyone know how I can build such a regex?
It is for detecting if anyone posts an IP:Port in unusual format to bypass default ip:port filters.
Edit: As for the first comment: I do not know regex and what I have tried is:
if(regex_match("192.168 najlepszy serwer SAMP!!1 1 join1!! 8080","/^[0-2](*)?[0-5](*)?[0-5](*).(*)[0-2](*)?[0-5](*)?[0-5](*).(*)[0-2](*)?[0-5](*)?[0-5](*).(*)[0-2](*)?[0-5](*)?[0-5](*)?$/"))
{
print("Cannot send message");
}
else
{
print("New message for everyone! :)");
}
and some other not working regexes.
If you don't want to complicate your life checking the exact ranges, the simple regex would be:
/^.*(\d)+.+(\d)+.+(\d)+.+(\d)+.+(\d)+.*$/
The first four (\d)+ parts can be replaced with more complicated check for 0-255 range:
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
the last (\d)+ replace with next for port range check:
(6553[0-5]|655[0-2]\d|65[0-4]\d\d|6[0-4]\d\d\d|[1-5]\d\d\d\d|[1-9]\d{0,3})
An exact, simple, and direct representation of your pattern as a regular expression is not possible in the general case. The reason are the number ranges. Something like "at this place any integral number with a value from a to b" is just to complex. A regular expression is executed by a finite state machine and these (theoretical) beasts are (basically) only able to look at strings character by character. Therefore you can match something like "ignore all characters until you find the first digit, then check whether the first digit is followed by at most two more digits".
As a workaround you may try to build a list of alternations of possible digit patterns that covers your desired range of values (in the extreme case list every single value like \b(?:1|2|3|4|...|154|155|...|255)\b). I have a pattern for the range 0-255, but I have none for the range of possible port numbers. So a first approximation may be (really, this is only an approximation and not thoroughly tested):
\b(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])\b.*\b(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])\b.*\b(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])\b.*\b(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])\b[^0-9]*[0-9]{1,5}
In the above pattern (?: .... ) means a shy group (not remembered for back references) and \b means word boundary.
I'd suggest you read up on Regex syntax. For starters . is special and matches any character. Also doing something like [0-2][0-5][0-5] won't catch something like 192 as 9 is not within 0-5.
According to your requirements here's a Regex that should roughly do what you want
([0-2]?\d{1,2}).*([0-2]?\d{1,2}).*([0-2]?\d{1,2}).*([0-2]?\d{1,2}).*(\d{1,5})?
Each of the ([0-2]?\d{1,2}) portions will match 1 or 2 digits preceded optionally with a 0,1, or 2. Each () will capture a group which you can then examine using a Regex engine. You will need to examine this group as the Regex for each of those portions will match numbers above 255 (specifically 256-299).
The last group (\d{1,5})? is to catch the port number, again you will have to examine this as it will catch any 1 to 5 digit number (hence the {1,5}). The ? makes the group optional, remove it if you want it to have to match against a port number.
As far as doing Regex in C, I haven't had much experience but there should be a way to get all the grouped matches and inspect them. Unfortunately they will be strings so you will have to convert them to integers to examine them.
Are you sure you need regex for this? In my opinion, you do not need regex for this.
Just split numbers into groups which are seperated by non-numeric characters. Then analyze.
What language?
As for actually looking for valid range, take a look at this;
http://www.regular-expressions.info/numericranges.html
I would do this simple regex
((\d|\D)+)*