Finding numbers with decimal comma in a TeX-File using regular expressions - regex

I have a very large TeX-File wherein we set mathematical text. Unfortunately, TeX sets a empty space after each comma, even though numbers (in Germany) are formatted without space after comma.
I am aware that there are packages that solve this problem automatically, but I tried to find such commas with a regular expression before the TeX formatter would do its thing, so I could wrap these commas in braces. For example, I want 1,2 to become 1{,}2. But my regular expression approach does not work as expected:
I used "[0-9],[0-9]" resp. "\d,\d" to look for all places where two digits are separated with a decimal comma. That works fine. But I was wondering why for input "1,2,3" it only finds the match "1,2" but not "2,3".
There must be something I don't quite understand about regular expressions. How would I have to alter the expression to find overlapping instances of the search pattern?

Once a character has been matched, it is not matched again. As in your example "2" is matched, the "cursor" of the regex engine has now passed that "2", and so the next match can only be found in ",3" -- and there no match is found.
To resolve that, use a look-ahead assertion for the decimal digits without actually capturing them:
\d,(?=\d)
The matches will be "1," and "2," and now you can make the replacement as you want (adding braces), knowing these matches guarantee there is a digit following.

Related

Regular expression to check strings containing a set of words separated by a delimiter

As the title says, I'm trying to build up a regular expression that can recognize strings with this format:
word!!cat!!DOG!! ... Phone!!home!!
where !! is used as a delimiter. Each word must have a length between 1 and 5 characters. Empty words are not allowed, i.e. no strings like !!,!!!! etc.
A word can only contain alphabetical characters between a and z (case insensitive). After each word I expect to find the special delimiter !!.
I came up with the solution below but since I need to add other controls (e.g. words can contain spaces) I would like to know if I'm on the right way.
(([a-zA-Z]{1,5})([!]{2}))+
Also note that empty strings are not allowed, hence the use of +
Help and advices are very welcome since I just started learning how to build regular expressions. I run some tests using http://regexr.com/ and it seems to be okay but I want to be sure. Thank you!
Examples that shouldn't match:
a!!b!!aaaaaa!!
a123!!b!!c!!
aAaa!!bbb
aAaa!!bbb!
Splitting the string and using the values between the !!
It depends on what you want to do with the regular expression. If you want to match the values between the !!, here are two ways:
Matching with groups
([^!]+)!!
[^!]+ requires at least 1 character other than !
!! instead of [!]{2} because it is the same but much more readable
Matching with lookahead
If you only want to match the actual word (and not the two !), you can do this by using a positive lookahead:
[^!]+(?=!!)
(?=) is a positive lookahead. It requires everything inside, i.e. here !!, to be directly after the previous match. It however won't be in the resulting match.
Here is a live example.
Validating the string
If you however want to check the validity of the whole string, then you need something like this:
^([^!]+!!)+$
^ start of the string
$ end of the string
It requires the whole string to contain only ([^!]+!!) one or more than one times.
If [^!] does not fit your requirements, you can of course replace it with [a-zA-Z] or similar.

Using regex to match multiple comma separated words

I am trying to find the appropriate regex pattern that allows me to pick out whole words either starting with or ending with a comma, but leave out numbers. I've come up with ([\w]+,) which matches the first word followed by a comma, so in something like:
red,1,yellow,4
red, will match, but I am trying to find a solution that will match like like the following:
red, 1 ,yellow, 4
I haven't been able to find anything that can break strings up like this, but hopefully you'll be able to help!
This regex
,?[a-zA-Z][a-zA-Z0-9]*,?
Matches 'words' optionally enclose with commas. No spaces between commas and the 'word' are permitted and the word must start with an alphanumeric.
See here for a demo.
To ascertain that at least one comma is matched, use the alternation syntax:
(,[a-zA-Z][a-zA-Z0-9]*|[a-zA-Z][a-zA-Z0-9]*,)
Unfortunately no regex engine that i am aware of supports cascaded matching. However, since you usually operate with regexen in the context of programming environments, you could repeatedly match against a regex and take the matched substring for further matches. This can be achieved by chaining or iterated function calls using speical delimiter chars (which must be guaranteed not to occur in the test strings).
Example (Javascript):
"red, 1 ,yellow, 4, red1, 1yellow yellow"
.replace(/(,?[a-zA-Z][a-zA-Z0-9]*,?)/g, "<$1>")
.replace(/<[^,>]+>/g, "")
.replace(/>[^>]+(<|$)/g, "> $1")
.replace(/^[^<]+</g, "<")
In this example, the (simple) regex is tested for first. The call returns a sequence of preliminary matches delimted by angle brackets. Matches that do not contain the required substring (, in this case) are eliminated, as is all intervening material.
This technique might produce code that is easier to maintain than a complicated regex.
However, as a rule of thumb, if your regex gets too complicated to be easily maintained, a good guess is that it hasn't been the right tool in the first place (Many engines provide the x matching modifier that allows you to intersperse whitespace - namely line breaks and spaces - and comments at will).
The issue with your expression is that:
- \w resolves to this: [a-zA-Z0-9_]. This includes numeric data which you do not want.
- You have the comma at the end, this will match foo, but not ,foo.
To fix this, you can do something like so: (,\s*[a-z]+)|([a-z]+\s*,). An example is available here.

Using Regex to find repeating groups in phone numbers

I'm looking for a way to use regex to search for obviously false phone numbers that have the same digit repeating. The numbers are all formatted and stored as follows:
(111)111-1111
I'm not able to alter the text in any way.
I've tried modifying a few of the regex lines I've seen such as:
^([0-9])\1{2}.\1{3}.\1{4}$
which was for finding repeating digits with a period in between the numbers. However, I haven't figured out how to get around the first character as a parenthesis.
Any help would be appreciated!
You misunderstand the purpose of the . Dot Operator. It is not to match a period, it matches anything. In that (quite badly) regex, it serves only to skip the - – and because it matches anything, it will also match something like 11121113111.
Use this regexp instead:
^\(?([0-9])\1{2}\)?\1{3}-?\1{4}$
This checks for parentheses around the first group, optionally so it will still work without; and specifically checks for the presence of a dash between the second and third group of digits, also optionally.

Regex how to match two similar numbers in separate match groups?

I got the following string:
[13:49:38 INFO]: Overall : Mean tick time: 4.126 ms. Mean TPS:
20.000
the bold numbers should be matched, each into its own capture group.
My current expression is (\d+.\d{3}) which matches 4.126 how can I match my 20.000 now into a second capture group? Adding the same capture group again makes it find nothing. So what I basically need is, "search for first number, then ignore everything until you find next digit."
You could use something like so: (\d+\.\d{3}).+?(\d+\.\d{3})$ (example here) which essentially is your regex (plus a minor fix) twice, with the difference that it will also look for the same pattern again at the end of the string.
Another minor note, your regex contains, a potential issue in which you are matching the decimal point with the period character. In regular expression language, the period character means any character, thus your expression would also match 4s222. Adding an extra \ in front makes the regex engine treat is as an actual character, and not a special one.

positive look ahead and replace

Recently I'm writing/testing regexps on https://regex101.com/.
My question is: Is it possible to do a positive look-ahead AND a replacement in the same "replacement"? Or just limited kind of replacement is possible.
Input is several lines with phone numbers. Let's say the correct phone number where the number of "numbers" are 11. No matter how the numbers are divided/group together with - / characters, no matter if starts with + 00 or it is omitted.
Some example lines:
+48301234567
+48/30/1234567
+48-30-12-345-67
+483011223344556677
0048301234567
+(48)30/1234567
Positive look-ahead able to check if from the beginning until the end of line there are only 11 digits, regardless how many other, above specified character separating them. This works perfectly.
Where the positive look-ahead check is fine, I would like to delete every character but numbers. The replacement works fine until I'm not involving look-ahead.
Checking the regexp itself working perfectly ("gm" modes):
^(?:\+|00)?(?:[\-\/\(\)]?\d){11}$
Checking the replace part works perfectly (replace to nothing):
[^\d\n]
Put this into look-ahead, after the deletion of non new-line and non-digit characters from the matching lines:
(?=^(?:\+|00)?(?:[\-\/\(\)]?\d){11}$)[^\d\n]
Even I put the ^ $ into look-ahead, seems the replacement working only from beginning of the lines until the very first digit.
I know in real life the replacement and the check should/would go separate ways, however I'm curious if I could mix look-ahead/look-behind with string operations like replace, delete, take the string apart and put together as I like.
UPDATE: This is what would do the trick, however I feel this one "ugly" a bit. Is there any prettier solution?
https://regex101.com/r/yT5dA4/2
Or the version which I asked originally, where only digits remains: regex101.com/r/yT5dA4/3
You cannot replace/delete text with regex. Regex is just a tool for matching certain strings and then taking certain action depending on the matching text, eg. perform a substitution, retrieve the second capture group.
However it is possible to perform certain decisions within a regex engine, by using conditionals. The common syntax for this, with a lookahead assertion, is (?(?=regex)then|else).
With conditionals you can change the behaviour depending on how the text matches the regex. For your example you could do something like:
^(\+)?(?(1)\(|\d)
If the phone number starts with a plus it must be followed by a bracket, else it should start with a digit. Although in your situation, this is not very useful.
If you want to read up more on conditionals in regex you can do so here.