How can I use regular expressions to insert commas into large integers? - regex

I have a text document with a lot of large integers, e.g. 123456789. I want to automatically insert commas into these to make them more readable: 123,456,789. However, my document also contains decimals, and these should remain untouched. Is there a regular expressions that will insert these? An answer on a similar question suggested (?<=\d)(?=(\d\d\d)+(?!\d)), but this also detects decimal numbers. What's more, I am unable to insert the commas using either Notepad++ or Overleaf. What should I replace this regex with?

If you don't want to touch the decimals you could use (*SKIP)(*FAIL) to match a dot and 1+ digits to consume the characters that should not be part of the match.
(Tested on Notepad++ 7.7.1)
\.\d+(*SKIP)(*FAIL)|\B(?=(?:\d{3})+(?!\d))
In the replacement use a comma ,
In parts
\.\d+(*SKIP)(*FAIL) Match a dot literally and 1+ digits (match to be left untouched)
| Or
\B Anchor that matches where \b does not match
(?= Positive lookahead, assert what is directly on the right is
(?:\d{3})+ Repeat 1+ times matching 3 digits
(?!\d) Negative lookahead, assert what is directly on the right is not a digit
) Close lookahead
Regex demo

My guess is that maybe,
(?<=\d)(?=(?:\d{3})+(?!\d|\.))
or
(?!^)(?=(?:\d{3})+(?!\.|\d))
Demo 2
or
\d+\.\d*(*SKIP)(*FAIL)|(?!^)(?=(?:\d{3})+(?!\.|\d))
Demo 3
might be close to what you're trying to write, which you can simply replace it with a comma.
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

Related

Using regex to find abbreviations

I am trying to create a regular expression that will identify possible abbreviations within a given string in Python. I am kind of new to RegEx and I am having difficulties creating an expression though I beleive it should be somewhat simple. The expression should pick up words that have two or more capitalised letter. The expression should also be able to pick up words where a dash have been used in-between and report the whole word (both before and after the dash). If numbers are also present they should also be reported with the word.
As such, it should pick up:
ABC, AbC, ABc, A-ABC, a-ABC, ABC-a, ABC123, ABC-123, 123-ABC.
I have already made the following expression: r'\b(?:[a-z]*[A-Z\-][a-z\d[^\]*]*){2,}'.
However this does also pick up these wrong words:
A-bc, a-b-c
I believe the problem is that it looks for either multiple capitalised letters or dashes. I wish for it to only give me words that have atleast two or more capitalised letters. I understand that it will also "mistakenly" take words as "Abc-Abc" but I don't believe there is a way to avoid these.
If a lookahead is supported and you don't want to match double -- you might use:
\b(?=(?:[a-z\d-]*[A-Z]){2})[A-Za-z\d]+(?:-[A-Za-z\d]+)*\b
Explanation
\b A word boundary
(?= Positive lookahead, assert that from the current location to the right is
(?:[a-z\d-]*[A-Z]){2} Match 2 times the optionally the allowed characters and an uppercase char A-Z
) Close the lookahead
[A-Za-z\d]+ match 1+ times the allowed characters without the hyphen
(?:-[A-Za-z\d]+)* Optionally repeat - and 1+ times the allowed characters
\b A word boundary
See a regex101 demo.
To also not not match when there are hyphens surrounding the characters you can use negative lookarounds asserting not a hyphen to the left or right.
\b(?<!-)(?=(?:[a-z\d-]*[A-Z]){2})[A-Za-z\d]+(?:-[A-Za-z\d]+)*\b(?!-)
See another regex demo.

Regex with negative lookahead limited length

I have some text that looks like this:
UPPERCASE TEXT {wildcard amount of text} {Anchor word}
With this pattern repeating multiple times. I want to extract these multiple matches, which I can do with
[A-Z][A-Z ]+.+anchor
However I don't want it to match if there is UPPERCASE text within the wildcard text. I can check for this with a negative lookahead
[A-Z][A-Z ]+(?!.+[A-Z][A-Z ]+).+anchor
However the lookahead matches with all the other matches and cancels out. I can put limits on the size of the lookahead however sometimes the distance between uppercase words and the anchor is small and sometimes it is large, so I can't match everything.
In your positive lookahead you don't need the + operand. You want it to fail if there is two or more characters, this is equivalent to failing if there is two characters.
Your negative lookahead needs to be tested every char in the intermediate section.
https://regex101.com/r/tcxch6/1
Why not just match uppercase letters, then space, then any amount of not-uppercase characters, then space then the anchor?
/([A-Z ]+) ([^A-Z]+) (anchor)/
I guess the problem is that if the text is
UPPERCASE TEXT {wildcard OTHERTEXT etc} anchor
Then this will find OTHERTEXT as the first match. Maybe the answer is to fix it to the start of the line, like this
/^([A-Z ]+) ([^A-Z]+) (anchor)/
If that isn't right, I recommend giving some more examples of the input and the required matches, because the question isn't all that clear at the moment.

RegEx: Excluding a pattern from the match

I know some basics of the RegEx but not a pro in it. And I am learning it. Currently, I am using the following very very simple regex to match any digit in the given sentence.
/d
Now, I want that, all the digits except some patterns like e074663 OR e123444 OR e7736 should be excluded from the match. So for the following input,
Edit 398e997979 the Expression 9798729889 & T900980980098ext to see e081815 matches. Roll over matches or e081815 the expression e081815 for details.e081815 PCRE & JavaScript flavors of RegEx are e081815 supported. Validate your expression with Tests mode e081815.
Only bold digits should be matched and not any e081815. I tried the following without the success.
(^[e\d])(\d)
Also, going forward, some more patterns needs to be added for exclusion. For e.g. cg636553 OR cg(any digits). Any help in this regards will be much appreciated. Thanks!
Try this:
(?<!\be)(?<!\d)\d+
Test it live on regex101.com.
Explanation:
(?<!\be) # make sure we're not right after a word boundary and "e"
(?<!\d) # make sure we're not right after a digit
\d+ # match one or more digits
If you want to match individual digits, you can achieve that using the \G anchor that matches at the position after a successful match:
(?:(?<!\be)(?<=\D)|\G)\d
Test it here
Another option is to use a capturing group with lookarounds
(?:\b(?!e|cg)|(?<=\d)\D)[A-Za-z]?(\d+)
(?: Non capture group
\b(?!e|cg) Word boundary, assert what is directly to the right is not e or cg
| Or
(?<=\d)\D Match any char except a digit, asserting what is directly on the left is a digit
) Close group
[A-Za-z]? Match an optional char a-zA-Z
(\d+) Capture 1 or more digits in group 1
Regex demo

Regex to capture a group of delimited words that must end with a specific word

I'm normalizing a bunch of Ansible group names, which have to change to use underscores instead of hyphens (thanks, Ansible). However, there's tons of other stuff in the file that is hyphenated, so I want to leave those lines alone. The ones I want to change always end with -servers. So, with a small sample, we might have:
foo-bar
foo-bar-servers
foo-bar-baz-servers
(\w)-(\w?)? very nicely captures things so I can just sub to $1_$2 to change the hyphens to underscores. However, as soon as I add -servers or ervers on the end, it grabs only the very last pair around the hyphen. I have tried many variations, read up a little on lookaheads, and I am thoroughly stumped. It seems like it ought to be simple. What is the magic incantation to match all the groups around the hyphens, for lines ending in -servers? Many thanks in advance.
Edit: desired results, with apologies:
foo-bar
foo_bar_servers
foo_bar_baz_servers
As long as your regex engine supports positive lookaheads and (fixed-length) positive lookbehinds (as do most engines, including PCRE (PHP) and Python, for example), you may use the following regular expression to match the desired hyphens, which may then be replaced with underscores.
(?<=\w)-(?=(?:\w+-)*servers$)
Demo
The regex engine performs the following operations.
(?<=\w) match a word char in a positive lookbehind
- match a hypen
(?= begin a positive lookahead
(?:\w+-) match 1+ word chars then '-', in a non-capture group
* execute non-capture group 0+ times
servers match string
$ match end of line
) end positive lookahead

Regular expression to match last number in a string

I need to extract the last number that is inside a string. I'm trying to do this with regex and negative lookaheads, but it's not working. This is the regex that I have:
\d+(?!\d+)
And these are some strings, just to give you an idea, and what the regex should match:
ARRAY[123] matches 123
ARRAY[123].ITEM[4] matches 4
B:1000 matches 1000
B:1000.10 matches 10
And so on. The regex matches the numbers, but all of them. I don't get why the negative lookahead is not working. Any one care to explain?
Your regex \d+(?!\d+) says
match any number if it is not immediately followed by a number.
which is incorrect. A number is last if it is not followed (following it anywhere, not just immediately) by any other number.
When translated to regex we have:
(\d+)(?!.*\d)
Rubular Link
I took it this way: you need to make sure the match is close enough to the end of the string; close enough in the sense that only non-digits may intervene. What I suggest is the following:
/(\d+)\D*\z/
\z at the end means that that is the end of the string.
\D* before that means that an arbitrary number of non-digits can intervene between the match and the end of the string.
(\d+) is the matching part. It is in parenthesis so that you can pick it up, as was pointed out by Cameron.
You can use
.*(?:\D|^)(\d+)
to get the last number; this is because the matcher will gobble up all the characters with .*, then backtrack to the first non-digit character or the start of the string, then match the final group of digits.
Your negative lookahead isn't working because on the string "1 3", for example, the 1 is matched by the \d+, then the space matches the negative lookahead (since it's not a sequence of one or more digits). The 3 is never even looked at.
Note that your example regex doesn't have any groups in it, so I'm not sure how you were extracting the number.
I still had issues with managing the capture groups
(for example, if using Inline Modifiers (?imsxXU)).
This worked for my purposes -
.(?:\D|^)\d(\D)