Removing comma between numbers in CSV using regex in Sublime - regex

I'm very new to regex. Pardon me for silly questions.
I was wondering if it was possible to use regex pattern matcher to replace commas in between numbers such as, $3,542 with $3542 in Sublime Editor.
I tried to use [0-9],[0-9][0-9][0-9] to detect all such occurrences but don't know why I can't retain just numbers :/
Puzzled me!

You may use capturing groups to retain digits:
(\$\d+),(\d+)
and replace with $1$2. You may remove \$ if you do not care if it is a currency or not.
The (\$\d+),(\d+) regex matches:
(\$\d+) - Group 1 matching $ as a literal symbol followed with 1 or more digits
, - a literal comma
(\d+) - Group 2 matching 1 or more digits
The $1 and $2 are backreferences that retrieve the texts stored in the memoru buffers for both groups.
/
Note that there are other ways to do the same, you can use lookarounds or a regex with \K, or using both, but capturing seems to me the most efficient solution for this case.

Ctrl + H, select "regular expression" (Alt + R) and replace:
\$\d+\K,(?=\d)
with nothing.
Explanation:
\$\d+\K will match dollar sign followed by one or more digit (we use the \K - the short form of the positive lookbehind to do a zero-width assertion). The next token "," matches a comma and finally we use a positive lookahead to match digits.

Related

Regex - extract last term between _ and before . from path

This is the regex that I'm currently testing
[\w\. ]+(?=[\.])
My ultimate goal is to include a regex expression to extract using regexp_extract in Impala/Hive query.
regexp_extract(col, '[\w\. ]+(?=[\.])', 1)
This doesn't work in Impala however.
Examples of path to extract from:
D:\mypath\Temp\abs\device\Program1.lua
D:\mypath\Temp\abs\device\SE1_Test-program.lua
D:\mypath\Temp\abs\device\Test_program.lua
D:\mypath\Temp\abs\device\Device_Test_Case-general.lua
The regex I've tested extracts the term I'm looking for but it's not good enough, for the second and third, fourth cases I would need to extract only the part after the last underscore.
My expections are:
Program1
Test-program
program
Case-general
Any suggestions? I'm also open to using something other than regexp_extract.
Note that Impala regex does not support lookarounds, and thus you need a capturing group to get a submatch out of the overall match. Also, if you use escaping \ in the pattern, make sure it is doubled.
You can use
regexp_extract(col, '([^-_\\\\]+)\\.\\w+$', 1)
See the regex demo.
The regex means
([^-_\\]+) - Group 1: one or more chars other than -, _ and \
\. - a dot
\w+ - one or more word chars
$ - end of string.
Using \w also matches an underscore, instead you can use [a-zA-Z0-9] instead.
Add matching a dot and hyphen in the character class, capture that in group 1 and match the expected trailing dot.
Note that you don't have to escape dots in a character class.
([a-zA-Z0-9.-]+)[.]
See a regex101 demo
Example using regexp_extract where the , 1 gets the group 1 value:
regexp_extract(col, '([a-zA-Z0-9.-]+)[.]', 1)
If it should be at the end of the string only, matching the last dot without matching any backslashes in between:
regexp_extract(col, '([a-zA-Z0-9.-]+)[.][^\\\\.]+$', 1)

How can I remove something from the middle of a string with regex?

I have strings which look like this:
/xxxxx/xxxxx-xxxx-xxxx-338200.html
With my regex:
(?<=-)(\d+)(?=\.html)
It matches just the numbers before .html.
Is it possible to write a regex that matches everything that surrounds the numbers (matches the .html part and the part before the numbers)?
In your current pattern you already use a capturing group. In that case you might also match what comes before and after instead of using the lookarounds
-(\d+)\.html
To get what comes before and after the digits, you could use 2 capturing groups:
^(.*-)\d+(\.html)$
Regex demo
In the replacement use the 2 groups.
This should do the job:
.*-\d+\.html
Explanation: .* will match anything until -\d+ say it should match a - followed by a sequence of digits before a \.html (where \. represents the character .).
To capture groups, just do (.*-)(\d+)(\.html). This will put everything before the number in a group, the number in another group and everything after the number in another group.

How can I use regular expressions to insert commas into large integers?

I have a text document with a lot of large integers, e.g. 123456789. I want to automatically insert commas into these to make them more readable: 123,456,789. However, my document also contains decimals, and these should remain untouched. Is there a regular expressions that will insert these? An answer on a similar question suggested (?<=\d)(?=(\d\d\d)+(?!\d)), but this also detects decimal numbers. What's more, I am unable to insert the commas using either Notepad++ or Overleaf. What should I replace this regex with?
If you don't want to touch the decimals you could use (*SKIP)(*FAIL) to match a dot and 1+ digits to consume the characters that should not be part of the match.
(Tested on Notepad++ 7.7.1)
\.\d+(*SKIP)(*FAIL)|\B(?=(?:\d{3})+(?!\d))
In the replacement use a comma ,
In parts
\.\d+(*SKIP)(*FAIL) Match a dot literally and 1+ digits (match to be left untouched)
| Or
\B Anchor that matches where \b does not match
(?= Positive lookahead, assert what is directly on the right is
(?:\d{3})+ Repeat 1+ times matching 3 digits
(?!\d) Negative lookahead, assert what is directly on the right is not a digit
) Close lookahead
Regex demo
My guess is that maybe,
(?<=\d)(?=(?:\d{3})+(?!\d|\.))
or
(?!^)(?=(?:\d{3})+(?!\.|\d))
Demo 2
or
\d+\.\d*(*SKIP)(*FAIL)|(?!^)(?=(?:\d{3})+(?!\.|\d))
Demo 3
might be close to what you're trying to write, which you can simply replace it with a comma.
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

XML Regex - Negative match

I have a problem with negative lookahead in XSD pattern.
When I specified:
<xs:pattern value="^(?!(00|\+\d))\d{6,}$"/>
then I got an error message:
Value '^(?!(00|\+\d))\d{6,}$' is not a valid XML regular expression.
Any idea why it does not work?
In online javascript validator it works fine (e.g. here under unit tests section click on "run test").
I need to validate phone numbers. The phone number cannot include international prefixes (+\d) and (00).
Thanks
Try the following regex:
[1-9][0-9]{5,} | 0[1-9][0-9]{4,}
This matches a number which does not begin with zero and is followed by any digit (including zero) 5 or more times, and it also matches a number which starts with zero and is not immediately followed by zero, but after that can have 0-9.
I will add my deleted comment as an answer:
([1-9][0-9]|[0-9][1-9])[0-9]{4,}
See the regex demo.
The regex should work well for your scenario because
([1-9][0-9]|[0-9][1-9]) - matches either 1 digit from 1-9 ranges and any digit after or (|) any 1 digit followed with any digit but 0 - making up 2 digits
[0-9]{4,} - matches 4 and more any digits.
This pattern only matches a full/entire string because all regex patterns inside XSD pattern are anchored by default (so, you do not have to and can't enclose the pattern with ^ and $).
Right, there is no lookaround support in XSD regex (no lookaheads, nor lookbehinds). Besides, XSD regex has other interesting limitations/features:
^ and $ anchors
Non-capturing groups like (?:...) (use capturing ones instead)
/ should not be escaped, do not use \/
\d should be written as [0-9] to only match ASCII digits (same as in .NET)
Back-references like \1, \2 are not supported.
No word boundaries are supported either.
See some more XSD regex description at regular-expressions.info.

How to replace only part of found text?

I have a file with a some comma separated names and some comma separated account numbers.
Names will always be something like Dow, John and numbers like 012394,19862.
Using Notepad++'s "Regex Find" feature, I'd like to replace commas between numbers with pipes |.
Basically :
turn: Dow,John into: Dow,John
12345,09876 12345|09876
13568,08642 13568|08642
I've been using [0-9], to find the commas, but I can't get it to properly leave the number's last digit and replace just the comma.
Any ideas?
Search for ([0-9]), and replace it with \1|. Does that work?
use this regex
(\d),(\d)
and replace it with
$1|$2
OR
\1|\2
(?<=\d), should work. Oddly enough, this only works if I use replace all, but not if I use replace single. As an alternative, you can use (\d), and replace with $1|
General thoughts about replacing only part of a match
In order to replace a part of a match, you need to either 1) use capturing groups in the regex pattern and backreferences to the kept group values in the replacement pattern, or 2) lookarounds, or 3) a \K operator to discard left-hand context.
So, if you have a string like a = 10, and you want to replace the number after a = with, say, 500, you can
find (a =)\d+ and replace with \1500 / ${1}500 (if you use $n backreference syntax and it is followed with a digit, you should wrap it with braces)
find (?<=a =)\d+ and replace with 500 (since (?<=...) is a non-consuming positive lookbehind pattern and the text it matches is not added to the match value, and hence is not replaced)
find a =\K\d+ and replace with 500 (where \K makes the regex engine "forget" the text is has matched up to the \K position, making it similar to the lookbehind solution, but allowing any quantifiers, e.g. a\h*=\K\d+ will match a = even if there are any zero or more horizontal whitespaces between a and =).
Current problem solution
In order to replace any comma in between two digits, you should use lookarounds:
Find What: (?<=\d),(?=\d)
Replace With: |
Details:
(?<=\d) - a positive lookbehind that requires a digit immediately to the left of the current location
, - a comma
(?=\d) - a positive lookahead that requires a digit immediately to the right of the current location.
See the demo screenshot with settings:
See the regex demo.
Variations:
Find What: (\d),(?=\d)
Replace With: \1|
Find What: \d\K,(?=\d)
Replace With: |
Note: if there are comma-separated single digits, e.g. 1,2,3,4 you can't use (\d),(\d) since this will only match odd occurrences (see what I mean).