regex - 4 digit number match and replace - regex

I want to match and replace a number of four digit numbers in a csv file
1,1456,2,3,4,5
2,1455,2,3,4,5
so that all 1400 numbers in the second column are mapped to the range of two hundred
1456 -> 256
1455 -> 255
I have this regex to match the 1400 numbers
',[1][4][0-9][0-9],'
but how can i define the matched substring regex to retain the last two digits of the match?
EDIT
Ended up changing the match regex to
,[1][4]([0-9][0-9])
and the match defined as
,2\1
in Notepad++

Replace /14(\d{2})/ with 2\1, where \1 is a back reference to the first match. Adapt to your regex flavor of choice.

sed -e 's/,[1][4]\([0-9][0-9]\),/,2\1,/'
Notice how the \( \) syntax captures a part of the matched expression, and \1 is used to say "the first captured data".

You need to use a backreference - by surrounding one or more parts of a regex in parentheses, you can later reference them in the output. Here is my final version (works with sed -r).
's/,[1][4]([0-9][0-9])/,2\1/'

You should use a group, i.e. something like
',[1][4]([0-9][0-9]),'
Some regex dialects will let you name groups, e.g. in .NET
',[1][4](?<LastTwoDigits>[0-9][0-9]),'
If you specify which language you are using, it will be easier to help you.

Related

Capture number between two whitespaces (RegEx)

I have the following data:
SOMEDATA .test 01/45/12 2.50 THIS IS DATA
and I want to extract the number 2.50 out of this. I have managed to do this with the following RegEx:
(?<=\d{2}\/\d{2}\/\d{2} )\d+.\d+
However that doesn't work for input like this:
SOMEDATA .test 01/45/12 2500 THIS IS DATA
In this case, I want to extract the number 2500.
I can't seem to figure out a regex rule for that. Is there a way to extract something between two spaces ? So extract the text/number after the date until the next whitespace ? All I know is that the date will always have the same format and there will always be a space after the text and then a space after the number I want to extract.
Can someone help me out on this ?
Capture number between two whitespaces
A whitespace is matched with \s, and non-whitespace with \S.
So, what you can use is:
\d{2}\/\d{2}\/\d{2} +(\S+)
^^^
See the regex demo
The 1+ non-whitespace symbols are captured into Group 1.
If - for some reason - you need to only get the value as a whole match, use your lookbehind approach:
(?<=\d{2}\/\d{2}\/\d{2} )\S+
Or - if you are using PCRE - you may leverage the match reset operator \K:
\d{2}\/\d{2}\/\d{2} +\K\S+
^^
See another demo
NOTE: the \K and a capture group approaches allow 1 or more spaces after the date and are thus more flexible.
I see some people helped you already, but if you would want an alternative working one for some reason, here's what works too :)
.+ \d+\/\d+\/\d+ (\d+[\.\d]*)
So the .+ matches anything plus the first space
then the \d+/\d+/\d+ is the date parsing plus a space
the capturing group is the number, as you can see I made the last part optional, so both floating point values and normal values can be matched. Hope this helped!
Proof: https://regex101.com/r/fY3nJ2/1
Just make the fractal part optional:
(?<=\d{2}\/\d{2}\/\d{2} )\d+(?:\.\d+)?
Demo: https://regex101.com/r/jH3pU7/1
Update following clarifications in comments:
To match anything (but space) surrounded by spaces and prepended by date use:
(?<=\d{2}\/\d{2}\/\d{2} )\S+
Demo: https://regex101.com/r/jH3pU7/3
Rather than capture, you can make your entire match be the target text by using a look behind:
(?<=\d\d(\/\d\d){2} )\S+
This matches the first series of non-whitespace that follows a "date like" part.
Note also the reduction in the length of the "date like" pattern. You may consider using this part of the regex in whatever solution you use.

Fetch one out of two Numbers out of String

I hav a list of strings, such as: Ø20X400
I need to extract the first of the numbers - between Ø and X
I've come so far to match the numbers in general with \d+ - as simple as it is...
But I need an expression to get the first value separated, not both of them...
You can use lookarounds (?<=..) and (?=..):
(?<=Ø)\d+(?=X)
or in Java style:
(?<=Ø)\\d+(?=X)
A second way is to use a capture group:
Ø(\d+)X
or
Ø(\\d+)X
Then you can extract the content of the group.
The regex engines I know parse \n as a newline. \d is used for numbers.
The following regex gives you the first number between a Ø and a X in a capture group:
^.*?Ø(\d+)X.*
Edit live on Debuggex
This Regex will do it for you, (\d+?)X, and here is a Rubular to prove it. See, you want to group digits together, but make it non-greedy, ending the evaluation on X.
Try this one:
\d+(?=\D)
Should find first number wich has some not a number ahead
With normal regular expressions, I would say:
Ø(\d+)X
This finds the Ø character, followed by one or more numbers, followed by an X. Also, the numbers will be stored in the first capture group. Capture groups differ from one regex implementation to another, but this would typically be denoted by \1. Capture group zero, \0, is usually the matched string itself. In this version, \d denotes digits 0-9, but if your regex engine uses \n for that purpose, use:
Ø(\n+)X

How to replace only part of found text?

I have a file with a some comma separated names and some comma separated account numbers.
Names will always be something like Dow, John and numbers like 012394,19862.
Using Notepad++'s "Regex Find" feature, I'd like to replace commas between numbers with pipes |.
Basically :
turn: Dow,John into: Dow,John
12345,09876 12345|09876
13568,08642 13568|08642
I've been using [0-9], to find the commas, but I can't get it to properly leave the number's last digit and replace just the comma.
Any ideas?
Search for ([0-9]), and replace it with \1|. Does that work?
use this regex
(\d),(\d)
and replace it with
$1|$2
OR
\1|\2
(?<=\d), should work. Oddly enough, this only works if I use replace all, but not if I use replace single. As an alternative, you can use (\d), and replace with $1|
General thoughts about replacing only part of a match
In order to replace a part of a match, you need to either 1) use capturing groups in the regex pattern and backreferences to the kept group values in the replacement pattern, or 2) lookarounds, or 3) a \K operator to discard left-hand context.
So, if you have a string like a = 10, and you want to replace the number after a = with, say, 500, you can
find (a =)\d+ and replace with \1500 / ${1}500 (if you use $n backreference syntax and it is followed with a digit, you should wrap it with braces)
find (?<=a =)\d+ and replace with 500 (since (?<=...) is a non-consuming positive lookbehind pattern and the text it matches is not added to the match value, and hence is not replaced)
find a =\K\d+ and replace with 500 (where \K makes the regex engine "forget" the text is has matched up to the \K position, making it similar to the lookbehind solution, but allowing any quantifiers, e.g. a\h*=\K\d+ will match a = even if there are any zero or more horizontal whitespaces between a and =).
Current problem solution
In order to replace any comma in between two digits, you should use lookarounds:
Find What: (?<=\d),(?=\d)
Replace With: |
Details:
(?<=\d) - a positive lookbehind that requires a digit immediately to the left of the current location
, - a comma
(?=\d) - a positive lookahead that requires a digit immediately to the right of the current location.
See the demo screenshot with settings:
See the regex demo.
Variations:
Find What: (\d),(?=\d)
Replace With: \1|
Find What: \d\K,(?=\d)
Replace With: |
Note: if there are comma-separated single digits, e.g. 1,2,3,4 you can't use (\d),(\d) since this will only match odd occurrences (see what I mean).

Regular expression to match 3 characters before match

I have a file with lines like
text text text 3424 text text 3423 50 US text 342 text
What I want to match is 50 US (yes, dollars) and ultimately extract that number.
Everything else changes in different lines, there may be more text or less surrounding, but in each line there is only one "US" anchor that I can match.
So what I want to do is to find a way to match US and get the preceding 3 or 4 characters.
Any ideas? Preferably with sed/awk, but any solution will do.
Perl regexes (or anything that understands non-greedy .*? expressions) are easier than sed for this:
perl -pe 's/^.*?(\d+\.?\d*)\s*US.*$/$1/'
That will handle things like "11.23" as well.
\d+ US
This should work given that US is present only once in the string.
Use lookarounds:
\d+(?= US)
This regex will only capture the numeric amount. The (?= US) tells it to match on "US" but not capture it.
This is what you could use in VBA regex flavor, which also supports lookaheads:
" ((.+)(?= US))"
Starts with a space
Next is the capture group. (.+) I use that instead of \d so that stuff like 5,000 and 11.3 works. In fact, anything works, so if you want the word/number that precedes "US" then this is the way to write it.
Next is the lookahead. So you only want the capture group that is immediately followed by " US". If it finds it, it will only give you back the capture group, not the lookahead value.

Matching on repeated substrings in a regex

Is it possible for a regex to match based on other parts of the same regex?
For example, how would I match lines that begins and end with the same sequence of 3 characters, regardless of what the characters are?
Matches:
abcabc
xyz abc xyz
Doesn't Match:
abc123
Undefined: (Can match or not, whichever is easiest)
ababa
a
Ideally, I'd like something in the perl regex flavor. If that's not possible, I'd be interested to know if there are any flavors that can do it.
Use capture groups and backreferences.
/^(.{3}).*\1$/
The \1 refers back to whatever is matched by the contents of the first capture group (the contents of the ()). Regexes in most languages allow something like this.
You need backreferences. The idea is to use a capturing group for the first bit, and then refer back to it when you're trying to match the last bit. Here's an example of matching a pair of HTML start and end tags (from the link given earlier):
<([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1>
This regex contains only one pair of parentheses, which capture the string matched by [A-Z][A-Z0-9]* into the first backreference. This backreference is reused with \1 (backslash one). The / before it is simply the forward slash in the closing HTML tag that we are trying to match.
Applying this to your case:
/^(.{3}).*\1$/
(Yes, that's the regex that Brian Carper posted. There just aren't that many ways to do this.)
A detailed explanation for posterity's sake (please don't be insulted if it's beneath you):
^ matches the start of the line.
(.{3}) grabs three characters of any type and saves them in a group for later reference.
.* matches anything for as long as possible. (You don't care what's in the middle of the line.)
\1 matches the group that was captured in step 2.
$ matches the end of the line.
For the same characters at the beginning and end:
/^(.{3}).*\1$/
This is a backreference.
This works:
my $test = 'abcabc';
print $test =~ m/^([a-z]{3}).*(\1)$/;
For matching the beginning and the end you should add ^ and $ anchors.