vim match group of numbers and replace - regex

I have a large file with data in this format:
regabc123456_user_domain_application_env_id
regdef789101_user_domain_application_env_id
in vim I want to do a search and replace ("_" for ", ") and match the machine name (regabc123456).
i am trying this:
:%s/^reg.*\{6}_/^reg.*\{6},\ /g
^ for beginning of the line 'reg' because all start with this then '.*' for anything after that but before the six digit code starts which I am tryign to catch with {6}.
This doesn't seem to be doing what I want. I can match the machine name, but I can't replace it with what I want. Is there an easier way to identify the machine name with regular expressions? example:
'reg' followed by three lower case letter followed by six numbers followed by an underscore, then replace?
Thanks.

The below regex would replace regabc123456_ to regabc123456,
:%s/^\(reg.*[0-9]\{6\}\)_/\1,/g
OR
:%s/^\(reg[a-z]\{3\}[0-9]\{6\}\)_/\1,/g
If you want a space after the comma then add space after comma in the replacement part.
%s/^\(reg[a-z]\{3\}[0-9]\{6\}\)_/\1, /g
To match a 6 digit number , you need to use [0-9]\{6\}. It repeats the previous token exactly 6 times.

Related

Regular Expression: Find a specific group within other groups in VB.Net

I need to write a regular expression that has to replace everything except for a single group.
E.g
IN
OUT
OK THT PHP This is it 06222021
This is it
NO MTM PYT Get this content 111111
Get this content
I wrote the following Regular Expression: (\w{0,2}\s\w{0,3}\s\w{0,3}\s)(.*?)(\s\d{6}(\s|))
This RegEx creates 4 groups, using the first entry as an example the groups are:
OK THT PHP
This is it
06222021
Space Charachter
I need a way to:
Replace Group 1,2,4 with String.Empty
OR
Get Group 3, ONLY
You don't need 4 groups, you can use a single group 1 to be in the replacement and match 6-8 digits for the last part instead of only 6.
Note that this \w{0,2} will also match an empty string, you can use \w{1,2} if there has to be at least a single word char.
^\w{0,2}\s\w{0,3}\s\w{0,3}\s(.*?)\s\d{6,8}\s?$
^ Start of string
\w{0,2}\s\w{0,3}\s\w{0,3}\s Match 3 times word characters with a quantifier and a whitespace in between
(.*?) Capture group 1 match any char as least as possible
\s\d{6,8} Match a whitespace char and 6-8 digits
\s? Match an optional whitespace char
$ End of string
Regex demo
Example code
Dim s As String = "OK THT PHP This is it 06222021"
Dim result As String = Regex.Replace(s, "^\w{0,2}\s\w{0,3}\s\w{0,3}\s(.*?)\s\d{6,8}\s?$", "$1")
Console.WriteLine(result)
Output
This is it
My approach does not work with groups and does use a Replace operation. The match itself yields the desired result.
It uses look-around expressions. To find a pattern between two other patterns, you can use the general form
(?<=prefix)find(?=suffix)
This will only return find as match, excluding prefix and suffix.
If we insert your expressions, we get
(?<=\w{0,2}\s\w{0,3}\s\w{0,3}\s).*?(?=\s\d{6}\s?)
where I simplified (\s|) as \s?. We can also drop it completely, since we don't care about trailing spaces.
(?<=\w{0,2}\s\w{0,3}\s\w{0,3}\s).*?(?=\s\d{6})
Note that this works also if we have more than 6 digits because regex stops searching after it has found 6 digits and doesn't care about what follows.
This also gives a match if other things precede our pattern like in 123 OK THT PHP This is it 06222021. We can exclude such results by specifying that the search must start at the beginning of the string with ^.
If the exact length of the words and numbers does not matter, we simply write
(?<=^\w+\s\w+\s\w+\s).*?(?=\s\d+)
If the find part can contain numbers, we must specify that we want to match until the end of the line with $ (and include a possible space again).
(?<=^\w+\s\w+\s\w+\s).*?(?=\s\d+\s?$)
Finally, we use a quantifier for the 3 ocurrences of word-space:
(?<=^(\w+\s){3}).*?(?=\s\d+\s?$)
This is compact and will only return This is it or Get this content.
string result = Regex.Match(#"(?<=^(\w+\s){3}).*?(?=\s\d+\s?$)").Value;

Regular expressions in notepad++ (Search and Replace)

I have a list of thousands of records within a .txt document.
some of them look like these records
201910031044 "00059" "11.31AG" "Senior Champion"
201910031044 "00060" "GBA146" "Junior Champion"
201910031044 "00999" "10.12G" "ProAM"
201910031044 "00362" "113.1LI" "Abcd"
Whenever a record similar to this occurs I'd like to get rid of the last words/numbers/etc in the last quotation marks (like "Senior Champion", "Junior Champion" etc. There are many possibilities here)
e.g. (before)
201910031044 "00059" "11.31AG" "Senior Champion"
after
201910031044 "00059" "11.31AG"
I tried the following regex but it wouldn't work.
Search: ^([0-9]{17,17} + "[0-9]{8,8}" + "[a-zA-Z0-9]").*$
Replace: \1 (replace string)
OK I forgot the . (dot) sign however even if I do not have a . (dot) sign it would not work. Not sure if it has anything to do when using the + sign used more than once.
I'd like to get rid of the last words/numbers/etc in the last quotation marks
This does the job:
Ctrl+H
Find what: ^.+\K\h+".*?"$
Replace with: LEAVE EMPTY
CHECK Wrap around
CHECK Regular expression
UNCHECK . matches newline*
Replace all
Explanation:
^ # beginning of line
.+ # 1 or more any character but newline
\K # forget all we have seen until this position
\h+ # 1 or more horizontal spaces
".*?" # something inside quotes
$ # end of line
Screen capture (before):
Screen capture (after):
The RegEx looks for the 4th double quote:
^(?:[^"]*\"){4}([^|]*)
You can see this demo: https://regex101.com/r/wJ9yS6/163
You will still need to parse the lines, so probably easier opening in excel or parsing using code as a CSV.
You have a problem with the count of your characters:
you specify that the line should start with exactly 17 digits ([0-9]{17,17}). However, there are only 12 digits in the data 201910031044.
you can specify exactly 12 digits by using {12} or if it could be 12-17, then {12,17}. I'll assume exactly 12 based on the current data.
similarly, for the second column you specify that it's exactly 8 digits surrounded by quotes ("[0-9]{8,8}") but it only has 5 digits surrounded by quotes.
again, you can specify exactly 5 with {5} or 5-8 with {5,8}. I will assume exactly 5.
finally, there is no quantifier for the final field, so the regex tries to match exactly one character that is a letter or a number surrounded by quotes "[a-zA-Z0-9]".
I'm not sure if there is any limit on the number of characters, so I would go with one or more using + as quantifier "[a-zA-Z0-9]+" - if you can have zero or more, then you can use *, or if it's any other count from m to n, then you can use {m,n} as before.
Not a character count problem but the final column can also have dots but the regex doesn't account for. You can just add . inside the square brackets and it will only match dot characters. It's usually used as a wildcard but it loses its special meaning inside a character class ([]), so you get "[a-zA-Z0-9.]+"
Putting it all together, you get
Search: ^([0-9]{12} + "[0-9]{5}" + "[a-zA-Z0-9.]+").*$
Replace: \1
Which will get rid of anything after the third field in Notepad++.
This can be shortened a bit by using \d instead of [0-9] for digits and \s+ for whitespace instead of +. As a benefit, \s will also match other whitespace like tabs, so you don't have to manually account for those. This leads to
Search: ^(\d{12}\s+"\d{5}"\s+"[a-zA-Z0-9.]+").*$
Replace: \1
If you want to get rid of the last words/numbers/etc in the last quotation marks you could capture in a group what is before that and match the last quotation marks and everything between it to remove it using a negated character class.
If what is between the values can be spaces or tabs, you could use [ \t]+ to match those (using \s could also match a newline)
Note that {17,17} and {8,8} may also be written as {17} and {8} which in this case should be {12} and {5}
^([0-9]{12}[ \t]+"[0-9]{5}"[ \t]+"[a-zA-Z0-9.]+")[ \t]{2,}"[^"\r\n]+"
In parts
^ Start of string
( Capture group 1
[0-9]{12}[ \t]+ Match 12 digits and 1+ spaces or tabs
"[0-9]{5}"[ \t]+ Match 5 digits between " and 1+ spaces or tabs
"[a-zA-Z0-9.]+" Match 1+ times any of the listed between "
) Close group
[ \t]{2,} Match 1+ times
"[^"\r\n]+"
In the replacement use group 1 $1
Regex demo
Before
After

Regex capture group that excludes optional substring?

I'm trying to construct a regex to extract Swedish organization numbers from data. These numbers can be of the following formats:
999999999999 // 12 digits, first two should be ignored.
9999999999 // 10 digits, all should be included.
99999999-9999 // 12 digits with a dash, first two digits and the dash should be ignored
999999-9999 // 10 digits with a dash, dash should be ignored.
For the 12 digit cases, the first two digits are always 16, 19 or 20. My current attempt is:
(?:16|19|20)?(\d{6}\-?\d{4})
This will return a ten digit organization number in $1, but it will contain the dash if it's present. I want the dash to be stripped (or possibly added if it's missing), so that $1 has the same format regardless of dash or no dash in the input.
The regex is in a config and will be used in code that simply extracts $1, so I can't solve this in code - I need the regex to do it "by itself".
As a last resort, I could modify the code to allow config to specify a "replace string" in addition to the search regex, and have the code use the result of the replace as the end result of the extraction. In that case I could use this:
Regex: (?:16|19|20)?(\d{6})\-?(\d{4})
Replace string: $1$2
But this causes other problems, because for other config items, the regex will return multiple "data fields", one for each capture group. To get this to work I would need, in that case, to provide a sequence of replace strings, e.g. for a tab separated format with organization number in the middle:
Regex: ([^\t]*)\t(?:16|19|20)?(\d{6})\-?(\d{4})\t([\d]*)
Replace string 1: $1 (free text field)
Replace string 2: $2-$3 (the organization number with dash "enforced")
Replace string 3: $4 (numeric field)
Workable, but rather awkward... So, any way to solve it within the search regex?

Regular Expression begining of string with special characters

Using this for an example string
+$43073$7
and need the 5 number sequence from it I'm using the Regex expression
#"\$+(?<lot>\d{5})"
which is matching up any +$ in the string. I tried
#"^\$+(?<lot>\d{5})"
as the +$ are always at the beginning of the string. What will work?
If you use anchor ^, you need to include the + symbol at the first and don't forget to escape it because + is a special meta character in regex which repeats the previous token one or more times.
#"^\+\$(?<lot>\d{5})"
And without the anchor, it would be like
#"\$(?<lot>\d{5})"
And get the 5 digit number you want from group index 1.
DEMO
I would match what you want:
\d+
or if you only want digits after "special" characters at the start of input:
^\W+(\d+)
grabbing group 1

Regex expression for digit followed by dot (.)

I want to find a text with with digit followed by a dot and replace it with the same text (digit with dot) and "xyz" string.
For ex.
1. This is a sample
2. test
3. string
**I want to change it to**
1.xyz This is a sample
2.xyz test
3.xyz string
I learnt how to find the matching text (\d.) but the challenge is to find the replace with text.
I'm using notepad ++ editor for this, can anyone suggest the "Replace with" string.
First of all, you need to escape the dot since it means "match anything (except newline depending if the s modifier is set)": (\d\.).
Second, you need to add a quantifier in case you have a 2 digit number or more: (\d+\.).
Third, we don't need group 1 in this case: \d+\..
In the replacement, it's quite simple: just use $0xyz. $0 will refer to group 0 which is the whole match.
For notepad++...
You must escape the period/dot character in the expression - precede it with a backslash:
\.
In my case, I needed to find all instances of "{EnvironmentName}.api.mycompany.com"
(dev.api.mycompany.com, stage.api.mycompany.com, prod.api.mycompany, etc.)
I used this search expression:
.*\.api.mycompany.com
Notepad++ RegEx Search Screenshot
I think the right answer is as follow:
Find: ^(\d)([.])(\s)
Replace: $1$2XYZ
That will work with "n. " being "n" a digit [0-9]. If the input should accept digits with different lengths like 10, 100, 1000... or multiples dots "." after the digit or multiple spaces after the dot, then the answer is:
Find: ^(\d*)([.])([.]*)(\s*)
Replace: $1$2XYZ
Input:
1. This is a sample
2. test
3. string
30. string
10..... string
50005... string
Output:
1.XYZ This is a sample
2.XYZ test
3.XYZ string
30.XYZ string
10.XYZ string
50005.XYZ string