regex - capture group - regex

I trying to write a regex to match the following at the beginning of a new line
- a number followed by parantheses e.g. 2) or 8)
- a number followed by period e.g. 5
- the character '-'
- the character '*'
the following strings should match
"1. Sorting function. If you have a long checklist it's very difficult."
"5) This is another example"
"-this is yet another one"
"* last item in the list"
I have tried this but it doesn't quite get me what I am looking for.
re.findall(r'(?m)\s*^[-*(\d.)(\d\))]',item)

Try
re.findall(r'^\s*(\d+(\)|\.)|-|\*)', item, re.MULTILINE)
It will match all sequences of numbers followed by a closing parenthesis or period as well as dashes and stars at the beginning of the line.
Example: https://regex101.com/r/cR2lZ5/6

Assuming that your quote marks " are not included, and that each line is a separate string,
^\d\.|^\d\)|^\-|^\*
Would be the regular expression. | is OR, \d is a digit, and you escape the special characters ".", ")", "-", and "*" by putting a backslash in front of them.
You can test your regular expressions here. Good luck!

Related

How to find the first occurrence of sub-strings not ended with specified characters

I'm gonna select the first occurrence of an only-alphabet string which is not ended by any of the characters ".", ":" and ";"
For example:
"float a bbc 10" --> "float"
"float.h" --> null
"float:: namespace" --> "namesapace"
"float;" --> null
I came up with the regex \G([A-z]+)(?![:;\.]) but it only ignores the character before the banned characters, while I need it to skip all string before banned characters.
You may use
/(?<!\S)[A-Za-z]++(?![:;.])/
See the regex demo. Make sure not to use the g modifier to get the first match only.
One of the main trick here is to use a possessive ++ quantifier to match all consecutive letters and check for :, ; or . only once right after the last of the matched letters.
Pattern details
(?<!\S) - either whitespace or start of string should immediately precede the current location
[A-Za-z]++ - 1+ letters matched possessively allowing no backtracking into the pattern
(?![:;.]) - a negative lookahead that fails the match if there is a ;, : or . immediately to the right of the current location.

Regular expressions in notepad++ (Search and Replace)

I have a list of thousands of records within a .txt document.
some of them look like these records
201910031044 "00059" "11.31AG" "Senior Champion"
201910031044 "00060" "GBA146" "Junior Champion"
201910031044 "00999" "10.12G" "ProAM"
201910031044 "00362" "113.1LI" "Abcd"
Whenever a record similar to this occurs I'd like to get rid of the last words/numbers/etc in the last quotation marks (like "Senior Champion", "Junior Champion" etc. There are many possibilities here)
e.g. (before)
201910031044 "00059" "11.31AG" "Senior Champion"
after
201910031044 "00059" "11.31AG"
I tried the following regex but it wouldn't work.
Search: ^([0-9]{17,17} + "[0-9]{8,8}" + "[a-zA-Z0-9]").*$
Replace: \1 (replace string)
OK I forgot the . (dot) sign however even if I do not have a . (dot) sign it would not work. Not sure if it has anything to do when using the + sign used more than once.
I'd like to get rid of the last words/numbers/etc in the last quotation marks
This does the job:
Ctrl+H
Find what: ^.+\K\h+".*?"$
Replace with: LEAVE EMPTY
CHECK Wrap around
CHECK Regular expression
UNCHECK . matches newline*
Replace all
Explanation:
^ # beginning of line
.+ # 1 or more any character but newline
\K # forget all we have seen until this position
\h+ # 1 or more horizontal spaces
".*?" # something inside quotes
$ # end of line
Screen capture (before):
Screen capture (after):
The RegEx looks for the 4th double quote:
^(?:[^"]*\"){4}([^|]*)
You can see this demo: https://regex101.com/r/wJ9yS6/163
You will still need to parse the lines, so probably easier opening in excel or parsing using code as a CSV.
You have a problem with the count of your characters:
you specify that the line should start with exactly 17 digits ([0-9]{17,17}). However, there are only 12 digits in the data 201910031044.
you can specify exactly 12 digits by using {12} or if it could be 12-17, then {12,17}. I'll assume exactly 12 based on the current data.
similarly, for the second column you specify that it's exactly 8 digits surrounded by quotes ("[0-9]{8,8}") but it only has 5 digits surrounded by quotes.
again, you can specify exactly 5 with {5} or 5-8 with {5,8}. I will assume exactly 5.
finally, there is no quantifier for the final field, so the regex tries to match exactly one character that is a letter or a number surrounded by quotes "[a-zA-Z0-9]".
I'm not sure if there is any limit on the number of characters, so I would go with one or more using + as quantifier "[a-zA-Z0-9]+" - if you can have zero or more, then you can use *, or if it's any other count from m to n, then you can use {m,n} as before.
Not a character count problem but the final column can also have dots but the regex doesn't account for. You can just add . inside the square brackets and it will only match dot characters. It's usually used as a wildcard but it loses its special meaning inside a character class ([]), so you get "[a-zA-Z0-9.]+"
Putting it all together, you get
Search: ^([0-9]{12} + "[0-9]{5}" + "[a-zA-Z0-9.]+").*$
Replace: \1
Which will get rid of anything after the third field in Notepad++.
This can be shortened a bit by using \d instead of [0-9] for digits and \s+ for whitespace instead of +. As a benefit, \s will also match other whitespace like tabs, so you don't have to manually account for those. This leads to
Search: ^(\d{12}\s+"\d{5}"\s+"[a-zA-Z0-9.]+").*$
Replace: \1
If you want to get rid of the last words/numbers/etc in the last quotation marks you could capture in a group what is before that and match the last quotation marks and everything between it to remove it using a negated character class.
If what is between the values can be spaces or tabs, you could use [ \t]+ to match those (using \s could also match a newline)
Note that {17,17} and {8,8} may also be written as {17} and {8} which in this case should be {12} and {5}
^([0-9]{12}[ \t]+"[0-9]{5}"[ \t]+"[a-zA-Z0-9.]+")[ \t]{2,}"[^"\r\n]+"
In parts
^ Start of string
( Capture group 1
[0-9]{12}[ \t]+ Match 12 digits and 1+ spaces or tabs
"[0-9]{5}"[ \t]+ Match 5 digits between " and 1+ spaces or tabs
"[a-zA-Z0-9.]+" Match 1+ times any of the listed between "
) Close group
[ \t]{2,} Match 1+ times
"[^"\r\n]+"
In the replacement use group 1 $1
Regex demo
Before
After

Regex expression for digit followed by dot (.)

I want to find a text with with digit followed by a dot and replace it with the same text (digit with dot) and "xyz" string.
For ex.
1. This is a sample
2. test
3. string
**I want to change it to**
1.xyz This is a sample
2.xyz test
3.xyz string
I learnt how to find the matching text (\d.) but the challenge is to find the replace with text.
I'm using notepad ++ editor for this, can anyone suggest the "Replace with" string.
First of all, you need to escape the dot since it means "match anything (except newline depending if the s modifier is set)": (\d\.).
Second, you need to add a quantifier in case you have a 2 digit number or more: (\d+\.).
Third, we don't need group 1 in this case: \d+\..
In the replacement, it's quite simple: just use $0xyz. $0 will refer to group 0 which is the whole match.
For notepad++...
You must escape the period/dot character in the expression - precede it with a backslash:
\.
In my case, I needed to find all instances of "{EnvironmentName}.api.mycompany.com"
(dev.api.mycompany.com, stage.api.mycompany.com, prod.api.mycompany, etc.)
I used this search expression:
.*\.api.mycompany.com
Notepad++ RegEx Search Screenshot
I think the right answer is as follow:
Find: ^(\d)([.])(\s)
Replace: $1$2XYZ
That will work with "n. " being "n" a digit [0-9]. If the input should accept digits with different lengths like 10, 100, 1000... or multiples dots "." after the digit or multiple spaces after the dot, then the answer is:
Find: ^(\d*)([.])([.]*)(\s*)
Replace: $1$2XYZ
Input:
1. This is a sample
2. test
3. string
30. string
10..... string
50005... string
Output:
1.XYZ This is a sample
2.XYZ test
3.XYZ string
30.XYZ string
10.XYZ string
50005.XYZ string

How do I find a particular word followed by a space followed by a number?

Using regular expressions, how do I find a particular word followed by a space followed by a number?
Example:
Bug 125
Where "Bug" is should always be the first word found in a line of text followed by a space and then a number and nothing else.
In other words, I don't want to find "Bug 125" as written within some paragraph in the same text file I am parsing.
I haven't tried much because I am terrible at regular expressions. Any help is appreciated.
Sounds like you want this:
^Bug [0-9]+$
Where:
^ = Starts with
[0-9]+ = One ore more digits
$ = Ends with this
This should do it:
^Bug [0-9]+$
"^" - begining of the line
"Bug " - the word and follow-on space you want to match
"[0-9]" - a digit 0-9
"+" - one or more (digits)
"$" - end of the line

changing RegEx from 3 digits to 4

I'm not that great at RegEx, and have the following piece of code on my hands:
value.replace(/\s*.*(\d+[,\.]\d+)[^\d]*/m, "$1");
Now it works great at reducing this "\r\n\t\t\t\t& #36;0.05 USD\t\t\t" (please note I've intentionally left a space between the & and # as removing it converts it to a dollar sign on the site) to this "0.05". The issue I have is that if the number is a double digit (10.05 rather than 0.05) the expression removes the digit from the front and still outputs 0.05 rather than 10.05.
From what I can see in the expression, it's hard coded to pick up just 3 digits, so I was wondering if there's a way to amend it to also work in cases where there are 4 digits.
The . after /\s* is matching the first digit if there are 2 or more digits. Remove that and see if it works...
value.replace(/\s*(\d+[,.]\d+)[^\d]/m, "$1");
Given your example of the regex:
/\s*.*(\d+[,.]\d+)[^\d]/m
And the data:
\r\n\t\t\t\t$0.05 USD\t\t\t
\r\n\t\t\t\t$10.05 USD\t\t\t
In the regex, the leading "/" (forward-slash), and the "/" before the "m" delimits the regex and is not part of the matching.
The "\s" in the regex is shorthand for [ \t\r\n\f] which matches whitespace (space, tab, Carriage-return, Line-feed, Form-feed). So, "\s*" will match "\r\n\t\t\t\t"
The "." (dot) in the regex matches any single character (generally any character except "\n").
The "*" following the "." says to match any 0 or more characters. So, together the ".*", matches the "$" (and possibly, additionally, one or more digits... see below).
Next, the "(" in the regex starts the part of the regex that will "capture" part of your data.
The "\d" in the regex will match any 1 number. Actually "\d" matches [0-9] and other digit characters, like Eastern Arabic numerals "??????????".
The "+" following the "\d" says to match any 1 or more numbers (digits).
The "[,.]" in the regex will match one of either a literal "." (dot), or a "," (comma), to match the "decimal" separator.
Another "\d+" to match any 1 or more numbers (digits).
Next, the ")" in the regex closes the part of the regex that will "capture" part of your data.
The "[^\d]" will match any 1 character that is not a number (digit). So, in this case, it will match the
" " (space).
The "m" at the end of the regex (following the second "/"): "m" changes the behavior of the "^" and "$" anchors, which are not used in your regex, so the "m" should have no effect. But, if you're using Ruby, "m" changes the behavior of the "." (dot).
Now, the "problem"... the ".*" (before the "("), is in regex terms, "greedy". This means it will match as "early" as possible, and for as "long" as possible. So, if there is more than 1 digit following the ";", then the ".*" will consume some digits.
Note: Using ".*" can cause all sorts of problems, especially with "/m" under Ruby. It's best to avoid using ".*" if possible.
There are 2 ways to fix this.
1) If the part before the number you want to capture is always "$", then specify that in regex instead of the ".*". So like this:
/\s*$(\d+[,.]\d+)[^\d]/m
or, if it will always be "$" or something very similar to that:
/\s*[^;]+;(\d+[,.]\d+)[^\d]/m
Here, "[^;]+;" means any string of 1 or more characters that does not contain a ";" followed by a "[;]".
2) If the part before the number you want to capture which is shown as "$", could be totally different in the data, then you just need to make sure that the part of the regex that is currently ".*" will not match a digit in the last position. So like this:
/\s[^.,]*[^\d](\d+[,.]\d+)[^\d]/m
Here, "[^.,]*[^\d]" means any string of 0 or more characters that does not contain a "." (dot) or a "," (comma) where the last character does not contain a digit.
Try this
value.replace( /\s*.(\d+[,.]\d+)[^\d]/m, "$1");
WORKING REGEX
Output:
The .* matches greedily and therefore matches as many characters, including digits, as it can, as long as the rest of the pattern can still match.
The rest of the pattern can still match if just one digit is left for the /d+ to match, so you only end up with one digit there.
If the semicolon in your example is always in that position in the strings you wish to match, use it as a marker like this
value.replace(/.*;(\d+[,\.]\d+).*/m, "$1");