Regex in matching exact word - regex

I am trying to match an exact word before last dot and after last dot it should be number.
(\W*((?i)rocket\.jhagsc\.djagsh(?-i)(.*(?=\.).))\W*)((.*(?=\.).)(\d+))
Example:
rocket.jhagsc.djagsh.465465
It should match.

I would phrase this as:
\brocket.jhagsc.djagsh[^.]*\.(?!.*\.)\d.*$
Here is an explanation of the regex pattern:
\brocket.jhagsc.djagsh match your exact word
[^.]* then match zero or more non dots (i.e. allow no dots)
\. match the final dot
(?!.*\.) then assert that no more dots occur in the string
\d match a single digit immediately after the final dot
.* consume the remainder of the string
$ end of the string
Demo

Related

Regex - All before an underscore, and all between second underscore and the last period?

How do I get everything before the first underscore, and everything between the last underscore and the period in the file extension?
So far, I have everything before the first underscore, not sure what to do after that.
.+?(?=_)
EXAMPLES:
111111_SMITH, JIM_END TLD 6-01-20 THR LEWISHS.pdf
222222_JONES, MIKE_G URS TO 7.25 2-28-19 SA COOPSHS.pdf
DESIRED RESULTS:
111111_END TLD 6-01-20 THR LEWISHS
222222_G URS TO 7.25 2-28-19 SA COOPSHS
You can match the following regular expression that contains no capture groups.
^[^_]*|(?!.*_).*(?=\.)
Demo
This expression can be broken down as follows.
^ # match the beginning of the string
[^_]* # match zero or more characters other than an underscore
| # or
(?! # begin negative lookahead
.*_ # match zero or more characters followed by an underscore
) # end negative lookahead
.* # match zero or more characters greedily
(?= # begin positive lookahead
\. # match a period
) # end positive lookahead
.*_ means to match zero or more characters greedily, followed by an underscore. To match greedily (the default) means to match as many characters as possible. Here that includes all underscores (if there are any) before the last one. Similarly, .* followed by (?=\.) means to match zero or more characters, possibly including periods, up to the last period.
Had I written .*?_ (incorrectly) it would match zero or more characters lazily, followed by an underscore. That means it would match as few characters as possible before matching an underscore; that is, it would match zero or more characters up to, but not including, the first underscore.
If instead of capturing the two parts of the string of interest you wanted to remove the two parts of the string you don't want (as suggested by the desired results of your example), you could substitute matches of the following regular expression with empty strings.
_.*_|\.[^.]*$
Demo
This regular expression reads, "Match an underscore followed by zero of more characters followed by an underscore, or match a period followed by zero or more characters that are not periods, followed by the end of the string".
You could use 2 capture groups:
^([^_\n]+_).*\b([^\s_]*_.*)(?=\.)
^ Start of string
([^_\n]+_) Capture group 1, match any char except _ or a newline followed by matching a _
.*\b Match the rest of the line and match a word boundary
([^\s_]*_.*) Capture group 2, optionally match any char except _ or a whitespace char, then match _ and the rest of the line
(?=\.) Positive lookahead, assert a . to the right
See a regex demo.
Another option could be using a non greedy version to get to the first _ and make sure that there are no following underscores and then match the last dot:
^([^_\n]+_).*?(\S*_[^_\n]+)\.[^.\n]+$
See another regex demo.
Looks like you're very close. You could eliminate the names between the underscores by finding this
(_.+?_)
and replacing the returned value with a single underscore.
I am assuming that you did not intend your second result to include the name MIKE.

Regular expression to Star with number and then numbers or one letter

I have this regular expression
^[0-9]+[a-zA-Z0-9]*
But I need one that always stars with a number then it can be another number or a letter, but it can not be number letter number. The letter will always be the last. Like this example
102A OK
1A OK
2 OK
110 OK
10A1 WRONG
BV WRONG
The letter cannot be between two numbers.
You could use a negative lookahead.
^(?!\d+[a-z]\d)\d.*
with the case-indifferent flag set.
Demo
A match of this regular expression signifies that the string does not contain a 3-character substring consisting of a digit, a letter, a digit, in that order. If the entire string is to be matched when the match is successful, add .* to the end of the regex.
The regex engine performs the following operations.
^ match beginning of line
(?! begin negative lookahead
\d+[a-z]\d match digits-letter-digit
) end negative lookahead
\d match a digit
Note that \d at the end must follow the negative lookahead. If the regex were ^\d(?!.\d+[a-z]\d) and the string were 1A1 the negative lookahead would fail to find digit-letter-digit in A1 and the overall match would succeed (incorrectly).
Because the negative lookahead is pinned to the beginning of the line and consumes no characters, if it fails (match succeeds) the search for \d at the end of the regex begins at the beginning of the line.
You could match a char 1-9 followed by optional digits 0-9 and optional chars a-zA-Z.
If you use [a-zA-Z0-9] the character class will match any of the listed in any order.
If you separate the chars and the digits, the letter can not come before the digits and, as the * quantifier matches 0 or more times, you can also match a single digit.
^[1-9][0-9]*[a-zA-Z]*$
Regex demo

Python regex match certain floating point numbers

I'm trying to match: 0 or more numbers followed by a dot followed by ( (0 or more numbers) but not (if followed by a d,D, or _))
Some examples and what should match/not:
match:
['1.0','1.','0.1','.1','1.2345']
not match:
['1d2','1.2d3','1._dp','1.0_dp','1.123165d0','1.132_dp','1D5','1.2356D6']
Currently i have:
"([0-9]*\.)([0-9]*(?!(d|D|_)))"
Which correctly matches everything in the match list. But for those in the things it should not match it incorrectly matches on:
['1.2d3','1.0_dp','1.123165d0','1.132_dp','1.2356D6']
and correctly does not match on:
['1d2','1._dp','1D5']
So it appears i have problem with the ([0-9]*(?!(d|D|_)) part which is trying to not match if there is a d|D|_ after the dot (with zero or more numbers in-between). Any suggestions?
Instead of using a negative lookahead, you might use a negated character class to match any character that is not in the character class.
If you only want to match word characters without the dD_ or a whitespace char you could use [^\W_Dd\s].
You might also remove the \W and \s to match all except dD_
^[0-9]*\.[^\W_Dd\s]*$
Explanation
^ Start of string
[0-9]*\. Match 0+ times a digit 0-9 followed by a dot
[^\W_Dd\s]* Negated character class, match 0+ times a word character without _ D d or whitespace char
$ End of string
Regex demo
If you don't want to use anchors to assert the start and the end of the string you could also use lookarounds to assert what is on the left and right is not a non whitspace char:
(?<!\S)[0-9]*\.[^\W_Dd\s]*(?!\S)
Regex demo
\d*[.](?!.*[_Dd]).* is what you are looking for:

Why is this regex selecting this text

I am using the regex
(.*)\d.txt
on the expression
MyFile23.txt
Now the online tester says that using the above regex the mentioned string would be allowed (selected). My understanding is that it should not be allowed because there are two numeric digits 2 and 3 while the above regex expression has only one numeric digit in it i.e \d.It should have been \d+. My current expression reads. Zero of more of any character followed by one numeric digit followed by .txt. My question is why is the above string passing the regex expression ?
This regex (.*)\d.txt will still match MyFile23.txt because of .* which will match 0 or more of any character (including a digit).
So for the given input: MyFile23.txt here is the breakup:
.* # matches MyFile2
\d # matched 3
. # matches a dot (though it can match anything here due to unescaped dot)
txt # will match literal txt
To make sure it only matches MyFile2.txt you can use:
^\D*\d\.txt$
Where ^ and $ are anchors to match start and end. \D* will match 0 or more non-digit.
The pattern you have has one group (.*) which would match using your example:MyFile2
because the . allows any character.
Furthermore the . in the pattern after this group is not escaped which will result in allowing another character of any kind.
To avoid this use:
(\D*)\d+\.txt
the group (\D*) would now match all non digit characters.
Here is the explanation, your "MyFile23.txt" matches the regex pattern:
A literal period . should always be escaped as \. else it will match "any character".
And finally, (.*) matches all the string from the beginning to the last digit (MyFile2). Have a look at the "MATCH INFORMATION" area on the right at this page.
So, I'd suggest the following fix:
^\D*\d\.txt$ = beginning of a line/string, non-digit character, any number of repetitions, a digit, a literal period, a literal txt, and the end of the string/line (depending on the m switch, which depends on the input string, whether you have a list of words on separate lines, or just a separate file name).
Here is a working example.

How can I detect last digits in python string

I need to detect last digits in the string, as they are indexes for my strings. They may be 2^64, So it's not convenient to check only last element in the string, then try second... etc.
String may be like asdgaf1_hsg534, i.e. in the string may be other digits too, but there are somewhere in the middle and they are not neighboring with the index I want to get.
Here is a method using re.sub:
import re
input = ['asdgaf1_hsg534', 'asdfh23_hsjd12', 'dgshg_jhfsd86']
for s in input:
print re.sub('.*?([0-9]*)$',r'\1',s)
Output:
534
12
86
Explanation:
The function takes a regular expression, a replacement string, and the string you want to do the replacement on: re.sub(regex,replace,string)
The regex '.*?([0-9]*)$' matches the whole string and captures the number that precedes the end of the string. Parenthesis are used to capture parts of the match we are interested in, \1 refers to the first capture group and \2 the second ect..
.*? # Matches anything (non-greedy)
([0-9]*) # Upto a zero or more digits digit (captured)
$ # Followed by the end-of-string identifier
So we are replacing the whole string with just the captured number we are interested in. In python we need to use raw strings for this: r'\1'. If the string doesn't end with digits then a blank string with be returned.
twosixfour = "get_the_numb3r_2_^_64__18446744073709551615"
print re.sub('.*?([0-9]*)$',r'\1',twosixfour)
>>> 18446744073709551615
A simple regex can detect digits at the end of the string:
'\d+$'
$ matches the end of the string. \d+ matches one or more digits. The + operator is greedy by default, meaning it matches as many digits as possible. So this will match all of the digits at the end of the string.
If you want to use re.sub and make sure that there is at least a single digit present at the end of the line, you can use the quantifier + to match 1 or more digits \d+ to not remove the whole line if there are no digits present or no digits only at the end of the line.
^.*?(\d+)$
^ Start of line
.*? Match any char except a newline as least as possible (non greedy)
(\d+) Capture group 1, match 1+ digits
$ End of line
Or using a negative lookbehind
^.*(?<!\d)(\d+)$
^ Start of line
.* Match any char except a newline as much as possible
(?<!\d)(\d+) Assert no digits directly to the left, then capture 1+ digits in group 1
$ End of line
Regex demo
When using re.match, you can omit the ^ anchor and you might also use \A and \Z to asert the start and the end of the string.
Regex demo
import re
strings = ['asdgaf1_hsg534', 'asdfh23_hsjd12', 'dgshg_jhfsd86', 'test']
for s in strings:
print (re.sub(r".*?(\d+)$", r'\1',s))
Output
534
12
86
test
If there should be a non digit present before matching a digit as in this comment you could use a negated character class with a single capture group.
^.*[^\d\r\n](\d+)
^ Start of line
.* Match any char except a newline as much as possible
[^\d\r\n] Negated character class, match any char except a digit or a newline
(\d+) Capture group 1, match 1+ digits
Regex demo
To get the last digits in the string (not necessarily at the end of the string)
^.*?(\d+)[^\r\n\d]*$
^ Start of line
.*? Match any char except a newline as least as possible (non greedy)
(\d+) Capture group 1, match 1+ digits
[^\r\n\d]* Negated character class, match 0+ times any char except a newline or digit
$ End of line
Regex demo