I'm trying to improve with regex as I'm tired of constantly having to look up existing solutions instead of creating my own. Having a bit of difficulty understanding why this isn't working though:
Trying to extract both phone numbers from the following string (numbers and address are random):
+1-541-754-3010 156 Alphand_St. <J Steeve>\n 133, Green, Rd. <E Kustur> NY-56423 ;+1-541-914-3010\n"
So I'm using the following expression:
/\+(.+)(?:\s|\b)/
These are the matches I'm getting back:
1-541-754-3010 156 Alphand_St.
1-541-914-3010
So I'm getting the last one correctly, but not the first one. Based on the expression, it should match anything from between a + and a space/boundary. But for some reason it's not stopping at the space after the first number. Am I going about this the wrong way?
In the format you provided for the search string, and since you are starting with a literal "+", I would just include the next following string of decimals and separators, like the hyphen:
/\+([0-9\-]+)/
Your ".+" says to match everything until there's a \s. However that also includes \s on the way to the \s.
Remember that dashes - are not word characters, so \b will match between, for example, 1- and -5 and so on. Also, your current regex is greedy - it'll try to match as many characters as it can with the repeated ., which is why it goes all the way to the end of the first line (because after the last character in the line matches \b). Making it lazy (with .+?) wouldn't fix it, though, because then it would terminate right after the 1 in 1-541 (because between 1- is a word boundary)
Try using a character set of digits and - instead:
\+([\d-]+)
https://regex101.com/r/ktbcHJ/1
Related
I would like to match an exact number in a string, but my regex keeps matching the exact number if it repeats together.
I have the following string:
SomePrefix1201-21,4,52
And I have the following regex to find a match for 21:
SomePrefix[\d]+-[,\d]*21[,$]*
It will match this string fine.
However, it also matches:
SomePrefix1201-2121,4,52
But I only want it to match if it is the exact number.
The number may exist at the end too, so it is not always following by a comma.
I've been racking my brain like anything
Update
Based on the corrected answer below, I managed to find the exact regex I need, with one addition of a lookahead too.
SomePrefix[\d]+-([\d]*,)*21(?!\d)[,$]*
The [,\d]* part matches any number of digits and commas in any order. What you probably wanted was ([\d]*,)* so that any preceding digits and commas must end in a comma (not a digit, which would become a part of the number).
SomePrefix[^-]+-(\d+,)*(21,|21$)
Match the prefix, followed by one or more non-dash characters, then a dash, then zero or more comma-terminated digit fields, followed either by 21, (and possibly more material) or just 21 anchored to the end.
If the comma-terminated fields can be empty, then of course \d* rather than \d+.
It's not clear that you can widely use the anchor operator $ inside a character class (perhaps some regex implementations have this feature), so I distributed it out into two matches for 21, which looks clear. The 21 can be factored out of this:
(21,|21$) -> 21(,|$)
I'm heaving trouble finding the right regex for decimal numbers which include the comma separator.
I did find a few other questions regarding this issue in general but none of the answers really worked when I tested them
The best I got so far is:
[0-9]{1,3}(,([0-9]{3}))*(.[0-9]+)?
2 main problems so far:
1) It records numbers with spaces between them "3001 1" instead of splitting them to 2 matches "3001" "1" - I don't really see where I allowed space in the regex.
2) I have a general problem with the beginning\ending of the regex.
The regex should match:
3,001
1
32,012,111.2131
But not:
32,012,11.2131
1132,012,111.2131
32,0112,111.2131
32131
In addition I'd like it to match:
1.(without any number after it)
1,(without any number after it)
as 1
(a comma or point at the end of the number should be overlooked).
Many Thanks!
.
This is a very long and convoluted regular expression that fits all your requirements. It will work if your regex engine is based on PCRE (hopefully you're using PHP, Delphi or R..).
(?<=[^\d,.]|^)\d{1,3}(,(\d{3}))*((?=[,.](\s|$))|(\.\d+)?(?=[^\d,.]|$))
DEMO on RegExr
The things that make it so long:
Matching multiple numbers on the same line separated by only 1 character (a space) whilst not allowing partial matchs requires a lookahead and a lookbehind.
Matching numbers ending with . and , without including the . or , in the match requires another lookahead.
(?=[,.](\s|$)) Explanation
When writing this explanation I realised the \s needs to be a (\s|$) to match 1, at the very end of a string.
This part of the regex is for matching the 1 in 1, or the 1,000 in 1,000. so let's say our number is 1,000. (with the . on the end).
Up to this point the regex has matched 1,000, then it can't find another , to repeat the thousands group so it moves on to our (?=[,.](\s|$))
(?=....) means its a lookahead, that means from where we have matched up to, look at whats coming but don't add it to the match.
So It checks if there is a , or a . and if there is, it checks that it's immediately followed by whitespace or the end of input. In this case it is, so it'd leave the match as 1,000
Had the lookahead not matched, it would have moved on to trying to match decimal places.
This works for all the ones that you have listed
^[0-9]{1,3}(,[0-9]{3})*(([\\.,]{1}[0-9]*)|())$
. means "any character". To use a literal ., escape it like this: \..
As far as I know, that's the only thing missing.
I am trying to do do following match using regex.
The input characters should be capital letters starting from 2-10 characters.
If it's 2 characters then allow only those 2 characters which does not contain A,E,I,O,U either at first place or second place.
I tried:
[B-DF-HJ-NP-TV-XZ]{2,10}
It works well, but I am not too sure if this is the right and most efficient way to do regex here.
All credit to Jerry, for his answer:
^(?:(?![AEIOU])[A-Z]{2}|[A-Z]{3,10})$
Explanation:
^ = "start of string", and $ = "end of string". This is useful for preventing false matches (e.g. a 10-character match from an 11 character input, or "MR" matching in "AMRXYZ").
(?![AEIOU]) is a negative look-ahead for the characters A,E,I,O and U - i.e. the regex will not match if the text contains a vowel. This is only applied to the first half of the conditional "OR" (|) regex, so vowels are still allowed in longer matches.
The rest is fairly obvious, based on what you've already demonstrated an understanding about regex in your question above.
Thanks to everyone who has replied.
I think I have to tweak my first question a little bit.
I'm a little bit confusing because of the definition of $ sign.
It just asserts that there are between 6 and 10 word chars at the very end of the string.
That's it! Right? Then, It has to be matched with my test string "123a56A781231231231241" in my opinion. Because it doesn't break the rule! 6-10 word chars at the very beginning of string, and at the very end of string. Perfect, isn't it?
Plus, I want to know the difference between ^(?=\w{6,10}$) and ^(?=\w{6,10})$.
One more, Casimir et Hippolyte you said The + doesn't change anything, this means only that the quantifier ( {6,10} here) is possessive and doesn't allow backtracks.
Is that means + sign makes $ sign disable?
Thank you guys in advance.
Before I go any further, I want you guys to know that it's been only 2 days since I started to study about regex. I'm totally newbie.
First. ^(?=\w{6,10}$) This is pattern. Why the dollar signal has to be inside of () ? I know it's a dumb question but I'm curious. I tried to locate the dollar sign at the outside of (). But it didn't work as I expected.
Second. I found several tutorial site and it says the dollar sign means
"$ may appear at the end of a pattern to require the match to occur at the very end of a line. For example, abc$ matches 123abc but not abc123."
So $ is used to assert that the matched part of string is at the very end of a line. Right?
If that is true, why this pattern : "^(?=\w{6,10}$)" can't be matched with my test string : "123a56A781231231231241".
As you see, my test string contains 6~10 word characters at the very beginning of a line and 6~10 word characters at the very end of a line.
Third. As I mention earlier, this pattern : ^(?=\w{6,10}$) can't be matched with my test string : "123a56A781231231231241" But! if I add + sign behind of \w{6,10} like ^(?=\w{6,10}+$)
it works.
Is it because + sign is possessive? I mean,as far as I know, + sign tells the engine not to backtrack once a match has been made. So I hazard the guess, the $ sign doesn't do his job as it doesn't even do backtracking(I'm not sure about this,of course,as I don't know how the $ sign works behind). Is it right?
If that's your whole regex, you don't need a look-ahead. ie these two regexes are equivalent:
^(?=\w{6,10}$)
^\w{6,10}$
Why the $ needs to be inside the bracket? That's because the (anchored) look ahead ^(?=\w{6,10}) just asserts that there are between 6 and 10 word chars at the front of the input. But it will succeed if there's more than 6-10 word chars at the front of the input.
By putting the $ inside the look ahead, it will only succeed if there are 6-10 word chars in the whole input.
You would only use a look ahead if you also wanted to have another restriction. For example, to match
6-10 word chars, and "a" appears before "b"
you would use the regex:
^(?=\w{6,10}$).*a.*b
The (?=..) is a lookahead, it's a zero-width assertion, this means that it is just a check and matches nothing. In other word a lookahead means followed by.
The pattern ^(?=\w{6,10}$) means:
begining of the string followed by between 6 and 10 word characters until the end of the string.
Note that there isn't any character matched since all is inside a lookahead exĂȘct the ^ that is zero-width too.
A match function can only return an empty string as match result, but will return true if the condition is met (otherwhise false)
The + doesn't change anything, this means only that the quantifier ( {6,10} here) is possessive and doesn't allow backtracks. More informations about this feature here: www.regular-expressions.info/possessive.html
I can't help you with this because I don't know what you mean. Are you trying to match against the test string in 2 and 3?
^(?=\w{6,10}$) is trying to match the beginning of the string, followed by 6-10 word characters and the end of the string. Your string is longer than 10 characters, so that won't match.
When you add the + it matches one or more instances of the 6-10 character string.
Adding the + should still not match, because either way you are looking to match a string exactly 6-10 chars long, but your test string is longer. Making it possessive won't change the match in this instance.
I need to find lines that are 3 digits and 3 other characters: I thought I use the following RegEx:
^\d{3}\D{3}$
But take the following sample text file and run the RegEx above (the text must have the empty lines in it):
1
12
123xxx
123y
aaabb
The problem is that there are two matches: 123xxx (which is fine), but also 123y is matched!
I suspect the reason is that "y" + the end-of-line + the beginning-of-next-line are also matched.
How can I tell the regex engine to ignore line beginnings and endings with \D and match characters only, not positions?
The behavior of $ in UltraEdit changes depending on whether you have "Match Whole Word Only" checked or not. To get the behavior you want you need to make sure that that option is checked. Your regular expression doesn't need to change.
Maybe:
/^\d{3}\D{3}$/m
The m means
Treat string as multiple lines. That is, change "^" and "$" from matching the start or end of the string to matching the start or end of any line anywhere within the string.
http://perldoc.perl.org/perlre.html
I don't know about UltraEdit exactly but I expect it will have something similar.
Try this :
^\d{3}[\S]{3}$
Match lines with 3 digits followed by three characters that are not blank characters.