Here's my regex :
\b(https?|www)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|]*[.]{1,256}
I know I'm doing something wrong because I use RegEx very rarely.
The idea of the last [.]{1,256} was to make sure of having at least one "." in.
So, without it I got "https://www" match, so I wanted to make sure that at least one dot exists.
But with the expression above, it cuts to the first dot, not the whole thing.
First of all, www before :// does not make much sense, it can occur after ://, so it can be removed.
Both [-a-zA-Z0-9+&##/%?=~_|!:,.;]* and [-a-zA-Z0-9+&##/%=~_|]* can match an empty string, and the [.]{1,256} at the end of your pattern matches 1 to 256 dots, that is why you get matches up to a dot.
You may refactor the pattern to match all those chars you allow before a dot, then match a dot, and then match any amount of chars you allow, together with a dot:
\bhttps?://[-a-zA-Z0-9+&##/%?=~_|!:,;]*\.[-a-zA-Z0-9+&##/%?=~_|!:,.;]*
Here,
[-a-zA-Z0-9+&##/%?=~_|!:,;]* - matches 0 or more chars you allow but a dot
\. - this matches a dot
[-a-zA-Z0-9+&##/%?=~_|!:,.;]* - 0 or more allowed chars including a dot.
So, at least 1 dot will get matched.
Related
I have a conditional lookahead regex that tests to see if there is a number substring at the end of a string, and if so match for the numbers, and if not, match for another substring
The string in question: "H2K 101"
If just the lookahead is used, i.e. (?=\d{1,8}$)(\d{1,8}$), the lookahead succeeds, and "101" is found in capture group 1
When the lookahead is placed into a conditional, i.e. (?(?=\d{1,8}\z)(\d{1,8}\z)|([a-zA-Z]+[\d_-]{1,8}[a-zA-Z]+)), the lookahead now fails, and the second pattern is used, matching "H2K", and a "2" is found in capture group 2.
If the test string has the "2" swapped for a letter, i.e. "HKK 101"
then the lookahead conditional works as expected, and the number "101" is once again found in capture group 1.
I've tested this in Regex101 and other PCRE engines, and all work the same, so clearly I'm missing something obvious about conditionals or the condition regex I'm using. Any insight greatly appreciated.
Thanks.
The look ahead starts at the current position, so initially it fails, and the alternative is used -- where it finds a match at the current position.
If you want the look ahead to succeed when still at the initial position, you need to allow for the intermediate characters to occur. Also, when the alternative kicks in, realise that there can follow a second match that still uses the look ahead, but now at a position where the look ahead is successful.
From what I understand, you are interested in one match only, not two consecutive matches (or more). So that means you should attempt to match the whole string, and capture the part of interest in a capture group. Also, the look ahead should be made to succeed when still at the initial position. This all means you need to inject several .*. There is no need for a conditional.
(?=.*\d{1,8}\z).*?(\d{1,8}\z)|([a-zA-Z]+[\d_-]{1,8}[a-zA-Z]+).*
Note also that (?=.*\d{1,8}\z) succeeds if and only when (?=.*\d\z) succeeds, so you can simplify that:
(?=.*\d\z).*?(\d{1,8}\z)|([a-zA-Z]+[\d_-]{1,8}[a-zA-Z]+).*
There are two capture groups. It there is a match, exactly one of the capture groups will have a non-empty matching content, which is the content you need.
You want to match a number of specific length at the end of the string, and if there is none, match something else.
There is no need for a conditional here. Conditional patterns are necessary to examine what to match next at the given position inside the string based either on a specific group match or a lookaround test. They are not useful when you want to give priority to a specific pattern.
Here, you can use a PCRE pattern based on the \K operator like
.*?\K\d{1,8}\z|[a-zA-Z]+[\d_-]{1,8}[a-zA-Z]+
Or, using capturing groups
(?|.*?(\d{1,8})\z|([a-zA-Z]+[\d_-]{1,8}[a-zA-Z]+))
See the regex demo #1 and regex demo #2.
Details:
.*?\K\d{1,8}$ - any zero or more chars other than line break chars, as few as possible, then the match reset operator that discards the text matched so far, then one to eight digits at the end of string
| - or
[a-zA-Z]+[\d_-]{1,8}[a-zA-Z]+ - one or more letters, 1-8 digits, underscores or hyphens, and then one or more letters.
And
(?| - start of the branch reset group:
.*? - any zero or more chars other than line break chars, as few as possible
(\d{1,8}) - Group 1: one to eight digits
\z - end of string
| - or
( - Group 1 start:
[a-zA-Z]+ - one or more ASCII letters
[\d_-]{1,8} - one to eight digits, underscores, hyphens
[a-zA-Z]+ - one or more ASCII letters
) - Group 1 end
) - end of the group.
I need some help here
Here is example of what im trying to match:
1 ScreenMail Enable friendly none Internal any 5
I need to match everything excluding the last digits (5) Meaning matching the first digit(1), spaces, letter, special characters, etc I tried using /^(\d), but after matching the first digits, it stopped. Your assistance would be appreciated.
The simplest way is probably to remove last digits with:
\d+$
\d+\s*$
See the regex demo.
You may want to use a matching regex like
^.*[^\d\s]
that matches any zero or more chars other than line break chars (.*) as many as possible and then a char other than a digit and whitespace. See this regex demo.
However, if the digits are followed with an optional whitespace, or if you allow any text after the last digits, it will fail. You can then use
^.*[^\d\s](?=\s*\d)
See this regex demo. The (?=\s*\d) positive lookahead requires zero or more whitespaces and then a digit immediately to the right of the current location.
Currently, I am not expert in Regex, but I tried below thing I want to improve it better, can some one please help me?
Pattern can contain ASCII letters, spaces, commas, periods, ', . and - special characters, and there can be one digit at the end of string.
So, it's working well
/^[a-z ,.'-]+(\d{1})?$/i
But I want to put condition that at least 2 letters should be there, could you please tell me, how to achieve this and explain me bit as well, please?
Note that {1} is always redundant in any regex, please remove it to make the regex pattern more readable. (\d{1})? is equal to \d? and matches an optional digit.
Taking into account the string must start with a letter, you can use
/^(?:[a-z][ ,.'-]*){2,}\d?$/i
Details:
^ - start of string
(?: - start of a non-capturing group (it is used here as a container for a pattern sequence to quantify):
[a-z] - an ASCII letter
[ ,.'-]* - zero or more spaces, commas, dots, single quotation marks or hyphens
){2,} - end of group, repeat two or more ({2,}) times
\d? - an optional digit
$ - end of string
i - case insensitive matching is ON.
See the regex demo.
The thing to change in your regex is + after the list of allowed characters.
+ means one or many occurrences of the provided characters. If you want to have 2 or more you can use {2,}
So your regex should look something like
/^[a-z ,.'-]{2,}\d?$/i
I am using regex to clean some text files.
In some places, spaces are missing as in the second line below:
1.9 Beef Curry
1.10Banana Pie
1.11 Corn Gravy
I need an expression to find a zero-length match at the position between 0 and B, so that I can replace it (in Notepad++) with a space. Note that numerators can be one or two digits, and there can also be one (i.e. 1. Exotic Disches) or three levels (i.e. 2.5.1 Chicken).
Can someone please give the answer?
I would have thought one of the following should work, but Notepad++ calls it invalid. Would also appreciate it if someone can tell my why...
(?<=\.\d\d|\.\d)(?! )(?!\.)
(?<=\.\d{1,3)(?! )(?!\.)
Thanks in advance!
Maybe it is enough, just to look for the zero length spaces \B (non word boundaries) between word characters and check, if preceded by a digit and not followed by a digit. If so, replace with space.
\B(?<=\d)(?!\d)
See this demo at regex101
at any \B non word boundary
(?<=\d) looks behind for a digt
(?!\d) looks ahead for no digit
For further restricting the digit part to dot, followed by 1-3 digits, try something like \.\d{1,3}\B\K(?!\d) where \K resets beginning of the reported match. Or without \K and replace by $0
Just to mention: Also the underscore belongs to word characters. If your input contains underscores, e.g. something like 1_ and you don't want to add space here, change the lookahead to (?![\d_])
You may use one of
^\d[\d.]*+(?!\h)
^\d[\d.]*+(?! )
^(?>\d+(?:\.\d+)*\.?)(?!\h)
Replace with $& .
Settings and test:
Details
^\d[\d.]*+(?!\h) matches a digit and then 0 or more digits/dots and once they are all matched, a horizontal whitespace is checked for. If there is no whitespace, there is a match.
^\d[\d.]*+(?! ) is the same, just the check is performed for a regular space.
^(?>\d+(?:\.\d+)*\.?)(?!\h) is more specific, it matches
^ - start of line
(?>\d+(?:\.\d+)*\.?) - an atomic group preventing backtracking:
\d+ - 1+ digits
(?:\.\d+)* - 0 or more sequences of . and 1+ digits
\.? - an optional dot
(?!\h) - no horizontal whitespace allowed immediately on the right
My alternative attempt also working
Find what: ^(\d\.\d+) ?(?=\w)
Replace with: $1 a space after $1
I have a regex pattern that almost works, but I can't quite get it totally correct. My goal is that if a string starts with letters, to ignore them up to the first digit. The second part of the pattern needs to make the match stop at the last hyphen in the string, if one exists. Here are some examples of strings that I would be working on:
PCKG6JUB-0330M3-0-812 wanting returned 6JUB-0330M3-0
CCP352878 wanting returned 352878
0972543107 wanting returned 0972543107
This is the pattern that I have so far: \d[\S]*- The problem is that on the top example, it includes the last hyphen in the match, so I get 6JUB-0330M3-0-. Also, if no hyphen exists, then nothing is returned.
I'm using the VBScript engine.
Use this:
\d(?:\S*(?=-)|\S*)
First, I used a positive lookahead, (?=...), so we don't actually match the last hyphen. Then, I used alternation, |, to check for a match with a hyphen or without a hyphen. So that we don't need to match the digit on both sides of the alternation, I put this part in a non-capturing group, (?:...). Finally, \S is shorthand for a character class and doesn't need to be in brackets.
One would think we'd just be able to make the hyphen optional (i.e. \d\S*(?=-?)), but that doesn't work. This is because our \S* match is greedy (and it needs to be, since you want to match up until the last hyphen) and will just blow right past the hyphen.