How do I match name, name here in Regex? - regex

I've tried everything I can think of to get this to work, and after hours of trying, I've got to ask for help.
This is the string I'm scanning.
F:\Downloads\Downloads\500 Comics CCC English\Jack, Byrd - Art #01.cbr
This is the current Regex I'm using to try to match what I want.
(?i)English\\(?<Writer>.*(?= ))(?-i)
It matches English\Jack, Byrd - Art
All I want it to match is Jack, Byrd (with no space after it.)
For some reason, the only space I can get it to match is the space after Art.
No matter what I try, it will only match that space. It's like it doesn't consider the other spaces to be spaces.

The (?i)English\\(?<Writer>.*(?= ))(?-i) pattern contains .* that grabs as many chars as it can and then backtracking results in the English\Jack, Byrd - Art match because the space after Art is the last space (the space is required due to the positive lookahead (?= )).
There are several ways to fix it depending on the contexts you have. If there is always a space-hyphen-space after the necessary value add a space and hyphen to the lookahead
(?i)English\\(?<Writer>.*(?= - ))(?-i)
If the value is always a string of non-whitespace chars, comma, space and again a string of non-whitespace chars use
(?i)English\\(?<Writer>\S+,\s*\S+)(?-i)
where \S+ matches one or more non-whitespace chars.
For those who wonder what a (?<Write>...) is, the construct is called a named capturing group and is referenced to as \k<Writer> when used inside the same regex pattern or as ${Writer} when used inside a string replacement argument.

Related

Regex to get the last match of a pattern

Here is a string similar to what I'm trying to match (with the exception of a couple of specific patterns, for the sake of simplicity).
Hello, tonight I'm in the town of Trenton in New Jersey and I will be staying in Hotel HomeStay [123] and I have no money.
I'm trying to match only the last in Hotel HomeStay [123].
I'm not very familiar with regex concepts like lookahead and lookbehind right now. Similar questions here don't seem to solve my issue. I've tried a bunch of regex (to the best of my understanding) and this is what I came up with (?= (?:in|\d+))([\w \[]*\s*\d*\]*)(?!.*in). The digits and special characters may be part of what I'm actually trying to match.
The lookahead and lookbehind patterns are not restricted to containing only in. They can have more common words as well such as and and is. I'm only looking for the last occurence of any of these, followed by the main pattern, which is quite distinctive -- edit let's say the match should necessarily contain either HomeStay or LuxuryInn, for the sake of the example.
However, this matches the whole of in the town of Trenton in New Jersey and I will be staying in Hotel HomeStay [123].
Where am I going wrong? Also, could someone explain why the in is captured despite being placed in a non-capturing group?
Any help is greatly appreciated.
If you want to retrieve a text containing HomeStay prefixed by certain words and not containing those words, you can use a capture group using negative look-ahead inside. The regex below captures all occurrences (working fiddle).
\b(?:in|and|is)\s+((?:.(?!\b(?:in|and|is)\b))*HomeStay(?:.(?!\b(?:in|and|is)\b))*)
Here, the regexp looks for :
a given prefix (in, and or is as a whole word, surrounded by word breakers \b)
... followed by at least one blank character,
... then a sequence of 0 or more characters each one not followed by a prefix,
... followed by HomeStay,
... followed by another sequence of 0 or more characters, each one still not followed by a prefix
If you just want the last occurrence, you can add another negative look-ahead after (fiddle).
\b(?:in|and|is)\s+((?:.(?!\b(?:in|and|is)\b))*HomeStay(?:.(?!\b(?:in|and|is)\b))*)(?!.*HomeStay.*)
Same as above, except the matched text must not be followed by a text containing HomeStay.
Finally, if the matching text has to contain at least a word from a list, just replace both occurrences of HomeStay with a list of alternatives. Example for HomeStay and Luxury: (?:HomeStay|Luxury) (fiddle).
In java:
String s = "Hello, tonight I'm in the town of Trenton in New Jersey and I will be "
+ "staying in Hotel HomeStay [123] and I have no money.";
// Garbage: final String SUBP = "\\bin\\s+(\\S+)";
Pattern p = Pattern.compile("^.*\\sin\\s+(\\S+).*$", Pattern.DOTALL);
String last = p.matcher(s).replaceFirst("$1"); // If found
This will find the last "... in ...", as .* (instead of eager .*?) will look for the longest sequence.
The result above will be Hotel (non spaces afer in) but it may be anything.
Dot-All will effect that . also matches line break characters.
The pattern will go from beginning ^ to end $.
Any characters .* (most longest) followed by a whitespace char \s.
Then "in ", then a word (non-spaces \S+) in group 1 (...)
Then any chars till the end .*. For purity it should have been .*? for the shortest sequence.
The End $.

Regex to find where space is missing between number and word

I am using regex to clean some text files.
In some places, spaces are missing as in the second line below:
1.9 Beef Curry
1.10Banana Pie
1.11 Corn Gravy
I need an expression to find a zero-length match at the position between 0 and B, so that I can replace it (in Notepad++) with a space. Note that numerators can be one or two digits, and there can also be one (i.e. 1. Exotic Disches) or three levels (i.e. 2.5.1 Chicken).
Can someone please give the answer?
I would have thought one of the following should work, but Notepad++ calls it invalid. Would also appreciate it if someone can tell my why...
(?<=\.\d\d|\.\d)(?! )(?!\.)
(?<=\.\d{1,3)(?! )(?!\.)
Thanks in advance!
Maybe it is enough, just to look for the zero length spaces \B (non word boundaries) between word characters and check, if preceded by a digit and not followed by a digit. If so, replace with space.
\B(?<=\d)(?!\d)
See this demo at regex101
at any \B non word boundary
(?<=\d) looks behind for a digt
(?!\d) looks ahead for no digit
For further restricting the digit part to dot, followed by 1-3 digits, try something like \.\d{1,3}\B\K(?!\d) where \K resets beginning of the reported match. Or without \K and replace by $0
Just to mention: Also the underscore belongs to word characters. If your input contains underscores, e.g. something like 1_ and you don't want to add space here, change the lookahead to (?![\d_])
You may use one of
^\d[\d.]*+(?!\h)
^\d[\d.]*+(?! )
^(?>\d+(?:\.\d+)*\.?)(?!\h)
Replace with $& .
Settings and test:
Details
^\d[\d.]*+(?!\h) matches a digit and then 0 or more digits/dots and once they are all matched, a horizontal whitespace is checked for. If there is no whitespace, there is a match.
^\d[\d.]*+(?! ) is the same, just the check is performed for a regular space.
^(?>\d+(?:\.\d+)*\.?)(?!\h) is more specific, it matches
^ - start of line
(?>\d+(?:\.\d+)*\.?) - an atomic group preventing backtracking:
\d+ - 1+ digits
(?:\.\d+)* - 0 or more sequences of . and 1+ digits
\.? - an optional dot
(?!\h) - no horizontal whitespace allowed immediately on the right
My alternative attempt also working
Find what: ^(\d\.\d+) ?(?=\w)
Replace with: $1 a space after $1

Regex to match words after dot until a whitespace occurs

Given the following string
span.a.b this.is.really.confusing
I need to return the matches a and b. I've been able to get close with the following regex:
(?<=\.)[\w]+
But it's also matching is, really, and confusing. When I include a negative lookahead I get even closer, but I'm still not there.
(?<=\.)[\w]+(?=\s) # matches b, confusing
How can I match words after a dot until a whitespace occurs?
How can I match words after a dot until a whitespace occurs?
NB: this is language agnostic pseudo-code, but should work.
regex = "^[^\s.]+.(\S+).*"
targets = <extracted_group>.split(".")
Regex explanation:
"^": beings with
"[^\s.]+." 1 or more non-whitespace, non-period characters, followed by a period.
"(\S+)": group and capture all of the following non-whitespace characters
".*": matches 0 or more of any non-newline character
If the split function takes a regex instead of a string, you'll need to escape the '.' or use a character class.
NB: You can do it without the split, but I think that the split is more transparent.
I am not sure if this is good enough for all your possible cases, but it should work with the provided example:
\.([\w]+)\.([\w]+)\s
$1 = a, $2 = b

Regular expression to allow spaces between words

I want a regular expression that prevents symbols and only allows letters and numbers. The regex below works great, but it doesn't allow for spaces between words.
^[a-zA-Z0-9_]*$
For example, when using this regular expression "HelloWorld" is fine, but "Hello World" does not match.
How can I tweak it to allow spaces?
tl;dr
Just add a space in your character class.
^[a-zA-Z0-9_ ]*$
Now, if you want to be strict...
The above isn't exactly correct. Due to the fact that * means zero or more, it would match all of the following cases that one would not usually mean to match:
An empty string, "".
A string comprised entirely of spaces, " ".
A string that leads and / or trails with spaces, " Hello World ".
A string that contains multiple spaces in between words, "Hello World".
Originally I didn't think such details were worth going into, as OP was asking such a basic question that it seemed strictness wasn't a concern. Now that the question's gained some popularity however, I want to say...
...use #stema's answer.
Which, in my flavor (without using \w) translates to:
^[a-zA-Z0-9_]+( [a-zA-Z0-9_]+)*$
(Please upvote #stema regardless.)
Some things to note about this (and #stema's) answer:
If you want to allow multiple spaces between words (say, if you'd like to allow accidental double-spaces, or if you're working with copy-pasted text from a PDF), then add a + after the space:
^\w+( +\w+)*$
If you want to allow tabs and newlines (whitespace characters), then replace the space with a \s+:
^\w+(\s+\w+)*$
Here I suggest the + by default because, for example, Windows linebreaks consist of two whitespace characters in sequence, \r\n, so you'll need the + to catch both.
Still not working?
Check what dialect of regular expressions you're using.* In languages like Java you'll have to escape your backslashes, i.e. \\w and \\s. In older or more basic languages and utilities, like sed, \w and \s aren't defined, so write them out with character classes, e.g. [a-zA-Z0-9_] and [\f\n\p\r\t], respectively.
* I know this question is tagged vb.net, but based on 25,000+ views, I'm guessing it's not only those folks who are coming across this question. Currently it's the first hit on google for the search phrase, regular expression space word.
One possibility would be to just add the space into you character class, like acheong87 suggested, this depends on how strict you are on your pattern, because this would also allow a string starting with 5 spaces, or strings consisting only of spaces.
The other possibility is to define a pattern:
I will use \w this is in most regex flavours the same than [a-zA-Z0-9_] (in some it is Unicode based)
^\w+( \w+)*$
This will allow a series of at least one word and the words are divided by spaces.
^ Match the start of the string
\w+ Match a series of at least one word character
( \w+)* is a group that is repeated 0 or more times. In the group it expects a space followed by a series of at least one word character
$ matches the end of the string
This one worked for me
([\w ]+)
Try with:
^(\w+ ?)*$
Explanation:
\w - alias for [a-zA-Z_0-9]
"whitespace"? - allow whitespace after word, set is as optional
I assume you don't want leading/trailing space. This means you have to split the regex into "first character", "stuff in the middle" and "last character":
^[a-zA-Z0-9_][a-zA-Z0-9_ ]*[a-zA-Z0-9_]$
or if you use a perl-like syntax:
^\w[\w ]*\w$
Also: If you intentionally worded your regex that it also allows empty Strings, you have to make the entire thing optional:
^(\w[\w ]*\w)?$
If you want to only allow single space chars, it looks a bit different:
^((\w+ )*\w+)?$
This matches 0..n words followed by a single space, plus one word without space. And makes the entire thing optional to allow empty strings.
This regular expression
^\w+(\s\w+)*$
will only allow a single space between words and no leading or trailing spaces.
Below is the explanation of the regular expression:
^ Assert position at start of the string
\w+ Match any word character [a-zA-Z0-9_]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
1st Capturing group (\s\w+)*
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
\s Match any white space character [\r\n\t\f ]
\w+ Match any word character [a-zA-Z0-9_]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
$ Assert position at end of the string
Just add a space to end of your regex pattern as follows:
[a-zA-Z0-9_ ]
This does not allow space in the beginning. But allowes spaces in between words. Also allows for special characters between words. A good regex for FirstName and LastName fields.
\w+.*$
For alphabets only:
^([a-zA-Z])+(\s)+[a-zA-Z]+$
For alphanumeric value and _:
^(\w)+(\s)+\w+$
If you are using JavaScript then you can use this regex:
/^[a-z0-9_.-\s]+$/i
For example:
/^[a-z0-9_.-\s]+$/i.test("") //false
/^[a-z0-9_.-\s]+$/i.test("helloworld") //true
/^[a-z0-9_.-\s]+$/i.test("hello world") //true
/^[a-z0-9_.-\s]+$/i.test("none alpha: ɹqɯ") //false
The only drawback with this regex is a string comprised entirely of spaces. "       " will also show as true.
It was my regex: #"^(?=.{3,15}$)(?:(?:\p{L}|\p{N})[._()\[\]-]?)*$"
I just added ([\w ]+) at the end of my regex before *
#"^(?=.{3,15}$)(?:(?:\p{L}|\p{N})[._()\[\]-]?)([\w ]+)*$"
Now string is allowed to have spaces.
This regex allow only alphabet and spaces:
^[a-zA-Z ]*$
Try with this one:
result = re.search(r"\w+( )\w+", text)

Regular expression to match non-integer values in a string

I want to match the following rules:
One dash is allowed at the start of a number.
Only values between 0 and 9 should be allowed.
I currently have the following regex pattern, I'm matching the inverse so that I can thrown an exception upon finding a match that doesn't follow the rules:
[^-0-9]
The downside to this pattern is that it works for all cases except a hyphen in the middle of the String will still pass. For example:
"-2304923" is allowed correctly but "9234-342" is also allowed and shouldn't be.
Please let me know what I can do to specify the first character as [^-0-9] and the rest as [^0-9]. Thanks!
This regex will work for you:
^-?\d+$
Explanation: start the string ^, then - but optional (?), the digit \d repeated few times (+), and string must finish here $.
You can do this:
(?:^|\s)(-?\d+)(?:["'\s]|$)
^^^^^ non capturing group for start of line or space
^^^^^ capture number
^^^^^^^^^ non capturing group for end of line, space or quote
See it work
This will capture all strings of numbers in a line with an optional hyphen in front.
-2304923" "9234-342" 1234 -1234
++++++++ captured
^^^^^^^^ NOT captured
++++ captured
+++++ captured
I don't understand how your pattern - [^-0-9] is matching those strings you are talking about. That pattern is just the opposite of what you want. You have simply negated the character class by using caret(^) at the beginning. So, this pattern would match anything except the hyphen and the digits.
Anyways, for your requirement, first you need to match one hyphen at the beginning. So, just keep it outside the character class. And then to match any number of digits later on, you can use [0-9]+ or \d+.
So, your pattern to match the required format should be:
-[0-9]+ // or -\d+
The above regex is used to find the pattern in some large string. If you want the entire string to match this pattern, then you can add anchors at the ends of the regex: -
^-[0-9]+$
For a regular expression like this, it's sometimes helpful to think of it in terms of two cases.
Is the first character messed up somehow?
If not, are any of the other characters messed up somehow?
Combine these with |
(^[^-0-9]|^.+?[^0-9])