I'm trying to figure out how to capture a regex and the two lines previous to it.
Example:
Santa Claus
North Pole, North Pole
H0H 0H0
The regex I have is for the Postal Code [a-z]{1}\d{1}[a-z]{1}\s\d{1}[a-z]{1}\d{1}
I want to be able to capture that result and the previous two lines as well using on regex expression.
Does anyone have any ideas?
Thank you in advance.
You could use the following:
(.*\n.*\n[a-z]\d[a-z]\s\d[a-z]\d)
Example Here
.*\n.*\n will match all characters on the previous two lines.
[a-z]\d[a-z]\s\d[a-z]\d - I removed {1} after each character class (since only one will be matched by default, this is redundant).
You may also need to add the case-insensitive i flag since [a-z] will only match lowercase characters. Otherwise that should be replaced with [A-Za-z] to catch the capital letters in the postal codes.
Related
I am trying to create a regex that checks if one or more middle-name initials have the following stucture:
INITIAL.[BLANK]INITIAL.[BLANK]INITIAL.
There can be multiple Initials as long as they are followed by a dot (.) - blank spaces are only allowed between two initials (e.g. L. B.)
It should not be possible to have a space after an initial if there's no other initial following.
At the moment, I have the following Regex which doesn't work perfectly as of now:
([A-Z]\. (?=[A-Z]|$))+
Using regex101, this is an example:
As you can see, it still matches the string even though there's a blank space at the end, without having another Initial following.
I am not sure why this is happening. I am just learning regex and would be glad if anyone could provide me with a solution to my problem :)
The error you're seeing is because at the last step, your expression reads in [A-Z]\. looks ahead for $ (and finds it). I would express the pattern this way: (?:[A-Z]\. )*[A-Z]\.$. Treat the last initial specially because it does not have a final space.
The pattern you tried ([A-Z]\. (?=[A-Z]|$))+ uses a repeated capturing group which will give you the value of the last iteration.
In that repetition you match a space <code>[A-Z]\. </code> effectively meaning that it should be present in the match.
You could repeat 0+ times matching a char [A-Z] followed by a space to match multiple occurrences.
Then match a char [A-Z] asserting what is on the right is not a non whitespace char.
\b(?:[A-Z]\. )*[A-Z]\.(?!\S)
Regex demo
If there can be multiple spaces but it should not match a newline:
\b(?:[A-Z]\.[^\S\r\n]*)*[A-Z]\.(?!\S)
Regex demo
I'm trying to find a regular expression for a Tokenizer operator in Rapidminer.
Now, what I'm trying to do is to split text in parts of, let's say, two words.
For example, That was a good movie. should result to That was, was a, a good, good movie.
What's special about a regex in a tokenizer is that it plays the role of a delimiter, so you match the splitting point and not what you're trying to keep.
Thus the first thought is to use \s in order to split on white spaces, but that would result in getting each word separately.
So, my question is how could I force the expression to somehow skip one in two whitespaces?
First of all, we can use the \W for identifying the characters that separate the words. And for removing multiple consecutive instances of them, we will use:
\W+
Having that in mind, you want to split every 2 instances of characters that are included in the "\W+" expression. Thus, the result must be strings that have the following form:
<a "word"> <separators that are matched by the pattern "\W+"> <another "word">
This means that each token you get from the split you are asking for will have to be further split using the pattern "\W+", in order to obtain the 2 "words" that form it.
For doing the first split you can try this formula:
\w+\W+\w+\K\W+
Then, for each token you have to tokenize it again using:
\W+
For getting tokens of 3 "words", you can use the following pattern for the initial split:
\w+\W+\w+\W+\w+\K\W+
This approach makes use of the \K feature that removes from the match everything that has been captured from the regex up to that point, and starts a new match that will be returned. So essentially, we do: match a word, match separators, match another word, forget everything, match separators and return only those.
In RapidMiner, this can be implemented with 2 consecutive regex tokenizers, the first with the above formula and the second with only the separators to be used within each token (\W+).
Also note that, the pattern \w selects only Latin characters, so if your documents contain text in a different character set, these characters will be consumed by the \W which is supposed to match the separators. If you want to capture text with non-Latin character sets, like Greek for example, you need to change the formula like this:
\p{L}+\P{L}+\p{L}+\K\P{L}+
Furthermore, if you want the formula to capture text on one language and not on another language, you can modify it accordingly, by specifying {Language_Identifier} in place of {L}. For example, if you only want to capture text in Greek, you will use "{Greek}", or "{InGreek}" which is what RapidMiner supports.
What you can do is use a zero width group (like a positive look-ahead, as shown in example). Regex usually "consumes" characters it checks, but with a positive lookahead/lookbehind, you assert that characters exist without preventing further checks from checking those letters too.
This should work for your purposes:
(\w+)(?=(\W+\w+))
The following pattern matches for each pair of two words (note that it won't match the last word since it does not have a pair). The first word is in the first capture group, (\w+). Then a positive lookahead includes a match for a sequence of non word characters \W+ and then another string of word characters \w+. The lookahead (?=...) the second word is not "consumed".
Here is a link to a demo on Regex101
Note that for each match, each word is in its own capture group (group 1, group 2)
Here is an example solution, (?=(\b[A-Za-z]+\s[A-Za-z]+)) inspired from this SO question.
My question sounds wrong once you understand that is a problem of an overlapping regex pattern.
I have to verify that strings match the following format before the first whitespace (if there is one):
Up to 3 leading letters
At least 4 consecutive digits
Up to 3 trailing letters
To give examples, the following are valid:
1234
Abc123456DeF
1234 blah+
XyZ01234
I'm having trouble avoiding this case however: 123a+b blah
So far I have (^\w{0,3}\d{4}\w{0,3})\s* but the problem lies in making sure a non-letter isn't caught in the first section.
I can see a couple solutions:
Run regex twice, first getting the string up to the first whitespace ([^\s]+) then apply regex again to that making sure it ends in up to 3 letters (^\w{0,3}\d{4}\w{0,3}$). This is what I do now, but surely there's a way to do this in one expression - I just can't figure out how
Make sure no non-letters exist between the (potential) 3 trailing letters and the (potential) whitespace. (^\w{0,3}\d{4}\w{0,3}no non-letters)\s*
I've tried negative lookahead (?!.*) but that doesn't seem to do anything.
This regex satisfy your specifications.
Regex: ^\w{0,3}\d{4,}\w{0,3}\s?$
Explanation:
According to your specifications.
\w{0,3}? Up to 3 leading letters
\d{4,} At least 4 consecutive digits
\w{0,3}? Up to 3 trailing letters
I have to verify that strings match the following format before the first whitespace (if there is one):
\s? hence an optional space.
Regex101 Demo
Note:- I am keeping this as stroked out because there were many shortcomings pointed out in comments. So to maintain the context of comments.
Solution:
Like I said in my comment.
#JCK: Problem is . . even whitespace is optional. Thus making it difficult to differentiate between first and second part.
Now employing a lookahead solves this problem. Complete regex goes like this.
Regex: ^(?=.*[0-9]{4,}[A-Za-z]{0,3}(?:\s|$))[A-Za-z]{0,3}[0-9]{4,}[A-Za-z]{0,3}\s*?(?:\S*\s*)*$
Explanation:
(?=.*[0-9]{4,}[A-Za-z]{0,3}(?:\s|$)) This positive lookahead makes sure that the first part defined by your specifications is matched. It looks for mentioned specs and either a \s or $ i.e end of string. Thus matching the first part.
[A-Za-z]{0,3}[0-9]{4,}[A-Za-z]{0,3}\s*?(?:\S*\s*)* Rest of the regex is as per the specifications.
Check by entering strings one by one.
Regex: (^[A-Za-z]{0,3}\d{4,}[A-Za-z]{0,3})(?:$|\s+)
\w is same as [A-Za-z0-9_], so to match just letters you should use [A-Za-z].
(?:$|\s+) matches end of string or at least one whitespace (hence ignoring the rest of the string).
I've been trying to use Regex tools online, but none seem to be working. I am close but not sure what I'm missing.
Here is the Text:
Valencia, Los Angeles, California - Map
I want to extract the first 2 letters of the state (so between "," and "-"), in this case "CA"
What I've done so far is:
[,/](.*)[-/]
$1
The output is:
Los Angeles, California
If anything I thought I would at least just get the state.
,\s*(\w\w)[^,]*-
will capture Ca in group 1.
, comma
\s* whitespace
(\w\w) capture the first two characters
[^,]* make sure there's no comma up to the next dash
-
,\s*(\S{2})[^,]*-
You're going to want to take just the first match.
I assume you use JavaScript.
Your regex fails this particular case because there are two commas in your input.
One possible fix is to modify the middle capture from . (any character) to [^,] (any character except comma). This will force the regex to match California only.
So, try [,/]([^,]*)[-/]. Here's a demo of how it works.
You can use this regex:
.*?,\s(\w\w)[^,]*-
$1 is the first two letters you're looking for.
I have difficulty using Regular Expression (Grep) in TextWrangler to find occurrences of lowercase letter followed by uppercase. For example:
This announcement meansStudents are welcome.
In fact, I want to split the occurrence by adding a colon so that it becomes means: Students
I have tried:
[a-z][A-Z]
But this expression does not work in TextWrangler.
*EDIT: here are the exact contexts in which the occurrences appear (I mean only with these font colors).*
<font color =#48B700> - Stột jlăm wẻ baOne hundred and three<br></font>
<font color =#C0C0C0> »» Qzống pguộc lyời ba yghìm fảy dyổiTo live a life full of vicissitudes, to live a life marked by ups and downs<br></font>
"baOne" and "dyổiTo" must be "ba: One" and "dyổi: To"
Could anyone help? Many thanks.
I do believe (don't have TextWrangler at hand though) that you need to search for ([a-z])([A-Z]) and replace it with: \1: \2
Hope this helps.
Replace ([a-z])([A-Z]) with \1:\2 - I don't have TextWrangler, but it works on Notepad++
The parenthesis are for capturing the data, which is referred to using \1 syntax in the replacement string
This question is ages old, but I stumbled upon it, so someone else might, as well. The OP's comment to Igor's response clarified how the task was meant to be described (& could have be added to the description).
To match only those font-specific lines of the HTML replace
(?<=<font color =#(?:48B700|C0C0C0)>)(.*?[a-z])([A-Z])
with \1: \2
Explanation:
(?<=[fixed-length regex]) is a positive lookbehind and means "if my match has this just before it"
(?:48B700|C0C0C0) is an unnamed group to match only 2 colours. Since they are of the same length, they work in a lookbehind (that needs to be of fixed length)
(.*?[a-z])([A-Z]) will match everything after the > of those begin font tags up to your Capital letters.
The \1: \2 replacement is the same as in Igor's response, only that \1 will match the entire first string that needs separating.
Addition:
Your input strings contain special characters and the part you want to split may very well end in one. In this case they won't be caught by [a-z] alone. You will need to add a character ranger that captures all the letters you care about, something like
(?<=<font color =#(?:48B700|C0C0C0)>)(.*?[a-zḁ-ῼ])([A-Z])
That is the correct pattern for identifying lower case and upper case letters, however, you will need to check matching to be Case Sensitive within the Find/Replace dialogue.