match any number of consecutive words following a backslash - regex

I'm trying to match a TeX command, i.e. a backslash followed by a word (in a desired list) using regex, but with any number of them. For example, if the list I want is test, other, list, then the sequences \test, \other, and \list should be matched, while \sdfsdf should not. I also would want \test\list and \test\other\list to be matched. However, I don't want to match things like \testagain (although \test again should be). I tried the following regex
(\\)(test|other|list)([^a-z])
to no avail, since it does not match \test\other. How would I do this? I am not very experienced with regex.

Use \b to match the word boundary at the end of the word.
\\(test|other|list)\b

Related

Is there a way to use periodicity in a regular expression?

I'm trying to find a regular expression for a Tokenizer operator in Rapidminer.
Now, what I'm trying to do is to split text in parts of, let's say, two words.
For example, That was a good movie. should result to That was, was a, a good, good movie.
What's special about a regex in a tokenizer is that it plays the role of a delimiter, so you match the splitting point and not what you're trying to keep.
Thus the first thought is to use \s in order to split on white spaces, but that would result in getting each word separately.
So, my question is how could I force the expression to somehow skip one in two whitespaces?
First of all, we can use the \W for identifying the characters that separate the words. And for removing multiple consecutive instances of them, we will use:
\W+
Having that in mind, you want to split every 2 instances of characters that are included in the "\W+" expression. Thus, the result must be strings that have the following form:
<a "word"> <separators that are matched by the pattern "\W+"> <another "word">
This means that each token you get from the split you are asking for will have to be further split using the pattern "\W+", in order to obtain the 2 "words" that form it.
For doing the first split you can try this formula:
\w+\W+\w+\K\W+
Then, for each token you have to tokenize it again using:
\W+
For getting tokens of 3 "words", you can use the following pattern for the initial split:
\w+\W+\w+\W+\w+\K\W+
This approach makes use of the \K feature that removes from the match everything that has been captured from the regex up to that point, and starts a new match that will be returned. So essentially, we do: match a word, match separators, match another word, forget everything, match separators and return only those.
In RapidMiner, this can be implemented with 2 consecutive regex tokenizers, the first with the above formula and the second with only the separators to be used within each token (\W+).
Also note that, the pattern \w selects only Latin characters, so if your documents contain text in a different character set, these characters will be consumed by the \W which is supposed to match the separators. If you want to capture text with non-Latin character sets, like Greek for example, you need to change the formula like this:
\p{L}+\P{L}+\p{L}+\K\P{L}+
Furthermore, if you want the formula to capture text on one language and not on another language, you can modify it accordingly, by specifying {Language_Identifier} in place of {L}. For example, if you only want to capture text in Greek, you will use "{Greek}", or "{InGreek}" which is what RapidMiner supports.
What you can do is use a zero width group (like a positive look-ahead, as shown in example). Regex usually "consumes" characters it checks, but with a positive lookahead/lookbehind, you assert that characters exist without preventing further checks from checking those letters too.
This should work for your purposes:
(\w+)(?=(\W+\w+))
The following pattern matches for each pair of two words (note that it won't match the last word since it does not have a pair). The first word is in the first capture group, (\w+). Then a positive lookahead includes a match for a sequence of non word characters \W+ and then another string of word characters \w+. The lookahead (?=...) the second word is not "consumed".
Here is a link to a demo on Regex101
Note that for each match, each word is in its own capture group (group 1, group 2)
Here is an example solution, (?=(\b[A-Za-z]+\s[A-Za-z]+)) inspired from this SO question.
My question sounds wrong once you understand that is a problem of an overlapping regex pattern.

trying to find the correct regular expression

I have the following cases that should match with a regular expression, I've tried several combinations and have read a lot of answers but still no clue on how to solve it.
the rule is, find any combination of . inside a quoted string, atm I have the following regexp
\"\w*((..)|(.))\w*\"
that covers most of the cases:
mmmas"A.F"asdaAA
196.34.45.."asd."#
".add"
sss"a.aa"sss
".."
"a.."
"a..a"
"..A"
but still having problems with this one:
"WERA.HJJ..J"
I've been testing the regpexp in the http://regexr.com/ site
I will really appreciate any help on this
Change your regex to
\"\w*(\.+\w*)+\"
Update: escape . to match the dot and not any character
demo
From the question, it seems that you need to find every occurrence of one or more dot (along with optional word characters) inside a pair of quotes. The following regex would do this:
\"\w*(\.+\w*)+\"
In "WERA.HJJ..J", you have some word characters followed by a dot which is followed by a sequence of word characters again followed by dot and word characters. Your regex would match one or two dots with a pair of optional word character blocks on either sides only.
The dots in the regex are escaped to avoid them being matched against any character, since it is a metacharacter.
Check here.

Don't know how to use lookarounds properly to achieve my Regex match

I'm writing a perl script and part of it requires that I match all occurrences of a certain pattern in a string. Naturally, a regular expression seems like it would be powerful enough, but I just can't get it right for this particular string.
A hypothetical example of the type of text the regex might be applied to would be:
1cat;2dog;!3monkey;!4horse;
As you can see, several data entries (1cat, 2dog, etc.) are present in the line, delimited by semicolons. The beginning of the line contains no semicolon, but the end does. I want to be able to match all the stuff which hasn't been not'ed by the !. In the above example, 1cat and 2dog would be matched and returned in list context, while 3monkey and 4horse would not.
What I have tried to do so far is use negative lookbehinds to notice only the entries without a !. Something like this:
m/(?<!\!)(\w+)\;/g
However, doesn't work because the for every !'ed entry, the regex just matches what comes after it, up to the semicolon. In the example, 1cat and 2dog are captured, but then so are monkey and horse.
I feel like this is easily doable, but I'm new to regular expressions and I can't think of anything else.
Throw a word boundary (\b) in there and you should be good:
(?<!!)\b(\w+);
As you could tell your negative lookbehind was working, but it would still match everything after the next character (horse from !4horse). A word boundary is a zero-width assertion, kind of like a conditional that doesn't match anything (like anchors ^ and $). It asserts for this: (^\w|\w\W|\W\w|\w$). In other words, anytime a word character ([a-zA-Z0-9_]) is next to the beginning/end of string or a non-word character.

Regular expression match if there's non-alphabetical character at the end, or nothing?

I have some regular expressions that match homonyms, like tw?oo? would match either two, to, or too. (It also matches twoo, but that's ok).
My question is, I want the regular expression to match if there is punctuation or some other nonalphabetical character at the ends, like to, or two. or even ,too!. If there's nothing at the end, that's ok as well.
So I want it to match tw?oo? if there are no other characters on each side, or if there are non-alphabetical characters, but not if there are letters around: tomorrow shouldn't match.
I tried [^A-Za-z]?tw?oo?[^A-Za-z]? , but since the character classes are optional they just get ommitted.
How would I do this, so the regex only matches the words if they are on their own, or surrounded by punctutation. (spaces aren't a problem, they've been cut out)
Thanks!
Use word boundaries \b. They match whenever a word character (\w) and a non-word character are adjacent:
for (qw/two to tomorrow/) {
say "$_ ", /\b(?:two|to|too)\b/ ? "matches" : "doesn't match";
}
Output:
two matches
to matches
tomorrow doesn't match
Edit
I changed the regex to /\b(?:two|to|too)\b/ per tobyink's suggestion. This is more readable than tw?oo? and more correct than tw?o+, and triggers the trie optimization, which transforms that part of the regex into a very efficient state machine.

Matching parts of string that contain no consecutive dashes

I need a regex that will match strings of letters that do not contain two consecutive dashes.
I came close with this regex that uses lookaround (I see no alternative):
([-a-z](?<!--))+
Which given the following as input:
qsdsdqf--sqdfqsdfazer--azerzaer-azerzear
Produces three matches:
qsdsdqf-
sqdfqsdfazer-
azerzaer-azerzear
What I want however is:
qsdsdqf-
-sqdfqsdfazer-
-azerzaer-azerzear
So my regex loses the first dash, which I don't want.
Who can give me a hint or a regex that can do this?
This should work:
-?([^-]-?)*
It makes sure that there is at least one non-dash character between every two dashes.
Looks to me like you do want to match strings that contain double hyphens, but you want to break them into substrings that don't. Have you considered splitting it between pairs of hyphens? In other words, split on:
(?<=-)(?=-)
As for your regex, I think this is what you were getting at:
(?:[^-]+|-(?<!--)|\G-)+
The -(?<!--) will match one hyphen, but if the next character is also a hyphen the match ends. Next time around, \G- picks up the second hyphen because it's the next character; the only way that can happen (except at the beginning of the string) is if a previous match broke off at that point.
Be aware that this regex is more flavor dependent than most; I tested it in Java, but not all flavors support \G and lookbehinds.