Matching multiple letters and special characters in regex - regex

I am trying to catch strings around the acronym ADJ. The strings look like this:
·NOM·JJ·ADJ+CASE_DEF_GEN
·NOM·JJ·ADJ+CASE_DEF_ACC
·NOM·JJ·ADJ+CASE_INDEF_GEN
·NOM·DT+JJ·DET+ADJ+NSUFF_FEM_SG+CASE_DEF_GEN
·NOM·JJ·ADJ+CASE_INDEF_GEN
·NOM·JJ·ADJ+NSUFF_FEM_SG+CASE_INDEF_GEN
·NOM·DT+JJ·DET+ADJ+NSUFF_FEM_SG+CASE_DEF_ACC
So far I have this:
/[A-Z·\+#_]*?[·\+]ADJ[·\+][A-Z_·\+#]*?/g
But it only matches from the beginning of the strings until "ADJ+" ·NOM·DT+JJ·DET+ADJ+.
Since the rest of the strings after ADJ have the same composition of the beginning of the strings before ADJ, I thought this /[A-Z·\+#_]*?[·\+]/g should work, but it doesn't.
How do I get it to match the rest of the string?

My guess is that you want to make sure if you have an ADJ in the string, which if so, maybe we could simplify our expression to something similar to:
([A-Z·+#_]*)\bADJ\b([A-Z·+#_]*)
The expression is explained on the top right panel of this demo, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs step by step, if you like.

That *? quantifier after the +ADJ+ phrase is satisfied with the empty string right after it, since the ? makes the quantifier before it match "the minimum number of times possible" and for * that is zero times.
So drop the ?, which also has no purpose for the rest of the line
perl -wE'$_=q(-XADJX-JJ+ADJ-REST-);
($before, $after) = /(.*?)[+\-]ADJ[+\-](.*)/;
say for $before,$after'

Removing the ? at the end would match the whole strings,
/[A-Z·\+#_]*?[·\+]ADJ[·\+][A-Z_·\+#]*/g
I am not entirely sure why you needed a ? in a *.

Related

Get all matches for a certain pattern using RegEx

I am not really a RegEx expert and hence asking a simple question.
I have a few parameters that I need to use which are in a particular pattern
For example
$$DATA_START_TIME
$$DATA_END_TIME
$$MIN_POID_ID_DLAY
$$MAX_POID_ID_DLAY
$$MIN_POID_ID_RELTM
$$MAX_POID_ID_RELTM
And these will be replaced at runtime in a string with their values (a SQL statement).
For example I have a simple query
select * from asdf where asdf.starttime = $$DATA_START_TIME and asdf.endtime = $$DATA_END_TIME
Now when I try to use the RegEx pattern
\$\$[^\W+]\w+$
I do not get all the matches(I get only a the last match).
I am trying to test my usage here https://regex101.com/r/xR9dG0/2
If someone could correct my mistake, I would really appreciate it.
Thanks!
This will do the job:
\$\$\w+/g
See Demo
Just Some clarifications why your regex is doing what is doing:
\$\$[^\W+]\w+$
Unescaped $ char means end of string, so, your pattern is matching something that must be on the end of the string, that's why its getting only the last match.
This group [^\W+] doesn't really makes sense, groups starting with [^..] means negate the chars inside here, and \W is the negation of words, and + inside the group means literally the char +, so you are saying match everything that is Not a Not word and that is not a + sign, i guess that was not what you wanted.
To match the next word just \w+ will do it. And the global modifier /g ensures that you will not stop on the first match.
This should work - Based on what you said you wanted to match this should work . Also it won't match $$lower_case_strings if that's what you wanted. If not, add the "i" flag also.
\${2}[A-Z_]+/g

Regular Expression for phrases starting with TO

I am pretty new to Regular Expression. I want to write a regular expression to get the TO Followed by the rest of it after each new line. I tried to use this but doesn't work properly.
^TO\n?\s?[A-Za-z0-9]\n?[A-Za-z0-9]
It only highlights properly the TO W11 which all are in one line. Highlights only TO from first data and the 3rd data only highlights the first line. Basically it doesn't read the new lines.
Some of my data looks like this:
TO
EXTERNAL
TRAVERSE
TO W11
TO CONTROL
TRAVERSE
I would appreciate if anybody can help me.
Make sure you use a multiline regex:
var options = RegexOptions.MultiLine;
foreach (Match match in Regex.Matches(input, pattern, options))
...
More at: http://msdn.microsoft.com/en-us/library/yd1hzczs(v=vs.110).aspx
It looks like your pattern isn't matching because the start of the string is really a space and not the T character. Also, [A-Za-z0-9] matches only one character, and you want the whole word. I used the + to denote that I want one or more matches of those characters.
(TO\n?\s?[A-Za-z0-9]+)
This regex matches "TO EXTERNAL", "TO W11" and "TO CONTROL". Be sure to use the global modifier so that you get all matches, not just the first one.

How to match any word from a word group

I'm trying to create a pattern that would identify a money in a string. My expression so far is:
(\d{1,3}[\.,\s]{0,2})*\d{3}[\.,\s]{0,2}\d{0,2}[\s]{0,2}[zl|zł|zlotych|złotych|pln|PLN]{0,1}
and my main problem is with the last part: [zl|zł|zlotych|złotych|pln|PLN], which should find one of the national notations for money value (sth like $ or usd or dollars) but I'm doing it wrong, since it also matches something like '108.1 z'. Is it possible to change the last part, so that it would match only expressions that contain the whole expressions like 'zl', 'pln' and so on, and not single letters?
Yes, don't use [], which defines a character class, but instead use () to group your words.
(\d{1,3}[\.,\s]{0,2})*\d{3}[\.,\s]{0,2}\d{0,2}[\s]{0,2}(zl|zł|zlotych|złotych|pln|PLN)?
As you had it written, [zl|zł|zlotych|złotych|pln|PLN], means "match any of the characters contained in the []", or the equivalent of: [zl|łotychpnPLN] (duplicates removed)
If you don't want the money symbol captured, then start the group with ?:, i.e.:
(\d{1,3}[\.,\s]{0,2})*\d{3}[\.,\s]{0,2}\d{0,2}[\s]{0,2}(?:zl|zł|zlotych|złotych|pln|PLN)?
Use parentheses (which delimit groups) rather than square brackets (which delimit character classes) around that last group.
As a matter of style, use ? instead of {0,1}.
(\d{1,3}[\.,\s]{0,2})*\d{3}[\.,\s]{0,2}\d{0,2}[\s]{0,2}(zl|zł|zlotych|złotych|pln|PLN)?
You have a few problems here. First off, inside [] characters are taken as literals, so the first two [] blocks should be [.,\s].
Next (as the other answers say), the last [] block needs to be a group, not a character class, so replace the [] with ().
Finally, at the end you can replace {0, 1} with ?. It won't make a difference, but it's neater.
The regex should look like this:
(\d{1,3}[.,\s]{0,2})*\d{3}[.,\s]{0,2}\d{0,2}[\s]{0,2}(zl|zł|zlotych|złotych|pln|PLN)?
For the future, for regex questions it's really helpful if you post a typical input string and desired match along with your question!

How to match a string that does not end in a certain substring?

how can I write regular expression that dose not contain some string at the end.
in my project,all classes that their names dont end with some string such as "controller" and "map" should inherit from a base class. how can I do this using regular expression ?
but using both
public*.class[a-zA-Z]*(?<!controller|map)$
public*.class*.(?<!controller)$
there isnt any match case!!!
Do a search for all filenames matching this:
(?<!controller|map|anythingelse)$
(Remove the |anythingelse if no other keywords, or append other keywords similarly.)
If you can't use negative lookbehinds (the (?<!..) bit), do a search for filenames that do not match this:
(?:controller|map)$
And if that still doesn't work (might not in some IDEs), remove the ?: part and it probably will - that just makes it a non-capturing group, but the difference here is fairly insignificant.
If you're using something where the full string must match, then you can just prefix either of the above with ^.* to do that.
Update:
In response to this:
but using both
public*.class[a-zA-Z]*(?<!controller|map)$
public*.class*.(?<!controller)$
there isnt any match case!!!
Not quite sure what you're attempting with the public/class stuff there, so try this:
public.*class.*(?<!controller|map)$`
The . is a regex char that means "anything except newline", and the * means zero or more times.
If this isn't what you're after, edit the question with more details.
Depending on your regex implementation, you might be able to use a lookbehind for this task. This would look like
(?<!SomeText)$
This matches any lines NOT having "SomeText" at their end. If you cannot use that, the expression
^(?!.*SomeText$).*$
matches any non-empty lines not ending with "SomeText" as well.
You could write a regex that contains two groups, one consists of one or more characters before controller or map, the other contains controller or map and is optional.
^(.+)(controller|map)?$
With that you may match your string and if there is a group() method in the regex API you use, if group(2) is empty, the string does not contain controller or map.
Check if the name does not match [a-zA-Z]*controller or [a-zA-Z]*map.
finally I did it in this way
public.*class.*[^(controller|map|spec)]$
it worked

Regex is behaving lazy, should be greedy

I thought that by default my Regex would exhibit the greedy behavior that I want, but it is not in the following code:
Regex keywords = new Regex(#"in|int|into|internal|interface");
var targets = keywords.ToString().Split('|');
foreach (string t in targets)
{
Match match = keywords.Match(t);
Console.WriteLine("Matched {0,-9} with {1}", t, match.Value);
}
Output:
Matched in with in
Matched int with in
Matched into with in
Matched internal with in
Matched interface with in
Now I realize that I could get it to work for this small example if I simply sorted the keywords by length descending, but
I want to understand why this
isn't working as expected, and
the actual project I am working on
has many more words in the Regex and
it is important to keep them in
alphabetical order.
So my question is: Why is this being lazy and how do I fix it?
Laziness and greediness applies to quantifiers only (?, *, +, {min,max}). Alternations always match in order and try the first possible match.
It looks like you're trying to word break things. To do that you need the entire expression to be correct, your current one is not. Try this one instead..
new Regex(#"\b(in|int|into|internal|interface)\b");
The "\b" says to match word boundaries, and is a zero-width match. This is locale dependent behavior, but in general this means whitespace and punctuation. Being a zero width match it will not contain the character that caused the regex engine to detect the word boundary.
According to RegularExpressions.info, regular expressions are eager. Therefore, when it goes through your piped expression, it stops on the first solid match.
My recommendation would be to store all of your keywords in an array or list, then generate the sorted, piped expression when you need it. You would only have to do this once too as long as your keyword list doesn't change. Just store the generated expression in a singleton of some sort and return that on regex executions.