Regular Expression for first two characters - regex

I need to match records starting with a certain character followed by subset of certain set of characters. After first two characters any character digit is allowed e.g.
in following dataset
man
mbn
mcn
mdn
aan
adn
I need to extract words starting from m and followed by a-c. So only first 3 records should match.

may be this should work for you
^m[a-c]\w+$

m[a-c] does what you want here.

What language? Perl, c#, python? They're similar but here's a regex from c#:
m[a-c]\w+
I'd also recommend that you take a look at Regulator if you're building c# based regex strings. It works for other languages with the exception of .NET features.

Related

Lucene Regex for alphanumeric match but not all numeric

I want to find the alphanumeric words in lucene automata regex but not entirely numeric and even not entirely alphabets.
I have tried
(([a-zA-Z0-9]{1,10})&(.*[0-9].*))
but this returns all numeric words also
So i tried to negate all numeric like below but it does not work
(^[0-9])(([a-zA-Z0-9]{1,10})&(.*[0-9].*))
Input String:
DL200, dal2 , 700091
Expected output:
DL200 and dal2
but it should not return 700091
With the help of the JvdV answer and with the help of https://stackoverflow.com/a/38665819/9758194, I was able to get the desired output
(([a-zA-Z0-9]{1,10})&(.*[0-9].*))&~([0-9]*)
Didn't know much about lucene regex flavor, but a little research tought me that it does not support PCRE library, however some standard operators are supported. I found that it does not include lookarounds nor word boundaries. Have a look at the docs.
Either way, to overcome the lack of support on lookarounds I had a look at this older SO post to use ~ instead. Furthermore, I see you can use the & operator to check if the string matches multiple patterns.
This makes for the assumption the following pattern might work for you:
~[0-9]+&~[^0-9]+&[A-Za-z0-9]{2,10}
~[0-9]+ - Negate a string made of numbers only.
&
~[^0-9]+ - Negate a string made of non-numbers only.
&
[A-Za-z0-9]{2,10} - Matches a string that is made out of 2 to 10 alphanumeric characters.

How to evaluate a RegExp in an array with match groups?

I need to parse an array-like text with regular expression and get the match groups.
One example of then text I want to parse is this:
['red','green', 'blue']
I want to use match groups, because I want to extract them.
I am using this regular expression, but the groups found by it are not like what I expected:
\[ *('.+?')( *, *('.+?'))* *\]
The idea is to parse in this order:
A square bracket
Any number of spaces
A group with:
Single quote
Any character
Single quote
Zero or more groups of:
Any number of spaces
A comma
Any number of spaces
A group with
Single quote
Any character
Single quote
Any number of spaces
A square bracket
And get one group with each parsed array element.
Can you help me?
Hint: a easy way to test regexp is the site http://rubular.com
This isn't going to be a totalitarian answer, but I'm fairly certain you can't whitespace check by doing " *", at least it may depend on the language you're using.
Here's a C# regex example that shows some of the language requirements to check for whitespace: regex check for white space in middle of string
Edit: I see you added Ruby as your language, unfortunately I'm not verbose in Ruby so specifics I cannot help you with, sorry.
Edit2: Seeing as you're forcing yourself into Ruby to debug your regex statement, might I suggest: http://www.debuggex.com/ which tries to stay language independent?
Try this regex: '([^']+)', it should give you the following match groups red, green, blue according to rubular.com
You can match an arbitrary number of groups with one regex:
^\[\s*|(?:\G'([^']+)'\s*(?:,\s*|]$))+
or like this (should be more performant):
^\[\s*+|(?>\G'([^']++)'\s*+(?>,\s*+|]$))++
This work in ruby like asked before, in delphi I don't know.

How can I match multiple sets of a regular expression pattern in the same string?

I am building a regex against strings that meet the following requirements:
The string has a maximum of 5 sets of alphanumeric characters.
Each set within the string is separated by SINGLE whitespace character.
For example, we can have "asa22d asdcac3" or "Asdcd234 sacasW2 sas1 s sd1" (hopefully you get the picture). So far I have:
^[A-z 0-9]\s{0,1}
I am not using \w because it allows underscores. This works for one set of characters, but I need to allow five sets of the same sort of strings separated by a space.
How can I do that?
You haven't said what language you are using, but this should do it for you:
^[A-Za-z0-9]+(\s[A-Za-z0-9]+){0,4}$
A word, followed by up to four instances of space-then-word.
Tools You Need
To match multiple instances of a pattern in a regular expression, you can use any combination of match groups, backreferences, and interval expressions allowed by your regular expression engine.
Examples
Based on your sample code, your regular expression engine clearly supports intervals, so use that. Here are two examples that will accomplish the goal.
# Use POSIX character classes with an interval expression
^([[:alnum:]]+[[:space:]]?){1,5}$
# PCRE expression with intervals
\A(\p{Xan}+\s?){1,5}\Z

Using an asterisk in a RegExp to extract data that is enclosed by a certain pattern

I have an text that consists of information enclosed by a certain pattern.
The only thing I know is the pattern: "${template.start}" and ${template.end}
To keep it simple I will substitute ${template.start} and ${template.end} with "a" in the example.
So one entry in the text would be:
aINFORMATIONHEREa
I do not know how many of these entries are concatenated in the text. So the following is correct too:
aFOOOOOOaaASDADaaASDSDADa
I want to write a regular expression to extract the information enclosed by the "a"s.
My first attempt was to do:
a(.*)a
which works as long as there is only one entry in the text. As soon as there are more than one entries it failes, because of the .* matching everything. So using a(.*)a on aFOOOOOOaaASDADaaASDSDADa results in only one capturing group containing everything between the first and the last character of the text which are "a":
FOOOOOOaaASDADaaASDSDAD
What I want to get is something like
captureGroup(0): aFOOOOOOaaASDADaaASDSDADa
captureGroup(1): FOOOOOO
captureGroup(2): ASDAD
captureGroup(3): ASDSDAD
It would be great to being able to extract each entry out of the text and from each entry the information that is enclosed between the "a"s. By the way I am using the QRegExp class of Qt4.
Any hints? Thanks!
Markus
Multiple variation of this question have been seen before. Various related discussions:
Regex to replace all \n in a String, but no those inside [code] [/code] tag
Using regular expressions how do I find a pattern surrounded by two other patterns without including the surrounding strings?
Use RegExp to match a parenthetical number then increment it
Regex for splitting a string using space when not surrounded by single or double quotes
What regex will match text excluding what lies within HTML tags?
and probably others...
Simply use non-greedy expressions, namely:
a(.*?)a
You need to match something like:
a[^a]*a
You have a couple of working answers already, but I'll add a little gratuitous advice:
Using regular expressions for parsing is a road fraught with danger
Edit: To be less cryptic: for all there power, flexibility and elegance, regular expression are not sufficiently expressive to describe any but the simplest grammars. Ther are adequate for the problem asked here, but are not a suitable replacement for state machine or recursive decent parsers if the input language become more complicated.
SO, choosing to use RE for parsing input streams is a decision that should be made with care and with an eye towards the future.

What are good regular expressions?

I have worked for 5 years mainly in java desktop applications accessing Oracle databases and I have never used regular expressions. Now I enter Stack Overflow and I see a lot of questions about them; I feel like I missed something.
For what do you use regular expressions?
P.S. sorry for my bad english
Consider an example in Ruby:
puts "Matched!" unless /\d{3}-\d{4}/.match("555-1234").nil?
puts "Didn't match!" if /\d{3}-\d{4}/.match("Not phone number").nil?
The "/\d{3}-\d{4}/" is the regular expression, and as you can see it is a VERY concise way of finding a match in a string.
Furthermore, using groups you can extract information, as such:
match = /([^#]*)#(.*)/.match("myaddress#domain.com")
name = match[1]
domain = match[2]
Here, the parenthesis in the regular expression mark a capturing group, so you can see exactly WHAT the data is that you matched, so you can do further processing.
This is just the tip of the iceberg... there are many many different things you can do in a regular expression that makes processing text REALLY easy.
Regular Expressions (or Regex) are used to pattern match in strings. You can thus pull out all email addresses from a piece of text because it follows a specific pattern.
In some cases regular expressions are enclosed in forward-slashes and after the second slash are placed options such as case-insensitivity. Here's a good one :)
/(bb|[^b]{2})/i
Spoken it can read "2 be or not 2 be".
The first part are the (brackets), they are split by the pipe | character which equates to an or statement so (a|b) matches "a" or "b". The first half of the piped area matches "bb". The second half's name I don't know but it's the square brackets, they match anything that is not "b", that's why there is a roof symbol thingie (technical term) there. The squiggly brackets match a count of the things before them, in this case two characters that are not "b".
After the second / is an "i" which makes it case insensitive. Use of the start and end slashes is environment specific, sometimes you do and sometimes you do not.
Two links that I think you will find handy for this are
regular-expressions.info
Wikipedia - Regular expression
Coolest regular expression ever:
/^1?$|^(11+?)\1+$/
It tests if a number is prime. And it works!!
N.B.: to make it work, a bit of set-up is needed; the number that we want to test has to be converted into a string of “1”s first, then we can apply the expression to test if the string does not contain a prime number of “1”s:
def is_prime(n)
str = "1" * n
return str !~ /^1?$|^(11+?)\1+$/
end
There’s a detailled and very approachable explanation over at Avinash Meetoo’s blog.
If you want to learn about regular expressions, I recommend Mastering Regular Expressions. It goes all the way from the very basic concepts, all the way up to talking about how different engines work underneath. The last 4 chapters also gives a dedicated chapter to each of PHP, .Net, Perl, and Java. I learned a lot from it, and still use it as a reference.
If you're just starting out with regular expressions, I heartily recommend a tool like The Regex Coach:
http://www.weitz.de/regex-coach/
also heard good things about RegexBuddy:
http://www.regexbuddy.com/
As you may know, Oracle now has regular expressions: http://www.oracle.com/technology/oramag/webcolumns/2003/techarticles/rischert_regexp_pt1.html. I have used the new functionality in a few queries, but it hasn't been as useful as in other contexts. The reason, I believe, is that regular expressions are best suited for finding structured data buried within unstructured data.
For instance, I might use a regex to find Oracle messages that are stuffed in log file. It isn't possible to know where the messages are--only what they look like. So a regex is the best solution to that problem. When you work with a relational database, the data is usually pre-structured, so a regex doesn't shine in that context.
A regular expression (regex or regexp for short) is a special text string for describing a search pattern. You can think of regular expressions as wildcards on steroids. You are probably familiar with wildcard notations such as *.txt to find all text files in a file manager. The regex equivalent is .*\.txt$.
A great resource for regular expressions: http://www.regular-expressions.info
These RE's are specific to Visual Studio and C++ but I've found them helpful at times:
Find all occurrences of "routineName" with non-default params passed:
routineName\(:a+\)
Conversely to find all occurrences of "routineName" with only defaults:
routineName\(\)
To find code enabled (or disabled) in a debug build:
\#if._DEBUG*
Note that this will catch all the variants: ifdef, if defined, ifndef, if !defined
Validating strong passwords:
This one will validate a password with a length of 5 to 10 alphanumerical characters, with at least one upper case, one lower case and one digit:
^(?=.*[A-Z])(?=.*[a-z])(?=.*[0-9])[a-zA-Z0-9]{5,10}$