regular expression to match english words with some other characters - regex

I use this regular expression: ^[a-zA-Z0-9]*$ to match English phrases, however, I want this expression to match English phrases that may contain some or all of these characters at the beginning, between or at the end of them:
? > < ; , { } [ ] - _ + = ! # # $ % ^ & * | ' and also the space character.
how can I update this regular expression to satisfy this requirement ?
thank you so much in advance ...

You could simply add all your desired characters to your character class.
^[a-zA-Z0-9?><;,{}[\]\-_+=!##$%\^&*|']*$
You will need to escape the following characters with a backslash, since they are considered as metacharacters inside character classes: ], -, ^.
Note that your regex will also match empty strings, since it uses the * quantifier. If you only want to match words having at least one character, replace it with the + quantifier.

You are looking for this pattern.
^[\s\w\d\?><;,\{\}\[\]\-_\+=!#\#\$%^&\*\|\']*$

I'm thankfully accept the \s\w\d groups from the previous answer, and add other delimiters and special characters as hexadecimal ASCII ranges (you can use Unicode ranges as well):
^[\s\w\d\x21-\x2f\x3a-\x40\x5b-\x60\x7b-\x7e]*$
You can refer here to the ASCII Codes
and Unicode characters

Related

Regular expression to match any word followed by a literal string

So I have the following:
^[a-zA-Z]+\b(myword+-)\b*
which I thought would match
^ start of string
[a-zA-Z] any alpha character
+ of one or more characters
\b followed by a word break
(myword+-) followed by myword which could include one or more special characters
\b followed by a word break
\* followed by anything at all
One: it does not work - it does not match anything
Two: any special characters included in {myword+-) throws an error
I could escape the special characters, but I don't know in advance what they might be, so I would have to escape all the possibilites, or perhaps I could just escape every character in {\m\y\w\o\r\d\\+\\-)
Edited to add:
Sorry, I knew I should have given more information
I have a series of strings to seach through in the form:
extra android-sdk and more that is of no interest
extra android-ndk and more that is of no interest
extra anjuta-extra and more that is of no interest
community c++-gtk-utils and more that is of no interest
and I have a list of items to search for in the strings:
android-sdk
android-ndk
extra
c++-gtk-utils
The item should only match if the second word in the string is an exact match to the item, so:
android-sdk will match the first string
android-ndk will match the second string
extra wuill NOT match the third string
c++-gtk-utils will match the fourth string
So (myword+-) is the item I am searching for "which could include one or more special characters"
Thanks for the help
Andrew
OK, with the help from above I worked it out.
This regex does exactly what I wanted, bear in mind that I am working in tcl (note the spaces to delimit the search word):
^[a-zA-Z]+\y extra \y *
where the search word is "extra".
It is necessary to escape any characters in the search string which may be interpreted by regex as qualifiers etc e.g +
So this will also work:
^[a-zA-Z]+\y dbus-c\+\+ \y *
Andrew
Strong recommendation: if you want to match literal strings, don't use regular expressions.
If we have this sample data:
set strings {
{extra android-sdk and more that is of no interest}
{extra android-ndk and more that is of no interest}
{extra anjuta-extra and more that is of no interest}
{community c++-gtk-utils and more that is of no interest}
}
set search_strings {
android-sdk
android-ndk
extra
c++-gtk-utils
}
Then, to find matches in the 2nd word of each string, we'll just use the eq string equality operator
foreach string $strings {
foreach search $search_strings {
if {[lindex [split $string] 1] eq $search} {
puts "$search matches $string"
}
}
}
outputs
android-sdk matches extra android-sdk and more that is of no interest
android-ndk matches extra android-ndk and more that is of no interest
c++-gtk-utils matches community c++-gtk-utils and more that is of no interest
If you insist on regular expression matching, you can escape any special characters to take away their usual regex meaning. Here, we'll take the brute force approach: any non-word chars will get escaped, so that the pattern may look like ^\S+\s+c\+\+\-gtk\-utils
foreach string $strings {
foreach search $search_strings {
set pattern "^\\S+\\s+[regsub -all {\W} $search {\\&}]"
if {[regexp $pattern $string]} {
puts "$search matches $string"
}
}
}
I was hoping to be able to make a portion of a regular expression to be a literal string, like
set pattern "^\\S+\\s+(***=$string)"
set pattern "^\\S+\\s+((?q)$string)"
but both failed.
Tcl regular expressions are documented at
https://www.tcl.tk/man/tcl8.6/TclCmd/re_syntax.htm
Also note your pattern ^[a-zA-Z]+\b(myword+-)\b* does not provide for any whitespace between the first and second words.
Disclaimer: Since your question lacks information what input and output is expected, I will give it a try to tell you why your Regex isn't working at all. Since it's not a full answer you might not want to mark it as accepted and possibly wait for someone to give you an example of working solution, as soon as you provide necessary information.
Notes:
quantifier characters (*, +, ? etc.) are applied to literal character or character class (a.k.a character group, namely characters/ranges inside [ ]) - when in your regex you write (myword+-) the only thing the + sign is applied to is letter 'd', nothing else.
what is myword in your regex? If you want a set of characters use [ ] combined with character ranges and/or character tokens such as \w (all word characters, such as letters and some special characters) or \d (all digit characters)
you also seem to misunderstand and misuse groups ("( )"), character classes ("[ ]") and quantifier notation ("{ }")

Which characters must be escaped in a Perl regex pattern

Im trying to find files that are looking like this:
access_log-20160101
access_log-20160304
...
with perl regex i came up with something like this:
/^access_log-\d{8}$/
But im not sure about the "_" and the "-". are these metacharacter?
What is the expression for this?
i read that "_" in regex is something like \w, but how do i use them in my exypression?
/^access\wlog-\d{8}$/ ?
Underscore (_) is not a metacharacter and does not need to be quoted (though it won't change anything if you quote it).
Hyphen (-) IS a metacharacter that defines the range between two symbols inside a bracketed character class. However, in this particular position, it will be interpreted verbatim and doesn't need quoting since it is not inside [] with a symbol on both sides.
You can use your regexp as is; hyphens (-) might need quoting if your format changes in future.
Your regex pattern is exactly right
Neither underscore _ nor hyphen - need to be escaped. Outside a square-bracketed character class, the twelve Perl regex metacharacters are
Brackets ( ) [ {
Quantifiers * + ?
Anchors ^ $
Alternator |
Wild character .
The escape itself \
and only these must be escaped
If the pattern of your file names doesn't vary from what you have shown then the pattern that you are using
^access_log-\d{8}$
is correct, unless you need to validate the date string
Within a character class like [A-F] you must escape the hyphen if you want it to be interpreted literally. As it stands, that class is the equivalent to [ABCDEF]. If you mean just the three characters A, - or F then [A\-F] will do what you want, but it is usual to put the hyphen at the start or end of the class list to make it unambiguous. [-AF] and [AF-] are the same as [A\-F] and rather more readable

R Regular expression for string containing full stops

I have a bunch of strings, some of which end with ..t.. I am trying to find a regular expression to match these strings but dealing with the full stops is giving me a headache!
I have tried
grep('^.+(..t.)$', myStrings)
but this also matches strings such as w...gate. I think I am dealing with the full stops incorrectly. Any help at all appreciated.
Note: I am using grep within R.
Since you are only checking if the end of the string ends with ..t., you can eliminate ^.+ in your pattern.
The dot . in regular expression syntax is a character of special meaning which matches any character except a newline sequence. To match a literal dot or any other character of special meaning you need to escape \\ it.
> x <- c('foo..t.', 'w...gate', 'bar..t.foo', 'bar..t.')
> grep('\\.{2}t\\.$', x)
# [1] 1 4
Or place that character inside of a character class.
> x <- c('foo..t.', 'w...gate', 'bar..t.foo', 'bar..t.')
> grep('[.]{2}t[.]$', x)
# [1] 1 4
Note: I used the range operator \\.{2} to match two dots instead of escaping it twice \\.\\.
k, a little bit of better googling provided the answer;
grep("^.+(\\.\\.t\\.)$", myStrings)
this works because we need to escape the point as \\. in R.
The dot(.) matches only a single character.. to remove meaning of dot u should use double slash before dot char (\\).
try this instead.....
grep('^.+(\\.\\.t\\.)$', myStrings)
Satheesh Appu

Regex Check Whether a string contains characters other than specified

How to check whether a string contains character other than:
Alphabets(Lowe-Case/Upper-Case)
digits
Space
Comma(,)
Period (.)
Bracket ( )
&
'
$
+(plus) minus(-) (*) (=) arithmetic operator
/
using regular expression in ColdFusion?
I want to make sure a string doesn't contain even single character other than the specified.
You can find if there are any invalid characters like this:
<cfif refind( "[^a-zA-Z0-9 ,.&'$()\-+*=/]" , Input ) >
<!--- invalid character found --->
</cfif>
Where the [...] is a character class (match any single char from within), and the ^ at the start means "NOT" - i.e. if it finds anything that is not an accepted char, it returns true.
I don't understand "Small Bracket(opening closing)", but guess you mean < and > there? If you want () or {} just swap them over. For [] you need to escape them as \[\]
Character Class Escaping
Inside a character class, only a handful of characters need escaping with a backslash, these are:
\ - if you want a literal backslash, escape it.
^ - a caret must be escaped if it's the first character, otherwise it negates the class.
- - a dash creates a range. It must be escaped unless first/last (but recommended always to be)
[ and ] - both brackets should be escaped.
ColdFusion uses Java's engine to parse regular expressions, anyway to make sure a string doesn't contain one of the characters you mentioned then try:
^(?![a-zA-Z0-9 ,.&$']*[^a-zA-Z0-9 ,.&$']).*$
The above expression would only work if you are parsing the file line by line. If you want to apply this to text which contains multiple lines then you need to use the global modifier and the multi-line modifier and change the expression a bit like this:
^(?![a-zA-Z0-9 ,.&$']*[^a-zA-Z0-9 ,.&$'\r\n]).*$
Regex101 Demo
The regular expression:
[^][a-zA-Z0-9 ,.&'$]
will match if the string contains any characters other than the ones in your list.

gvim search match multiple characters using regex

I am trying to search in gvim for the following pattern:
arrayA[*].entryx
hoping it would match the following:
arrayA[size].entryx
arrayA[i].entryx
arrayA[index].entryx
but it prints message saying Pattern not found even though the above lines are present in the file.
arrayA[.].entryx
only matches arrayA[i].entryx
i.e. with only one character between [] braces.
What should I do to match multiple characters between [] braces?
Here is the PCRE expression detail
/arrayA\[[^]]*]\.entryx/
^^^^^ # 0 or more characters before a ']'
^^ ^^ # Escaped '[' & '.'
^ # Closing ']' -- does not need to be escaped
^^^^^^ ^^^^^^ # Literal parts
If you want to look for arrayA[X].entryx where, there is at least on character in the [],
You need to replace \[[^]]* with \[[^]]\+
ps: Note my edit -- I've changed the \* to just * -- you don't escape that either.
But, you need to escape the + :-)
Update on your comment:
While my comment answers your question on escaping ] broadly,
for more detail look at Perl Character Class details.
Specifically, the Special Characters Inside a Bracketed Character Class section.
Rules of what needs to be escaped change after a [character starts a Character Class (CCL).
The * repeats the previous character; and [ starts a character class. So, you need something more like:
/arrayA\[[^]]*]\.entryx/
That looks for a literal [, a series of zero or more characters other than ], a literal ], a literal . and the entryx.
Always remember that in VIM you need to scape some special characters, such as [, ], {, } and .. As said before the *repeats the previous character, with this you can simply use the /arrayA\[.*\]\.entryx, but the * is greedy character, it may match some strange things, add the following line to your file and you'll understand: arrayA[size].entryx = arrayB[].entryx
A "safer" Regular Expression would be:
/arrayA\[.\{-\}\]\.entryx
The .\{-\} matches any character in a non-greedy way, witch is safer for some cases.