Check if there is special character and symbol inside a sentence using regex - regex

I wanna check if there is special character in a line inside a text file using Regex in shell script.
assume there is sentence "assccÑasas"
how to check if there is 'Ñ' inside the line, so it should be output as error instead.
I also wanna check if there is symbol such as '/' or '^' or '&', etc
my code :
VALID='^[a-zA-Z_0-9&.<>/|\-]+$' #myregex
checkError(){
if [[ $line =~ $VALID ]]; then
echo "tes"
else
echo "not okay"
exit 1
fi
}
while read line
do
checkRusak
done < $1
so the example like if there is sentence "StÀck Overflow;" then it will output error. if there is sentence "stack overflow;" still output error.
but if only "stack overflow" (no symbol or special character), it will output "test"
so far it can check for symbol ('/' or '\' etc) but still problem in special character.
Any help really appreciated,
thank you in advance

You could either
define, which characters are valid and only allow these or
define, which characters are invalid and check if there is one
For the 2nd approach, you could use this regex: [^a-zA-Z_0-9\s].
The ^ inside of square brackets negate the character class, so it matches on any string, that contains a character that is not a letter A-Z, a-z, a number, an underscore or a white space.
Since you want to detect a single character, you don't need a quantifier.
Demo

Related

What's special about a "space" character in an "expr match" regexp?

In a bash shell, I set line like so:
line="total active bytes: 256"
Now, I just want to get the digits from that line so I do:
echo $(expr match "$line" '.*\([[:digit:]]*\)' )
and I don't get anything. But, if I add a space character before the first backslash in the regexp, then it works:
echo $(expr match "$line" '.* \([[:digit:]]*\)' )
Why?
The space isn't special at all. What's happening is that in the first case, the .* matches the entire string (i.e., it matches "greedily"), including the numbers, and since you've quantified the digits with * (as opposed to \+), that part of the regex is allowed to match 0 characters.
By putting a space before the digit match, the first part can only match up to but not including the last space in the string, leaving the digits to be matched by \([[:digit:]]*\).

perl regex "\w" does not accept "$"

So I have a regex that only wants to look for 2 words - just one word on a line will error, more than three words will kick it out and give me a line number (which is what I want).
#!/usr/bin/perl
use warnings
use strict
open( my $filehandle ,"<", "/tmp/compare.cleartxt.tmpusers" ) || die "cant access the file" ;
while (<$filehandle>) {
if ($_ !~ /^\w+\s\w+$/) {
print "LINE $., error on $_ " ;
}
}
the problem is that some of these words contain "$" signs.
like
LINE 700, error on ubs$iontest ubs$iontest
LINE 904, error on uho$jptest uho$jptest uho$jptest
LINE 1929, error on boa$jgb boa$jgb
LINE 2976, error on mitadel mitadel mitadel$001
LINE 3205, error on csfb csfb csfb$jpntest csfb$001 csfb$nytest
LINE 4762, error on mitsi$jgb2 mitsub$jgb2
LINE 6346, error on GOLDSTPTG GOLDSTPTG GOLDSTPTG
LINE 6660, error on jptest mizuho$jptest jptest
so I want to get rid of false positives like in line 700 or 1929, but keep errors like line 904.
I tired using this, but it presented alot more errors, like it printed every words with an underbar in it like "foo_bar"
if ($_ !~ /^[a-zA-Z$0-9]+\s[a-zA-Z$0-9]+$/)
\w doesn't match $ because $ isn't considered a word character.
It looks like what you want to match, in Perl terms, is either a word character or a $ character.
Try replacing \w by [\w\$]. (You need to escape the $ so it doesn't treat $] as a variable reference.)
If you want to match sequences of non-whitespace characters, \S will match any single non-whitespace character. That includes all word characters and $; it also includes other punctuation characters.
I just noticed something else you wrote in your question:
I tired using this, but it presented alot more errors, like it printed
every words with an underbar in it like "foo_bar".
Perl's definition of a "word character" is:
alphanumeric plus "_", plus other connector punctuation chars plus Unicode marks
so the underscore _ will be treated as a word character. It sounds like you want to match letters and $, but not _. What about digits? Other punctuation? Accented and non-Latin letters?
Once you specify exactly what you want to match, it will be much easier to construct a regular expression that will do the job.
See here or try perldoc perlre for more information on Perl's regular expressions.
You can use:
/^[\w$]+\s[\w$]+$/

Perl regexp tr// "I dont get why it does this?"

I did the following to my string $text
$text =~ tr/a-zåàâäæçéèêëîïôöœßùûüÿA-ZÅÀÂÄÆÇÉÈÊËÎÏÔÖŒÙÛÜŸ'()\-,.?!:;/\n/cs;
What this did was to split the string in newlines. This is what I wanted to do
but I dont get why it does this?
I thought that this line would take all chars a-zåàâäæçéèêëîïôöœßùûüÿA-ZÅÀÂÄÆÇÉÈÊËÎÏÔÖŒÙÛÜŸ'()-,.?!:; and replace each of them with \n
I dont get what cs in the end does either. Here you can get an explanation of cs but I dont understand what it means:
"c - is used to specify that the SEARCHLIST character set is
complemented"
"s - is used to specify that the sequences of characters that were
transliterated to the same character are squashed down to a single
instance of the character"
Example:
$text= "a ar? å ..";
gives
a
ar?
å
..
c - is used to specify that the SEARCHLIST character set is complemented
In this usage, "complemented" is similar to "negated" or "reversed", so instead of replacing the characters listed in your expression every character not found in your expression is replaced. In your example string this means that all of the spaces are replaced with a newline because every other character is included in the set.
If you want to turn all spaces into newlines, listing out all the things which are not spaces is cumbersome and you're likely to forget some. You can instead work directly on the spaces with a regex.
s{\s+}{\n}g;
s{...}{...} is a "search and replace" using regular expressions rather than just characters. \s is regex speak for "whitespace" which includes spaces, tabs and newlines. + says to match 1 or more of them, so multiple spaces in a row will be turned into one newline. The g modifier says to do it "globally" or across every character in the string, otherwise it would stop at the first match.
foo bar baz
Becomes
foo
bar
baz
"c - is used to specify that the SEARCHLIST character set is complemented"
This means that it will replace anything not in the search list with \n. In your example, the only character not in the search list is a space. Therefore each space gets replaced with a newline. As Schwern pointed out, this is not a good way to do this.
"s - is used to specify that the sequences of characters that were transliterated to the same character are squashed down to a single instance of the character"
This means that if three characters in a row are translated (resulting in three \n in a row), the three \n will be "squashed" into a single \n. If you added some spaces to your example input, you could see this in action:
# Multiple spaces separating words
my $str = "a ar? å";
Without squashing:
$str =~ tr/a-zåàâäæçéèêëîïôöœßùûüÿA-ZÅÀÂÄÆÇÉÈÊËÎÏÔÖŒÙÛÜŸ'()\-,.?!:;/\n/c;
Outputs:
a
ar?
å
With squashing:
$str =~ tr/a-zåàâäæçéèêëîïôöœßùûüÿA-ZÅÀÂÄÆÇÉÈÊËÎÏÔÖŒÙÛÜŸ'()\-,.?!:;/\n/cs;
Outputs:
a
ar?
å

confusion about basic rules of regex in Perl

I have a lot of trouble understanding the basic rules of regex and hope that someone could help explain them in "plain English".
$_ = '1: A silly sentence (495,a) *BUT* one which will be useful. (3)';
print "Enter a regular expression: ";
my $pattern = <STDIN>;
chomp($pattern);
if (/$pattern/) {
print "The text matches the pattern '$pattern'.\n";
print "\$1 is '$1'\n" if defined $1;
print "\$2 is '$2'\n" if defined $2;
print "\$3 is '$3'\n" if defined $3;
print "\$4 is '$4'\n" if defined $4;
print "\$5 is '$5'\n" if defined $5;
}
Three test outputs
Enter a regular expression: ([a-z]+)
The text matches the pattern '([a-z]+)'
$1 is 'silly'
Enter a regular expression: (\w+)
The text matches the pattern '(\w+)'
$1 is '1'
Enter a regular expression: ([a-z]+)(.*)([a-z]+)
The text matches the pattern '([a-z]+)(.*)([a-z]+)'
$1 is 'silly'
$2 is " sentence (495,a) *BUT* one which will be usefu'
$3 is 'l'
My confusion is as follows
doesn't ([a-z]+) mean "a lower case alphabet and one/more repeats"? If so, shouldn't "will" be picked up as well? Unless it has something to do with () being about memory (i.e. "silly" being 5-letter word, so "will" will not picked up, but "willx" will ??)
doesn't (\w+) mean "any word and one/more repeats"? If so, why is number "1" picked up as there is no repeat but a colon ":" afterwards?
does ([a-z]+)(.*)([a-z]+)mean "any lower case and repeat", immediately followed by "anything and 0 or more repeat", immediately followed by "any lower case and repeat"? If so, why does the output look like the one shown above?
I tried to look up online as much as I could but still fail to understand them. Any help will be greatly appreciated. Thank you.
No, it means "one or more unaccented lowercase latin letter".
Yes, "will" would also match, but the match op only returns the first match unless you use /g.
print "$1\n" while /([a-z]+)/g; # //g in scalar context
or
print "$_\n" for /([a-z]+)/g; # //g in list context
See m/PATTERN/ in perlop for details on how to use /g.
No, it means "one or more word chars", so it can indeed match a single character.
Or maybe you're surprised that 1 is a word char? In the ASCII range, the word chars are A-Z, a-z, 0-9 and _. Another 102,661 word chars are found outside of the ASCII range.
It means "one or more unaccented lowercase latin letter, followed by any number of characters other than newline, followed by one or more unaccented lowercase latin letter".
If you're asking why .* is matching so much, the engine will always match as much as possible at the current location. This is called greediness.
Maybe you're looking for /([a-z]+)([^a-z]+)([a-z]+)/.
I'm really not sure why you would expect that. It looks at your sentence and finds the first lowercase letter and continues matching them until it doesn't find one. (In your case a space) the match is 'silly' and it should be. The matching stops at that point.
\w matches a "word character" and includes numbers but not punctuation aside from the "_" ":" is not a word character thus you get "1" and nothing else.
This is because (.*) is "greedy" (and generally you shouldn't use it). You're telling Perl to match anything and everything to the end of the line. It then backtracks to give you a match for your last check which is the last character of your string.
EDIT: as #ikegami pointed out it \w actually matches quite a bit more than what I was thinking.

Only allow some characters with grep?

I would like to check a string, so it only contains the characters 0-9 a-z -.
When I do
regex='[-a-z0-9]*'
string='abcd!'
if [[ $string =~ $regex ]]
then
echo "valid"
else
echo "not valid"
fi
it outputs valid, where I would have expected not valid because $string contains a !.
try this: regex='^[-a-z0-9]*$'. It will force the complete line to match this class. Otherwise, only a single match, or no match at all (due to *) will return valid. ^...$ says the string starts and ends without anything that fails to match.
You will have to add boundaries for this regex to work.
'[-a-z0-9]*' says: match these characters 0 or more times anywhere in the string.
So adding start and end of line characters to the regex will do what you are looking for:
regex='^[-a-z0-9]*$'
The next step is to limit the number of occurrences of the '-' to only once. Can the dash charcter occur at the start or at the end of the string? If not try:
regex='^[a-z0-9]*-?[a-z0-9]*$'
Hope this helps.