Shell script variable assignment with two values (regular expression) - regex

I'm try to set a variable with two values. Here is an example:
letter='[[:alpha:]]'
digit='[[:digit:]]'
integer='$digit'
float='$digit.$digit'
The integer variable must appear one or more times. The variable float should display the first field (before the dot) zero or more times. How can I do this?
Thanks for help!
-- UPDATE --
It's very good to have the support of all of you. Below the solution that has served me:
letter='[[:alpha:]]'
digit='[[:digit:]]'
integer="${digit}+"
float="[0-9]*\\.[0-9]+"
Thank you guys! :D

I haven't looked into bash's expr command (which I assume you are using) to verify which flavor of regex they use, so you may need to do something like [a-zA-Z] instead of [[:alpha:]] and similar substitutions. But assume you have chosen the right value in letter and digit then this should work:
expr match "$string" "(${digit}*.${digit}*)"
or, using your float variable:
float="(${digit}*.${digit}*)"
expr match "$string" "$float"
Remove the parens if you just want to use the return value rather than returning the actual value matched.
Any of the following would be equivalent regexes for the integer:
integer="(${digit}+)"
integer="(${digit}{1,})"
integer="(${digit}${digit}*)"
Do be aware that there are different "flavors" of regex and in different contexts things need to be escaped where in another context they don't need it.

for egrep and grep -E on the bash command line:
float: [0-9]*\\.[0-9]+
integer: [0-9]+
see chart of egrep regxes at http://www.cyberciti.biz/faq/grep-regular-expressions/ for some hints but needs testing for specific situation
for perl and java:
float: [0-9]*?\.[0-9]+?
integer: [0-9]+?
+ matches preceding char or char class >= 1 times
* matches preceding char or char class >= 0 times
. matches any char
\. matches an uninterpreted period
[0-9] matches the class of any digit
? forces reluctant (non-greedy) matching

Related

How to use sed/perl to find only 2d arrays and replace text?

Currently I have tons of code that looks like this:
static double testVar1 [2][8] = {0.0} ; /* This is for testing var 1 */
static double var_test2 [3][2] = {0.0} ; /* This is for testing var 2 */
static double var_test3 [4] = {0.0} ; /* This is for testing var 3 */
2d arrays in c++ initialize with double curly brackets, so I need to only find the 2d arrays and change it like this:
static double testVar1 [2][8] = {{0.0}} ; /* This is for testing var 1 */
static double var_test2 [3][2] = {{0.0}} ; /* This is for testing var 2 */
static double var_test3 [4] = {0.0} ; /* This is for testing var 3 */
I have been trying with sed to use groupings, but I can't figure out how to escape the brackets, some posts suggest not escaping at all. I have also tried without extended regular expressions.
Just now, I found out only 9 groupings in sed are possible, so now completely stuck. Any suggestions?
sed -i -r 's/(.*)(\[)([0-9]+)(\])(\[)([0-9]+)(\])(.*)(\{)(0.0)(\})(.*)/echo "\1\2\3"/ge'
Use a perl script with the following regex:
\w+\s*(?:\[\d+\]){2}\s*=\s*\K\{([\d.]*)\}
And replace this with \{\{\1\}\}, see a demo on regex101.com.
Broken down, this says:
\w+ # at least one word character
\s* # Zero or more spaces
(?:\[\d+\]){2} # [0][1] or any other two-dimensional array
\s*=\s* # spaces = and spaces
\K # "forget" everything
\{([\d.]*)\} # match and capture {number}
A Perl one-liner, cautious about literals such as 2u and 1e-06l (etc)
perl -pe's/(?:\[ [^]]+ \]){2} \s*=\s* \K (\{ [^}]+ \})/{$1}/x' in > out
The (?:) groups (without capture) and (?:\[[^]]+\]){2} is for [n][m]. The \K is the form of the positive lookbehind, which also drops previous matches so we don't have to put them back.
With an integer inside [] being just digits and with a float in {} being n.m this simplifies
perl -pe's/(?:\[\d+\]){2}\s*=\s*\K( \{[\d.]+\} )/{$1}/x' in > out
Note that [\d.] allows for all kinds of wrong things, like .2..3, but that is a different issue.
However, watch out for use of literals for numbers such as 2u (with the suffix) which are fine as indices as well, along with vec[1.2e+01] or even vec[1.2]. The varied notation for float/double literals is also more likely to show up in data. Altogether I'd go with a more rounded pattern like
perl -pe's/(?:\[ [\d\w+-.]+ \]){2}\s*=\s*\K(\{ [\d\w+-.]+ \})/{$1}/x' in > out
Keep in mind that this allows various wrong formats and so it doesn't check data well.
Here is a sed attempt with the regex wrinkles ironed out.
sed -i -r 's/(.*\[[0-9]+\]\[[0-9]+\].*)(\{0.0\})(.*)/\1{\2}\3/'
You had significant amounts of unmotivated additional grouping parentheses so \1\2\3 would only refer to the very beginning of the match. I simply took them out. Remember, the captures are ordered from left to right, so the first left parenthesis creates group \1, the second captures into \2, etc.
The GNU sed extension /e allows you to invoke the shell on the replacement string but in this case this added no value and introduced significant additional possible errors, so taking it out was a no-brainer. The /g option would make sense if you expected multiple matches per line, but your example shows no examples of input lines with multiple matches, and the entire script would need to be rather more complex in order to support that, so I took that out as well.
Depending on the language you are attempting to process and the regularity of the files, you might want to permit whitespace between the closing and opening square brackets, or not; and the "anything" wildcard between the closing square bracket and the opening curly bracket looks somewhat prone to false positives (matching where you don't want it to) -- maybe change it to only permit whitespace and and equals sign, like [ =]* instead of .*
Another approach with sed:
sed -i -r 's/((\[[0-9]\]){2} *= )(\{[^}]*\})/\1{\3}/' file
and same in BRE mode :
sed -i 's/\(\(\[[0-9]\]\)\{2\} *= \({[^}]*}\)\)/\1{\2}/' file
sed -i '/]\[/s/[{}]/&&/g' file

Trim end of string

I'm having trouble trimming off some characters at the end of a string. The string usually looks like:
C:\blah1\blah2
But sometimes it looks like:
C:\blah1\blah2.extra
I need to extract out the string 'blah2'. Most of the time, that's easy with a substring command. But on the rare occasions when the '.extra' portion is present, I need to first trim that part off.
The thing is, '.extra' always begins with a dot, but then is followed by various combinations of letters with various lengths. So wildcards will be necessary. Essentially, I need to script, "If the string contains a dot, trim off the dot and anything following it."
$string.replace(".*","") doesn't work. Nor does $string.replace(".\*",""). Nor does $string.replace(".[A-Z]","").
Also, I can't get at it from the beginning of the string either. 'blah1' is unknown and of various lengths. I have to get at 'blah2' from the end of the string.
Assuming that the string is always a path to a file with or without an extension (such as ".extra"), you can use Path.GetFileNameWithoutExtension():
PS C:\> [System.IO.Path]::GetFileNameWithoutExtension("C:\blah1\blah2")
blah2
PS C:\> [System.IO.Path]::GetFileNameWithoutExtension("C:\blah1\blah2.extra")
blah2
The path doesn't even have to be rooted:
PS C:\> [System.IO.Path]::GetFileNameWithoutExtension("blah1\blah2.extra")
blah2
If you want to implement similar functionality on your own, that should be fairly simply as well - use String.LastIndexOf() to find the last \ in the string and use that as your starting argument for Substring():
function Extract-Name {
param($NameString)
# Extract part after the last occurrence of \
if($NameString -like '*\*') {
$NameString = $NameString.Substring($NameString.LastIndexOf('\') + 1)
}
# Remove anything after a potential .
if($NameString -like '*.*') {
$NameString.Remove($NameString.IndexOf("."))
}
$NameString
}
And you'll see similar results:
PS C:\> Extract-Name "C:\blah1\blah2.extra"
blah2
PS C:\> Extract-Name "C:\blah124323\blah2.extra"
blah2
PS C:\> Extract-Name "C:\blah124323\blah2"
blah2
PS C:\> Extract-Name "abc124323\blah2"
blah2
As the other posters have said, you can use special file name manipulators for this. If you'd like to do it with regular expressions, you can say
$string.replace("\..*","")
The \..* regex matches a dot (\.) and then any string of characters (.*).
Let me address each of the non-working regexes individually:
$string.replace(".*","")
The reason this doesn't work is that . and * are both special characters in regular expressions: . is a wildcard character that matches any character, and * means "match the previous character zero or more times." So .* means "any string of characters."
$string.replace(".\*","")
In this instance, you're escaping the * character, meaning that the regex treats it literally, so the regex matches any single character (.) followed by a star (\*).
$string.replace(".[A-Z]","")
In this case, the regex will match any character (.) followed by any single capital letter ([A-Z]).
If the strings are actual paths using Get-Item would be another option:
$path = 'C:\blah1\blah2.something'
(Get-Item $path).BaseName
The Replace() method can't be used here, because it doesn't support wildcards or regular expressions.

How do I specify a regex of certain length where I want to disallow characters instead of allowing them?

I need a regular expression which will validate a string to have length 7 and doesn't contain vowels, number 0 and number 1.
I know about character classes like [a-z] but it seems a pain to have to specify every possibility that way: [2-9~!##$%^&*()b-df-hj-np-t...]
For example:
If I pass a String June2013 - it should fail because length of the string is 8 and it contains 2 vowels and number 0 and 1.
If I pass a String XYZ2003 - it should fail because it contains 0.
If I pass a String XYZ2223 - it should pass.
Thanks in advance!
So that would be something like this:
^[^aeiouAEIOU01]{7}$
The ^$ anchors ensure there's nothing in there but what you specify, the character class [^...] means any character except those listed and the {7} means exactly seven of them.
That's following the English definition of vowel, other cultures may have a different idea as to what constitutes voweliness.
Based on your test data, the results are:
pax> echo 'June2013' | egrep '^[^aeiouAEIOU01]{7}$'
pax> echo 'XYZ2003' | egrep '^[^aeiouAEIOU01]{7}$'
pax> echo 'XYZ2223' | egrep '^[^aeiouAEIOU01]{7}$'
XYZ2223
This is the briefest way to express it:
(?i)^[^aeiou01]{7}$
The term (?i) means "ignore case", which obviates typing both upper and lower vowels.

Difference between * and + regex

Can anybody tell me the difference between the * and + operators in the example below:
[<>]+ [<>]*
Each of them are quantifiers, the star quantifier(*) means that the preceding expression can match zero or more times it is like {0,} while the plus quantifier(+) indicate that the preceding expression MUST match at least one time or multiple times and it is the same as {1,} .
So to recap :
a* ---> a{0,} ---> Match a or aa or aaaaa or an empty string
a+ ---> a{1,} ---> Match a or aa or aaaa but not a string empty
* means zero-or-more, and + means one-or-more. So the difference is that the empty string would match the second expression but not the first.
+ means one or more of the previous atom. ({1,})
* means zero or more. This can match nothing, in addition to the characters specified in your square-bracket expression. ({0,})
Note that + is available in Extended and Perl-Compatible Regular Expressions, and is not available in Basic RE. * is available in all three RE dialects. That dialect you're using depends most likely on the language you're in.
Pretty much, the only things in modern operating systems that still default to BRE are grep and sed (both of which have ERE capability as an option) and non-vim vi.
* means zero or more of the previous expression.
In other words, the expression is optional.
You might define an integer like this:
-*[0-9]+
In other words, an optional negative sign followed by one or more digits.
They are quantifiers.
+ means 1 or many (at least one occurrence for the match to succeed)
* means 0 or many (the match succeeds regardless of the presence of the search string)
[<>]+ is same as [<>][<>]*
I'll bring some example to extend answers above. Let we have a text:
100test10
test10
test
if we write \d+test\d+, this expression matches 100test10 and test10 but \d*test\d* matches three of them

Regular Expression issue with * laziness

Sorry in advance that this might be a little challenging to read...
I'm trying to parse a line (actually a subject line from an IMAP server) that looks like this:
=?utf-8?Q?Here is som?= =?utf-8?Q?e text.?=
It's a little hard to see, but there are two =?/?= pairs in the above line. (There will always be one pair; there can theoretically be many.) In each of those =?/?= pairs, I want the third argument (as defined by a ? delimiter) extracted. (In the first pair, it's "Here is som", and in the second it's "e text.")
Here's the regex I'm using:
=\?(.+)\?.\?(.*?)\?=
I want it to return two matches, one for each =?/?= pair. Instead, it's returning the entire line as a single match. I would have thought that the ? in the (.*?), to make the * operator lazy, would have kept this from happening, but obviously it doesn't.
Any suggestions?
EDIT: Per suggestions below to replace ".?" with "[^(\?=)]?" I'm now trying to do:
=\?(.+)\?.\?([^(\?=)]*?)\?=
...but it's not working, either. (I'm unsure whether [^(\?=)]*? is the proper way to test for exclusion of a two-character sequence like "?=". Is it correct?)
Try this:
\=\?([^?]+)\?.\?(.*?)\?\=
I changed the .+ to [^?]+, which means "everything except ?"
A good practice in my experience is not to use .*? but instead do use the * without the ?, but refine the character class. In this case [^?]* to match a sequence of non-question mark characters.
You can also match more complex endmarkers this way, for instance, in this case your end-limiter is ?=, so you want to match nonquestionmarks, and questionmarks followed by non-equals:
([^?]*\?[^=])*[^?]*
At this point it becomes harder to choose though. I like that this solution is stricter, but readability decreases in this case.
One solution:
=\?(.*?)\?=\s*=\?(.*?)\?=
Explanation:
=\? # Literal characters '=?'
(.*?) # Match each character until find next one in the regular expression. A '?' in this case.
\?= # Literal characters '?='
\s* # Match spaces.
=\? # Literal characters '=?'
(.*?) # Match each character until find next one in the regular expression. A '?' in this case.
\?= # Literal characters '?='
Test in a 'perl' program:
use warnings;
use strict;
while ( <DATA> ) {
printf qq[Group 1 -> %s\nGroup 2 -> %s\n], $1, $2 if m/=\?(.*?)\?=\s*=\?(.*?)\?=/;
}
__DATA__
=?utf-8?Q?Here is som?= =?utf-8?Q?e text.?=
Running:
perl script.pl
Results:
Group 1 -> utf-8?Q?Here is som
Group 2 -> utf-8?Q?e text.
EDIT to comment:
I would use the global modifier /.../g. Regular expression would be:
/=\?(?:[^?]*\?){2}([^?]*)/g
Explanation:
=\? # Literal characters '=?'
(?:[^?]*\?){2} # Any number of characters except '?' with a '?' after them. This process twice to omit the string 'utf-8?Q?'
([^?]*) # Save in a group next characters until found a '?'
/g # Repeat this process multiple times until end of string.
Tested in a Perl script:
use warnings;
use strict;
while ( <DATA> ) {
printf qq[Group -> %s\n], $1 while m/=\?(?:[^?]*\?){2}([^?]*)/g;
}
__DATA__
=?utf-8?Q?Here is som?= =?utf-8?Q?e text.?= =?utf-8?Q?more text?=
Running and results:
Group -> Here is som
Group -> e text.
Group -> more text
Thanks for everyone's answers! The simplest expression that solved my issue was this:
=\?(.*?)\?.\?(.*?)\?=
The only difference between this and my originally-posted expression was the addition of a ? (non-greedy) operator on the first ".*". Critical, and I'd forgotten it.