Creating a regex to parse a build version - regex

I'm tyring to grab a build verson from a file that contains the following line:
<Assembly: AssemblyVersion("004.005.0862")>
and I would like it to return
4.5.862
I'm using sed in dos and got the following to spit out 004.005.0862
echo "<Assembly: AssemblyVersion("004.005.0862")>" | sed "s/[^0-9,.]//g"
How do I get rid of the leading zeros for each part of the build number?

The regex to do this in a single step looks like this:
^.*"0*([0-9]+\.)0*([0-9]+\.)0*([0-9]+).*
with sed-specific escaping and as a full expression, it becomes a little longer:
s/^.*"0*\([0-9]\+\.\)0*\([0-9]\+\.\)0*\([0-9]\+\).*/\1\2\3/g
The regex breaks down as
^ # start-of-string
.*" # anything, up to a double quote
0*([0-9]+\.) # any number of zeros, then group 1: at least 1 digit and a dot
0*([0-9]+\.) # any number of zeros, then group 2: at least 1 digit and a dot
0*([0-9]+) # any number of zeros, then group 3: at least 1 digit
.* # anything up to the end of the string

Maybe ... | sed "s/[^0-9]*0*([1-9][0-9,.]*)/\1/g". I'm using a subpattern to filter out the part you need, ignoring leading zeros and non-numeric characters.

There are probably many more clever ways, but one that works (and is reasonably easy to understand) is to pipe it through additional calls:
echo "version(004.005.0862)" | sed "s/[^0-9,.]//g" | sed "s/^0*//g" | sed "s/\.0*/./g"

Related

How to use sed/perl to find only 2d arrays and replace text?

Currently I have tons of code that looks like this:
static double testVar1 [2][8] = {0.0} ; /* This is for testing var 1 */
static double var_test2 [3][2] = {0.0} ; /* This is for testing var 2 */
static double var_test3 [4] = {0.0} ; /* This is for testing var 3 */
2d arrays in c++ initialize with double curly brackets, so I need to only find the 2d arrays and change it like this:
static double testVar1 [2][8] = {{0.0}} ; /* This is for testing var 1 */
static double var_test2 [3][2] = {{0.0}} ; /* This is for testing var 2 */
static double var_test3 [4] = {0.0} ; /* This is for testing var 3 */
I have been trying with sed to use groupings, but I can't figure out how to escape the brackets, some posts suggest not escaping at all. I have also tried without extended regular expressions.
Just now, I found out only 9 groupings in sed are possible, so now completely stuck. Any suggestions?
sed -i -r 's/(.*)(\[)([0-9]+)(\])(\[)([0-9]+)(\])(.*)(\{)(0.0)(\})(.*)/echo "\1\2\3"/ge'
Use a perl script with the following regex:
\w+\s*(?:\[\d+\]){2}\s*=\s*\K\{([\d.]*)\}
And replace this with \{\{\1\}\}, see a demo on regex101.com.
Broken down, this says:
\w+ # at least one word character
\s* # Zero or more spaces
(?:\[\d+\]){2} # [0][1] or any other two-dimensional array
\s*=\s* # spaces = and spaces
\K # "forget" everything
\{([\d.]*)\} # match and capture {number}
A Perl one-liner, cautious about literals such as 2u and 1e-06l (etc)
perl -pe's/(?:\[ [^]]+ \]){2} \s*=\s* \K (\{ [^}]+ \})/{$1}/x' in > out
The (?:) groups (without capture) and (?:\[[^]]+\]){2} is for [n][m]. The \K is the form of the positive lookbehind, which also drops previous matches so we don't have to put them back.
With an integer inside [] being just digits and with a float in {} being n.m this simplifies
perl -pe's/(?:\[\d+\]){2}\s*=\s*\K( \{[\d.]+\} )/{$1}/x' in > out
Note that [\d.] allows for all kinds of wrong things, like .2..3, but that is a different issue.
However, watch out for use of literals for numbers such as 2u (with the suffix) which are fine as indices as well, along with vec[1.2e+01] or even vec[1.2]. The varied notation for float/double literals is also more likely to show up in data. Altogether I'd go with a more rounded pattern like
perl -pe's/(?:\[ [\d\w+-.]+ \]){2}\s*=\s*\K(\{ [\d\w+-.]+ \})/{$1}/x' in > out
Keep in mind that this allows various wrong formats and so it doesn't check data well.
Here is a sed attempt with the regex wrinkles ironed out.
sed -i -r 's/(.*\[[0-9]+\]\[[0-9]+\].*)(\{0.0\})(.*)/\1{\2}\3/'
You had significant amounts of unmotivated additional grouping parentheses so \1\2\3 would only refer to the very beginning of the match. I simply took them out. Remember, the captures are ordered from left to right, so the first left parenthesis creates group \1, the second captures into \2, etc.
The GNU sed extension /e allows you to invoke the shell on the replacement string but in this case this added no value and introduced significant additional possible errors, so taking it out was a no-brainer. The /g option would make sense if you expected multiple matches per line, but your example shows no examples of input lines with multiple matches, and the entire script would need to be rather more complex in order to support that, so I took that out as well.
Depending on the language you are attempting to process and the regularity of the files, you might want to permit whitespace between the closing and opening square brackets, or not; and the "anything" wildcard between the closing square bracket and the opening curly bracket looks somewhat prone to false positives (matching where you don't want it to) -- maybe change it to only permit whitespace and and equals sign, like [ =]* instead of .*
Another approach with sed:
sed -i -r 's/((\[[0-9]\]){2} *= )(\{[^}]*\})/\1{\3}/' file
and same in BRE mode :
sed -i 's/\(\(\[[0-9]\]\)\{2\} *= \({[^}]*}\)\)/\1{\2}/' file
sed -i '/]\[/s/[{}]/&&/g' file

sed: Why does "s/TR[0-9]*//2g" work but not "s/TR[0-9].*//2g"?

My file look like this:
>TR45672|c1_g1_i1|m.87632TR21000
sometextherethatmayincludeTRbutnonumbers
>TR10000|c0_g1_i1|m.83558TR1702000
sometextherethatmayincludeTRbutnonumbers
....
....
I want it to looks like this:
>TR45672|c1_g1_i1|m.87632
sometextherethatmayincludeTRbutnonumbers
>TR10000|c0_g1_i1|m.83558
sometextherethatmayincludeTRbutnonumbers
....
....
In other words, I want to remove second occurrence of the pattern TR in the headers (rows that start with ">") and everything after that, but not touch any TR patterns in lines that are not headers. In non-header lines, TR will never ever be followed by a number.
I try to use the following code:
sed "s/TR[0-9].*//2g"
It will, as I have understood it, match TR and then a number and remove all instances but the first one. Since there are always exactly two occurrences of TR[0-9] in the header and no occurrences of TR[0-9] in non-headers, this will accomplish my goals...
...or so I thought. In reality, using the above code has no effect whatsoever.
If I instead skip the dot and use:
sed "s/TR[0-9]*//2g"
It produces what looks like the desired result for those lines I have manually checked.
Questions:
(1) How come it works without the dot but not work with it? My understanding is that ".*" is the key to removing everything after a pattern.
(2) Removing the dot seems to work, but it is not possible for me to manually check through the entire file. Is there are reason to suspect something unexpected happens when skipping the dot in this case?
sed "s/TR[0-9].*//2g"
...matches the whole line from the first TR to the end of the line, which means there is no following match (there's nothing left of the line to match since it has all been matched)
sed "s/TR[0-9]*//2g"
...first matches only the first TR<number> sequence, then finds the second match in the rest of the line.
Analyze the first line of your input file against the regex with the dot:
|-------------------------------- (1) TR matches 'TR' literally
| |------------------------------ (2) [0-9] match a single digit
| | |---------------------------- (3) .* matches any char till the end
| | |
TR 4 5672|c1_g1_i1|m.87632TR21000
11 2 3333333333333333333333333333
---------------------------------
1st and only match so there no 2nd match or above to replace
So using TR[0-9].* you have a single match per line starting with TR.
If you use the second regex instead:
|---------------------------------- (m1) TR matches 'TR' literally
| |------------------------------- (m1) [0-9]* match zero or more digits
| |
| | |------ (m2) TR matches 'TR' literally
| | | |--- (m2) [0-9]* match zero or more
TR 45672 |c1_g1_i1|m.87632 TR 21000
-------- --------
1st match 2nd match
By the way, since there are only two TR section you can skip global flag and use:
sed 's/TR[0-9]*//2' file

Using grep to find keywords, and then list the following characters until the next ; character

I have a long list of chemical conditions in the following form:
0.2M sodium acetate; 0.3M ammonium thiosulfate;
The molarities can be listed in various ways:
x.xM, x.x M, x M
where the number of x digits vary. I want to do two things, select those numbers using grep, and then list only the following characters until ;. So if I select 0.2M in the example above, I want to be able to list sodium acetate.
For selecting, I have tried the following:
grep '[0-9]*.[0-9]*[[:space:]]*M' file
so that there are arbitrary number of digits and spaces, but it always ends with M. The problem is, it also selects the following:
0.05MRbCl+MgCl2;
I am not quite sure why this is selected. Ideally, I would want 0.05M to be selected, and then list RbCl+MgCl2. How can I achieve this?
(The system is OS X Yosemite)
It matches that because:
[0-9]* matches 0
. matches any character (this is the . in this case, but you probably meant to escape it)
[0-9]* matches 05
[[:space:]]* matches the empty string between 05 and M
M matches M
As for how to do what you want: I think that if you don't want the numbers to be printed with the output, this would require either a lookbehind assertion or the ability to print a specific capture group, which it sounds like OS X's grep doesn't support. You could use a similar approach with a slightly more powerful tool, though:
$ cat test.txt
0.2M sodium acetate; 0.3M ammonium thiosulfate;
0.05MRbCl+MgCl2;
1.23M dihydrogen monoxide;
45 M xenon quadroxide;
$ perl -ne 'while (/([0-9]*\.)?[0-9]+\s*M\s*([^;]+)/g) { print "$2\n"; }' test.txt
sodium acetate
ammonium thiosulfate
RbCl+MgCl2
dihydrogen monoxide
xenon quadroxide
Written out, that regex is:
([0-9]*\.)? optionally, some digits and a decimal point
[0-9]+ one or more digits
\s*M\s* the letter M, with spacing around it
([^;]+) all the characters up until the next semicolon (the thing you want to print)
With GNU awk for multi-char RS, gensub() and \s:
$ awk -vRS=';\\s*' -vm='0.2M' 'm==gensub(/\s*([0-9.]+)\s*M.*/,"\\1M","")' file
0.2M sodium acetate
$ awk -vRS=';\\s*' -vm='0.05M' 'm==gensub(/\s*([0-9.]+)\s*M.*/,"\\1M","")' file
0.05MRbCl+MgCl2

How do I specify a regex of certain length where I want to disallow characters instead of allowing them?

I need a regular expression which will validate a string to have length 7 and doesn't contain vowels, number 0 and number 1.
I know about character classes like [a-z] but it seems a pain to have to specify every possibility that way: [2-9~!##$%^&*()b-df-hj-np-t...]
For example:
If I pass a String June2013 - it should fail because length of the string is 8 and it contains 2 vowels and number 0 and 1.
If I pass a String XYZ2003 - it should fail because it contains 0.
If I pass a String XYZ2223 - it should pass.
Thanks in advance!
So that would be something like this:
^[^aeiouAEIOU01]{7}$
The ^$ anchors ensure there's nothing in there but what you specify, the character class [^...] means any character except those listed and the {7} means exactly seven of them.
That's following the English definition of vowel, other cultures may have a different idea as to what constitutes voweliness.
Based on your test data, the results are:
pax> echo 'June2013' | egrep '^[^aeiouAEIOU01]{7}$'
pax> echo 'XYZ2003' | egrep '^[^aeiouAEIOU01]{7}$'
pax> echo 'XYZ2223' | egrep '^[^aeiouAEIOU01]{7}$'
XYZ2223
This is the briefest way to express it:
(?i)^[^aeiou01]{7}$
The term (?i) means "ignore case", which obviates typing both upper and lower vowels.

Replace patterns that are inside delimiters using a regular expression call

I need to clip out all the occurances of the pattern '--' that are inside single quotes in long string (leaving intact the ones that are outside single quotes).
Is there a RegEx way of doing this?
(using it with an iterator from the language is OK).
For example, starting with
"xxxx rt / $ 'dfdf--fggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g '--ggh--' vcbcvb"
I should end up with:
"xxxx rt / $ 'dfdffggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g 'ggh' vcbcvb"
So I am looking for a regex that could be run from the following languages as shown:
+-------------+------------------------------------------+
| Language | RegEx |
+-------------+------------------------------------------+
| JavaScript | input.replace(/someregex/g, "") |
| PHP | preg_replace('/someregex/', "", input) |
| Python | re.sub(r'someregex', "", input) |
| Ruby | input.gsub(/someregex/, "") |
+-------------+------------------------------------------+
I found another way to do this from an answer by Greg Hewgill at Qn138522
It is based on using this regex (adapted to contain the pattern I was looking for):
--(?=[^\']*'([^']|'[^']*')*$)
Greg explains:
"What this does is use the non-capturing match (?=...) to check that the character x is within a quoted string. It looks for some nonquote characters up to the next quote, then looks for a sequence of either single characters or quoted groups of characters, until the end of the string. This relies on your assumption that the quotes are always balanced. This is also not very efficient."
The usage examples would be :
JavaScript: input.replace(/--(?=[^']*'([^']|'[^']*')*$)/g, "")
PHP: preg_replace('/--(?=[^\']*'([^']|'[^']*')*$)/', "", input)
Python: re.sub(r'--(?=[^\']*'([^']|'[^']*')*$)', "", input)
Ruby: input.gsub(/--(?=[^\']*'([^']|'[^']*')*$)/, "")
I have tested this for Ruby and it provides the desired result.
This cannot be done with regular expressions, because you need to maintain state on whether you're inside single quotes or outside, and regex is inherently stateless. (Also, as far as I understand, single quotes can be escaped without terminating the "inside" region).
Your best bet is to iterate through the string character by character, keeping a boolean flag on whether or not you're inside a quoted region - and remove the --'s that way.
If bending the rules a little is allowed, this could work:
import re
p = re.compile(r"((?:^[^']*')?[^']*?(?:'[^']*'[^']*?)*?)(-{2,})")
txt = "xxxx rt / $ 'dfdf--fggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g '--ggh--' vcbcvb"
print re.sub(p, r'\1-', txt)
Output:
xxxx rt / $ 'dfdf-fggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g '-ggh-' vcbcvb
The regex:
( # Group 1
(?:^[^']*')? # Start of string, up till the first single quote
[^']*? # Inside the single quotes, as few characters as possible
(?:
'[^']*' # No double dashes inside theses single quotes, jump to the next.
[^']*?
)*? # as few as possible
)
(-{2,}) # The dashes themselves (Group 2)
If there where different delimiters for start and end, you could use something like this:
-{2,}(?=[^'`]*`)
Edit: I realized that if the string does not contain any quotes, it will match all double dashes in the string. One way of fixing it would be to change
(?:^[^']*')?
in the beginning to
(?:^[^']*'|(?!^))
Updated regex:
((?:^[^']*'|(?!^))[^']*?(?:'[^']*'[^']*?)*?)(-{2,})
Hm. There might be a way in Python if there are no quoted apostrophes, given that there is the (?(id/name)yes-pattern|no-pattern) construct in regular expressions, but it goes way over my head currently.
Does this help?
def remove_double_dashes_in_apostrophes(text):
return "'".join(
part.replace("--", "") if (ix&1) else part
for ix, part in enumerate(text.split("'")))
Seems to work for me. What it does, is split the input text to parts on apostrophes, and replace the "--" only when the part is odd-numbered (i.e. there has been an odd number of apostrophes before the part). Note about "odd numbered": part numbering starts from zero!
You can use the following sed script, I believe:
:again
s/'\(.*\)--\(.*\)'/'\1\2'/g
t again
Store that in a file (rmdashdash.sed) and do whatever exec magic in your scripting language allows you to do the following shell equivalent:
sed -f rmdotdot.sed < file containing your input data
What the script does is:
:again <-- just a label
s/'\(.*\)--\(.*\)'/'\1\2'/g
substitute, for the pattern ' followed by anything followed by -- followed by anything followed by ', just the two anythings within quotes.
t again <-- feed the resulting string back into sed again.
Note that this script will convert '----' into '', since it is a sequence of two --'s within quotes. However, '---' will be converted into '-'.
Ain't no school like old school.