Sed: Substitute letters on certain positions - regex

I have a file with the structure:
N1H3O1 C2H2
C1H4 H201
C1H1N1 N1H3
C2N1O1P1H3 P5
What I am trying to do is to count the sum of coefficients in each of the formulae. Thus, the desire output is:
1+3+1 5 2+2 4
1+4 5 2+1 3
1+1+1 3 3+1 4
2+1+1+1+3 8 5 5
What I did is a simple replacement of each letter with "+" and then deleting the first " +".
I however would like to know how to do it in a more proper way in sed, using branch and flow operators.

The problem with your input is the 0 which is used instead of O, which might make it difficult to design a regular expression for it, which you can see here:
([^A-Z]+)*([0-9]+)
Other than that, you might be able to capture the numbers by simply adding ([^A-Z]+).
However, you may not wish to do this task with regular expression, since your data except for that 0 is pretty structured, and you could maybe write a script to do so.

Related

Vim: Placing (,) in between CERTAIN high numbers Issue

source txt file:
34|Gurla Mandhata|7694|25243|2788|Nalakankar Himalaya|30°26'19"N
81°17'48"E|Dhaulagiri|1985|6 (4)|China
command input:
:%s/\(\d\+\)\(\d\d\d\)/\1,\2/g
command output:
34|Gurla Mandhata|7,694|25,243|2,788|Nalakankar Himalaya|30°26'19"N
81°17'48"E|Dhaulagiri|1,985|6 (4)|China
Desired output:
34|Gurla Mandhata|7,694|25,243|2,788|Nalakankar Himalaya|30°26'19"N
81°17'48"E|Dhaulagiri|1985|6 (4)|China
Basically 1985 is supposed to be 1985 and not 1,985. I tried to put a \? so every time the pattern matches it stops and a °+ after so it has to detect a ° to match the pattern, but no success. It just replaces the ° and everything before that, complete mess.
My knowledge of regular expressions however combined with the substitute is weak and I'm stuck here.
EDIT
the first 3 numbers represent heights of mountains, those 3 need to change with a (,) and the last number ( 1985 ) represents a year, which must not be changed.
Mathematical solutions are not going to work as loophole since there are mountains with a height off less than 1900
You haven't told us what is the difference between 1985 and other numbers, so I assumed that your "small" numbers are less than 2000.
You almost got it:
:%s/(\d*[2-90])(\d\d\d)/\1,\2/g
Alternatively if that isn't what you want, you can use c flag (:h s_flags):
:%s/\(\d\+\)\(\d\d\d\)/\1,\2/gc
this line will leave the last 3 columns untouched, just do substitution on the content before it:
%s/\v(.*)((\|[^|]*){3}$)/\=substitute(submatch(1),'\v(\d+)(\d{3})','\1,\2','g').submatch(2)/g
Note that the above line will change 1000000 into 1000,000 instead of 1,000,000. Vim's printf() doesn't support %'d, it is pity. If you do have number > 1m, we can find other solutions.
update
I solved it myself, by using 3 seperate commands; one for every number string in the file:
%s/^\(\d*|[^|]*|\)\(\d\+\)\(\d\d\d\)|/\1\2,\3|/g
:%s/^\(\d*|[^|]*|\d\+,*\d*|\)\(\d\+\)\(\d\d\d\)|/\1\2,\3|/g
:%s/^\(\d*|[^|]*|\d\+,*\d*|\d\+,*\d*|\)\(\d\+\)\(\d\d\d\)|/\1\2,\3|/g
In case you want to use perl:
:%!perl -F'\|' -lane 'for(#F[2..4]) { s/(\d+)(\d{3})/\1,\2/;} print join "|", #F'

RegEx to find numbers over certain value with commas, and another text value, on same line

I'm new to Regular Expressions, and I have been trying to figure out how to code this: I need to find numbers greater than 25000 where the same line also has the number " 19" somewhere on that line (that's a space then 19). The problem is that the numbers have commas in them. I tried a few options:
This finds lines with any numbers over 25000:
^.*(25,|26,|27,|28,|29,|30,|31,|32,|33,|34,|35,|36,|37,|38,|39,|40,|41,|42,|43,|44,|45,|46,|47,|48,|49,|50,|51,|52,|53,|54,|55,|56,|57,|58,|59,|60,|61,|62,|63,|64,|65,|66,|67,|68,|69,|70,|71,|72,|73,|74,|75,|76,|77,|78,|79,|80,|81,|82,|83,|84,|85,|86,|87,|88,|89,|90,|91,|92,|93,|94,|95,|96,|97,|98,|99,|100,|101,|102,|103,|104,|105,|106,|107,|108,|109,|110,|111,|112,|113,|114,|115,|116,|117,|118,|119,|120,|121,|122,|123,|124,).*$
This finds line with both " 19" and 26, (but not with the comma behind the 26)
^.*( 19.*26).*$
Any help is appreciated!
Numbers over 25000 can be represented as follows :
\d{6,}|2[5-9]\d{3}|[3-9]\d{4}
That is, in english :
numbers of 6 digits or more
numbers of 5 digits starting with 2 and another digit equal or greater than 5
numbers of 5 digits starting with a digit greater than 2
So the complete regex would look like this :
.*(\d{6,}|2[5-9]\d{3,}|[3-9]\d{4,}).* 19.*
Which is said number somewhere in the line, followed by 19 somewhere in the line.
Here is a test run on regex101 for you to test with your data.
I also second the comment that this isn't a job for regular expressions, which as you can see work on characters rather than numbers.
I would try something like this:
^(([0-9,]*([3-9][0-9]|2[5-9]),?[0-9]{3})\s?)$
That should handle the numeric part. You didn't really explain if the " 19" would come before or after that, and what would delimit that from the numeric part, but just insert (\s19) wherever that bit needs to go.
example
Thanks everyone. The following RegEx worked for me:
^.* 19.(25,|26,|27,|28,|29,|30,|31,|32,|33,|34,|35,|36,|37,|38,|39,|40,|41,|42,|43,|44,|45,|46,|47,|48,|49,|50,|51,|52,|53,|54,|55,|56,|57,|58,|59,|60,|61,|62,|63,|64,|65,|66,|67,|68,|69,|70,|71,|72,|73,|74,|75,|76,|77,|78,|79,|80,|81,|82,|83,|84,|85,|86,|87,|88,|89,|90,|91,|92,|93,|94,|95,|96,|97,|98,|99,|100,|101,|102,|103,|104,|105,|106,|107,|108,|109,|110,|111,|112,|113,|114,|115,|116,|117,|118,|119,|120,|121,|122,|123,|124,).$
This finds lines that have " 19" first in the line then a number greater than 25K later in the line, when the numbers have commas in them. I couldn't use the shortcut "number ranges" that were suggested because there are other numbers on the lines without commas that are over 25K that I don't want to flag. Maybe there's any easier way that my brute force method, but if not, at least this works. Thanks again!

2 digits only allowed once (Regex)

I'm trying to check if a level is valid or not.
The level is of the form: (but they're 998 more of these)
bbbbbbb
b41111b
b81400b
b81010b
b01121b
b08001b
bbbbbbb
The level must follow a few rules. I have written a regex to conform all rules but one:
The level must contain exactly 1 times 2 and 1 times 4.
(Notice in the level above there's two 4's and one 2. The level above is not valid.)
This is a school project so please guide me through to the answer.
Thanks in advance.
EDIT:
My current regex is:
^b{' + str(length) + r'}\n(b{1}[0-8]{' + str(length - 2) + r'}b{1}\n)+b{' + str(length) + '}$
For the level above, length = 7
Note that it doesn't even try to filter this wrong level above.
The other rules are:
The level must be surrounded by a 'b'
The level can only contain the char 'b' and numbers smaller than 9.
There can only be one 2
There can only be one 4
My regex above does take rules 1 and 2 into account, but I still need to figure out rules 3 and 4.
I have tried lookarounds and such, couldn't figure it out.
This is the regex that will meet all of your conditions:
^b(?!(?:[^2]*2){2,})(?!(?:[^4]*4){2,})[b0-8]*b$
It starts and end with b Using ^b and $b
It comprises of only letter b and numbers 0-8 by[b0-8]*
It won't allow more than one digit 2 by using (?!(?:[^2]*2){2,})
It won't allow more than one digit 4 by using (?!(?:[^4]*4){2,})
Well, a regex for exactly 1 a would be [^a]*a[^a]* (that is, a possibly empty sequence of non-a's, followed by an a, followed by a possibly empty sequence of non-a's). I'll leave it an an exercise how to handle multiple lines & making sure this covers the whole level.
For exactly 1 a and 1 b: [^ab]*((a[^ab]*b)|(b[^ab]*a))[^ab]*, with the same caveats. Explanation: a sequence of non-a-or-b's, follow by EITHER 1) an a, a run of non-a-or-b's, and a b, or 2) a b, a run of non-a-or-b's, and an a, with THAT followed by a run of non-a-or-b's.
The answer is to use negative lookaheads anchored to start of input.
It's unclear what you're trying to match, so I will just use a placeholder <your-regex> for your current regex:
^(?!.*?2.*?2)(?!.*?4.*?4)<your-regex>
See a live demo of rhis correctly rejecting more than 1 "2".

More efficient regex than "(cg[agct])|(ag[ag])"

I need a regex to match any of:
cgt, cgc, cga, cgg, aga, agg
They're DNA codons. Is the regex I've given, (cg[agct])|(ag[ag]), as efficient as it could be? It somehow seems clunky, and I wonder if I could use the fact that there has to be a g as the second character.
To sum up the comments:
It appears that what you have is pretty good.
The one suggestion is to change the grouping into a non-capturing group (or remove them all together).
Something like this seems optimal:
cg[agct]|ag[ag]
If you had a set that was FAR more frequent than the others, you could possibly speed it up (slightly) by adding it literally to the alternation:
cgg|cg[act]|ag[ag]
Internally, most regex engines will turn small character classes like this into their own alternation. It may be fastest to expand out the alternation all the way, or in different groups, to see the performance impact.
I would suggest that you should profile all three of these approaches with your regex engine:
cg[agct]|ag[ag]
cga|cgc|cgg|cgt|aga|agg
[ac]g[agct](?<!agt|agc)
The last one is the closest to an answer to your question, since it leverages the fact that a "g" is required in the middle and used a "negative lookbehind" to eliminate the invalid sets.
One other thing to check would be if just finding all instances of [ac]g[agct] (including the undesired "agt" and "agc") and then filtering them in your language of choice would be fastest.
EDIT, FOR SCIENCE!
Here is a chart of the various types of matches and failures, along with their number of steps required to reach a conclusion (match or no match).
cg[agct]|ag[ag] [ac]g[agct](?<!agt|agc) cga|cgc|cgg|cgt|aga|agg
agg 4 6 10
agc 4 8 10
cga 3 6 3
axa 3 2 8
cxa 3 2 10
xxx 2 1 6
So, it appears that (as we guessed), the methods have entirely different properties.
My hunch about splitting everything into an alternation was wrong. Don't use that.
Your hunch about utilizing the "g" in the middle is warranted, except that for partial matches (agg, for example) and full matches (cga, for example) take longer. However, throwing away bad results is slightly faster with the negative lookbehind version.
So, to compensate for the worst case, (8 checks versus 3 = delta -5) we would have to see at least 5 failing character positions. (2 checks versus 3 = delta 1 or 1 check versus 2 = delta 1)
I guess, then, that you should use the negative lookbehind version if you anticipate that you will fail a match at 5 positions for every match that you find.
EDIT 2, A SLIGHTLY BETTER VERSION
Looking at how exactly the regex is going to evaluate each match, we can craft a better version that will let about half of the matches "fast track", and will also reduce the number of characters checked when the match fails.
[ca]g(?:[ag]|(?<!ag)[ct])
agg 4
agc 7
cga 4
axa 2
cxa 2
xxx 1
This reduces all of the positive matches times by one or two comparisons each.
Based on this, I would recommend using [ca]g(?:[ag]|(?<!ag)[ct]) if you expect to check 4 or more positions for each match.

find a string with at least n matching elements

I have a list of numbers that I want to find at least 3 of...
here is an example
I have a large list of numbers in a sql database in the format of (for example)
01-02-03-04-05-06
06-08-19-24-25-36
etc etc
basically 6 random numbers between 0 and 99.
Now I want to find the strings where at least 3 of a set of given numbers occurs.
For example:
given: 01-02-03-10-11-12
return the strings that have at least 3 of those numbers in them.
eg
01-05-06-09-10-12 would match
03-08-10-12-18-22 would match
03-09-12-18-22-38 would not
I am thinking that there might be some algorithm or even regular expression that could match this... but my lack of computer science textbook experience is tripping me up I think.
No - this is not a homework question! This is for an actual application!
I am developing in ruby, but any language answer would be appreciated
You can use a string replacement to replace - with | to turn 01-02-03-10-11-12 into 01|02|03|10|11|12. Then wrap it like this:
((01|02|03|10|11|12).*){3}
This will find any of the digit pairs, then ignore any number of characters... 3 times. If it matches, then success.