Trouble parsing custom structure with shell script utilities like sed / awk / grep - regex

I am trying to use a shell script to parse a complex list of structures out of a text file, and search those structures for a very specific set of values. If there is a match then I need to print the values of one variable. I am limited to lightweight utilities like sed, awk, and grep (but not Perl).
Here is an example of the structure, followed by an explanation of what I am looking for:
{
{ 1, 2,
{ 15, 25 },
{ 15, 25 }
},
{ 3, 4,
{ 35, 45 },
{ 35, 45 }
},
{ 5, 6,
{ 55, 65 },
{ 55, 65 }
}
};
In this example I would be parsing the three structures and searching for a structure which has a "3" as the first value, has any single digit (0-9) as the second value, and at least one set of "35" and "45" in the inner list of structures. Once I have located a match I would then print the value of the second value. In this case the second structure would match, and I would need to print out the value "4".
I don't want to assume anything about how the whitespace is organized, only that the format above is followed. I.e. it could all be on a single line or have different combinations of line breaks in random places.
Can someone please help me think about how to approach this problem?

this may be what you want, using GNU awk for various extensions:
$ cat tst.awk
BEGIN { RS="[{}]"; FS="\\s*,\\s*" }
depth == 2 { split($0,outer) }
(depth == 3) && (outer[1]==3) && (outer[2]~/^[0-9]$/) &&
((($1==35) && ($2==45)) || (($1==45) && ($2==35))) { print outer[2] }
{ depth = depth + (RT=="{" ? 1 : -1) }
$ awk -f tst.awk file
4

A non-robust awk attempt
$ awk -F"[{,}]" '/{/ && !/}/{c=($2==3)?+$3:""}
c~/^[0-9]$/ && $2==35 && $3==45{print c;exit}' file
4
using the layout

Thank you all for the help on this one. I was able to eventually solve it using only sed and tr, although it wasn't pretty. I used tr to join all of the lines together, then sed to remove the outer { };, sed again to split the lines along structure boundaries by using backrefrences, sed again to cleanup commas and whitespace between structures, and then "sed -n -r "s//\1/p" to validate the expected values in pattern, and to print only the matching variable.
I will take a look at your examples and see if I can learn from them.

Related

awk, skip current rule upon sanity check

How to skip current awk rule when its sanity check failed?
{
if (not_applicable) skip;
if (not_sanity_check2) skip;
if (not_sanity_check3) skip;
# the rest of the actions
}
IMHO, it's much cleaner to write code this way than,
{
if (!not_applicable) {
if (!not_sanity_check2) {
if (!not_sanity_check3) {
# the rest of the actions
}
}
}
}
1;
I need to skip the current rule because I have a catch all rule at the end.
UPDATE, the case I'm trying to solve.
There is multiple match point in a file that I want to match & alter, however, there's no other obvious sign for me to match what I want.
hmmm..., let me simplify it this way, I want to match & alter the first match and skip the rest of the matches and print them as-is.
As far as I understood your requirement, you are looking for if, else if here. Also you could use switch case available in newer version of gawk packages too.
Let's take an example of a Input_file here:
cat Input_file
9
29
Following is the awk code here:
awk -v var="10" '{if($0<var){print "Line " FNR " is less than var"} else if($0>var){print "Line " FNR " is greater than var"}}' Input_file
This will print as follows:
Line 1 is less than var
Line 2 isgreater than var
So if you see code carefully its checking:
First condition if current line is less than var then it will be executed in if block.
Second condition in else if block, if current line is greater than var then print it there.
I'm really not sure what you're trying to do but if I focus on just that last sentence in your question of I want to match & alter the first match and skip the rest of the matches and print them as-is. ... is this what you're trying to do?
{ s=1 }
s && /abc/ { $0="uvw"; s=0 }
s && /def/ { $0="xyz"; s=0 }
{ print }
e.g. to borrow #Ravinder's example:
$ cat Input_file
9
29
$ awk -v var='10' '
{ s=1 }
s && ($0<var) { $0="Line " FNR " is less than var"; s=0 }
s && ($0>var) { $0="Line " FNR " is greater than var"; s=0 }
{ print }
' Input_file
Line 1 is less than var
Line 2 is greater than var
I used the boolean flag variable name s for sane as you also mentioned something in your question about the conditions tested being sanity checks so each condition can be read as is the input sane so far and this next condition is true?.

Bash - Extract a column from a tsv file whose header matches a given pattern

I've got a tab-delimited file called dataTypeA.txt. It looks something like this:
Probe_ID GSM24652 GSM24653 GSM24654 GSM24655 GSM24656 GSM24657
1007_s_at 1149.82818866431 1156.14191288693 743.515922643437 1219.55564561635 1291.68030259557 1110.83793199643
1053_at 253.507372571459 150.907554200493 181.107054946649 99.0610660103702 147.953428467212 178.841519788697
117_at 157.176825094869 147.807257232552 162.11169957066 248.732378039521 176.808414979907 112.885784025819
121_at 1629.87514240262 1458.34809770171 1397.36209234134 1601.83045996129 1777.53949459116 1256.89054921471
1255_g_at 91.9622298972477 29.644137111864 61.3949774595639 41.2554576367652 78.4403716513328 66.5624213750532
1294_at 313.633291641829 305.907304474766 218.567756319376 335.301256439494 337.349552407502 316.760658896597
1316_at 195.799277107983 163.176402437481 111.887056644528 194.008323756222 211.992656497053 135.013920706472
1320_at 34.5168433158599 19.7928225262233 21.7147425051394 25.3213322300348 22.4410631949167 29.6960283168278
1405_i_at 74.938724593443 24.1084307838881 24.8088845994911 113.28326338746 74.6406975005947 70.016519414531
1431_at 88.5010900723741 21.0652011409692 84.8954961447585 110.017339630928 84.1264201735067 49.8556999547353
1438_at 26.0276274326623 45.5977459152141 31.8633816890024 38.568939176828 43.7048363737468 28.5759163094148
1487_at 1936.80799770498 2049.19167519573 1902.85054762899 2079.84030768241 2088.91036902825 1879.84684705068
1494_f_at 358.11266607978 271.309665853292 340.738488775022 477.953251687206 388.441738062896 329.43505750512
1598_g_at 2908.90515715761 4319.04621682741 2405.62061966298 3450.85255814957 2573.97860992156 2791.38660060659
160020_at 416.089910909237 327.353902186303 385.030831004533 385.199279534446 256.512900212781 217.754025190117
1729_at 43.1079499314469 114.654670657195 133.191500889286 86.4106614983387 122.099426341898 218.536976034472
177_at 75.9653827137444 27.4348937420347 16.5837374743166 50.6758325717831 58.7568500760629 18.8061888366161
1773_at 31.1717741953018 158.225161489953 161.976679771553 139.173486349393 218.572194156366 103.916119454
179_at 1613.72113870554 1563.35465407698 1725.1817757679 1694.82209331327 1535.8108561345 1650.09670894426
Let's say I have a variable col="GSM24655". I want to extract the column from dataTypeA.txt that corresponds to this column name.
Additionally, I'd like to put this in a function, where I can just give it a file (i.e. dataTypeA.txt), and a column (i.e. GSM24655), and it'll return that column.
I'm not very proficient in Bash, so I've been having some trouble with this. I'd appreciate the help.
Below script using awk can be used to achieve the objective.
col="GSM24655";
awk -v column_val="$col" '{ if (NR==1) {val=-1; for(i=1;i<=NF;i++) { if ($i == column_val) {val=i;}}} if(val != -1) print $val} ' dataTypeA.txt
Working: Initially, value of col is passed to awk script using -v column_val="$col" . Then the column number is find out. (when NR==1, i.e the first row, it iterates through all the fields (for(i=1;i<=NF;i++), awk variable NF contains the number of columns) and then compare the value of column_val (if ($i == column_val)), when a match is found the corresponding column number is found and stored ( val=i )). After that, from next row onwards, the values in that column is printed (print $val).
If you copy the below code into a file called say find_column.sh, you can call sh find_column.sh GSM24655 dataTypeA.txt to display the column having value of first parameter (GSM24655) in the file named second parameter (dataTypeA.txt). $1 and $2 are positional parameters. The lines column=$1 and file=$2 will assign the input values to the variables.
column=$1;
file=$2;
awk -v column_val="$column" '{ if (NR==1) {val=-1; for(i=1;i<=NF;i++) { if ($i == column_val) {val=i;}}} if(val != -1) print $val} ' $file
I would use the following, it is quick and easy.
In your script, you get the name of the file, let's say $1, and word, $2.
Then, in my for each I am using the whole header, but you can just add a head -1 $1, and in the IF, the $2, this is going to output column name.
c=0;
for each in `echo "Probe_ID GSM24652 GSM24653 GSM24654 GSM24655 GSM24656 GSM24657"`;do if [[ $each == "Probe_ID" ]];then
echo $c;
col=$c;
else c=$(( c + 1 ));
fi;
done
Right after this, you just do a cat $1| cut -d$'\t' -f$col

Regex match scss function / mixin

I am trying to match a function or mixin used in an SCSS string so I may remove it but I am having a bit of trouble.
For those unfamiliar with SCSS this is an example of the things I am trying to match (from bootstrap 4).
#mixin _assert-ascending($map, $map-name) {
$prev-key: null;
$prev-num: null;
#each $key, $num in $map {
#if $prev-num == null {
// Do nothing
} #else if not comparable($prev-num, $num) {
#warn "Potentially invalid value for #{$map-name}: This map must be in ascending order, but key '#{$key}' has value #{$num} whose unit makes it incomparable to #{$prev-num}, the value of the previous key '#{$prev-key}' !";
} #else if $prev-num >= $num {
#warn "Invalid value for #{$map-name}: This map must be in ascending order, but key '#{$key}' has value #{$num} which isn't greater than #{$prev-num}, the value of the previous key '#{$prev-key}' !";
}
$prev-key: $key;
$prev-num: $num;
}
}
And a small function:
#function str-replace($string, $search, $replace: "") {
$index: str-index($string, $search);
#if $index {
#return str-slice($string, 1, $index - 1) + $replace + str-replace(str-slice($string, $index + str-length($search)), $search, $replace);
}
#return $string;
}
So far I have the following regex:
#(function|mixin)\s?[[:print:]]+\n?([^\}]+)
However it only matches to the first } that it finds which makes it fail, this is because it needs to find the last occurance of the closing curly brace.
My thoughts are that a regex capable of matching a function definition could be adapted but I can't find a good one using my Google foo!
Thanks in advance!
I would not recommend to use a regex for that, since a regex is not able to handle recursion, what you might need in that case.
For Instance:
#mixin test {
body {
}
}
Includes two »levels« of scope here ({{ }}), so your regex should be able to to count brackets as they open and close, to match the end of the mixin or function. But that is not possible with a regex.
This regex
/#mixin(.|\s)*\}/gm
will match the whole mixin, but if the input is like that:
#mixin foo { … }
body { … }
It will match everything up to the last } what includes the style definition for the body. That is because the regex cannot know which } closes the mixin.
Have a look at this answer, it explains more or less the same thing but based on matching html elements.
Instead you should use a parser, to parse the whole Stylesheet into syntax tree, than remove unneeded functions and than write it to string again.
In fact, like #philipp said, regex can't replace syntax analysis like compilers do.
But here is a sed command which is a little ugly but could make the trick :
sed -r -e ':a' -e 'N' -e '$!ba' -e 's/\n//g' -e 's/}\s*#(function|mixin)/}\n#\1/g' -e 's/^#(function|mixin)\s*str-replace(\s|\()+.*}$//gm' <your file>
-e ':a' -e 'N' -e '$!ba' -e 's/\n//g' : Read all file in a loop and remove the new line (See https://stackoverflow.com/a/1252191/7990687 for more information)
-e 's/}\s*#(function|mixin)/}\n#\1/g' : Make each #mixin or #function statement the start of a new line, and the preceding } the last character of the previous line
's/^#(function|mixin)\s*str-replace(\s|\()+.*}$//gm' : Remove the line corresponding to the #function str-replace or #mixin str-replace declaration
But it will result in an output that will loose indentation, so you will have to reindent it after that.
I tried it on a file where I copy/paste multiple times the sample code you provided, so you will have to try it on your file because there could be cases where the regex will match more element than wanted. If it is the case, provide us a test file to try to resolve these issues.
After much headache here is the answer to my question!
The source needs to be split line by line and read, maintining a count of the open / closed braces to determine when the index is 0.
$pattern = '/(?<remove>#(function|mixin)\s?[\w-]+[$,:"\'()\s\w\d]+)/';
$subject = file_get_contents('vendor/twbs/bootstrap/scss/_variables.scss'); // just a regular SCSS file containing what I've already posted.
$lines = explode("\n",$subject);
$total_lines = count($lines);
foreach($lines as $line_no=>$line) {
if(preg_match($pattern,$line,$matches)) {
$match = $matches['remove'];
$counter = 0;
$open_braces = $closed_braces = 0;
for($i=$line_no;$i<$total_lines;$i++) {
$current = $lines[$i];
$open_braces = substr_count($current,"{");
$closed_braces = substr_count($current,"}");
$counter += ($open_braces - $closed_braces);
if($counter==0) {
$start = $line_no;
$end = $i;
foreach(range($start,$end) as $a) {
unset($lines[$a]);
} // end foreach(range)
break; // break out of this if!
} // end for loop
} // end preg_match
} // endforeach
And we have a $lines array without any functions or mixins.
There is probably a more elegant way to do this but I don't have the time or the willing to write an AST parser for SCSS
This can be quite easily adapted into making a hacked one however!

VIM padding with appropriate number of ",0" to get CSV file

I have a file containing numbers like
1, 2, 3
4, 5
6, 7, 8, 9,10,11
12,13,14,15,16
...
I want to create a CSV file by padding each line such that there are 6 values separated by 5 commas, so I need to add to each line an appropriate number of ",0". It shall look like
1, 2, 3, 0, 0, 0
4, 5, 0, 0, 0, 0
6, 7, 8, 9,10,11
12,13,14,15,16, 0
...
How would I do this with VIM?
Can I count the number of "," in a line with regular expressions and add the correct number of ",0" to each line with the substitute s command?
You can achieve that by typing this command:
:g/^/ s/^.*$/&,0,0,0,0,0,0/ | normal! 6f,D
You can add six zeros in all lines first, irrespective of how many numbers they have and then, you can delete everything from sixth comma till end in every line.
To insert them,
:1,$ normal! i,0,0,0,0,0,0
To delete from sixth comma till end,
:1,$normal! ^6f,D
^ moves to first character in line(which is obviously a number here)
6f, finds comma six times
D delete from cursor to end of line
Example:
Original
1,2,
3,6,7,0,0,0
4,5,6
11,12,13
After adding six zeroes,
1,2,0,0,0,0,0,0
3,6,7,0,0,0,0,0,0,0,0,0
4,5,6,0,0,0,0,0,0
11,12,13,0,0,0,0,0,0
After removing from six comma to end of line
1,2,0,0,0,0,0
3,6,7,0,0,0,0
4,5,6,0,0,0,0
11,12,13,0,0,0
With perl:
perl -lpe '$_ .= ",0" x (5 - tr/,//)' file.txt
With awk:
awk -v FS=, -v OFS=, '{ for(i = NF+1; i <= 6; i++) $i = 0 } 1' file.txt
With sed:
sed ':b /^\([^,]*,\)\{5\}/ b; { s/$/,0/; b b }' file.txt
As far as how to do this from inside Vim, you can also pipe text through external programs and it will replace the input with the output. That's an easy way to leverage sorting, deduping, grep-based filtering, etc, or some of Sato's suggestions. So, if you have a script called standardize_commas.py, try selecting your block with visual line mode (shift+v then select), and then typing something like :! python /tmp/standardize_commas.py. It should prepend a little bit to that string indicating that the command will run on the currently selected lines.
FYI, this was my /tmp/standardize_commas.py script:
import sys
max_width = 0
rows = []
for line in sys.stdin:
line = line.strip()
existing_vals = line.split(",")
rows.append(existing_vals)
max_width = max(max_width, len(existing_vals))
for row in rows:
zeros_needed = max_width - len(row)
full_values = row + ["0"] * zeros_needed
print ",".join(full_values)

Generating the shortest regex to match an arbitrary word list

I'm hoping someone might know of a script that can take an arbitrary word list and generated the shortest regex that could match that list exactly (and nothing else).
For example, suppose my list is
1231
1233
1234
1236
1238
1247
1256
1258
1259
Then the output should be:
12(3[13468]|47|5[589])
This is an old post, but for the benefit of those finding it through web searches as I did, there is a Perl module that does this, called Regexp::Optimizer, here: http://search.cpan.org/~dankogai/Regexp-Optimizer-0.23/lib/Regexp/Optimizer.pm
It takes a regular expression as input, which can consist just of the list of input strings separated with |, and outputs an optimal regular expression.
For example, this Perl command-line:
perl -mRegexp::Optimizer -e "print Regexp::Optimizer->new->optimize(qr/1231|1233|1234|1236|1238|1247|1256|1258|1259/)"
generates this output:
(?^:(?^:12(?:3[13468]|5[689]|47)))
(assuming you have installed Regex::Optimizer), which matches the OP's expectation quite well.
Here's another example:
perl -mRegexp::Optimizer -e "print Regexp::Optimizer->new->optimize(qr/314|324|334|3574|384/)"
And the output:
(?^:(?^:3(?:[1238]|57)4))
For comparison, an optimal trie-based version would output 3(14|24|34|574|84). In the above output, you can also search and replace (?: and (?^: with just ( and eliminate redundant parentheses, to obtain this:
3([1238]|57)4
You are probably better off saving the entire list, or if you want to get fancy, create a Trie:
1231
1234
1247
1
|
2
/ \
3 4
/ \ \
1 4 7
Now when you take a string check if it reaches a leaf node. It does, it's valid.
If you have variable length overlapping strings (eg: 123 and 1234) you'll need to mark some nodes as possibly terminal.
You can also use the trie to generate the regex if you really like the regex idea:
Nodes from the root to the first branching are fixed (eg: 12)
Branches create |: (eg: 12(3|4)
Leaf nodes generate a character class (or single character) that follows the parent node: (eg 12(3[14]|47))
This might not generate the shortest regex, to do that you'll might some extra work:
"Compact" ranges if you find them (eg [12345] becomes [1-4])
Add quantifiers for repeated elements (eg: [1234][1234] becomes [1234]{2}
???
I really don't think it's worth it to generate the regex.
This project generates a regexp from a given list of words: https://github.com/bwagner/wordhierarchy
It almost does the same as the above JavaScript solution, but avoids certain superfluous parentheses.
It only uses "|", non-capturing group "(?:)" and option "?".
There's room for improvement when there's a row of single characters:
Instead of e.g. (?:3|8|1|6|4) it could generate [38164].
The generated regexp could easily be adapted to other regexp dialects.
Sample usage:
java -jar dist/wordhierarchy.jar 1231 1233 1234 1236 1238 1247 1256 1258 1259
-> 12(?:5(?:6|9|8)|47|3(?:3|8|1|6|4))
Here's what I came up with (JavaScript). It turned a list of 20,000 6-digit numbers into a 60,000-character regular expression. Compared to a naive (word1|word2|...) construction, that's almost 60% "compression" by character count.
I'm leaving the question open, as there's still a lot of room for improvement and I'm holding out hope that there might be a better tool out there.
var list = new listChar("");
function listChar(s, p) {
this.char = s;
this.depth = 0;
this.parent = p;
this.add = function(n) {
if (!this.subList) {
this.subList = {};
this.increaseDepth();
}
if (!this.subList[n]) {
this.subList[n] = new listChar(n, this);
}
return this.subList[n];
}
this.toString = function() {
var ret = "";
var subVals = [];
if (this.depth >=1) {
for (var i in this.subList) {
subVals[subVals.length] = this.subList[i].toString();
}
}
if (this.depth === 1 && subVals.length > 1) {
ret = "[" + subVals.join("") + "]";
} else if (this.depth === 1 && subVals.length === 1) {
ret = subVals[0];
} else if (this.depth > 1) {
ret = "(" + subVals.join("|") + ")";
}
return this.char + ret;
}
this.increaseDepth = function() {
this.depth++;
if (this.parent) {
this.parent.increaseDepth();
}
}
}
function wordList(input) {
var listStep = list;
while (input.length > 0) {
var c = input.charAt(0);
listStep = listStep.add(c);
input = input.substring(1);
}
}
words = [/* WORDS GO HERE*/];
for (var i = 0; i < words.length; i++) {
wordList(words[i]);
}
document.write(list.toString());
Using
words = ["1231","1233","1234","1236","1238","1247","1256","1258","1259"];
Here's the output:
(1(2(3[13468]|47|5[689])))