Using regex to parse a delimited array in bash

Using regex to parse a delimited array in bash - regex

I have a file that contains the following line:
<polyline id="graph" points="0,287 100,470 200,509 300,459 400,471"/>
And I need to extract the following values:
287 470 509 459 471
I am currently using this code:
grep -oP '(?<=points=").*(?="/>)' "file.svg" | grep -oP '(?<=,)[[:digit:]]*'
I want to do it with a single grep, I tried using (?:), with no success. Any suggestions?

A sed solution can be like
$ sed -r '/points=/ s/[^,]+,?([0-9]*)/\1 /g' input
287 470 509 459 471
OR
for much better handling
$ sed -r '/points=/ s/.*points=("[^"]+").*/\1/g; s/[^,]+,?([0-9]*)/\1 /g' input
287 470 509 459 471

If you're just parsing a single line like that, you might get away with using XML::Simple, like this:
perl -MXML::Simple -lwe'$x = XMLin(<>); print $x->{points};' file.svg
With your line, this gives me the following output:
0,287 100,470 200,509 300,459 400,471
The entire structure in $x parsed from that line looks like this when printed with Data::Dumper:
$VAR1 = {
'points' => '0,287 100,470 200,509 300,459 400,471',
'id' => 'graph'
};
Note that you may need to pre-process your input, if it is more complex than you indicated in your question.

It's XML, so parse as XML.
use XML::Twig;
my $twig = XML::Twig -> new -> parse ( '<polyline id="graph" points="0,287 100,470 200,509 300,459 400,471"/>' );
print $twig -> root -> {'att'} -> {'points'};
Although - you might need something slightly different if you want to parse it out of a svg file - but you can then use $twig -> parsefile.
Simplifies as a one liner:
perl -MXML::Twig -e 'print XML::Twig -> new -> parsefile ("test.xml" ) -> root -> first_child("polyline") -> {"att"}{"points"};'

You can use gnu-awk:
awk -v RS='points="[^"]+"' 'RT{s=RT; gsub(/[^[:digit:], ]|[[:digit:]]+,/, "", s);
print s}' file
287 470 509 459 471

This awk should do:
awk -F\" '/points/ {gsub(/[0-9]+,/,"",$4);print $4}' file
287 470 509 459 471
If position on the line do change, do:
awk -F"points=" 'NF==2{gsub(/[0-9]+,|[^0-9 ]/,"",$2);print $2}' file
287 470 509 459 471

Related

Using grep, how to match beginning of line with pattern from stdin

I have a one-liner that prints out a series a numbers:
124
132
186
I am then piping this output into grep to match these numbers to the beginning of lines in another file but sometimes the second number in the line matches one of the patterns and I get an incorrect match like so:
$ get_id_command | grep -f - users.list
124 => 3456, Charles Charmichael, ccharmichael
132 => 2498, Sarah Walker, swalker
186 => 8934, John Casey, jcasey
240 => 1245, Morgan Grimes, mgrimes
What options do I need for grep to only match patterns at the beginning of the line? I would really like to keep this as a one-linter.

Prepend a circumflex to each line of your file and it will work. Circumflex does indicate the line start within the pattern. So modify your users.list as described, e.g.
sed -Ei 's|(.*)|^\1|' users.list
After that you should get the desired result by your command
$ get_id_command | grep -f - users.list

Bash/sed: delete everything from text file except match(es)

I have a text file which I need to extract a match from in a bash script. There might be more than one match and everything else is supposed to be discarded.
Sample snippet of input.txt file content:
PART TWO OF TWO PARTS-
E RESNO 56/20 56/30 54/40 52/50 TUDEP
EAST LVLS NIL
WEST LVLS 310 320 330 340 350 360 370 380 390
EUR RTS WEST NIL
NAR NIL-
REMARKS.
1.TMI IS 142 AND OPERATORS ARE REMINDED TO INCLUDE THE
TMI NUMBER AS PART OF THE OCEANIC CLEARANCE READ BACK.
2.ADS-C AND CPDLC MANDATED OTS ARE AS FOLLOWS
TRACK A 350 360 370 380 390
TRACK B 350 360 370 380 390
I try to match for 142 from the line
1.TMI IS 142 AND OPERATORS ARE REMINDED TO INCLUDE THE
The match is always a number (one to three digits, may have leading zeroes) and always preceded by TMI IS.
My experiments so far led to nothing: I tried .*TMI IS ([0-9]+).* with the following sed command in my bash script
sed -n 's/.*TMI IS \([0-9]+\).*/\1/g' input.txt > output.txt
but only got an empty output.txt.
My script runs in GNU Bash-4.2. Where do I make my mistake? I ran out of ideas so your input is highly appreciated!
Thanks,
Chris

Two moments about your sed approach to make it work:
+ quantifier should be escaped in sed basic regular expressions
to print matched pattern use p subcommand:
sed -n 's/.*TMI IS \([0-9]\+\).*/\1/gp' input.txt
142
To get only the first match for your current format use:
sed -n 's/^\S\+TMI IS \([0-9]\+\).*/\1/gp' input.txt

With GNU grep:
$ grep -oP 'TMI IS \K([0-9]*)' input.txt
142

You could also do this using perl as an alternative to the above:
$ perl -nle 'print $1 if /TMI IS (\d+)/;' < input.txt
142

Grep each line of a text file in another tab separated file [duplicate]

This question already has answers here:
Inner join on two text files
(5 answers)
Closed 6 years ago.
I have a text file1 that has some id's like:
c10013_g2_i1|m.63|vomeronasal type-1 receptor 4-like
c10015_g1_i1|m.409|vomeronasal type-1 receptor 1-like
I used grep '^[^|]*' file1 to extract the string before | from file1.
I want each of this greped string to match lines from another file2 and return the whole line when matched. file2 looks like this:
c10013_g2_i1 781 622.2 73 5.95 5.16
c10014_g1_i1 213 58.67 3 2.59 2.25
c10014_g2_i1 341 182.35 4 1.11 0.96
c10015_g1_i1 404 245.23 16 3.31 2.87
c10017_g1_i1 263 105.37 6 2.89 2.5
Finally the result should look like:
c10013_g2_i1|m.63|vomeronasal type-1 receptor 4-like 781 622.2 73 5.95 5.16
c10015_g1_i1|m.409|vomeronasal type-1 receptor 1-like 404 245.23 16 3.31 2.87

You can use awk:
awk 'FNR == NR {
split($0, a, /[|]/)
seen[a[1]] = $0
next
}
$1 in seen {
$1 = seen[$1]
print
}' file1 file2
c10013_g2_i1|m.63|vomeronasal type-1 receptor 4-like 781 622.2 73 5.95 5.16
c10015_g1_i1|m.409|vomeronasal type-1 receptor 1-like 404 245.23 16 3.31 2.87

for structured text, awk is the king of tools.
$ awk 'NR==FNR{split($0,v,"|");a[v[1]]=$0; next}
$1 in a{k=$1; $1=""; print a[k] $0}' file1 file2
c10013_g2_i1|m.63|vomeronasal type-1 receptor 4-like 781 622.2 73 5.95 5.16
c10015_g1_i1|m.409|vomeronasal type-1 receptor 1-like 404 245.23 16 3.31 2.87

Sounds like you're trying to join on the first field of each file. There's actually a join command that can do this. You'll need to change file1 slightly (join works on spaces):
cat file1 | sed 's/^\([^|]*\)[|]/\1 |/' | sort > file1-delimited
Then you can join them:
cat file2 | sort | join file1-delimited -
c10013_g2_i1 |m.63|vomeronasal type-1 receptor 4-like 781 622.2 73 5.95 5.16
c10015_g1_i1 |m.409|vomeronasal type-1 receptor 1-like 404 245.23 16 3.31 2.87
This should get you 95% of the way there, but the format might not be perfect.

Sed command garbled with very easy mutiline regex in bash

I'm again garbled with sed command, because most probably i have very old version of sed but according to my limitations i couldn't change the version of 'sed' (!)
My question is this i wrote such an easy regex that fits with my string file such as:
/[^,]*$/mg
My string file is this :
23:53:20,650
23:53:20,654
23:53:20,655
23:53:20,656
23:53:21,238
23:53:21,240
23:53:21,302
23:53:21,303
23:53:21,304
23:53:21,305
23:53:21,889
23:53:21,890
23:53:21,896
23:53:21,897
23:53:21,898
23:53:21,899
23:53:22,492
23:53:22,538
23:53:22,539
23:53:23,109
23:53:23,110
23:53:23,115
23:53:23,117
23:53:23,118
23:53:23,119
23:53:23,690
23:53:23,721
23:53:23,722
23:53:24,275
23:53:24,276
23:53:24,313
23:53:24,316
23:53:24,317
23:53:24,318
23:53:24,854
23:53:24,888
23:53:24,889
23:53:24,890
23:53:24,891
23:53:50,676
23:53:50,677
23:53:50,711
23:53:50,713
23:53:50,714
23:53:51,257
23:53:51,258
23:53:51,296
23:53:51,297
23:53:51,298
23:53:51,820
23:53:51,822
23:53:51,823
23:53:52,358
23:53:52,364
23:53:52,367
23:53:52,909
23:53:52,910
23:53:52,936
23:53:52,939
23:53:52,941
23:53:52,944
23:53:52,945
23:53:52,946
23:53:52,949
23:53:52,953
23:53:52,956
23:53:52,959
23:53:52,963
23:53:52,966
23:53:52,970
23:53:52,971
23:53:52,974
23:53:52,978
23:53:52,980
23:53:52,983
23:53:52,984
23:53:52,986
23:53:52,987
23:53:52,989
23:53:52,990
23:53:52,991
23:53:52,994
23:53:52,995
23:53:52,999
23:53:53,001
23:53:53,002
23:53:53,004
23:53:53,005
23:53:53,007
23:53:53,010
23:53:53,026
23:53:53,027
23:53:53,081
23:53:53,082
23:53:53,083
23:53:53,085
07:32:54,519
07:32:54,521
07:32:54,537
07:32:54,538
07:32:54,539
07:32:54,540
07:32:54,541
07:32:54,542
07:32:54,543
07:32:54,544
07:32:54,545
07:32:54,546
07:32:54,547
07:32:54,548
07:32:54,549
07:32:54,550
I'm trying to get the values after the comma then assign them into array, when I used the sed command like :
`sed -n '/[^,]*$/mg'` file
It says command garbled, i read about multiline sed but i still couldn't reach to solution, i am new to regexes so the help will be appreciated.
Thank you in advance!

If you are using a "recent" bash, I think you can use cut and assign extracted values to an array:
numbers="$(cut -d',' -f2 filename.txt)"
array_numbers=( $numbers )

If you want to get the values after comma then you could use the below sed command which removes the values from the start upto the first comma.
sed 's/^[^,]*,//' file
OR
sed 's/^.*,//' file
Example:
$ echo '23:53:22,492' | sed 's/^[^,]*,//'
492
$ echo '23:53:22,492' | sed 's/^.*,//'
492

sed s/.*,// file
would match the till the first , are substitute the match wth nothing, which effectively gives the values after comma
for the input file
23:53:20,650
23:53:20,654
23:53:20,655
23:53:20,656
23:53:21,238
23:53:21,240
23:53:21,302
23:53:21,303
23:53:21,304
23:53:21,305
23:53:21,889
23:53:21,890
23:53:21,896
23:53:21,897
23:53:21,898
23:53:21,899
23:53:22,492
23:53:22,538
will produce output as
650
654
655
656
238
240
302
303
304
305
889
890
896
897
898
899
492
538

Match a word just once - AWK

I was reading GNU awk manual but I didnt find a regular expression wich whom I can match a string just once.
For example from the files aha_1.txt, aha_2.txt, aha_3.txt, .... I would like to print the second column $2 from the first time ana appears in the files (aha_1.txt, aha_2.txt, aha_3.txt, ....). In addition, the same thing when pedro appears.
aha_1.txt
luis 321 487
ana 454 345
pedro 341 435
ana 941 345
aha_2.txt
pedro 201 723
gusi 837 134
ana 319 518
cindy 738 278
ana 984 265
.
.
.
.
Meanwhile I did this but it counts all the cases not just the first time
/^ana/ {print $2 }
/^pedro/ {print $2 }
Thanks for your help :-)

Just call the exit command after printing the first value(second column in the line which starts with the string ana).
$ awk '$1~/^ana$/{print $2; exit}' file
454

Original question
Only processing one file.
awk '/ana/ { if (ana++ == 0) print $2 }' aha.txt
or
awk '/ana/ && ana++ == 0 { print $2 }' aha.txt
Or, if you don't need to do anything else, you can exit after printing, as suggested by Avinash Raj in his answer.
Revised question
I have many files (aha.txt, aha_1.txt, aha_2.txt, ...) each file has ana inside and I need just to take the fist time ana appears in each file and the output has to be one file.
That's sightly different as a question. If you have GNU grep, you can use (more or less):
grep -m1 -e ana aha*.txt
That will list the whole line, not just column 2, and will list the filenames too, so it isn't a perfect match.
Using awk, you have to work a bit more:
awk 'FILENAME != old_file { ana = 0; old_file = FILENAME }
/ana/ { if (ana++ == 0) print $2 }' aha*.txt

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Using regex to parse a delimited array in bash - regex

A sed solution can be like $ sed -r '/points=/ s/[^,]+,?([0-9])/\1 /g' input 287 470 509 459 471 OR for much better handling $ sed -r '/points=/ s/.points=("[^"]+")./\1/g; s/[^,]+,?([0-9])/\1 /g' input 287 470 509 459 471

You can use gnu-awk: awk -v RS='points="[^"]+"' 'RT{s=RT; gsub(/[^[:digit:], ]|[[:digit:]]+,/, "", s); print s}' file 287 470 509 459 471

This awk should do: awk -F\" '/points/ {gsub(/[0-9]+,/,"",$4);print $4}' file 287 470 509 459 471 If position on the line do change, do: awk -F"points=" 'NF==2{gsub(/[0-9]+,|[^0-9 ]/,"",$2);print $2}' file 287 470 509 459 471

Related

Using grep, how to match beginning of line with pattern from stdin

Bash/sed: delete everything from text file except match(es)

Grep each line of a text file in another tab separated file [duplicate]

Sed command garbled with very easy mutiline regex in bash

Match a word just once - AWK

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Using regex to parse a delimited array in bash - regex

A sed solution can be like $ sed -r '/points=/ s/[^,]+,?([0-9]*)/\1 /g' input 287 470 509 459 471 OR for much better handling $ sed -r '/points=/ s/.*points=("[^"]+").*/\1/g; s/[^,]+,?([0-9]*)/\1 /g' input 287 470 509 459 471

You can use gnu-awk: awk -v RS='points="[^"]+"' 'RT{s=RT; gsub(/[^[:digit:], ]|[[:digit:]]+,/, "", s); print s}' file 287 470 509 459 471

This awk should do: awk -F\" '/points/ {gsub(/[0-9]+,/,"",$4);print $4}' file 287 470 509 459 471 If position on the line do change, do: awk -F"points=" 'NF==2{gsub(/[0-9]+,|[^0-9 ]/,"",$2);print $2}' file 287 470 509 459 471

Related

Using grep, how to match beginning of line with pattern from stdin

Bash/sed: delete everything from text file except match(es)

Grep each line of a text file in another tab separated file [duplicate]

Sed command garbled with very easy mutiline regex in bash

Match a word just once - AWK

Categories

Resources

A sed solution can be like $ sed -r '/points=/ s/[^,]+,?([0-9])/\1 /g' input 287 470 509 459 471 OR for much better handling $ sed -r '/points=/ s/.points=("[^"]+")./\1/g; s/[^,]+,?([0-9])/\1 /g' input 287 470 509 459 471