Yet again my unfamiliarity with AWK lets me down, I can't figure out how to match a variable at the end of a line?
This would be fairly trivial with grep etc, but I'm interested in matching integers at the end of a string in a specific field of a tsv, and all the posts suggest (and I believe it to be the case!) that awk is the way to go.
If I want to just match a single one explicity, that's easy:
Here's my example file:
PVClopT_11 PAU_02102 PAU_02064 1pqx 1pqx_A 37.4 13 0.00035 31.4 >1pqx_A Conserved hypothetical protein; ZR18,structure, autostructure,spins,autoassign, northeast structural genomics consortium; NMR {Staphylococcus aureus subsp} SCOP: d.267.1.1 PDB: 2ffm_A 2m6q_A 2m8w_A No DOI found.
PVCpnf_18 PAK_3526 PAK_03186 3fxq 3fxq_A 99.7 2.7e-21 7e-26 122.2 >3fxq_A LYSR type regulator of TSAMBCD; transcriptional regulator, LTTR, TSAR, WHTH, DNA- transcription, transcription regulation; 1.85A {Comamonas testosteroni} PDB: 3fxr_A* 3fxu_A* 3fzj_A 3n6t_A 3n6u_A* 10.1111/j.1365-2958.2010.07043.x
PVCunit1_19 PAU_02807 PAU_02793 3kx6 3kx6_A 19.7 45 0.0012 31.3 >3kx6_A Fructose-bisphosphate aldolase; ssgcid, NIH, niaid, SBRI, UW, emerald biostructures, glycolysis, lyase, STRU genomics; HET: CIT; 2.10A {Babesia bovis} No DOI found.
PVClumt_17 PAU_02231 PAU_02190 3lfh 3lfh_A 39.7 12 0.0003 28.9 >3lfh_A Manxa, phosphotransferase system, mannose/fructose-speci component IIA; PTS; 1.80A {Thermoanaerobacter tengcongensis} No DOI found.
PVCcif_11 plu2521 PLT_02558 3h2t 3h2t_A 96.6 2.6e-05 6.7e-10 79.0 >3h2t_A Baseplate structural protein GP6; viral protein, virion; 3.20A {Enterobacteria phage T4} PDB: 3h3w_A 3h3y_A 10.1016/j.str.2009.04.005
PVCpnf_16 PAU_03338 PAU_03377 5jbr 5jbr_A 29.2 22 0.00058 23.9 >5jbr_A Uncharacterized protein BCAV_2135; structural genomics, PSI-biology, midwest center for structu genomics, MCSG, unknown function; 1.65A {Beutenbergia cavernae} No DOI found.
PVCunit1_17 PAK_2892 PAK_02622 1cii 1cii_A 63.2 2.7 6.9e-05 41.7 >1cii_A Colicin IA; bacteriocin, ION channel formation, transmembrane protein; 3.00A {Escherichia coli} SCOP: f.1.1.1 h.4.3.1 10.1038/385461a0
PVCunit1_11 PAK_2886 PAK_02616 3h2t 3h2t_A 96.6 1.9e-05 4.9e-10 79.9 >3h2t_A Baseplate structural protein GP6; viral protein, virion; 3.20A {Enterobacteria phage T4} PDB: 3h3w_A 3h3y_A 10.1016/j.str.2009.04.005
PVCpnf_11 PAU_03343 PAU_03382 3h2t 3h2t_A 97.4 4.4e-07 1.2e-11 89.7 >3h2t_A Baseplate structural protein GP6; viral protein, virion; 3.20A {Enterobacteria phage T4} PDB: 3h3w_A 3h3y_A 10.1016/j.str.2009.04.005
PVCunit1_5 afp5 PAU_02779 4tv4 4tv4_A 63.6 2.6 6.7e-05 30.5 >4tv4_A Uncharacterized protein; unknown function, ssgcid, virulence, structural genomics; 2.10A {Burkholderia pseudomallei} No DOI found.
And I can pull out all the lines which have a "_11" at the end of the first column by running the following on the commandline:
awk '{ if ($1 ~ /_11$/) { print } }' 02052017_HHresults_sorted.tsv
I want to enclose this in a loop to cover all integers from 1 - 5 (for instance), but I'm having trouble passing a variable in to the text match.
I expect it should be something like the following, but $i$ seems like its probably incorrect and by google-fu failed me:
awk 'BEGIN{ for (i=1;i<=5;i++){ if ($1 ~ /_$i$/) { print } } }' 02052017_HHresults_sorted.tsv
There may be other issues I haven't spotted with that awk command too, as I say, I'm not very awk-savvy.
EDIT FOR CLARIFICATION
I want to separate out all the matches, so can't use a character class. i.e. I want all the lines ending in "_1" in one file, then all the ones ending in "_2" in another, and so on (hence the loop).
You can't put variables inside //. Use string concatenation, which is done by simply putting the strings adjacent to each other in awk. You don't need to use a regexp literal when you use the ~ operator, it always treats the second argument as a regexp.
awk '{ for (i = 1; i <= 5; i++) {
if ( $1 ~ ("_" i "$") ) { print; break; }
}' 02052017_HHresults_sorted.tsv
It sounds like you're thinking about this all wrong and what you really need is just (with GNU awk for gensub()):
awk '{ print > ("out" gensub(/.*_/,"",1,$1)) }' 02052017_HHresults_sorted.tsv
or with any awk:
awk '{ n=$1; sub(/.*_/,"",n); print > ("out" n) }' 02052017_HHresults_sorted.tsv
No need to loop, use regex character class [..]:
awk 'match($1,/_([1-5])$/,a){ print >> a[1]".txt" }' 02052017_HHresults_sorted.tsv
I have a simple but large data file. It's output from a neural network simulation. The first column is a time step, 1..200. The second is the target word (for the current simulation, 1..212). Then there are 212 columns, one for each word. That is, each row has the activation values of each word node at a particular time step given a particular target (input) word.
I need to do simple operations, such as converting each activation to a response strength (exp(constant x activation)) and then dividing each response strength by the row sum of response strength. Doing this in R is very slow (20 minutes), and doing it with conventional looping in perl is faster but still slow (7 minutes) given that later simulations will involve thousands of words.
It seems like PDL should be able to do this much more quickly. I've been reading the PDL documentation, but I'm really at a loss for how to do the second step. The first one seems as easy as selecting just the activation columns and putting them in $act and then:
$rp = exp($act * $k);
But I can't figure out how then to divide each value by its row sum. Any advice would be appreciated.
Thanks!
It looks like you need to make a copy of the matrix, then use the first one to read from, and the second to write too.
NOTE using $c++ instead of the for $loop() { might be more efficient ! }
$x = sequence(3,3)*2+1;
[ 1 3 5]
[ 7 9 11]
[13 15 17]
$y .= $x; # if you use = here it will change both x and y
for $c(0..2) { for $d(0..2) { $y($c,$d) .= $y($c,$d) / sum($x(,$d)) }}
p $y;
[0.11111111 0.33333333 0.55555556]
[0.25925926 0.33333333 0.40740741]
[0.28888889 0.33333333 0.37777778]
As is often the case in PDL, a good answer to this involves slicing and indices.
$k = 0.7; # made-up value
$data = zeroes 214,200;
$data((0)) .= sequence(200) + 1; # column 0=1..200
$data((1)) .= indx(zeroes(200)->random*212) + 1; # column 1 randomly 1..212
$data(2:-1)->inplace->random; # rest of columns random values for this demo
$indices = ($data(1)+1)->append($data((0))->sequence->transpose); # indices are [column 1 value,row index]
$act = $data->indexND($indices); # vector of the activation values
$rp = exp($act * $k);
$rp /= $data(2:-1)->sumover; # divide by sum of each row's non-index values
I have this ugly file.
{message:"What this does is, every time the mouse moves in the canvas
area, it sets mouseX and mouseY to the location of the
mouse.",},{message:"Then, when each ball is updated, it figures out
how far away from the mouse it is, and accelerates toward
it.",},{message:"The acceleration is the square root of the distance,
so it pulls harder when it is really far away. Imagine all the balls
being connected to the mouse by little rubber bands or springs. It's
a little like that.",},{message:"Try making the balls smaller! And
add more of them! I like it with about 40 small balls chasing the
mouse.",},{message:"Great job! Like what you learned? Was it
fun?",code:"",hiddenCode:"var c =
document.getElementById('pane').getContext('2d');\nfunction
rgba(r,g,b,a) {return 'rgba('+[r,g,b,a].join(',')+')';}\nfunction
rgb(r,g,b,a) {return
'rgb('+[r,g,b].join(',')+')';}\n\n",lessonSection:"The
End",},{message:"Wow, you did everything! Congratulations, nice work!
A lot of these are really hard. I'm impressed you finished! I hope
you enjoyed it!",code:'var pane =
document.getElementById(\'pane\');\nvar s = 3;\n\npane.onmousemove =
function(evt) {\n c.fillStyle = randomRGBA();\n var x =
evt.clientX;\n var y = evt.clientY;\n c.fillRect(x - s / 2, y - s /
2, s, s);};\n\nfunction randomRGBA() {\n var r = randInt(255);\n var
g = randInt(255);\n var b = randInt(255);\n var a = Math.random();\n
var rgba = [r,g,b,a].join(",");\n return "rgba(" + rgba +
")";\n}\nfunction randInt(limit) {\n var x =
I am trying to use Perl regex to extract the body of the message
I trying two 3 hours working on it, but I can not seems to extract it.
My point is to translate the message from English to other languages, so I wanted the string of the message on a clean file instead of working on this ugly file that combine both messages and code.
I was trying to use this code:
use strict;
use warnings;
my $filename = 'test.txt';
my $row = '';
if (open(my $fh, '<:encoding(UTF-8)', $filename)) {
while ($row = <$fh>) {
if ($row =~/message:(.*)/)
{
print $1 . "\n";
}
}
}
else {
warn "Could not open file '$filename' $!";
}
It give me results basically the entire file as an output.
I tried \W+ or \s+ which gave me the first word only.
Any ideas?
The problem is that there are no newlines in the data so your .* matches the whole of the rest of the file. Try /message:"([^"]*)/ which matches only characters that aren't double quotes
I wrote this
use strict;
use warnings;
use 5.010;
my $data = do {
local $/;
<DATA>;
};
say "$1: $2" while $data =~ /[{,](\w+):"([^"]*)/g;
__DATA__
{message:"What this does is, every time the mouse moves in the canvas area, it sets mouseX and mouseY to the location of the mouse.",},{message:"Then, when each ball is updated, it figures out how far away from the mouse it is, and accelerates toward it.",},{message:"The acceleration is the square root of the distance, so it pulls harder when it is really far away. Imagine all the balls being connected to the mouse by little rubber bands or springs. It's a little like that.",},{message:"Try making the balls smaller! And add more of them! I like it with about 40 small balls chasing the mouse.",},{message:"Great job! Like what you learned? Was it fun?",code:"",hiddenCode:"var c = document.getElementById('pane').getContext('2d');\nfunction rgba(r,g,b,a) {return 'rgba('+[r,g,b,a].join(',')+')';}\nfunction rgb(r,g,b,a) {return 'rgb('+[r,g,b].join(',')+')';}\n\n",lessonSection:"The End",},{message:"Wow, you did everything! Congratulations, nice work! A lot of these are really hard. I'm impressed you finished! I hope you enjoyed it!",code:'var pane = document.getElementById(\'pane\');\nvar s = 3;\n\npane.onmousemove = function(evt) {\n c.fillStyle = randomRGBA();\n var x = evt.clientX;\n var y = evt.clientY;\n c.fillRect(x - s / 2, y - s / 2, s, s);};\n\nfunction randomRGBA() {\n var r = randInt(255);\n var g = randInt(255);\n var b = randInt(255);\n var a = Math.random();\n var rgba = [r,g,b,a].join(",");\n return "rgba(" + rgba + ")";\n}\nfunction randInt(limit) {\n var x =
which produced this output
message: What this does is, every time the mouse moves in the canvas area, it sets mouseX and mouseY to the location of the mouse.
message: Then, when each ball is updated, it figures out how far away from the mouse it is, and accelerates toward it.
message: The acceleration is the square root of the distance, so it pulls harder when it is really far away. Imagine all the balls being connected to the mouse by little rubber bands or springs. It's a little like that.
message: Try making the balls smaller! And add more of them! I like it with about 40 small balls chasing the mouse.
message: Great job! Like what you learned? Was it fun?
code:
hiddenCode: var c = document.getElementById('pane').getContext('2d');\nfunction rgba(r,g,b,a) {return 'rgba('+[r,g,b,a].join(',')+')';}\nfunction rgb(r,g,b,a) {return 'rgb('+[r,g,b].join(',')+')';}\n\n
lessonSection: The End
message: Wow, you did everything! Congratulations, nice work! A lot of these are really hard. I'm impressed you finished! I hope you enjoyed it!
No doubt the syntax, whatever it is, allows for embedding double quotes within each string, but there is no example of it in this fragment
Your problem is that the .* that you use in your regex is "greedy". It grabs as much of the input data as possible - which does right to the end of the file.
You need to change that to .*? so that it grabs as little as possible. But you also need to define better markers for the beginning and end of the regex. Looks to me like your message is always in double-quotes. So let's use that.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
my $input = do { local $/; <> };
# Look for 'message:', then capture the following " and
# the minimal amount of test until you get the next ". Also
# check for a following comma - to be safe.
while ($input =~ /message:(".*?"),/) {
say $1;
}
This will work unless your messages have embedded double-quote marks (which will presumably be escaped as \"). If that's the case, you'll need something more complex.
I do not know why you need to do this with the minified and concatenated source code, but, you can reverse that:
#!/usr/bin/env perl
use strict;
use warnings;
use Path::Class;
use JavaScript::Beautifier qw/js_beautify/;
my $js = file('combined.min.js')->slurp('<:encoding(UTF-8)');
my $pretty_js = js_beautify($js);
my #messages = ($pretty_js =~ /message: (.+?)\n/g);
print "$_\n" for #messages;
You already have some perl answers, but you may also be interested in the
xgettext tool which is designed specifically to extract strings for internationalisation. Run it like this:
xgettext -a --from-code UTF-8 combined.min.js -o -
It gives you output on each string like this:
#: combined.min.js:36
msgid ""
"Here is a ball that sticks to the mouse. Every time the mouse moves, the "
"ball redraws on top of the mouse."
msgstr ""
It is in the gnu gettext package. Look at gnu gettext
I have created the following script that returns the percentage of free space , the total space etc. of every disk for the remote servers the problem is that i want an extra column "warning" that prints No or Yes if the free space is bellow 10% i tried if statement but with no success. Please for your help.
Get-WmiObject Win32_LogicalDisk -filter "DriveType=3" -computer (Get-Content .\servers.txt) | Select SystemName,DeviceID,VolumeName,#{Name="Size(GB)";
Expression={"{0:N1}" -f($_.size/1gb)}},#{Name="FreeSpace(GB)";
Expression={"{0:N1}" -f($_.freespace/1gb)}},#{Name=" % Free(GB)";
Expression={"{0:N1}" -f(($_.freespace/$_.size)*100 )}},#{Name=" Warning";
Expression={????????}} |Format-Table -AutoSize |Out-File disk_monitor.txt
You can try something like
#{Name="Warning";Expression={ if((100 / $_.Size * $_.FreeSpace) -lt 10) { "Yes" } else { "No" }} };
This will calculate what percentage of disk space is available (100 / Size * FreeSpace) and if it's less than 10 (as in, percent), will return "Yes" or otherwise "No".
I have a script which helps me to login to a cisco switch nad run the mac-address table command and save it to an array #ver. The script is as follows:
#!/usr/bin/perl
use strict;
use warnings;
use Net::Telnet::Cisco;
my $host = '192.168.168.10';
my $session = Net::Telnet::Cisco->new(Host => $host, -Prompt=>'/(?m:^[\w.&-]+\s?(?:\(config[^\)]*\))?\s?[\$#>]\s?(?:\(enable\))?\s*$)/');
$session->login(Name => 'admin',Password => 'password');
my #ver = $session->cmd('show mac-address-table dynamic');
for my $line (#ver)
{
print "$line";
if ($line =~ m/^\*\s+\d+\s+(([0-9a-f]{4}[.]){2}[0-9a-f]{4})\s+/ ){
my $mac_addr = $1;
print ("$mac_addr \n");
}
}
$session->close();
It get the following results:
Legend: * - primary entry
age - seconds since last seen
n/a - not available
vlan mac address type learn age ports
------+----------------+--------+-----+----------+--------------------------
* 14 782b.cb87.b085 dynamic Yes 5 Gi4/39
* 400 c0ea.e402.e711 dynamic Yes 5 Gi6/17
* 400 c0ea.e45c.0ecf dynamic Yes 0 Gi11/43
* 400 0050.5677.c0ba dynamic Yes 0 Gi1/27
* 400 c0ea.e400.9f91 dynamic Yes 0 Gi6/3
Now, with the above script I am trying to get the mac address and store it in $mac_addr. But I am not getting the desired results. Please can someone guide me. Thank you.
I'm not clear when you say you're not getting the desired results. I did notice that you are first printing your $line and then printing $mac_addr afterwards, besides that your expression seems to match.
Your regular expression matching your desired data.
If you simply just want the matches, you could do..
for my $line (#ver) {
if (my ($mac_addr) = $line =~ /((?:[0-9a-f]{4}\.){2}[0-9a-f]{4})/) {
print $mac_addr, "\n";
}
}
Output
782b.cb87.b085
c0ea.e402.e711
c0ea.e45c.0ecf
0050.5677.c0ba
c0ea.e400.9f91
If you want to print out the mac addresses, you can do the following:
/^\*/ and print +(split)[2], "\n" for #ver;
Note that this splits the line (implicitly on whitespace) if it begins with *; the mac address is the second element in the resulting list (in case you still need to set $mac_addr).
Hope this helps!