Error executing a perl command using popen in C++ - c++

In my c++ program I want to execute a perl comand and read the output returned by the execution. I use popen for that, but I get an error when executing my command:
Command:
string cmd = "perl -ne 's/^\\S+\\s//; if ((/" +
pattern1+ " START/ .. /" + pattern2+ " END/) && /find/)"
" { print \"$_\"}' file";
stream = popen(cmd.c_str(),"r");
If I execute this command in the command line it works, but in C++ i get this error:
Search pattern not terminated at -e line 1.
The command that works in command line is, in C++ I already escaped the '\' and '"':
perl -ne 's/^\\S+\\s//; if ((/aaa START/ .. /bbb END/) && /find/) { print "$_"}' file
If I execute this command, it works: "perl -ne print $_ file".
But my initial command doesn't.
What I am doing wrong. Thanks.

It's your escape characters \. You'll have to double them up in the C++ string as \\ gets turned into \. Then the shell does it's processing as you see on the command line. i.e. another round of \\ turned into \.

You need to escape your backslashes (by adding more backslashes!).
std::string cmd = "perl -ne 's/^\\\\S+\\\\s//; if ((/" +
pattern1 + " START/ .. /" +
pattern2+ " END/) && /find/)"
" { print \"$_\"}' file";
In C++0x you can use raw R"(strings)" to avoid adding slashes. Compile with GCC like
g++ -std=c++0x -Wall popen.cpp
example:
std::string cmd_raw = R"(perl -ne 's/^\\S+\\s//; if ((/)" +
pattern1 + R"( START/ .. /)" +
pattern2 + R"( END/) && /find/))"
R"( { print \"$_\"}' file)";

This worked:
cmd = "perl -ne 's/^\\\\S+\\\\s//; if ((/" +
pattern1+ " START/ .. /" + pattern2+ " END/) && /find/)"
" { print \"$_\"}' file";
stream = popen(cmd.c_str(),"r");

Related

How to use sed to extract numbers from a comma separated string?

I managed to extract the following response and comma separate it. It's comma seperated string and I'm only interested in comma separated values of the account_id's. How do you pattern match using sed?
Input: ACCOUNT_ID,711111111119,ENVIRONMENT,dev,ACCOUNT_ID,111111111115,dev
Expected Output: 711111111119, 111111111115
My $input variable stores the input
I tried the below but it joins all the numbers and I would like to preserve the comma ','
echo $input | sed -e "s/[^0-9]//g"
I think you're better served with awk:
awk -v FS=, '{for(i=1;i<=NF;i++)if($i~/[0-9]/){printf sep $i;sep=","}}'
If you really want sed, you can go for
sed -e "s/[^0-9]/,/g" -e "s/,,*/,/g" -e "s/^,\|,$//g"
$ awk '
BEGIN {
FS = OFS = ","
}
{
c = 0
for (i = 1; i <= NF; i++) {
if ($i == "ACCOUNT_ID") {
printf "%s%s", (c++ ? OFS : ""), $(i + 1)
}
}
print ""
}' file
711111111119,111111111115

copying first string into second line

I have a text file in this format:
abacası Abaca[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 20.1748046875
abacı Abaç[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 16.3037109375 Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+[A3sg]+[Pnon]+[Nom] : 23.0185546875
abacılarla Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 27.8974609375 aba[Noun]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 23.3427734375 abacı[Noun]+lAr[A3pl]+[Pnon]+YlA[Ins] : 19.556640625
Here I call the first string before the first space as word (for example abacısı)
The string which starts with after first space and ends with integer is definition (for example Abaca[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 20.1748046875)
I want to do this: If a line includes more than one definition (first line has one, second line has two, third line has three), apply newline and put the first string (word) into the beginning of the new line. Expected output:
abacası Abaca[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 20.1748046875
abacı Abaç[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 16.3037109375
abacı Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+[A3sg]+[Pnon]+[Nom] : 23.0185546875
abacılarla Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 27.8974609375
abacılarla aba[Noun]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 23.3427734375
abacılarla abacı[Noun]+lAr[A3pl]+[Pnon]+YlA[Ins] : 19.556640625
I have almost 1.500.000 lines in my text file and the number of definition is not certain for each line. It can be 1 to 5
Small python script does the job. Input is expected in input.txt, output gotes to output.txt.
import re
rf = re.compile('([^\s]+\s).+')
r = re.compile('([^\s]+\s\:\s\d+\.\d+)')
with open("input.txt", "r") as f:
text = f.read()
with open("output.txt", "w") as f:
for l in text.split('\n'):
offset = 0
first = ""
match = re.search(rf, l[offset:])
if match:
first = match.group(1)
offset = len(first)
while True:
match = re.search(r, l[offset:])
if not match:
break
s = match.group(1)
offset += len(s)
f.write(first + " " + s + "\n")
I am assuming the following format:
word definitionkey : definitionvalue [definitionkey : definitionvalue …]
None of those elements may contain a space and they are always delimited by a single space.
The following code should work:
awk '{ for (i=2; i<=NF; i+=3) print $1, $i, $(i+1), $(i+2) }' file
Explanation (this is the same code but with comments and more spaces):
awk '
# match any line
{
# iterate over each "key : value"
for (i=2; i<=NF; i+=3)
print $1, $i, $(i+1), $(i+2) # prints each "word key : value"
}
' file
awk has some tricks that you may not be familiar with. It works on a line-by-line basis. Each stanza has an optional conditional before it (awk 'NF >=4 {…}' would make sense here since we'll have an error given fewer than four fields). NF is the number of fields and a dollar sign ($) indicates we want the value of the given field, so $1 is the value of the first field, $NF is the value of the last field, and $(i+1) is the value of the third field (assuming i=2). print will default to using spaces between its arguments and adds a line break at the end (otherwise, we'd need printf "%s %s %s %s\n", $1, $i, $(i+1), $(i+2), which is a bit harder to read).
With perl:
perl -a -F'[^]:]\K\h' -ne 'chomp(#F);$p=shift(#F);print "$p ",shift(#F),"\n" while(#F);' yourfile.txt
With bash:
while read -r line
do
pre=${line%% *}
echo "$line" | sed 's/\([0-9]\) /\1\n'$pre' /g'
done < "yourfile.txt"
This script read the file line by line. For each line, the prefix is extracted with a parameter expansion (all until the first space) and spaces preceded by a digit are replaced with a newline and the prefix using sed.
edit: as tripleee suggested it, it's much faster to do all with sed:
sed -i.bak ':a;s/^\(\([^ ]*\).*[0-9]\) /\1\n\2 /;ta' yourfile.txt
Assuming there are always 4 space-separated words for each definition:
awk '{for (i=1; i<NF; i+=4) print $i, $(i+1), $(i+2), $(i+3)}' file
Or if the split should occur after that floating point number
perl -pe 's/\b\d+\.\d+\K\s+(?=\S)/\n/g' file
(This is the perl equivalent of Avinash's answer)
Bash and grep:
#!/bin/bash
while IFS=' ' read -r in1 in2 in3 in4; do
if [[ -n $in4 ]]; then
prepend="$in1"
echo "$in1 $in2 $in3 $in4"
else
echo "$prepend $in1 $in2 $in3"
fi
done < <(grep -o '[[:alnum:]][^:]\+ : [[:digit:].]\+' "$1")
The output of grep -o is putting all definitions on a separate line, but definitions originating from the same line are missing the "word" at the beginning:
abacası Abaca[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 20.1748046875
abacı Abaç[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 16.3037109375
Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+[A3sg]+[Pnon]+[Nom] : 23.0185546875
abacılarla Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 27.8974609375
aba[Noun]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 23.3427734375
abacı[Noun]+lAr[A3pl]+[Pnon]+YlA[Ins] : 19.556640625
The for loop now loops over this, using a space as the input file separator. If in4 is a zero length string, we're on a line where the "word" is missing, so we prepend it.
The script takes the input file name as its argument, and saving output to an output file can be done with simple redirection:
./script inputfile > outputfile
Using perl:
$ perl -nE 'm/([^ ]*) (.*)/; my $word=$1; $_=$2; say $word . " " . $_ for / *(.*?[0-9]+\.[0-9]+)/g;' < input.log
Output:
abacası Abaca[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 20.1748046875
abacı Abaç[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 16.3037109375
abacı Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+[A3sg]+[Pnon]+[Nom] : 23.0185546875
abacılarla Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 27.8974609375
abacılarla aba[Noun]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 23.3427734375
abacılarla abacı[Noun]+lAr[A3pl]+[Pnon]+YlA[Ins] : 19.556640625
Explanation:
Split the line to separate first field as word.
Then split the remaining line using the regex .*?[0-9]+\.[0-9]+.
Print word concatenated with every match of above regex.
I would approach this with one of the excellent Awk answers here; but I'm posting a Python solution to point to some oddities and problems with the currently accepted answer:
It reads the entire input file into memory before processing it. This is harmless for small inputs, but the OP mentions that the real-world input is kind of big.
It needlessly uses re when simple whitespace tokenization appears to be sufficient.
I would also prefer a tool which prints to standard output, so that I can redirect it where I want it from the shell; but to keep this compatible with the earlier solution, this hard-codes output.txt as the destination file.
with open('input.txt', 'r') as input:
with open('output.txt', 'w') as output:
for line in input:
tokens = line.rstrip().split()
word = tokens[0]
for idx in xrange(1, len(tokens), 3):
print(word, ' ', ' '.join(tokens[idx:idx+3]), file=output)
If you really, really wanted to do this in pure Bash, I suppose you could:
while read -r word analyses; do
set -- $analyses
while [ $# -gt 0 ]; do
printf "%s %s %s %s\n" "$word" "$1" "$2" "$3"
shift; shift; shift
done
done <input.txt >output.txt
Please find the following bash code
#!/bin/bash
# read.sh
while read variable
do
for i in "$variable"
do
var=`echo "$i" |wc -w`
array_1=( $i )
counter=0
for((j=1 ; j < $var ; j++))
do
if [ $counter = 0 ] #1
then
echo -ne ${array_1[0]}' '
fi #1
echo -ne ${array_1[$j]}' '
counter=$(expr $counter + 1)
if [ $counter = 3 ] #2
then
counter=0
echo
fi #2
done
done
done
I have tested and it is working.
To test
On bash shell prompt give the following command
$ ./read.sh < input.txt > output.txt
where read.sh is script , input.txt is input file and output.txt is where output is generated
here is a sed in action
sed -r '/^indirger(ken|di)/{s/([0-9]+[.][0-9]+ )(indirge)/\1\n\2/g}' my_file
output
indirgerdi indirge[Verb]+[Pos]+Hr[Aor]+[A3sg]+YDH[Past] : 22.2626953125
indirge[Verb]+[Pos]+Hr[Aor]+YDH[Past]+[A3sg] : 18.720703125
indirgerken indirge[Verb]+[Pos]+Hr[Aor]+[A3sg]-Yken[Adv+While] : 19.6201171875

Escaping special characters with sed

I have a script to generate char arrays from strings:
#!/bin/bash
while [ -n "$1" ]
do
echo -n "{" && echo -n "$1" | sed -r "s/((\\\\x[0-9a-fA-F]+)|(\\\\[0-7]{1,3})|(\\\\?.))/'\1',/g" && echo "0}"
shift
done
It works great as is:
$ wchar 'test\n' 'test\\n' 'test\123' 'test\1234' 'test\x12345'
{'t','e','s','t','\n',0}
{'t','e','s','t','\\','n',0}
{'t','e','s','t','\123',0}
{'t','e','s','t','\123','4',0}
{'t','e','s','t','\x12345',0}
But because sed considers each new line to be a brand new thing it doesn't handle actual newlines:
$ wchar 'test
> test'
{'t','e','s','t',
't','e','s','t',0}
How can I replace special characters (Tabs, newlines etc) with their escaped versions so that the output would be like so:
$ wchar 'test
> test'
{'t','e','s','t','\n','t','e','s','t',0}
Edit: Some ideas that almost work:
echo -n "{" && echo -n "$1" | sed -r ":a;N;;s/\\n/\\\\n/;$!ba;s/((\\\\x[0-9a-fA-F]+)|(\\\\[0-7]{1,3})|(\\\\?.))/'\1',/g" && echo "0}"
Produces:
$ wchar 'test\n\\n\1234\x1234abg
test
test'
{test\n\\n\1234\x1234abg\ntest\ntest0}
While removing the !:
echo -n "{" && echo -n "$1" | sed -r ":a;N;;s/\\n/\\\\n/;$ba;s/((\\\\x[0-9a-fA-F]+)|(\\\\[0-7]{1,3})|(\\\\?.))/'\1',/g" && echo "0}"
Produces:
$ wchar 'test\n\\n\1234\x1234abg
test
test'
{'t','e','s','t','\n','\\','n','\123','4','\x1234ab','g','\n','t','e','s','t',
test0}
This is close...
The first isn't performing the final replacement, and the second isn't correctly adding the last line
You can pre-filter before passing to sed. Perl will do:
$ set -- 'test1
> test2'
$ echo -n "$1" | perl -0777 -pe 's/\n/\\n/g'
test1\ntest2
This is a very convoluted solution, but might work for your needs. GNU awk 4.1
#!/usr/bin/awk -f
#include "join"
#include "ord"
BEGIN {
RS = "\\\\(n|x..)"
FS = ""
}
{
for (z=1; z<=NF; z++)
y[++x] = ord($z)<0x20 ? sprintf("\\x%02x",ord($z)) : $z
y[++x] = RT
}
END {
y[++x] = "\\0"
for (w in y)
y[w] = "'" y[w] "'"
printf "{%s}", join(y, 1, x, ",")
}
Result
$ cat file
a
b\nc\x0a
$ ./foo.awk file
{'a','\x0a','b','\n','c','\x0a','\0'}

Print only '+' or '-' if string matches (two files)

I would like to print only a '+' o '-' symbols if string is found or not. Basically, I have two files:
Input file 1 (tab-delimited):
HPNK_00457
HPNK_00458
HPNK_00459
Input file 2 (tab-delimited):
HPNK_00457 AAA50325 1e-43 437 28 43 83 ATP-binding protein.
HPNK_00458 P25256 8e-43 429 28 43 82 RecName: Full=Tylosin resistance ATP-binding protein tlrC.
HPNK_00458 CAM96590 1e-42 429 27 42 87 ABC transporter ATP-binding protein [Streptomyces ambofaciens].
Desired output (tab-delimited, maintaining order of strings in file 1):
HPNK_00457 +
HPNK_00458 +
HPNK_00459 -
This is what I've been using up to now, but need to update:
while read vl; do grep "^$vl " file2 || printf -- "- -\n" ; done < file1
Thanks, trying to learn everyday here.
Here's one way using awk:
awk 'FNR==NR { a[$1]; next } { print $1, ($1 in a ? "+" : "-" ) }' file2 file1
Results:
HPNK_00457 +
HPNK_00458 +
HPNK_00459 -
You can use:
while read -r line
do
grep -q "$line" f2 && echo "$line +" || echo "$line -"
done < f1
As grep -q just returns true if it has matched something, in that case we print the file name + + otherwise, we print the file name + -.
It returns:
$ while read -r line; do grep -q "$line" f2 && echo "$line +" || echo "$line -"; done < f1
HPNK_00457 +
HPNK_00458 +
HPNK_00459 -
perl -lane'
BEGIN{ $, ="\t"; $x=shift; #h{ map /(\S+)/, <> } =(); #ARGV=$x }
print #F, exists $h{$F[0]} ? "+" : "-";
' file1 file2
output
HPNK_00457 +
HPNK_00458 +
HPNK_00459 -
Here's the algorithm:
Read file 2. For each line,
Get the first word
Store it in a hash.
Read file 1. For each line, chomp it, then
print $hash{$_}? '+' : '-'
I can write the code for you but if you want to learn everyday, it will be a useful exercise if you want to write it yourself.
This simple Perl script should do the work
#!/usr/local/bin/perl
## f1 and f2 are the 2 files containing your input data
open FILE1, f1;
open FILE2, f2;
#file1data = <FILE1>;
#file2data = <FILE2>;
my $row = 0;
foreach $data (#file1data) {
chomp($data);
if (grep (/$data/,$file2data[$row]) ) {
print $data . " " . "+\n";
}
else {
print $data . " " . "-\n";
}
$row++;
}
awk 'FNR==NR
{a[$1];next}
{b[$1]}
END{
for(i in a)
if(b[i]){print i,"+"}
else{print i,"-"}
}' file1 file2

Execute a command using popen

I have a C++ program in which I want to execute the following command:
cmd = "(diff <(perl -ne 's/^\\S+\\s//; if ((/aaa/ .. /bbb/) && /ccc/)"
" { print \"$_\"}' file1)"
"<(perl -ne 's/^\\S+\\s//; if ((/aaa/ .. /bbb/) && /ccc/)"
" { print \"$_\"} file2)) ";
I get this error when I want to execute this command:
Search pattern not terminated at -e line 1.
I've noticed that the following commands work like this:
cmd = "diff <(echo aa) <(echo bb)"
string strCall = "bash -c \"( " + cmd + " ) 2>&1\"";
stream = popen(strCall.c_str(),"r"); // doesn't work popen(**str**.c_str(),"r")
and an example perl command containing '"' works like this:
cmd = "perl -ne '{print \"$1\"}' file"
stream = popen(str.c_str(),"r"); // doesn't work popen(**strCall**.c_str(),"r");
but if the perl command doesn't contains '"', it works both ways:
cmd = "perl -ne '{print $1}' file"
string strCall = "bash -c \"( " + cmd + " ) 2>&1\"";
stream = popen(str.c_str(),"r"); // also works popen(**strCall**.c_str(),"r");
How can I do to use both diff and perl in the same command. I assume I have to use strCall.
I've tried also to escape the perl cmd like this, but it doesn't work:
cmd = "perl -ne '{print \\\"$1\\\"}' file" // one '/' for '/', one for "'" and one for '"'.
Also it didn't worked this, but I am however not allowed to use R("str"):
cmd = R"(perl -ne '{print \"$1\"}' file)"
string strCall = "bash -c \"( " + cmd + " ) 2>&1\"";
stream = popen(strCall.c_str(),"r")
Thanks.
I know I am not answering your question, but a common solution once you reach this many levels of quoting is to write a simple shell script and then call that from popen.
E.g., popen("/path/diffscript.sh", "r");