sed matching multiple line pattern - regex

I have a log of following format
<<
[ABC] some other data
some other data
>>
<<
DEF some other data
some other data
>>
<<
[ABC] some other data
some other data
>>
I wanted to select all logs which are having ABC expected result is
<<
[ABC] some other data
some other data
>>
<<
[ABC] some other data
some other data
>>
What will the expression for sed command ?
For fetching contents b/w << >> expression will be
sed -e '/<</,/>>/!d'
But how can I force it to have [ABC] in b/w

This might work for you:
sed '/^<</,/^>>/{/^<</{h;d};H;/^>>/{x;/^<<\n\[ABC\]/p}};d' file
<<
[ABC] some other data
some other data
>>
<<
[ABC] some other data
some other data
>>
sed comes equipped with a register called the hold space (HS).
You can use the HS to collect data of interest. In this case lines between /^<</,/^>>/
h replaces whatever is in the HS with what is in the pattern space (PS)
H appends a newline \n and then the PS to the HS
x swaps the HS for the PS
N.B. This deletes all lines other than those between <<...>> containing [ABC].
If you want to retain other lines use:
sed '/^<</,/^>>/{/^<</{h;d};H;/^>>/{x;/^<<\n\[ABC\]/p};d}' file
<<
[ABC] some other data
some other data
>>
<<
[ABC] some other data
some other data
>>

This works on my side:
awk '$0~/ABC/{print "<<";print;getline;print;getline;print }' temp.txt
tested as below:
pearl.242> cat temp.txt
<<
[ABC] some other data
some other data
>>
<<
DEF some other data
some other data
>>
nkeem
<<
[ABC] some other data
some other data
>>
pearl.243> awk '$0~/ABC/{print "<<";print;getline;print;getline;print }' temp.txt
<<
[ABC] some other data
some other data
>>
<<
[ABC] some other data
some other data
>>
pearl.244>
If you donot want to hard code this statement print "<<";,then you can go for the below:
pearl.249> awk '$0~/ABC/{print x;print;getline;print;getline;print}{x=$0}' temp.txt
<<
[ABC] some other data
some other data
>>
<<
[ABC] some other data
some other data
>>
pearl.250>

To me, sed is line based. You can probably talk it into being multi line, but it would be easier to start the job with awk or perl rather than trying to do it in sed.
I'd use perl and make a little state machine like this pseudo code (I don't guarantee it'll catch every little detail of what you are trying to achieve)
state = 0;
for each line
if state == 0
if line == '<<'
state = 1;
if state == 1
If line starts with [ABC]
buffer += line
state =2
if state == 2
if line == >>
do something with buffer
state = 0
else
buffer += line;
See also http://www.catonmat.net/blog/awk-one-liners-explained-part-three/ for some hints on how you might do it with awk as a 1 liner...

TXR: built for multi-line stuff.
#(collect)
<<
[ABC] #line1
#line2
>>
# (output)
>>
[ABC] #line1
#line2
<<
# (end)
#(end)
Run:
$ txr data.txr data
>>
[ABC] some other data
some other data
<<
>>
[ABC] some other data
some other data
<<
Very basic stuff; you're probably better off sticking to awk until you have a very complicated multi-line extraction job with irregular data with numerous cases, lots of nesting, etc.
If the log is very large, we should write #(collect :vars ()) so the collect doesn't implicitly accumulate lists; then the job will run in constant memory.
Also, if the logs are not always two lines, it becomes a little more complicated. We can use a nested collect to gather the variable number of lines.
#(collect :vars ())
<<
[ABC] #line1
# (collect)
#line
# (until)
>>
# (end)
# (output)
>>
[ABC] #line1
# {line "\n"}
<<
# (end)
#(end)

Related

How to identify '\N' character in data using Pig

I'm getting very weird character '\N' in my data. I want to remove or replace this character from data. Below is the data sample:
Girls Shoes,1325051884
\N,\N
Men's Shirts,\N
Delimiter : comma (,)
I tried couple of ways to replace/identify this \N character but not working.
In Pig, positional notation is indicated with the dollar sign ($) and begins with zero (0); for example, $0, $1, $2.
So, in the data mentioned above, the first field is identified by $0 (for e.g. "Girls Shoes") and second is identified by $1 (for e.g. 1325051884).
Following script has logic to replace '\N':
A = LOAD '/data.txt' USING PigStorage(',');
B = FILTER A BY ($0 != '\\N') OR ($1 != '\\N');
dump B;
C = FOREACH B GENERATE ($0 == '\\N' ? '' : $0), ($1 == '\\N' ? '' : $1);
dump C;
Where '/data.txt' contains following data:
Girl's Shoes,1325051884
\N,\N
Men's Shirts,\N
\N,Boy's Pants
Logic:
A = LOAD '/data.txt' USING PigStorage(',');
Loads data, by assuming the delimiter to be comma (,).
B = FILTER A BY ($0 != '\\N') OR ($1 != '\\N');
For each loaded record, filter the records by condition: $0 (first field) NOT EQUALS '\N' OR $1 (second field) NOT EQUALS '\N'
Output of this stage would be (2nd record containing both '\N' is filtered out):
(Girl's Shoes,1325051884)
(Men's Shirts,\N)
(\N,Boy's Pants)
C = FOREACH B GENERATE ($0 == '\\N' ? '' : $0), ($1 == '\\N' ? '' : $1);
For each of the records generated in the 2nd step, it checks: if $0 is equal to '\N'. If yes, it emits blank (''), else emits $0. Similar logic is applied to $1.
Output of this stage would be:
(Girl's Shoes,1325051884)
(Men's Shirts,)
(,Boy's Pants)
You can see that, '\N' is replaced by blank ('').
I am using Apache Pig 0.15. This script worked perfectly for your data.
A = FILTER data by $2 =='//N'
it will list out all data with such character appearance.

C++: Output file of system command not generated

I am trying to run a shell command from my code. However, the output file (ts.dat) is not generated.
Can somebody let me know how to solve this problem?
string cmd1, input;
cout << "Enter the input file name: ";
cin >> input;
cmd1 = "grep 'DYNA>' input | cut -c9-14 > ts.dat";
system((cmd1).c_str());
Edit this line:
cmd1="grep 'DYNA>' input | cut -c9-14 > ts.dat";
To this:
cmd1="grep 'DYNA>' " + input + " | cut -c9-14 > ts.dat";
You need to actually use the value from the input string. The way you have your code currently, you are just writing the word input in your string and not using the value that is stored in the string.
cmd1="grep 'DYNA>' "+input+" | cut -c9-14 > ts.dat";
Placing input inside the quotes will let the compiler parse it as a string instead of a variable.

Finding columns with only white space in a text file and replace them with a unique separator

I have a file like this:
aaa b b ccc 345
ddd fgt f u 3456
e r der der 5 674
As you can see the only way that we can separate the columns is by finding columns that have only one or more spaces. How can we identify these columns and replace them with a unique separator like ,.
aaa,b b,ccc,345
ddd,fgt,f u,3456
e r,der,der,5 674
Note:
If we find all continuous columns with one or more white spaces (nothing else) and replace them with , (all the column) the problem will be solved.
Better explanation of the question by josifoski :
Per block of matrix characters, if all are 'space' then all block should be replaced vertically with one , on every line.
$ cat tst.awk
BEGIN{ FS=OFS=""; ARGV[ARGC]=ARGV[ARGC-1]; ARGC++ }
NR==FNR {
for (i=1;i<=NF;i++) {
if ($i == " ") {
space[i]
}
else {
nonSpace[i]
}
}
next
}
FNR==1 {
for (i in nonSpace) {
delete space[i]
}
}
{
for (i in space) {
$i = ","
}
gsub(/,+/,",")
print
}
$ awk -f tst.awk file
aaa,b b,ccc,345
ddd,fgt,f u,3456
e r,der,der,5 674
Another in awk
awk 'BEGIN{OFS=FS=""} # Sets field separator to nothing so each character is a field
FNR==NR{for(i=1;i<=NF;i++)a[i]+=$i!=" ";next} #Increments array with key as character
#position based on whether a space is in that position.
#Skips all further commands for first file.
{ # In second file(same file but second time)
for(i=1;i<=NF;i++) #Loops through fields
if(!a[i]){ #If field is set
$i="," #Change field to ","
x=i #Set x to field number
while(!a[++x]){ # Whilst incrementing x and it is not set
$x="" # Change field to nothing
i=x # Set i to x so it doesnt do those fields again
}
}
}1' test{,} #PRint and use the same file twice
Since you have also tagged this r, here is a possible solution using the R package readr. It looks like you want to read a fix width file and convert it to a comma-seperated file. You can use read_fwf to read the fix width file and write_csv to write the comma-seperated file.
# required package
require(readr)
# read data
df <- read_fwf(path_to_input, fwf_empty(path_to_input))
# write data
write_csv(df, path = path_to_output, col_names = FALSE)

Separate a line to columns using several spaces (>1) as a delimiter using C++ or linux

I have several lines looking like this:
4539(random number of spaces)07235001(random number of spaces)Aach(random number of spaces)Trier Saarburg
I want to separate it to 4 columns using C++ or linux. The output I want will look like this:
4539|07235001|Aach|Trier Saasburg
So I want to treat several spaces as the delimiter but not the single one.
(random number of spaces thankfully is always > 1)
Lines do not always consist of 4 columns and the space problem is not always at the last column.
Thanks in advance
You should read each field individually. The last field can be read until a newline
character is received:
std::string column1;
std::string column2;
std::string column3;
std::string column4;
while (input_file >> column1)
{
input_file >> column2;
input_file >> column3;
getline(input_file, column4);
}
Another method is to read the entire line using getline and then fetch out the substring fields using std::string::find and std::string::substr.
You can use awk with regular expressions for this:
echo "4539 07235001 Aach Trier Saarburg" | awk 'BEGIN { FS = "[ ]{2,}" } { OFS = "|" }; {$1=$1; print $0 }'
FS variable is used to set the field separator for each record and may contain any regular expression. OFS is the output equivalent of the FS variable.

Parsing of file with Key value in C/C++

Need some help in parsing the file
Device# Device Name Serial No. Active Policy Disk# P.B.T.L ALB
Paths
--------------------------------------------------------------------------------------- -------------------------------------
1 AB OPEN-V-CM 50 0BC1F1621 1 SQST Disk 2 3.1.4.0 N/A
2 AB OPEN-V-CM 50 0BC1F1605 1 SQST Disk 3 3.1.4.1 N/A
3 AB OPEN-V*2 50 0BC1F11D4 1 SQST Disk 4 3.1.4.2 N/A
4 AB OPEN-V-CM 50 0BC1F005A 1 SQST Disk 5 3.1.4.3 N/A
The above information is in devices.txt file and and i want to extract the device number corresponding to the disk no i input.
The disk number i input is just an integer (and not "Disk 2" as shown in the file).
Open the file and skip first 3 lines.
Start reading line by line from 4th line onward. You can get the device number easily as it is the first column.
To get the disk no, search through each line using the space character. When you encounter one space character it means you've gone past one column. Ignore the repeated spaces and continue this until you reach the disk no. You must handle the spaces in the column data separately if it exist.
Load the disk no and device no in to say a map and later you can use your input to query the device info from this map.
#include <sstream>
#include <fstream>
#include <iostream>
#include <cctype>
using namespace std;
int main(int argc, char* argv[])
{
int wantedDisknum = 4;
int finalDeviceNum = -1;
ifstream fin("test.txt");
if(!fin.is_open())
return -1;
while(!fin.eof())
{
string line;
getline(fin, line);
stringstream ss(line);
int deviceNum;
ss >> deviceNum;
if(ss.fail())
{
ss.clear();
continue;
}
string unused;
int diskNum;
ss >> unused >> unused >> unused >> unused >> unused >> unused >> unused >> diskNum;
if(diskNum == wantedDisknum)
{
finalDeviceNum = deviceNum;
break;
}
}
fin.close();
cout << finalDeviceNum << endl;
system("pause");
return 0;
}
In UNIX, you can easily achieve this using awk or other script lang.
cat Device.txt | awk '{if ( $1 == 2 ) print}'
In C++, you have to extract specific column using strtok and compare it with 'val' if it matches print that line.'
Assuming there is no "Disk" in any of the following columns:
1) Skip lines until you encounter '-' as the first character of a line, then skip that line too.
2) read a line
2.a) skip characters of the current line until isdigit(line[i]) function returns true, then read current character and characters following it into a temporary buffer until isdigit(line[i]) returns false. This is the device id.
2.b) Skip characters of the current line until you find a 'D'
2.b.i) match 'i', 's', 'k' characters, if any of them fails, go to 2.b
2.c) skip characters of the current line until isdigit(line[i]) function returns true, then read current character and characters following it into another buffer until isdigit(line[i]) returns false. This is the disk id.
3) print out both buffers
I don't have my Regular Expression cheat sheet handy, but I'm pretty sure it would be straightforward to run each line in the file through a regex that:
1) looks for a integer in the line
2) skips whitespace followed by text three times
3) matches characters one space and characters
Boost, Qt, and most other common C++ class libraries have a Regex parser for just this kind of thing.