I need to extract messages from a log file. Messages are logged in two different ways: in a single line, like this:
2018-09-21 10:03:54,145 <message-content>
2018-09-21 10:05:02,008 <next-message-content>
or in several lines like this:
2018-09-21 10:03:54,145 <message-content-part 1>
<message-content-part 2>
...
<message-content-part n>
2018-09-21 10:04:12,198 <next-message-content>
Each message starts with header \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3}.
There is no any specific ending tag in each message.
I want to extract all messages, both single- and multi-line, with specific text.
For example, the output of search for "XYZ" could be like this:
2018-09-21 10:03:54,145 AAA BBB XYZ CCC
2018-09-21 10:10:55,347 BBB
CCC XYZW
DDD
2018-09-21 10:12:56,060 EEE XYZFFF
GGG
You may use
cat file | \
sed -E 's/^[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}/\n\n&/' | \
awk 'BEGIN { RS = "\n\n"; ORS=""} /XYZ/ {print}'
See the online demo
Details
sed -E 's/^[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}/\n\n&/' - This sed command finds lines starting with datetime format and prepends them with double newline
awk 'BEGIN { RS = "\n\n"; ORS=""} /XYZ/ {print}' - This awk command reads the file in splitting the file into records by "\n\n" (RS is the record separator), and only prints (omitting the \n\n because of ORS="", where ORS is the output record separator) those that contain XYZ substring.
Using perl. I added 2 more messages in the sample input, which should not appear in the output.
> cat pattern_xyz.dat
2018-09-21 10:03:54,145 AAA BBB XYZ CCC
2018-09-21 10:03:54,145 AAA BBB PPP CCC
2018-09-21 10:10:55,347 BBB
CCC XYZW
DDD
2018-09-21 10:12:56,060 EEE XYZFFF
GGG
2018-09-21 10:10:55,347 BBB
CCC QQQW
DDD
>
> cat pattern_xyz.pl
#!/usr/bin/perl
$file=$ARGV[0];
$x=`cat $file`;
while($x=~m/(^\d{4}-\d{2}-\d{2})(.+?)(\d{4}-\d{2}-\d{2})(.*)/osm)
{
$content="$1$2";
$x="$3$4";
if( $content=~/XYZ/ ) { print "$content"; }
}
> pattern_xyz.pl pattern_xyz.dat #executing script
2018-09-21 10:03:54,145 AAA BBB XYZ CCC
2018-09-21 10:10:55,347 BBB
CCC XYZW
DDD
2018-09-21 10:12:56,060 EEE XYZFFF
GGG
>
>
Related
This is the script:
sed 's\[/][*]\//\g ; s/[*][/]\s\+/\n/g ; s/[*][/]/\n/g' inputFile > outputFile
This is input file:
aaa /* bbb */ ccc /* ddd */ eee /* fff
ggg */ hhh /* iii */ jjj
kkk
/* lll
mmm
nnn */
ooo
This is output file:
aaa // bbb
ccc // ddd
eee // fff
ggg
hhh // iii
jjj
kkk
// lll
mmm
nnn
ooo
Expected output:
aaa // bbb
ccc // ddd
eee // fff
// ggg
hhh // iii
jjj
kkk
// lll
// mmm
// nnn
ooo
The current script I using is unable to tackle with multiline comments, is there any way using sed command to achieve this?
If perl is your option, would you please try the following:
perl -0777pe 's#/\*\s*(.+?)\s*\*/\s*#join("\n", map {"// " . $_ } split("\n", $1)) . "\n"#sge' inputFile
Output:
aaa // bbb
ccc // ddd
eee // fff
// ggg
hhh // iii
jjj
kkk
// lll
// mmm
// nnn
ooo
The -0777 option tells perl to slurp all lines at once.
The -pe option enables the one-liner scripting.
The s switch to the s/pattern/replacement/ operator makes a dot match a newline character.
The e switch to the s/pattern/replacement/ operator enables the replacement
to be a perl expression.
The join .. map .. split() functions handle the multiline comments properly.
This might work for you (GNU sed):
sed -E ':a;\#/\*.*\*/#{s#/[*] ?#// #;s# \*/ ?#\n#;P;D};\#/\*#{N;s#(.* ?)\n#\1\n// #;ba}' file
If the current line contains both a starting and ending comment delimiter: replace the starting delimiter by // and the ending delimiter by \n, print the first line in the pattern space, remove it and then repeat.
Otherwise, if the current line contains a starting comment delimiter: append the following line, append // to the introduced newline and repeat (see above).
If there is no comment delimiters in the current line, print the line as normal.
N.B. The use of the alternate matching delimiter \#...# and the same for substitution s#...#...#, to avoid confusion because the comment delimiters contain /'s. The markdown may be acting up with the above solution owing to the nature of the *'s in the above text. Also for formatting purposes, spaces have been pinched and added to the result as per requirement.
For Eg:-
I have to scrape address from multiple websites. Sometimes address having repeated country name or address.
$string1="No 3, 3rd street mumbai india 3rd street";
$string2="#3 1019 GM Amsterdam Funda Real Estate BV 1019 GM Amsterdam The Netherlands";
I need to remove the group of n number of words in the given string.
In the given
$string1 contains "3rd street" as duplicate. I need to remove.
$string2 contains "1019 GM Amsterdam" as duplicate.
Output will be..
$string1="No 3, 3rd street mumbai india";
$string2="#3 1019 GM Amsterdam Funda Real Estate BV The Netherlands";
I have tried with some brute force method try the following
use warnings;
use strict;
use POSIX;
my $string1="aaa bbb aaa ccc aaa bbb";
#my $string1="fff ggg hhh ddd jjj fff ggg hhh";
#my $string2 = "fff ggg hhh ddd jjj fff ggg hhh fff ggg mmm";
my $string1_count = () = $string1=~m/\s+/g;
my $string_divide = ceil($string1_count/2);
for(my $i = $string_divide; $i > 1; $i--)
{
last if($string1 =~s/((?:\w+\s?){$i}).+\K\1//g);
}
print "$string1\n";
Just try this:
my $string1="aaa bbb aaa ccc aaa bbb";
my $string2="fff ggg hhh ddd jjj fff ggg hhh";
my #split = split / /, $string1;
my #unique = keys {map {$_ => 1} #split};
my $string3 = join " ", sort #unique;
print $string3;
I have the following strings in a file
1. aaa bbb zccc ddd eee;
2. yyaaa bbb zccc dzdd eee; ('z' is present multiple times)
3. yyaaa bbb ccc *zddd eee; (special character '*' present)
4. yyaaa bbb ccc * zddd eee; (special character '*' present)
5. aaa bbb ccc* zddd eee; (special character '*' present)
6. aaa bbb ccc ddd eee; ('z' is absent)
Another example file
1. aaa bbb zccc ddd eee;
2. yyaaa bbb zccc dzdd eee;
3. yyaaa bbb *ccc * zddd eee;
4. yyaaa bbb * ccc zddd eee;
5. aaa bbb* ccc zddd eee;
6. aaa bbb ccc ddd eee;
In each line, I want to extract the substring from the end of aaa to the first presence of z (minus the z). If z is absent, it should print the whole string. If there are special characters it should omit them.
REQUIRED OUTPUT
bbb
bbb
bbb ccc
bbb ccc
bbb ccc
aaa bbb ccc ddd eee
I have tried the following but it doesn't give the output I am seeking
my $file = qq(test.txt);
open (my $IN, '<', $file) || die "Cannot open $file for read: $!";
my #lines=<$IN>;
close($IN);
foreach (#lines)
{
if( $_ =~ m/aaa\b(.*?)z/)
{
print "$1\n";
}
}
MY OUTPUT
bbb
bbb
bbb ccc *
bbb ccc *
bbb ccc*
I am not sure how to exclude the special character (tried character classes) and it doesn't output anything for line#6 where there is no 'z' character present.
I think this is what you want
Note that there's no way to excluded the "special" characters in a single capture, so this must be done in two stages
Your "required output" has fewer spaces than the corresponding input line, but you don't mention anything about that in the text, so there's no way of knowing what it is that you really want
use strict;
use warnings 'all';
while ( <DATA> ) {
next unless /a+\s+((?:(?!\s*z).)+)/;
(my $val = $1) =~ tr/*;//d;
print $val, "\n";
}
__DATA__
1. aaa bbb zccc ddd eee;
2. yyaaa bbb zccc dzdd eee;
3. yyaaa bbb *ccc * zddd eee;
4. yyaaa bbb * ccc zddd eee;
5. aaa bbb* ccc zddd eee;
6. aaa bbb ccc ddd eee;
output
bbb
bbb
bbb ccc
bbb ccc
bbb ccc
bbb ccc ddd eee
You can use a negated character class as
if( $_ =~ m/aaa\b([^z;]*)/)
{
$string = $1;
$string =~ s/\*//g;
print "$string\n";
}
# Outputs
# bbb
# bbb
# bbb ccc
# bbb ccc
# bbb ccc
# bbb ccc ddd eee
[^z;]* Matches anything other than z or ;
$string =~ s/\*//g; substitute * in the group with nothing.
i have a pipe delimited file that looks like this:
34ab1 | aaa bbb ccc fff vf | 2015-01-01
35ab1 | aaa bbb ccc dddefd ddff ssss fff vi | 2015-01-01
i want to replace everything that starts with bbb and ends with fff.
i used this:
BEGIN {
FS = OFS = "|"
}
{
sub(/[0-9].*[0-9]/, "", $2); sub(/bbb.*fff/, "", $2);
print
}
the regex part for the numbers worked but the second part of the regex didnt.
output i want:
34ab1 | aaa vf | 2015-01-01
35ab1 | aaa vi | 2015-01-01
Use a single gsub function for both.
BEGIN {
FS = OFS = "|"
}
{
gsub(/[0-9].*[0-9]|bbb.*fff/, "", $2);
print
}
I am looking for every occurence of a search term, e.g. ddd, in a file and output the surroundings, like this:
File.txt
aaa bbb ccc ddd eee fff
ttt uuu iii eee ddd
ddd
ggg jjj kkk ddd lll
output
ccc ddd eee
eee ddd
ddd
kkk ddd lll
As a starting point, I am using this piece of code
#!/usr/bin/perl -w
while(<>) {
while (/ddd(\d{1,3}))/g) {
print "$1\n"
}
}
You can try the following..it gives the output you want:
while(<>) {
if(/((?:\w+ )?ddd(?: \w+)?)/) {
print "$1\n";
}
}
Regex used:
( # open the grouping.
(?:\w+ )? # an optional word of at least one char followed by a space.
ddd # 'ddd'
(?: \w+)? # an optional space followed by a word of at least one char.
) # close the grouping.
#!/usr/bin/perl -w
while (<>) {
if (/((?:[a-z]{3} )?ddd(?: [a-z]{3})?)/)
print "$1\n";
}
while (<>) {
chomp;
my #words = split;
for my $i (0..$#words) {
if ($words[$i] eq 'ddd') {
print join ' ', $i > 0 ? $words[$i-1] : (), $words[$i], $i < $#words ? $words[$i+1] : ();
print "\n";
}
}
}
#!/usr/bin/perl
while (<>) {
chomp;
#F = split /\s+/;
if (/^ddd$/) {print $_."\n";next};
for ($i=0; $i<=$#F;$i++) {
if ($F[$i] eq 'ddd') {
print "$F[$i-1] $F[$i] $F[$i + 1]\n";
}
}
}