Processing multi-line input - Loop vs Regex

Processing multi-line input - Loop vs Regex - regex

I am trying to look for a specific keyword in multi-line input string like this,
this is input line 1
this is the keyword line
this is another input line
this is the last input line
The multi-line input is stored in a variable called "$inputData". Now, I have 2 ways in mind to look for the word "keyword",
Method 1:
Using split to put the lines into an array using "\n" separator and iterate and process each line using foreach loop, like this,
my #opLines = split("\n", $inputData);
# process each line individually
foreach my $opLine ( #opLines )
{
# look for presence of "keyword" in the line
if(index($opLine, "keyword") > -1)
{
# further processing
}
}
Method 2:
Using regex, as below,
if($inputData =~ /keyword/m)
{
# further processing
}
I would like to know how these 2 methods compare with each other and What would be the better method with regards to actual code performance and execution time. Also, is there a better and more efficient way to go about this task?

my #opLines = split("\n", $inputData);
Will create variable #opLines, allocate memory, and search "\n" trough whole $inputData and write found lines into it.
# process each line individually
foreach my $opLine ( #opLines )
{
Will process the whole bunch of code for each value in array #opLines
# look for presence of "keyword" in the line
if(index($opLine, "keyword") > -1)
Will search for the "keyword" in each line.
{
# further processing
}
}
And comapare
if($inputData =~ /keyword/m)
Will search for the "keyword" and stops when find first occurrence.
{
# further processing
}
And now guess, what will be faster and consume less memory (which affects speed as well). If you are bad in guessing use Benchmark module.
According documentation m regular expression modifier: Treat string as multiple lines. That is, change "^" and "$" from matching the start or end of line only at the left and right ends of the string to matching them anywhere within the string. I don't see neither ^ nor $ in your regexp so it is useless there.

Related

Find number and replace + 1

I have a large file with a list of objects that have an incrementing page # ie
[
{page: 1},
{page: 2},
{page: 3}
]
I can find each instance of page: # with page: (\d) in vscode's ctrl+f finder. How would I replace each of these numbers with # + 1?

It can be done rather easily in vscode using one of emmet's built-in commands:
Emmet: Increment by 1
Use your regex to find all the page: \d+ in your file.
Ctrl-Shift-L to select all those occurrences.
Trigger the Emmet: Increment by 1 command.
Here is a demo:

It's not possible to perform arithmetic with regex. I use LINQPad to execute these small kind of scripts. An example of how I would do it is in the c# program below.
void Main()
{
var basePath = #"C:\";
// Get all files with extension .cs in the directory and all its subdirectories.
foreach (var filePath in Directory.GetFiles(basePath, "*.cs", SearchOption.AllDirectories))
{
// Read the content of the file.
var fileContent = File.ReadAllText(filePath);
// Replace the content by using a named capture group.
// The named capture group allows one to work with only a part of the regex match.
var replacedContent = Regex.Replace(fileContent, #"page: (?<number>[0-9]+)", match => $"page: {int.Parse(match.Groups["number"].Value) + 1}");
// Write the replaced content back to the file.
File.WriteAllText(filePath, replacedContent);
}
}
I also took the liberty of changing your regex to the one below.
page: (?<number>[0-9]+)
page: matches with "page: " literally.
(?<number> is the start of a named capture group called number. We can then use this group during replacement.
[0-9]+ matches a number between 0 and 9 one to infinite times. This is more specific than using \d as \d also matches other number characters.
The + makes it match more than on digit allowing for the number 10 and onwards.
) is the end of a named capture group.

You could do that in Ruby as follows.
FileIn = "in"
FileOut = "out"
File let's construct a sample file (containing 37 characters).
File.write FileIn, "[\n{page: 1},\n{page: 2},\n{page: 33}\n]\n"
#=> 37
We may now read the input file FileIn, convert it and write it to a new file FileOut.
File.write(FileOut, File.read(FileIn).
gsub(/\{page: (\d+)\}/) { "{page: #{$1.next}}" })
Let's look at what's be written.
puts File.read(FileOut)
[
{page: 2},
{page: 3},
{page: 34}
]
I've gulped the entire file, made the changes in memory and spit out the modified file. If the original file were large this could be easily modified to read from and write to the files line-by-line.

Adding another answer as it is significantly different than the other. I wrote an extension Find and Transform which makes it easy to do math in a find in a file.
In this case with this keybinding (in your keybindings.json file):
{
"key": "alt+r", // whatever keybinding you want
"command": "findInCurrentFile",
"args": {
"find": "page: (\\d)",
"replace": "page: $${ return $1 + 1 }$$",
"isRegex": true
}
[That could also be a setting in your settings.json file if you wish with slightly different syntax of course.]
The $${ return $1 + 1 }$$ represents a javascript operation. Here 1 will be added to capture group 1 from the find regex.
Within the $${ ... }$$ almost any javascript operation can be inserted. There are many examples in the repo.

How to check multiline text with regex after first match and only before the second one

I'm trying to find needed log in a pretty big log file(let's say 250 mb).
Every single log starts with
YYYY-MM-DD time:
Next goes some one or multiline text that I want to match
And finally ends with a newline and new DateTime pattern.
The question is how to match the text inside a log if it is multiline and only before the next log.
The order of matching values is unknown as well as the line of them.
I have tried next solution
grep -Pzio '^(\d{4}-\d{2}-\d{2} timePattern)(?=[\s\S]*?Value1)(?=[\s\S]*?Value2)(?=[\s\S]*?Value3)[\s\S]*?(?=(\n\1|\Z)' file.log
But it comes to overhead PCRE limit even with ungreedy [\s\S]*? or simply gets previous unmatched log and includes lots of other logs in [\s\S]* before it finally finds all three values to match before the first capturing group and just gives me back huge text.
So the only difficulty is multiline I think here.
Will appreciate any help!
EDIT 0: I need to find only one log that has all the values that I'm trying to match.
EDIT 1: Example
2018-02-09 03:52:46,347 Activity=SomeAct
#Request=<S:Body><S:RQ><S:Info><S:Key><S:First>Value1</S:First><S:Second>Value2</S:Second></S:Key></S:Info></S:RQ></S:Body>
#Response=<SOAP-ENV:Body><S:RS><S:StatusCode>FAILURE</S:StatusCode></S:RS></SOAP-ENV:Body>
2018-02-09 03:52:51,377 Activity=SomeAct
#Request=<S:Body><S:RQ><S:Info><S:Key><S:First>Value1</S:First><S:Second>Value2</S:Second></S:Key></S:Info></S:RQ></S:Body>
#Response=<SOAP-ENV:Body><S:RS><S:StatusCode>SUCCESSFUL</S:StatusCode></S:RS></SOAP-ENV:Body>
2018-02-09 03:52:52,112 Activity=SomeAct
#Response=<SOAP-ENV:Body><S:RS><S:StatusCode>FAILURE</S:StatusCode></S:RS></SOAP-ENV:Body>
#Request=<S:Body><S:RQ><S:Info><S:Key><S:First>Value1</S:First><S:Second>Value3</S:Second></S:Key></S:Info></S:RQ></S:Body>
I need to get only the record with value1 and value2 in SUCCESFULL status. BUT it is not necessary that response is after request or <first> goes before <second> or RS\RQ are only one lines.

It's not really clear what you want to find but a common approach is to use Awk with a custom record separator so that a record can be multiple lines. Or you can collect the records manually:
awk '/^YYYY-MM-DD time: / { if (seen1 && seen2 && seen3) print rec;
seen1 = seen2 = seen3 = 0; rec = "" }
{ rec = (rec ? rec "\n" $0 : $0 }
/Value1/ { seen1++ }
/Value2/ { seen2++ }
/Value3/ { seen3++ }
END { if (seen1 && seen2) print rec; }' file
This collects into rec the lines we have seen since the previous separator, and when we see a new separator, we print the previous value from rec before starting over if all the "seen" flags are set, indicating that we have matched all the regexes with the text in the current rec.
A common omission is forgetting to also do this in the END block, when we reach the end of the file.

Stopping regex at the first match, it shows two times

I am writing a perl script and I have a simple regex to capture a line from a data file. That line starts with IG-XL Version:, followed by the data, so my regex matches that line.
if($row =~/IG-XL Version:\s(.*)\;/)
{
print $1, "\n";
}
Let's say $1 prints out 9.0.0. That's my desired outcome. However in another part of the same data file also has a same line IG-XL Version:. $1 now prints out two of the data 9.0.0.
I only want it to match the first one so I can only get the one value. I have tried /IG-XL Version:\s(.*?)\;/ which is the most suggested solution by adding a ? so it'll be .*? but it still outputs two. Any help?
EDIT:
The value of $row is:
Current IG-XL Version: 8.00.01_uflx (P7); Build: 11.10.12.01.31
Current IG-XL Version: 8.00.01_uflx (P7); Build: 11.10.12.01.31
The desired value I want is 8.00.01_uflx (P7) which I did get, but two times.

The only way to do this while reading the file line by line is to keep a status flag that records whether you have already found that pattern. But if you are storing the data in a hash, as you were in your previous question, then it won't matter as you will just overwrite the hash element with the same value
if ( $row =~ /IG-XL Version:\s*([^;]+)/ and not $seen_igxl_vn ) {
print $1, "\n";
$seen_igxl_vn = 1;
}
Or, if the file is reasonably small, you could read the whole thing into memory and search for just the first occurrence of each item
I suggest you should post a question showing your complete program, your input data, and your required output, so that we can give you a complete solution rather than seeing your problem bit by bit

Perl - Regexp to manipulate .csv

I've got a function in Perl that reads the last modified .csv in a folder, and parses it's values into variables.
I'm finding some problems with the regular expressions.
My .csv look like:
Title is: "NAME_NAME_NAME"
"Period end","Duration","Sample","Corner","Line","PDP OUT TOTAL","PDP OUT OK","PDP OUT NOK","PDP OUT OK Rate"
"04/12/2014 11:00:00","3600","1","GPRS_OUT","ARG - NAME 1","536","536","0","100%"
"04/12/2014 11:00:00","3600","1","GPRS_OUT","USA - NAME 2","1850","1438","412","77.72%"
"04/12/2014 11:00:00","3600","1","GPRS_OUT","AUS - NAME 3","8","6","2","75%"
.(ignore this dot, you will understand later)
So far, I've had some help to parse the values into some variables, by:
open my $file, "<", $newest_file
or die qq(Cannot open file "$newest_file" for reading.);
while ( my $line = <$file> ) {
my ($date_time, $duration, $sample, $corner, $country_name, $pdp_in_total, $pdp_in_ok, $pdp_in_not_ok, $pdp_in_ok_rate)
= parse_line ',', 0, $line;
my ($date, $time) = split /\s+/, $date_time;
my ($country, $name) = $country_name =~ m/(.+) - (.*)/;
print "$date, $time, $country, $name, $pdp_in_total, $pdp_in_ok_rate";
}
The problems are:
I don't know how to make the first AND second line (that are the column names from the .csv) to be ignored;
The file sometimes come with 2-5 empty lines in the end of the file, as I show in my sample (ignore the dot in the end of it, it doesn't exists in the file).
How can I do this?

When you have a csv file with column headers and want to parse the data into variables, the simplest choice would be to use Text::CSV. This code shows how you get your data into the hash reference $row. (I.e. my %data = %$row)
use strict;
use warnings;
use Text::CSV;
use feature 'say';
my $csv = Text::CSV->new({
binary => 1,
eol => $/,
});
# open the file, I use the DATA internal file handle here
my $title = <DATA>;
# Set the headers using the header line
$csv->column_names( $csv->getline(*DATA) );
while (my $row = $csv->getline_hr(*DATA)) {
# you can now access the variables via their header names, e.g.:
if (defined $row->{Duration}) { # this will skip the blank lines
say $row->{Duration};
}
}
__DATA__
Title is: "NAME_NAME_NAME"
"Period end","Duration","Sample","Corner","Line","PDP IN TOTAL","PDP IN OK","PDP IN NOT OK","PDP IN OK Rate"
"04/12/2014 10:00:00","3600","1","GRPS_INB","CHN - Name 1","1198","1195","3","99.74%"
"04/12/2014 10:00:00","3600","1","GRPS_INB","ARG - Name 2","1198","1069","129","89.23%"
"04/12/2014 10:00:00","3600","1","GRPS_INB","NLD - Name 3","813","798","15","98.15%"
If we print one of the $row variables with Data::Dumper, it shows the structure we are getting back from Text::CSV:
$VAR1 = {
'PDP IN TOTAL' => '1198',
'PDP IN NOT OK' => '3',
'PDP IN OK' => '1195',
'Period end' => '04/12/2014 10:00:00',
'Line' => 'CHN - Name 1',
'Duration' => '3600',
'Sample' => '1',
'PDP IN OK Rate' => '99.74%',
'Corner' => 'GRPS_INB'
};

open ...
my $names_from_first_line = <$file>; # you can use them or just ignore them
while($my line = <$file>) {
unless ($line =~ /\S/) {
# skip empty lines
next;
}
..
}
Also, consider using Text::CSV to handle CSV format

1) I don't know how to make the first line (that are the column names from the .csv) to be ignored;
while ( my $line = <$file> ) {
chomp $line;
next if $. == 1 || $. == 2;
2) The file sometimes come with 2-5 empty lines in the end of the file, as I show in my sample (ignore the dot in the end of it, it doesn't exists in the file).
while ( my $line = <$file> ) {
chomp $line;
next if $. == 1 || $. == 2;
next if $line =~ /^\s*$/;

You know that the valid lines will start with dates. I suggest you simply skip lines that don't start with dates in the format you expect:
while ( my $line = <$file> ) {
warn qq(next if not $line =~ /^"\d{2}-\d{2}-d{4}/;); # Temp debugging line
next if not $line =~ /^"\d{2}-\d{2}-d{4}/;
warn qq($line matched regular expression); # Temp debugging line
...
}
The /^"\d{2}-\d{2}-d{4}",/ is a regular expression pattern. The pattern is between the /.../:
^ - Beginning of the line.
" - Quotation Mark.
\d{2} - Followed by two digits.
- - Followed by a dash.
\d{2] - Followed by two more digits.
- - Followed by a dash.
\d{4} - Followed by four more digits
This should be describing the first part of your line which is the date in MM-DD-YYYY format surrounded by quotes and followed by a comma. The =~ tells Perl that you want the thing on the left to match the regular expression on the right.
Regular expressions can be difficult to understand, and is one of the reasons why Perl has such a reputation of being a write-only language. Regular expressions have been likened to sailor cussing. However, regular expressions is an extremely powerful tool, and worth the effort to learn. And with some experience, you'll be able to easily decode them.
The next if... syntax is similar to:
if (...) {
next;
}
Normally, you shouldn't use post-fix if and never use unless (which is if's opposite). They can make your program more difficult to understand. However, when placed right after the opening line of a loop like this, they make a clear statement that you're filtering out lines you don't want. I could have written this (and many people would argue this is preferable):
next unless $line =~ /^"\d{2}-\d{2}-d{4}",/;
This is saying you want to skip lines unless they match your regular expression. It's all a matter of personal preference and what do you think is easier for the poor schlub who comes along next year and has to figure out what your program is doing.
I actually thought about this and decided that if not ... was saying that I expect almost all lines in the file to match my format, and I want to toss away the few exceptions. To me, next unless ... is saying that there are some lines that match my regular expression, and many lines that don't, and I want to only work on lines that match.
Which gets us to the next part of programming: Watching for things that will break your program. My previous answer didn't do a lot of error checking, but it should. What happens if a line doesn't match your format? What if the split didn't work? What if the fields are not what I expect? You should really check each statement to make sure it actually worked. Almost all functions in Perl will return a zero, a null string, or an undef if they don't work. For example, the open statement.
open my $file, "<", $newest_file
or die qq(Cannot open file "$newest_file" for reading.);
If open doesn't work, it returns a file handle value of zero. The or states that if open doesn't return a non-zero file handle, execute the line that follows which kills your program.
So, look through your program, and see any place where you make an assumption that something works as expected and think what happens if it didn't. Then, add checks in your program to something if you get that exception. It could be that you want to report the error or log the error and skip to the next line. It could be that you want your program to come to a screeching halt. It could be that you can recover from the error and continue. What ever you do, check for possible errors (especially from user input) and handle possible errors.
Debugging
I told you regular expressions are tricky. Yes, I made a mistake assuming that your date was a separate field. Instead, it's followed by a space then the time which means that the final ", in the regular expression should not be there. I've fixed the above code. However, you may still need to test and tweak. Which brings us into debugging in Perl.
You can use warn statements to help debug your program. If you copy a statement, then surround it with warn qq(...);, Perl will print out the line (filling out variables) and the line number. I even create macros in my various editors to do this for me.
The qq(...) is a quote like operator. It's another way to do double quotes around a string. The nice thing is that the string can contain actual quotation marks, and the qq(...); will still work.
Once you've finished debugging, you can search for your warn statements and delete them. Perl comes with a powerful built in debugger, and many IDEs integrate with it. However, sometimes it's just easier to toss in a few warn statements to see what's going on in your code -- especially if you're having issues with regular expressions acting up.

Does using multiline in logstash filter print out the data?

I am trying to use multiline to combine a number of of lines in a logfile with the same starting symbol. In my case the starting symbol is #S#. it would look something like this:
#S# dsifj sdfojosf sfjosdfoisdjf
#S# dsfj sdojifoig dfpkgokdfgk 89s7fsjlk sdf
#S# lsdffm dg;;dfgl djfg 930`e`fsd
...
...
...
Note: The random character is just use to imitate the content of the actual log.
The following is what is wrote for the multiline startment:
multiline {
type => "table_init"
pattern => "#S#"
negate => true
what => "next"
}
I am assuming what I wrote does combine them as one line, but I am wondering if this prints out the line or do I need to use gork to parse the whole entire line before it prints. Any thoughts and inputs will be helpful. Thank you.

If you are trying to match up all lines that DO match "#S#", then you should have negate set to false. You use negate when you want to get all lines that DO NOT match a certain pattern.
As for your actual question, multiline takes all the relevant lines and puts them into the "message" field, including newline characters (\n, and I assume \r if you are running Windows as well though I have never checked). You can then grok this entire message to get the data you want.
So if you set up your output like so:
output { stdout { codec => rubydebug } }
You should find that the outputted message will read something like:
"message" = "#S# dsifj sdfojosf sfjosdfoisdjf \n#S# dsfj sdojifoig dfpkgokdfgk 89s7fsjlk sdf\n#S# lsdffm dg;;dfgl djfg 930`e`fsd
if you set up your multiline filter correctly.
Hope this helps!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Processing multi-line input - Loop vs Regex - regex

Related

Find number and replace + 1

How to check multiline text with regex after first match and only before the second one

Stopping regex at the first match, it shows two times

Perl - Regexp to manipulate .csv

Does using multiline in logstash filter print out the data?

Categories

Resources