Keep all lines of a list with identical beginning (Notepad++) - list

From a list, how to keep all occurrences of those lines only whose "first part or beginning" (defined from the beginning of the line to the ^ character) is present in other lines? (The pattern of lines in the list: beginning-of-line^rest_of_line_012345)
The type of characters, length, etc. after the ^ is irrelevant (but needs to be kept). Every line has only one (1) ^ character. The "beginning" string that determines identity must be present in the same (analogous) position in other lines (i.e., from the beginning of the line to ^, and must be exact match). (Lines contain characters that trouble regex, such as \/()*., so these need to be summarily escaped.)
For example: Original list:
abc^123
0xyz^xxx
aaa-123^123
aaa-12^0xyz
0xyz^098
00xyz^098
0xyz^x111xx
Keep all occurrences of lines with identical first part:
0xyz^xxx
0xyz^098
0xyz^x111xx
This elegant script by #Lars Fischer ((.*)\R(\2\R?)+)*\K.* (after pre-sorting) keeps all occurrences of duplicate lines, but it considers the entire line (it was designed to do so).
In this Q, I am looking for a solution that considers only the "beginning" of the line to see if it occurs more than once, and if yes, then keep the entire line. Any guidance?

Note: in this solution the characters # and % are used based on the assumption that these characters do not show up ANYWHERE in the file to begin with. If that's not the case for you, just use different patterns that you know don't show up anywhere in the file, such as ##### and %%%%%.
Start by sorting the file Lexicographically with Notepad++ by going to Edit -> Line Operations -> Sort Lines Lexicographically Ascending
Do a regex Find-and-Replace (UNcheck the box for ". matches newline"):
Find what:
^(.*?)\^[^\r\n]+[\r\n]+(\1\^.*?[\r\n]+)*\1\^.*?$
Replace with:
#$&%
Now do another regex Find-and-Replace (CHECK the box for ". matches newline"):
Find what:
%.*?#
Replace with:
\r\n
Finally, do one last regex Find-and-Replace (CHECK the box for ". matches newline"):
Find what:
^.*?#|%.*
Replace with nothing.

You said in comments that a perl script is OK for you.
#!/usr/bin/perl
use Modern::Perl;
my %values;
my $file = 'path/to/file';
open my $fh, '<', $file or die "unable to open '$file': $!";
while(<$fh>) {
chomp;
# get the prefix value
my ($prefix) = split('\^', $_);
# push in array the whole line in hash with the prefix as key
push #{$values{$prefix}}, $_;
}
foreach (keys %values) {
# skip the prefix tat have only one line
next if scalar #{$values{$_}} == 1;
local $" = "\n";
say "#{$values{$_}}";
}
Output:
0xyz^xxx
0xyz^098
0xyz^x111xx

Related

grep a pattern and return all characters before and after another specific character bash

I'm interested in searching a variable inside a log file, in case the search returns something then I wish for all entries before the variable until the character '{' is met and after the pattern until the character '}' is met.
To be more precise let's take the following example:
something something {
entry 1
entry 2
name foo
entry 3
entry 4
}
something something test
test1 test2
test3 test4
In this case I would search for 'name foo' which will be stored in a variable (which I create before in a separate part) and the expected output would be:
{
entry 1
entry 2
name foo
entry 3
entry 4
}
I tried finding something on grep, awk or sed. I was able to only come up with options for finding the pattern and then return all lines until '}' is met, however I can't find a suitable solution for the lines before the pattern.
I found a regex in Perl that could be used but I'm not able to use the variable, in case I switch the variable with 'foo' then I will have output.
grep -Poz '.*(?s)\{[^}]*name\tfoo.*?\}'
The regex is quite simple, once the whole file is read into a variable
use warnings;
use strict;
use feature 'say';
die "Usage: $0 filename\n" if not #ARGV;
my $file_content = do { local $/; <> }; # "slurp" file with given name
my $target = qr{name foo};
while ( $file_content =~ /({ .*? $target .*? })/gsx ) {
say $1;
}
Since we undef-ine the input record separator inside the do block using local, the following read via the null filehandle <> pulls the whole file at once, as a string ("slurps" it). That is returned by the do block and assigned to the variable. The <> reads from file(s) with names in #ARGV, so what was submitted on the command-line at program's invocation.
In the regex pattern, the ? quantifier makes .* match only up to the first occurrence of the next subpattern, so after { the .*? matches up to the first (evaluated) $target, then the $target is matched, then .*? matches eveyrthing up to the first }. All that is captured by enclosing () and is thus later available in $1.
The /s modifier makes . match newlines, what it normally doesn't, what is necessary in order to match patterns that span multiple lines. With the /g modifier it keeps going through the string searching for all such matches. With /x whitespace isn't matched so we can spread out the pattern for readability (even over lines -- and use comments!).
The $target is compiled as a proper regex pattern using the qr operator.
See regex tutorial perlretut, and then there's the full reference perlre.
Here's an Awk attempt which tries to read between the lines to articulate an actual requirement. What I'm guessing you are trying to say is that "if there is an opening brace, print all content between it and the closing brace in case of a match inside the braces. Otherwise, just print the matching line."
We accomplish this by creating a state variable in Awk which keeps track of whether you are in a brace context or not. This simple implementation will not handle nested braces correctly; if that's your requirement, maybe post a new and better question with your actual requirements.
awk -v search="foo" 'n { context[++n] = $0 }
/{/ { delete context; n=0; matched=0; context[++n] = $0 }
/}/ && n { if (matched) for (i=1; i<=n; i++) print context[i];
delete context; n=0 }
$0 ~ search { if(n) matched=1; else print }' file
The variable n is the number of lines in the collected array context; when it is zero, we are not in a context between braces. If we find a match and are collecting lines into context, defer printing until we have collected the whole context. Otherwise, just print the current line.

Telling regex search to only start searching at a certain index

Normally, a regex search will start searching for matches from the beginning of the string I provide. In this particular case, I'm working with a very large string (up to several megabytes), and I'd like to run successive regex searches on that string, but beginning at specific indices.
Now, I'm aware that I could use the substr function to simply throw away the part at the beginning I want to exclude from the search, but I'm afraid this is not very efficient, since I'll be doing it several thousand times.
The specific purpose I want to use this for is to jump from word to word in a very large text, skipping whitespace (regardless of whether it's simple space, tabs, newlines, etc). I know that I could just use the split function to split the text into words by passing \s+ as the delimiter, but that would make things for more complicated for me later on, as there a various other possible word delimiters such as quotes (ok, I'm using the term 'word' a bit generously here), so it would be easier for me if I could just hop from word to word using successive regex searches on the same string, always specifying the next index at which to start looking as I go. Is this doable in Perl?
So you want to match against the words of a body of text.
(The examples find words that contain i.)
You think having the starting positions of the words would help, but it isn't useful. The following illustrates what it might look like to obtain the positions and use them:
my #positions;
while ($text =~ /\w+/g) {
push #positions, $-[0];
}
my #matches;
for my $pos (#positions) {
pos($text) = $pos;
push #matches $1 if $text =~ /\G(\w*i\w*)/g;
}
If would far simpler not to use the starting positions at all. Aside from being far simpler, we also remove the need for two different regex patterns to agree as to what constitute a word. The result is the following:
my #matches;
while ($text =~ /\b(\w*i\w*)/g) {
push #matches $1;
}
or
my #matches = $text =~ /\b(\w*i\w*)/g;
A far better idea, however, is to extra the words themselves in advance. This approach allows for simpler patterns and more advanced definitions of "word"[1].
my #matches;
while ($text =~ /(\w+)/g) {
my $word = $1;
push #matches, $word if $word =~ /i/;
}
or
my #matches = grep { /i/ } $text =~ /\w+/g;
For example, a proper tokenizer could be used.
In the absence of more information, I can only suggest the pos function
When doing a global regex search, the engine saves the position where the previous match ended so that it knows where to start searching for the next iteration. The pos function gives access to that value and allows it to be set explicitly, so that a subsequent m//g will start looking at the specified position instead of at the start of the string
This program gives an example. The string is searched for the first non-space character after each of a list of offsets, and displays the character found, if any
Note that the global match must be done in scalar context, which is applied by if here, so that only the next match will be reported. Otherwise the global search will just run on to the end of the file and leave information about only the very last match
use strict;
use warnings 'all';
use feature 'say';
my $str = 'a b c d e f g h i j k l m n';
# 0123456789012345678901234567890123456789
# 1 2 3
for ( 4, 31, 16, 22 ) {
pos($str) = $_;
say $1 if $str =~ /(\S)/g;
}
output
c
l
g
i

Deleting a line with a pattern unless another pattern is found?

I have a very messy data file, that can look something like this
========
Line 1
dfa====dsfdas==
Line 2
df as TOTAL ============
I would like to delete all the lines with "=" only in them, but keep the line if TOTAL is also in the line.
My code is as follows:
for my $file (glob '*.csv') {
open my $in, '<', $file;
my #lines;
while (<$in>) {
next if /===/; #THIS IS THE PROBLEM
push #lines, $_;
}
close $in;
open my $out, '>', $file;
print $out $_ for #lines;
close $out;
}
I was wondering if there was a way to do this in perl with regular expressions. I was thinking something like letting "TOTAL" be condition 1 and "===" be condition 2. Then, perhaps if both conditions are satisfied, the script leaves the line alone, but if only one or zero are fulfilled, then the line is deleted?
Thanks in advance!
You need \A or ^ to check whether the string starts with = or not.Put anchor in regex like:
next if /^===/;
or if only = is going to exist then:
next if /^=+/;
It will skip all the lines beginning with =.+ is for matching 1 or more occurrences of previous token.
Edit:
Then you should use Negative look behind like
next if /(?<!TOTAL)===/
This will ensure that you === is not preceded by TOTAL.
As any no of character's may occur between TOTAL and ===, I will suggest you to use two regexes to ensure string contains === but it doesn't contain TOTAL like:
next if (($_ =~ /===/) && ($_ !~ /TOTAL/))
You can use Negative look behind assertion
next if /(?<!TOTAL)===/
matches === when NOT preceded by TOTAL
As a general rule, you should avoid making your regexes more complicated. Compressing too many things into a single regex may seem clever, but it makes it harder to understand and thus debug.
So why not just do a compound condition?
E.g. like this:
#!/usr/bin/env perl
use strict;
use warnings;
my #lines;
while (<DATA>) {
next if ( m/====/ and not m/TOTAL/ );
push #lines, $_;
}
print $_ for #lines;
__DATA__
========
Line 1
dfa====dsfdas==
Line 2
df as TOTAL ============
Will skip any lines with === in, as long as they don't contain TOTAL. And doesn't need advanced regex features which I assure you will get your maintenance programmers cursing you.
You're current regex will pick up anything that contains the string === anywhere in the string.
Hello=== Match
===goodbye Match
======= Match
foo======bar Match
=== Match
= No Match
Hello== No Match
========= Match
If you wanted to ensure it picks up only strings made up of = signs then you would need to anchor to the start and the end of the line and account for any number of = signs. The regex that will work will be as follows:
next if /^=+$/;
Each symbols meaning:
^ The start of the string
= A literal "=" sign
+ One or more of the previous
$ The end of the string
This will pick up a string of any length from the start of the string to the end of the string made up of only = signs.
Hello=== No Match
===goodbye No Match
======= No Match
foo======bar No Match
=== Match
= Match
Hello== No Match
========= Match
I suggest you read up on perl's regex and what each symbol means it can be a very powerful tool if you know what's going on.
http://perldoc.perl.org/perlre.html#Regular-Expressions
EDIT:
If you want to skip a line on matching both TOTAL and the = then just put in 2 checks:
next if(/TOTAL/ and /=+/)
This can probably be done with a single line of regex. But why bother making it complicated and less readable?

Matching the end of line $ in perl; print showing different behavior with chomp

I am reading a file and matching a regex for lines with a hex number at the start followed by few dot separated hex values followed by optional array name which may contain an option index. For eg:
010c10 00000000.00000000.0000a000.02300000 myFooArray[0]
while (my $rdLine = <RDHANDLE>) {
chomp $rdLine;
if ($rdLine =~ m/^([0-9a-z]+)[ \t]+([0-9.a-z]+)[ \t]*([A-Za-z_0-9]*)\[*[0-9]*\]*$/) {
...
My source file containing these hex strings is also script generated. This match works fine for some files but other files produced thru the exact same script (ie no extra spaces, formats etc) do not match when the last $ is present on the match condition.
If I modify the condition to not have the end $, lines match as expected.
Another curious thing is for debugging this, I added a print statement like this:
if ($rdLine =~ m/^([0-9a-z]+)[ \t]+/) {
print "Hey first part matched for $rdLine \n";
}
if ($rdLine =~ m/^([0-9a-z]+)[ \t]+([0-9.a-z]+)/) {
print "Hey second part matched for $rdLine \n";
}
The output on the terminal for the following input eats the first character :
010000 00000000 foo
"ey first part matched for 010000 00000000 foo
ey second part matched for 010000 00000000 foo"
If I remove the chomp, it prints the Hey correctly instead of just ey.
Any clues appreciated!
"other files produced thru the exact same script (ie no extra spaces, formats etc) do not match when the last $ is present on the match condition"
Although you deny it, I am certain that your file contains a single space character directly before the end of the line. You should check by using Data::Dump to display the true contents of each file record. Like this
use Data::Dump;
dd \$read_line;
It is probably best to use
$read_line =~ s/\s+\z//;
in place of chomp. That will remove all spaces and tabs, as well as line endings like carriage-return and linefeed from the end of each line.
"If I remove the chomp, it prints the Hey correctly instead of just ey."
It looks like you are working on a Linux machine, processing a file that was generated on a Windows platform. Windows uses the two characters CR LF as a record separator, whereas Linux uses just LF, so a chomp removes just the trailing LF, leaving CR to cause the start of the string to be overwritten.
If it wasn't for your secondary problem of having trailing whitespace, tThe best solution here would be to replace chomp $read_line with $read_line =~ s/\R\z//. The \R character class matches the Unicode idea of a line break sequence, and was introduced in version 10 of Perl 5. However, the aforementioned s/\s+\z// will deal with your line endings as well, and should be all that you need.
Borodin is right, \r\n is the culprit.
I used a less elegant solution, but it works:
$rdLine =~ s/\r//g;
followed by:
chomp $rdLine;

Gene expression data in hashes

I have two data files: one contains gene expression data, the other genome annotation data. I have to compare values in columns 1 and 2 of one file and if 1 > 2 then output that line as well as the refseq id found on the same line of the annotation data file.
So far I have opened both files for reading:
#!usr/bin/perl
use strict;
use warnings;
open (my $deg, "<", "/data/deg/DEG_list.txt") or die $!;
open (my $af "<", "/data/deg/Affy_annotation.txt") or die $!;
# I want to store data in hash
my %data;
while (my $records = <$deg>) {
chomp($records);
# the first line is labels so we want to skip this
if($records =~ /^A-Z/) {
next;
else {
my #columns = split("/\s/", $records);
if ($columns[2] > $columns[1]) {
print $records;
}
}
}
I want to print the line every time this happens, but I also want to print the gene id which is found in the other data file. I'm not sure how to do this, plus the code I have now is not working, in that it doesn't just print the line.
Besides your missing parentheses here and there, your problem is probably your regex
if($records =~ /^A-Z/) {
This looks for lines that begin with this literal string, e.g. A-Zfoobar, and not, as you might be thinking, any string beginning with a capital letter. You probably want:
if($records =~ /^[A-Z]/) {
The square brackets denote a character class with a range inside.
You should also know that split /\s/, ... splits on a single whitespace, which may not be what you want, in that it creates empty fields for every extra whitespace you have. Unless you explicitly want to split on a single whitespace, you probably want
split ' ', $records;
Which will split on multiple consecutive whitespace, and strip leading whitespace.
Two main problems in the code
if($records =~ /^A-Z/) ...
if you want to detect letters at the beginning of a line, you better
if($records =~ /^[a-z]/i) ... starting with any letter
if($records =~ /^[A-Z]/) ... starting with big letter
And in
my #columns = split("/\s/", $records);
the regex is here a string ... (since quoted), to have a regex remove the quotes
my #columns = split(/\s/, $records);
but if you want to split fields even if there is more than one space, use
my #columns = split(/\s+/, $records);
instead.