Gene expression data in hashes - regex

I have two data files: one contains gene expression data, the other genome annotation data. I have to compare values in columns 1 and 2 of one file and if 1 > 2 then output that line as well as the refseq id found on the same line of the annotation data file.
So far I have opened both files for reading:
#!usr/bin/perl
use strict;
use warnings;
open (my $deg, "<", "/data/deg/DEG_list.txt") or die $!;
open (my $af "<", "/data/deg/Affy_annotation.txt") or die $!;
# I want to store data in hash
my %data;
while (my $records = <$deg>) {
chomp($records);
# the first line is labels so we want to skip this
if($records =~ /^A-Z/) {
next;
else {
my #columns = split("/\s/", $records);
if ($columns[2] > $columns[1]) {
print $records;
}
}
}
I want to print the line every time this happens, but I also want to print the gene id which is found in the other data file. I'm not sure how to do this, plus the code I have now is not working, in that it doesn't just print the line.

Besides your missing parentheses here and there, your problem is probably your regex
if($records =~ /^A-Z/) {
This looks for lines that begin with this literal string, e.g. A-Zfoobar, and not, as you might be thinking, any string beginning with a capital letter. You probably want:
if($records =~ /^[A-Z]/) {
The square brackets denote a character class with a range inside.
You should also know that split /\s/, ... splits on a single whitespace, which may not be what you want, in that it creates empty fields for every extra whitespace you have. Unless you explicitly want to split on a single whitespace, you probably want
split ' ', $records;
Which will split on multiple consecutive whitespace, and strip leading whitespace.

Two main problems in the code
if($records =~ /^A-Z/) ...
if you want to detect letters at the beginning of a line, you better
if($records =~ /^[a-z]/i) ... starting with any letter
if($records =~ /^[A-Z]/) ... starting with big letter
And in
my #columns = split("/\s/", $records);
the regex is here a string ... (since quoted), to have a regex remove the quotes
my #columns = split(/\s/, $records);
but if you want to split fields even if there is more than one space, use
my #columns = split(/\s+/, $records);
instead.

Related

Telling regex search to only start searching at a certain index

Normally, a regex search will start searching for matches from the beginning of the string I provide. In this particular case, I'm working with a very large string (up to several megabytes), and I'd like to run successive regex searches on that string, but beginning at specific indices.
Now, I'm aware that I could use the substr function to simply throw away the part at the beginning I want to exclude from the search, but I'm afraid this is not very efficient, since I'll be doing it several thousand times.
The specific purpose I want to use this for is to jump from word to word in a very large text, skipping whitespace (regardless of whether it's simple space, tabs, newlines, etc). I know that I could just use the split function to split the text into words by passing \s+ as the delimiter, but that would make things for more complicated for me later on, as there a various other possible word delimiters such as quotes (ok, I'm using the term 'word' a bit generously here), so it would be easier for me if I could just hop from word to word using successive regex searches on the same string, always specifying the next index at which to start looking as I go. Is this doable in Perl?
So you want to match against the words of a body of text.
(The examples find words that contain i.)
You think having the starting positions of the words would help, but it isn't useful. The following illustrates what it might look like to obtain the positions and use them:
my #positions;
while ($text =~ /\w+/g) {
push #positions, $-[0];
}
my #matches;
for my $pos (#positions) {
pos($text) = $pos;
push #matches $1 if $text =~ /\G(\w*i\w*)/g;
}
If would far simpler not to use the starting positions at all. Aside from being far simpler, we also remove the need for two different regex patterns to agree as to what constitute a word. The result is the following:
my #matches;
while ($text =~ /\b(\w*i\w*)/g) {
push #matches $1;
}
or
my #matches = $text =~ /\b(\w*i\w*)/g;
A far better idea, however, is to extra the words themselves in advance. This approach allows for simpler patterns and more advanced definitions of "word"[1].
my #matches;
while ($text =~ /(\w+)/g) {
my $word = $1;
push #matches, $word if $word =~ /i/;
}
or
my #matches = grep { /i/ } $text =~ /\w+/g;
For example, a proper tokenizer could be used.
In the absence of more information, I can only suggest the pos function
When doing a global regex search, the engine saves the position where the previous match ended so that it knows where to start searching for the next iteration. The pos function gives access to that value and allows it to be set explicitly, so that a subsequent m//g will start looking at the specified position instead of at the start of the string
This program gives an example. The string is searched for the first non-space character after each of a list of offsets, and displays the character found, if any
Note that the global match must be done in scalar context, which is applied by if here, so that only the next match will be reported. Otherwise the global search will just run on to the end of the file and leave information about only the very last match
use strict;
use warnings 'all';
use feature 'say';
my $str = 'a b c d e f g h i j k l m n';
# 0123456789012345678901234567890123456789
# 1 2 3
for ( 4, 31, 16, 22 ) {
pos($str) = $_;
say $1 if $str =~ /(\S)/g;
}
output
c
l
g
i

Deleting a line with a pattern unless another pattern is found?

I have a very messy data file, that can look something like this
========
Line 1
dfa====dsfdas==
Line 2
df as TOTAL ============
I would like to delete all the lines with "=" only in them, but keep the line if TOTAL is also in the line.
My code is as follows:
for my $file (glob '*.csv') {
open my $in, '<', $file;
my #lines;
while (<$in>) {
next if /===/; #THIS IS THE PROBLEM
push #lines, $_;
}
close $in;
open my $out, '>', $file;
print $out $_ for #lines;
close $out;
}
I was wondering if there was a way to do this in perl with regular expressions. I was thinking something like letting "TOTAL" be condition 1 and "===" be condition 2. Then, perhaps if both conditions are satisfied, the script leaves the line alone, but if only one or zero are fulfilled, then the line is deleted?
Thanks in advance!
You need \A or ^ to check whether the string starts with = or not.Put anchor in regex like:
next if /^===/;
or if only = is going to exist then:
next if /^=+/;
It will skip all the lines beginning with =.+ is for matching 1 or more occurrences of previous token.
Edit:
Then you should use Negative look behind like
next if /(?<!TOTAL)===/
This will ensure that you === is not preceded by TOTAL.
As any no of character's may occur between TOTAL and ===, I will suggest you to use two regexes to ensure string contains === but it doesn't contain TOTAL like:
next if (($_ =~ /===/) && ($_ !~ /TOTAL/))
You can use Negative look behind assertion
next if /(?<!TOTAL)===/
matches === when NOT preceded by TOTAL
As a general rule, you should avoid making your regexes more complicated. Compressing too many things into a single regex may seem clever, but it makes it harder to understand and thus debug.
So why not just do a compound condition?
E.g. like this:
#!/usr/bin/env perl
use strict;
use warnings;
my #lines;
while (<DATA>) {
next if ( m/====/ and not m/TOTAL/ );
push #lines, $_;
}
print $_ for #lines;
__DATA__
========
Line 1
dfa====dsfdas==
Line 2
df as TOTAL ============
Will skip any lines with === in, as long as they don't contain TOTAL. And doesn't need advanced regex features which I assure you will get your maintenance programmers cursing you.
You're current regex will pick up anything that contains the string === anywhere in the string.
Hello=== Match
===goodbye Match
======= Match
foo======bar Match
=== Match
= No Match
Hello== No Match
========= Match
If you wanted to ensure it picks up only strings made up of = signs then you would need to anchor to the start and the end of the line and account for any number of = signs. The regex that will work will be as follows:
next if /^=+$/;
Each symbols meaning:
^ The start of the string
= A literal "=" sign
+ One or more of the previous
$ The end of the string
This will pick up a string of any length from the start of the string to the end of the string made up of only = signs.
Hello=== No Match
===goodbye No Match
======= No Match
foo======bar No Match
=== Match
= Match
Hello== No Match
========= Match
I suggest you read up on perl's regex and what each symbol means it can be a very powerful tool if you know what's going on.
http://perldoc.perl.org/perlre.html#Regular-Expressions
EDIT:
If you want to skip a line on matching both TOTAL and the = then just put in 2 checks:
next if(/TOTAL/ and /=+/)
This can probably be done with a single line of regex. But why bother making it complicated and less readable?

Use Perl to check if a string has only English characters

I have a file with submissions like this
%TRYYVJT128F93506D3<SEP>SOYKCDV12AB0185D99<SEP>Rainie Yang<SEP>Ai Wo Qing shut up (OT: Shotgun(Aka Shot Gun))
%TRYYVHU128F933CCB3<SEP>SOCCHZY12AB0185CE6<SEP>Tepr<SEP>Achète-moi
I am stripping everything but the song name by using this regex.
$line =~ s/.*>|([([\/\_\-:"``+=*].*)|(feat.*)|[?¿!¡\.;&\$#%#\\|]//g;
I want to make sure that the only strings printed are ones that contain only English characters, so in this case it would the first song title Ai Wo Quing shut up and not the next one because of the è.
I have tried this
if ( $line =~ m/[^a-zA-z0-9_]*$/ ) {
print $line;
}
else {
print "Non-english\n";
I thought this would match just the English characters, but it always prints Non-english. I feel this is me being rusty with regex, but I cannot find my answer.
Following from the comments, your problem would appear to be:
$line =~ m/[^a-zA-z0-9_]*$/
Specifically - the ^ is inside the brackets, which means that it's not acting as an 'anchor'. It's actually a negation operator
See: http://perldoc.perl.org/perlrecharclass.html#Negation
It is also possible to instead list the characters you do not want to match. You can do so by using a caret (^) as the first character in the character class. For instance, [^a-z] matches any character that is not a lowercase ASCII letter, which therefore includes more than a million Unicode code points. The class is said to be "negated" or "inverted".
But the important part is - that without the 'start of line' anchor, your regular expression is zero-or-more instances (of whatever), so will match pretty much anything - because it can freely ignore the line content.
(Borodin's answer covers some of the other options for this sort of pattern match, so I shan't reproduce).
It's not clear exactly what you need, so here are a couple of observations that speak to what you have written.
It is probably best if you use split to divide each line of data on <SEP>, which I presume is a separator. Your question asks for the fourth such field, like this
use strict;
use warnings;
use 5.010;
while ( <DATA> ) {
chomp;
my #fields = split /<SEP>/;
say $fields[3];
}
__DATA__
%TRYYVJT128F93506D3<SEP>SOYKCDV12AB0185D99<SEP>Rainie Yang<SEP>Ai Wo Qing shut up (OT: Shotgun(Aka Shot Gun))
%TRYYVHU128F933CCB3<SEP>SOCCHZY12AB0185CE6<SEP>Tepr<SEP>Achète-moi
output
Ai Wo Qing shut up (OT: Shotgun(Aka Shot Gun))
Achète-moi
Also, the word character class \w matches exactly [a-zA-z0-9_] (and \W matches the complement) so you can rewrite your if statement like this
if ( $line =~ /\W/ ) {
print "Non-English\n";
}
else {
print $line;
}

Perl - Regex to extract only the comma-separated strings

I have a question I am hoping someone could help with...
I have a variable that contains the content from a webpage (scraped using WWW::Mechanize).
The variable contains data such as these:
$var = "ewrfs sdfdsf cat_dog,horse,rabbit,chicken-pig"
$var = "fdsf iiukui aawwe dffg elephant,MOUSE_RAT,spider,lion-tiger hdsfds jdlkf sdf"
$var = "dsadp poids pewqwe ANTELOPE-GIRAFFE,frOG,fish,crab,kangaROO-KOALA sdfdsf hkew"
The only bits I am interested in from the above examples are:
#array = ("cat_dog","horse","rabbit","chicken-pig")
#array = ("elephant","MOUSE_RAT","spider","lion-tiger")
#array = ("ANTELOPE-GIRAFFE","frOG","fish","crab","kangaROO-KOALA")
The problem I am having:
I am trying to extract only the comma-separated strings from the variables and then store these in an array for use later on.
But what is the best way to make sure that I get the strings at the start (ie cat_dog) and end (ie chicken-pig) of the comma-separated list of animals as they are not prefixed/suffixed with a comma.
Also, as the variables will contain webpage content, it is inevitable that there may also be instances where a commas is immediately succeeded by a space and then another word, as that is the correct method of using commas in paragraphs and sentences...
For example:
Saturn was long thought to be the only ringed planet, however, this is now known not to be the case.
^ ^
| |
note the spaces here and here
I am not interested in any cases where the comma is followed by a space (as shown above).
I am only interested in cases where the comma DOES NOT have a space after it (ie cat_dog,horse,rabbit,chicken-pig)
I have a tried a number of ways of doing this but cannot work out the best way to go about constructing the regular expression.
How about
[^,\s]+(,[^,\s]+)+
which will match one or more characters that are not a space or comma [^,\s]+ followed by a comma and one or more characters that are not a space or comma, one or more times.
Further to comments
To match more than one sequence add the g modifier for global matching.
The following splits each match $& on a , and pushes the results to #matches.
my $str = "sdfds cat_dog,horse,rabbit,chicken-pig then some more pig,duck,goose";
my #matches;
while ($str =~ /[^,\s]+(,[^,\s]+)+/g) {
push(#matches, split(/,/, $&));
}
print join("\n",#matches),"\n";
Though you can probably construct a single regex, a combination of regexs, splits, grep and map looks decently
my #array = map { split /,/ } grep { !/^,/ && !/,$/ && /,/ } split
Going from right to left:
Split the line on spaces (split)
Leave only elements having no comma at the either end but having one inside (grep)
Split each such element into parts (map and split)
That way you can easily change the parts e.g. to eliminate two consecutive commas add && !/,,/ inside grep.
I hope this is clear and suits your needs:
#!/usr/bin/perl
use warnings;
use strict;
my #strs = ("ewrfs sdfdsf cat_dog,horse,rabbit,chicken-pig",
"fdsf iiukui aawwe dffg elephant,MOUSE_RAT,spider,lion-tiger hdsfds jdlkf sdf",
"dsadp poids pewqwe ANTELOPE-GIRAFFE,frOG,fish,crab,kangaROO-KOALA sdfdsf hkew",
"Saturn was long thought to be the only ringed planet, however, this is now known not to be the case.",
"Another sentence, although having commas, should not confuse the regex with this: a,b,c,d");
my $regex = qr/
\s #From your examples, it seems as if every
#comma separated list is preceded by a space.
(
(?:
[^,\s]+ #Now, not a comma or a space for the
#terms of the list
, #followed by a comma
)+
[^,\s]+ #followed by one last term of the list
)
/x;
my #matches = map {
$_ =~ /$regex/;
if ($1) {
my $comma_sep_list = $1;
[split ',', $comma_sep_list];
}
else {
[]
}
} #strs;
$var =~ tr/ //s;
while ($var =~ /(?<!, )\b[^, ]+(?=,\S)|(?<=,)[^, ]+(?=,)|(?<=\S,)[^, ]+\b(?! ,)/g) {
push (#arr, $&);
}
the regular expression matches three cases :
(?<!, )\b[^, ]+(?=,\S) : matches cat_dog
(?<=,)[^, ]+(?=,) : matches horse & rabbit
(?<=\S,)[^, ]+\b(?! ,) : matches chicken-pig

Perl - Search for specific string in csv and extract characters that immediately follow it

I have a csv file that has 2 columns: an ID and a free text columns. The ID column contains a 16-character alphanumeric id but it may not be the only data present in the cell: it may be a blank cell, or a cell that contains only the 16-character id, or contain a bunch of stuff along with the following buried in it - "user_id=xxxxxxxxxxxxxxxx"
What I want is to somehow extract the 16-character id from whichever cells have it. So I need to:
(a) ignore blank cells
(b) extract the whole cell's content if all it has is a continuous 16-character string with no spaces in between
(c) look for the pattern "user_id=" and then extract the 16 characters that immediately follow it
I see a lot of Perl scripts for either pattern matching or find/replace string etc., but I am not sure how I can do different kinds of parsing/pattern searching and extraction one after the other on the same column. As you may have already realized, I am fairly new to Perl.
I understand that you want to (1) skip lines that contain nothing, or that fail to match your spec. (2) Capture 16 non-space characters if they are the only content of the cell. (3) Capture 16 non-space characters following the literal pattern "user_id=".
If it's ok to capture space characters too, if they follow a "user_id=" literal, you can change \S to . in the appropriate place.
My solution uses Text::CSV to handle the details of dealing with a CSV file. Here's how you might do it:
use strict;
use warnings;
use autodie;
use open ':encoding(utf8)';
use utf8;
use feature 'unicode_strings';
use Text::CSV;
binmode STDOUT, ':utf8';
my $csv = Text::CSV->new( {binary => 1} )
or die "Cannot use CSV: " . Text::CSV->error_diag;
while( my $row = $csv->getline( \*DATA ) ) {
my $column = $row->[0];
if( $column =~ m/^(\S{16})$/ || $column =~ m/user_id=(\S{16})/ ) {
print $1, "\n";
}
}
__DATA__
abcdefghijklmnop
user_id=abcdefghijklmnop
abcd fghij lmnop
randomdatAuser_id=abcdefghijklmnopMorerandomdata
user_id=abcd fghij lmnop
randomdatAuser_id=abcd fghij lmnopMorerandomdata
In your own code you would not be using the DATA filehandle, but I assume you know how to open a file already.
CSV is a format that is deceptively simple. Don't confuse its high readability with parsing simplicity though. When dealing with CSV, it's best to use a well-proven module to extract the columns. Other solutions can fail quote-embedded commas, escaped commas, unbalanced quotes, and other irregularities that our brain fixes for us on the fly, but that make a pure-regex solution fragile.
Well I can set you up with a basic file and regex commands that might do what you need (in a sorta basic format for someone not familiar with perl):
use strict;
use warnings;
open FILE "<:utf8", "myfile.csv";
#"slurp" the file into an array, each element is a line
my #lines = <FILE>;
my #idArray;
foreach my $line (#lines){
#make two captures, the first we can ignore and both are optional
$line =~ /^(user_id=|)([A-Za-z0-9]{16}|),/;
#for display purposes, this is just the second captured group
my $id = $2;
#if the group actually has something in it, add it to your final array
if($id){ push #idArray, $id; }
}
for example, in the next example only line 2 and 3 is valid, so in the cell1 (column1) is
string what is exactly 16 chars long, or
has the "user=16charshere"
Any other is not valid.
use 5.014;
use warnings;
while(<DATA>) {
chomp;
my($col1, #remainder) = split /\t/;
say $2 if $col1 =~ m/^(|user=)(.{16})$/;
}
__DATA__
ToShort col2 not_valid
a123456789012345 col2 valid
user=b123456789012345 col2 valid
TooLongStringHereSoNotValidOne col2 not_valid
In this example the columns are TAB separated.
Please provide (a) some example data which can be used for testing solutions and (b) please try supplying code you have written so far for this problem.
However, you will probably want to go through all rows of your table, then split it into fields, performe all your operations on a certain field, perform business logic, and then write everything back.
Problem (c) is solved by $idField =~ /user_id=(.{16})/; my $id = $1;
If the user_id always appears at the beginning of a line, this does the trick: for (<FILE>) {/^user_id=(.{16})/; ...}