remove up to _ in perl using regex? - regex

How would I go about removing all characters before a "_" in perl? So if I had a string that was "124312412_hithere" it would replace the string as just "hithere". I imagine there is a very simple way to do this using regex, but I am still new dealing with that so I need help here.

Remove all characters up to and including "_":
s/^[^_]*_//;
Remove all characters before "_":
s/^[^_]*(?=_)//;
Remove all characters before "_" (assuming the presence of a "_"):
s/^[^_]*//;

This is a bit more verbose than it needs to be, but would be probably more valuable for you to see what's going on:
my $astring = "124312412_hithere";
my $find = "^[^_]*_";
my $replace = "_";
$astring =~ s/$find/$replace/;
print $astring;
Also, there's a bit of conflicting requirements in your question. If you just want hithere (without the leading _), then change it to:
$astring =~ s/$find//;

I know it's slightly different than what was asked, but in cases like this (where you KNOW the character you are looking for exists in the string) I prefer to use split:
$str = '124312412_hithere';
$str = (split (/_/, $str, 2))[1];
Here I am splitting the string into parts, using the '_' as a delimiter, but to a maximum of 2 parts. Then, I am assigning the second part back to $str.
There's still a regex in this solution (the /_/) but I think this is a much simpler solution to read and understand than regexes full of character classes, conditional matches, etc.

You can try out this: -
$_ = "124312412_hithere";
s/^[^_]*_//;
print $_; # hithere
Note that this will also remove the _(as I infer from your sample output). If you want to keep the _ (as it seems doubtful what you want as per your first statement), you would probably need to use look-ahead as in #ikegami's answer.
Also, just to make it little more clear, any substitution and matching in regex is applied by default on $_. So, you don't need to bind it to $_ explicitly. That is implied.
So, s/^[^_]*_//; is essentially same as - $_ =~ s/^[^_]*_//;, but later one is not really required.

Related

How can I remove the last comma from a string in Perl

I have a string coming in from raw data. I can't guarantee that there might or might not be an extra comma. I thought I might be able to remove it like this:
$value = "cat, dog, fish, ";
$value =~ s/,//r;
Sadly that doesn't work. Of course I could do a loop to check the last char of the string one by one, but I would like to learn how to do it with the Regex backslash method.
Can someone help me please?
Try this
$value =~ s/,\s*$//;
The pattern ,\s*$ matches a comma (,) followed by zero or more space-chars (\s*), followed by the end of the line/input ($).
s/,// removes the first comma. So,
$value = reverse(reverse($value) =~ s/,//r);
Not sure why you are specifying /r in your code but not using the return value. If in fact you are using it, add it back.
s/.*\K,//
Ah, if there may not be a trailing comma that you don't want, this won't work; it will always delete the last comma. Use Bart's answer then.
The accepted answer removes a comma followed by zero or more white space characters at the end of a string. But you asked about removing the last comma. Either is consistent with your example, but if you really want to remove the last comma, one way is:
$value =~ s/,([^,]*$)/$1/
This will, for example, change "foo,bar,baz" to "foo,barbaz", and in your example"cat, dog, fish, "to"cat, dog, fish "` (leaving the trailing space).
The reverse trick in choruba's answer also works.
If nothing else, this shows the importance of a precise problem statement.
Using positive look ahead,
$value =~ s/,(?=[^,]*\z)//;
I suggest this pattern: ,*\s*$. It matches all commas (if any) and all white spaces (if any) and the end of the string.
A full example:
use 5.18.2;
use strict ;
use warnings ;
use Data::Dumper;
my $data = "cat, dog, fish,,,,,,,,,,,,, ";
$data =~ s/,*\s*$// ;
print $data;

Make a regular expression in perl to grep value work on a string with different endings

I have this code in perl where I want to extract the value of 'EUR_AF', in this case '0.39'.
Sometimes 'EUR_AF' ends with ';', sometimes it doesn't.
Alternatively, 'EUR_AF' may end with '=0' instead of '=0.39;' or '=0.39'.
How do I make the code handle that? Can't seem to find it online...I could of course wrap everything in an almost endless if-elsif-else statement, but that seems overkill.
Example text:
AVGPOST=0.9092;AN=2184;RSQ=0.5988;ERATE=0.0081;AC=144;VT=SNP;THETA=0.0045;AA=A;SNPSOURCE=LOWCOV;LDAF=0.0959;AF=0.07;ASN_AF=0.05;AMR_AF=0.10;AFR_AF=0.11;EUR_AF=0.039
Code: $INFO =~ m/\;EUR\_AF\=(.*?)(;)/
I did find that: $INFO =~ m/\;EUR\_AF\=(.*?0)/ handles the cases of EUR_AF=0, but how to handle alternative scenarios efficiently?
Extract one value:
my ($eur_af) = $s =~ /(?:^|;)EUR_AF=([^;]*)/;
my ($eur_af) = ";$s" =~ /;EUR_AF=([^;]*)/;
Extract all values:
my %rec = split(/[=;]/, $s);
my $eur_af = $rec{EUR_AF};
This regex should work for you: (?<=EUR_AF=)\d+(\.\d+)?
It means
(?<=EUR_AF=) - look for a string preceeded by EUR_AF=
\d+(\.\d+)? - consist of a digit, optionally a decimal digit
EDIT: I originally wanted the whole regex to return the correct result, not only the capture group. If you want the correct capture group edit it to (?<=EUR_AF=)(\d+(?:\.\d+)?)
I have found the answer. The code:
$INFO =~ m/(?:^|;)EUR_AF=([^;]*)/
seems to handle the cases where EUR_AF=0 and EUR_AF=0.39, ending with or without ;. The resulting $INFO will be 0 or 0.39.

QRegex look ahead/look behind

I have been pondering on this for quite awhile and still can't figure it out. The regex look ahead/behinds. Anyway, I'm not sure which to use in my situation, I am still having trouble grasping the concept. Let me give you an example.
I have a name....My Business,LLC (Milwaukee,WI)_12345678_12345678
What I want to do is if there is a comma in the name, no matter how many, remove it. At the same time, if there is not a comma in the name, still read the line. The one-liner I have is listed below.
s/(.*?)(_)(\d+_)(\d+$)/$1$2$3$4/gi;
I want to remove any comma from $1(My Business,LLC (Milwaukee,WI)). I could call out the comma in regex as a literal string((.?),(.?),(.*?)(_.*?$)) if it was this EXACT situation everytime, however it is not.
I want it to omit commas and match 'My Business, LLC_12345678_12345678' or just 'My Business_12345678_12345678', even though there is no comma.
In any situation I want it to match the line, comma or not, and remove any commas(if any) no matter how many or where.
If someone can help me understand this concept, it will be a breakthrough!!
Use the /e modifier of Perl so that you can pass your function during the replace in s///
$str = 'My Business,LLC (Milwaukee,WI)_12345678_12345678';
## modified your regex as well using lookahead
$str =~ s/(.*?)(?=_\d+_\d+$)/funct($1)/ge;
print $str;
sub funct{
my $val = shift;
## replacing , with empty, use anything what you want!
$val =~ s/,//g;
return $val;
}
Using funct($1) in substitute you are basically calling the funct() function with parameter $1

Better way to remove specific characters from a Perl string

I have dynamically generated strings like ###!efq#!#!, and I want to remove specific characters from the string using Perl.
Currently I am doing something this (replacing the characters with nothing):
$varTemp =~ s/['\$','\#','\#','\~','\!','\&','\*','\(','\)','\[','\]','\;','\.','\,','\:','\?','\^',' ', '\`','\\','\/']//g;
Is there a better way of doing this? I am fooking for something clean.
You've misunderstood how character classes are used:
$varTemp =~ s/[\$##~!&*()\[\];.,:?^ `\\\/]+//g;
does the same as your regex (assuming you didn't mean to remove ' characters from your strings).
Edit: The + allows several of those "special characters" to match at once, so it should also be faster.
You could use the tr instead:
$p =~ tr/fo//d;
will delete every f and every o from $p. In your case it should be:
$p =~ tr/\$##~!&*()[];.,:?^ `\\\///d
See Perl's tr documentation.
tr/SEARCHLIST/REPLACEMENTLIST/cdsr
Transliterates all occurrences of the characters found (or not found if the /c modifier is specified) in the search list with the positionally corresponding character in the replacement list, possibly deleting some, depending on the modifiers specified.
[…]
If the /d modifier is specified, any characters specified by SEARCHLIST not found in REPLACEMENTLIST are deleted.
With a character class this big it is easier to say what you want to keep. A caret in the first position of a character class inverts its sense, so you can write
$varTemp =~ s/[^"%'+\-0-9<=>a-z_{|}]+//gi
or, using the more efficient tr
$varTemp =~ tr/"%'+\-0-9<=>A-Z_a-z{|}//cd
tr docs
Well if you're using the randomly-generated string so that it has a low probability of being matched by some intentional string that you might normally find in the data, then you probably want one string per file.
You take that string, call it $place_older say. And then when you want to eliminate the text, you call quotemeta, and you use that value to substitute:
my $subs = quotemeta $place_holder;
s/$subs//g;

Regex to match anything but more than two spaces

I'm trying to create two a regex to add quotes to some values in a string. Basically the string would be like this:
999 date Doe, John E. London 123456789
And I want to surround the name so that if this file is exported to a csv, it won't be separated. This is what I have so far
$line =~ s/([^\s{2,}]*,[^\s{2,}]*)/"$1"/g;
I think it should find any comma and anything near it until it finds two or more spaces but it's not working. Thanks for the help.
You asked for anything except 2 or more spaces.
I agree that unpack is the more natural way to do this. But split is a way to use a cookie-cutter in the shape of a pattern. Anything not in that pattern is a return field. So this:
#fields = split /\h{2,}/, $line;
$line = join(" " x 2 => map { "($_)" } #fields);
might be enough.
If this is a fixed-width data (and my guess, it is), better use unpack (or plain old substr,etc..) rather than regular expressions.
[] contain a range of characters that are allowed at that possible, 2-space isn't a character.
Maybe:
$line =~ s/ (.*? .*?) / "\1" /g;
You'll probably need to be more explicit about the format to avoid matches against ' '.
$line =~ s/ (\w+?, [\w ]+?.) / "\1" /g;
To avoid repeating the space in the replacement, look-around assertions could be used, which could also fix the issue of items at the beginning and end of the line:
$line =~ s/(?<=^| )(\w+?, [\w ]+?.)(?=$| )/"\1"/g;
Also be careful of your original format - are you sure it isn't just column aligned? (In which case a sufficiently long name or date might not allow 2+ spaces between columns).
Try this:
s/.* \K(.*),(.*?) /"$1,$2" /
Logically, this means: Find a substring between two spaces and a comma, where the two spaces are as far right as possible, and then a substring between that comma and two spaces, where the substring is as short as possible.
Your approach can work too, if you get the syntax for negative lookaheads right.
The sample text you supplied seems to be separated either by tabs or spaces (column aligned?). It is important to know which, or the regexp will not work. It is also important to know whether the pattern is consistent throughout the file.
If it is aligned by columns, the easiest and probably safest way is to simply count off characters. E.g.:
s/(^.{20})(\S*) /$1"$2"/;
(You will have to adjust the number 20 yourself. I just approximated.)
Note that I am chopping off two spaces at the end of the name field in a reckless manner. This is to not screw up the format for the following values. If however the field is filled to the brim, there might not be two spaces at the end, and the regexp will miss. But then, on the other hand, you would not be able to fit quotes there anyway.
When dealing with these types of files, I do not think it is safe to use generic searches. If you are counting on commas to only appear in the names, sooner or later you will find someone who thought that "Bronx, New York" should be in the city field, and your regexp will be screwed.
A somewhat more strict, but complicated regexp would include the previous fields:
$date='\d{2}-\d{2}-\d{2}'; # this might work for dates such as 11-10-23
s/^(\d+\s+$date\s+)(\S+) /$1"$2"/;
Same thing here, if the name field is not big enough to fit two quotes, it won't be added. You should check your file and see if that is ever the case. If it is the case, you will need to deal with it somehow.
I sometimes find that putting certain field's regexps in separate variables helps with legibility, such as with $date above.
Good luck!