Wild card matching - regex

I need to match a sentences which contains both wild card character \ and . in same sentence.How to do it with Perl?
Say suppose my file has following sentences :
ttterfasghti.
ddseghies/affag
hhail/afgsh.
asfsdgagh/
adterhjc/sgsagh.
My expected output should be :
hhail/afgsh.
adterhjc/sgsagh.

Given a clarification from a comment
Any order but the matching line should contain both / and .
an easy way
perl -wne'print if m{/} and m{\.}' filename
This is inefficient in the sense that it starts the regex engine twice and scans each string twice. However, in most cases that is unnoticable while this code is much clearer than a single regex for the task.
I use {} delimiters so to not have to escape the /, in which case the m in front is compulsory. Then I use the same m{...} on the other pattern for consistency.
A most welcome inquiry comes that this be done in a script, not one-liner! Happy to oblige.
use warnings;
use strict;
my $file = shift || die "Usage: $0 file\n";
open my $fh, '<', $file or die "Can't open $file: $!";
while (<$fh>) {
print if m{/} and m{\.};
}
close $fh;

This feels like a duplicate, but I just can't find a good previous question for this.
For / there are two ways:
use m// operator with different separator characters, e.g. m,<regex with />,, m{<regex with />}, or
escape it, i.e. /\//
For . use escaping.
Note that inside a character class ([...]) many special characters no longer need escaping.
Hence we get:
$ perl <dummy.txt -ne 'print "$1: $_" if m,(\w+/\w*\.),'
hhail/afgsh.: hhail/afgsh.
adterhjc/sgsagh.: adterhjc/sgsagh.
i.e. the line is printed if it contains one-or-more word characters, followed by a /, zero-or-more word characters, ending with a ..
Recommended reading perlrequick, perlretut & perlre.
UPDATE after OP clarified the requirement in a comment:
$ perl <dummy.txt -ne 'print if m,/, && m{\.}'
hhail/afgsh.
adterhjc/sgsagh.

Related

Perl, use regex to find a match and replace just the last character of the match (in this case a line break)

I have to clean several csv files before i put them in a database, some of the files have a unexpected linebreak in the middle of the line, as the line should always end with a number i managed to fix the files with this one liner:
perl -pe 's/[^0-9]\r?\n//g'
while it did work it also replaces the last char before the line break
foob
ar
turns into
fooar
Is there any one liner perl that i can call that would follow the same rule without replacing the last char before the linebreak
A negative lookbehind which is an assertion and won't consume characters can also be used.
(?<!\d)\R
\d is a a short for digit
\R matches any linebreak sequence
See this demo at regex101
One way is to use \K lookbehind
perl -pe 's/[^0-9]\K\r?\n//g'
Now it drops all matches up to \K so only what follows it is subject to the replacement side.
However, I'd rather recommend to process your CSV with a library, even as it's a little more code. There's already been one problem, that linefeed inside a field, what else may be there? A good library can handle a variety of irregularities.
A simple example with Text::CSV
use warnings;
use strict;
use feature 'say';
use Text::CSV;
my $file = shift or die "Usage: $0 file.csv\n";
my $csv = Text::CSV->new({ binary => 1, auto_diag => 1 });
open my $fh, '<', $file or die "Can't open $file: $!";
while (my $row = $csv->getline($fh)) {
s/\n+//g for #$row;
$csv->say(\*STDOUT, $row);
}
Consider other constructor options, also available via accessors, that are good for all kinds of unexpected problems. Like allow_whitespace for example.
This can be done as a command-line program ("one-liner") as well, if there is a reason for that. The library's functional interface via csv is then convenient
perl -MText::CSV=csv -we'
csv in => *ARGV, on_in => sub { s/\n+//g for #{$_[1]} }' filename
With *ARGV the input is taken either from a file named on command line or from STDIN.
Just capture the last char and put it back:
perl -pe 's/([^0-9])\r?\n/$1/g'

Regex to match C integer literals

I would like to use egrep/grep -E to print out the lines in C source files that contain integer literals (as described here). The following works for the most part, except it matches floats too:
egrep '\b[0-9]+' *.c
Any suggestions for how to fix this?
You can use negative Lookarounds to make sure the number isn't followed by or preceded by a .:
\b(?<!\.)[0-9]+(?!\.)\b
Edit:
Since you want to only match the 0 of 0x in hex literals as you mentioned in the comments, use the following pattern instead. It works exactly like your original regex except that it doesn't match float numbers.
\b(?<!\.)[0-9]+(?![\.\d])
Try it online.
References:
Regular expressions: Lookahead and Lookbehind
I would not try to overoptimize a pattern like this and just convert each integer literal type and the possible suffixes literally into a regex with alternations:
(?i)(?:0x(?:[0-9a-f]+(?:'?[0-9a-f]+)*)|0b(?:[10]+(?:'?[10]+)*)|\d+(?:'?\d+)*)(?:ull|ll|ul|l|u)?
Only the digit separators require some more work: a separator cannot be followed by another separator and can only appear between numbers.
Suffixes are allowed for hex and binary too, as tested with C++14 here.
Demo
Note: The pattern is designed to be case-insensitive.
Run it like this: egrep -ei "(?:0x(?:[0-9a-f]+(?:'?[0-9a-f]+)*)|0b(?:[10]+(?:'?[10]+)*)|\d+(?:'?\d+)*)(?:ull|ll|ul|l|u)?" input.txt
PS: If you just want to extract the values a Perl script could come handy:
use strict;
my $file = '/some/where/input.txt';
my $regex = qr/(?:0x(?:[0-9a-f]+(?:'?[0-9a-f]+)*)|0b(?:[10]+(?:'?[10]+)*)|\d+(?:'?\d+)*)(?:ull|ll|ul|l|u)?/ip;
open my $input, '<', $file or die "can't open $file: $!";
while (<$input>) {
chomp;
while ($_ =~ /($regex)/g) {
print "${^MATCH}\n";
}
}
close $input or die "can't close $file: $!";

Replacing a single character in a perl regex match

How can I replace the 6th "_" that appears in the regex match?
Here is the literal input to be searched. It is not representing a path to the input:
/Users/rob/Documents/Test/m160505_031746_42156_c100980652550000001823221307061611_s1_p0_30_0_59.fsa
Here is my code, which parses out what I need. I just now need to replace the last matched "_" with a "/":
#!/usr/bin/perl
use strict;
use warnings;
open(IN, '<', '/Users/roblogan/Test_Database.txt') or die $!;
open(OUT, '>', '/Users/roblogan/Test_Output.txt') or die $!;
while (my $line = <IN>){
if ($line =~ m/(m160505_031746_42156_c100980652550000001823221307061611_s1_p0_[0-9]*)/){
print OUT $1, "\n";
}
}
Current output:
m160505_031746_42156_c100980652550000001823221307061611_s1_p0_30
Desired output:
m160505_031746_42156_c100980652550000001823221307061611_s1_p0/30
I have tried:
if ($line =~ s/(m160505_031746_42156_c100980652550000001823221307061611_s1_p0_[0-9]*)/(m160505_031746_42156_c100980652550000001823221307061611_s1_p0\/[0-9]*)/){
Any help would be appreciated.
This Perl code will do what I think you need, determined from your subject line and example output
It finds the sixth occurrence of an underscore in the target string and, if that underscore is followed by decimal digits, it changes the underscore to a slash and removes everything following the digits
I have used the pipe character | as the delimiter for the substitute operator s/// to avoid the need to escape forward slashes
use strict;
use warnings 'all';
my $path = q{/Users/rob/Documents/Test/m160505_031746_42156_c100980652550000001823221307061611_s1_p0_30_0_59.fsa};
$path =~ s|^(?:[^_]*_){5}[^_]*\K_(\d+).*|/$1|s;
print $path, "\n";
output
/Users/rob/Documents/Test/m160505_031746_42156_c100980652550000001823221307061611_s1_p0/30
From your description, the easiest way is:
$line =~ s!(m160505_031746_42156_c100980652550000001823221307061611_s1_pā€Œā€‹ā€Œā€‹0)_!$1/!
I've chosen ! as the delimiter because / is used in the replacement part.
$1 is a variable containing the text matched by the first ( ) group in the regex (I didn't want to repeat the whole thing twice).
The final _ is not included in $1 (it's outside of the parens); instead we put / in the replacement part.
See perldoc perlretut for more information.

Perl removing words from file1 with file2

I am using a perl script to remove all stopwords in a text. The stop words are stored one by line. I am using Mac OSX command line and perl is installed correctly.
This script is not working properly and has a boundary problem.
#!/usr/bin/env perl -w
# usage: script.pl words text >newfile
use English;
# poor man's argument handler
open(WORDS, shift #ARGV) || die "failed to open words file: $!";
open(REPLACE, shift #ARGV) || die "failed to open replacement file: $!";
my #words;
# get all words into an array
while ($_=<WORDS>) {
chop; # strip eol
push #words, split; # break up words on line
}
# (optional)
# sort by length (makes sure smaller words don't trump bigger ones); ie, "then" vs "the"
#words=sort { length($b) <=> length($a) } #words;
# slurp text file into one variable.
undef $RS;
$text = <REPLACE>;
# now for each word, do a global search-and-replace; make sure only words are replaced; remove possible following space.
foreach $word (#words) {
$text =~ s/\b\Q$word\E\s?//sg;
}
# output "fixed" text
print $text;
sample.txt
$ cat sample.txt
how about i decide to look at it afterwards what
across do you think is it a good idea to go out and about i
think id rather go up and above
stopwords.txt
I
a
about
an
are
as
at
be
by
com
for
from
how
in
is
it
..
Output:
$ ./remove.pl stopwords.txt sample.txt
i decide look fterwards cross do you think good idea go out d i
think id rather go up d bove
As you can see, it replaces afterwards using a as fterwards. Think its a regex problem. Please can somebody help me to patch this quickly? Thanks for all the help :J
Use word-boundary on both sides of your $word. Currently, you are only checking for it at the beginning.
You won't need the \s? condition with the \b in place:
$text =~ s/\b\Q$word\E\b//sg;
Your regex is not strict enough.
$text =~ s/\b\Q$word\E\s?//sg;
When $word is a, the command is effectively s/\ba\s?//sg. This means, remove all occurrences of a new word starting with a followed by zero or more whitespace. In afterwards, this will successfully match the first a.
You can make the match more stricter by ending word with another \b. Like
$text =~ s/\b\Q$word\E\b\s?//sg;

Regex question!

I'm not too familiar with regex but I know what I need to find-
I have a long list of data separated by newlines, and I need to delete all the lines of data that contain a string "(V)". The lines are of variable length, so I guess something to do with selecting everything between two newline characters if there's a (V) inside?
Try searching for this regular expression:
^.*\(V\).*$
Explanation:
^ start of line
.* any characters apart from new line
\( open parenthesis (escaped to avoid special behaviour)
V V
\) close parenthesis (escaped to avoid special behaviour)
.* any characters apart from new line
$ end of line (not strictly need here, included only for clarity)
Depending on your language you may need to add delimiters such as / and/or quotes " around the regular expression and you may need to enable multiline mode.
Here's an online example showing it working: Rubular
If the data is indeed rather large, then running a single regex against the whole string would be a bad idea. Instead, a simple solution like this Perl script could work for you:
open my $fh, '<', 'data.txt' or die $!;
while (my $line = <$fh>) {
if ($line =~ m/\(V\)/) {
next;
}
print $line;
}
close $fh;
This script reads the data file one line at a time and prints the lines that do not contain "(V)" to stdout. (You obviously could replace the "print" with a different data processing task)
Use the UNIX command grep, if you have access to such a system.
$ grep -v '(V)' data.txt
Grep matches all lines containing "(V)" in data.txt, and shows only the lines not matching (-v).