perl regex to replace 2 substrings

perl regex to replace 2 substrings - regex

I wrote a perl snippet that strips http:// and www from the front of a domain name input from the console
#!/usr/bin/perl
use strict;
print "Enter the domain name to be queried:\n";
my $input_domain = <>;
chomp ($input_domain);
my $inter_domain = $input_domain =~ s/http:\/\///r;
my $domain = $inter_domain =~ s/www.//r;
print $domain."\n";
When http://domain-name.tld or http://www.domain-name.tld or even*www.domain-name.tld is entered, this code returns domain-name.tld.
The question I have is, can the same be achieved using a Perl one-liner that combines both the search and replace lines into one?

If you make both the http:// and the www. optional but look for both of them then it will remove either one or both. The only disparity from the original code is that it will change www.http://domain-name.tld to http://domain-name.tld which I think isn't a disadvantage
It seems odd to ask for a on-liner that modifies user input, so I've written this sample that processes four different strings from the DATA file handle. Also note that it's much tidier to use different delimiters for the substitution to avoid having to escape the slashes
use strict;
use warnings;
while ( <DATA> ) {
s|^(?:http://)?(?:www\.)?||;
print;
}
__END__
http://www.domain-name.tld
http://domain-name.tld
www.domain-name.tld
domain-name.tld
output
domain-name.tld
domain-name.tld
domain-name.tld
domain-name.tld

Combine the regex: (http:\/\/)|(www\.)
s/(http:\/\/)|(www\.)//r;
This removes http:// and/or www.

Related

Perl Regex regular expression to split //

I went through s'flow and other sites for simple solution with regex in perl.
$str = q(//////);#
Say I've six slash or seven, or other chars like q(aaaaa)
I want them to split like ['//','//'],
I tried #my_split = split ( /\/\/,$str); but it didn't work
Is it possible with regex?
Reason for this question is, say I have this domain name:
$site_name = q(http://www.yahoo.com/blah1/blah2.txt);
I wanted to split along single slash to get 'domain-name', I couldn't do it.
I tried
split( '/'{1,1}, $sitename); #didn't work. I expected it split on one slash than two.
Thanks.

The question is rather unclear.
To break a string into pairs of consecutive characters
my #pairs = $string =~ /(..)/g;
or to split a string by repeating slash
my #parts = split /\/\//, $string;
The separator pattern, in /.../, is an actual regex so we need to escape / inside it.
But then you say you want to parse URI?
Use a module, please. For example, there is URI
use warnings;
use strict;
use feature 'say';
use URI;
my $string = q(http://www.yahoo.com/blah1/blah2.txt);
my $uri = URI->new($string);
say "Scheme: ", $uri->scheme;
say "Path: ", $uri->path;
say "Host: ", $uri->host;
# there's more, see docs
and then there's URI::Split
use URI::Split qw(uri_split uri_join);
my ($scheme, $auth, $path, $query, $frag) = uri_split($uri);
A number of other modules or frameworks, which you may already be using, nicely handle URIs.

Here's a quick way to split the full URL into its components:
my $u = q(http://www.yahoo.com/blah1/blah2.txt);
my ($protocol, $server, $path) = split(/:\/\/([^\/]+)/, $u);
print "($protocol, $server, $path)\n";
h/t #Mike

Well next piece of code does the trick
use strict;
use warnings;
use Data::Dumper;
my %url;
while( <DATA> ) {
chomp;
m|(\wttps{0,1})://([\w\d\.]+)/(.+)/([^/]+)$|;
#url{qw(proto dn path file)} = ($1,$2,$3,$4);
print Dumper(\%url);
}
__DATA__
http://www.yahoo.com/blah1/blah2.txt
http://www.google.com/dir1/dir2/dir3/file.ext
ftp://www.server.com/dir1/dir2/file.ext
https://www.inter.net/dir/file.ext

So it seems you want to simply get the Domain name:
my $url = q(http://www.yahoo.com/blah1/blah2.txt);
my #vars = split /\//, $url;
print $vars[2];
results:
www.yahoo.com

Replace strings only within a regex match in perl

I have an XML document with text in attribute values. I can't change how the the XML file is generated, but need to extract the attribute values without loosing \r\n. The XML parser of course strips them out.
So I'm trying to replace \r\n in attribute values with entity references
I'm using perl to do this because of it's non-greedy matching. But I need help getting the replace to happen only within the match. Or I need an easier way to do this :)
Here's is what I have so far:
perl -i -pe 'BEGIN{undef $/;} s/m_description="(.*?)"/m_description="$1"/smg' tmp.xml
This matches what I need to work with: (.*?). But I don't know to expand that pattern to match \r\n inside it, and do the replacement in the results. If I knew how many \r\n I have I could do it, but it seems I need a variable number of capture groups or something like that? There's a lot to regex I don't understand and it seems like there should be something do do this.
Example:
preceding lines
stuff m_description="Over
any number
of lines" other stuff
more lines
Should go to:
preceding lines
stuff m_description="Over
any number
of lines" other stuff
more lines
Solution
Thanks to Ikegam and ysth for the solution I used, which for 5.14+ is:
perl -i -0777 -pe's/m_description="\K(.*?)(?=")/ $1 =~ s!\n!
!gr =~ s!\r!
!gr /sge' tmp.xml

. should already match \n (because you specify the /s flag) and \r.
To do the replacement in the results, use /e:
perl -i -0777 -pe's/(?<=m_description=")(.*?)(?=")/ my $replacement=$1; $replacement=~s!\n!
!g; $replacement=~s!\r!
!g; $replacement /sge' tmp.xml
I've also changed it to use lookbehind/lookahead to make the code simpler and to use -0777 to set $/ to slurp mode and to remove the useless /m.

OK, so whilst this looks like an XML problem, it isn't. The XML problem is the person generating it. You should probably give them a prod with a rolled up copy of the spec as your first port of call for "fixing" this.
But failing that - I'd do a two pass approach, where I read the text, find all the 'blobs' that match a description, and then replace them all.
Something like this:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my $text = do { local $/ ; <DATA> };
#filter text for 'description' text:
my #matches = $text =~ m{m_description=\"([^\"]+)\"}gms;
print Dumper \#matches;
#Generate a search-and-replace hash
my %replace = map { $_ => s/[\r\n]+/
/gr } #matches;
print Dumper \%replace;
#turn the keys of that hash into a search regex
my $search = join ( "|", keys %replace );
$search = qr/\"($search)\"/ms;
print "Using search regex: $search\n";
#search and replace text block
$text =~ s/m_description=$search/m_description="$replace{$1}"/mgs;
print "New text:\n";
print $text;
__DATA__
preceding lines
stuff m_description="Over
any number
of lines" other stuff
more lines

How can I use regex to remove /1 or /2?

Regex gurus,
Here is the following line of code I want to parse with regex:
#ERR030882.2595 HWI-BRUNOP16X_0001:3:1:6649:5175#0/1
I want to obtain the following:
#ERR030882.2595 HWI-BRUNOP16X_0001:3:1:6649:5175#0
I have written the following regex on rubular.com:
(#.* *.)(!?(\/.))
My idea is to use negation to remove /1 by (!?(\/.)). However, this produces the entire line?
#ERR030882.2595 HWI-BRUNOP16X_0001:3:1:6649:5175#0/1
Why is (?!thisismystring) not removing /1? I googled the fire out of this, but they seemed to suggest similar things I am already trying? I deeply appreciate your help.

I think what you are trying to write is /(\#.* .*)(?=\/\d)/ (you need to escape the at sign # to prevent Perl from treating it as an array) but you need a positive look-ahead because you want to match everything up until the following characters are a slash followed by a digit.
Here is a program that demonstrates.
use strict;
use warnings;
use 5.010;
my $s = '#ERR030882.2595 HWI-BRUNOP16X_0001:3:1:6649:5175#0/1';
$s =~ /(\#.* .*)(?=\/.)/;
print $1, "\n";
But you would be much better off copying the whole string and removing the slash and everything after it, like this
use strict;
use warnings;
my $s = '#ERR030882.2595 HWI-BRUNOP16X_0001:3:1:6649:5175#0/1';
(my $fixed = $s) =~ s{/\d+$}{};
print $fixed, "\n";
output
#ERR030882.2595 HWI-BRUNOP16X_0001:3:1:6649:5175#0

How do I get the host name from a URL in Perl using regex?

so what I want to do is remove everything after and including the first "/" to appear after a "."
so: http://linux.pacific.net.au/primary.xml.gz
would become: http://linux.pacific.net.au
How do I do this using regex? The system I'm running on can't use URI tool.

$url = 'http://linux.pacific.net.au/primary.xml.gz';
($domain) = $url =~ m!(https?://[^:/]+)!;
print $domain;
output:
http://linux.pacific.net.au
and this is the official regular expression can be used to decode a URI:
my($scheme, $authority, $path, $query, $fragment) =
$uri =~ m|(?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?|;

I suggest you use URI::Split which will separate a standard URL into its constuent parts for you and rejoin them. You want the first two parts - the scheme and the host.
use strict;
use warnings;
use URI::Split qw/ uri_split uri_join /;
my $scheme_host = do {
my (#parts) = uri_split 'http://linux.pacific.net.au/primary.xml.gz';
uri_join #parts[0,1];
};
print $scheme_host;
output
http://linux.pacific.net.au
Update
If your comment The system I'm running on can't use URI tool means you can't install modules, then here is a regular expression solution.
You say you want to remove everything after and including the first "/" to appear after a ".", so /^.*?\./ finds the first dot, and m|[^/]+| finds everything after it up tot he next slash.
The output is identical to that of the preceding code
use strict;
use warnings;
my $url = 'http://linux.pacific.net.au/primary.xml.gz';
my ($scheme_host) = $url =~ m|^( .*?\. [^/]+ )|x;
print $scheme_host;

The system I'm running on can't use URI tool.
I really recommend doing whatever you can to fix that problem first. If you're not able to use CPAN modules then you'll be missing out on a lot of the power of Perl and your Perl programming life will be far more frustrating than it needs to be.

How can I extract URLs from plain text with Perl?

I need the Perl regex to parse plain text input and convert all links to valid HTML HREF links. I've tried 10 different versions I found on the web but none of them seen to work correctly. I also tested other solutions posted on StackOverflow, none of which seem to work. The correct solution should be able to find any URL in the plain text input and convert it to:
$1
Some cases other regular expressions I tried didn't handle correctly include:
URLs at the end of a line which are followed by returns
URLs that included question marks
URLs that start with 'https'
I'm hoping that another Perl guy out there will already have a regular expression they are using for this that they can share. Thanks in advance for your help!

You want URI::Find. Once you extract the links, you should be able to handle the rest of the problem just fine.
This is answered in perlfaq9's answer to "How do I extract URLs?", by the way. There is a lot of good stuff in those perlfaq. :)

Besides URI::Find, also checkout the big regular expression database: Regexp::Common, there is a Regexp::Common::URI module that gives you something as easy as:
my ($uri) = $str =~ /$RE{URI}{-keep}/;
If you want different pieces (hostname, query parameters etc) in that uri, see the doc of Regexp::Common::URI::http for what's captured in the $RE{URI} regular expression.

When I tried URI::Find::Schemeless with the following text:
Here is a URL and one bare URL with
https: https://www.example.com and another with a query
http://example.org/?test=one&another=2 and another with parentheses
http://example.org/(9.3)
Another one that appears in quotation marks "http://www.example.net/s=1;q=5"
etc. A link to an ftp site: ftp://user#example.org/test/me
How about one without a protocol www.example.com?
it messed up http://example.org/(9.3). So, I came up with the following with the help of Regexp::Common:
#!/usr/bin/perl
use strict; use warnings;
use CGI 'escapeHTML';
use Regexp::Common qw/URI/;
use URI::Find::Schemeless;
my $heuristic = URI::Find::Schemeless->schemeless_uri_re;
my $pattern = qr{
$RE{URI}{HTTP}{-scheme=>'https?'} |
$RE{URI}{FTP} |
$heuristic
}x;
local $/ = '';
while ( my $par = <DATA> ) {
chomp $par;
$par =~ s/</</g;
$par =~ s/( $pattern ) / linkify($1) /gex;
print "<p>$par</p>\n";
}
sub linkify {
my ($str) = #_;
$str = "http://$str" unless $str =~ /^[fh]t(?:p|tp)/;
$str = escapeHTML($str);
sprintf q|%s|, ($str) x 2;
}
This worked for the input shown. Of course, life is never that easy as you can see by trying (http://example.org/(9.3)).

Here I have posted the sample code using how to extract the urls.
Here it will take the lines from the stdin.
And it will check whether the input line contains valid URL format.
And it will give you the URL
use strict;
use warnings;
use Regexp::Common qw /URI/;
while (1)
{
#getting the input from stdin.
print "Enter the line: \n";
my $line = <>;
chomp ($line); #removing the unwanted new line character
my ($uri)= $line =~ /$RE{URI}{HTTP}{-keep}/ and print "Contains an HTTP URI.\n";
print "URL : $uri\n" if ($uri);
}
Sample output I am getting is as follows
Enter the line:
http://stackoverflow.com/posts/2565350/
Contains an HTTP URI.
URL : http://stackoverflow.com/posts/2565350/
Enter the line:
this is not valid url line
Enter the line:
www.google.com
Enter the line:
http://
Enter the line:
http://www.google.com
Contains an HTTP URI.
URL : http://www.google.com

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

perl regex to replace 2 substrings - regex

Combine the regex: (http:\/\/)|(www\.) s/(http:\/\/)|(www\.)//r; This removes http:// and/or www.

Related

Perl Regex regular expression to split //

Replace strings only within a regex match in perl

How can I use regex to remove /1 or /2?

How do I get the host name from a URL in Perl using regex?

How can I extract URLs from plain text with Perl?

Categories

Resources