Regular Expression to match an ssh connection string - regex

I'm trying in vain to write a regular expression to match valid ssh connection strings.
I really only need to recognise strings of the format:
user#hostname:/some/path
but it would be nice to also match an implicit home directory:
user#hostname:
I've so-far come up with this regex:
/^[:alnum:]+\#\:(\/[:alnum:]+)*$/
which doesn't work as intended.
Any suggestions welcome before my brain explodes and I start speaking in line noise :)

Your supplied regex doesn't have the hostname section. Try:
/^[:alnum:]+\#[:alnum:\.]\:(\/[:alnum:]+)*$/
or
/^[A-Za-z][A-Za-z0-9_]*\#[A-Za-z][A-Za-z0-9_\.]*\:(\/[A-Za-z][A-Za-z0-9_]*)*$/
since I don't trust alnum without double brackets.
Also, :alnum: may not give you the required range for your sections. You can have "." characters in your host name and may also need to allow for "_" characters. And it's rare I've seen usernames or hostnames start with a non-alphabetic.
Just as a side note, I try to avoid the enhanced regexes since they don't run on all regex engines (I've been using UNIX for a long time). Unfortunately that makes my regexes ungainly (see above) and not overly internationalizable. Apologies for that.

Bracket expressions go inside their own brackets. You are matching any one of colon, 'a', 'l', 'm', 'n', or 'u'.
And like Pax said, you missed the hostname. But the bracket expressions are still wrong.

After some more revisions I'm using:
/^\w+\#(\w|\.)+\:(\/\w+)*$/
which seems to match my test cases and accounts for hostnames, FQDNs and IP addresses in the host portion. It also makes the path after the colon optional to allow for implicit home directories.
Thanks for the help so far - I didn't spot the lack of hostname until it was pointed out.

What sgm is getting at, is you are doing
/^[:alnum:]+\#\:(\/[:alnum:]+)*$/
Where you should be doing
/^[[:alnum:]]+\#\:(\/[[:alnum:]]+)*$/
Pax's answer is also practical, but won't work without the proper double bracketing.
my $at = q{#};
my #res = (
qr/^[:alnum:]+${at}[:alnum:]+:(\/[:alnum:]+)*$/,
qr/^[[:alnum:]]+${at}[[:alnum:]]+:(\/[[:alnum:]]+)*$/,
qr/^[a-z][[:alnum:]_]*${at}[a-z][[:alnum:]_.]*:(\/[^\/]*)*$/i,
);
my #u = qw{
user#hostname:/some/path
bob_foo#bobs.fish.stores.com:/foo/bar/baz/quux_
9foo#9foo.org:/9foo/9foo
baz#foo.org:/9foo/_#_numerics_are_fine_in_URI_and_so_is_anything_else_(virtually)
};
for my $str (#u) {
for my $re (#res) {
if ( $str =~ $re ) {
print "$str =~ $re\n";
}
else {
print "NOT $str =~ $re\n";
}
}
}
POSIX syntax [: :] belongs inside character classes in regex; marked by <-- HERE in m/^[:alnum:] <-- HERE +#[:alnum:]+:(/[:alnum:]+)*$/ at /tmp/egl.pl line 27.
POSIX syntax [: :] belongs inside character classes in regex; marked by <-- HERE in m/^[:alnum:]+#[:alnum:] <-- HERE +:(/[:alnum:]+)*$/ at /tmp/egl.pl line 27.
POSIX syntax [: :] belongs inside character classes in regex; marked by <-- HERE in m/^[:alnum:]+#[:alnum:]+:(/[:alnum:] <-- HERE +)*$/ at /tmp/egl.pl line 27.
NOT user#hostname:/some/path =~ (?-xism:^[:alnum:]+#[:alnum:]+:(/[:alnum:]+)*$)
user#hostname:/some/path =~ (?-xism:^[[:alnum:]]+#[[:alnum:]]+:(/[[:alnum:]]+)*$)
user#hostname:/some/path =~ (?i-xsm:^[a-z][[:alnum:]_]*#[a-z][[:alnum:]_.]*:(/[^/]*)*$)
NOT bob_foo#bobs.fish.stores.com:/foo/bar/baz/quux_ =~ (?-xism:^[:alnum:]+#[:alnum:]+:(/[:alnum:]+)*$)
NOT bob_foo#bobs.fish.stores.com:/foo/bar/baz/quux_ =~ (?-xism:^[[:alnum:]]+#[[:alnum:]]+:(/[[:alnum:]]+)*$)
bob_foo#bobs.fish.stores.com:/foo/bar/baz/quux_ =~ (?i-xsm:^[a-z][[:alnum:]_]*#[a-z][[:alnum:]_.]*:(/[^/]*)*$)
NOT 9foo#9foo.org:/9foo/9foo =~ (?-xism:^[:alnum:]+#[:alnum:]+:(/[:alnum:]+)*$)
NOT 9foo#9foo.org:/9foo/9foo =~ (?-xism:^[[:alnum:]]+#[[:alnum:]]+:(/[[:alnum:]]+)*$)
NOT 9foo#9foo.org:/9foo/9foo =~ (?i-xsm:^[a-z][[:alnum:]_]*#[a-z][[:alnum:]_.]*:(/[^/]*)*$)
NOT baz#foo.org:/9foo/_#_numerics_are_fine_in_URI_and_so_is_anything_else_(virtually) =~ (?-xism:^[:alnum:]+#[:alnum:]+:(/[:alnum:]+)*$)
NOT baz#foo.org:/9foo/_#_numerics_are_fine_in_URI_and_so_is_anything_else_(virtually) =~ (?-xism:^[[:alnum:]]+#[[:alnum:]]+:(/[[:alnum:]]+)*$)
baz#foo.org:/9foo/_#_numerics_are_fine_in_URI_and_so_is_anything_else_(virtually) =~ (?i-xsm:^[a-z][[:alnum:]_]*#[a-z][[:alnum:]_.]*:(/[^/]*)*$)

OK, further revision to:
/^\w+\#(\w|\.)+\:(\/(\w|.)+)*$/
to account for the . that might be present in a file name.

Final go:
/^\w+\#(\w|\.)+\:(\/(\w|.)+\/?)*$/
This also allows for an optional trailing slash.

These didn't quite do it for what I needed; as some were broke or not liberal enough. For example if you have a folder named stackoverflow.com not having dots would break it. Implementations are inconsistent about what \w means, so I wouldn't recommend using that, especially since we know darn well what characters we need.
The following is a bash example to construct the regex:
#should match 99.9% of SSH users
user_regex='[a-zA-Z][a-zA-Z0-9_]+'
#match domains
host_regex='([a-zA-Z][a-zA-Z0-9\-]*\.)*[a-zA-Z][a-zA-Z0-9\-]*'
#match paths starting with / and empty strings (which is valid for our use!)
path_regex='(\/[A-Za-z0-9_\-\.]+)*\/?'
#the complete regex
master_regex="^$user_regex\#$host_regex\:$path_regex\$"
This affords the modularity to check your parts later, if need be. To enable IP addresses in the match add 0-9 to the two first letter match portions of the host regex.

Related

Make a regular expression in perl to grep value work on a string with different endings

I have this code in perl where I want to extract the value of 'EUR_AF', in this case '0.39'.
Sometimes 'EUR_AF' ends with ';', sometimes it doesn't.
Alternatively, 'EUR_AF' may end with '=0' instead of '=0.39;' or '=0.39'.
How do I make the code handle that? Can't seem to find it online...I could of course wrap everything in an almost endless if-elsif-else statement, but that seems overkill.
Example text:
AVGPOST=0.9092;AN=2184;RSQ=0.5988;ERATE=0.0081;AC=144;VT=SNP;THETA=0.0045;AA=A;SNPSOURCE=LOWCOV;LDAF=0.0959;AF=0.07;ASN_AF=0.05;AMR_AF=0.10;AFR_AF=0.11;EUR_AF=0.039
Code: $INFO =~ m/\;EUR\_AF\=(.*?)(;)/
I did find that: $INFO =~ m/\;EUR\_AF\=(.*?0)/ handles the cases of EUR_AF=0, but how to handle alternative scenarios efficiently?
Extract one value:
my ($eur_af) = $s =~ /(?:^|;)EUR_AF=([^;]*)/;
my ($eur_af) = ";$s" =~ /;EUR_AF=([^;]*)/;
Extract all values:
my %rec = split(/[=;]/, $s);
my $eur_af = $rec{EUR_AF};
This regex should work for you: (?<=EUR_AF=)\d+(\.\d+)?
It means
(?<=EUR_AF=) - look for a string preceeded by EUR_AF=
\d+(\.\d+)? - consist of a digit, optionally a decimal digit
EDIT: I originally wanted the whole regex to return the correct result, not only the capture group. If you want the correct capture group edit it to (?<=EUR_AF=)(\d+(?:\.\d+)?)
I have found the answer. The code:
$INFO =~ m/(?:^|;)EUR_AF=([^;]*)/
seems to handle the cases where EUR_AF=0 and EUR_AF=0.39, ending with or without ;. The resulting $INFO will be 0 or 0.39.

How to limit match length before a certain character?

I am using the following regular expression to scan input text files for valid emails.
[A-Za-z0-9!#$%&*+/=?^_`{|}~-]+(?:\.[A-Za-z0-9!#$%&*+/=?^_`{|}~-]+)*#(?:[A-Za-z0-9](?:[A-Za-z0-9-]*[A-Za-z0-9])?\.)+[A-Za-z0-9](?:[A-Za-z0-9-]*[A-Za-z0-9])?
Now I also need to limit the matches to 20 characters before the '#' sign in the email address, but not sure how to do it.
PS. I am using the Perl regular expression library (TPerlRegex) found in Delphi XE2.
Please can you help me?
Since your library is supposed to be PERL compatible, it should support lookaheads. These are convenient to ensure several "orthogonal" restrictions in the pattern:
(?=[^#]{1,20}#)[A-Za-z0-9!#$%&*+/=?^_`{|}~-]+(?:\.[A-Za-z0-9!#$%&*+/=?^_`{|}~-]+)*#(?:[A-Za-z0-9](?:[A-Za-z0-9-]*[A-Za-z0-9])?\.)+[A-Za-z0-9](?:[A-Za-z0-9-]*[A-Za-z0-9])?
The lookahead will only match if there is an # after no more than 20 non-# characters. However, the lookahead does not actually advance the position of the regex engine in your subject string, so after the condition has been checked, the engine is still at the beginning of the email (or whichever position it is checking at the moment) and will continue with your pattern as previously.
Consider using Email::Address to capture email addresses, and then grepping the results for those having 20 or fewer characters before the #:
use strict;
use warnings;
use Email::Address;
my #addresses;
while ( my $line = <DATA> ) {
push #addresses, $_
for grep { /([^#]+)/ and length $1 < 21 }
Email::Address->parse($line);
}
print "$_\n" for #addresses;
__DATA__
ABCDEFGHIJKLMNOPQRSTUVWXYZguest#host.com frank#email.net Line noise. test#host.com
Some stuff here... help#perl.org And even more here!
Nothing to see here. 01234567890123456789#numbers.com Nothing to see.
Output:
frank#email.net
test#host.com
help#perl.org
01234567890123456789#numbers.com

remove up to _ in perl using regex?

How would I go about removing all characters before a "_" in perl? So if I had a string that was "124312412_hithere" it would replace the string as just "hithere". I imagine there is a very simple way to do this using regex, but I am still new dealing with that so I need help here.
Remove all characters up to and including "_":
s/^[^_]*_//;
Remove all characters before "_":
s/^[^_]*(?=_)//;
Remove all characters before "_" (assuming the presence of a "_"):
s/^[^_]*//;
This is a bit more verbose than it needs to be, but would be probably more valuable for you to see what's going on:
my $astring = "124312412_hithere";
my $find = "^[^_]*_";
my $replace = "_";
$astring =~ s/$find/$replace/;
print $astring;
Also, there's a bit of conflicting requirements in your question. If you just want hithere (without the leading _), then change it to:
$astring =~ s/$find//;
I know it's slightly different than what was asked, but in cases like this (where you KNOW the character you are looking for exists in the string) I prefer to use split:
$str = '124312412_hithere';
$str = (split (/_/, $str, 2))[1];
Here I am splitting the string into parts, using the '_' as a delimiter, but to a maximum of 2 parts. Then, I am assigning the second part back to $str.
There's still a regex in this solution (the /_/) but I think this is a much simpler solution to read and understand than regexes full of character classes, conditional matches, etc.
You can try out this: -
$_ = "124312412_hithere";
s/^[^_]*_//;
print $_; # hithere
Note that this will also remove the _(as I infer from your sample output). If you want to keep the _ (as it seems doubtful what you want as per your first statement), you would probably need to use look-ahead as in #ikegami's answer.
Also, just to make it little more clear, any substitution and matching in regex is applied by default on $_. So, you don't need to bind it to $_ explicitly. That is implied.
So, s/^[^_]*_//; is essentially same as - $_ =~ s/^[^_]*_//;, but later one is not really required.

Perl - regex - I want to read and search each line for a string followed by a ";"

I'm playing and learning Perl so that I can read log files. I want to search every line and look for a string of alphanumeric followed by this ; at the beginning of each line.
This is part of what I have:
if ($line =~ /\S([a-zA-Z][a-zA-Z0-9]*)/)
but I think this is wrong.
Please advise.
"Alphanumeric" is a bit ambiguous now, since many people still infected with ASCII think it means A-Z with 0-9, but Perl thinks about it differently depending on the version (Know your character classes under different semantics). As with any regular expression, your job is to design a pattern the includes only what you want and doesn't exclude anything that you do want.
Also, many people still use the ^ to mean the beginning of the string, which is does if there's no /m flag. However, the re module can now set default flags, so your regex might not be what you think it is when another programmer tries to be helpful.
I tend to write things like:
my $alphanum = qr/[a-z0-9]/i;
my $regex = qr/
\A # absolute start of string
(?:$alphanum)+ # I can change this elsewhere
;
/x;
if( $line =~ $regex ) { ... }
Try:
if ($line =~ /^[a-z0-9]+;/i) { ... }
^ matches the start of a line. The + matches once or more. /i makes the search case-insensitive.

What is the regular Expression to uncomment a block of Perl code in Eclipse?

I need a regular expression to uncomment a block of Perl code, commented with # in each line.
As of now, my find expression in the Eclipse IDE is (^#(.*$\R)+) which matches the commented block, but if I give $2 as the replace expression, it only prints the last matched line. How do I remove the # while replacing?
For example, I need to convert:
# print "yes";
# print "no";
# print "blah";
to
print "yes";
print "no";
print "blah";
In most flavors, when a capturing group is repeated, only the last capture is kept. Your original pattern uses + repetition to match multiple lines of comments, but group 2 can only keep what was captured in the last match from the last line. This is the source of your problem.
To fix this, you can remove the outer repetition, so you match and replace one line at a time. Perhaps the simplest pattern to do this is to match:
^#\s*
And replace with the empty string.
Since this performs match and replacement one line at a time, you must repeat it as many times as necessary (in some flavors, you can use the g global flag, in e.g. Java there are replaceFirst/All pair of methods instead).
References
regular-expressions.info/Repeating a Captured Group vs Capturing a Repeated Group
Related questions
Is there a regex flavor that allows me to count the number of repetitions matched by * and +?
.NET regex keeps all repeated matches
Special note on Eclipse keyboard shortcuts
It Java mode, Eclipse already has keyboard shortcuts to add/remove/toggle block comments. By default, Ctrl+/ binds to the "Toggle comment" action. You can highlight multiple lines, and hit Ctrl+/ to toggle block comments (i.e. //) on and off.
You can hit Ctrl+Shift+L to see a list of all keyboard shortcuts. There may be one in Perl mode to toggle Perl block comments #.
Related questions
What is your favorite hot-key in Eclipse?
Hidden features of Eclipse
Search with ^#(.*$) and replace with $1
You can try this one: -
use strict;
use warning;
my $data = "#Hello#stack\n#overflow\n";
$data =~ s/^?#//g ;
OUTPUT:-
Hello
stack
overflow
Or
open(IN, '<', "test.pl") or die $!;
read(IN, my $data, -s "test.pl"); #reading a file
$data =~ s/^?#//g ;
open(OUT, '>', "test1.pl") or die $!;
print OUT $data; #Writing a file
close OUT;
close IN;
Note: Take care of #!/usr/bin/perl in the Perl script, it will uncomment it also.
You need the GLOBAL g switch.
s/^#(.+)/$1/g
In order to determine whether a perl '#' is a comment or something else, you have to compile the perl and build a parse tree, because of Schwartz's Snippet
whatever / 25 ; # / ; die "this dies!";
Whether that '#' is a comment or part of a regex depends on whether whatever() is nullary, which depends on the parse tree.
For the simple cases, however, yours is failing because (^#(.*$\R)+) repeats a capturing group, which is not what you wanted.
But anyway, if you want to handle simple cases, I don't even like the regex that everyone else is using, because it fails if there is whitespace before the # on the line. What about
^\s*#(.*)$
? This will match any line that begins with a comment (optionally with whitespace, e.g., for indented blocks).
Try this regex:
(^[\t ]+)(\#)(.*)
With this replacement:
$1$3
Group 1 is (^[\t ]+) and matches all leading whitespace (spaces and tabs).
Group 2 is (#) and matches one # character.
Group 3 is (.*) and matches the rest of the line.