Regex to match anything but more than two spaces - regex

I'm trying to create two a regex to add quotes to some values in a string. Basically the string would be like this:
999 date Doe, John E. London 123456789
And I want to surround the name so that if this file is exported to a csv, it won't be separated. This is what I have so far
$line =~ s/([^\s{2,}]*,[^\s{2,}]*)/"$1"/g;
I think it should find any comma and anything near it until it finds two or more spaces but it's not working. Thanks for the help.

You asked for anything except 2 or more spaces.
I agree that unpack is the more natural way to do this. But split is a way to use a cookie-cutter in the shape of a pattern. Anything not in that pattern is a return field. So this:
#fields = split /\h{2,}/, $line;
$line = join(" " x 2 => map { "($_)" } #fields);
might be enough.

If this is a fixed-width data (and my guess, it is), better use unpack (or plain old substr,etc..) rather than regular expressions.

[] contain a range of characters that are allowed at that possible, 2-space isn't a character.
Maybe:
$line =~ s/ (.*? .*?) / "\1" /g;
You'll probably need to be more explicit about the format to avoid matches against ' '.
$line =~ s/ (\w+?, [\w ]+?.) / "\1" /g;
To avoid repeating the space in the replacement, look-around assertions could be used, which could also fix the issue of items at the beginning and end of the line:
$line =~ s/(?<=^| )(\w+?, [\w ]+?.)(?=$| )/"\1"/g;
Also be careful of your original format - are you sure it isn't just column aligned? (In which case a sufficiently long name or date might not allow 2+ spaces between columns).

Try this:
s/.* \K(.*),(.*?) /"$1,$2" /
Logically, this means: Find a substring between two spaces and a comma, where the two spaces are as far right as possible, and then a substring between that comma and two spaces, where the substring is as short as possible.
Your approach can work too, if you get the syntax for negative lookaheads right.

The sample text you supplied seems to be separated either by tabs or spaces (column aligned?). It is important to know which, or the regexp will not work. It is also important to know whether the pattern is consistent throughout the file.
If it is aligned by columns, the easiest and probably safest way is to simply count off characters. E.g.:
s/(^.{20})(\S*) /$1"$2"/;
(You will have to adjust the number 20 yourself. I just approximated.)
Note that I am chopping off two spaces at the end of the name field in a reckless manner. This is to not screw up the format for the following values. If however the field is filled to the brim, there might not be two spaces at the end, and the regexp will miss. But then, on the other hand, you would not be able to fit quotes there anyway.
When dealing with these types of files, I do not think it is safe to use generic searches. If you are counting on commas to only appear in the names, sooner or later you will find someone who thought that "Bronx, New York" should be in the city field, and your regexp will be screwed.
A somewhat more strict, but complicated regexp would include the previous fields:
$date='\d{2}-\d{2}-\d{2}'; # this might work for dates such as 11-10-23
s/^(\d+\s+$date\s+)(\S+) /$1"$2"/;
Same thing here, if the name field is not big enough to fit two quotes, it won't be added. You should check your file and see if that is ever the case. If it is the case, you will need to deal with it somehow.
I sometimes find that putting certain field's regexps in separate variables helps with legibility, such as with $date above.
Good luck!

Related

Writing a regex pattern for text if not 255/255

I'm currently writing a Perl program where I need a regex pattern, but there is something I don't know how to do.
Text
Reliability: 255/255
I need to write a regex where if reliability is 255/255 then it's okay (I have that one: Reliability:\s+(255\/255))
but if Reliability is different from 255/255 it's not.
I need to do it without an else. I just need the regex if possible.
Clarification
I send a command on a router : show interface $WAN_Int where $WAN_Int is an IP address. I put the result in a table. Then I read the table and I search for Reliability: 255/255. if it's 255, it's OK. If it's not, there is a problem.
my $ok = $string eq 'Reliability: 255/255';
or, if it must be a regex
my $ok = $string =~ m{^Reliability: 255/255$};
The $ok is 1 if $string is equal to the text or an empty string otherwise.
Both eq operator
and the regex's match operator in scalar context
return "true" (1) on success or false (an empty string) otherwise. So you can use them in an expression and assign their return (or decide based on it).
Please keep in mind the precedence table when doing this or use parenthesis liberally.
Given the clarification a regex is better here, to allow for flexibility in input as it comes from another interface which may have (or pick up) slight variations in the expected format.
Then allow for multiple spaces between words (\s+), for the phrase being only a part of the string (drop anchors ^ and $), perhaps for the word not being capitalized
my $ok = $string =~ m{\b[Rr]eliability:\s+255/255\b};
where \b is the word-boundary anchor, since trailing numbers after 255 may compromise this.
This may work for you:
Reliability:\s+(?!255)\d+/255\s*$
https://regex101.com/r/1JAsG6/5/
Also, for matching 255 on your original regexp, It can be simplified (you don't need classes and you can avoid the group)
Reliability:\s+255/255\s*$
https://regex101.com/r/LnfXVI/3/
NOTE: If you use / as a delimiter for the regexp, you may need to scape / on the regxes as \/
Extra: If you would want to also disallow strings like Reliability: 123456/255, use this instead: Reliability:\s+(?!255|0\d)\d{1,3}/255\s*$
https://regex101.com/r/1JAsG6/8/

How to grab certain number of characters before and after a perl regex match?

I am crafting regexes that match certain terms best within html code. I'm doing this in an iterative process to whittle down matches to exclude things I don't want. So I craft a regex, run it, and spit out data that I then look through to see how well my match is working. For example, if I am looking for the term "tema" (the name of a trade association that provides standards) I might notice that it also matches "sitemap" and alter my regex in some way to exclude the unwanted items.
To make this easier, I want to print out my match along with some context, say 20 charcters before and after the match, rather than the entire line, to make it easier to scan through the results. This is proving frustratingly hard to accomplish in a simple fashion.
For example, I would think this would work:
$line =~ /(.{,20}tema.{,20})/i;
That is, I want to match up to 20 of anything before and after my keyword and include it in the "context" I print out for scanning.
But it doesn't. Am I missing something here? If a{,20} will match up to 20 'a' characters, why won't .{,20} match 20 of anything that '.' will match?
Scratching my head.
Syntax:
atom{n} (exactly n)
atom{n,} (n or more)
atom{n,m} (n or more, but no more than m)
So,
say $1 if $line =~ /(.{0,20}tema.{0,20})/i;
Or if you're using /g and might get overlapping matches:
say "$1$2$3" while $line =~ /(.{0,20})\K(tema)(?=(.{0,20}))/ig;
(a{,20} doesn't "match up to 20 a characters.")
How about searching with m/^(.*)tema(.*)$/ then use substr or similar to get the last characters of $1 and the first from $2.

Simple regex - finding words including numbers but only on occasion

I'm really bad at regex, I have:
/(#[A-Za-z-]+)/
which finds words after the # symbol in a textbox, however I need it to ignore email addresses, like:
foo#things.com
however it finds #things
I also need it to include numbers, like:
#He2foo
however it only finds the #He part.
Help is appreciated, and if you feel like explaining regex in simple terms, that'd be great :D
/(?:^|(?<=\s))#([A-Za-z0-9]+)(?=[.?]?\s)/
#This (matched) regex ignores#this but matches on #separate tokens as well as tokens at the end of a sentence like #this. or #this? (without picking the . or the ?) And yes email#addresses.com are ignored too.
The regex while matching on # also lets you quickly access what's after it (like userid in #userid) by picking up the regex group(1). Check PHP documentation on how to work with regex groups.
You can just add 0-9 to your regex, like so:
/(#[A-Za-z0-9-]+)/
Don't think any more explanation is needed since you've been able to come this far by yourself. 0-9 is just like a-z (though numeric ofcourse).
In order to ignore emailaddresses you will need to provide more specific requirements. You could try preceding # with (^| ) which basically states that your value MUST be preceeded by either the start of the string (so nothing really, though at the start) or a space.
Extending this you can also use ($| ) on the end to require the value to be followed by the end of the string or a space (which means there's no period allowed, which is requirement for a valid emailaddress).
Update
$subject = "#a #b a#b a# #b";
preg_match_all("/(^| )#[A-Za-z0-9-]+/", $subject, $matches);
print_r($matches[0]);

remove up to _ in perl using regex?

How would I go about removing all characters before a "_" in perl? So if I had a string that was "124312412_hithere" it would replace the string as just "hithere". I imagine there is a very simple way to do this using regex, but I am still new dealing with that so I need help here.
Remove all characters up to and including "_":
s/^[^_]*_//;
Remove all characters before "_":
s/^[^_]*(?=_)//;
Remove all characters before "_" (assuming the presence of a "_"):
s/^[^_]*//;
This is a bit more verbose than it needs to be, but would be probably more valuable for you to see what's going on:
my $astring = "124312412_hithere";
my $find = "^[^_]*_";
my $replace = "_";
$astring =~ s/$find/$replace/;
print $astring;
Also, there's a bit of conflicting requirements in your question. If you just want hithere (without the leading _), then change it to:
$astring =~ s/$find//;
I know it's slightly different than what was asked, but in cases like this (where you KNOW the character you are looking for exists in the string) I prefer to use split:
$str = '124312412_hithere';
$str = (split (/_/, $str, 2))[1];
Here I am splitting the string into parts, using the '_' as a delimiter, but to a maximum of 2 parts. Then, I am assigning the second part back to $str.
There's still a regex in this solution (the /_/) but I think this is a much simpler solution to read and understand than regexes full of character classes, conditional matches, etc.
You can try out this: -
$_ = "124312412_hithere";
s/^[^_]*_//;
print $_; # hithere
Note that this will also remove the _(as I infer from your sample output). If you want to keep the _ (as it seems doubtful what you want as per your first statement), you would probably need to use look-ahead as in #ikegami's answer.
Also, just to make it little more clear, any substitution and matching in regex is applied by default on $_. So, you don't need to bind it to $_ explicitly. That is implied.
So, s/^[^_]*_//; is essentially same as - $_ =~ s/^[^_]*_//;, but later one is not really required.

Regular expression to select entire word except first letter, including words such as "Jack's" and "merry-go-round"

I'm trying to use a regular expression to select all of each word except the first character, much as #mahdaeng wanted to do here. The solution offered to his question was to use \B[a-z]. This works fine, except when a word contains some form of punctuation, such as "Jack's" and "merry-go-round". Is there a way to select the entire word including any contained punctuation? (Not including outside punctuation such as "? , ." etc.)
If you can enumerate the acceptable in-word punctuation, you could just expand upon the answer you linked:
\B[a-zA-Z'-]+
A regex really isn't necessary here, since you can just split your word on spaces and deal with each word accordingly. Since you don't mention an underlying language, here's an implementation in Perl:
use strict;
use warnings;
$_="Jack's merry-go-round revolves way too fast!";
my #words=split /\s+/;
foreach my $word(#words)
{
my $stripped_word=substr($word,1);
$stripped_word=~s/[^a-z]$//i; #stripping out end punctuation
print "$stripped_word\n";
}
The output is:
ack's
erry-go-round
evolves
ay
oo
ast
\B[^\s]+
(where ^\s means "not whitespace") should get you what you want assuming the words are whitespace-delimited. If they're also punctuation-delimited, you might need to enumerate the punctuation:
\B[^\s,.?!]+