string pattern and regex - regex

I have a file with different lines, among which I have some lines like
173.194.034.006.00080-138.096.201.072.49934
the pattern is 3 numbers and then a dot and then 3 numbers and then a dot, etc.
I want to use awk, grep, or sed for this purpose. How do I express this regular expression?

Assuming you want to get lines with 1 series like 123. exists, do
grep '[0-9][0-9][0-9]\.' file > numbersFile
If you want 2 series like 123.345., then do
grep '[0-9][0-9][0-9]\.[0-9][0-9][0-9]\.' file > numbersFile
etc, etc.
Each [0-9] means match only one occurance of characters in the range between 0-9 (0,1,2,3,4,5,6,7,8,9).
Because the '.' char has a special meaning in a normal grep regexp, you nave to escape it like \. to indicate "Just match the '.' char (only!) ;-)
There are fancy extensions to grep that allow you to specify the pattern once, and include a qualifier like {3} or sometimes \{3\} (to indicate 3 repetitions). But this extension isn't portable to older Unix like Solaris, AIX, and others.
Here's a simple test to see if your system supports qualifiers. (Super Grep-heads are welcome to correct my terminology :-).
echo "173.194.034.006.00080-138.096.201.072.49934" | grep '[0-9]\{10\}\.'
echo "173.194.034.006.00080-138.096.201.072.49934" | grep '[0-9]\{2\}\.'
The first test should fail, the 2nd will succeed if your grep supports qualifiers.
It doesn't hurt to learn the long-hand solution (as above), and you can be sure this will work with any grep.
IHTH.

In awk I'd probably build up the string and then search for it as:
BEGIN {
p = "[.]"
d = "[[:digit:]]"
d3 = d d d # or d"{3}"
d5 = d d d d d # or d"{5}"
re = d3 p d3 p d3 p d3 p d5 # or "(" d3 p "){4}" d5
}
$0 ~ re "-" re
but it really all depends what you want to do with it.

By the look of it, these are IP addresses, followed by a port number, a dash and then the IP address/port number combination again.
If you're on a modern UNIX/Linux system then
grep -P '(\d{3}\.){4}\d{5}-(\d{3}\.){4}\d{5})'
would do the trick -- although may not be the most portable way to do it. This uses the '-P' for "use Perl regular expressions" option, which some people might consider to be cheating!
You didn't say if you've got extra text either before or after these strings on the line. If you have then you can use the '-o' option just to extract the matched text and ignore everything else.

Related

How to use the literal string "STDIN" in negative lookbehind in perl? [duplicate]

I have a very crazy regex that I'm trying to diagnose. It is also very long, but I have cut it down to just the following script. Run using Strawberry Perl v5.26.2.
use strict;
use warnings;
my $text = "M Y H A P P Y T E X T";
my $regex = '(?i)(?<!(Mon|Fri|Sun)day |August )abcd(?-i)';
if ($text =~ m/$regex/){
print "true\n";
}
else {
print "false\n";
}
This gives the error "Variable length lookbehind not implemented in regex."
I am hoping you can help with several issues:
I don't see why this error would occur, because all of the possible lookbehind values are 7 characters: "Monday ", "Friday ", "Sunday ", "August ".
I did not write this regex myself, and I am not sure how to interpret the syntax (?i) and (?-i). When I get rid of the (?i) the error actually goes away. How will perl interpret this part of the regex? I would think the first two characters are evaluated to "optional literal parentheses" except that the parentheses isn't escaped and also in that case I would get a different syntax error because the closing parentheses would then not be matched.
This behavior starts somewhere between Perl 5.16.3_64 and 5.26.1_64, at least in Strawberry Perl. The former version is fine with the code, the latter is not. Why did it start?
I have reduced your problem to this:
my $text = 'M Y H A P P Y T E X T';
my $regex = '(?<!st)A';
print ($text =~ m/$regex/i ? "true\n" : "false\n");
Due to presence of /i (case insensitive) modifier and presence of certain character combinations such as "ss" or "st" that can be replaced by a Typographic_ligature causing it to be a variable length (/August/i matches for instance on both AUGUST (6 characters) and august (5 characters, the last one being U+FB06)).
However if we remove /i (case insensitive) modifier then it works because typographic ligatures are not matched.
Solution: Use aa modifiers i.e.:
/(?<!st)A/iaa
Or in your regex:
my $text = 'M Y H A P P Y T E X T';
my $regex = '(?<!(Mon|Fri|Sun)day |August )abcd';
print ($text =~ m/$regex/iaa ? "true\n" : "false\n");
From perlre:
To forbid ASCII/non-ASCII matches (like "k" with "\N{KELVIN SIGN}"), specify the "a" twice, for example /aai or /aia. (The first occurrence of "a" restricts the \d, etc., and the second occurrence adds the "/i" restrictions.) But, note that code points outside the ASCII range will use Unicode rules for /i matching, so the modifier doesn't really restrict things to just ASCII; it just forbids the intermixing of ASCII and non-ASCII.
See a closely related discussion here
That's because st can be a ligature. The same happens to fi and ff:
#!/usr/bin/perl
use warnings;
use strict;
use utf8;
my $fi = 'fi';
print $fi =~ /fi/i;
So imagine something like fi|fi where, indeed, the lengths of alternatives isn't the same.
st could be represented in a 1-character stylistic ligature as st or ſt, so its length could be 2 or 1.
Quickly finding perl's full list of 2→1-character ligatures using a bash command:
$ perl -e 'print $^V'
v5.26.2
$ for lig in {a..z}{a..z}; do \
perl -e 'print if /(?<!'$lig')x/i' 2>/dev/null || echo $lig; done
ff fi fl ss st
These respectively represent the ff, fi, fl, ß, and st/ſt ligatures. (ſt represents ſt, using the obsolete long s character; it matches st and it does not match ft.)
Perl also supports the remaining stylistic ligatures, ffi and ffl for ffi and ffl, though this isn't noteworthy in this context since lookbehinds already have issues with ff and fi/fl separately.
Future releases of perl may include more stylistic ligatures, though all that remain are font-specific (e.g. Linux Libertine has stylistic ligatures for ct and ch) or debatably stylistic (such as the Dutch ij for ij or the obsolete Spanish ꝇ for ll). It doesn't seem appropriate to have this treatment for ligatures that are not entirely interchangeable (nobody would accept dœs for does), though there are other scenarios, such as including ß thanks to its uppercase form being SS.
Perl 5.16.3 (and similarly old versions) only stumble on ss (for ß) and fail to expand the other ligatures in lookbehinds (they have fixed width and will not match). I didn't seek out the bugfix to itemize exactly which versions are affected.
Perl 5.14 introduced ligature support, so earlier versions don't have this problem.
Workarounds
Workarounds for /(?<!August)x/i (only the first will properly avoid August):
/(?<!Augus[t])(?<!Augu(?=st).)x/i (absolutely comprehensive)
/(?<!Augu(?aa:st))x/i (just the st in the lookbehind is "ASCII-safe" ²)
/(?<!(?aa)August)x/i (the whole the lookbehind is "ASCII-safe" ²)
/(?<!August)x/iaa (the whole regex is "ASCII-safe" ²)
/(?<!Augus[t])x/i (breaks ligature seeking ¹)
/(?<!Augus.)x/i (slightly different, matches more)
/(?<!Augu(?-i:st))x/i (case-sensitive st in lookbehind, won't match AugusTx)
These toy with removing the case-insensitive modifier¹ or adding the ASCII-safe modifier² in various places, often requiring the regex writer to specifically know of the variable-width ligature.
The first variation (which is the only comprehensive one) matches the variable widths with two lookbehinds: first for the six character version (no ligatures as noted in the first quote below) and second for any ligatures, employing a forward lookahead (which has zero width!) for st (including the ligatures) and then accounting for its single character width with a .
Two segments of the perlre man page:
¹ Case-insensitive modifier /i & ligatures
There are a number of Unicode characters that match a sequence of
multiple characters under /i. For example, "LATIN SMALL LIGATURE
FI" should match the sequence fi. Perl is not currently able to
do this when the multiple characters are in the pattern and are
split between groupings, or when one or more are quantified. Thus
"\N{LATIN SMALL LIGATURE FI}" =~ /fi/i; # Matches [in perl 5.14+]
"\N{LATIN SMALL LIGATURE FI}" =~ /[fi][fi]/i; # Doesn't match!
"\N{LATIN SMALL LIGATURE FI}" =~ /fi*/i; # Doesn't match!
"\N{LATIN SMALL LIGATURE FI}" =~ /(f)(i)/i; # Doesn't match!
² ASCII-safe modifier /aa (perl 5.14+)
To forbid ASCII/non-ASCII matches (like k with \N{KELVIN SIGN}),
specify the a twice, for example /aai or /aia. (The first
occurrence of a restricts the \d, etc., and the second occurrence
adds the /i restrictions.) But, note that code points outside the
ASCII range will use Unicode rules for /i matching, so the modifier
doesn't really restrict things to just ASCII; it just forbids the
intermixing of ASCII and non-ASCII.
To summarize, this modifier provides protection for applications that
don't wish to be exposed to all of Unicode. Specifying it twice gives
added protection.
Put (?i) after lookbehind:
(?<!(Mon|Fri|Sun)day |August )(?i)abcd(?-i)
or
(?<!(Mon|Fri|Sun)day |August )(?i:abcd)
To me it seems to be a bug.

grep to search for a specified pattern

i want to grep all the texts in a file which contain symbols (non alpha numeric) and start with a number and which have spaces between them
grep -i "^[0-9]\|[^a-zA-Z0-9]\| "
I have written the following grep command which works perfectly , however i also wish to include those texts which are not in a particular limit say for example all those texts which are less than 3 and more than 15 should be greped
How can include that limit pattern as well in one command
I tried using
{3,15}
and all but could not get the desired output
sample input
aa
9dsa
abcd
abc#$
ab d
Sample output
aa //because length less than 3
ab d //because has space in between
9dsa // because starts with a number
abc#$ //because has special symbols in it
For clarity, simplicty, robustness, portability, etc. just use awk instead of grep to search for non-trivial conditions:
$ awk 'length()<3 || length()>15 || /[^[:alnum:]]/ || /[[:space:]]/ || /^[0-9]/' file
aa
9dsa
abc#$
ab d
I mean seriously, that couldn't get much clearer/simpler and it will work in any POSIX awk and it's trivial to change if/when your requirements change.
Below expression should help you find the required lines. I am assuming you will use grep -E so the alternation will work properly
^[[:digit:]]|[##$%^&*()]|^.{0,3}$|^.{15,}$
Below is the explanation for the regex
^[[:digit:]] - Match a line that starts with a number
[##$%^&*()] - Match a line containing the specified symbols.
Alternatively you can use [^[:alnum:]], if you want
the symbol to match any non alpha numeric character.
Beware that a space, underscore, tab, quote, etc are all
examples of non alpha numeric characters
^.{0,3}$ - Match a line containing less than 3 characters
^.{15,}$ - Match a line containing more than 15 characters

Using grep to find keywords, and then list the following characters until the next ; character

I have a long list of chemical conditions in the following form:
0.2M sodium acetate; 0.3M ammonium thiosulfate;
The molarities can be listed in various ways:
x.xM, x.x M, x M
where the number of x digits vary. I want to do two things, select those numbers using grep, and then list only the following characters until ;. So if I select 0.2M in the example above, I want to be able to list sodium acetate.
For selecting, I have tried the following:
grep '[0-9]*.[0-9]*[[:space:]]*M' file
so that there are arbitrary number of digits and spaces, but it always ends with M. The problem is, it also selects the following:
0.05MRbCl+MgCl2;
I am not quite sure why this is selected. Ideally, I would want 0.05M to be selected, and then list RbCl+MgCl2. How can I achieve this?
(The system is OS X Yosemite)
It matches that because:
[0-9]* matches 0
. matches any character (this is the . in this case, but you probably meant to escape it)
[0-9]* matches 05
[[:space:]]* matches the empty string between 05 and M
M matches M
As for how to do what you want: I think that if you don't want the numbers to be printed with the output, this would require either a lookbehind assertion or the ability to print a specific capture group, which it sounds like OS X's grep doesn't support. You could use a similar approach with a slightly more powerful tool, though:
$ cat test.txt
0.2M sodium acetate; 0.3M ammonium thiosulfate;
0.05MRbCl+MgCl2;
1.23M dihydrogen monoxide;
45 M xenon quadroxide;
$ perl -ne 'while (/([0-9]*\.)?[0-9]+\s*M\s*([^;]+)/g) { print "$2\n"; }' test.txt
sodium acetate
ammonium thiosulfate
RbCl+MgCl2
dihydrogen monoxide
xenon quadroxide
Written out, that regex is:
([0-9]*\.)? optionally, some digits and a decimal point
[0-9]+ one or more digits
\s*M\s* the letter M, with spacing around it
([^;]+) all the characters up until the next semicolon (the thing you want to print)
With GNU awk for multi-char RS, gensub() and \s:
$ awk -vRS=';\\s*' -vm='0.2M' 'm==gensub(/\s*([0-9.]+)\s*M.*/,"\\1M","")' file
0.2M sodium acetate
$ awk -vRS=';\\s*' -vm='0.05M' 'm==gensub(/\s*([0-9.]+)\s*M.*/,"\\1M","")' file
0.05MRbCl+MgCl2

How to write this in different regex flavours

I have the following data:
a b c d FROM:<uniquepattern1>
e f g h TO:<uniquepattern2>
i j k l FROM:<uniquepattern1>
m n o p TO:<uniquepattern3>
q r s t FROM:<uniquepattern4>
u v w x TO:<uniquepattern5>
I would like a regex query that can find the contents of TO: when FROM:<uniquepattern1> is encountered, so the results would be uniquepattern2 and uniquepattern3.
I am hopeless with regex, I would appreciate any pointers on how to write this (lookahead parameters?) and any differences between regex on different platforms (eg the C# .NET Regex versus Grep vs Perl) that might be relevant here.
Thank you.
Try:
/FROM:<uniquepattern1>.*\r?\n.*?TO:<(.*?)>/
This works by first finding the FROM anchor and then use a dot wildcard. The dot operator does not match a newline so this will consume the rest of the line. A non-greedy dot wildcard match then consumes up to the next TO and captures what's between the angle brackets.
your requirement for file parsing is simple. there is no need to use regular expression. Open the file for reading, go through each line check for FROM:<uniquepattern1>, get the next line and print them out. Furthermore, your TO lines are only separated by ":". therefore you can use that as field delimiter.
eg with awk
$ awk -F":" '/FROM:<uniquepattern1>/{getline;print $2}' file
<uniquepattern2>
<uniquepattern3>
the same goes for other languages/tools

Replace patterns that are inside delimiters using a regular expression call

I need to clip out all the occurances of the pattern '--' that are inside single quotes in long string (leaving intact the ones that are outside single quotes).
Is there a RegEx way of doing this?
(using it with an iterator from the language is OK).
For example, starting with
"xxxx rt / $ 'dfdf--fggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g '--ggh--' vcbcvb"
I should end up with:
"xxxx rt / $ 'dfdffggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g 'ggh' vcbcvb"
So I am looking for a regex that could be run from the following languages as shown:
+-------------+------------------------------------------+
| Language | RegEx |
+-------------+------------------------------------------+
| JavaScript | input.replace(/someregex/g, "") |
| PHP | preg_replace('/someregex/', "", input) |
| Python | re.sub(r'someregex', "", input) |
| Ruby | input.gsub(/someregex/, "") |
+-------------+------------------------------------------+
I found another way to do this from an answer by Greg Hewgill at Qn138522
It is based on using this regex (adapted to contain the pattern I was looking for):
--(?=[^\']*'([^']|'[^']*')*$)
Greg explains:
"What this does is use the non-capturing match (?=...) to check that the character x is within a quoted string. It looks for some nonquote characters up to the next quote, then looks for a sequence of either single characters or quoted groups of characters, until the end of the string. This relies on your assumption that the quotes are always balanced. This is also not very efficient."
The usage examples would be :
JavaScript: input.replace(/--(?=[^']*'([^']|'[^']*')*$)/g, "")
PHP: preg_replace('/--(?=[^\']*'([^']|'[^']*')*$)/', "", input)
Python: re.sub(r'--(?=[^\']*'([^']|'[^']*')*$)', "", input)
Ruby: input.gsub(/--(?=[^\']*'([^']|'[^']*')*$)/, "")
I have tested this for Ruby and it provides the desired result.
This cannot be done with regular expressions, because you need to maintain state on whether you're inside single quotes or outside, and regex is inherently stateless. (Also, as far as I understand, single quotes can be escaped without terminating the "inside" region).
Your best bet is to iterate through the string character by character, keeping a boolean flag on whether or not you're inside a quoted region - and remove the --'s that way.
If bending the rules a little is allowed, this could work:
import re
p = re.compile(r"((?:^[^']*')?[^']*?(?:'[^']*'[^']*?)*?)(-{2,})")
txt = "xxxx rt / $ 'dfdf--fggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g '--ggh--' vcbcvb"
print re.sub(p, r'\1-', txt)
Output:
xxxx rt / $ 'dfdf-fggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g '-ggh-' vcbcvb
The regex:
( # Group 1
(?:^[^']*')? # Start of string, up till the first single quote
[^']*? # Inside the single quotes, as few characters as possible
(?:
'[^']*' # No double dashes inside theses single quotes, jump to the next.
[^']*?
)*? # as few as possible
)
(-{2,}) # The dashes themselves (Group 2)
If there where different delimiters for start and end, you could use something like this:
-{2,}(?=[^'`]*`)
Edit: I realized that if the string does not contain any quotes, it will match all double dashes in the string. One way of fixing it would be to change
(?:^[^']*')?
in the beginning to
(?:^[^']*'|(?!^))
Updated regex:
((?:^[^']*'|(?!^))[^']*?(?:'[^']*'[^']*?)*?)(-{2,})
Hm. There might be a way in Python if there are no quoted apostrophes, given that there is the (?(id/name)yes-pattern|no-pattern) construct in regular expressions, but it goes way over my head currently.
Does this help?
def remove_double_dashes_in_apostrophes(text):
return "'".join(
part.replace("--", "") if (ix&1) else part
for ix, part in enumerate(text.split("'")))
Seems to work for me. What it does, is split the input text to parts on apostrophes, and replace the "--" only when the part is odd-numbered (i.e. there has been an odd number of apostrophes before the part). Note about "odd numbered": part numbering starts from zero!
You can use the following sed script, I believe:
:again
s/'\(.*\)--\(.*\)'/'\1\2'/g
t again
Store that in a file (rmdashdash.sed) and do whatever exec magic in your scripting language allows you to do the following shell equivalent:
sed -f rmdotdot.sed < file containing your input data
What the script does is:
:again <-- just a label
s/'\(.*\)--\(.*\)'/'\1\2'/g
substitute, for the pattern ' followed by anything followed by -- followed by anything followed by ', just the two anythings within quotes.
t again <-- feed the resulting string back into sed again.
Note that this script will convert '----' into '', since it is a sequence of two --'s within quotes. However, '---' will be converted into '-'.
Ain't no school like old school.