How to write this in different regex flavours - regex

I have the following data:
a b c d FROM:<uniquepattern1>
e f g h TO:<uniquepattern2>
i j k l FROM:<uniquepattern1>
m n o p TO:<uniquepattern3>
q r s t FROM:<uniquepattern4>
u v w x TO:<uniquepattern5>
I would like a regex query that can find the contents of TO: when FROM:<uniquepattern1> is encountered, so the results would be uniquepattern2 and uniquepattern3.
I am hopeless with regex, I would appreciate any pointers on how to write this (lookahead parameters?) and any differences between regex on different platforms (eg the C# .NET Regex versus Grep vs Perl) that might be relevant here.
Thank you.

Try:
/FROM:<uniquepattern1>.*\r?\n.*?TO:<(.*?)>/
This works by first finding the FROM anchor and then use a dot wildcard. The dot operator does not match a newline so this will consume the rest of the line. A non-greedy dot wildcard match then consumes up to the next TO and captures what's between the angle brackets.

your requirement for file parsing is simple. there is no need to use regular expression. Open the file for reading, go through each line check for FROM:<uniquepattern1>, get the next line and print them out. Furthermore, your TO lines are only separated by ":". therefore you can use that as field delimiter.
eg with awk
$ awk -F":" '/FROM:<uniquepattern1>/{getline;print $2}' file
<uniquepattern2>
<uniquepattern3>
the same goes for other languages/tools

Related

Using grep to find keywords, and then list the following characters until the next ; character

I have a long list of chemical conditions in the following form:
0.2M sodium acetate; 0.3M ammonium thiosulfate;
The molarities can be listed in various ways:
x.xM, x.x M, x M
where the number of x digits vary. I want to do two things, select those numbers using grep, and then list only the following characters until ;. So if I select 0.2M in the example above, I want to be able to list sodium acetate.
For selecting, I have tried the following:
grep '[0-9]*.[0-9]*[[:space:]]*M' file
so that there are arbitrary number of digits and spaces, but it always ends with M. The problem is, it also selects the following:
0.05MRbCl+MgCl2;
I am not quite sure why this is selected. Ideally, I would want 0.05M to be selected, and then list RbCl+MgCl2. How can I achieve this?
(The system is OS X Yosemite)
It matches that because:
[0-9]* matches 0
. matches any character (this is the . in this case, but you probably meant to escape it)
[0-9]* matches 05
[[:space:]]* matches the empty string between 05 and M
M matches M
As for how to do what you want: I think that if you don't want the numbers to be printed with the output, this would require either a lookbehind assertion or the ability to print a specific capture group, which it sounds like OS X's grep doesn't support. You could use a similar approach with a slightly more powerful tool, though:
$ cat test.txt
0.2M sodium acetate; 0.3M ammonium thiosulfate;
0.05MRbCl+MgCl2;
1.23M dihydrogen monoxide;
45 M xenon quadroxide;
$ perl -ne 'while (/([0-9]*\.)?[0-9]+\s*M\s*([^;]+)/g) { print "$2\n"; }' test.txt
sodium acetate
ammonium thiosulfate
RbCl+MgCl2
dihydrogen monoxide
xenon quadroxide
Written out, that regex is:
([0-9]*\.)? optionally, some digits and a decimal point
[0-9]+ one or more digits
\s*M\s* the letter M, with spacing around it
([^;]+) all the characters up until the next semicolon (the thing you want to print)
With GNU awk for multi-char RS, gensub() and \s:
$ awk -vRS=';\\s*' -vm='0.2M' 'm==gensub(/\s*([0-9.]+)\s*M.*/,"\\1M","")' file
0.2M sodium acetate
$ awk -vRS=';\\s*' -vm='0.05M' 'm==gensub(/\s*([0-9.]+)\s*M.*/,"\\1M","")' file
0.05MRbCl+MgCl2

Regex: Match any character (including whitespace) except a comma

I would like to match any character and any whitespace except comma with regex. Only matching any character except comma gives me:
[^,]*
but I also want to match any whitespace characters, tabs, space, newline, etc. anywhere in the string.
EDIT:
This is using sed in vim via :%s/foo/bar/gc.
I want to find starting from func up until the comma, in the following example:
func("bla bla bla"
"asdfasdfasdfasdfasdfasdf"
"asdfasdfasdf", "more strings")
I
To work with multiline in SED using RegEx, you should look at here.
EDIT:
In SED command, working with NewLine is a bit different. SED command support three patterns to manage multiline operations N, P and D. To see how it works see this(Working with Multiple Lines) explaination. Here these three operations discussed.
My guess is that N operator is the area of consideration that is missing from here. Addition of N operator will allows to sense \n in string.
An example from here:
Occasionally one wishes to use a new line character in a sed script.
Well, this has some subtle issues here. If one wants to search for a
new line, one has to use "\n." Here is an example where you search for
a phrase, and delete the new line character after that phrase -
joining two lines together.
(echo a;echo x;echo y) | sed '/x$/ { N s:x\n:x: }'
which generates
a xy
However, if you are inserting a new line, don't use "\n" - instead
insert a literal new line character:
(echo a;echo x;echo y) | sed 's:x:X\ :'
generates
a X
y
So basically you're trying to match a pattern over multiple lines.
Here's one way to do it in sed (pretty sure these are not useable within vim though, and I don't know how to replicate this within vim)
sed '
/func/{
:loop
/,/! {N; b loop}
s/[^,]*/func("ok"/
}
' inputfile
Let's say inputfile contains these lines
func("bla bla bla"
"asdfasdfasdfasdfasdfasdf"
"asdfasdfasdf", "more strings")
The output is
func("ok", "more strings")
Details:
If a line contains func, enter the braces.
:loop is a label named loop
If the line does not contain , (that's what /,/! means)
append the next line to pattern space (N)
branch to / go to loop label (b loop)
So it will keep on appending lines and looping until , is found, upon which the s command is run which matches all characters before the first comma against the (multi-line) pattern space, and performs a replacement.

string pattern and regex

I have a file with different lines, among which I have some lines like
173.194.034.006.00080-138.096.201.072.49934
the pattern is 3 numbers and then a dot and then 3 numbers and then a dot, etc.
I want to use awk, grep, or sed for this purpose. How do I express this regular expression?
Assuming you want to get lines with 1 series like 123. exists, do
grep '[0-9][0-9][0-9]\.' file > numbersFile
If you want 2 series like 123.345., then do
grep '[0-9][0-9][0-9]\.[0-9][0-9][0-9]\.' file > numbersFile
etc, etc.
Each [0-9] means match only one occurance of characters in the range between 0-9 (0,1,2,3,4,5,6,7,8,9).
Because the '.' char has a special meaning in a normal grep regexp, you nave to escape it like \. to indicate "Just match the '.' char (only!) ;-)
There are fancy extensions to grep that allow you to specify the pattern once, and include a qualifier like {3} or sometimes \{3\} (to indicate 3 repetitions). But this extension isn't portable to older Unix like Solaris, AIX, and others.
Here's a simple test to see if your system supports qualifiers. (Super Grep-heads are welcome to correct my terminology :-).
echo "173.194.034.006.00080-138.096.201.072.49934" | grep '[0-9]\{10\}\.'
echo "173.194.034.006.00080-138.096.201.072.49934" | grep '[0-9]\{2\}\.'
The first test should fail, the 2nd will succeed if your grep supports qualifiers.
It doesn't hurt to learn the long-hand solution (as above), and you can be sure this will work with any grep.
IHTH.
In awk I'd probably build up the string and then search for it as:
BEGIN {
p = "[.]"
d = "[[:digit:]]"
d3 = d d d # or d"{3}"
d5 = d d d d d # or d"{5}"
re = d3 p d3 p d3 p d3 p d5 # or "(" d3 p "){4}" d5
}
$0 ~ re "-" re
but it really all depends what you want to do with it.
By the look of it, these are IP addresses, followed by a port number, a dash and then the IP address/port number combination again.
If you're on a modern UNIX/Linux system then
grep -P '(\d{3}\.){4}\d{5}-(\d{3}\.){4}\d{5})'
would do the trick -- although may not be the most portable way to do it. This uses the '-P' for "use Perl regular expressions" option, which some people might consider to be cheating!
You didn't say if you've got extra text either before or after these strings on the line. If you have then you can use the '-o' option just to extract the matched text and ignore everything else.

R: Find the last dot in a string

In R, is there a better/simpler way than the following of finding the location of the last dot in a string?
x <- "hello.world.123.456"
g <- gregexpr(".", x, fixed=TRUE)
loc <- g[[1]]
loc[length(loc)] # returns 16
This finds all the dots in the string and then returns the last one, but it seems rather clumsy. I tried using regular expressions, but didn't get very far.
Does this work for you?
x <- "hello.world.123.456"
g <- regexpr("\\.[^\\.]*$", x)
g
\. matches a dot
[^\.] matches everything but a dot
* specifies that the previous expression (everything but a dot) may occur between 0 and unlimited times
$ marks the end of the string.
Taking everything together: find a dot that is followed by anything but a dot until the string ends. R requires \ to be escaped, hence \\ in the expression above. See regex101.com to experiment with regex.
How about a minor syntax improvement?
This will work for your literal example where the input vector is of length 1. Use escapes to get a literal "." search, and reverse the result to get the last index as the "first":
rev(gregexpr("\\.", x)[[1]])[1]
A more proper vectorized version (in case x is longer than 1):
sapply(gregexpr("\\.", x), function(x) rev(x)[1])
and another tidier option to use tail instead:
sapply(gregexpr("\\.", x), tail, 1)
Someone posted the following answer which I really liked, but I notice that they've deleted it:
regexpr("\\.[^\\.]*$", x)
I like it because it directly produces the desired location, without having to search through the results. The regexp is also fairly clean, which is a bit of an exception where regexps are concerned :)
There is a slick stri_locate_last function in the stringi package, that can accept both literal strings and regular expressions.
To just find a dot, no regex is required, and it is as easy as
stringi::stri_locate_last_fixed(x, ".")[,1]
If you need to use this function with a regex, to find the location of the last regex match in the string, you should replace _fixed with _regex:
stringi::stri_locate_last_regex(x, "\\.")[,1]
Note the . is a special regex metacharacter and should be escaped when used in a regex to match a literal dot char.
See an R demo online:
x <- "hello.world.123.456"
stringi::stri_locate_last_fixed(x, ".")[,1]
stringi::stri_locate_last_regex(x, "\\.")[,1]

Replace patterns that are inside delimiters using a regular expression call

I need to clip out all the occurances of the pattern '--' that are inside single quotes in long string (leaving intact the ones that are outside single quotes).
Is there a RegEx way of doing this?
(using it with an iterator from the language is OK).
For example, starting with
"xxxx rt / $ 'dfdf--fggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g '--ggh--' vcbcvb"
I should end up with:
"xxxx rt / $ 'dfdffggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g 'ggh' vcbcvb"
So I am looking for a regex that could be run from the following languages as shown:
+-------------+------------------------------------------+
| Language | RegEx |
+-------------+------------------------------------------+
| JavaScript | input.replace(/someregex/g, "") |
| PHP | preg_replace('/someregex/', "", input) |
| Python | re.sub(r'someregex', "", input) |
| Ruby | input.gsub(/someregex/, "") |
+-------------+------------------------------------------+
I found another way to do this from an answer by Greg Hewgill at Qn138522
It is based on using this regex (adapted to contain the pattern I was looking for):
--(?=[^\']*'([^']|'[^']*')*$)
Greg explains:
"What this does is use the non-capturing match (?=...) to check that the character x is within a quoted string. It looks for some nonquote characters up to the next quote, then looks for a sequence of either single characters or quoted groups of characters, until the end of the string. This relies on your assumption that the quotes are always balanced. This is also not very efficient."
The usage examples would be :
JavaScript: input.replace(/--(?=[^']*'([^']|'[^']*')*$)/g, "")
PHP: preg_replace('/--(?=[^\']*'([^']|'[^']*')*$)/', "", input)
Python: re.sub(r'--(?=[^\']*'([^']|'[^']*')*$)', "", input)
Ruby: input.gsub(/--(?=[^\']*'([^']|'[^']*')*$)/, "")
I have tested this for Ruby and it provides the desired result.
This cannot be done with regular expressions, because you need to maintain state on whether you're inside single quotes or outside, and regex is inherently stateless. (Also, as far as I understand, single quotes can be escaped without terminating the "inside" region).
Your best bet is to iterate through the string character by character, keeping a boolean flag on whether or not you're inside a quoted region - and remove the --'s that way.
If bending the rules a little is allowed, this could work:
import re
p = re.compile(r"((?:^[^']*')?[^']*?(?:'[^']*'[^']*?)*?)(-{2,})")
txt = "xxxx rt / $ 'dfdf--fggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g '--ggh--' vcbcvb"
print re.sub(p, r'\1-', txt)
Output:
xxxx rt / $ 'dfdf-fggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g '-ggh-' vcbcvb
The regex:
( # Group 1
(?:^[^']*')? # Start of string, up till the first single quote
[^']*? # Inside the single quotes, as few characters as possible
(?:
'[^']*' # No double dashes inside theses single quotes, jump to the next.
[^']*?
)*? # as few as possible
)
(-{2,}) # The dashes themselves (Group 2)
If there where different delimiters for start and end, you could use something like this:
-{2,}(?=[^'`]*`)
Edit: I realized that if the string does not contain any quotes, it will match all double dashes in the string. One way of fixing it would be to change
(?:^[^']*')?
in the beginning to
(?:^[^']*'|(?!^))
Updated regex:
((?:^[^']*'|(?!^))[^']*?(?:'[^']*'[^']*?)*?)(-{2,})
Hm. There might be a way in Python if there are no quoted apostrophes, given that there is the (?(id/name)yes-pattern|no-pattern) construct in regular expressions, but it goes way over my head currently.
Does this help?
def remove_double_dashes_in_apostrophes(text):
return "'".join(
part.replace("--", "") if (ix&1) else part
for ix, part in enumerate(text.split("'")))
Seems to work for me. What it does, is split the input text to parts on apostrophes, and replace the "--" only when the part is odd-numbered (i.e. there has been an odd number of apostrophes before the part). Note about "odd numbered": part numbering starts from zero!
You can use the following sed script, I believe:
:again
s/'\(.*\)--\(.*\)'/'\1\2'/g
t again
Store that in a file (rmdashdash.sed) and do whatever exec magic in your scripting language allows you to do the following shell equivalent:
sed -f rmdotdot.sed < file containing your input data
What the script does is:
:again <-- just a label
s/'\(.*\)--\(.*\)'/'\1\2'/g
substitute, for the pattern ' followed by anything followed by -- followed by anything followed by ', just the two anythings within quotes.
t again <-- feed the resulting string back into sed again.
Note that this script will convert '----' into '', since it is a sequence of two --'s within quotes. However, '---' will be converted into '-'.
Ain't no school like old school.