Is it possible to conditionally insert text via regex substitution in Vim? - regex

I have have several lines from a table that I’m converting from Excel to the Wiki format, and want to add link tags for part of the text on each line, if there is text in that field. I have started the converting job and come to this point:
|10.20.30.9||x|-||
|10.20.30.10||x|s04|Server 4|
|10.20.30.11||x|s05|Server 5|
|10.20.30.12|||||
|10.20.30.13|||||
What I want is to change the fourth column from, e.g., s04 to [[server:s04]]. I do not wish to add the link brackets if the line is empty, or if it contains -. If that - is a big problem, I can remove it.
All my tries on regex to get anything from the line ends in the whole line being replaced.

Consider using awk to do this:
#!/bin/bash
awk -F'|' '
{
OFS = "|";
if ($5 != "" && $5 != "-")
$5 = "server:" $5;
print $0
}'
NOTE: I've edited this script since the first version. This current one, IMO is better.
Then you can process it with:
cat $FILENAME | sh $AWK_SCRIPTNAME
The -F'|' switch tells awk to use | as a field separator. The if/else and printf statements are pretty self explanatory. It prints the fields, with 'server:' prepended to column 5, only if it is not "-" or "".
Why column 5 and not column 4?: Because you use | at the beginning of each record. So awk takes the 'first' field ($1) to be an empty string that it believes should have occured before this first |.

This seems to do the job on the sample you give up there (with Vim):
%s/^|\%([^|]*|\)\{3}\zs[^|]*/\=(empty(submatch(0)) || submatch(0) == '-') ? submatch(0) : '[[server:'.submatch(0).']]'/

It's probably better to use awk as ArjunShankar writes, but this should work if you remove "-" ;) Didn't get it to work with it there.
:%s/^\([^|]*|\)\([^|]*|\)\([^|]*|\)\([^|]*|\)\([^|]\+|\)/\1\2\3\4[[server:\5]]/
It's just stupid though. The first 4 are identical (match anything up to | 4 times). Didn't get it to work with {4}. The fifth matches the s04/s05-strings (just requires that it's not empty, therefor "-" must be removed).

Adding a bit more readability to the ideas given by others:
:%s/\v^%(\|.{-}){3}\|\zs(\w+)/[[server:\1]]/
Job done.
Note how {3} indicates the number of columns to skip. Also note the use of \v for very magic regex mode. This reduces the complexity of your regex, especially when it uses more 'special' characters than literal text.

Let me recommend the following substitution command.
:%s/^|\%([^|]*|\)\{3}\zs[^|-]\+\ze|/[[server:&]]/

try
:1,$s/|\(s[0-9]\+\)|/|[[server:\1]]|/
assuming that your s04, s05 are always s and a number

A simpler substitution can be achieved with this:
%s/^|.\{-}|.\{-}|.\{-}|\zs\(\w\{1,}\)\ze|/[[server:\1]]/
^^^^^^^^^^^^^^^^^^^^ -> Match the first 3 groups (empty or not);
^^^ -> Marks the "start of match";
^^^^^^^^^^^ -> Match only if the 4th line contains letters numbers and `_` ([0-9A-Za-z_]);
^^^ -> Marks the "end of match";
If the _character is similar to -, can appear but must not be substituted, uset the following regex: %s/^|.\{-}|.\{-}|.\{-}|\zs\([0-9a-zA-Z]\{1,}\)\ze|/[[server:\1]]/

Related

Regex match until third occurrence of a char is found, counting occurrence of said char starting from the end of string

Let's dive in : Input :
p9_rec_tonly_.cr_called.seg
p9_tonly_.cr_called.seg
p10_nor_nor_.cr_called.seg
p10_rec_tn_.cr_called.seg
p10_tn_.cr_called.seg
p26_rec_nor_nor_.cr_called.seg
p26_rec_tn_.cr_called.seg
p26_tn_.cr_called.seg
Desired output :
p9_rec
p9
p10_nor
p10_rec
p10
p26_rec_nor
p26_rec
p26
Starting from the beginning of my string, I need to match until the third occurrence of " _ " (underscore) is found, but I need to count " _ " (underscore) occurrence starting from end of string.
Any tips is appreciated,
Best regards
I believe this regex should do the trick!
^.*?(?=_[^_]*_[^_]*_[^_]*$)
Online Demo
Explanation:
^ the start of the line
.*? matches as many characters as possible
(?=...) asserts that its contents follow our match
_[^_]*_[^_]*_[^_]* Looks for exactly three underscores after our match.
$ the end of the line
You should think beyond regex to solve this problem. For example, if you are using Python just use rsplit with a limit of 3 and get the first resulting string:
>>> data = [
'p9_rec_tonly_.cr_called.seg',
'p9_tonly_.cr_called.seg',
'p10_nor_nor_.cr_called.seg',
'p10_rec_tn_.cr_called.seg',
'p10_tn_.cr_called.seg',
'p26_rec_nor_nor_.cr_called.seg',
'p26_rec_tn_.cr_called.seg',
'p26_tn_.cr_called.seg',
]
>>> for d in data:
print(d.rsplit('_', 3)[0])
p9_rec
p9
p10_nor
p10_rec
p10
p26_rec_nor
p26_rec
p26
bash you say? Well it's not a regular expression but you can do pattern substitutions (or stripping with bash):
while read var ; do echo ${var%_*_*_*} ; done <<EOT
p9_rec_tonly_.cr_called.seg
p9_tonly_.cr_called.seg
p10_nor_nor_.cr_called.seg
p10_rec_tn_.cr_called.seg
p10_tn_.cr_called.seg
p26_rec_nor_nor_.cr_called.seg
p26_rec_tn_.cr_called.seg
p26_tn_.cr_called.seg
EOT
${var%_*_*_*} expands variable var stripping shorted suffix match for _*_*_*.
Otherwise to perform regex operations in shell, you could normally ask a utility like sed for help and feed your lines through for instance this:
sed -e 's#_[^_]*_[^_]*_[^_]*$##'
or for short:
sed -e 's#\(_[^_]*\)\{3\}$##'
Find three groups of _ and zero or more characters of not _ at the end of line $ replacing them with nothing ('').

awk finding a column and trimming

I have a text file with irregular structure like following
first_name1 last_name1 designation1 email1 phone_number1
first_name2 last_name2 designation2 email2
first_name3 last_name3 designation3 email3 phone_number3 address3
As you see email could be the last column, second last column or the third last column. This means one simply cannot use $NF to get email. My goal is to get email address wherever it is on the line and then extract the portion before # so for instance email1 = foobar#dept.company.com then I want to extract foobar. How can i write an awk query to extract first portion of the email address. I tried this but it is looking for exact match. How can i make it into Regex to get the job done.
awk '{for(i=1;i<=NF;i++){ if($i=="foobar#dept.company.com"){print $i} } }' users.txt
You are comparing $i to a string "foobar#dept.company.com", so yes of course this will only make an exact comparison. What it seems you are looking for is whether or not $i matches (~) a regular expression (/.../ instead of "..."), then tailor the regex to your needs. Try something like:
awk '{for(i=1;i<=NF;++i){if ($i ~ /.+#.+/){sub(/#.*$/, "", $i); print $i; next}}}'
The regex /.+#.+/ matches a string with a # in it, and some non-empty thing before it and after it. This will not match, for example #foobar or foobar#, or just #. You might want to consider using something more like /.+#.+\..+/ which would match (something)#(something).(something) since domain names usually have a . in them. You can tailor this regex to be more specific, if you wish.
The sub(/#.*$/, "", $i) means to substitute in $i everything after (and including) the first # until the end of the line ($) with an empty string "", thus stripping out the part before the # (i.e. the username). The print $i prints it, and the next moves on to the next line (skipping any remaining fields for the current record).
I don't know awk at all but I looked the regex reference up and this should be supported: \b([^ ]*#.*?)($|[^\w#.]) in which group 1 matches the email. This just search for something after a word boundary that contains #. The match ends at the next non word character, excluding # and ..

Regex: Match any character (including whitespace) except a comma

I would like to match any character and any whitespace except comma with regex. Only matching any character except comma gives me:
[^,]*
but I also want to match any whitespace characters, tabs, space, newline, etc. anywhere in the string.
EDIT:
This is using sed in vim via :%s/foo/bar/gc.
I want to find starting from func up until the comma, in the following example:
func("bla bla bla"
"asdfasdfasdfasdfasdfasdf"
"asdfasdfasdf", "more strings")
I
To work with multiline in SED using RegEx, you should look at here.
EDIT:
In SED command, working with NewLine is a bit different. SED command support three patterns to manage multiline operations N, P and D. To see how it works see this(Working with Multiple Lines) explaination. Here these three operations discussed.
My guess is that N operator is the area of consideration that is missing from here. Addition of N operator will allows to sense \n in string.
An example from here:
Occasionally one wishes to use a new line character in a sed script.
Well, this has some subtle issues here. If one wants to search for a
new line, one has to use "\n." Here is an example where you search for
a phrase, and delete the new line character after that phrase -
joining two lines together.
(echo a;echo x;echo y) | sed '/x$/ { N s:x\n:x: }'
which generates
a xy
However, if you are inserting a new line, don't use "\n" - instead
insert a literal new line character:
(echo a;echo x;echo y) | sed 's:x:X\ :'
generates
a X
y
So basically you're trying to match a pattern over multiple lines.
Here's one way to do it in sed (pretty sure these are not useable within vim though, and I don't know how to replicate this within vim)
sed '
/func/{
:loop
/,/! {N; b loop}
s/[^,]*/func("ok"/
}
' inputfile
Let's say inputfile contains these lines
func("bla bla bla"
"asdfasdfasdfasdfasdfasdf"
"asdfasdfasdf", "more strings")
The output is
func("ok", "more strings")
Details:
If a line contains func, enter the braces.
:loop is a label named loop
If the line does not contain , (that's what /,/! means)
append the next line to pattern space (N)
branch to / go to loop label (b loop)
So it will keep on appending lines and looping until , is found, upon which the s command is run which matches all characters before the first comma against the (multi-line) pattern space, and performs a replacement.

Regex: how to determine odd/even number of occurrences of a char preceding a given char?

I would like to replace the | with OR only in unquoted terms, eg:
"this | that" | "the | other" -> "this | that" OR "the | other"
Yes, I could split on space or quote, get an array and iterate through it, and reconstruct the string, but that seems ... inelegant. So perhaps there's a regex way to do this by counting "s preceding | and obviously odd means the | is quoted and even means unquoted. (Note: Processing doesn't start until there is an even number of " if there is at least one ").
It's true that regexes can't count, but they can be used to determine whether there's an odd or even number of something. The trick in this case is to examine the quotation marks after the pipe, not before it.
str = str.replace(/\|(?=(?:(?:[^"]*"){2})*[^"]*$)/g, "OR");
Breaking that down, (?:[^"]*"){2} matches the next pair of quotes if there is one, along with the intervening non-quotes. After you've done that as many times as possible (which might be zero), [^"]*$ consumes any remaining non-quotes until the end of the string.
Of course, this assumes the text is well-formed. It doesn't address the problem of escaped quotes either, but it can if you need it to.
Regexes do not count. That's what parsers are for.
You might find the Perl FAQ on this issue relevant.
#!/usr/bin/perl
use strict;
use warnings;
my $x = qq{"this | that" | "the | other"};
print join('" OR "', split /" \| "/, $x), "\n";
You don't need to count, because you don't nest quotes. This will do:
#!/usr/bin/perl
my $str = '" this \" | that" | "the | other" | "still | something | else"';
print "$str\n";
while($str =~ /^((?:[^"|\\]*|\\.|"(?:[^\\"]|\\.)*")*)\|/) {
$str =~ s/^((?:[^"|\\]*|\\.|"(?:[^\\"]|\\.)*")*)\|/$1OR/;
}
print "$str\n";
Now, let's explain that expression.
^ -- means you'll always match everything from the beginning of the string, otherwise
the match might start inside a quote, and break everything
(...)\| -- this means you'll match a certain pattern, followed by a |, which appears
escaped here; so when you replace it with $1OR, you keep everything, but
replace the |.
(?:...)* -- This is a non-matching group, which can be repeated multiple times; we
use a group here so we can repeat multiple times alternative patterns.
[^"|\\]* -- This is the first pattern. Anything that isn't a pipe, an escape character
or a quote.
\\. -- This is the second pattern. Basically, an escape character and anything
that follows it.
"(?:...)*" -- This is the third pattern. Open quote, followed by a another
non-matching group repeated multiple times, followed by a closing
quote.
[^\\"] -- This is the first pattern in the second non-matching group. It's anything
except an escape character or a quote.
\\. -- This is the second pattern in the second non-matching group. It's an
escape character and whatever follows it.
The result is as follow:
" this \" | that" | "the | other" | "still | something | else"
" this \" | that" OR "the | other" OR "still | something | else"
Another approach (similar to Alan M's working answer):
str = str.replace(/(".+?"|\w+)\s*\|\s*/g, '$1 OR ');
The part inside the first group (spaced for readability):
".+?" | \w+
... basically means, something quoted, or a word. The remainder means that it was followed by a "|" wrapped in optional whitespace. The replacement is that first part ("$1" means the first group) followed by " OR ".
Perhaps you're looking for something like this:
(?<=^([^"]*"[^"]*")+[^"|]*)\|
Thanks everyone. Apologies for neglecting to mention this is in javascript and that terms don't have to be quoted, and there can be any number of quoted/unquoted terms, eg:
"this | that" | "the | other" | yet | another -> "this | that" OR "the | other" OR yet OR another
Daniel, it seems that's in the ballpark, ie basically a matching/massaging loop. Thanks for the detailed explanation. In js, it looks like a split, a forEach loop on the array of terms, pushing a term (after changing a | term to OR) back into an array, and a re join.
#Alan M, works nicely, escaping not necessary due to the sparseness of sqlite FTS capabilities.
#epost, accepted solution for brevity and elegance, thanks. it needed to merely be put in a more general form for unicode etc.
(".+?"|[^\"\s]+)\s*\|\s*
My solution in C# to count the quotes and then regex to get the matches:
// Count the number of quotes.
var quotesOnly = Regex.Replace(searchText, #"[^""]", string.Empty);
var quoteCount = quotesOnly.Length;
if (quoteCount > 0)
{
// If the quote count is an odd number there's a missing quote.
// Assume a quote is missing from the end - executive decision.
if (quoteCount%2 == 1)
{
searchText += #"""";
}
// Get the matching groups of strings. Exclude the quotes themselves.
// e.g. The following line:
// "this and that" or then and "this or other"
// will result in the following groups:
// 1. "this and that"
// 2. "or"
// 3. "then"
// 4. "and"
// 5. "this or other"
var matches = Regex.Matches(searchText, #"([^\""]*)", RegexOptions.Singleline);
var list = new List<string>();
foreach (var match in matches.Cast<Match>())
{
var value = match.Groups[0].Value.Trim();
if (!string.IsNullOrEmpty(value))
{
list.Add(value);
}
}
// TODO: Do something with the list of strings.
}

Replace patterns that are inside delimiters using a regular expression call

I need to clip out all the occurances of the pattern '--' that are inside single quotes in long string (leaving intact the ones that are outside single quotes).
Is there a RegEx way of doing this?
(using it with an iterator from the language is OK).
For example, starting with
"xxxx rt / $ 'dfdf--fggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g '--ggh--' vcbcvb"
I should end up with:
"xxxx rt / $ 'dfdffggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g 'ggh' vcbcvb"
So I am looking for a regex that could be run from the following languages as shown:
+-------------+------------------------------------------+
| Language | RegEx |
+-------------+------------------------------------------+
| JavaScript | input.replace(/someregex/g, "") |
| PHP | preg_replace('/someregex/', "", input) |
| Python | re.sub(r'someregex', "", input) |
| Ruby | input.gsub(/someregex/, "") |
+-------------+------------------------------------------+
I found another way to do this from an answer by Greg Hewgill at Qn138522
It is based on using this regex (adapted to contain the pattern I was looking for):
--(?=[^\']*'([^']|'[^']*')*$)
Greg explains:
"What this does is use the non-capturing match (?=...) to check that the character x is within a quoted string. It looks for some nonquote characters up to the next quote, then looks for a sequence of either single characters or quoted groups of characters, until the end of the string. This relies on your assumption that the quotes are always balanced. This is also not very efficient."
The usage examples would be :
JavaScript: input.replace(/--(?=[^']*'([^']|'[^']*')*$)/g, "")
PHP: preg_replace('/--(?=[^\']*'([^']|'[^']*')*$)/', "", input)
Python: re.sub(r'--(?=[^\']*'([^']|'[^']*')*$)', "", input)
Ruby: input.gsub(/--(?=[^\']*'([^']|'[^']*')*$)/, "")
I have tested this for Ruby and it provides the desired result.
This cannot be done with regular expressions, because you need to maintain state on whether you're inside single quotes or outside, and regex is inherently stateless. (Also, as far as I understand, single quotes can be escaped without terminating the "inside" region).
Your best bet is to iterate through the string character by character, keeping a boolean flag on whether or not you're inside a quoted region - and remove the --'s that way.
If bending the rules a little is allowed, this could work:
import re
p = re.compile(r"((?:^[^']*')?[^']*?(?:'[^']*'[^']*?)*?)(-{2,})")
txt = "xxxx rt / $ 'dfdf--fggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g '--ggh--' vcbcvb"
print re.sub(p, r'\1-', txt)
Output:
xxxx rt / $ 'dfdf-fggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g '-ggh-' vcbcvb
The regex:
( # Group 1
(?:^[^']*')? # Start of string, up till the first single quote
[^']*? # Inside the single quotes, as few characters as possible
(?:
'[^']*' # No double dashes inside theses single quotes, jump to the next.
[^']*?
)*? # as few as possible
)
(-{2,}) # The dashes themselves (Group 2)
If there where different delimiters for start and end, you could use something like this:
-{2,}(?=[^'`]*`)
Edit: I realized that if the string does not contain any quotes, it will match all double dashes in the string. One way of fixing it would be to change
(?:^[^']*')?
in the beginning to
(?:^[^']*'|(?!^))
Updated regex:
((?:^[^']*'|(?!^))[^']*?(?:'[^']*'[^']*?)*?)(-{2,})
Hm. There might be a way in Python if there are no quoted apostrophes, given that there is the (?(id/name)yes-pattern|no-pattern) construct in regular expressions, but it goes way over my head currently.
Does this help?
def remove_double_dashes_in_apostrophes(text):
return "'".join(
part.replace("--", "") if (ix&1) else part
for ix, part in enumerate(text.split("'")))
Seems to work for me. What it does, is split the input text to parts on apostrophes, and replace the "--" only when the part is odd-numbered (i.e. there has been an odd number of apostrophes before the part). Note about "odd numbered": part numbering starts from zero!
You can use the following sed script, I believe:
:again
s/'\(.*\)--\(.*\)'/'\1\2'/g
t again
Store that in a file (rmdashdash.sed) and do whatever exec magic in your scripting language allows you to do the following shell equivalent:
sed -f rmdotdot.sed < file containing your input data
What the script does is:
:again <-- just a label
s/'\(.*\)--\(.*\)'/'\1\2'/g
substitute, for the pattern ' followed by anything followed by -- followed by anything followed by ', just the two anythings within quotes.
t again <-- feed the resulting string back into sed again.
Note that this script will convert '----' into '', since it is a sequence of two --'s within quotes. However, '---' will be converted into '-'.
Ain't no school like old school.