Vim Regex Capture Groups [bau -> byau : ceu -> cyeu] - regex

I have a list of words:
bau
ceu
diu
fou
gau
I want to turn that list into:
byau
cyeu
dyiu
fyou
gyau
I unsuccessfully tried the command:
:%s/(\w)(\w\w)/\1y\2/g
Given that this doesn't work, what do I have to change to make the regex capture groups work in Vim?

One way to fix this is by ensuring the pattern is enclosed by escaped parentheses:
:%s/\(\w\)\(\w\w\)/\1y\2/g
Slightly shorter (and more magic-al) is to use \v, meaning that in the pattern after it all ASCII characters except '0'-'9', 'a'-'z', 'A'-'Z' and '_' have a special meaning:
:%s/\v(\w)(\w\w)/\1y\2/g
See:
:help \(
:help \v

You can also use this pattern which is shorter:
:%s/^./&y
%s applies the pattern to the whole file.
^. matches the first character of the line.
&y adds the y after the pattern.

If you don't want to escape the capturing groups with backslashes (this is what you've missed), prepend \v to turn Vim's regular expression engine into very magic mode:
:%s/\v(\w)(\w\w)/\1y\2/g

You also have to escape the Grouping paranthesis:
:%s/\(\w\)\(\w\w\)/\1y\2/g
That does the trick.

In Vim, on a selection, the following
:'<,'>s/^\(\w\+ - \w\+\).*/\1/
or
:'<,'>s/\v^(\w+ - \w+).*/\1/
parses
Space - Commercial - Boeing
to
Space - Commercial
Similarly,
apple - banana - cake - donuts - eggs
is parsed to
apple - banana
Explanation
^ : match start of line
\-escape (, +, ) per the first regex (accepted answer) -- or prepend with \v (#ingo-karkat's answer)
\w\+ finds a word (\w will find the first character): in this example, I search for a word followed by - followed by another word)
.* after the capturing group is needed to find / match / exclude the remaining text
Addendum. This is a bit off topic, but I would suggest that Vim is not well-suited for the execution of more complex regex expressions / captures. [I am doing something similar to the following, which is how I found this thread.]
In those instances, it is likely better to dump the lines to a text file and edit it "in place"
sed -i ...
or in a redirect
sed ... > out.txt
In a terminal (or BASH script, ...):
echo 'Space Sciences - Private Industry - Boeing' | sed -r 's/^((\w+ ){1,2}- (\w+ ){1,2}).*/\1/'
Space Sciences - Private Industry
cat in.txt
Space Sciences - Private Industry - Boeing
sed -r 's/^((\w+ ){1,2}- (\w+ ){1,2}).*/\1/' ~/in.txt > ~/out.txt
cat ~/out.txt
Space Sciences - Private Industry
## Caution: if you forget the > redirect, you'll edit your source.
## Subsequent > redirects also overwrite the output; use >> to append
## subsequent iterations to the output (preserving the previous output).
## To edit "in place" (`-i` argument/flag):
sed -i -r 's/^((\w+ ){1,2}- (\w+ ){1,2}).*/\1/' ~/in.txt
cat in.txt
Space Sciences - Private Industry
sed -r 's/^((\w+ ){1,2}- (\w+ ){1,2}).*/\1/'
(note the {1,2}) allows the flexibility of finding {x,y} repetitions of a word(s) -- see https://www.gnu.org/software/sed/manual/html_node/Regular-Expressions.html .
Here, since my phrases are separated by -, I can simply tweak those parameters to get what I want.

Related

Substitute any other character except for a specific pattern in Perl

I have text files with lines like this:
U_town/u_LN0_pk_LN3_bnb_LN155/DD0 U_DESIGN/u_LNxx_pk_LN99_bnb_LN151_LN11_/DD5
U_master/u_LN999_pk_LN767888_bnb_LN9772/Dnn111 u_LN999_pk_LN767888_bnb_LN9772_LN9999_LN11/DD
...
I am trying to substitute any other character except for / to nothing and keep a word with pattern _LN\d+_ with Perl one-liner.
So the edited version would look like:
/_LN0__LN3__LN155/ /_LN99__LN151_LN11_/
/_LN999__LN767888_/ _LN999__LN767888__LN9772_LN9999_/
I tried below which returned empty lines
perl -pe 's/(?! _LN\d+_)[^\/].+//g' file
Below returned only '/'.
perl -pe 's/(?! _LN\d+_)\w+//g' file
Is it perhaps not possible with a one-liner and I should consider writing a code to parse character by character and see if a matching word _LN\d+_ or a character / is there?
To merely remove everything other than these patterns can simply match the patterns and join the matches back
perl -wnE'say join "", m{/ | _LN[0-9]+_ }gx' file
or perhaps, depending on details of the requirements
perl -wnE'say join "", m{/ | _LN[0-9]+(?=_) }gx' file
(See explanation in the last bullet below.)
Prints, for the first line (of the two) of the shown sample input
/_LN0__LN3_//_LN99__LN151_
...
or, in the second version
/_LN0_LN3//_LN99_LN151_LN11/
...
The _LN155 is not there because it is not followed by _. See below.
Questions:
Why are there spaces after some / in the "edited version" shown in the question?
The pattern to keep is shown as _LN\d+_ but _LN155 is shown to be kept even though it is not followed by a _ in the input (but by a /) ...?
Are underscores optional by any chance? If so, append ? to them in the pattern
perl -wnE'say join "", m{/ | _?LN[0-9]+_? }gx' file
with output
/_LN0__LN3__LN155//_LN99__LN151_LN11_/
(It's been clarified that the extra space in the shown desired output is a mistake.)
If the underscores "overlap," like in _LN155_LN11_, in the regex they won't be both matched by the _LN\d+_ pattern, since the first one "takes" the underscore.
But if such overlapping instances nned be kept then replace the trailing _ with a lookahead for it, which doesn't consume it so it's there for the leading _ on the next pattern
perl -wnE'say join "", m{/ | _LN[0-9]+(?=_) }gx' file
(if the underscores are optional and you use _?LN\d+_? pattern then this isn't needed)

How to replace spaces after a certain pattern with commas?

I am new to coding and I'm trying to format some bioinformatics data. I am trying to remove all the spaces after GT:GL:GOF:GQ:NR:NV with commas, but not anything outside of the format xx:xx:xx:xx:xx (like the example). I know I need to use sed with regex option but I'm not very familiar with how to use it. I've never actually used sed before and got confused trying so any help would be appreciated. Sorry if I formatted this poorly (this is my first post).
EDIT 2: I got actual data from the file this time which may help solve the problem. Removed the bad example.
New Example: I pulled this data from my actual file (this is just two samples), and it is surrounded by other data. Essentially the line has a bunch of data followed by "GT:GL:GOF:GQ:NR:NV ", after this there is more data in the format shown below, and finally there is some more random data. Unfortunately I can't post a full line of the data because it is extremely long and will not fit.
Input
0/1:-1,-1,-1:146:28:14,14:4,0 0/1:-1,-1,-1:134:6:2,2:1,0
Output
0/1:-1,-1,-1:146:28:14,14:4,0,0/1:-1,-1,-1:134:6:2,2:1,0
With Basic Regular Expressions, you can use character classes and backreferences to accomplish your task, e.g.
$ sed 's/\([0-9][0-9]*:[0-9][0-9]*\)[ ]\([0-9][0-9]*:[0-9][0-9]*\)/\1,\2/g' file
1/0 ./. 0/1 GT:GL:GOF:GQ:NR:NV 1:12:314,213:132:13:31,14:31:31 AB GT BB
1/0 ./. 0/1 GT:GL:GOF:GQ:NR:NV 10:13:12,41:41:1:13,13:131:1:1 AB GT RT
1/0 ./. 0/1 GT:GL:GOF:GQ:NR:NV 1:12:314,213:132:13:31,14:31:31 AB GT
Which basically says:
find and capture any [0-9][0-9]* one or more digits,
separated by a :, and
followed by [0-9][0-9]* one or more digits -- as capture group 1,
match a space following capture group 1 followed by capture group 2 (which is the same as capture group 1),
then replace the space separating the capture groups with a comma reinserting the capture group text using backreference 1 and 2 (e.g. \1 and \2), finally
make the replacement global (e.g. g) to replace all matching occurrences.
Edit Based On New Input Posted
If you still need all of the original commas added, and you now want to add a comma between ,0 0/ (where there is a comma precedes a single-digit followed by the space to be replaced with a comma, followed by a single-digit and a forward-slash), then all you need to do is make your capture groups conditional (on either capturing the original data as above -or- capturing this new segment. You do that by including an OR (e.g. \| in basic regex terms) between the conditions.
For instance by adding \|,[0-9] at the end of the first capture group and \|[0-9][/] at the end of the second, e.g.
$ sed 's/\([0-9][0-9]*:[0-9][0-9]*\|,[0-9]\)[ ]\([0-9][0-9]*:[0-9][0-9]*\|[0-9][/]\)/\1,\2/g' file
0/1:-1,-1,-1:146:28:14,14:4,0,0/1:-1,-1,-1:134:6:2,2:1,0
If you have other caveats in your file, I suggest you post several complete lines of input, and if they are too long, then create a zip, gzip, bzip or xz file and post it to a site like pastebin and add the link to your question.
If all you really care about now is the space in ,0 0/, then you can shorten the sed command to:
$ sed 's/\(,[0-9]\)[[:space:]]\([0-9][/]\)/\1,\2/g' file
0/1:-1,-1,-1:146:28:14,14:4,0,0/1:-1,-1,-1:134:6:2,2:1,0
(note: I've included [[:space:]] to handle any whitespace (space, tab, ...) instead of just the literal [ ] (space) in the new example)
Let me know if this fixes the issue.
I'm assuming that the xx:xx:xx or xx:xx:xx:xx can have any number of parts, since some have 3, and some have 4.
This is quite difficult to do reliably with sed, as it does not support lookarounds, which seem like they might be needed for this example.
You can try something like:
perl -pe 's/(?<=\d) (?=\d+(:\d+){2,})/,/g' input.txt
If you've got your heart set on sed, you can try this, but it may miss some cases:
sed -r 's/(:[0-9]+) ([0-9]+:)/\1,\2/g' input.txt
Could you please try following. This will take care of printing those values also which are NOT coming in match of regex. Also we would have made regex mentioned in match a bit shorter by doing it as [0-9]+\.{4} etc since this is tested on old awk so couldn't test it.
awk '
BEGIN{
OFS=","
}
match($0,/GT:GL:GOF:GQ:NR:NV [0-9]+:[0-9]+:[0-9]+:[0-9]+:[0-9]+/){
value=substr($0,RSTART!=1?1:RSTART,RSTART+RLENGTH-1)
value1=substr($0,RSTART+RLENGTH+1)
gsub(/[[:space:]]+/,",",value1)
print value,value1
next
}
1
' Input_file
You may also achieve your desired result without regex, using awk:
awk '{printf "%s", $1FS$2FS$3FS$4FS$5","$6","$7; for (i=8;i<=NF;i++) printf "%s", FS$i; print ""}' input.txt
Basically, it outputs from field 1 to 5 with the default field separator ("space"), then from field 5 to 7 with the comma separator, then from field 8 onwards with default separator again.
perl myscript.pl '0/1:-1,-1,-1:146:28:14,14:4,0 0/1:-1,-1,-1:134:6:2,2:1,0'
myscript.pl,
#!/usr/local/ActivePerl-5.20/bin/env perl
my $input = $ARGV[0];
$input =~ s/ /\,/g;
print $input, "\n";
__DATA__
output
0/1:-1,-1,-1:146:28:14,14:4,0,0/1:-1,-1,-1:134:6:2,2:1,0
This will remove all spaces, not just the space in question

Replace spaces with dashes, but only for text found between quotes in the text TAGS=""

Is it possible to do the following with Notepad++'s FIND/REPLACE function?
I have a text file where I want to replace spaces found in between the quotes of the text TAGS="*" with dashes.
Example:
TAGS="tag1,tag2,tag 3,tag4,tag 5"
should become:
TAGS="tag1,tag2,tag-3,tag4,tag-5"
So far I can find the text I want using:
FIND WHAT: TAGS="*"
But how do I have it replace spaces with dashes?
--------------------- UPDATE -----------------
My question before used tag1,tag2, but the actual data in the file does not have numbers, only words.
These following are three actual lines from the file. I need to find spaces between the quotes of TAGS="*" and replace only those spaces with dashes:
<DT>Kundalini Yoga - Pranayama - Breathing Techniques
<DT>40 Ways The World Makes Awesome Hot Dogs | Food Republic
<DT>Fix Windows boot, Fix your Boot sequence with BcdEdit, BootSect, BCDboot, WINRE,...
In the lines above, there are 3 instances of TAGS="*" which I've extracted here to make them easy to see:
TAGS="kundalini,yoga,fire breath,breathing,breath of fire"
TAGS="recipe,cooking,hot dog"
TAGS="windows stuff,bcdboot,bcdsect,repair,boot"
which, after the FIND/REPLACE, should look like:
TAGS="kundalini,yoga,fire-breath,breathing,breath-of-fire"
TAGS="recipe,cooking,hot-dog"
TAGS="windows-stuff,bcdboot,bcdsect,repair,boot"
Use the following regex:
Find what: (?:\G(?!^)|\bTAGS=")[^\s"]*\K\s+
Replace with: -
Details:
(?:\G(?!^)|\bTAGS=") - Finds either the end of the previous successful match (\G(?!^)) or
[^\s"]* - 0+ chars other than a space and "
\K - match reset operator discarding the text matched so far
\s+ - 1+ whitespaces
See the screenshot with settings below:
Use the following find/replace pattern in regex mode, and do a replace all to cover the entire document (or selection which you want). Note that I make no effort to check for TAGS="...", under the assumption that you don't have strings of the form tag123 or tag 123 anywhere else in your document.
Find:
tag\s+(\d*)
Replace:
tag-$1
Input:
tag1,tag2,tag 3,tag4,tag 5
Output:
tag1,tag2,tag-3,tag4,tag-5

Using the star sign in grep

I am trying to search for the substring "abc" in a specific file in linux/bash
So I do:
grep '*abc*' myFile
It returns nothing.
But if I do:
grep 'abc' myFile
It returns matches correctly.
Now, this is not a problem for me. But what if I want to grep for a more complex string, say
*abc * def *
How would I accomplish it using grep?
The asterisk is just a repetition operator, but you need to tell it what you repeat. /*abc*/ matches a string containing ab and zero or more c's (because the second * is on the c; the first is meaningless because there's nothing for it to repeat). If you want to match anything, you need to say .* -- the dot means any character (within certain guidelines). If you want to just match abc, you could just say grep 'abc' myFile. For your more complex match, you need to use .* -- grep 'abc.*def' myFile will match a string that contains abc followed by def with something optionally in between.
Update based on a comment:
* in a regular expression is not exactly the same as * in the console. In the console, * is part of a glob construct, and just acts as a wildcard (for instance ls *.log will list all files that end in .log). However, in regular expressions, * is a modifier, meaning that it only applies to the character or group preceding it. If you want * in regular expressions to act as a wildcard, you need to use .* as previously mentioned -- the dot is a wildcard character, and the star, when modifying the dot, means find one or more dot; ie. find one or more of any character.
The dot character means match any character, so .* means zero or more occurrences of any character. You probably mean to use .* rather than just *.
Use grep -P - which enables support for Perl style regular expressions.
grep -P "abc.*def" myfile
The "star sign" is only meaningful if there is something in front of it. If there isn't the tool (grep in this case) may just treat it as an error. For example:
'*xyz' is meaningless
'a*xyz' means zero or more occurrences of 'a' followed by xyz
This worked for me:
grep ".*${expr}" - with double-quotes, preceded by the dot.
Where ${expr} is whatever string you need in the end of the line.
So in your case:
grep ".*abc.*" myFile
Standard unix grep.
The expression you tried, like those that work on the shell command line in Linux for instance, is called a "glob". Glob expressions are not full regular expressions, which is what grep uses to specify strings to look for. Here is (old, small) post about the differences. The glob expressions (as in "ls *") are interpreted by the shell itself.
It's possible to translate from globs to REs, but you typically need to do so in your head.
You're not using regular expressions, so your grep variant of choice should be fgrep, which will behave as you expect it to.
Try grep -E for extended regular expression support
Also take a look at:
The grep man page
'*' works as a modifier for the previous item. So 'abc*def' searches for 'ab' followed by 0 or more 'c's follwed by 'def'.
What you probably want is 'abc.*def' which searches for 'abc' followed by any number of characters, follwed by 'def'.
This may be the answer you're looking for:
grep abc MyFile | grep def
Only thing is... it will output lines were "def" is before OR after "abc"
$ cat a.txt
123abcd456def798
123456def789
Abc456def798
123aaABc456DEF
* matches the preceding character zero or more times.
$ grep -i "abc*def" a.txt
$
It would match, for instance "abdef" or "abcdef" or "abcccccccccdef". But none of these are in the file, so no match.
. means "match any character" Together with *, .* means match any character any number of times.
$ grep -i "abc.*def" a.txt
123abcd456def798
Abc456def798
123aaABc456DEF
So we get matches.
There are alot of online references about regular expressions, which is what is being used here.
I summarize other answers, and make these examples to understand how the regex and glob work.
There are three files
echo 'abc' > file1
echo '*abc' > file2
echo '*abcc' > file3
Now I execute the same commands for these 3 files, let's see what happen.
(1)
grep '*abc*' file1
As you said, this one return nothing. * wants to repeat something in front of it. For the first *, there is nothing in front of it to repeat, so the system recognize this * just a character *. Because the string in the file is abc, there is no * in the string, so you cannot find it. The second * after c means it repeat c 0 or more times.
(2)
grep '*abc*' file2
This one return *abc, because there is a * in the front, it matches the pattern *abc*.
(3)
grep '*abc*' file3
This one return *abcc because there is a * in the front and 2 c at the tail. so it matches the pattern *abc*
(4)
grep '.*abc.*' file1
This one return abc because .* indicate 0 or more repetition of any character.

matching text in quotes (newbie)

I'm getting totally lost in shell programming, mainly because every site I use offers different tool to do pattern matching. So my question is what tool to use to do simple pattern matching in piped stream.
context: I have named.conf file, and i need all zones names in a simple file for further processing. So I do ~$ cat named.local | grep zone and get totally lost here. My output is ~hundred or so newlines in form 'zone "domain.tld" {' and I need text in double quotes.
Thanks for showing a way to do this.
J
I think what you're looking for is sed... it's a stream editor which will let you do replacements on a line-by-line basis.
As you're explaining it, the command `cat named.local | grep zone' gives you an output a little like this:
zone "domain1.tld" {
zone "domain2.tld" {
zone "domain3.tld" {
zone "domain4.tld" {
I'm guessing you want the output to be something like this, since you said you need the text in double quotes:
"domain1.tld"
"domain2.tld"
"domain3.tld"
"domain4.tld"
So, in reality, from each line we just want the text between the double-quotes (including the double-quotes themselves.)
I'm not sure you're familiar with Regular Expressions, but they are an invaluable tool for any person writing shell scripts. For example, the regular expression /.o.e/ would match any line where there's a word with the 2nd letter was a lower-case o, and the 4th was e. This would match string containing words like "zone", "tone", or even "I am tone-deaf."
The trick there was to use the . (dot) character to mean "any letter". There's a couple of other special characters, such as * which means "repeat the previous character 0 or more times". Thus a regular expression like a* would match "a", "aaaaaaa", or an empty string: ""
So you can match the string inside the quotes using: /".*"/
There's another thing you would know about sed (and by the comments, you already do!) - it allows backtracking. Once you've told it how to recognize a word, you can have it use that word as part of the replacement. For example, let's say that you wanted to turn this list:
Billy "The Kid" Smith
Jimmy "The Fish" Stuart
Chuck "The Man" Norris
Into this list:
The Kid
The Fish
The Man
First, you'd look for the string inside the quotes. We already saw that, it was /".*"/.
Next, we want to use what's inside the quotes. We can group it using parens: /"(.*)"/
If we wanted to replace the text with the quotes with an underscore, we'd do a replace: s/"(.*)"/_/, and that would leave us with:
Billy _ Smith
Jimmy _ Stuart
Chuck _ Norris
But we have backtracking! That'll let us recall what was inside the parens, using the symbol \1. So if we do now: s/"(.*)"/\1/ we'll get:
Billy The Kid Smith
Jimmy The Fish Stuart
Chuck The Man Norris
Because the quotes weren't in the parens, they weren't part of the contents of \1!
To only leave the stuff inside the double-quotes, we need to match the entire line. To do that we have ^ (which means "beginning of line"), and $ (which means "end of line".)
So now if we use s/^.*"(.*)".*$/\1/, we'll get:
The Kid
The Fish
The Man
Why? Let's read the regular expression s/^.*"(.*)".*$/\1/ from left-to-right:
s/ - Start a substitution regular expression
^ - Look for the beginning of the line. Start from there.
.* - Keep going, reading every character, until...
" - ... until you reach a double-quote.
( - start a group a characters we might want to recall later when backtracking.
.* - Keep going, reading every character, until...
) - (pssst! close the group!)
" - ... until you reach a double-quote.
.* - Keep going, reading every character, until...
$ - The end of the line!
/ - use what's after this to replace what you matched
\1 - paste the contents of the first group (what was in the parens) matched.
/ - end of regular expression
In plain English: "Read the entire line, copying aside the text between the double-quotes. Then replace the entire line with the content between the double qoutes."
You can even add double-quote around the replacing text s/^.*"(.*)".*$/"\1"/, so we'll get:
"The Kid"
"The Fish"
"The Man"
And that can be used by sed to replace the line with the content from within the quotes:
sed -e "s/^.*\"\(.*\)\".*$/\"\1\"/"
(This is just shell-escaped to deal with the double-quotes and slashes and stuff.)
So the whole command would be something like:
cat named.local | grep zone | sed -e "s/^.*\"\(.*\)\".*$/\"\1\"/"
Well, nobody mentioned cut yet, so, to prove that there are many ways to do something with the shell:
% grep '^zone' /etc/bind/named.conf | cut -d' ' -f2
"gennic.net"
"generic-nic.net"
"dyn.generic-nic.net"
"langtag.net"
1.
zoul#naima:etc$ cat named.conf | grep zone
zone "." IN {
zone "localhost" IN {
file "localhost.zone";
zone "0.0.127.in-addr.arpa" IN {
2.
zoul#naima:etc$ cat named.conf | grep ^zone
zone "." IN {
zone "localhost" IN {
zone "0.0.127.in-addr.arpa" IN {
3.
zoul#naima:etc$ cat named.conf | grep ^zone | sed 's/.*"\([^"]*\)".*/\1/'
.
localhost
0.0.127.in-addr.arpa
The regexp is .*"\([^"]*\)".*, which matches:
any number of any characters: .*
a quote: "
starts to remember for later: \(
any characters except quote: [^"]*
ends group to remember: \)
closing quote: "
and any number of characters: .*
When calling sed, the syntax is 's/what_to_match/what_to_replace_it_with/'. The single quotes are there to keep your regexp from being expanded by bash. When you “remember” something in the regexp using parens, you can recall it as \1, \2 etc. Fiddle with it for a while.
You should have a look at awk.
As long as someone is pointing out sed/awk, I'm going to point out that grep is redundant.
sed -ne '/^zone/{s/.*"\([^"]*\)".*/\1/;p}' /etc/bind/named.conf
This gives you what you're looking for without the quotes (move the quotes inside the parenthesis to keep them). In awk, it's even simpler with the quotes:
awk '/^zone/{print $2}' /etc/bind/named.conf
I try to avoid pipelines as much as possible (but not more). Remember, Don't pipe cat. It's not needed. And, insomuch as awk and sed duplicating grep's work, don't pipe grep, either. At least, not into sed or awk.
Personally, I'd probably have used perl. But that's because I probably would have done the rest of whatever you're doing in perl, making it a minor detail (and being able to slurp the whole file in and regex against everything simultaneously, ignoring \n's would be a bonus for cases where I don't control /etc/bind, such as on a shared webhost). But, if I were to do it in shell, one of the above two would be the way I'd approach it.