Vim complex regex - regex

I have these strings in a file:
a b
a-b
a / b / c
I want to replace these with:
"a b" => a_b
"a-b" => a_b
"a / b / c" => a_b_c
How do I write the regex ? Please also explain the regex and name the concepts involved.

Yet another way:
:g/^/co.|-s/.*/"&" =>/|+s/\W\+/_/g|-j
Overview:
For every line, :g/^/, copy a line (:copy) and then substitute to add the "..." => on the first line and do a substitution on the non-alpha characters on the next line with _. Then join the two line, -j.
Glory of Details:
:g/{pat}/{cmd} - run {cmd} on each line matching {pat}. Use ^ to match every line
copy . - copy the current line below the current line (.). Short: co.
-1s/.*/.../ - :s the line above (-1). Replace entire line, .*
"&" => - & is the entire match (or \0 in PRCE)
+s/\W\+/_/g - do a global :s on the next line (+1) for all non-alphanumeric characters with _
-j - do a :join starting from the line above with the next line
For more help:
:h :g
:h :copy
:h :s
:h :j
:h :range

This is beyond simple capturing and reordering in the replacement. The modification of the non-alphabetic characters to _ requires a contained substitution of the match. This can be done via :help sub-replace-expr:
:%substitute/.*/\='"' . submatch(0) . '" => ' . substitute(submatch(0), '\A\+', '_', 'g')/
Basically, this matches entire lines, then replaces with the match in double quotes, followed by =>, followed by the match with non-alphabetic character sequences (\A\+) replaced with a single _.
alternative
You can also do this in two separate steps: First duplicating and quoting the line:
:%substitute/.*/"&" => &/
Then, the second copy needs to be modified. To apply the substitution to only match after the => separator, a positive lookbehind (must match after => + any characters) must be given:
:%substitute/\%(=> .*\)\#<=\A\+/_/g

This achieves what you're asking for, although the question is somewhat ambiguous:
%s/\(\a\)\A\+/\1_/g
%s/[find_pattern]/[replace_pattern/g does find and replace for every line (%) in a file, and does any number of matches (g), as opposed to the default behaviour of just the first one.
(\a) captures a group (brackets have to be escaped), containing an alphabetic character.
\A+ means one or more non-alphabetic character
/1 is a backreference to the first captured group in the pattern. In this case the alphabetic character in brackets.
_ is just the literal.
So together it replaces every letter followed by 1 or more non-letters with that letter followed by _. So this only works when the line ends with the last letter.

One way of doing this:
:%s/[\ -]\/*\ */_/g
[\ -] looks for either a space \ (note the space between \ and -) or a dash -.
The asterisk * means 0 or N occurrences. So \/* 0 or N occurrences of slash /; \ * 0 or N occurrences of space. Finally g replace all occurrences in the line.
[Edit]
I had misunderstood the question. Your problem can be solved using multiple sub-expressions in 2 steps.
step 1) Put an underscore before the c
:%s/c/_c/g
step 2) find and replace
:%s/a\([\ -]\/*\ *\)b\(\1\)*\(_\)*\(c\)*/"a\1b\2\4" => a_b\3\4/g
This will give you
"a b" => a_b
"a-b" => a_b
"a / b / c" => a_b_c
Explanation:
\(\) denotes a sub-expression, order of appearance matters so \1 matches to sub-expression one and so forth.
The trick is to add a _ somewhere so we can use it and at the same get information about the length. Because it only appears before c, the subexpression \3 will only match _ for that line.
Now, by replacing by "a\1b\2\4" we skip \3 avoiding to add an underscore.

:%s:[\ /-]\+:_:g
Explanation:
s: : : - Substitute command (with delimiter `:`)
[\ /-] - Match a ` ` (space), `/`, or `-` character
\+ - Match one or more of the previous group consecutively
_ - Replace with one `_` character
g - Replace all matches in line
% - Execute command on every line in file (optional)
I interpreted your question to be very generic. If you need to match more specific patterns, please indicate exactly what needs to be matched.
[Edit]
If you need to match ' / ' exactly, use:
:%s:\ /\ \|[\ -]:_:g
s: : : - Substitute command (with delimiter `:`)
\| - Match left pattern OR right pattern
\ /\ - Match ` / ` exactly
[\ -] - Match a ` ` (space) or `-` character
_ - Replace with one `_` character
g - Replace all matches in line
% - Execute command on every line in file (optional)
[Edit 2]
I misunderstood what you wanted to substitute.
You're making your life very difficult if you're trying to do this with a
single regex. It will get so complicated, at that point you're better off
writing a small function, like some of the other answers. But you should be
able to get away with two substitution commands without it getting too crazy.
One for the first two strings (a b and a-b), and one for the third
(a / b / c).
%s:\v(\a+)[\ -](\a+):"\0"\ =>\ \1_\2
%s:\v(\a+)\s*/\s*(\a+)\s*/\s*(\a+):"\0"\ =>\ \1_\2_\3
Explanation:
%s:\v(\a+)[\ -](\a+):"\0"\ =>\ \1_\2
s: : - Substitute command (with delimiter `:`)
\v - Very Magic mode *
( ) ( ) - Capture contained matches into numbered sub-expressions
\a+ \a+ - Match at least one alphanumeric character
[\ -] - Match either ` ` (space) or `-`
" "\ =>\ _ - Literal text
\0 - Replace with entire matched text
\1 \2 - Replace with first and second `()` sub-expression, respectively
% - Execute command on every line in file (optional)
%s:\v(\a+)\s*/\s*(\a+)\s*/\s*(\a+):"\0"\ =>\ \1_\2_\3
s: : - Substitute command (with delimiter `:`)
\v - Very Magic mode *
( ) ( ) ( ) - Capture contained matches into numbered sub-expressions
\a+ \a+ \a+ - Match at least one alphanumeric character
\s*/\s* \s*/\s* - Match a `/` and any surrounding spaces
" "\ =>\ _ _ - Literal text
\0 - Replace with entire matched text
\1 \2 \3 - Replace with first, second, and third `()` sub-expression, respectively
% - Execute command on every line in file (optional)
* This eliminates the need for a lot of ugly backslashes.
See `:h /magic` and `:h /\v`

Related

How can I get the first and last part of one wordcombination using regex

How can I get only the middle part of a combined name with PCRE regex?
name: 211103_TV_storyname_TYPE
result: storyname
I have used this single line: .(\d)+.(_TV_) to remove the first part: 211103_TV_
Another idea is to use (_TYPE)$ but the problem is that I don´t have in all variations of names a space to declare a second word to use the ^ for the first word and $ for the second.
The variation of the combined name is fix for _TYPE and the TV.
The numbers are changing according to the date. And the storyname is variable.
Any ideas?
Thanks
With your shown samples, please try following regex, this creates one capturing group which contains matched values in it.
.*?_TV_([^_]*)(?=_TYPE)
OR(adding a small variation of above solution with fourth bird's nice suggestion), following is without lazy match .*? unlike above:
_TV_([^_]*)(?=_TYPE)
Here is the Online demo for above regex
Explanation: Adding detailed explanation for above.
.*?_ ##Using Lazy match to match till 1st occurrence of _ here.
TV_ ##Matching TV_ here.
([^_]*) ##Creating 1st capturing group which has everything before next occurrence of _ here.
(?=_TYPE) ##Making sure previous values are followed by _TYPE here.
You could match as least as possible chars after _TV_ until you match _TYPE
\d_TV_\K.*?(?=_TYPE)
\d_TV_ Match a digit and _TV_
\K Forget what is matched until now
.*? Match as least as possible characters
(?=_TYPE) Assert _TYPE to the right
Regex demo
Another option without a non greedy quantifier, and leaving out the digit at the start:
_TV_\K[^_]*+(?>_(?!TYPE)[^_]*)*(?=_TYPE)
_TV_ Match literally
\K[^_]*+ Forget what is matched until now and optionally match any char except _
(?>_(?!TYPE)[^_]*)* Only allow matching _ when not directly followed by TYPE
(?=_TYPE) Assert _TYPE to the right
Regex demo
Edit
If you want to replace the 2 parts, you can use an alternation and replace with an empty string.
If it should be at the start and the end of the string, you can prepend ^ and append $ to the pattern.
\b\d{6}_TV_|_TYPE\b
\b\d{6}_TV_ A word boundary, match 6 digits and _TV_
| Or
_TYPE\b Match _TYPE followed by a word boundary
Regex demo
Here i put some additional Screenshots to the post. With the Documentation that appears on the help button. And you see the forms and what i see.
Documentation
The regular expressions we use are based on PCRE - Perl Compatible Regular Expressions. Full specification can be found here: http://www.pcere.org and http://perldoc.perl.org/perlre.html
Summary of some useful terms:
Metacharacters
\ Quote the next metacharacter
^ Match the beginning of the line
. Match any character (except newline)
$ Match the end of the line (or before newline at the end)
| Alternation
() Grouping
[] Character class
Quantifiers
* Match 0 or more times
+ Match 1 or more times
? Match 1 or 0 times
{n} Match exactly n times
{n,} Match at least n times
{n,m} Match at least n but not more than m times
Charcter Classes
\w Match a "word" character (alphanumeric plus mao}
\W Match a non-"word" character
\s Match a whitespace character
\S Match a non-whitespace character
\d Match a digit character
\D Match a non-digit character
Capture buffers
The bracketing construct (...) creates capture buffers. To refer to
Within the same pattern, use \1 for the first, \2 for the second, and so on. Outside the match use "$" instead of "". The \ notation works in certain circumstances outside the match. See the warning below about \1 vs $1 for details.
Referring back to another part of the match is called a backreference.
Examples
Replace story with certain prefix letters M N or E to have the prefix "AA":
`srcPattern "(M|N|E ) ([A-Za-z0-9\s]*)"`
`trgPattern "AA$2" `
`"N StoryWord1 StoryWord2" -> "AA StoryWord1 StoryWord2"`
`"E StoryWord1 StoryWord2" -> "AA StoryWord1 StoryWord2"`
`"M StoryWord1 StoryWord2" -> "AA StoryWord1 StoryWord2"`
"NoMatchWord StoryWord1 StoryWord2" -> "NoMatchWord StoryWord1 StoryWord2" (no match found, name remains the same)

Regex POSIX - How can i find if the start of a line contains a word from a word that appears later in line

I have a UNIX passwd file and i need to find using egrep if the first 7 characters from GECOS are inside the username. I want to check if the username (jkennedy) contains the word "kennedy" from the GECOS.
I was planning to use back-references but the username is before the gecos so i don't know how to implement it.
For example the passwd file contains this line:
jkennedy:x:2473:1067:kennedy john:/root:/bin/bash
As per my original comment, the regex below works for me.
See it in use here - note this regex differs slightly as it's more used for display purposes. The regex below is the POSIX version of this and removes non-capture groups and the unneeded capture group around the backreference.
^[^:]*([^:]{7})([^:]*:){4}\1.*$
^ assert position at the start of the line
[^:]* match any character except : any number of times
([^:]{7}) capture exactly seven of any character except :
([^:]*:){4} match the following exactly four times
[^:]*: match any character except : any number of times, followed by : literally
\1 match the backreference; matches what was previously matched by the first capture gorup
.* match any character (except newline characters) any number of times
$ assert position at the end of the line
Assuming you do NOT want case sensitivity to foul your matching -
declare -l tmpUsr tmpName
while IFS=: read usr x x x name x
do tmpUsr="$usr"; tmpName="$name"
(( ${#name} )) && [[ "$tmpUsr" =~ ${tmpName:0:7} ]] &&
printf "$usr ($name<${tmpName:0:7}>)\n"
done</etc/passwd

Regex for text file

I have a text file with the following text:
andal-4.1.0.jar
besc_2.1.0-beta
prov-3.0.jar
add4lib-1.0.jar
com_lab_2.0.jar
astrix
lis-2_0_1.jar
Is there any way i can split the name and the version using regex. I want to use the results to make two columns 'Name' and 'Version' in excel.
So i want the results from regex to look like
andal 4.1.0.jar
besc 2.1.0-beta
prov 3.0.jar
add4lib 1.0.jar
com_lab 2.0.jar
astrix
lis 2_0_1.jar
So far I have used ^(?:.*-(?=\d)|\D+) to get the Version and -\d.*$ to get the Name separately. The problem with this is that when i do it for a large text file, the results from the two regex are not in the same order. So is there any way to get the results in the way I have mentioned above?
Ctrl+H
Find what: ^(.+?)[-_](\d.*)$
Replace with: $1\t$2
check Wrap around
check Regular expression
UNCHECK . matches newline
Replace all
Explanation:
^ # beginning of line
(.+?) # group 1, 1 or more any character but newline, not greedy
[-_] # a dash or underscore
(\d.*) # group 2, a digit then 0 or more any character but newline
$ # end of line
Replacement:
$1 # content of group 1
\t # a tabulation, you may replace with what you want
$2 # content of group 2
Result for given example:
andal 4.1.0.jar
besc 2.1.0-beta
prov 3.0.jar
add4lib 1.0.jar
com_lab 2.0.jar
astrix
lis 2_0_1.jar
Not quite sure what you meant for the problem in large file, and I believe the two regex you showed are doing opposite as what you said: first one should get you the name and second one should give you version.
Anyway, here is the assumption I have to guess what may make sense to you:
"Name" may follow by - or _, followed by version string.
"Version" string is something preceded by - or _, with some digit, followed by a dot or underscore, followed by some digit, and then any string.
If these assumption make sense, you may use
^(.+?)(?:[-_](\d+[._]\d+.*))?$
as your regex. Group 1 is will be the name, Group 2 will be the Version.
Demo in regex101: https://regex101.com/r/RnwMaw/3
Explanation of regex
^ start of line
(.+?) "Name" part, using reluctant match of
at least 1 character
(?: )? Optional group of "Version String", which
consists of:
[-_] - or _
( ) Followed by the "Version" , which is
\d+ at least 1 digit,
[._] then 1 dot or underscore,
\d+ then at least 1 digit,
.* then any string
$ end of line

Regular Expressions - Greedy but stop before a string match

I have the some data and i'd like to convert it into a table format.
Here's the input data
1- This is the 1st line with a
newline character
2- This is the 2nd line
Each line may contain multiple newline characters.
Output
<td>1- This the 1st line with
a new line character</td>
<td>2- This is the 2nd line</td>
I've tried the following
^(\d{1,3}-)[^\d]*
but it seems to match only till the digit 1 in 1st.
I'd like to be able to stop matching after i find another \d{1,3}\- in my string.
Any suggestions?
EDIT:
I'm using EditPad Lite.
This is for vim, and uses zerowidth positive-lookahead:
/^\d\{1,3\}-\_.*[\r\n]\(\d\{1,3\}-\)\#=
Steps:
/^\d\{1,3\}- 1 to 3 digits followed by -
\_.* any number of characters including newlines/linefeeds
[\r\n]\(\d\{1,3\}-\)\#= followed by a newline/linefeed ONLY if it is followed
by 1 to 3 digits followed by - (the first condition)
EDIT: This is how it would be in pcre/ruby:
/(\d{1,3}-.*?[\r\n])(?=(?:\d{1,3}-)|\Z)/m
Note you need a string ending with a newline to match the last entry.
SEARCH: ^\d+-.*(?:[\r\n]++(?!\d+-).*)*
REPLACE: <td>$0</td>
[\r\n]++ matches one or more carriage-returns or linefeeds, so you don't have to worry about whether the file use Unix (\n), DOS (\r\n), or older Mac (\r) line separators.
(?!\d+-) asserts that the first thing after the line separator is not another line number.
I used the possessive + in [\r\n]++ to make sure it matches the whole separator. Otherwise, if the separator is \r\n, [\r\n]+ could match the \r and (?!\d+-) could match the \n.
Tested in EditPad Pro, but it should work in Lite as well.
You did not specify a language (there are many regexp implementations), but in general, what you are looking for is called "positive lookahead", which lets you add patterns that will influence the match, but will not become part of it.
Search for lookahead in the documentation of whatever language you are using.
Edit: the following sample seems to work in vim.
:%s#\v(^\d+-\_.{-})\ze(\n\d+-|%$)#<td>\1</td>
Annotation below:
% - for all lines
s# - substitute the following (you can use any delimiter, and slash is most
common, but as that will require that we escape slashes in the command
I chose to use the number sign)
\v - very magic mode, let's us use less backslashes
( - start group for back referencing
^ - start of line
\d+ - one or more digits (as many as possible)
- - a literal dash!
\_. - any character, including a newline
{-} - zero or more of these (as few as possible)
) - end group
\ze - end match (anything beyond this point will not be included in the match)
( - start a new group
[\n\r] - newline (in any format - thanks Alan)
\d+ - one or more digits
- - a dash
| - or
%$ - end of file
) - end group
# - start substitute string
<td>\1</td> - a TD tag around the first matched group
(\d+-.+(\r|$)((?!^\d-).+(\r|$))?)
You can match only the separators and split on them. In C#, for example, it could be done like this:
string s = "1- This is the 1st line with a \r\nnewline character\r\n2- This is the 2nd line";
string ss = "<td>" + string.Join("</td>\r\n<td>", Regex.Split(s.Substring(3), "\r\n\\d{1,3}- ")) + "</td>";
MessageBox.Show(ss);
Would it be good for you to do it in 3 steps?
(these are perl regex):
Replace the first:
$input =~ s/^(\d{1,3})/<td>\1/;
Replace the rest
$input =~ s/\n(\d{1,3})/<\/td>\n<td>\1/gm;
Add the last:
$input .= '</td>';

Trying to understand this perl regex bracketed character class?

Below is a script that I was playing with. With the script below it will print a
$tmp = "cd abc/test/.";
if ( $tmp =~ /cd ([\w\/\.])/ ) {
print $1."\n";
}
BUT if I change it to:
$tmp = "cd abc/test/.";
if ( $tmp =~ /cd ([\w\/\.]+)/ ) {
print $1."\n";
}
then it prints: cd abc/test/.
From my understanding the + matches one or more of the matching sequence, correct me if i am wrong please. But why in the first case it only matches a? I thought it should match nothing!!
Thank you.
You are correct. In the first case you match a single character from that character class, while in the second you match at least one, with as many as possible after the first one.
First one :
"
cd\ # Match the characters “cd ” literally
( # Match the regular expression below and capture its match into backreference number 1
[\w\/\.] # Match a single character present in the list below
# A word character (letters, digits, etc.)
# A / character
# A . character
)
"
Second one :
"
cd\ # Match the characters “cd ” literally
( # Match the regular expression below and capture its match into backreference number 1
[\w\/\.] # Match a single character present in the list below
# A word character (letters, digits, etc.)
# A / character
# A . character
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
"
In regexes, characters in brackets only count for a match of one character within the given bracket. In other words, [\w\/\.] matches exactly one of the following characters:
An alphanumeric character or "_" (the \w).
A forward slash (the \/--notice that the forward slash needs to be escaped, since it is used as the default marker for the beginning and end of a regex)
A period (the \.--again, escaped since . denotes any character except the newline character).
Because /cd ([\w\/\.])./ only captures one character into $1, it grabs the first character, which in this case is "a".
You are correct in that the + allows for a match of one or more such characters. Since regexes match greedily by default, you should get all of "abc/test/." for $1 in the second match.
If you haven't already done so, you might want to peruse perldoc perlretut.