Vim regex match a space, a number and everything else until ; - regex

I'm trying to make this regex but it's driving me insane. I've strings like this:
foobar 34;lorem ipsum;
foo 34/ABC;dolerm sit;
bar 3445b;amet;
I need to transform them like this:
foobar;34;lorem ipsum;
foo;34/ABC;dolerm sit;
bar;3445b;amet;
The regex I come up to is this one but it matches only numbers: \s\d*; and this one matches the whole line \s\d*\p*;
I need something to match only a white space, a number and than everything until the first ";".

does this work for you?
%s/ \ze\d/;/g
if you want to change
foo bar 3 r e p l a c e;bar;
to
foo bar;3;r;e;p;l;a;c;e;bar;
%s/ \d[^;]*/\=substitute(submatch(0)," ",";","g")/

You probably could get your original patterns working, if you used "non-greedy" matches, for example \p\{-} for "any number of printable characters, but as few as possible", or by explicitly excluding the ';' character with [^;]* (any number of any character that is not a ';').
:help non-greedy
:help /[ (then scroll down below the E769 topic)

Related

How to replace the last two words with a regular expression

I have an input which is formated like this:
1.[variable length whitespace][aaa] [bbb] [ccc] NAME [01] [-] [ADDITIONAL NAMES OF VARIABLE LENGTH] [endword1] [endword2]
where everything in [] is optional, but aaa bbb ccc and endword1 and endword2 are fixed key words. The first number is a counter from 0 to n, and the second number has two digits [0-9][0-9] (if they exist).
I can match everything, but the last two words, which sometimes (they are not necassary) end the line with this:
[0-9]*\.[^\S\r\n]{1,}(\baaa\b)?[^\S\r\n]*(\bbbb\b)?[^\S\r\n]{1,}?(\bccc\b)?[^\S\r\n]{1,}[A-Za-z0-9\s]*(\-)?[^\S\r\n]{1,}[A-Za-z0-9\s]*
So how do i check for my last two endwords?
Additionally: I do not know if the first part, which works, is a good regex or not; so if you think there is something to write better/cleaner, feel free to better it up.
You can use the following:
^\d*\.\h+(\baaa\b)?\h*(\bbbb\b)?\h*(\bccc\b)?\h*[A-Za-z0-9\s]*(\-)?\h+[A-Za-z0-9\s]*?(\bendword1\b)?\h*(\bendword2\b)?$
[^\S\r\n] is replaced with \h (horizontal space of variable length)
Made the last pattern non greedy for matching end words if exists
See DEMO

TCL_REGEXP:: How to grep a line from variable that looks similar in TCL

My TCL script:
set test {
a for apple
b for ball
c for cat
number n1
numbers 2,3,4,5,6
d for doctor
e for egg
number n2
numbers 56,4,5,5
}
set lines [split $test \n]
set data [join $lines :]
if { [regexp {number n1.*(numbers .*)} $data x y]} {
puts "numbers are : $y"
}
Current output if I run the above script:
C:\Documents and Settings\Owner\Desktop>tclsh stack.tcl
numbers are : numbers 56,4,5,5:
C:\Documents and Settings\Owner\Desktop>
Expected output:
In the script regexp, If I specify "number n1"... Its should print "numbers are : numbers 2,3,4,5,6"
If I specify "number n2"... Its should print "numbers are : numbers 56,4,5,5:"
Now always its prints the last (final line - numbers 56,4,5,5:) as output. How to resolve this issue.
Thanks,
Kumar
Try using
regexp {number n1.*?(numbers .*)\n} $test x y
(note that I'm matching against test. There is no need to replace the newlines.)
There are two differences from your pattern.
The question mark behind the first star makes the match non-greedy.
There is a newline character behind the capturing parentheses.
Your pattern told regexp to match from the first occurrence of number n1 up to the last occurrence of numbers, and it did. This is because the .* match between them was greedy, i.e. it matched as many characters as it could, which meant it went past the first numbers.
Making the match non-greedy means that the pattern will match from the first occurrence of number n1 up to the following occurrence of numbers, which was what you wanted.
After numbers, there is another .* match which is a bit troublesome. If it were greedy, it would match everything up to the end of the variable content. If it were non-greedy, it wouldn't match any characters, since matching a zero-length string satisfies the match. Another problem is that the Tcl RE engine doesn't really allow for switching back from non-greedy mode.
You can fix this by forcing the pattern to match one character past the text that you want the .* to match, making the zero-length match invalid. Matching a newline (\n) or space (\s) character should work. (This of course means that there must be a newline / other space character after every data field: if a numbers field is the last character range in the variable that field can't be located.)
Documentation: regular expression syntax, regexp
To use a Tcl variable in a regular expression is easy. On one level anyway: you put the regular expression in double quotes so that you have standard Tcl variable substitution inside it prior to it being passed to the RE engine:
# ...
set target "n1"
if { [regexp "number $target.*(numbers .*)" $data x y]} {
# ...
The hard part is that you've got to remember that switching to "…" from {…} will affect the whole of that word, and that the substitutions are of regular expression fragments. We usually recommend using {…} because that's easier to get consistently and unconfusingly right in the majority of cases.
Let's illustrate how this can get annoying. In your specific case, you may want to actually use this:
if { [regexp "number $target\[^:\]*:(numbers \[^:\]*)" $data x y]} {
The character sets here exclude the : (which you've — unnecessarily — used as a newline replacement) but because […] is also standard Tcl metasyntax, you have to backslash-quote it. (Things get even more annoying when you want to always use the contents of the variable as a literal even though they might include RE metasyntax characters; you need a regsub call to tidy things up. And you start to potentially make Tcl's RE cache less efficient too.)

split text into words and exclude hyphens

I want to split a text into it's single words using regular expressions. The obvious solution would be to use the regex \\b unfortunately this one does split words also on the hyphen.
So I am searching an expression doing exactly the same as the \\b but does not split on hyphens.
Thanks for your help.
Example:
String s = "This is my text! It uses some odd words like user-generated and need therefore a special regex.";
String [] b = s.split("\\b+");
for (int i = 0; i < b.length; i++){
System.out.println(b[i]);
}
Output:
This
is
my
text
!
It
uses
some
odd
words
like
user
-
generated
and
need
therefore
a
special
regex
.
Expected output:
...
like
user-generated
and
....
#Matmarbon solution is already quite close, but not 100% fitting it gives me
...
like
user-
generated
and
....
This should do the trick, even if lookaheads are not available:
[^\w\-]+
Also not you but somebody who needs this for another purpose (i.e. inserting something) this is more of an equivalent to the \b-solutions:
([^\w\-]|$|^)+
because:
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
--- http://www.regular-expressions.info/wordboundaries.html
You can use this:
(?<!-)\\b(?!-)

Regexp x.get(y) -> x[y]

While porting many lines of code from one language to another I must replace all array access from the form of the function call x.get(y) to the square brackets notation x[y]. There are few text editors around that can do regular expression based replace.
What should be typed in the "text to find" field and what should be typed in the "replace with" field in this situation? Both x and y can vary, so the original code can have lines like:
... state.get(1);
... text.get(i);
... result.get(line);
after conversion:
... state[1];
... text[i];
... result[line];
You can search for \.get\((\w+)\) and replace with [$1].
The above pattern assumes only alphanumeric characters between the parentheses, but there are other alternatives:
.* (without checking ". matched newline") should match until the end of the line.
[^)]* should match characters that are not ). Would work for new lines.
In both cases, you may want to include the ; in your pattern.
Note that this is very fragile either way - you might encounter code like state.get(a.get(3 + sin(6))), and probably get incorrect results.
For Notepad++, I would write in Find what: ([0-9,a-z,A-Z,-,_]+).get\(([0-9,a-z,A-Z,-,_]+)\)
replace with \1[\2]
Input:
x.get(1);
text.get(i);
result.get(line);
Output:
x[1];
text[i];
result[line];

Regex for comparing Strings with spaces

Im trying to compare is a string is present among a list of Strings using regex.
I tried using the following...
(?!MyDisk1$|MyDisk2$)
But this isnt working... for the scenarios like
(?!My disk1$|My Disk2$)
Can you suggest a better approach to deal with such situations..
I get the list of strings from an sql query... So I am not sure where the spaces are present. The list of Strings vary like My Disk1, MyDisk2, My_Disk3, ABCD123, XYZ_123, MNP 123 etc.... or any other String with [a-zA-Z0-9_ ]
You can make the spaces optional using a zero-or-one quantifier (?):
(?!My ?disk1$|My ?Disk2$)
This assertion will reject substrings like MyDisk2 or My Disk2. Or to handle potentially many spaces, use a zero-or-more quantifier (*):
(?!My *disk1$|My *Disk2$)
Note that if you're running this in an engine which ignores whitespace in the pattern you may need to use a character class, like this:
(?!My[ ]*disk1$|My[ ]*Disk2$)
Or to handle spaces or underscores:
(?!My[ _]*disk1$|My[ _]*Disk2$)
Unfortunately if the spaces can be anywhere in the string, (but you still care about matching the other letters in order), you'd have to do something like this:
(?! *M *y *d *i *s *k *1$| *M *y *D *i *s *k *2$)
Or to handle spaces or underscores:
(?![ _]*M[ _]*y[ _]*d[ _]*i[ _]*s[ _]*k[ _]*1$|[ _]*M[ _]*y[ _]*D[ _]*i[ _]*s[ _]*k[ _]*2$)
But to be honest, at that point, you may be better off preprocessing your data before you try to use your regex with it.
use this Regex upending i at the end that will mean that your regex is case-insensitive
/my\s?disk[12]\$/i
this will match all possible scenarios.
You can do this:
/(?[^\s_-]+(\s|_|-)?[^\s_-]*?$)/i
'?' quantifier means 0 or 1 of the preceding pattern.
/i is for case insensitive. The separator can be space or underscore or dash.I have replace My and disk with a string of length 1 or more which does not contain space ,underscore or dash.. Now it wil match "Shikhar Subedi" "dprpradeep" or "MyDisk 54".
The + quantifier means 1 or more. ^ means not. * means 0 or more. So the string after the space is optional.