Lua: Replace characters in a string - replace

I have strings like
abcdef
abcd|(
abcde|(foo
abcd|)
abcde|)foo
which should be modified to
abcdef
abcd
abcde \foo
abcd
abcde \foo
if there is no | then do nothing
if nothing follows the |( or |) then delete these two characters
if something follows then replace |( or |) with <space>\
I am interested in short pattern expressions, if possible. I can do this by several string.find and string.sub but then I have a lot of if statements.

You may use
function repl(v)
res, _ = string.gsub(v:gsub('|[()]$', ''), '|[()]', ' \\')
return res
end
See Lua demo online
Details
'|[()]$' matches | and then either ( or ) at the end of the string, and string.gsub replaces these occurrences with an empty string
|[()] then matches | and then either ( or ) anywhere in the string, and string.gsub replaces these occurrences with a space and \.

Related

Hive regex: Positive lookahead to match '&' or end of string

I would like to match text between two strings, although the last string/character might not aways be available.
String1: 'www.mywebsite.com/search/keyword=toys'
String2: 'www.mywebsite.com/search/keyword=toys&lnk=hp1'
Here I want to match the value in keyword= that is 'toys' and I am using
(?<=keyword=)(.*)(?=&|$)
Works for String1 but for String2 it matches everything after '&'
What am I doing wrong?
.* is greedy. It takes everything it can, therefore stops at the end of the string ($) and not at the & character.
Change it to its non-greedy version - .*?
with t as
(
select explode
(
array
(
'www.mywebsite.com/search/keyword=toys'
,'www.mywebsite.com/search/keyword=toys&lnk=hp1'
)
) as (val)
)
select regexp_extract(val,'(?<=keyword=)(.*?)(?=&|$)',0)
from t
;
+------+
| toys |
+------+
| toys |
+------+
You do not need to bother with greediness when you need to match zero or more occurrences of any characters but a specific character (or set of characters). All you need is to get rid of the lookahead and the dot pattern and use [^&]* (or, if the value you expect should not be an empty string, [^&]+):
(?<=keyword=)[^&]+
Code:
select regexp_extract(val,'(?<=keyword=)[^&]+', 0) from t
See the regex demo
Note you do not even need a capturing group since the 0 argument instructs regexp_extract to retrieve the value of the whole match.
Pattern details
(?<=keyword=) - a positive lookbehind that matches a location that is immediately preceded with keyword=
[^&]+ - any 1+ chars other than & (if you use * instead of +, it will match 0 or more occurrences).

Regex in PHP: take all the words after the first one in string and truncate all of them to the first character

I'm quite terrible at regexes.
I have a string that may have 1 or more words in it (generally 2 or 3), usually a person name, for example:
$str1 = 'John Smith';
$str2 = 'John Doe';
$str3 = 'David X. Cohen';
$str4 = 'Kim Jong Un';
$str5 = 'Bob';
I'd like to convert each as follows:
$str1 = 'John S.';
$str2 = 'John D.';
$str3 = 'David X. C.';
$str4 = 'Kim J. U.';
$str5 = 'Bob';
My guess is that I should first match the first word, like so:
preg_match( "^([\w\-]+)", $str1, $first_word )
then all the words after the first one... but how do I match those? should I use again preg_match and use offset = 1 in the arguments? but that offset is in characters or bytes right?
Anyway after I matched the words following the first, if the exist, should I do for each of them something like:
$second_word = substr( $following_word, 1 ) . '. ';
Or my approach is completely wrong?
Thanks
ps - it would be a boon if the regex could maintain the whole first two words when the string contain three or more words... (e.g. 'Kim Jong U.').
It can be done in single preg_replace using a regex.
You can search using this regex:
^\w+(?:$| +)(*SKIP)(*F)|(\w)\w+
And replace by:
$1.
RegEx Demo
Code:
$name = preg_replace('/^\w+(?:$| +)(*SKIP)(*F)|(\w)\w+/', '$1.', $name);
Explanation:
(*FAIL) behaves like a failing negative assertion and is a synonym for (?!)
(*SKIP) defines a point beyond which the regex engine is not allowed to backtrack when the subpattern fails later
(*SKIP)(*FAIL) together provide a nice alternative of restriction that you cannot have a variable length lookbehind in above regex.
^\w+(?:$| +)(*SKIP)(*F) matches first word in a name and skips it (does nothing)
(\w)\w+ matches all other words and replaces it with first letter and a dot.
You could use a positive lookbehind assertion.
(?<=\h)([A-Z])\w+
OR
Use this regex if you want to turn Bob F to Bob F.
(?<=\h)([A-Z])\w*(?!\.)
Then replace the matched characters with \1.
DEMO
Code would be like,
preg_replace('~(?<=\h)([A-Z])\w+~', '\1.', $string);
DEMO
(?<=\h)([A-Z]) Captures all the uppercase letters which are preceeded by a horizontal space character.
\w+ matches one or more word characters.
Replace the matched chars with the chars inside the group index 1 \1 plus a dot will give you the desired output.
A simple solution with only look-ahead and word boundary check:
preg_replace('~(?!^)\b(\w)\w+~', '$1.', $string);
(\w)\w+ is a word in the name, with the first character captured
(?!^)\b performs a word boundary check \b, and makes sure the match is not at the start of the string (?!^).
Demo

Regex Matching a string

I currently am writing a regex to match strings such as this:
( expr ) | id | num
term * factor | factor
expr
I want the regex to match each occurence of set of characters between each ' | ', but also match solo expressions such as:
expr
I currently have this, but I am doing my negative lookahead wrong and I am not really sure how to proceed.
((.*) \|) (.*)$
P.s. I am not really fond of using .* in this situation, but I cannot think of another way to match, because the characters between ' | 's can be word characters, digits, or anything in between.
EDIT:
I would like the output matches to look like this:
Regex ran on line 1, output:
3 matches - ( expor ), id, num
Regex ran on line 2:
2 matches - term * factor, factor
Regex ran on line 3:
1 match - expr
This could be your simple regex:
[^|]+
-capture one or more characters until you reach "|" (or end of string)
Or alternatively you could use String.split("|");
String line = "term * factor | factor";
String[] split = line.split("\\|");

Grep ambiguity nested square bracket

sample.txt contains
abcde
abde
Can anybody explain the output of following commands -
grep '[[ab]]' sample.txt - no output
grep '[ab[]]' sample.txt - no output
grep '[ab[]' sample.txt - output is abcde , abde
grep '[ab]]' sample.txt - no output
And what does [(ab)] and [^(ab)] mean? Is it the same as [ab] and [^ab] ?
First thing to understand is, inside a character class, none of the meta-characters of regex has any special meaning. They are matched literally. For e.g., an * will match a * and will not mean 0 or 1 repetition. Similarly, () will match ( and ), and will not create a capture group.
Now, if a ] is found in a character class, that automatically closes the character class, and the further character won't be the part of that character class. Now, let's understand what is happening above:
In 1, 2, and 4, your character class ends at the first closing ]. So, the last closing bracket - ], is not the part of character class. It has to be matched separately. So, your pattern will match something like this:
'[[ab]]' is same as '([|a|b)(])' // The last `]` has to match.
'[ab[]]' is same as '(a|b|[)(])' // Again, the last `]` has to match.
'[ab]]' is same as '(a|b|])(])' // Same, the last `]` has to match.
^
^---- Character class closes here.
Now, since in both the string, there is no ] at the end, hence no match is found.
Whereas, in the 3rd pattern, your character class is closed only by the last ]. And hence everything comes inside the character class.
'[ab[]' means match string that contains 'a', or 'b', or '['
which is perfectly valid and match both the string.
And what does [(ab)] and [^(ab)] mean?
[(ab)] means match any of the (, a, b, ). Remember, inside a character class, no meta-character of regex has any special meaning. So, you can't create groups inside a character class.
[^(ab)] means exact opposite of [(ab)]. It matches any string which does not contain any of those characters specified.
Is it the same as [ab] and [^ab] ?
No. These two does not include ( and ). Hence they are little different.
I give it a try:
grep '[[ab]]' - match string which has one of "[,a,b" and then a "]" char followed
grep '[ab[]]' - match string which has one of "a,b,[" and then a "]" char followed
grep '[ab[]' - match string which has one of "a,b,["
grep '[ab]]' - match string which has one of "a,b" and then a "]" char followed
grep '[(ab)]' - match string which has one of "(,a,b,)"
grep '[^(ab)]' - match string which doesn't contain "(,a,b" and ")"
grep '[ab]' - match string which contains one of "a,b"
grep '[^ab]' - match string which doesn't contain "a" and "b"
you can go through those grep cmds on this example:
#create a file with below lines:
abcde
abde
[abcd
abcd]
abc[]foo
abc]bar
[ab]cdef
a(b)cde
you will see the difference, and think about it with my comment/explanation.

Regex to remove EVEN lines

I need help to build a regex that can remove EVEN lines in a plain textfile.
Given this input:
line1
line2line3line4line5line6
It would output this:
line1line3line5
Thanks !
Actually, you don't use regex for that. With your favourite language, iterate the file, use a counter and do modulus. eg with awk (*nix)
$ awk 'NR%2==1' file
line1
line3
line5
even lines:
$ awk 'NR%2==0' file
line2
line4
line6
Well, if you do a search-and-replace-all-matches on
^(.*)\r?\n.*
in "^ matches start-of-line mode" and ". doesn't match linebreaks mode"; replacing with
\1
then you lose every even line.
E. g. in C#:
resultString = Regex.Replace(subjectString, #"^(.*)\r?\n.*", "$1", RegexOptions.Multiline);
or in Python:
result = re.sub(r"(?m)^(.*)\r?\n.*", r"\1", subject)
First, I fully agree with the consensus that this is not something regex should be doing.
Here's a Java demo:
public class Test {
public static String voodoo(String lines) {
return lines.replaceAll("\\G(.*\r?\n).*(?:\r?\n|$)", "$1");
}
public static void main(String[] args) {
System.out.println("a)\n"+voodoo("1\n2\n3\n4\n5\n6"));
System.out.println("b)\n"+voodoo("1\r\n2\n3\r\n4\n5\n6\n7"));
System.out.println("c)\n"+voodoo("1"));
}
}
output:
a)
1
3
5
b)
1
3
5
7
c)
1
A short explanation of the regex:
\G # match the end of the previous match
( # start capture group 1
.* # match any character except line breaks and repeat it zero or more times
\r? # match the character '\r' and match it once or none at all
\n # match the character '\n'
) # end capture group 1
.* # match any character except line breaks and repeat it zero or more times
(?: # start non-capture group 1
\r? # match the character '\r' and match it once or none at all
\n # match the character '\n'
| # OR
$ # match the end of the input
) # end non-capture group 1
\G begins at the start of the string. Every pair of lines (where the second line is optional, in case of the last uneven line) gets replaced by the first line in the pair.
But again: using a normal programming language (if one can call awk "normal" :)) is the way to go.
EDIT
And as Tim suggested, this also works:
replaceAll("(?m)^(.*)\r?\n.*", "$1")
I use capture groups (.*) --> $1 in Sublime Text' 'regex-find-replace' mode to
remove the line break in every other line and place a tab character between the values using
replace (.*)\n(.*)\n
with $1\t$2\n
For this specific question the OP could change this to
replace (.*)\n(.*)\n
with $1\n
Well this, will remove EVEN lines from the text file:
grep '[13579]$' textfile > textfilewithoddlines
And output this:
line1
line3
line5
Perhaps you are on the command line. In PowerShell:
$x = 0; gc .\foo.txt | ? { $x++; $x % 2 -eq 0 }