Looking for some regex which will create a capture group for words occurring within parentheses, ignoring the parentheses themselves. The regex must be either PCRE or ICU.
Input: ( lakshd asd___ asa1123 Name : _____)
Desired Output: Name
What I've tried:
\\((Name|name|NAME)\\)
(?<=\\()name|Name|NAME(?=\\))
\\(name|Name|NAME\\)
What I've tried:
\\((Name|name|NAME)\\)
(?<=\\()name|Name|NAME(?=\\))
\\(name|Name|NAME\\)
All these patterns look for name or Name or NAME that has a ( immediately before and ) right after, with difference being what is captured or returned as a match. To match some word inside parentheses, you need to use \([^()]* before the value you need to get, and [^()]*\) after it.
Also, there is no point in extracting something you already know.
So, if you plan to extract the last word from the parentheses, you may use
> library(stringr)
> s = "( lakshd asd___ asa1123 Name : _____)"
> res <- str_match(s, "(?i)\\([^()]*\\b([a-z]\\w*)\\b[^()]*\\)")
> res[,2]
[1] "Name"
Note that str_match allows accessing captured values.
The (?i)\\([^()]*\\b([a-z]\\w*)\\b[^()]*\\) pattern matches parentheses and the last whole word from it.
If nested levels of parentheses are not likely to happen then looking if current position is going to be followed by a closing parenthesis at the end while an opening parenthesis is supposed to be opened already will do the trick (works with both ICU and PCRE):
(Name|name|NAME)(?=[^()]*\))
PCRE live demo
Related
I'm trying to capture a group from a string with ~, ~~ and ~~~ symbols. I was successful with extracting single symbols but it doesn't ignore the other occurrences in the string.
This is my code I tried experimenting with:
String f = '~the calculator is on and working~I entered 50 into the calculator'+
'~~I press add button~~holding equal button ~~~The result should be 50';
List<String>givens = f.split(RegExp(r'~+'));
List<String>whens = f.split(RegExp(r'~~+'));
List<String>thens = f.split(RegExp(r'~~~+'));
for(String ss in givens){
print(ss);
}
print('xxxxxxxxxxxx');
for(String ss in whens){
print(ss);
}
print('xxxxxxxxxxxx');
for(String ss in thens){
print(ss);
}
Which will result with:
The givens capture group also captured the ones with ~~ and ~~~ which is not intended.
The whens capture group also captured the ones single ~ which made it very confusing.
Lastly, the thens capture group also captured the others which is also not intended.
I only need to capture the strings starting with the specific pattern but will stop when they see a different one.
Example: givens should only capture 'the calculator is on and working' and 'I entered 50 into the calculator' only.
Any hints or help is greatly appreciated!
I think the problem is that you started off by splitting the string into pieces. But it might be easier to search for the elements with a pattern that will look for some text preceeded with either one, two or three ~ chars.
This can be done with regex positive lookbehind patterns.
Typically, if you want to find a string preceeded by one tild then you have to avoid that it matches if we have other tilds before it.
Find givens
(?<=(?:[^~]|^)~)[^~]+ would be the pattern to find only givens.
Test it here: https://regex101.com/r/9WLbM3/2
Explanation
[^~] means search for any character which is not a ~. This is because [abc] means any char which is in the list, so a, b or c. If you add the ^ char at the beginning of the list then it means "not these chars".
[^~]+ means search for one or multiple times a character which is not ~. This will capture phrases between the tilds.
A positive lookbehind is done with (?<=something present). We want to search for a tild so we would put (?<=~) as positive lookbehind. But the problem is that it will also match the ones with several tilds in front. To avoid that we can say that the tild should either be prefixed by ^ (meaning the beginning of a string) or by [^~] (meaning not a tild). To say "either this or that", we use the syntax (this|that|or even that). But using parenthesis will capture the content and we don't need that. To disable group capturing we can add ?: at the beginning of the group, leading finally to (?:[^~]|^) meaning either a non-tild char or the beginning of the string, without capturing it.
Find whens and thens
The regular expression is almost the same. It's just that we replace ~ by ~{2} or ~{3}.
Pattern for whens: (?<=(?:[^~]|^)~{2})[^~]+
Pattern for thens: (?<=(?:[^~]|^)~{3})[^~]+
I want to select some string combination (with dots(.)) from a very long string (sql). The full string could be a single line or multiple line with new line separator, and this combination could be in start (at first line) or a next line (new line) or at both place.
I need help in writing a regex for it.
Examples:
String s = I am testing something like test.test.test in sentence.
Expected output: test.test.test
Example2 (real usecase):
UPDATE test.table
SET access = 01
WHERE access IN (
SELECT name FROM project.dataset.tablename WHERE name = 'test' GROUP BY 1 )
Expected output: test.table and project.dataset.tablename
, can I also add some prefix or suffix words or space which should be present where ever this logic gets checked. In above case if its update regex should pick test.table, but if the statement is like select test.table regex should not pick it up this combinations and same applies for suffix.
Example3: This is to illustrate the above theory.
INS INTO test.table
SEL 'abcscsc', wu_id.Item_Nbr ,1
FROM test.table as_t
WHERE as_t.old <> 0 AND as_t.date = 11
AND (as_t.numb IN ('11') )
Expected Output: test.table, test.table (Key words are INTO and FROM)
Things Not Needed in selection:as_t.numb, as_t.old, as_t.date
If I get the regex I can use in program to extract this word.
Note: Before and after string words to the combination could be anything like update, select { or(, so we have to find the occurrence of words which are joined together with .(dot) and all the number of such occurrence.
I tried something like this:
(?<=.)(.?)(?=.)(.?) -: This only selected the word between two .dot and not all.
.(?<=.)(.?)(?=.)(.?). - This everything before and after.
To solve your initial problem, we can just use some negation. Here's the pattern I came up with:
[^\s]+\.[^\s]+
[^ ... ] Means to make a character class including everything except for what's between the brackets. In this case, I put \s in there, which matches any whitespace. So [^\s] matches anything that isn't whitespace.
+ Is a quantifier. It means to find as many of the preceding construct as you can without breaking the match. This would happily match everything that's not whitespace, but I follow it with a \., which matches a literal .. The \ is necessary because . means to match any character in regex, so we need to escape it so it only has its literal meaning. This means there has to be a . in this group of non-whitespace characters.
I end the pattern with another [^\s]+, which matches everything after the . until the next whitespace.
Now, to solve your secondary problem, you want to make this match only work if it is preceded by a given keyword. Luckily, regex has a construct almost specifically for this case. It's called a lookbehind. The syntax is (?<= ... ) where the ... is the pattern you want to look for. Using your example, this will only match after the keywords INTO and FROM:
(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
Here (?:INTO|FROM) means to match either the text INTO or the text FROM. I then specify that it should be followed by a whitespace character with \s. One possible problem here is that it will only match if the keywords are written in all upper case. You can change this behavior by specifying the case insensitive flag i to your regex parser. If your regex parser doesn't have a way to specify flags, you can usually still specify it inline by putting (?i) in front of the pattern, like so:
(?i)(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
If you are new to regex, I highly recommend using the www.regex101.com website to generate regex and learn how it works. Don't forget to check out the code generator part for getting the regex code based on the programming language you are using, that's a cool feature.
For your question, you need a regex that understands any word character \w that matches between 0 and unlimited times, followed by a dot, followed by another series of word character that repeats between 0 and unlimited times.
So here is my solution to your question:
Your regex in JavaScript:
const regex = /([\w][.][\w])+/gm;
in Java:
final String regex = "([\w][.][\w])+";
in Python:
regex = r"([\w][.][\w])+"
in PHP:
$re = '/([\w][.][\w])+/m';
Note that: this solution is written for your use case (to be used for SQL strings), because now if you have something like '.word' or 'word..word', it will still catch it which I assume you don't have a string like that.
See this screenshot for more details
I am using regex to find few keywords after colon(:) and the best I have reached so far is:
sample test case
test {
test1 {
sadffd(test: "aff", aaa: "aa1") {}
}
}
Now I have to find a keyword inside () brackets and its working for 'aaa' but when I add test it fails, it matches entire words in string.
my regex so far
\btest(.*\w") (failed case) expected "aff" returned "aff", aaa: "aa1"
\baaa(.*\w") (pass case) returned "aa1"
please let me know if more information is needed
You may try
:\s*"(.*?)"
And the data you need is in the first capturing group.
Explanation
:\s*"(.*?)"
: colon
\s* followed by optionally any number of spaces
" followed by quote
( ) capturing group, containing...
.*? any number of character, matching as few as possible
" followed by quote
Demo:
https://regex101.com/r/WnvzdG/1
Update:
If you want to match ONLY after specific keywords, followed by colon, you can do something like:
(KEYWORD1|KEYWORD2|KEYWORD3)\s*:\s*"(.*?)"
First capture group will be the keyword matched, second capture group will be the value.
One more approach (executed in Python)
items = ['test{test1 {sadffd(test: "aff", aaa: "aa1") {}}}']
for item in items:
print(re.findall(r'"(\w+)"',item))
print(re.findall(r'(?<=: )"(\w+)"',item))
Output
['aff', 'aa1']
['aff', 'aa1']
I believe a simple regex would work to get everything inside the double quotes in your case:
("\w+")
Note that your question above says you want to capture "aff" and not just aff so I've included the surrounding quotes within the capturing group.
Example from regex101:
It's pretty crude but this should be OK for the input you've presented. (It wouldn't handle things like an escaped double quote in the string, for example).
I'm doing a search and replace in Notepad++ and am looking for a regex that will literally give me the first ( in a given string, so I can replace it.
I am not interested in any preceding or succeeding characters, literally just the first (.
An example string is:
"starLan(11), -- Deprecated via RFC3635 ethernetCsmacd (6) should be used instead
I'd like to find the first ( (near starLan(11) in this case) so I can replace that character with something else.
It should not match any other ( in the same line, so in this case it should not match the second ( near (6).
All of the examples I've come across seem to be returning everything up to and including the given character, which is not what I'm after in this case.
I would match the following pattern :
^([^(]*)\((.*)$
And replace it with this :
\1X\2
Where X is the text you want to replace your ( with.
It uses back-references to refer to the parts before and after the first (.
Edit : as mentioned by OP, matching ^([^(]*)\( and replacing with \1X is enough.
you can use this
^(.*?)\(
the text captured inside () will be available in back reference $1. so you can replace it like:
$1someText
where someText is the text you want to put in place of removed '('
if you want the text after removed '(' to remain intact as well, you can use:
^(.*?)\((.*)
and replacement as:
$1someText$2
How to grep data item from this html string
a <- "<div class=\"tst-10\">100%</div>"
so that the result is 100%? The main idea is to get data between > <.
I would use gsub() in this case:
gsub("(<.*>)(.*)(<.*>)", "\\2", a)
[1] "100%"
Basically, this breaks the string up into three parts, each separated by regular brackets ( and ). We can then use these as backreferences. The contents matched by the first set of backreferences can be referred to as \1 (use a double slash to escape the special character), those matched in the second, \2 and so on.
So, essentially, we're saying parse this string, figure out what matches my conditions, and return only the second backreference.
Piece by piece:
<.*> says to look for a "<" followed by any number of any characters ".*" up until you get to a ">"
.* means to match any number of characters (up until the next condition)
Keeping this in mind, you could actually probably use gsub("(.*>)(.*)(<.*)", "\\2", a) and get the same result.
I always use this regular expression to remove HTML tags:
gsub("<(.|\n)*?>","",a)
Gives:
[1] "100%"
Differs from mrdwab's in that I just remove every html tag and his extracts content from within html tags, which is probably more appropriate for this example. Look out that both will give different results if there are more tags:
> gsub("(<.*>)(.*)(<.*>)", "\\2", paste(a,"<lalala>foo</lalala>"))
[1] "foo"
> gsub("<(.|\n)*?>","", paste(a,"<lalala>foo</lalala>"))
[1] "100% foo"
I think that I found it here on SO once, not sure which answer.