preserve all group in regexp - regex

I have question regarding regexp, I have text like this
embedded-software-entwickler
basically I want to replace the - with something else but preserving the group so I can easily do $1#$2#$3 with # as replacement of -
my current regexp is like this ([a-zäöüß]+)(-) but this one will not hit the third word which is entiwckler

How about something simple like this:
([\w]*?)-([\w]*?)-([\w]*)
Replace with:
$1#$2#$3
What we did here is basically we started looking for any available character using \w and using the lazy sign *? at the beginning and the greedy sign * at the end to match each group, and separated each section with -.
If you would like to include spaces, numbers, special characters, etc. in each section, you can use something like this:
([\s\S]*?)-([\s\S]*?)-([\s\S]*)
If you prefer something dynamic, you could try something like this:
([^\-]+)-
Replace with:
$1#
Demo: https://regex101.com/r/p6zQTO/1/
Alternative way to mach each group plus the replacement:
([^\-]*)-([^\-]*)
Replace with:
$1#$2
Demo: https://regex101.com/r/p6zQTO/2/

If your need is simply to change all '-' into '#', trying a tr/-/#/m would produce simpler and better substitution.
If you need to group and extract for other purposes, then try something like /(\w+)(?:-(\w+))*/
(?:groups but don't extract)

Related

Regex to match(extract) string between dot(.)

I want to select some string combination (with dots(.)) from a very long string (sql). The full string could be a single line or multiple line with new line separator, and this combination could be in start (at first line) or a next line (new line) or at both place.
I need help in writing a regex for it.
Examples:
String s = I am testing something like test.test.test in sentence.
Expected output: test.test.test
Example2 (real usecase):
UPDATE test.table
SET access = 01
WHERE access IN (
SELECT name FROM project.dataset.tablename WHERE name = 'test' GROUP BY 1 )
Expected output: test.table and project.dataset.tablename
, can I also add some prefix or suffix words or space which should be present where ever this logic gets checked. In above case if its update regex should pick test.table, but if the statement is like select test.table regex should not pick it up this combinations and same applies for suffix.
Example3: This is to illustrate the above theory.
INS INTO test.table
SEL 'abcscsc', wu_id.Item_Nbr ,1
FROM test.table as_t
WHERE as_t.old <> 0 AND as_t.date = 11
AND (as_t.numb IN ('11') )
Expected Output: test.table, test.table (Key words are INTO and FROM)
Things Not Needed in selection:as_t.numb, as_t.old, as_t.date
If I get the regex I can use in program to extract this word.
Note: Before and after string words to the combination could be anything like update, select { or(, so we have to find the occurrence of words which are joined together with .(dot) and all the number of such occurrence.
I tried something like this:
(?<=.)(.?)(?=.)(.?) -: This only selected the word between two .dot and not all.
.(?<=.)(.?)(?=.)(.?). - This everything before and after.
To solve your initial problem, we can just use some negation. Here's the pattern I came up with:
[^\s]+\.[^\s]+
[^ ... ] Means to make a character class including everything except for what's between the brackets. In this case, I put \s in there, which matches any whitespace. So [^\s] matches anything that isn't whitespace.
+ Is a quantifier. It means to find as many of the preceding construct as you can without breaking the match. This would happily match everything that's not whitespace, but I follow it with a \., which matches a literal .. The \ is necessary because . means to match any character in regex, so we need to escape it so it only has its literal meaning. This means there has to be a . in this group of non-whitespace characters.
I end the pattern with another [^\s]+, which matches everything after the . until the next whitespace.
Now, to solve your secondary problem, you want to make this match only work if it is preceded by a given keyword. Luckily, regex has a construct almost specifically for this case. It's called a lookbehind. The syntax is (?<= ... ) where the ... is the pattern you want to look for. Using your example, this will only match after the keywords INTO and FROM:
(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
Here (?:INTO|FROM) means to match either the text INTO or the text FROM. I then specify that it should be followed by a whitespace character with \s. One possible problem here is that it will only match if the keywords are written in all upper case. You can change this behavior by specifying the case insensitive flag i to your regex parser. If your regex parser doesn't have a way to specify flags, you can usually still specify it inline by putting (?i) in front of the pattern, like so:
(?i)(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
If you are new to regex, I highly recommend using the www.regex101.com website to generate regex and learn how it works. Don't forget to check out the code generator part for getting the regex code based on the programming language you are using, that's a cool feature.
For your question, you need a regex that understands any word character \w that matches between 0 and unlimited times, followed by a dot, followed by another series of word character that repeats between 0 and unlimited times.
So here is my solution to your question:
Your regex in JavaScript:
const regex = /([\w][.][\w])+/gm;
in Java:
final String regex = "([\w][.][\w])+";
in Python:
regex = r"([\w][.][\w])+"
in PHP:
$re = '/([\w][.][\w])+/m';
Note that: this solution is written for your use case (to be used for SQL strings), because now if you have something like '.word' or 'word..word', it will still catch it which I assume you don't have a string like that.
See this screenshot for more details

RegEx substract text from inside

I have an example string:
*DataFromAdHoc(cbgv)
I would like to extract by RegEx:
DataFromAdHoc
So far I have figured something like that:
^[^#][^\(]+
But Unfortunately without positive result. Do you have maybe any idea why it's not working?
The regex you tried ^[^#][^\(]+ would match:
From the beginning of the string, it should not be a # ^[^#]
Then match until you encounter a parenthesis (I think you don't have to escape the parenthesis in a character class) [^\(]+
So this would match *DataFromAdHoc, including the *, because it is not a #.
What you could do, it capture this part [^\(]+ in a group like ([^(]+)
Then your regex would look like:
^[^#]([^(]+)
And the DataFromAdHoc would be in group 1.
Use ^\*(\w+)\(\w+\)$
It just gets everything between the * and the stuff in brackets.
Your answer may depend on which language you're running your regex in, please include that in your question.

textmate regex date search

I'm using text mate to search for date strings in the format 30/08/2016 and tried adding this to the regex find panel: ^\d{1,2}\/\d{1,2}\/\d{4}$
but it yields no results
You are using beginning of the string ^ and end of the string $. Therefore, it only works, if you have a sample string like this: 12/12/1212 but not like this: abc12/12/1212def. You also have to escape a forward slash /.
If you don't care about this string being inside the text, you can use
\d{1,2}\/\d{1,2}\/\d{4}
If you know, that there is white space around it, you can use this:
(?<=\s)(\d{1,2}\/\d{1,2}\/\d{4})(?=\s)
Or the simpler solution with word boundaries \b
\b\d{1,2}\/\d{1,2}\/\d{4}\b
Don't forget to use the global flag, if you are trying to match multiple instance of this date pattern.

RegEx to match string between delimiters or at the beginning or end

I am processing a CSV file and want to search and replace strings as long as it is an exact match in the column. For example:
xxx,Apple,Green Apple,xxx,xxx
Apple,xxx,xxx,Apple,xxx
xxx,xxx,Fruit/Apple,xxx,Apple
I want to replace 'Apple' if it is the EXACT value in the column (if it is contained in text within another column, I do not want to replace). I cannot see how to do this with a single expression (maybe not possible?).
The desired output is:
xxx,GRAPE,Green Apple,xxx,xxx
GRAPE,xxx,xxx,GRAPE,xxx
xxx,xxx,Fruit/Apple,xxx,GRAPE
So the expression I want is: match the beginning of input OR a comma, followed by desired string, followed by a comma OR the end of input.
You cannot put ^ or $ in character classes, so I tried \A and \Z but that didn't work.
([\A,])Apple([\Z,])
This didn't work, sadly. Can I do this with one regular expression? Seems like this would be a common enough problem.
It will depend on your language, but if the one you use supports lookarounds, then you would use something like this:
(?<=,|^)Apple(?=,|$)
Replace with GRAPE.
Otherwise, you will have to put back the commas:
(^|,)Apple(,|$)
Or
(\A|,)Apple(,|\Z)
And replace with:
\1GRAPE\2
Or
$1GRAPE$2
Depending on what's supported.
The above are raw regex (and replacement) strings. Escape as necessary.
Note: The disadvatage with the latter solution is that it will not work on strings like:
xxx,Apple,Apple,xxx,xxx
Since the comma after the first Apple got consumed. You'd have to call the regex replacement at most twice if you have such cases.
Oh, and I forgot to mention, you can have some 'hybrids' since some language have different levels of support for lookbehinds (in all the below ^ and \A, $ and \Z, \1 and $1 are interchangeable, just so I don't make it longer than it already is):
(?:(?<=,)|(?<=^))Apple(?=,|$)
For those where lookbehinds cannot be of variable width, replace with GRAPE.
(^|,)Apple(?=,|$)
And the above one for where lookaheads are supported but not lookbehinds. Replace with \1Apple.
This does as you wish:
Find what: (^|,)(?:Apple)(,|$)
Replace with: $1GRAPE$2
This works on regex101, in all flavors.
http://regex101.com/r/iP6dZ8
I wanted to share my original work-around (before the other answers), though it feels like more of a hack.
I simply prepend and append a comma on the string before doing the simpler:
/,Apple,/,GRAPE,/g
then cut off the first and last character.
PHP looks like:
$line = substr(preg_replace($search, $replace, ','.$line.','), 1, -1);
This still suffers from the problem of consecutive columns (e.g. ",Apple,Apple,").

Replacing part of delimited string with R's regex

I have the following list of strings:
name <- c("hsa-miR-555p","hsa-miR-519b-3p","hsa-let-7a")
What I want to do is for each of the above strings
replace the text after second delimiter (-) with "zzz".
Yielding:
hsa-miR-zzz
hsa-miR-zzz
hsa-let-zzz
What's the way to do it?
Might as well use something like:
gsub("^((?:[^-]*-){2}).*", "\\1zzz", name)
(?:[^-]*-) is a non-capturing group which consists of several non-dash characters followed by a single dash character and the {2} just after means this group occurs twice only. Then, match everything else for the replacement. Note I used an anchor just in case to avoid unintended substitutions.
Perhaps something like this:
> gsub("([A-Za-z]+-)([A-Za-z]+-)(.*)", "\\1\\2zzz", name)
[1] "hsa-miR-zzz" "hsa-miR-zzz" "hsa-let-zzz"
There are actually several ways to approach this, depending on how "regular" your expressions actually are. For example, do they all start with "hsa-"? What are the options for the "middle" group? Might there be more than three dashes?