Regex search character inside capturing group

Regex search character inside capturing group - regex

I have a lob describing the rows of a CSV, and of course each column is delimited by a semicolon.
Some of that colum are strings, delimited by pipes, which may hold a semicolon, so I must replace that semicolon with colon but only inside a delimiter used for string colums, or columns order will be destroyed.
Example of a row:
1;4;|1.Simple response|;|once upon a time; I used to...|;|my favorite
character is ; I really love it.|
Response example:
1;4;|1.Simple response|;|once upon a time, I used to...|;|my favorite
character is , I really love it.|
This is the regex I wrote:
(\|)(.*?)(\|[\n\;])
LINK To regex101
What I need is to replace that .*? with [;]+ but if a try, nothing will be catched.
I don't get how to capture with regex, inside an already captured group.
Any advice?
Thanks

It appears that the third part of the pattern (second pipe |) won't match because it must always be followed by a newline (\n) or semicolon (;), which is not the case with your input.
Did you mean something like this:
\|;([\|\n;]) //would allow newline OR semicolon OR second pipe
https://regex101.com/r/S9MrWe/1

Related

Regex to select only specific characters between two strings

I have the following HTML:
<i>This is my first sentence.
This is my second sentence.</i>
Using Regex (in SublimeText FYI) how can I select only the whitespace (including line breaks) between the two <i></i> brackets?
I have got this far where I can select all the characters, but how do I limit it to whitespace and new lines only?:
(?<=<i.).*?(?=</i>)
https://regex101.com/r/eZ1gT7/1986

You can not do it with single regex, you can use a combination of regex
<\s*i[^>]+>([\s\S]+?)<\s*\/\s*i\s*>
Demo
This will give you values between tags <i> and text between tags is available in captured group 1, now you can loop through the matched values and find any space character
\s+

I'm guessing that maybe this expression,
(?=\s*[\n\r])(\s*)(?=\S)
replaced with a single space () might be close to what you might have in mind.
The expression is explained on the top right panel of this demo if you wish to explore/simplify/modify it.

You'll have to loop through the captured group 2:
(<i[^>]+?>)?([ \n]*)(<\/i>)?
https://regex101.com/r/wcPwkU/1

Regular expressionin Oracle

This is my string
'SEPA1,30-NOV-17;SEPA2,30-NOV-17;SEPA3,30-NOV-17;'
I need out like 'SEPA1,SEPA2,SEPA3' using Regular expression.
SELECT REGEXP_REPLACE ('SEPA1,30-NOV-17;SEPA2,30-NOV-17;SEPA3,30-NOV-17;',
'([^,]+)(\1)+', '')
FROM dual;
This query is not working: it leaves the input string unchanged. Also, I am looking for a regular expression solution (in particular, no use of a CONNECT BY LEVEL query to split the string into pieces).

MT0 has already provided the correct solution (most likely, but see the discussion of commas - perhaps escaped - within token values). Let me explain here what is happening in your attempt - you may find this helpful.
[^,]+ in the search pattern means one or more non-comma characters. This part is probably OK, but it raises two questions.
Can the input string contain substrings like 'SEPA6,;'? This would be how a "row in a table" (presented as a single string, where "rows" are terminated by semicolon and within each row, values are separated by comma), where the "date" is null. So - the question is, can there be null dates in your string, which would be represented by ,; with nothing between the comma and the semicolon? If that is possible, you would need to change the + quantifier to *, to allow zero or more non-comma characters before the semicolon.
Can there ever be a comma, a few characters, another comma, a few more characters, and then a semicolon? Presumably not in the "date" portion of each token; but where you show SEPA1 etc., whatever they mean, could there be a comma in the name (probably escaped, something like SE","TG)? In that case, you really want something like what you did, with the negated character class. The Answer posted by MT0 will delete everything from the FIRST comma (even if it's in the middle of the "name") to the semicolon.
Then, in your attempt you use a lookback reference, (\1), in the search pattern. There is no reason for that; you want to match non-commas followed by a semicolon, so that's what you must write in the search pattern. There is no repeating of the substring of non-commas found by the first part of the pattern.
Replacing something with null is the default for regexp_replace, so you may - optionally - leave out the last argument - the '' in your attempt.
So, your solution can be rewritten like so:
... regexp_replace(input_string, '[^,]*;')
(I left out the last argument, which was '' in your attempt - that is the default third argument anyway; but you may prefer to show it for clarity. OK either way.)
This will leave a comma at the END of the output string. I asked you a question in the Comments - it is not clear why you are changing from a terminator (the semicolon in the inputs) to a separator (the comma in the output); normally the delimiter should be of the same kind, either terminator in both input and output, or separator in both. (It is also odd that you are changing from semicolon to comma as the primary delimiter, but you must have your reasons.) In any case, that's why MT0 needed to wrap the return string from the regexp replace operation within a call to trim(), to remove the trailing comma.
A note about efficiency:
If you can have commas (perhaps escaped) within the "values" in your input string, the solution will have to be more complicated to handle all the possibilities. If commas are not possible in the "date" portion that you must eliminate from your input, but they are possible elsewhere, then the solution you were trying (which I fixed for you a little earlier in this Answer) will produce the required result; MT0's Answer will not, since it will start at the first comma after a semicolon, regardless of where it is.
However, if there are no commas anywhere except as true delimiters, then MT0's solution will be correct, and much faster than replacing [^,]*;. Regular expressions are (very) slow by nature, and writing them efficiently is exceptionally important. The difference between the solutions seems minor, so let's see what it is.
When you search for '[^,]*;' the regexp engine will try to find a match from the first character. It's not a comma... it reads the second character in, the third, ... and then it finds a comma before it finds a semicolon. So the pattern can't be matched. Then the engine tries to find a match from the second character, which also fails when the first comma is encountered. Etc. This will take a lot of time.
If you search for ',.*?;', the engine starts at the first character in the input string. The first character is not a comma, so there will be no match. The engine can already move on to the second character. It is not a comma either, to match the first character in the regexp pattern, so there will be no match at the second character of the input string either. These conclusions are drawn much faster, so the actual matches are found much faster too. MT0's solution differs from yours by a leading comma - that helps the regexp engine a lot.

Use the regular expression ,.*?; to find each comma and then the minimum amount of characters until the next semi-colon to match the portion of the string you want to replace:
SELECT TRIM(
TRAILING ',' FROM
REGEXP_REPLACE(
'SEPA1,30-NOV-17;SEPA2,30-NOV-17;SEPA3,30-NOV-17;',
',.*?;',
','
)
) AS sepas
FROM DUAL
Output:
SEPAS
-----------------
SEPA1,SEPA2,SEPA3

how can i remove every thing before ":" string in notepad++?

I have a file like this in notepad++
n1:n1:n1
n1:n1:n2
n1:n1:n3
i want to delete everything before the first ":" including the ":" itself
and be like this
n1:n1
n1:n2
n1:n3
and thanks..
hope i was clear enough in my explanation of my problem
Ken White :
thanks but the problem is my file have over 10k lines and the first "n1" changes to "n2" after about 1000 lines
and then it become "o1" instead of "n1"
i want to delelte every thing before the first ":"

Use Replace and use a regular expression to find any chars at the start of the line that are not a colon :, followed by a colon, and replace them with nothing
Find what: ^([^:]+:)(.)
Replace with: \2
Search Mode: Regular Expression
This actually answers your question and doesn't assume anything about what is before or after the first colon.
The first ^ indicates that the search must start at the beginning of a line
Parentheses are groupers and savers. They're not actually needed for this first bit, since you are just deleting the stuff before the colon, but this makes it parallel with Ken White's solution
Square brackets [ ] indicate which characters you want to look for
a. The second ^ right after the first square bracket switches from chars you want to look for to chars you do not want to look for
b. So [^:] means look for any char other than a colon
The plus + means look for 1 or more occurrences of this set of chars
a. If some lines may start with a colon, and you still want to replace that colon, you'd want to look for 0 or more occurrences of non-colon chars at the start of a line
b. To do that, replace the + with a *
Select the colon (so it will be deleted also)
Right Paren ends the first group
Left Paren starts the 2nd group
Dot . says look for any char. If you don't have this here, then it will delete everything before the first colon and then next set will be at the start of the line, so you'll delete too much. You could technically put a plus or star here, but you don't need it.
Right Paren ends the 2nd group
In the Replace with box, \2 (that's a backslash or reverse solidus if you prefer) will take the contents of the 2nd group and replace everything it found with those contents
Here is the test input and output:
Input (stuck some tabs and spaces and other stuff in there for good measure)
n1:n1:n1
n1:n1:n2
n1:n1:n3
n2:n1:n3
n4:n7:n5
o1:n1:n1:m1:m1:l1:l7b:l1011
z99:
-- Here's some more data
o1:o2:o3:o4:o5
:o2:o3:o4:o5:o6
o1:o1:o3:x37:n99
n2:o1:o3:o44:z76
n4:n7:n5:u72:j9:
Output
n1:n1
n1:n2
n1:n3
n1:n3
n7:n5
n1:n1:m1:m1:l1:l7b:l1011
z99:
o2:o3:o4:o5
:o2:o3:o4:o5:o6
o1:o3:x37:n99
o1:o3:o44:z76
n7:n5:u72:j9:
Notice it removed any line without a colon, which in some cases may be preferable. It also missed the two lines I threw in there with a colon at the beginning or end of the line.
If you wanted to leave these blank lines in, add an \r\n in the brackets in step 3 above (and again these are backslashes). Then it will look for any char that's not a colon or end-of-line (Step 3), followed by a colon (Step 5). Therefore, it only removes chars on the line with a colon. Change Find what to this string:
Find what: ^([^:\r\n]+):(.)
To catch the lines starting with a colon or with nothing after the first colon, change the plus to a star and add a question mark after the dot:
Find what: ^([^:\r\n]*):(.?)

regex - Removing text from around numbers in Notepad++

I have a large subset of data that looks like this:
MyApp.Whatever\app.config(115): More stuff here, but possibly with numbers or parenthesis...
I'd like to create a replace filter using Notepad++ that would identify and replace the line number "(115):" and replace it with a tab character followed by the same number.
I've been trying filters such as (\(\d+\):) and (\(\[0-9]+\):), but they keep returning the entire value in the \1 output.
How would I create a filter using Notepad++ that would successfully replace (115): with tab character + 115?

Use a quantifier.. (\(\d+?\):) where the ? will prevent it from being greedy. Also, since everything is in a () it will group it all and treat it as \1 ..
If it was in perl I'd say \((\d+?)\): which should match only the inner part.
Edit:
Just talked with my colleague - he said s/\((\d+)\)/\t\1/ and if you needed app config in front you could just put that in the front.

this should work for your needs
replace
\((\d+)\):
with
\t$1

Replacing (\(\d+\):) with \t\1 will keep the parenthesis and the colon since you've included them in the group (the outer parenthesis), and I think that's what you mean by "they keep returning the entire value."
Instead of escaping those inner parenthesis, escape the outer ones like the other answers have suggested: \((\d+)\): - this says to match a left paren, then match and capture a group of digits, then match a right paren and a colon. Replacing that with \t\1 will get rid of the parens and colon that were not in the captured group.

Convert a list of values to CSV values

I have a list of words and I want to convert them into a CSV.
a
b
c
d
to a,b,c,d
I replaced \n by , and it worked, but that was my 2 attempt
I first tried this regex ^([A-Za-z ]+)$\n and replacement is \1, . This particular regex is doing it for adjacent string like this:
a,b
c,d
What can I change in it to get it to work.
I am doing it in eclipse so I guess it is java, but I dont have to take into consideration the \ escape, it is same as edit+.

This regex:
^([A-Za-z ]+)$\n
matches the beginning of a line, letters and space, then the end of the line.
Once you perform your first replacement, the line contains a comma, so it would no longer match that pattern.
The regex is also a bit redundant. Because \n only comes at the end of a line anyway, you don't need both $ and \n in your pattern.
In order to fix it, you simply need to let your pattern match a comma:
^([A-Za-z ,]+)\n
Note: the specifics might vary based on your Eclipse version and/or file encoding. I needed \r\n to match a newline in mine.

From your example, you don't even need to use regular expressions. Simply replace two one newlines newline (\n) with a comma (,) and you're set.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js