I need to replace sequences (of various length) of characters by another character. I am working in Eclipse on xml files
For exemple -------- should be replaced by ********.
The replacement should be done only for sequences of at least 3 characters, not 1 or 2.
It is easy to find the matching sequences in regex for example with -{3,30} but I don't understand how to specify the replacement sequence.
I made this regex solution ready when question was posted but didn't submit an answer because I kept testing in eclipse and even though regex was working for find feature, a * in replacement wasn't changing text in Eclipse editor.
Here is a shorter and a bit more efficient regex:
(?!^)\G-|(?=-{3})-
Replace with a *
RegEx Demo
Breakdown:
(?!^)\G: Match from end of the previous match
-: Match a -
|: OR
(?=-{3}): Make sure we have 3 hyphens ahead
-: Match a -
Here is a screenshot from my Eclipse that shows selected match for this regex:
You can use
(?:\G(?!\A)|(?<!-)(?=-{3,30}(?!-)))-
See the regex pattern. Details:
(?:\G(?!\A)|(?<!-)(?=-{3,30}(?!-))) - either
\G(?!\A) - end of the preceding match
| - or
(?<!-)(?=-{3,30}(?!-)) - a position that is not immediately preceded with a - char and is immediately followed with 3 to 30 hyphens (not followed with another hyphen).
- - a hyphen.
The (?:\G(?!\A)|(?<!-)(?=-{3,30}(?!-)))- regex goes into the "Find What" filed, * goes to the "Replace With" field.
Note that regular expressions are only meant to be used in search fields, replacement fields must only contain replacement patterns. Usually, the replacement pattern is a string containing literal string(s) and/or backreferences. Here, we do not need any backreference as the regex does not capture anything.
I am trying to use Regex to grab a substring of a large string.
The overall string has certain text, 'cow/', then any number of characters or spaces that are not digits. The first digit hit is the start of the desired substring I want.
This desired substring consists of only digits and periods, the first character or space seen that is not a digit or period indicates the end of the desired substring.
For example:
'cow/ a12.34 -123'
The desired substring is '12.34'.
So far I have this regex that partially works (I think the '| .' is not entirely correct):
(?<=([A-z]|[0-9])/\s*).?(?=\s[^0-9 |.])
Thanks in advance.
This should be easy to achieve by relying on capturing groups:
cow/[^0-9]*([0-9.]+)
The group will contain the text that you want to extract, in Java group(index), in C# with Groups[index]. Other languages provide similar features.
Don't try to solve everything inside the regular expression, but leverage the power of your runtime :)
Edit after comment on the OP:
Azure Kusto has the extract(regex, captureGroup, text [, typeLiteral]) function to extract groups from regular expression matches:
extract("cow/[^0-9]*([0-9.]+)", 1, "cow/ a12.34 -123") == "12.34";
The argument 1 tells Kusto to extract the first capturing group (the expression inside the parentheses).
I have been looking at the various topics on Regex on SO, and they are all saying that to find the invert (select all that doesn't fit the criteria) you simply use the[^] syntax or negative lookahead.
I have tried using both of these methods on my Regex but the results are not adequate the [^] especially seems to take all its contents literally (even when escaped).
What I need this for:
I have a massive SQL line with a SQL dump I'm trying to remove all characters that are not the line id, and the numerical value of one column.
My regex works in matching exactly what I'm looking for; what I need to do is to invert this match so I can remove all non-matching parts in my IDE.
My regex:
/(\),\(\d{1,4},)|(,\d{10},)/
This matches a "),(<number upto 4 digits>," or ",<number of ten digits>," .
The subject
My subject is a 500Kb line of an SQL dump looking something like this (I have already removed a-z and other unwanted characters in previous simple find/replaces):
),(39,' ',1,'01761472100','#','9 ','20',1237213277,0,1237215419,''),(40,' ',3,'01445731203','#',' ','-','22 2','210410//816',1237225423,0,1484651768,''),(4270,' /
My aim is to use a regex to achive the following output:
),(39,,1237213277,,1237215419,),(40,,1237225423,,1484651768,),(4270,
Which I can then go over again and easily remove repetitions such as commas.
I have read that Negation in Regex is tricky, So, what is the syntax to get the regex I've made to work inverted? To remove all non-matching groups? What can you recommend as a way of solving this without spending hours manually reading the lines?
You may use a really helpful (*SKIP)(?!) (=(*SKIP)(*F) or (*SKIP)(*FAIL)) construct in PCRE to match these texts you know and then skip and match all other text to remove:
/(?:\),\(\d{1,4},|,\d{10},)(*SKIP)(?!)|./s
See the regex demo
Details:
(?:\),\(\d{1,4},|,\d{10},) - match 1 of the 2 alternatives:
\),\(\d{1,4}, - ),(, then 1 to 4 digits and then ,
| - or
,\d{10}, - a comma, 10 digits, a comma
(*SKIP)(?!) - omit the matched text and proceed to the next match
| - or
. - any char (since /s DOTALL modifier is passed to the regex)
The same can be done with
/(\),\(\d{1,4},|,\d{10},)?./s
and replacing with $1 backreference (since we need to put back the text captured with the patterns we need to keep), see another regex demo.
I have the following data:
SOMEDATA .test 01/45/12 2.50 THIS IS DATA
and I want to extract the number 2.50 out of this. I have managed to do this with the following RegEx:
(?<=\d{2}\/\d{2}\/\d{2} )\d+.\d+
However that doesn't work for input like this:
SOMEDATA .test 01/45/12 2500 THIS IS DATA
In this case, I want to extract the number 2500.
I can't seem to figure out a regex rule for that. Is there a way to extract something between two spaces ? So extract the text/number after the date until the next whitespace ? All I know is that the date will always have the same format and there will always be a space after the text and then a space after the number I want to extract.
Can someone help me out on this ?
Capture number between two whitespaces
A whitespace is matched with \s, and non-whitespace with \S.
So, what you can use is:
\d{2}\/\d{2}\/\d{2} +(\S+)
^^^
See the regex demo
The 1+ non-whitespace symbols are captured into Group 1.
If - for some reason - you need to only get the value as a whole match, use your lookbehind approach:
(?<=\d{2}\/\d{2}\/\d{2} )\S+
Or - if you are using PCRE - you may leverage the match reset operator \K:
\d{2}\/\d{2}\/\d{2} +\K\S+
^^
See another demo
NOTE: the \K and a capture group approaches allow 1 or more spaces after the date and are thus more flexible.
I see some people helped you already, but if you would want an alternative working one for some reason, here's what works too :)
.+ \d+\/\d+\/\d+ (\d+[\.\d]*)
So the .+ matches anything plus the first space
then the \d+/\d+/\d+ is the date parsing plus a space
the capturing group is the number, as you can see I made the last part optional, so both floating point values and normal values can be matched. Hope this helped!
Proof: https://regex101.com/r/fY3nJ2/1
Just make the fractal part optional:
(?<=\d{2}\/\d{2}\/\d{2} )\d+(?:\.\d+)?
Demo: https://regex101.com/r/jH3pU7/1
Update following clarifications in comments:
To match anything (but space) surrounded by spaces and prepended by date use:
(?<=\d{2}\/\d{2}\/\d{2} )\S+
Demo: https://regex101.com/r/jH3pU7/3
Rather than capture, you can make your entire match be the target text by using a look behind:
(?<=\d\d(\/\d\d){2} )\S+
This matches the first series of non-whitespace that follows a "date like" part.
Note also the reduction in the length of the "date like" pattern. You may consider using this part of the regex in whatever solution you use.
To answer another user's question I knocked together the below regular expression to match numbers within a string.
\b[+-]?[0-9]+(\.[0-9]+)?\b
After providing my answer I noticed that I was getting unwanted matches in cases where there was a sequence of digits with more than one period among them due to \b matching the period character. For example "2.3.4" would return matches "2.3" and "4".
A negative lookahead and lookbehind could help me here, giving me a regex like this:
\b(?<!\.)[+-]?[0-9]+(\.[0-9]+)?\b(?!\.)
...except that for some unknown reason VBScript Regex (and by extension VBA) doesn't support lookbehind.
Is there some workaround that allows me to affirm that the word boundary at the start of the match is not a period without including it in the match?
Perhaps you don't need a look behind. If you are able to extract specific capture groups instead of the entire match then you can use:
(?:[^.]|^)\b([+-]?([0-9]+(\.[0-9]+)))\b(?!\.)
Will match:
2.5
54.5
+3.45
-0.5
Won't match:
1.2.3
3.6.
.3.5
Capture group 1 will output the whole number and sign
Capture group 2 will output the whole number
Capture group 3 will output the fraction (like capture group 1 in your original expression)