How to cycle through delimited tokens with a Regular Expression? - regex

How can I create a regular expression that will grab delimited text from a string? For example, given a string like
text ###token1### text text ###token2### text text
I want a regex that will pull out ###token1###. Yes, I do want the delimiter as well. By adding another group, I can get both:
(###(.+?)###)

/###(.+?)###/
if you want the ###'s then you need
/(###.+?###)/
the ? means non greedy, if you didn't have the ?, then it would grab too much.
e.g. '###token1### text text ###token2###' would all get grabbed.
My initial answer had a * instead of a +. * means 0 or more. + means 1 or more. * was wrong because that would allow ###### as a valid thing to find.
For playing around with regular expressions. I highly recommend http://www.weitz.de/regex-coach/ for windows. You can type in the string you want and your regular expression and see what it's actually doing.
Your selected text will be stored in \1 or $1 depending on where you are using your regular expression.

In Perl, you actually want something like this:
$text = 'text ###token1### text text ###token2### text text';
while($text =~ m/###(.+?)###/g) {
print $1, "\n";
}
Which will give you each token in turn within the while loop. The (.*?) ensures that you get the shortest bit between the delimiters, preventing it from thinking the token is 'token1### text text ###token2'.
Or, if you just want to save them, not loop immediately:
#tokens = $text =~ m/###(.+?)###/g;

Assuming you want to match ###token2### as well...
/###.+###/

Use () and \x. A naive example that assumes the text within the tokens is always delimited by #:
text (#+.+#+) text text (#+.+#+) text text
The stuff in the () can then be grabbed by using \1 and \2 (\1 for the first set, \2 for the second in the replacement expression (assuming you're doing a search/replace in an editor). For example, the replacement expression could be:
token1: \1, token2: \2
For the above example, that should produce:
token1: ###token1###, token2: ###token2###
If you're using a regexp library in a program, you'd presumably call a function to get at the contents first and second token, which you've indicated with the ()s around them.

Well when you are using delimiters such as this basically you just grab the first one then anything that does not match the ending delimiter followed by the ending delimiter. A special caution should be that in cases as the example above [^#] would not work as checking to ensure the end delimiter is not there since a singe # would cause the regex to fail (ie. "###foo#bar###). In the case above the regex to parse it would be the following assuming empty tokens are allowed (if not, change * to +):
###([^#]|#[^#]|##[^#])*###

Related

Removing quotation marks around text in IntelliJ (with Regex?)

I have a piece of code where I want to remove the quotation marks around property names.
// Current format
var user = {
'name': 'A',
'loggedIn': true
}
// Desired format
var user = {
name: 'A',
loggedIn: true
}
I've managed to find all the places I wish to change with this regular expression:
'(.+)'\:
Now I want to remove the quotation marks in those strings. I tried to enter (.+)\: into the "replace with" field, but it did not work. Is there some way to do what I want to do with this tool?
Find in Path documentation explains how to use the references:
if you specify the search pattern through a regular expression, use the $n format in back references (to refer to a previously found and saved pattern).
$1 will contain whatever is matched by the parenthesis, so your replacement string would look like $1:.
See also Regular Expression Syntax Reference.
Your regex matches with your desire strings, but you missed using captured groups! $1 returns first group and second and 3rd comes with $2 and $3 and ... .
Additional words:
You can back-referencing by \1 in your find regex to avoid repeating capture groups codes.
I suggest use this regex instead of your own in general cases:
^\s*(['|"])(.*?)\1\s?:
and replace by $2: to extract string between '/".

Replace string using regular expression in KETTLE

I would like to use regular expression for replacing a certain pattern in the Kettle. For example, AAAA >5< BBBB, I want to replace this with AAAA 555 BBBB. I know how to find the pattern, but I am not sure how to replace that with new string. The one thing I have to keep is that I have to find pattern together ><, not separately like > or < because there is another pattern <5>.
You can use the "Replace in String" step in a transformation.
Set use RegEx to "Y", type your regex on the Search box, with capturing groups if necessary, and the replacement string in the replacement box, referring to capture groups as $1, $2, ...
It'll replace all occurrences of the regex in the original string.
If the Out Stream field is ommitted, it'll overwrite the In stream field.
If you want the pattern >\d< replaced by a triple of the found digit, you can use Replace-In-String in regex mode:
Search: (.*)(>(\d)<)(.*)
Replace: $1$3$3$3$4
If you want all such patterns treated the same:
Search: (>(\d)<)
Replace: $2$2$2
EDIT due to your improved requirement
Since you intend to convert your "simple" markup to a more HTML-like markup, you better use a User-Defined-Java-Expression. Also, you must avoid to reintroduce simple markup when replacing repeatedly.

Need RegEx Pattern to get text between delimiters at start of text

My source text could be any number of characters between "[" an "]" at the beginning of the line. I will have ONLY one line.
For example:
[1] and some other text here
[10] more text, but maybe some brackets [KEY]
[1000000] a lot more text
I want to match/return the text between the "[" and "]".
EDIT AFTER ANSWER PROVIDED
The first answer, provided by #nickb worked for me with this AppleScript:
Note that I had to convert the RegEx to a quoted string to use in AS. This uses the Satimage AppleScript Additions find text command, which provides the RegEx engine for AppleScript.
set strRegEx to "^\\[(.*?)\\]" -- Original: "^\[(.*?)\]"
set strTextToSearch to "[10] My Note title with [KEY] "
set strCaptureGroup to find text strRegEx in strTextToSearch using {"\\1"} with regexp and string result
log strCaptureGroup
-->10
The most simple regex you could use would be this:
^\[(.*?)\]
You can see it matching your input here.
Alternatively a pure AppleScript solution
set theText to "[1] and some other text here
[10] more text, but maybe some brackets [KEY]
[1000000] a lot more text"
set resultList to {}
set {TID, text item delimiters} to {text item delimiters, "]"}
repeat with aLine in (get paragraphs of theText)
if aLine starts with "[" then set end of resultList to text 2 thru -1 of text item 1 of aLine
end repeat
set text item delimiters to TID
resultList -- {"1", "10", "1000000"}
I think this will fit your criteria:
^\[([^]]*)\].*
With the stuff in brackets in the first matching group returned.
You can try runing the following reg. exp. on each line:
[^\[]\w+[^\]]
I tested it at regex101 and it matches the contents inside the [], excluding the brackets.
/^\[(.*?)\]/
is really the most simple regex for this case, but it matches surrounding brackets too.
The exact value (without brackets) is stored in 1st capture group.
If you don't want to match brackets, you will need this:
/(?<=^\[).*?(?=\])/
… unless you're using JavaScript – unfortunately, JS doesn't support lookbehinds.
In this case you'll need this regex:
/^[^\[\]]+/
(assuming that every input will start with […] component, and will not be empty)
The regex to use depends on how you are going to use it for the input it will parse. Some of the answers here have a trailing .* and some do not. Both are correct, it just depends on what exactly you are trying to match, and crucially how you ask it about a match. For example, in Java, with the regex ^\[(.*?)\], if you feed it the whole string "[1000000] a lot more text" and call matches(), it will return false because the regex pattern does not account for any of the trailing text outside the brackets. However, if you call find() after feeding in the same string, it will match because find() works on each substring as it parses and will return true on the first match it hits, while matches() will only return true if the entire input matches the regex. find() will also find subsequent substring matches to the regex in the string each time find() is called until the parser reaches the end of the input.
Personally, I like to use regex that account for the entire input and use capture groups to isolate the actual text I want to grab from the input. But your mileage may vary.

PCRE regex replace a text pattern within double quotes

In Notepad++ 6.5.1 I need to replace certain patterns within quote pairs. I want to save the replace as part of a macro, so all replacements need to happen in one step.
For example, in the following string, replace all 'a' characters within quote pairs with a dash, while leaving characters outside the quote pairs untouched:
Input: aa"bbabaavv"kdjhas"bbabaavv"x
Desired result: aa"bb-b--vv"kdjhas"bb-b--vv"x
Note that the quotes are matched up pairwise, such that the 'a' in kdjhas is untouched.
So far I have tried searching for (?:"[^"a]*|\G)\Ka([^"a]*) and replacing with -$1, but that simply replaces all the a's, with the result --"bb-b--vv"kdjh-s"bb-b--vv"x. I'm attempting PCRE regex that will let me recursively replace the quote-delimited text.
Edit: Quote marks within a quoted string are escaped with an extra quote, e.g. "". However, assume I will have already replaced these in a previous pass with a special character. Therefore a regex solution to this problem will not have to deal with escaped quotes.
It is hard to tell if this is possible as you've only provided one line of input text.
But assuming that input follows this pattern:
BOL|any text|string with two groups of a's|any text|string with two groups of a's|any text|EOL
aa "bbabaavv" kdjhas "bbabaavv" x
I was able to create this regexp search string:
^(.+?\".+?)([a]+)(.+?)([a]+)(.*?\")(.+?\".+?)([a]+)(.+?)([a]+)(.*?\".*)$
With this replace string:
\1-\3-\5\6-\8-\A
and it turn your input string from this:
aa"bbabaavv"kdjhas"bbabaavv"x
into this:
aa"bb-b-vv"kdjhas"bb-b-vv"x
Now naturally the search an replace will fail if the input varies from that pattern described as the search is looking for those four groups of a's inside the two groups of quoted strings.
Also I tested that regexp using Zeus which can create a regexp with more than 9 groups.
As you can see the regexp requires 10 groups.
I'm not familar with Notpad++ so I don't know if it supports that many groups.
If your data have variable number of occurrences of quoted strings, then it is not possible to perform replacements only via regex at least in its form offered by Notepad++.
To replace using regex, you would need to perform regex find in existing regex match. As far as I know such a functionality is not available in Notepad++ regexes.
Self-answer
I may have been reaching for the stars in trying to get Notepad++ to do this regex replace, but I think I found a workaround.
The actual task I was attempting involved creating a SQL Server VALUES list from an Excel spreadsheet, where I was copying and pasting selected cells into Notepad++. The delimiters are \t and \r\n. But, cells can have linefeeds too, which are delimited by ". So, I was going to replace these linefeeds with <br> (or something like it), so that
"line1
line2"
would become "line1<br>line2", before processing the actual end-of-row line feeds.
Having such parsing work reliably, especially when more than two lines were in a single cell, may have been too much to ask of Notepad++'s regex capability.
So I came up with a workaround that seems to be working:) Basically it starts with selecting a blank "dummy" column to the right of my column selection (which I can insert if I'm partially selecting from the middle). This will leave a trailing \t at the end of each row, which effectively sets these EOL's apart from ones that might exist with a text cell, freeing me from having to parse line feeds from a "..." field.
So I compiled a macro from the following steps, which seems to be working well:
replace ' with ''
replace \t\r\n with '\)\r\n, \('
replace \t with ', '
replace "" with ''
replace " with <blank>
replace ^ with \(' (cleanup - first row only)
replace ^, \('$ with <blank> (cleanup - last row only)
Example transformation:
from
line1 line 2
"line3
line3b
line3c" line 4
to
('line1', 'line 2')
, ('line3
line3b
line3c', 'line 4')
which can now be easily modified into a SELECT statement:
SELECT *
FROM (VALUES('line1', 'line 2')
, ('line3
line3b
line3c', 'line 4')
) t(a,b)

Regular Expression, dynamic number

The regular expression which I have provided will select the string 72719.
Regular expression:
(?<=bdfg34f;\d{4};)\d{0,9}
Text sample:
vfhnsirf;5234;72159;2;668912;28032009;4;
bdfg34f;8467;72719;7;6637912;05072009;7;
b5g342sirf;234;72119;4;774582;20102009;3;
How can I rewrite the expression to select that string even when the number 8467; is changed to 84677; or 846777; ? Is it possible?
First, when asking a regex question, you should always specify which language you are using.
Assuming that the language you are using does not support variable length lookbehind (and most don't), here is a solution which will work. Your original expression uses a fixed-length lookbehind to match the pattern preceding the value you want. But now this preceding text may be of variable length so you can't use a look behind. This is no problem. Simply match the preceding text normally and capture the portion that you want to keep in a capture group. Here is a tested PHP code snippet which grabs all values from a string, capturing each value into capture group $1:
$re = '/^bdfg34f;\d{4,};(\d{0,9})/m';
if (preg_match_all($re, $text, $matches)) {
$values = $matches[1];
}
The changes are:
Removed the lookbehind group.
Added a start of line anchor and set multi-line mode.
Changed the \d{4} "exactly four" to \d{4,} "four or more".
Added a capture group for the desired value.
Here's how I usually describe "fields" in a regex:
[^;]+;[^;]+;([^;]+);
This means "stuff that isn't semi-colon, followed by a semicolon", which describes each field. Do that twice. Then the third time, select it.
You may have to tweak the syntax for whatever language you are doing this regex in.
Also, if this is just a data file on disk and you are using GNU tools, there's a much easier way to do this:
cat file | cut -d";" -f 3
to match the first number with a minimum of 4 digits
(?<=bdfg34f;\d{4,};)\d{0,9}
and to match the first number with 1 or more length
(?<=bdfg34f;\d+;)\d{0,9}
or to match the first number only if the length is between 4 and 6
(?<=bdfg34f;\d{4,6};)\d{0,9}
This is a simple text parsing problem that probably doesn't mandate the use of regular expressions.
You could take the input line by line and split on ';', i.e. (in php, I have no idea what you're doing)
foreach (explode("\n", $string) as $line) {
$bits = explode(";", $line);
echo $bits[3]; // third column
}
If this is indeed in a file and you happen to be using PHP, using fgetcsv would be much better though.
Anyway, context is missing, but the bottom line is I don't think you should be using regular expressions for this.