Regex for specific file structure - regex

I need to parse file with next simple structure:
some string 1
some string 2
some string 3
some string x
some string y
some string z
...
File consist of 2 parts separated by "\n\n" or "\r\n\r\n". This separator present in my example after "some string 3". Each part is optional, that is if first part omitted than there will be 1(but with my regex I need 2 empty lines) empty line(\n|\r\n) before second part. And if second part is omitted than there will be any number of empty lines after first part(include no empty lines at all).
I'm trying to achieve desired result with regex like this:
(?isx: \h* (.+)? \h* (?:(?:\n|\r\n){2,} \h* (.+))? \s*)
But with no success because first "(.+)?" very greedy and if I making 2nd part non-optional it violates my requirements that both part must be optional. I know that I can use split /(?:\n|\r\n)/, $str in this case but this file in future could have more complex structure so I can't use split.
Can someone help me with this?

You actually might want to use a non-greedy group, since you don't want to match your seperator.
(?ìsx: (?:
(.*?) # Non greedy
(?:\r?\n){2,} # also matches \r\n\n but that might not be of concern
|\r?\n) # one empty line.
(.*) # second group
)
I don't know what you wanted to achieve with the \hs. If you want to ensure that there is something in the lines (right now, the . also could all match \n or spaces) you could try something like (?:[^\n]+\n)*? for the groups.
Also, for brevities sake, I avoided the explicit ? you used. There might be a difference in results. If you match nothing under a star, you'll get the empty string, if you don't match at all, the value of the group-variable is undefined. Here is a short example to show the difference:
"aa" =~ /(c)?(d*)aa/
Here $1 is undefined, while $2 is the empty string. This minor difference might yield some annoying warnings or unexpected results if someone tested with defined for the contents of a group.

Related

Dart regex for capturing groups but ignoring certain similar patterns

I'm trying to capture a group from a string with ~, ~~ and ~~~ symbols. I was successful with extracting single symbols but it doesn't ignore the other occurrences in the string.
This is my code I tried experimenting with:
String f = '~the calculator is on and working~I entered 50 into the calculator'+
'~~I press add button~~holding equal button ~~~The result should be 50';
List<String>givens = f.split(RegExp(r'~+'));
List<String>whens = f.split(RegExp(r'~~+'));
List<String>thens = f.split(RegExp(r'~~~+'));
for(String ss in givens){
print(ss);
}
print('xxxxxxxxxxxx');
for(String ss in whens){
print(ss);
}
print('xxxxxxxxxxxx');
for(String ss in thens){
print(ss);
}
Which will result with:
The givens capture group also captured the ones with ~~ and ~~~ which is not intended.
The whens capture group also captured the ones single ~ which made it very confusing.
Lastly, the thens capture group also captured the others which is also not intended.
I only need to capture the strings starting with the specific pattern but will stop when they see a different one.
Example: givens should only capture 'the calculator is on and working' and 'I entered 50 into the calculator' only.
Any hints or help is greatly appreciated!
I think the problem is that you started off by splitting the string into pieces. But it might be easier to search for the elements with a pattern that will look for some text preceeded with either one, two or three ~ chars.
This can be done with regex positive lookbehind patterns.
Typically, if you want to find a string preceeded by one tild then you have to avoid that it matches if we have other tilds before it.
Find givens
(?<=(?:[^~]|^)~)[^~]+ would be the pattern to find only givens.
Test it here: https://regex101.com/r/9WLbM3/2
Explanation
[^~] means search for any character which is not a ~. This is because [abc] means any char which is in the list, so a, b or c. If you add the ^ char at the beginning of the list then it means "not these chars".
[^~]+ means search for one or multiple times a character which is not ~. This will capture phrases between the tilds.
A positive lookbehind is done with (?<=something present). We want to search for a tild so we would put (?<=~) as positive lookbehind. But the problem is that it will also match the ones with several tilds in front. To avoid that we can say that the tild should either be prefixed by ^ (meaning the beginning of a string) or by [^~] (meaning not a tild). To say "either this or that", we use the syntax (this|that|or even that). But using parenthesis will capture the content and we don't need that. To disable group capturing we can add ?: at the beginning of the group, leading finally to (?:[^~]|^) meaning either a non-tild char or the beginning of the string, without capturing it.
Find whens and thens
The regular expression is almost the same. It's just that we replace ~ by ~{2} or ~{3}.
Pattern for whens: (?<=(?:[^~]|^)~{2})[^~]+
Pattern for thens: (?<=(?:[^~]|^)~{3})[^~]+

Regex to match(extract) string between dot(.)

I want to select some string combination (with dots(.)) from a very long string (sql). The full string could be a single line or multiple line with new line separator, and this combination could be in start (at first line) or a next line (new line) or at both place.
I need help in writing a regex for it.
Examples:
String s = I am testing something like test.test.test in sentence.
Expected output: test.test.test
Example2 (real usecase):
UPDATE test.table
SET access = 01
WHERE access IN (
SELECT name FROM project.dataset.tablename WHERE name = 'test' GROUP BY 1 )
Expected output: test.table and project.dataset.tablename
, can I also add some prefix or suffix words or space which should be present where ever this logic gets checked. In above case if its update regex should pick test.table, but if the statement is like select test.table regex should not pick it up this combinations and same applies for suffix.
Example3: This is to illustrate the above theory.
INS INTO test.table
SEL 'abcscsc', wu_id.Item_Nbr ,1
FROM test.table as_t
WHERE as_t.old <> 0 AND as_t.date = 11
AND (as_t.numb IN ('11') )
Expected Output: test.table, test.table (Key words are INTO and FROM)
Things Not Needed in selection:as_t.numb, as_t.old, as_t.date
If I get the regex I can use in program to extract this word.
Note: Before and after string words to the combination could be anything like update, select { or(, so we have to find the occurrence of words which are joined together with .(dot) and all the number of such occurrence.
I tried something like this:
(?<=.)(.?)(?=.)(.?) -: This only selected the word between two .dot and not all.
.(?<=.)(.?)(?=.)(.?). - This everything before and after.
To solve your initial problem, we can just use some negation. Here's the pattern I came up with:
[^\s]+\.[^\s]+
[^ ... ] Means to make a character class including everything except for what's between the brackets. In this case, I put \s in there, which matches any whitespace. So [^\s] matches anything that isn't whitespace.
+ Is a quantifier. It means to find as many of the preceding construct as you can without breaking the match. This would happily match everything that's not whitespace, but I follow it with a \., which matches a literal .. The \ is necessary because . means to match any character in regex, so we need to escape it so it only has its literal meaning. This means there has to be a . in this group of non-whitespace characters.
I end the pattern with another [^\s]+, which matches everything after the . until the next whitespace.
Now, to solve your secondary problem, you want to make this match only work if it is preceded by a given keyword. Luckily, regex has a construct almost specifically for this case. It's called a lookbehind. The syntax is (?<= ... ) where the ... is the pattern you want to look for. Using your example, this will only match after the keywords INTO and FROM:
(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
Here (?:INTO|FROM) means to match either the text INTO or the text FROM. I then specify that it should be followed by a whitespace character with \s. One possible problem here is that it will only match if the keywords are written in all upper case. You can change this behavior by specifying the case insensitive flag i to your regex parser. If your regex parser doesn't have a way to specify flags, you can usually still specify it inline by putting (?i) in front of the pattern, like so:
(?i)(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
If you are new to regex, I highly recommend using the www.regex101.com website to generate regex and learn how it works. Don't forget to check out the code generator part for getting the regex code based on the programming language you are using, that's a cool feature.
For your question, you need a regex that understands any word character \w that matches between 0 and unlimited times, followed by a dot, followed by another series of word character that repeats between 0 and unlimited times.
So here is my solution to your question:
Your regex in JavaScript:
const regex = /([\w][.][\w])+/gm;
in Java:
final String regex = "([\w][.][\w])+";
in Python:
regex = r"([\w][.][\w])+"
in PHP:
$re = '/([\w][.][\w])+/m';
Note that: this solution is written for your use case (to be used for SQL strings), because now if you have something like '.word' or 'word..word', it will still catch it which I assume you don't have a string like that.
See this screenshot for more details

regex to match specific pattern of string followed by digits

Sample input:
___file___name___2000___ed2___1___2___3
DIFFERENT+FILENAME+(2000)+1+2+3+ed10
Desired output (eg, all letters and 4-digit numbers and literal 'ed' followed immediately by a digit of arbitrary length:
file name 2000 ed2
DIFFERENT FILENAME 2000 ed10
I am using:
[A-Za-z]+|[\d]{4}|ed\d+ which only returns:
file name 2000 ed
DIFFERENT FILENAME 2000 ed
I see that there is a related Q+A here:Regular Expression to match specific string followed by number?
eg using ed[0-9]* would match ed#, but unsure why it does not match in the above.
As written, your regex is correct. Remember, however, that regex tries to match its statements from left to right. Your ed\d+ is never going to match, because the ed was already consumed by your [A-Za-z] alternative. Reorder your regex and it'll work just fine:
ed\d+|[a-zA-Z]+|\d{4}
Demo
Nick's answer is right, but because in-order matching can be a less readable "gotcha", the best (order-insensitive) ways to do this kind of search are 1) with specified delimiters, and 2) by making each search term unique.
Jan's answer handles #1 well. But you would have to specify each specific delimiter, including its length (e.g. ___). It sounds like you may have some unusual delimiters, so this may not be ideal.
For #2, then, you can make each search term unique. (That is, you want the thing matching "file" and "name" to be distinct from the thing matching "2000", and both to be distinct from the thing matching "ed2".)
One way to do this is [A-Za-z]+(?![0-9a-zA-Z])|[\d]{4}|ed\d+. This is saying that for the first type of search term, you want an alphabet string which is followed by a non-alphanumeric character. This keeps it distinct from the third search term, which is an alphabet string followed by some digit(s). This also allows you to specify any range of delimiters inside of that negative lookbehind.
demo
You might very well use (just grab the first capturing group):
(?:^|___|[+(]) # delimiter before
([a-zA-Z0-9]{2,}) # the actual content
(?=$|___|[+)]) # delimiter afterwards
See a demo on regex101.com

How to ignore whitespace in a regular expression subject string?

Is there a simple way to ignore the white space in a target string when searching for matches using a regular expression pattern? For example, if my search is for "cats", I would want "c ats" or "ca ts" to match. I can't strip out the whitespace beforehand because I need to find the begin and end index of the match (including any whitespace) in order to highlight that match and any whitespace needs to be there for formatting purposes.
You can stick optional whitespace characters \s* in between every other character in your regex. Although granted, it will get a bit lengthy.
/cats/ -> /c\s*a\s*t\s*s/
While the accepted answer is technically correct, a more practical approach, if possible, is to just strip whitespace out of both the regular expression and the search string.
If you want to search for "my cats", instead of:
myString.match(/m\s*y\s*c\s*a\*st\s*s\s*/g)
Just do:
myString.replace(/\s*/g,"").match(/mycats/g)
Warning: You can't automate this on the regular expression by just replacing all spaces with empty strings because they may occur in a negation or otherwise make your regular expression invalid.
Addressing Steven's comment to Sam Dufel's answer
Thanks, sounds like that's the way to go. But I just realized that I only want the optional whitespace characters if they follow a newline. So for example, "c\n ats" or "ca\n ts" should match. But wouldn't want "c ats" to match if there is no newline. Any ideas on how that might be done?
This should do the trick:
/c(?:\n\s*)?a(?:\n\s*)?t(?:\n\s*)?s/
See this page for all the different variations of 'cats' that this matches.
You can also solve this using conditionals, but they are not supported in the javascript flavor of regex.
You could put \s* inbetween every character in your search string so if you were looking for cat you would use c\s*a\s*t\s*s\s*s
It's long but you could build the string dynamically of course.
You can see it working here: http://www.rubular.com/r/zzWwvppSpE
If you only want to allow spaces, then
\bc *a *t *s\b
should do it. To also allow tabs, use
\bc[ \t]*a[ \t]*t[ \t]*s\b
Remove the \b anchors if you also want to find cats within words like bobcats or catsup.
This approach can be used to automate this
(the following exemplary solution is in python, although obviously it can be ported to any language):
you can strip the whitespace beforehand AND save the positions of non-whitespace characters so you can use them later to find out the matched string boundary positions in the original string like the following:
def regex_search_ignore_space(regex, string):
no_spaces = ''
char_positions = []
for pos, char in enumerate(string):
if re.match(r'\S', char): # upper \S matches non-whitespace chars
no_spaces += char
char_positions.append(pos)
match = re.search(regex, no_spaces)
if not match:
return match
# match.start() and match.end() are indices of start and end
# of the found string in the spaceless string
# (as we have searched in it).
start = char_positions[match.start()] # in the original string
end = char_positions[match.end()] # in the original string
matched_string = string[start:end] # see
# the match WITH spaces is returned.
return matched_string
with_spaces = 'a li on and a cat'
print(regex_search_ignore_space('lion', with_spaces))
# prints 'li on'
If you want to go further you can construct the match object and return it instead, so the use of this helper will be more handy.
And the performance of this function can of course also be optimized, this example is just to show the path to a solution.
The accepted answer will not work if and when you are passing a dynamic value (such as "current value" in an array loop) as the regex test value. You would not be able to input the optional white spaces without getting some really ugly regex.
Konrad Hoffner's solution is therefore better in such cases as it will strip both the regest and test string of whitespace. The test will be conducted as though both have no whitespace.

Extracting some data items in a string using regular expression

<![Apple]!>some garbage text may be here<![Banana]!>some garbage text may be here<![Orange]!><![Pear]!><![Pineapple]!>
In the above string, I would like to have a regex that matches all <![FruitName]!>, between these <![FruitName]!>, there may be some garbage text, my first attempt is like this:
<!\[[^\]!>]+\]!>
It works, but as you can see I've used this part:
[^\]!>]+
This kills some innocents. If the fruit name contains any one of these characters: ] ! > It'd be discarded and we love eating fruit so much that this should not happen.
How do we construct a regex that disallows exactly this string ]!> in the FruitName while all these can still be obtained?
The above example is just made up by me, I just want to know what the regex would look like if it has to be done in regex.
The simplest way would be <!\[.+?]!> - just don't care about what is matched between the two delimiters at all. Only make sure that it always matches the closing delimiter at the earliest possible opportunity - therefore the ? to make the quantifier lazy.
(Also, no need to escape the ])
About the specification that the sequence ]!> should be "disallowed" within the fruit name - well that's implicit since it is the closing delimiter.
To match a fruit name, you could use:
<!\[(.*?)]!>
After the opening <![, this matches the least amount of text that's followed by ]!>. By using .*? instead of .*, the least possible amount of text is matched.
Here's a full regex to match each fruit with the following text:
<!\[(.*?)]!>(.*?)(?=(<!\[)|$)
This uses positive lookahead (?=xxx) to match the beginning of the next tag or end-of-string. Positive lookahead matches but does not consume, so the next fruit can be matched by another application of the same regex.
depending on what language you are using, you can use the string methods your language provide by doing simple splitting (and simple regex that is more understandable). Split your string using "!>" as separator. Go through each field, check for <!. If found, replace all characters from front till <!. This will give you all the fruits. I use gawk to demonstrate, but the algorithm can be implemented in your language
eg gawk
# set field separator as !>
awk -F'!>' '
{
# for each field
for(i=1;i<=NF;i++){
# check if there is <!
if($i ~ /<!/){
# if <! is found, substitute from front till <!
gsub(/.*<!/,"",$i)
}
# print result
print $i
}
}
' file
output
# ./run.sh
[Apple]
[Banana]
[Orange]
[Pear]
[Pineapple]
No complicated regex needed.