regular expressions boost c++ - c++

trying to catch the characters at the start the string and newlines the string is
.V/1LBOG\n.F/AV0094/08NOV/SAL/Y\n.E/0134249356001"
the regular expression i am using is from the string above i need to catch .V/ and .E/
^.[VE]/*
But it only seems to ctach .V/ can anyone see why as i thought ^ means newlines aswell as start of strings ? any help will be very gratefull as ive had this problem for a while now. If this is not the correct way as in doing this could you propose a different way.

Regex 101:
^ means start of string. And you guessed it right. There can only be one start of string.
^.[VE]/*
means :
Match start of string, followed by any character (other than newline), followed by either a V or a E, followed by 0 to n / (greedy).
Probably you want something like this :
\.[VE].*?(?:\\n|$)
Which means match a dot, followed by V or E and match everything until \n or end of string.
Comment if I am wrong.
So .V/1LBOG\n.F/AV0094/08NOV/SAL/Y\n.E/0134249356001"
Looks like this ?
.V/1LBOG
.F/AV0094/08NOV/SAL/Y
.E/0134249356001"
If yes, then you need to change your regex a little bit:
\.[VE].*
Abusing the fact that . does not match newlines by default.

. in regular expressions matches any single character, not a literal .. If you want to match a literal period, you need to escape it (\.). * doesn't match any number of any characters (as most shells would), but instead matches zero or more instances of whatever you put before it. For example, A* will match the literal letter A, AAAA etc., and .* will match any string.
^ means the beginning of a line. ^\.[VE]/ will match .V/ and .E/ (but only at the start of the line).

if you need .V or .E try ^.(V|E)/* the or | operator is useful for check ^.V/* or ^.E/*

Related

Regex to match words after dot until a whitespace occurs

Given the following string
span.a.b this.is.really.confusing
I need to return the matches a and b. I've been able to get close with the following regex:
(?<=\.)[\w]+
But it's also matching is, really, and confusing. When I include a negative lookahead I get even closer, but I'm still not there.
(?<=\.)[\w]+(?=\s) # matches b, confusing
How can I match words after a dot until a whitespace occurs?
How can I match words after a dot until a whitespace occurs?
NB: this is language agnostic pseudo-code, but should work.
regex = "^[^\s.]+.(\S+).*"
targets = <extracted_group>.split(".")
Regex explanation:
"^": beings with
"[^\s.]+." 1 or more non-whitespace, non-period characters, followed by a period.
"(\S+)": group and capture all of the following non-whitespace characters
".*": matches 0 or more of any non-newline character
If the split function takes a regex instead of a string, you'll need to escape the '.' or use a character class.
NB: You can do it without the split, but I think that the split is more transparent.
I am not sure if this is good enough for all your possible cases, but it should work with the provided example:
\.([\w]+)\.([\w]+)\s
$1 = a, $2 = b

Understanding regex in shell

I came across single grouping concept in shell script.
cat employee.txt
101,John Doe,CEO
I was practising SED substitute command and came across with below example.
sed 's/\([^,]*\).*/\1/g' employee.txt
It was given that above expression matches the string up to the 1st comma.
I am unable to understand how this matches the 1st comma.
Below is my understanding
s - substitute command
/ delimiter
\ escape character for (
( opening braces for grouping
^ beginning of the line - anchor
[^,] - i am confused in this , is it negate of comma or mean something else?
why * and again .* is used to match the string up to 1st comma?
^ matches beginning of line outside of a character class []. At the beginning of a character class, it means negation.
So, it says: non-comma ([^,]) repeated zero or more times (*) followed by anything (.*). The matching part of the string is replaced by the part before the comma, so it removes everything from the first comma onward.
I know 'link only' answers are to be avoided - Choroba has correctly pointed out that this is:
non-comma ([^,]) repeated zero or more times () followed by anything (.). The matching part of the string is replaced by the part before the comma, so it removes everything from the first comma onward.
However I'd like to add that for this sort of thing, I find regulex quite a useful tool for visualising what's going on with a regular expression.
The image representation of your regular expression is:
Given the string "foo, bar", s/\([^,]*\).*/\1/g, and more specifically \([^,]\)*) means, "match any character that is not a comma" (zero or more times). Since "f" is not a comma, it matches "f" and "remembers" it. Because it is "zero or more times", it tries again. The next character is not a comma either (it is o), then, the regex engine adds that o to the group as well. The same thing happens for the 2nd o.
The next character is indeed a comma, but [^,] forbids it, as #choroba affirmed. What is in the group now is "foo". Then, the regex uses .* outside the group which causes zero or more characters to be matched but not remembered.
In the replacement part of the regex, \1 is used to place the contents of the remembered text ("foo"). The rest of the matched text is lost and that is how you remain with only the text up to the first comma.

Regular exp to match string from beginning until certain char is met

I have some long string where i'm trying to catch a substring until a certain character is met.
Lets suppose I have the following string, and I would like to get the text until the first ampersand.
abc.8965.aghtj&hgjkiyu5.8jfhsdj
I would like to extract what is present before the ampersand so: abc.8965.aghtj
W thought this would work:
grep'^.*&{1}'
I would translate it as
^ start of string
.* match whatever chars
&{1} until the first ampersand is matched
Any advice?
I'm afraid this will take me weeks
{1} does not match the first occurrence; instead it means "match exactly one of the preceding pattern/character", which is identical to just matching the character (&{3} would match &&&).
In order to match the first occurrence of &, you need to use .*?:
grep'^.*?&'
Normally, .* is greedy, meaning it matches as much as possible. This means your pattern would match the last ampersand rather than the first one. .*? is the non-greedy version, matching as little as possible while fulfilling the pattern.
Update: That syntax may not be supported by grep. Here is another option:
'^[^&]*&'
It matches anything that is not an ampersand, up to the first ampersand.
You also may have to enable extended regular expression in grep (-E).
Try this one:
^.*?(?=&)
it won't get ampersand sign, just a text before it

regular expression no characters

I have this regular expression
([A-Z], )*
which should match something like
test, (with a space after the comma)
How to I change the regex expression so that if there are any characters after the space then it doesn't match.
For example if I had:
test, test
I'm looking to do something similar to
([A-Z], ~[A-Z])*
Cheers
Use the following regular expression:
^[A-Za-z]*, $
Explanation:
^ matches the start of the string.
[A-Za-z]* matches 0 or more letters (case-insensitive) -- replace * with + to require 1 or more letters.
, matches a comma followed by a space.
$ matches the end of the string, so if there's anything after the comma and space then the match will fail.
As has been mentioned, you should specify which language you're using when you ask a Regex question, since there are many different varieties that have their own idiosyncrasies.
^([A-Z]+, )?$
The difference between mine and Donut is that he will match , and fail for the empty string, mine will match the empty string and fail for ,. (and that his is more case-insensitive than mine. With mine you'll have to add case-insensitivity to the options of your regex function, but it's like your example)
I am not sure which regex engine/language you are using, but there is often something like a negative character groups [^a-z] meaning "everything other than a character".

Regex Pattern - Allow alpha numeric, a bunch of special chars, but not a certain sequence of chars

I have the following regex:
(?!^[&#]*$)^([A-Za-z0-9-'.,&#:?!()$#/\\]*)$
So allow A-Z, a-Z, 0-9, and these special chars '.,&#:?!()$#/\
I want to NOT match if the following set of chars is encountered anywhere in the string in this order:
&#
When I run this regex with just "&#" as input, it does not match my pattern, I get an error, great. When I run the regex with '.,&#:?!()$#/\ABC123 It does match my pattern, no errors.
However when I run it with:
'.,&##:?!()$#/\ABC123
It does not error either. I'm doing something wrong with the check for the &# sequence.
Can someone tell me what I've done wrong, I'm not great with these things.
Borrowing a technique for matching quoted strings, remove & from your character class, add an alternative for & not followed by #, and allow the string to optionally end with &:
^((?:[A-Za-z0-9-'.,#:?!()$#/\\]+|&[^#])*&?)$
I would actually do it in two parts:
Check your allowed character set. To do this I would look for characters that are not allowed, and return false if there's a match. That means I have a nice simple expression:
[^A-Za-z0-9'\.&#:?!()$#^]
Check your banned substring. And since it is just a substring, I probably wouldn't even use a regex for that part.
You didn't mention your language, but if in C#:
bool IsValid(string input)
{
return !( input.Contains("&#")
|| Regex.IsMatch(#"[^A-Za-z0-9'\.&#:?!()$#^]", input)
);
}
^((?!&#)[A-Za-z0-9-'.,&#:?!()$#/\\])*$
note that the last \ is escaped (doubled)
SO automatically turns \\ into \ if not in backticks
Assuming Perl compatible RegExp
To not match on the string '&#':
(?![^&]*&#)^([A-Za-z0-9-'.,&#:?!()$#/\\]*)$
Although you don't need the parenthesis because you are matching the entire string.
Just FYI, although Ben Blank's regex works, it's more complicated than it needs to be. I would do it like this:
^(?:[A-Za-z0-9-'.,#:?!()$#/\\]+|&(?!#))+$
Because I used a negative lookahead instead of a negated character class, the regex doesn't need any extra help to match an ampersand at the end of the string.
I'd recommend using two regular expressions in a conditional:
if (string has sequence "&#")
return false
else
return (string matches sequence "A-Za-z0-9-'.,&#:?!()$#/\")
I believe your second "main" regex of
^([A-Za-z0-9-'.,&#:?!()$#/\])$"
has several errors:
It will test only one character in your set
The \ character in regular expressions is a token indicating that the next character is part of some sort of "class" of characters (ex. \n = is the line feed character). The character sequence \] is actually causing your bracketed list not to be terminated.
You may be better off using
^[A-Za-z0-9-'.,&#:?!()$#/\\]+$
Note that the slash character is represented by a double-slash.
The + character indicates that at least one character being tested has to match the regex; if it is fine to pass a zero-length string, replace the + with a *.