Regular exp to match string from beginning until certain char is met - regex

I have some long string where i'm trying to catch a substring until a certain character is met.
Lets suppose I have the following string, and I would like to get the text until the first ampersand.
abc.8965.aghtj&hgjkiyu5.8jfhsdj
I would like to extract what is present before the ampersand so: abc.8965.aghtj
W thought this would work:
grep'^.*&{1}'
I would translate it as
^ start of string
.* match whatever chars
&{1} until the first ampersand is matched
Any advice?
I'm afraid this will take me weeks

{1} does not match the first occurrence; instead it means "match exactly one of the preceding pattern/character", which is identical to just matching the character (&{3} would match &&&).
In order to match the first occurrence of &, you need to use .*?:
grep'^.*?&'
Normally, .* is greedy, meaning it matches as much as possible. This means your pattern would match the last ampersand rather than the first one. .*? is the non-greedy version, matching as little as possible while fulfilling the pattern.
Update: That syntax may not be supported by grep. Here is another option:
'^[^&]*&'
It matches anything that is not an ampersand, up to the first ampersand.
You also may have to enable extended regular expression in grep (-E).

Try this one:
^.*?(?=&)
it won't get ampersand sign, just a text before it

Related

Understanding regex in shell

I came across single grouping concept in shell script.
cat employee.txt
101,John Doe,CEO
I was practising SED substitute command and came across with below example.
sed 's/\([^,]*\).*/\1/g' employee.txt
It was given that above expression matches the string up to the 1st comma.
I am unable to understand how this matches the 1st comma.
Below is my understanding
s - substitute command
/ delimiter
\ escape character for (
( opening braces for grouping
^ beginning of the line - anchor
[^,] - i am confused in this , is it negate of comma or mean something else?
why * and again .* is used to match the string up to 1st comma?
^ matches beginning of line outside of a character class []. At the beginning of a character class, it means negation.
So, it says: non-comma ([^,]) repeated zero or more times (*) followed by anything (.*). The matching part of the string is replaced by the part before the comma, so it removes everything from the first comma onward.
I know 'link only' answers are to be avoided - Choroba has correctly pointed out that this is:
non-comma ([^,]) repeated zero or more times () followed by anything (.). The matching part of the string is replaced by the part before the comma, so it removes everything from the first comma onward.
However I'd like to add that for this sort of thing, I find regulex quite a useful tool for visualising what's going on with a regular expression.
The image representation of your regular expression is:
Given the string "foo, bar", s/\([^,]*\).*/\1/g, and more specifically \([^,]\)*) means, "match any character that is not a comma" (zero or more times). Since "f" is not a comma, it matches "f" and "remembers" it. Because it is "zero or more times", it tries again. The next character is not a comma either (it is o), then, the regex engine adds that o to the group as well. The same thing happens for the 2nd o.
The next character is indeed a comma, but [^,] forbids it, as #choroba affirmed. What is in the group now is "foo". Then, the regex uses .* outside the group which causes zero or more characters to be matched but not remembered.
In the replacement part of the regex, \1 is used to place the contents of the remembered text ("foo"). The rest of the matched text is lost and that is how you remain with only the text up to the first comma.

Match pattern anywhere in string?

I want to match the following pattern:
Exxxx49 (where x is a digit 0-9)
For example, E123449abcdefgh, abcdefE123449987654321 are both valid. I.e., I need to match the pattern anywhere in a string.
I am using:
^*E[0-9]{4}49*$
But it only matches E123449.
How can I allow any amount of characters in front or after the pattern?
Remove the ^ and $ to search anywhere in the string.
In your case the * are probably not what you intended; E[0-9]{4}49 should suffice. This will find an E, followed by four digits, followed by a 4 and a 9, anywhere in the string.
I would go for
^.*E[0-9]{4}49.*$
EDIT:
since it fullfills all requirements state by OP.
"[match] Exxxx49 (where x is digit 0-9)"
"allow for any amount of characters in front or after pattern"
It will match
^.* everything from, including the beginning of the line
E[0-9]{4}49 the requested pattern
.*$ everthing after the pattern, including the the end of the line
Your original regex had a regex pattern syntax error at the first *. Fix it and change it to this:
.*E\d{4}49.*
This pattern is for matching in engines (most engines) that are anchored, like Java. Since you forgot to specify a language.
.* matches any number of sequences. As it surrounds the match, this will match the entire string as long as this match is located in the string.
Here is a regex demo!
Just simply use this:
E[0-9]{4}49
How do I allow for any amount of characters in front or after pattern? but it only matches E123449
Use global flag /E\d{4}49/g if supported by the language
OR
Try with capturing groups (E\d{4}49)+ that is grouped by enclosing inside parenthesis (...)
Here is online demo

Regex match whole string

I have the following pattern:
[ \n\t]*([a-zA-Z][a-zA-Z0-9_]*)[ \n\t]+((char)[ \n\t]*\[[ \n\t]*([0-9]+)[ \t\n]*\]|(char)|(int)|(double)|(bool)|(blob)[ \n\t]*\[[ \n\t]*([0-9]+)[ \t\n]*\])[ \n\t]*
You can try it here: http://regex101.com/r/vA0xG9
In the first capturing group ([a-zA-Z][a-zA-Z0-9_]*), I want to grab words that only starts with a-zA-Z.
The two following strings matches equally:
cpf char[12]
,
9cpf char[12]
It ignores the 9 digit and matches equally to the first string.
I've tried to use this capturing group: (ˆ[a-zA-Z][a-zA-Z0-9_]*$), but it didn't work.
I'm using lib regex.h.
What should I do?
Thanks.
Put ^ at the beginning of the whole thing and $ at the end
^[ \n\t]*([a-zA-Z][a-zA-Z0-9_]*)[ \n\t]+((char)[ \n\t]*\[[ \n\t]*([0-9]+)[ \t\n]*\]|(char)|(int)|(double)|(bool)|(blob)[ \n\t]*\[[ \n\t]*([0-9]+)[ \t\n]*\])[ \n\t]*$
I would also suggest \s instead of [ \n\t] if you want to match whitespace.
In C++, there is a handy regex method that anchors the match to the whole string automatically: std::regex_match:
Determines if the regular expression e matches the entire target character sequence, which may be specified as std::string, a C-string, or an iterator pair.
This way, you will avoid issues with mistyped ^ as ˆ as well as cases when you have alternation (e.g. ^A|B$ won't match strings only equal to A or B, you need ^(A|B)$ or ^(?:A|B)$).
Note that there is an equivalent boost::regex_match method.

regular expression no characters

I have this regular expression
([A-Z], )*
which should match something like
test, (with a space after the comma)
How to I change the regex expression so that if there are any characters after the space then it doesn't match.
For example if I had:
test, test
I'm looking to do something similar to
([A-Z], ~[A-Z])*
Cheers
Use the following regular expression:
^[A-Za-z]*, $
Explanation:
^ matches the start of the string.
[A-Za-z]* matches 0 or more letters (case-insensitive) -- replace * with + to require 1 or more letters.
, matches a comma followed by a space.
$ matches the end of the string, so if there's anything after the comma and space then the match will fail.
As has been mentioned, you should specify which language you're using when you ask a Regex question, since there are many different varieties that have their own idiosyncrasies.
^([A-Z]+, )?$
The difference between mine and Donut is that he will match , and fail for the empty string, mine will match the empty string and fail for ,. (and that his is more case-insensitive than mine. With mine you'll have to add case-insensitivity to the options of your regex function, but it's like your example)
I am not sure which regex engine/language you are using, but there is often something like a negative character groups [^a-z] meaning "everything other than a character".

Why do I get successful but empty regex matches?

I'm searching the pattern (.*)\\1 on the text blabl with regexec(). I get successful but empty matches in regmatch_t structures. What exactly has been matched?
The regex .* can match successfully a string of zero characters, or the nothing that occurs between adjacent characters.
So your pattern is matching zero characters in the parens, and then matching zero characters immediately following that.
So if your regex was /f(.*)\1/ it would match the string "foo" between the 'f' and the first 'o'.
You might try using .+ instead of .*, as that matches one or more instead of zero or more. (Using .+ you should match the 'oo' in 'foo')
\1 is the backreference typically used for replacement later or when trying to further refine your regex by getting a match within a match. You should just use (.*), this will give you the results you want and will automatically be given the backreference number 1. I'm no regex expert but these are my thoughts based on my limited knowledge.
As an aside, I always revert back to RegexBuddy when trying to see what's really happening.
\1 is the "re-match" instruction. The question is, do you want to re-match immediately (e.g., BLABLA)
/(.+)\1/
or later (e.g., BLAahemBLA)
/(.+).*\1/