Understanding regex in shell

Understanding regex in shell - regex

I came across single grouping concept in shell script.
cat employee.txt
101,John Doe,CEO
I was practising SED substitute command and came across with below example.
sed 's/\([^,]*\).*/\1/g' employee.txt
It was given that above expression matches the string up to the 1st comma.
I am unable to understand how this matches the 1st comma.
Below is my understanding
s - substitute command
/ delimiter
\ escape character for (
( opening braces for grouping
^ beginning of the line - anchor
[^,] - i am confused in this , is it negate of comma or mean something else?
why * and again .* is used to match the string up to 1st comma?

^ matches beginning of line outside of a character class []. At the beginning of a character class, it means negation.
So, it says: non-comma ([^,]) repeated zero or more times (*) followed by anything (.*). The matching part of the string is replaced by the part before the comma, so it removes everything from the first comma onward.

I know 'link only' answers are to be avoided - Choroba has correctly pointed out that this is:
non-comma ([^,]) repeated zero or more times () followed by anything (.). The matching part of the string is replaced by the part before the comma, so it removes everything from the first comma onward.
However I'd like to add that for this sort of thing, I find regulex quite a useful tool for visualising what's going on with a regular expression.
The image representation of your regular expression is:

Given the string "foo, bar", s/\([^,]*\).*/\1/g, and more specifically \([^,]\)*) means, "match any character that is not a comma" (zero or more times). Since "f" is not a comma, it matches "f" and "remembers" it. Because it is "zero or more times", it tries again. The next character is not a comma either (it is o), then, the regex engine adds that o to the group as well. The same thing happens for the 2nd o.
The next character is indeed a comma, but [^,] forbids it, as #choroba affirmed. What is in the group now is "foo". Then, the regex uses .* outside the group which causes zero or more characters to be matched but not remembered.
In the replacement part of the regex, \1 is used to place the contents of the remembered text ("foo"). The rest of the matched text is lost and that is how you remain with only the text up to the first comma.

Related

Removing last character from a line using regex

I just started learning regex and I'm trying to understand how it possible to do the following:
If I have:
helmut_rankl:20Suzuki12
helmut1195:wasserfall1974
helmut1951:roller11
Get:
helmut_rankl:20Suzuki1
helmut1195:wasserfall197
helmut1951:roller1
I tried using .$ which actually match the last character of a string, but it doesn't match letters and numbers.
How do I get these results from the input?

You could match the whole line, and assert a single char to the right if you want to match at least a single character.
.+(?=.)
Regex demo
If you also want to match empty strings:
.*(?=.)

This will do what you want with regex's match function.
^(.*).$
Broken down:
^ matches the start of the string
( and ) denote a capturing group. The matches which fall within it are returned.
.* matches everything, as much as it can.
The final . matches any single character (i.e. the last character of the line)
$ matches the end of the line/input

Regex - No "p" at second position

I am learning Regex and after reading this post, I started doing some exercises and I got stuck on this exercise. Here are the two lists of words that should be matched and not matched
I started with
^(.).*\1$
and get bothered with sporous that get matched although it should not. So I found
^(.)(?!p).*\1$
that did the trick.
The best solution (uses one less character than my solution) given here is
^(.)[^p].*\1$
but I don't really understand this pattern. Actually I think I am confused about seeing the ^ anchor in a group [] and I am confused about seeing the ^ anchor somewhere else than at the beginning of the regex.
Can you help to understand what this regex is doing?

Anything in square brackets is a character class. This context uses its own mini-syntax which simply lists the allowed characters [abc] or a range of allowed characters [a-z] or disallowed characters by adding a caret as the very first character in the character class [^a-z].

Your solution uses a negative look-ahead (?!p) that does not consume characters, and just checks if the next character is not p.
The other solution uses a negated character class [^p] that will consume a character other than p.
So, the final solution depends on what you need to match/capture.

Here is the pattern explanation of ^(.)[^p].*\1$
^ start of the string/line
(.) group first character
[^p] any character except p
.* zero or more characters
\1 first matched group again
$ end of the string/line
The above regex matches any string that starts and ends with the same character and not contains p at second position.
For detail explanation visit at regex101.
Read more about Negated Character Classes.

[^p] simply means that any character will match, which is not p.
I'll explain the regex step by step in the following sentences.
^ start of the string
(.) matches any character as group 1
[^p] matches any character that is not p
.* matches any character that repeats zero or more times
\1 matches the exact matched character(s) from group 1
$ end of the string
A good source for learning regex is regex101.

^ means assert position at start of the line, however, in a character class [ ] it equates to match character other than ...
Example:
^test-[^p]-1234
Result:
test-q-1234 // match
test-p-1234 // no match
test-o-1234 // match
https://regex101.com/r/wN4zF9/1

Perl code understanding

I am new to perl language - I have been trying to understand the below code
if ( $nextvalue !~ /^.+"[^ ]+ \/cs\/.+\sHTTP\/[1-9]\.[0-9]"|\/\/|\/Images\/fold\/1.jpg|\/busines|\/Type= OPTIONS|\/203.176.111.126/)
Can you please help us understand what is above meant for?

condition will be true when $nextvalue will NOT match following regular expression.
Regular expressiion will match if that string
either
starts with at least one character,
followed by double quote sign ("),
followed by at least one non-whitespace character,
followed by whitespace (),
followed by string "/cs/",
followed by at least one character,
followed by whitespace and string HTTP/,
followed by one of digits from 1 to 9 inclusive,
followed by dot
followed by one of digits from 0 to 9,
followed by double quote mark (")
or contains two forward slashes (//)
or contains sunstring "/Images/fold/1.jpg"
or contains substring "/busines"
or contains substring "/Type= OPTIONS"
or contains substring "/203.176.111.126"

Whenever i am unsure what some cryptic regular expression does, i turn to Debuggex:
^.+"[^ ]+ \/cs\/.+\sHTTP\/[1-9]\.[0-9]"|\/\/|\/Images\/fold\/1.jpg|\/busines|\/Type= OPTIONS|\/203.176.111.126
Debuggex Demo
This is a railroad diagram, every string that has a substring fitting the description along any of the grey tracks will match your regex. As your condition uses !~ meaning "does not match", those strings will then fail the check.
Debuggex certainly has issues (for example it displays ^, meaning you would have to know that this means the beginning of the string, same for dots and other, whitespaces show up as underscroes, etc.) but it certainly helps in understanding the structure of the expression and possibly gives you an idea what the author had in mind.

Regex to match anything

I know it seems a bit redundant but I'd like a regex to match anything.
At the moment we are using ^*$ but it doesn't seem to match no matter what the text.
I do a manual check for no text but the test view we use is always validated with a regex. However, sometimes we need it to validate anything using a regex. i.e. it doesn't matter what is in the text field, it can be anything.
I don't actually produce the regex and I'm a complete beginner with them.

The regex .* will match anything (including the empty string, as Junuxx points out).

The chosen answer is slightly incorrect, as it wont match line breaks or returns. This regex to match anything is useful if your desired selection includes any line breaks:
[\s\S]+
[\s\S] matches a character that is either a whitespace character (including line break characters), or a character that is not a whitespace character. Since all characters are either whitespace or non-whitespace, this character class matches any character. the + matches one or more of the preceding expression

^ is the beginning-of-line anchor, so it will be a "zero-width match," meaning it won't match any actual characters (and the first character matched after the ^ will be the first character of the string). Similarly, $ is the end-of-line anchor.
* is a quantifier. It will not by itself match anything; it only indicates how many times a portion of the pattern can be matched. Specifically, it indicates that the previous "atom" (that is, the previous character or the previous parenthesized sub-pattern) can match any number of times.
To actually match some set of characters, you need to use a character class. As RichieHindle pointed out, the character class you need here is ., which represents any character except newlines (and it can be made to match newlines as well using the appropriate flag). So .* represents * (any number) matches on . (any character). Similarly, .+ represents + (at least one) matches on . (any character).

I know this is a bit old post, but we can have different ways like :
.*
(.*?)

What does /([^.])\.(.)/ mean?

When I searched about something, I found an answered question in this site. 2 of the answers contain
/([^.]*)\.(.*)/
on their answer.
The question is located at Find & replace jquery. I'm newbie in javascript, so I wonder, what does it mean? Thanks.

/([^.]*)\.(.*)/
Let us deconstruct it. The beginning and trailing slash are delimiters, and mark the start and end of the regular expression.
Then there is a parenthesized group: ([^.]*) The parentheseis are there just to group a string together. The square brackets denote a "character group", meaning that any character inside this group is accepted in its place. However, this group is negated by the first character being ^, which reverse its meaning. Since the only character beside the negation is a period, this matches a single character that is not a period. After the square brackets is a * (asterisk), which means that the square brackets can be matched zero or more times.
Then we get to the \.. This is an escaped period. Periods in regular expressions have special meaning (except when escaped or in a character group). This matches a literal period in the text.
(.*) is a new paranthesized sub-group. This time, the period matches any character, and the asterisk says it can be repeated as many times as needs to.
In summary, the expression finds any sequence of characters (that isn't a period), followed by a single period, again followed by any character.
Edit: Removed part about shortening, as it defeats the assumed purpose of the regular expression.

It's a regular expression (it matches non-periods, followed by a period followed by anything (think "file.ext")). And you should run, not walk, to learn about them. Explaining how this particular regular expression works isn't going to help you as you need to start simpler. So start with a regex tutorial and pick up Mastering Regular Expressions.

Original: /([^.]*)\.(.*)/
Split this as:
[1] ([^.]*) : It says match all characters except . [ period ]
[2] \. : match a period
[3] (.*) : matches any character
so it becomes
[1]Match all characters which are not . [ period ] [2] till you find a .[ period ] then [3] match all characters.

Anything except a dot, followed by a dot, followed by anything.
You can test regex'es on regexpal

It's a regular expression that roughly searches for a string that doesn't contain a period, followed by a period, and then a string containing any characters.

That is a regular expression. Regular expressions are powerful tools if you use them right.
That particular regex extracts filename and extension from a string that looks like "file.ext".

It's a regular expression that splits a string into two parts: everything before the first period, and then the remainder. Most regex engines (including the Javascript one) allow you to then access those parts of the string separately (using $1 to refer to the first part, and $2 for the second part).

This is a regular expression with some advanced use.
Consider a simpler version: /[^.]*\..*/ which is the same as above without parentheses. This will match just any string with at least one dot. When the parentheses are added, and a match happens, the variables \1 and \2 will contain the matched parts from the parentheses. The first one will have anything before the first dot. The second part will have everything after the first dot.
Examples:
input: foo...bar
\1: foo
\2: ..bar
input: .foobar
\1:
\2: foobar

This regular expression generates two matching expressions that can be retrieved.
The two parts are the string before the first dot (which may be empty), and the string after the first dot (which may contain other dots).
The only restriction on the input is that it contain at least one dot. It will match "." contrary to some of the other answers, but the retrived groups will be empty.

IMO /.*\..*/g Would do the same thing.
const senExample = 'I am test. Food is good.';
const result1 = senExample.match(/([^.]*)\.(.*)/g);
console.log(result1); // ["I am test. Food is good."]
const result2 = senExample.match(/^.*\..*/g);
console.log(result2); // ["I am test. Food is good."]

the . character matches any character except line break characters the \r or \n.
the ^ negates what follows it (in this case the dot)
the * means "zero or more times"
the parentheses group and capture,
the \ allows you to match a special character (like the dot or the star)
so this ([^.]*) means any line break repeated zero or more times (it just eats up carriage returns).
this (.*) part means any string of characters zero or more times (except the line breaks)
and the \. means a real dot
so the whole thing would match zero or more line breaks followed by a dot followed by any number of characters.
For more information and a really great reference on Regular Expressions check out: http://www.regular-expressions.info/reference.html

It's a regular expression, which basically is a pattern of characters that is used to describe another pattern of characters. I once used regexps to find an email address inside a text file, and they can be used to find pretty much any pattern of text within a larger body of text provided you write the regexp properly.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Understanding regex in shell - regex

Related

Removing last character from a line using regex

Regex - No "p" at second position

Perl code understanding

Regex to match anything

What does /([^.])\.(.)/ mean?

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Understanding regex in shell - regex

Related

Removing last character from a line using regex

Regex - No "p" at second position

Perl code understanding

Regex to match anything

What does /([^.]*)\.(.*)/ mean?

Categories

Resources

What does /([^.])\.(.)/ mean?