what does (?=.*[^a-zA-Z]) mean

what does (?=.*[^a-zA-Z]) mean - regex

What does (?=.*[^a-zA-Z]) mean
I am a beginner in regex and not getting what does it mean .
Is it like, dot(.) means any character so .* means any character any number of times and [^a-zA-z] any one character except a-z and A-Z.
what string will match it?
Thanks,
Puneet

That is positive look ahead assertion.
That means that there are at least one symbol that is not a-ZA-Z to right from the point.
Example:
$ echo 12abc | grep -P '2(?=.*[^a-zA-Z])'
$ echo 12abc. | grep -P '2(?=.*[^a-zA-Z])'
12abc.
In the first line there are no not a-zA-Z after 2. And the line will not be shown.
In the second line I've added point to the end. Now there is a not a-zA-Z after 2. And the line will be found and shown.

Related

I want to grep words that have a hyphen in the middle and start with uppercase letter + words that start with uppercase letter without hyphen

I want the regex that allows me to match words that have hyphen in the middle and start with uppercase letter + words that start with uppercase letter without hyphen.
also i want only the first letter to be uppercase, all the others are lowercase, something like (ENGLAND) is not what i need, because all letters are uppercase
I will give examples for all the wanted words' structure:
Wilkes-Barre
California
I have tried:
[A-Z][a-z-]\+[A-Z][a-z]\+
but it only matches things like Wilkes-Barre it doesnt match California
also tried
[A-Z][a-z-]\+
this one matches things like California, but it matches Wilkes-Barre as it is 2 words: Wilkes- and Barre
So if someone please can help me find the regex that matches those 2 types of words, so if grep a file that has
Wilkes-Barre
California
ENGLAND
rome
It will only match the first 2 and it will give 2 matches not 3.

You do not specify if a single upper-case latter should match. Let's assume the answer is yes. The following should do what you want:
$ grep -E '^((^|-)[A-Z][a-z]*)+$' data.txt
Wilkes-Barre
California
It matches entire lines (because of the leading ^ and trailing $) of one or more tokens (one or more because of the +) where each token is a hyphen or the beginning of the line ((^|-)) followed by a single upper case letter ([A-Z]) and zero or more lower case letters ([a-z]*).
If there must be at least one lower case letter after the upper case letter, just replace the * by a +:
grep -E '^((^|-)[A-Z][a-z]+)+$' data.txt
These regexes also match a line like -Foobar. If this is not wanted the following excludes lines that start with a hyphen:
grep -E '^[A-Z][a-z]*(-[A-Z][a-z]*)*$' data.txt
or (if at least one lower case letter is required):
grep -E '^[A-Z][a-z]+(-[A-Z][a-z]+)*$' data.txt
Finally, if there is at most one hyphen (no Foo-Bar-Baz):
grep -E '^[A-Z][a-z]*(-[A-Z][a-z]*)?$' data.txt
or:
grep -E '^[A-Z][a-z]+(-[A-Z][a-z]+)?$' data.txt

You can use
grep -E '^[[:upper:]][[:lower:]]+(-[[:upper:]][[:lower:]]*)?$'
See the online demo:
#!/bin/bash
s='Wilkes-Barre
California'
grep -E '^[[:upper:]][[:lower:]]+(-[[:upper:]][[:lower:]]*)?$' <<< "$s"
Output:
Wilkes-Barre
California
POSIX ERE pattern details:
^ - start of string
[[:upper:]] - an uppercase letter
[[:lower:]]+ - one or more lowercase letters
(-[[:upper:]][[:lower:]]*)? - an optional occurrence of an uppercase letter and then one or more lowercase letters
$ - end of string.
NOTE: If you need to match strings with more than one hyphen, replace the last ? with *.

Normally the answer should be:
grep "^[A-Z][a-z-]+" test.txt
However on my system, the plus-sign is not recognised, so I have to go for:
grep "^[A-Z][a-z-][a-z-]*" test.txt
Explanation:
^ : start of the line
[A-Z] : all possible uppercase letters
[a-z-] : all possible lowercase letters or a hyphen
Edit after comment
This, however, only shows the first part of Wilkes-Barre. If you want both, you might try this:
egrep "^[A-Z][a-z-]+|^[A-Z][a-z-]+[A-Z][a-z-]+" test.txt

Allow only one number in grep Regex

I have to accept the strings that only have a single number, it doesn't matter the content of the string, it just needs to be a single number.
I was trying something like this:
echo "exaaaamplee1" | grep '[0-9]\{1\}'
This string is accepted, but this string also is accepted:
echo "exaaaamplee11" | grep '[0-9]\{1\}'

You probably want to use something like [^0-9]. This represents any character except a digit 0-9, and you can use [0-9] (or \d) for the one digit that is allowed.
Something like ^[^0-9]*[0-9][^0-9]*$ should match any string with exactly one digit. (^ being the start and $ the end of the string)

If you want to match a string with only one digit character using grep, it's
echo whatever1 | grep '^[^[:digit:]]*[[:digit:]][^[:digit:]]*$'
Start of line followed by any number of non-digits, one digit, and then any number of non-digits until the end of the line.

Swap minus sign from after the number to in front of the number using SED (and Regex)

I've got a text-file with the following line:
201174480 11-01-1911 J Student 25-07 11585 2 0 SPOED BEZORGEN 1ST 25,00
320819019 11-01-1911 T. Student 28-07 13561 1 15786986 DESLORATADINE TABL OMH 5MG 60ST 3,60
706059901 11-01-1911 ST Student-Student 30-06 14956 1 15356221 METOPROLOLSUCC RET T 100MG 180ST 12,90-
I want to change this line with SED into:
201174480 11-01-1911 J Student 25-07 11585 2 0 SPOED BEZORGEN 1ST 25,00
320819019 11-01-1911 T. Student 28-07 13561 1 15786986 DESLORATADINE TABL OMH 5MG 60ST 3,60
706059901 11-01-1911 ST Student-Student 30-06 14956 1 15356221 METOPROLOLSUCC RET T 100MG 180ST -12,90
So I want to swap the minus sign so that I get-12,90 in stead of 12,90- with SED. I tried:
try 1:
sed 's/\([0-9.]\+\)-/-\1/g' file.txt > file1.txt
try 2:
sed 's/\([0-9].\+\)-$/-\1/g' file.txt > file1.txt
So there must be something wrong with the REGEX but I donot really understand it. Please help.

You may use
sed 's/\([0-9][0-9,.]\+\)-\($\|[^0-9]\)/-\1\2/g'
See the online demo
The point is that after matching a number and a - (see \([0-9][0-9,.]\+\)-), there should come either end of string or non-digit (\($\|[^0-9]\)). Thus, we have 2 capturing groups now, and that is why we need a second backreference in the replacement pattern (\2).
I added a dot . to the bracket expression just in case you have mixed number formats, you may remove it if you always have a comma as the decimal separator.
Pattern details:
\([0-9][0-9,.]\+\) - Group 1 capturing
[0-9] - a digit
[0-9,.]\+ - one or more digits, commas or dots
- - a literal hyphen
\($\|[^0-9]\) - Group 2 capturing the end of string $ or a non-digit ([^0-9])

In your example, both files are identical, but I think I know what you mean.
For this particular file, you want to match a space, followed by zero or more digits, followed by a comma, followed by at least one digit, followed by a dash,
followed by zero or more spaces to the end of the line.
Then you want to replace the space in front of the matched digits and the comma with a dash. This will do the trick:
sed -e 's/ \([0-9]*,[0-9][0-9]*\)- *$/-\1/' <file.txt >file1.txt

Your first regular expression attempts to match against a string of numbers and .s, but the text contains a comma, not a .. It does the substitution you want if you replace [0-9.] with [0-9,], giving:
sed 's/\([0-9,]\+\)-/-\1/g' file.txt > file1.txt
However, it also replaces 25-07 in that case with -2507. I suggest you explicitly match against the end of the line:
sed 's/\([0-9,]\+\)-$/-\1/g'
or alternatively, you can demand that the match contains exactly one comma:
sed 's/\([0-9]\+,[0-9]\+\)-$/-\1/g'
I also find these things easier to read if you use the -r option to sed, which enables "extended regular expressions":
sed -r 's/([0-9]+,[0-9]+)-$/-\1/g'
Fewer special characters need to be escaped (on the other hand, more literal characters need to be escaped, but I find that tends to be a rarer occurrence).
(Aside: note that . usually means "any character", but inside a character class [.] it means "literally a .", since after all having it mean "any character" in there would be pretty useless.)

Using grep to find a pattern beginning in a $

I need to find a pattern that starts with a $ is followed by two numbers, a single character that is not a number, and anything else.
I know how to find a pattern starting in a dollar sign and followed by two numbers but I can't figure out how to check for one character that is not a number.
I also need to count how many lines have this pattern.
I have this so far:
grep -Ec '\$[0-9][0-9].....
I don't know what to do. Can someone please help? Any help would be much appreciated.

The caret character inverts a selection group, so if [0-9] is "match any digit" then [^0-9] is "match any non-digit".

You can possibly try this regex \$[0-9][0-9][^0-9].*
\$[0-9][0-9][^0-9].*
\$ matches the character $ literally
[0-9] match a single character present in the list below.
0-9 a single character in the range between 0 and 9
[0-9] match a single character present in the list below.
0-9 a single character in the range between 0 and 9
[^0-9] match a single character not present in the list below.
0-9 a single character in the range between 0 and 9
.* matches any character (except newline)
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]

I would second #realspirituals answer, and if you need to count how many lines have this pattern, you can count how many lines grep ouputs by piping to wc -l. In order to both show the lines and count them in one fell swoop, pipe the output like so
grep "\$[0-9]{2}[^0-9].*" | tee >(wl -l)
where tee will split the output between wl and STDOUT. {2} will cause the prior [0-9] to match twice.

Regex to match ZIP code without punctuation

I have a file with a bunch of different ZIP codes:
12345
12345-6789
1234567890
12345:6789
12345-7890
12:1234678
I want to only match on codes that have the format 12345 or 12345-6789, but ignore all other forms.
I have my regex as:
grep -E '\<[0-9]{5}\>[^[:punct:]]|\<[0-9]{5}\>-[0-9]{4}' samplefile
It matches on the 12345-6789 because the "or" clause matches on that particular one. I am confused as to why it won't match on the first 12345 since my expression should say "match on 5 numbers but ignore any punctuation."

An expression that matches your desired output is:
egrep "^[0-9]{5}([-][0-9]{4})?$" samplefile
The expression breakdown:
^[0-9]{5} - Find a line that starts with 5 digits. ^ means start of line and [0-9]{5} means exactly five digits between zero and nine.
([-][0-9]{4})?$ - May end with a dash and four digits or nothing at all. () groups the expressions together, [-] represents the dash character, [0-9]{4} represents exactly four digits between zero and nine, ? indicates the grouped expression either exists entirely or does not exist and $ marks the end of the line.
test.dat
12345
12345-6789
1234567890
12345:6789
12345-7890
12:1234678
Running the expression on the test data:
mike#test:~$ egrep "^[0-9]{5}([-][0-9]{4})?$" test.dat
12345
12345-6789
12345-7890
Additional info: grep -E can alternatively be written as egrep. This also works for grep -F which is the same as fgrep and grep -r which is the same as rgrep.

It won't match "12345" but will match "12345a". The first clause needs to end in a non-punctuation character, the way you wrote it.
Consider Mike's answer; it's clearer.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

what does (?=.*[^a-zA-Z]) mean - regex

What does (?=.[^a-zA-Z]) mean I am a beginner in regex and not getting what does it mean . Is it like, dot(.) means any character so . means any character any number of times and [^a-zA-z] any one character except a-z and A-Z. what string will match it? Thanks, Puneet

Related

I want to grep words that have a hyphen in the middle and start with uppercase letter + words that start with uppercase letter without hyphen

Allow only one number in grep Regex

Swap minus sign from after the number to in front of the number using SED (and Regex)

Using grep to find a pattern beginning in a $

Regex to match ZIP code without punctuation

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

what does (?=.*[^a-zA-Z]) mean - regex

What does (?=.*[^a-zA-Z]) mean I am a beginner in regex and not getting what does it mean . Is it like, dot(.) means any character so .* means any character any number of times and [^a-zA-z] any one character except a-z and A-Z. what string will match it? Thanks, Puneet

Related

I want to grep words that have a hyphen in the middle and start with uppercase letter + words that start with uppercase letter without hyphen

Allow only one number in grep Regex

Swap minus sign from after the number to in front of the number using SED (and Regex)

Using grep to find a pattern beginning in a $

Regex to match ZIP code without punctuation

Categories

Resources

What does (?=.[^a-zA-Z]) mean I am a beginner in regex and not getting what does it mean . Is it like, dot(.) means any character so . means any character any number of times and [^a-zA-z] any one character except a-z and A-Z. what string will match it? Thanks, Puneet