what does (?=.*[^a-zA-Z]) mean - regex

What does (?=.*[^a-zA-Z]) mean
I am a beginner in regex and not getting what does it mean .
Is it like, dot(.) means any character so .* means any character any number of times and [^a-zA-z] any one character except a-z and A-Z.
what string will match it?
Thanks,
Puneet

That is positive look ahead assertion.
That means that there are at least one symbol that is not a-ZA-Z to right from the point.
Example:
$ echo 12abc | grep -P '2(?=.*[^a-zA-Z])'
$ echo 12abc. | grep -P '2(?=.*[^a-zA-Z])'
12abc.
In the first line there are no not a-zA-Z after 2. And the line will not be shown.
In the second line I've added point to the end. Now there is a not a-zA-Z after 2. And the line will be found and shown.

Related

I want to grep words that have a hyphen in the middle and start with uppercase letter + words that start with uppercase letter without hyphen

I want the regex that allows me to match words that have hyphen in the middle and start with uppercase letter + words that start with uppercase letter without hyphen.
also i want only the first letter to be uppercase, all the others are lowercase, something like (ENGLAND) is not what i need, because all letters are uppercase
I will give examples for all the wanted words' structure:
Wilkes-Barre
California
I have tried:
[A-Z][a-z-]\+[A-Z][a-z]\+
but it only matches things like Wilkes-Barre it doesnt match California
also tried
[A-Z][a-z-]\+
this one matches things like California, but it matches Wilkes-Barre as it is 2 words: Wilkes- and Barre
So if someone please can help me find the regex that matches those 2 types of words, so if grep a file that has
Wilkes-Barre
California
ENGLAND
rome
It will only match the first 2 and it will give 2 matches not 3.
You do not specify if a single upper-case latter should match. Let's assume the answer is yes. The following should do what you want:
$ grep -E '^((^|-)[A-Z][a-z]*)+$' data.txt
Wilkes-Barre
California
It matches entire lines (because of the leading ^ and trailing $) of one or more tokens (one or more because of the +) where each token is a hyphen or the beginning of the line ((^|-)) followed by a single upper case letter ([A-Z]) and zero or more lower case letters ([a-z]*).
If there must be at least one lower case letter after the upper case letter, just replace the * by a +:
grep -E '^((^|-)[A-Z][a-z]+)+$' data.txt
These regexes also match a line like -Foobar. If this is not wanted the following excludes lines that start with a hyphen:
grep -E '^[A-Z][a-z]*(-[A-Z][a-z]*)*$' data.txt
or (if at least one lower case letter is required):
grep -E '^[A-Z][a-z]+(-[A-Z][a-z]+)*$' data.txt
Finally, if there is at most one hyphen (no Foo-Bar-Baz):
grep -E '^[A-Z][a-z]*(-[A-Z][a-z]*)?$' data.txt
or:
grep -E '^[A-Z][a-z]+(-[A-Z][a-z]+)?$' data.txt
You can use
grep -E '^[[:upper:]][[:lower:]]+(-[[:upper:]][[:lower:]]*)?$'
See the online demo:
#!/bin/bash
s='Wilkes-Barre
California'
grep -E '^[[:upper:]][[:lower:]]+(-[[:upper:]][[:lower:]]*)?$' <<< "$s"
Output:
Wilkes-Barre
California
POSIX ERE pattern details:
^ - start of string
[[:upper:]] - an uppercase letter
[[:lower:]]+ - one or more lowercase letters
(-[[:upper:]][[:lower:]]*)? - an optional occurrence of an uppercase letter and then one or more lowercase letters
$ - end of string.
NOTE: If you need to match strings with more than one hyphen, replace the last ? with *.
Normally the answer should be:
grep "^[A-Z][a-z-]+" test.txt
However on my system, the plus-sign is not recognised, so I have to go for:
grep "^[A-Z][a-z-][a-z-]*" test.txt
Explanation:
^ : start of the line
[A-Z] : all possible uppercase letters
[a-z-] : all possible lowercase letters or a hyphen
Edit after comment
This, however, only shows the first part of Wilkes-Barre. If you want both, you might try this:
egrep "^[A-Z][a-z-]+|^[A-Z][a-z-]+[A-Z][a-z-]+" test.txt

Allow only one number in grep Regex

I have to accept the strings that only have a single number, it doesn't matter the content of the string, it just needs to be a single number.
I was trying something like this:
echo "exaaaamplee1" | grep '[0-9]\{1\}'
This string is accepted, but this string also is accepted:
echo "exaaaamplee11" | grep '[0-9]\{1\}'
You probably want to use something like [^0-9]. This represents any character except a digit 0-9, and you can use [0-9] (or \d) for the one digit that is allowed.
Something like ^[^0-9]*[0-9][^0-9]*$ should match any string with exactly one digit. (^ being the start and $ the end of the string)
If you want to match a string with only one digit character using grep, it's
echo whatever1 | grep '^[^[:digit:]]*[[:digit:]][^[:digit:]]*$'
Start of line followed by any number of non-digits, one digit, and then any number of non-digits until the end of the line.

Swap minus sign from after the number to in front of the number using SED (and Regex)

I've got a text-file with the following line:
201174480 11-01-1911 J Student 25-07 11585 2 0 SPOED BEZORGEN 1ST 25,00
320819019 11-01-1911 T. Student 28-07 13561 1 15786986 DESLORATADINE TABL OMH 5MG 60ST 3,60
706059901 11-01-1911 ST Student-Student 30-06 14956 1 15356221 METOPROLOLSUCC RET T 100MG 180ST 12,90-
I want to change this line with SED into:
201174480 11-01-1911 J Student 25-07 11585 2 0 SPOED BEZORGEN 1ST 25,00
320819019 11-01-1911 T. Student 28-07 13561 1 15786986 DESLORATADINE TABL OMH 5MG 60ST 3,60
706059901 11-01-1911 ST Student-Student 30-06 14956 1 15356221 METOPROLOLSUCC RET T 100MG 180ST -12,90
So I want to swap the minus sign so that I get-12,90 in stead of 12,90- with SED. I tried:
try 1:
sed 's/\([0-9.]\+\)-/-\1/g' file.txt > file1.txt
try 2:
sed 's/\([0-9].\+\)-$/-\1/g' file.txt > file1.txt
So there must be something wrong with the REGEX but I donot really understand it. Please help.
You may use
sed 's/\([0-9][0-9,.]\+\)-\($\|[^0-9]\)/-\1\2/g'
See the online demo
The point is that after matching a number and a - (see \([0-9][0-9,.]\+\)-), there should come either end of string or non-digit (\($\|[^0-9]\)). Thus, we have 2 capturing groups now, and that is why we need a second backreference in the replacement pattern (\2).
I added a dot . to the bracket expression just in case you have mixed number formats, you may remove it if you always have a comma as the decimal separator.
Pattern details:
\([0-9][0-9,.]\+\) - Group 1 capturing
[0-9] - a digit
[0-9,.]\+ - one or more digits, commas or dots
- - a literal hyphen
\($\|[^0-9]\) - Group 2 capturing the end of string $ or a non-digit ([^0-9])
In your example, both files are identical, but I think I know what you mean.
For this particular file, you want to match a space, followed by zero or more digits, followed by a comma, followed by at least one digit, followed by a dash,
followed by zero or more spaces to the end of the line.
Then you want to replace the space in front of the matched digits and the comma with a dash. This will do the trick:
sed -e 's/ \([0-9]*,[0-9][0-9]*\)- *$/-\1/' <file.txt >file1.txt
Your first regular expression attempts to match against a string of numbers and .s, but the text contains a comma, not a .. It does the substitution you want if you replace [0-9.] with [0-9,], giving:
sed 's/\([0-9,]\+\)-/-\1/g' file.txt > file1.txt
However, it also replaces 25-07 in that case with -2507. I suggest you explicitly match against the end of the line:
sed 's/\([0-9,]\+\)-$/-\1/g'
or alternatively, you can demand that the match contains exactly one comma:
sed 's/\([0-9]\+,[0-9]\+\)-$/-\1/g'
I also find these things easier to read if you use the -r option to sed, which enables "extended regular expressions":
sed -r 's/([0-9]+,[0-9]+)-$/-\1/g'
Fewer special characters need to be escaped (on the other hand, more literal characters need to be escaped, but I find that tends to be a rarer occurrence).
(Aside: note that . usually means "any character", but inside a character class [.] it means "literally a .", since after all having it mean "any character" in there would be pretty useless.)

Using grep to find a pattern beginning in a $

I need to find a pattern that starts with a $ is followed by two numbers, a single character that is not a number, and anything else.
I know how to find a pattern starting in a dollar sign and followed by two numbers but I can't figure out how to check for one character that is not a number.
I also need to count how many lines have this pattern.
I have this so far:
grep -Ec '\$[0-9][0-9].....
I don't know what to do. Can someone please help? Any help would be much appreciated.
The caret character inverts a selection group, so if [0-9] is "match any digit" then [^0-9] is "match any non-digit".
You can possibly try this regex \$[0-9][0-9][^0-9].*
\$[0-9][0-9][^0-9].*
\$ matches the character $ literally
[0-9] match a single character present in the list below.
0-9 a single character in the range between 0 and 9
[0-9] match a single character present in the list below.
0-9 a single character in the range between 0 and 9
[^0-9] match a single character not present in the list below.
0-9 a single character in the range between 0 and 9
.* matches any character (except newline)
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
I would second #realspirituals answer, and if you need to count how many lines have this pattern, you can count how many lines grep ouputs by piping to wc -l. In order to both show the lines and count them in one fell swoop, pipe the output like so
grep "\$[0-9]{2}[^0-9].*" | tee >(wl -l)
where tee will split the output between wl and STDOUT. {2} will cause the prior [0-9] to match twice.

Regex to match ZIP code without punctuation

I have a file with a bunch of different ZIP codes:
12345
12345-6789
1234567890
12345:6789
12345-7890
12:1234678
I want to only match on codes that have the format 12345 or 12345-6789, but ignore all other forms.
I have my regex as:
grep -E '\<[0-9]{5}\>[^[:punct:]]|\<[0-9]{5}\>-[0-9]{4}' samplefile
It matches on the 12345-6789 because the "or" clause matches on that particular one. I am confused as to why it won't match on the first 12345 since my expression should say "match on 5 numbers but ignore any punctuation."
An expression that matches your desired output is:
egrep "^[0-9]{5}([-][0-9]{4})?$" samplefile
The expression breakdown:
^[0-9]{5} - Find a line that starts with 5 digits. ^ means start of line and [0-9]{5} means exactly five digits between zero and nine.
([-][0-9]{4})?$ - May end with a dash and four digits or nothing at all. () groups the expressions together, [-] represents the dash character, [0-9]{4} represents exactly four digits between zero and nine, ? indicates the grouped expression either exists entirely or does not exist and $ marks the end of the line.
test.dat
12345
12345-6789
1234567890
12345:6789
12345-7890
12:1234678
Running the expression on the test data:
mike#test:~$ egrep "^[0-9]{5}([-][0-9]{4})?$" test.dat
12345
12345-6789
12345-7890
Additional info: grep -E can alternatively be written as egrep. This also works for grep -F which is the same as fgrep and grep -r which is the same as rgrep.
It won't match "12345" but will match "12345a". The first clause needs to end in a non-punctuation character, the way you wrote it.
Consider Mike's answer; it's clearer.