Regex to match ZIP code without punctuation - regex

I have a file with a bunch of different ZIP codes:
12345
12345-6789
1234567890
12345:6789
12345-7890
12:1234678
I want to only match on codes that have the format 12345 or 12345-6789, but ignore all other forms.
I have my regex as:
grep -E '\<[0-9]{5}\>[^[:punct:]]|\<[0-9]{5}\>-[0-9]{4}' samplefile
It matches on the 12345-6789 because the "or" clause matches on that particular one. I am confused as to why it won't match on the first 12345 since my expression should say "match on 5 numbers but ignore any punctuation."

An expression that matches your desired output is:
egrep "^[0-9]{5}([-][0-9]{4})?$" samplefile
The expression breakdown:
^[0-9]{5} - Find a line that starts with 5 digits. ^ means start of line and [0-9]{5} means exactly five digits between zero and nine.
([-][0-9]{4})?$ - May end with a dash and four digits or nothing at all. () groups the expressions together, [-] represents the dash character, [0-9]{4} represents exactly four digits between zero and nine, ? indicates the grouped expression either exists entirely or does not exist and $ marks the end of the line.
test.dat
12345
12345-6789
1234567890
12345:6789
12345-7890
12:1234678
Running the expression on the test data:
mike#test:~$ egrep "^[0-9]{5}([-][0-9]{4})?$" test.dat
12345
12345-6789
12345-7890
Additional info: grep -E can alternatively be written as egrep. This also works for grep -F which is the same as fgrep and grep -r which is the same as rgrep.

It won't match "12345" but will match "12345a". The first clause needs to end in a non-punctuation character, the way you wrote it.
Consider Mike's answer; it's clearer.

Related

How do I filter lines in a text file that start with a capital letter and end with a positive integer with regex on the command line in linux?

I am attempting to use Regex with the grep command in the linux terminal in order to filter lines in a text file that start with Capital letter and end with a positive integer. Is there a way to modify my command so that it does this all in one line with one call of grep instead of two? I am using windows subsystem for linux and the microsoft store ubuntu.
Text File:
C line 1
c line 2
B line 3
d line 4
E line five
The command that I have gotten to work:
grep ^[A-Z] cap*| grep [0-9]$ cap*
The Output
C line 1
B line 3
This works but i feel like the regex statement could be combined somehow but
grep ^[A-Z][0-9]$
does not yield the same result as the command above.
You need to use
grep '^[A-Z].*[0-9]$'
grep '^[[:upper:]].*[0-9]$'
See the online demo. The regex matches:
^ - start of string
[A-Z] / [[:upper:]] - an uppercase letter
.* - any zero or more chars ([^0-9]* matches zero or more non-digit chars)
[0-9] - a digit.
$ - end of string.
Also, if you want to make sure there is no - before the number at the end of string, you need to use a negated bracket expression, like
grep -E '^[[:upper:]](.*[^-0-9])?[1-9][0-9]*$'
Here, the POSIX ERE regx (due to -E option) matches
^[[:upper:]] - an uppercase letter at the start and then
(.*[^-0-9])? - an optional occurrence of any text and then any char other than a digit and -
[1-9] - a non-zero digit
[0-9]* - zero or more digits
$ - end of string.
When you use a pipeline, you want the second grep to act on standard input, not on the file you originally grepped from.
grep ^[A-Z] cap*| grep [0-9]$
However, you need to expand the second regex if you want to exclude negative numbers. Anyway, a better solution altogether might be to switch to Awk:
awk '/^[A-Z]/ && /[0-9]$/ && $NF > 0' cap*
The output format will be slightly different than from grep; if you want to include the name of the matching file, you have to specify that separately:
awk '/^[A-Z]/ && /[0-9]$/ && $NF > 0 { print FILENAME ":" $0 }' cap*
The regex ^[A-Z][0-9]$ matches exactly two characters, the first of which must be an alphabetic, and the second one has to be a number. If you want to permit arbitrary text between them, that would be ^[A-Z].*[0-9]$ (and for less arbitrary, use something a bit more specific than .*, like (.*[^-0-9])? perhaps, where you need grep -E for the parentheses and the question mark for optional, or backslashes before each of these for the BRE regex dialect you get out of the box with POSIX grep).

I want to grep words that have a hyphen in the middle and start with uppercase letter + words that start with uppercase letter without hyphen

I want the regex that allows me to match words that have hyphen in the middle and start with uppercase letter + words that start with uppercase letter without hyphen.
also i want only the first letter to be uppercase, all the others are lowercase, something like (ENGLAND) is not what i need, because all letters are uppercase
I will give examples for all the wanted words' structure:
Wilkes-Barre
California
I have tried:
[A-Z][a-z-]\+[A-Z][a-z]\+
but it only matches things like Wilkes-Barre it doesnt match California
also tried
[A-Z][a-z-]\+
this one matches things like California, but it matches Wilkes-Barre as it is 2 words: Wilkes- and Barre
So if someone please can help me find the regex that matches those 2 types of words, so if grep a file that has
Wilkes-Barre
California
ENGLAND
rome
It will only match the first 2 and it will give 2 matches not 3.
You do not specify if a single upper-case latter should match. Let's assume the answer is yes. The following should do what you want:
$ grep -E '^((^|-)[A-Z][a-z]*)+$' data.txt
Wilkes-Barre
California
It matches entire lines (because of the leading ^ and trailing $) of one or more tokens (one or more because of the +) where each token is a hyphen or the beginning of the line ((^|-)) followed by a single upper case letter ([A-Z]) and zero or more lower case letters ([a-z]*).
If there must be at least one lower case letter after the upper case letter, just replace the * by a +:
grep -E '^((^|-)[A-Z][a-z]+)+$' data.txt
These regexes also match a line like -Foobar. If this is not wanted the following excludes lines that start with a hyphen:
grep -E '^[A-Z][a-z]*(-[A-Z][a-z]*)*$' data.txt
or (if at least one lower case letter is required):
grep -E '^[A-Z][a-z]+(-[A-Z][a-z]+)*$' data.txt
Finally, if there is at most one hyphen (no Foo-Bar-Baz):
grep -E '^[A-Z][a-z]*(-[A-Z][a-z]*)?$' data.txt
or:
grep -E '^[A-Z][a-z]+(-[A-Z][a-z]+)?$' data.txt
You can use
grep -E '^[[:upper:]][[:lower:]]+(-[[:upper:]][[:lower:]]*)?$'
See the online demo:
#!/bin/bash
s='Wilkes-Barre
California'
grep -E '^[[:upper:]][[:lower:]]+(-[[:upper:]][[:lower:]]*)?$' <<< "$s"
Output:
Wilkes-Barre
California
POSIX ERE pattern details:
^ - start of string
[[:upper:]] - an uppercase letter
[[:lower:]]+ - one or more lowercase letters
(-[[:upper:]][[:lower:]]*)? - an optional occurrence of an uppercase letter and then one or more lowercase letters
$ - end of string.
NOTE: If you need to match strings with more than one hyphen, replace the last ? with *.
Normally the answer should be:
grep "^[A-Z][a-z-]+" test.txt
However on my system, the plus-sign is not recognised, so I have to go for:
grep "^[A-Z][a-z-][a-z-]*" test.txt
Explanation:
^ : start of the line
[A-Z] : all possible uppercase letters
[a-z-] : all possible lowercase letters or a hyphen
Edit after comment
This, however, only shows the first part of Wilkes-Barre. If you want both, you might try this:
egrep "^[A-Z][a-z-]+|^[A-Z][a-z-]+[A-Z][a-z-]+" test.txt

Allow only one number in grep Regex

I have to accept the strings that only have a single number, it doesn't matter the content of the string, it just needs to be a single number.
I was trying something like this:
echo "exaaaamplee1" | grep '[0-9]\{1\}'
This string is accepted, but this string also is accepted:
echo "exaaaamplee11" | grep '[0-9]\{1\}'
You probably want to use something like [^0-9]. This represents any character except a digit 0-9, and you can use [0-9] (or \d) for the one digit that is allowed.
Something like ^[^0-9]*[0-9][^0-9]*$ should match any string with exactly one digit. (^ being the start and $ the end of the string)
If you want to match a string with only one digit character using grep, it's
echo whatever1 | grep '^[^[:digit:]]*[[:digit:]][^[:digit:]]*$'
Start of line followed by any number of non-digits, one digit, and then any number of non-digits until the end of the line.

Insert Decimal After Character Match in Text File

I have a CSV file that has some data values. I need to insert a decimal point after the second character when the string has 3 values and after the third character when the string has 4 values.
CSV File:
956,938,987,964,1004,934,1018,912
Attempted Code:
sed -e "s/\([0-9]\{2\}\)/\1./g"
Current Result:
95.6,93.8,98.7,96.4,10.04.,93.4,10.18.,91.2
Expected Result:
95.6,93.8,98.7,96.4,100.4,93.4,101.8,91.2
My current code (using sed) appears to be working for 3-value strings but, failing when it detects 4-value strings.
You may capture 2 or more digits into 1 group, and then capture a trailing digit into another group:
s='956,938,987,964,1004,934,1018,912'
echo $s | sed 's/\([0-9]\{2,\}\)\([0-9]\)/\1.\2/g'
See the online demo, output: 95.6,93.8,98.7,96.4,100.4,93.4,101.8,91.2.
Details:
\([0-9]\{2,\}\) - Group 1: two or more (\{2,\}) digits ([0-9])
\([0-9]\) - Group 2: a single digit.
In awk:
$ awk '{gsub(/.(,|$)/,".&")}1' file
95.6,93.8,98.7,96.4,100.4,93.4,101.8,91.2
Just in case if there was spaces or other stuff, you could:
$ awk '{gsub(/[0-9] *(,|$)/,".&")}1' file
How about simply replacing
\B([0-9])\b
with
.\1
like
sed 's/\B\([0-9]\)\b/.\1/g'
Explanation:
\B Matches if the position being match is in a word/number sequence (not a word boundary)
([0-9]) Matches and captures a digit
\b Matches if the position being match is in on a word/number boundary
By your examples I gather you simply want to have all numbers with one decimal. What this regex does is to match, and capture, the last digit in a multi digit number. Replacing it with itself preceded by a . gives you the desired output.
Online demo and here at regex101 for a more visual illustration.
Edit
If Wiktors concerns are an issue, change it to
\B([0-9])([0-9])\b
replaced by
\1.\2
like
sed 's/\B\([0-9]\)\([0-9]\)\b/\1.\2/g'
Here at regex101.
Looks like you are just dividing all numbers by 10, hence you can use this non-regex approach:
awk 'BEGIN{FS=OFS=","} {for (i=1; i<=NF; i++) $i/=10} 1' file
95.6,93.8,98.7,96.4,100.4,93.4,101.8,91.2

Swap minus sign from after the number to in front of the number using SED (and Regex)

I've got a text-file with the following line:
201174480 11-01-1911 J Student 25-07 11585 2 0 SPOED BEZORGEN 1ST 25,00
320819019 11-01-1911 T. Student 28-07 13561 1 15786986 DESLORATADINE TABL OMH 5MG 60ST 3,60
706059901 11-01-1911 ST Student-Student 30-06 14956 1 15356221 METOPROLOLSUCC RET T 100MG 180ST 12,90-
I want to change this line with SED into:
201174480 11-01-1911 J Student 25-07 11585 2 0 SPOED BEZORGEN 1ST 25,00
320819019 11-01-1911 T. Student 28-07 13561 1 15786986 DESLORATADINE TABL OMH 5MG 60ST 3,60
706059901 11-01-1911 ST Student-Student 30-06 14956 1 15356221 METOPROLOLSUCC RET T 100MG 180ST -12,90
So I want to swap the minus sign so that I get-12,90 in stead of 12,90- with SED. I tried:
try 1:
sed 's/\([0-9.]\+\)-/-\1/g' file.txt > file1.txt
try 2:
sed 's/\([0-9].\+\)-$/-\1/g' file.txt > file1.txt
So there must be something wrong with the REGEX but I donot really understand it. Please help.
You may use
sed 's/\([0-9][0-9,.]\+\)-\($\|[^0-9]\)/-\1\2/g'
See the online demo
The point is that after matching a number and a - (see \([0-9][0-9,.]\+\)-), there should come either end of string or non-digit (\($\|[^0-9]\)). Thus, we have 2 capturing groups now, and that is why we need a second backreference in the replacement pattern (\2).
I added a dot . to the bracket expression just in case you have mixed number formats, you may remove it if you always have a comma as the decimal separator.
Pattern details:
\([0-9][0-9,.]\+\) - Group 1 capturing
[0-9] - a digit
[0-9,.]\+ - one or more digits, commas or dots
- - a literal hyphen
\($\|[^0-9]\) - Group 2 capturing the end of string $ or a non-digit ([^0-9])
In your example, both files are identical, but I think I know what you mean.
For this particular file, you want to match a space, followed by zero or more digits, followed by a comma, followed by at least one digit, followed by a dash,
followed by zero or more spaces to the end of the line.
Then you want to replace the space in front of the matched digits and the comma with a dash. This will do the trick:
sed -e 's/ \([0-9]*,[0-9][0-9]*\)- *$/-\1/' <file.txt >file1.txt
Your first regular expression attempts to match against a string of numbers and .s, but the text contains a comma, not a .. It does the substitution you want if you replace [0-9.] with [0-9,], giving:
sed 's/\([0-9,]\+\)-/-\1/g' file.txt > file1.txt
However, it also replaces 25-07 in that case with -2507. I suggest you explicitly match against the end of the line:
sed 's/\([0-9,]\+\)-$/-\1/g'
or alternatively, you can demand that the match contains exactly one comma:
sed 's/\([0-9]\+,[0-9]\+\)-$/-\1/g'
I also find these things easier to read if you use the -r option to sed, which enables "extended regular expressions":
sed -r 's/([0-9]+,[0-9]+)-$/-\1/g'
Fewer special characters need to be escaped (on the other hand, more literal characters need to be escaped, but I find that tends to be a rarer occurrence).
(Aside: note that . usually means "any character", but inside a character class [.] it means "literally a .", since after all having it mean "any character" in there would be pretty useless.)