Regex to match ZIP code without punctuation

Regex to match ZIP code without punctuation - regex

I have a file with a bunch of different ZIP codes:
12345
12345-6789
1234567890
12345:6789
12345-7890
12:1234678
I want to only match on codes that have the format 12345 or 12345-6789, but ignore all other forms.
I have my regex as:
grep -E '\<[0-9]{5}\>[^[:punct:]]|\<[0-9]{5}\>-[0-9]{4}' samplefile
It matches on the 12345-6789 because the "or" clause matches on that particular one. I am confused as to why it won't match on the first 12345 since my expression should say "match on 5 numbers but ignore any punctuation."

An expression that matches your desired output is:
egrep "^[0-9]{5}([-][0-9]{4})?$" samplefile
The expression breakdown:
^[0-9]{5} - Find a line that starts with 5 digits. ^ means start of line and [0-9]{5} means exactly five digits between zero and nine.
([-][0-9]{4})?$ - May end with a dash and four digits or nothing at all. () groups the expressions together, [-] represents the dash character, [0-9]{4} represents exactly four digits between zero and nine, ? indicates the grouped expression either exists entirely or does not exist and $ marks the end of the line.
test.dat
12345
12345-6789
1234567890
12345:6789
12345-7890
12:1234678
Running the expression on the test data:
mike#test:~$ egrep "^[0-9]{5}([-][0-9]{4})?$" test.dat
12345
12345-6789
12345-7890
Additional info: grep -E can alternatively be written as egrep. This also works for grep -F which is the same as fgrep and grep -r which is the same as rgrep.

It won't match "12345" but will match "12345a". The first clause needs to end in a non-punctuation character, the way you wrote it.
Consider Mike's answer; it's clearer.

Related

How do I filter lines in a text file that start with a capital letter and end with a positive integer with regex on the command line in linux?

I am attempting to use Regex with the grep command in the linux terminal in order to filter lines in a text file that start with Capital letter and end with a positive integer. Is there a way to modify my command so that it does this all in one line with one call of grep instead of two? I am using windows subsystem for linux and the microsoft store ubuntu.
Text File:
C line 1
c line 2
B line 3
d line 4
E line five
The command that I have gotten to work:
grep ^[A-Z] cap*| grep [0-9]$ cap*
The Output
C line 1
B line 3
This works but i feel like the regex statement could be combined somehow but
grep ^[A-Z][0-9]$
does not yield the same result as the command above.

You need to use
grep '^[A-Z].*[0-9]$'
grep '^[[:upper:]].*[0-9]$'
See the online demo. The regex matches:
^ - start of string
[A-Z] / [[:upper:]] - an uppercase letter
.* - any zero or more chars ([^0-9]* matches zero or more non-digit chars)
[0-9] - a digit.
$ - end of string.
Also, if you want to make sure there is no - before the number at the end of string, you need to use a negated bracket expression, like
grep -E '^[[:upper:]](.*[^-0-9])?[1-9][0-9]*$'
Here, the POSIX ERE regx (due to -E option) matches
^[[:upper:]] - an uppercase letter at the start and then
(.*[^-0-9])? - an optional occurrence of any text and then any char other than a digit and -
[1-9] - a non-zero digit
[0-9]* - zero or more digits
$ - end of string.

When you use a pipeline, you want the second grep to act on standard input, not on the file you originally grepped from.
grep ^[A-Z] cap*| grep [0-9]$
However, you need to expand the second regex if you want to exclude negative numbers. Anyway, a better solution altogether might be to switch to Awk:
awk '/^[A-Z]/ && /[0-9]$/ && $NF > 0' cap*
The output format will be slightly different than from grep; if you want to include the name of the matching file, you have to specify that separately:
awk '/^[A-Z]/ && /[0-9]$/ && $NF > 0 { print FILENAME ":" $0 }' cap*
The regex ^[A-Z][0-9]$ matches exactly two characters, the first of which must be an alphabetic, and the second one has to be a number. If you want to permit arbitrary text between them, that would be ^[A-Z].*[0-9]$ (and for less arbitrary, use something a bit more specific than .*, like (.*[^-0-9])? perhaps, where you need grep -E for the parentheses and the question mark for optional, or backslashes before each of these for the BRE regex dialect you get out of the box with POSIX grep).

I want to grep words that have a hyphen in the middle and start with uppercase letter + words that start with uppercase letter without hyphen

I want the regex that allows me to match words that have hyphen in the middle and start with uppercase letter + words that start with uppercase letter without hyphen.
also i want only the first letter to be uppercase, all the others are lowercase, something like (ENGLAND) is not what i need, because all letters are uppercase
I will give examples for all the wanted words' structure:
Wilkes-Barre
California
I have tried:
[A-Z][a-z-]\+[A-Z][a-z]\+
but it only matches things like Wilkes-Barre it doesnt match California
also tried
[A-Z][a-z-]\+
this one matches things like California, but it matches Wilkes-Barre as it is 2 words: Wilkes- and Barre
So if someone please can help me find the regex that matches those 2 types of words, so if grep a file that has
Wilkes-Barre
California
ENGLAND
rome
It will only match the first 2 and it will give 2 matches not 3.

You do not specify if a single upper-case latter should match. Let's assume the answer is yes. The following should do what you want:
$ grep -E '^((^|-)[A-Z][a-z]*)+$' data.txt
Wilkes-Barre
California
It matches entire lines (because of the leading ^ and trailing $) of one or more tokens (one or more because of the +) where each token is a hyphen or the beginning of the line ((^|-)) followed by a single upper case letter ([A-Z]) and zero or more lower case letters ([a-z]*).
If there must be at least one lower case letter after the upper case letter, just replace the * by a +:
grep -E '^((^|-)[A-Z][a-z]+)+$' data.txt
These regexes also match a line like -Foobar. If this is not wanted the following excludes lines that start with a hyphen:
grep -E '^[A-Z][a-z]*(-[A-Z][a-z]*)*$' data.txt
or (if at least one lower case letter is required):
grep -E '^[A-Z][a-z]+(-[A-Z][a-z]+)*$' data.txt
Finally, if there is at most one hyphen (no Foo-Bar-Baz):
grep -E '^[A-Z][a-z]*(-[A-Z][a-z]*)?$' data.txt
or:
grep -E '^[A-Z][a-z]+(-[A-Z][a-z]+)?$' data.txt

You can use
grep -E '^[[:upper:]][[:lower:]]+(-[[:upper:]][[:lower:]]*)?$'
See the online demo:
#!/bin/bash
s='Wilkes-Barre
California'
grep -E '^[[:upper:]][[:lower:]]+(-[[:upper:]][[:lower:]]*)?$' <<< "$s"
Output:
Wilkes-Barre
California
POSIX ERE pattern details:
^ - start of string
[[:upper:]] - an uppercase letter
[[:lower:]]+ - one or more lowercase letters
(-[[:upper:]][[:lower:]]*)? - an optional occurrence of an uppercase letter and then one or more lowercase letters
$ - end of string.
NOTE: If you need to match strings with more than one hyphen, replace the last ? with *.

Normally the answer should be:
grep "^[A-Z][a-z-]+" test.txt
However on my system, the plus-sign is not recognised, so I have to go for:
grep "^[A-Z][a-z-][a-z-]*" test.txt
Explanation:
^ : start of the line
[A-Z] : all possible uppercase letters
[a-z-] : all possible lowercase letters or a hyphen
Edit after comment
This, however, only shows the first part of Wilkes-Barre. If you want both, you might try this:
egrep "^[A-Z][a-z-]+|^[A-Z][a-z-]+[A-Z][a-z-]+" test.txt

Allow only one number in grep Regex

I have to accept the strings that only have a single number, it doesn't matter the content of the string, it just needs to be a single number.
I was trying something like this:
echo "exaaaamplee1" | grep '[0-9]\{1\}'
This string is accepted, but this string also is accepted:
echo "exaaaamplee11" | grep '[0-9]\{1\}'

You probably want to use something like [^0-9]. This represents any character except a digit 0-9, and you can use [0-9] (or \d) for the one digit that is allowed.
Something like ^[^0-9]*[0-9][^0-9]*$ should match any string with exactly one digit. (^ being the start and $ the end of the string)

If you want to match a string with only one digit character using grep, it's
echo whatever1 | grep '^[^[:digit:]]*[[:digit:]][^[:digit:]]*$'
Start of line followed by any number of non-digits, one digit, and then any number of non-digits until the end of the line.

Insert Decimal After Character Match in Text File

I have a CSV file that has some data values. I need to insert a decimal point after the second character when the string has 3 values and after the third character when the string has 4 values.
CSV File:
956,938,987,964,1004,934,1018,912
Attempted Code:
sed -e "s/\([0-9]\{2\}\)/\1./g"
Current Result:
95.6,93.8,98.7,96.4,10.04.,93.4,10.18.,91.2
Expected Result:
95.6,93.8,98.7,96.4,100.4,93.4,101.8,91.2
My current code (using sed) appears to be working for 3-value strings but, failing when it detects 4-value strings.

You may capture 2 or more digits into 1 group, and then capture a trailing digit into another group:
s='956,938,987,964,1004,934,1018,912'
echo $s | sed 's/\([0-9]\{2,\}\)\([0-9]\)/\1.\2/g'
See the online demo, output: 95.6,93.8,98.7,96.4,100.4,93.4,101.8,91.2.
Details:
\([0-9]\{2,\}\) - Group 1: two or more (\{2,\}) digits ([0-9])
\([0-9]\) - Group 2: a single digit.

In awk:
$ awk '{gsub(/.(,|$)/,".&")}1' file
95.6,93.8,98.7,96.4,100.4,93.4,101.8,91.2
Just in case if there was spaces or other stuff, you could:
$ awk '{gsub(/[0-9] *(,|$)/,".&")}1' file

How about simply replacing
\B([0-9])\b
with
.\1
like
sed 's/\B\([0-9]\)\b/.\1/g'
Explanation:
\B Matches if the position being match is in a word/number sequence (not a word boundary)
([0-9]) Matches and captures a digit
\b Matches if the position being match is in on a word/number boundary
By your examples I gather you simply want to have all numbers with one decimal. What this regex does is to match, and capture, the last digit in a multi digit number. Replacing it with itself preceded by a . gives you the desired output.
Online demo and here at regex101 for a more visual illustration.
Edit
If Wiktors concerns are an issue, change it to
\B([0-9])([0-9])\b
replaced by
\1.\2
like
sed 's/\B\([0-9]\)\([0-9]\)\b/\1.\2/g'
Here at regex101.

Looks like you are just dividing all numbers by 10, hence you can use this non-regex approach:
awk 'BEGIN{FS=OFS=","} {for (i=1; i<=NF; i++) $i/=10} 1' file
95.6,93.8,98.7,96.4,100.4,93.4,101.8,91.2

Swap minus sign from after the number to in front of the number using SED (and Regex)

I've got a text-file with the following line:
201174480 11-01-1911 J Student 25-07 11585 2 0 SPOED BEZORGEN 1ST 25,00
320819019 11-01-1911 T. Student 28-07 13561 1 15786986 DESLORATADINE TABL OMH 5MG 60ST 3,60
706059901 11-01-1911 ST Student-Student 30-06 14956 1 15356221 METOPROLOLSUCC RET T 100MG 180ST 12,90-
I want to change this line with SED into:
201174480 11-01-1911 J Student 25-07 11585 2 0 SPOED BEZORGEN 1ST 25,00
320819019 11-01-1911 T. Student 28-07 13561 1 15786986 DESLORATADINE TABL OMH 5MG 60ST 3,60
706059901 11-01-1911 ST Student-Student 30-06 14956 1 15356221 METOPROLOLSUCC RET T 100MG 180ST -12,90
So I want to swap the minus sign so that I get-12,90 in stead of 12,90- with SED. I tried:
try 1:
sed 's/\([0-9.]\+\)-/-\1/g' file.txt > file1.txt
try 2:
sed 's/\([0-9].\+\)-$/-\1/g' file.txt > file1.txt
So there must be something wrong with the REGEX but I donot really understand it. Please help.

You may use
sed 's/\([0-9][0-9,.]\+\)-\($\|[^0-9]\)/-\1\2/g'
See the online demo
The point is that after matching a number and a - (see \([0-9][0-9,.]\+\)-), there should come either end of string or non-digit (\($\|[^0-9]\)). Thus, we have 2 capturing groups now, and that is why we need a second backreference in the replacement pattern (\2).
I added a dot . to the bracket expression just in case you have mixed number formats, you may remove it if you always have a comma as the decimal separator.
Pattern details:
\([0-9][0-9,.]\+\) - Group 1 capturing
[0-9] - a digit
[0-9,.]\+ - one or more digits, commas or dots
- - a literal hyphen
\($\|[^0-9]\) - Group 2 capturing the end of string $ or a non-digit ([^0-9])

In your example, both files are identical, but I think I know what you mean.
For this particular file, you want to match a space, followed by zero or more digits, followed by a comma, followed by at least one digit, followed by a dash,
followed by zero or more spaces to the end of the line.
Then you want to replace the space in front of the matched digits and the comma with a dash. This will do the trick:
sed -e 's/ \([0-9]*,[0-9][0-9]*\)- *$/-\1/' <file.txt >file1.txt

Your first regular expression attempts to match against a string of numbers and .s, but the text contains a comma, not a .. It does the substitution you want if you replace [0-9.] with [0-9,], giving:
sed 's/\([0-9,]\+\)-/-\1/g' file.txt > file1.txt
However, it also replaces 25-07 in that case with -2507. I suggest you explicitly match against the end of the line:
sed 's/\([0-9,]\+\)-$/-\1/g'
or alternatively, you can demand that the match contains exactly one comma:
sed 's/\([0-9]\+,[0-9]\+\)-$/-\1/g'
I also find these things easier to read if you use the -r option to sed, which enables "extended regular expressions":
sed -r 's/([0-9]+,[0-9]+)-$/-\1/g'
Fewer special characters need to be escaped (on the other hand, more literal characters need to be escaped, but I find that tends to be a rarer occurrence).
(Aside: note that . usually means "any character", but inside a character class [.] it means "literally a .", since after all having it mean "any character" in there would be pretty useless.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex to match ZIP code without punctuation - regex

It won't match "12345" but will match "12345a". The first clause needs to end in a non-punctuation character, the way you wrote it. Consider Mike's answer; it's clearer.

Related

How do I filter lines in a text file that start with a capital letter and end with a positive integer with regex on the command line in linux?

I want to grep words that have a hyphen in the middle and start with uppercase letter + words that start with uppercase letter without hyphen

Allow only one number in grep Regex

Insert Decimal After Character Match in Text File

Swap minus sign from after the number to in front of the number using SED (and Regex)

Categories

Resources