Matching only 5xx using regex - regex

I want to find all numbers that are in between 500-599. I'm very new to regex, I came up with this :
5[0-9][0-9]+
This is working fine, matching 566,577,500. But it also matches 6578. Which I don't want.
Edit:
Here is my file contents:
asd 554
sad
sads
dsa
456
sa
d
dsa
asda
d500
521
519 asdasd
524 asdasdsdsadsdasd sadsadsadasdsd asdsa dsa dsadsad sad asdas dsa sad sad asds a 543
As many suggested I tried :
grep "^5[0-9]{2}$" test
which isn't finding any numbers at all!
How do I put a constraint on this?

If you want to match 5xx only on a line, and not when 5xx occurs as a part of x5xx,
^5\d{2}$
\d = Digit
^ = beginning of line
$ = end of line
EDIT:
Based on additional details in the question, you have a variable number of spaces at the beginning of the line, so, you want the following instead:
\s*5\d{2}\s
Matches spaces on either side of 5xx.

With grep the easiest way is to use -w to only match whole words:
grep --color=always -w "5[0-9][0-9]" test

Remove the + sign:
5[0-9][0-9]
This will match "5" succeeded by two numbers, and nothing else.

You have to describe a bit more accurately what you want to happen with e.g. 6578? If you want 578 in the output (because after "6" there is a sequence of characters matching your format 5xx) you can simply do
grep -o "5[0-9][0-9]"
Note that unlike other answers, the -o flag emits multiple numbers from a single line if needed.
If, on the other hand, you want to match words of format 5xx, you can add -w flag, too:
grep -o -w "5[0-9][0-9]"
For more complex rules for matching, you want to use -E flag instead and use possibly a much more complex regex.

Related

Match lines containing even numbers with grep

I am doing a series of questions regarding grep and I have gotten stuck on trying to match lines containing even numbers in any way (so, it should match 'hello22 23', '8', '2222 2999 1', 'hello2hello9', etc.)
The problem is that, while I managed to match all of those cases, I cannot find a way to match cases in which the line either contains exclusively an even number or it's the last occurrence before an EOL ('22', 'hello8', anything that ends with a number which should match).
So far, this is what I'm using:
grep -P '((.)*[02468][^0-9](.)*)'
The above matches anything followed by an even number with no numbers whatsoever after it, followed by anything else.
I have tried playing with the '$' regex which should match it, with no effect. Could it be maybe that grep isn't detecting my EOLs properly?
I think I understand what you're after--you want to avoid lines that may contain even digits but the numbers they comprise are all not even. Examples include 3, 23, a23, 23a, 3a49. You want to match lines that have at least one even number: 2, 22, 32, a32, 32a, 45a5bb44, etc.
The pattern grep -P '[02468](?=\D|$)' ensures at least one even digit is present that's followed by EOL or a non-digit using a lookahead and should fit your requirements.
$ cat test.txt
3
23
a23
23a
3a49
2
22
32
a32
32a
45a5bb44
$ grep -P '[02468](?=\D|$)' test.txt
2
22
32
a32
32a
45a5bb44

Remove artificially generated words from a word list in Linux?

Hello everyone and good day!
I have the following question: I have a word list that consists of normal words as well as artificially generated words.
example:
Ford
09mKGmaePnCmjkxm
Opel
0AACyvG0FtRHAU7i
Audi
0AR6V7cCy2phgXcv
BMW
0bDOlBY5VGAe5Vai
Alfa-Romeo
Mercedes
Pegout-323
0BDTwSCCrCy4VgEc
0cmolI8g4CerXKaH
0dL2m36014PmOetH
0dqjCZU7ZeRuovFF
0ekelbAnWcGC1c7n
Lada 2109
Lada 2106
0ER4tS8jhESXuISp
0Gao8qHgbEyZ06Bh
0j1pjZBAW2avxU6Z
0j5zBVhdPDyaVoZL
Toyouta
0Jn0qoKdnM6neGdx
0KlzXttiw81AvU2C
0kXzuEtHxiWfECw7
mitsubisi
0l8qW9Uv0V1DZPei
0LJQxUNuEp42txme
jeep
0m8G1GUytcETbtWv
0MexVW3TQ2sRqLjr
I want to remove all artificially generated words from this list.
I have converted such words to REGEX and saved them in a new file "Generic.txt":
[0-9][0-9][a-z][A-Z][A-Z][a-z][a-z][a-z][A-Z][a-z][A-Z][a-z][a-z][a-z][a-z][a-z]
[0-9][A-Z][A-Z][A-Z][a-z][a-z][A-Z][0-9][A-Z][a-z][A-Z][A-Z][A-Z][A-Z][0-9][a-z]
[0-9][A-Z][A-Z][0-9][A-Z][0-9][a-z][A-Z][a-z][0-9][a-z][a-z][a-z][A-Z][a-z][a-z]
[0-9][a-z][A-Z][A-Z][a-z][A-Z][A-Z][0-9][A-Z][A-Z][A-Z][a-z][0-9][A-Z][a-z][a-z]
[0-9][A-Z][A-Z][A-Z][a-z][A-Z][A-Z][A-Z][a-z][A-Z][a-z][0-9][A-Z][a-z][A-Z][a-z]
[0-9][a-z][a-z][a-z][a-z][A-Z][0-9][a-z][0-9][A-Z][a-z][a-z][A-Z][A-Z][a-z][A-Z]
[0-9][a-z][A-Z][0-9][a-z][0-9][0-9][0-9][0-9][0-9][A-Z][a-z][A-Z][a-z][a-z][A-Z]
[0-9][a-z][a-z][a-z][A-Z][A-Z][A-Z][0-9][A-Z][a-z][A-Z][a-z][a-z][a-z][A-Z][A-Z]
[0-9][a-z][a-z][a-z][a-z][a-z][A-Z][a-z][A-Z][a-z][A-Z][A-Z][0-9][a-z][0-9][a-z]
[0-9][A-Z][A-Z][0-9][a-z][A-Z][0-9][a-z][a-z][A-Z][A-Z][A-Z][a-z][A-Z][A-Z][a-z]
[0-9][A-Z][a-z][a-z][0-9][a-z][A-Z][a-z][a-z][A-Z][a-z][A-Z][0-9][0-9][A-Z][a-z]
[0-9][a-z][0-9][a-z][a-z][A-Z][A-Z][A-Z][A-Z][0-9][a-z][a-z][a-z][A-Z][0-9][A-Z]
[0-9][a-z][0-9][a-z][A-Z][A-Z][a-z][a-z][A-Z][A-Z][a-z][a-z][A-Z][a-z][A-Z][A-Z]
[0-9][A-Z][a-z][0-9][a-z][a-z][A-Z][a-z][a-z][A-Z][0-9][a-z][a-z][A-Z][a-z][a-z]
[0-9][A-Z][a-z][a-z][A-Z][a-z][a-z][a-z][a-z][0-9][0-9][A-Z][a-z][A-Z][0-9][A-Z]
[0-9][a-z][A-Z][a-z][a-z][A-Z][a-z][A-Z][a-z][a-z][A-Z][a-z][A-Z][A-Z][a-z][0-9]
[0-9][a-z][0-9][a-z][A-Z][0-9][A-Z][a-z][0-9][A-Z][0-9][A-Z][A-Z][A-Z][a-z][a-z]
[0-9][A-Z][A-Z][A-Z][a-z][A-Z][A-Z][a-z][A-Z][a-z][0-9][0-9][a-z][a-z][a-z][a-z]
[0-9][a-z][0-9][A-Z][0-9][A-Z][A-Z][a-z][a-z][a-z][A-Z][A-Z][a-z][a-z][A-Z][a-z]
[0-9][A-Z][a-z][a-z][A-Z][A-Z][0-9][A-Z][A-Z][0-9][a-z][A-Z][a-z][A-Z][a-z][a-z]
Now I want to delete from the word list "base.txt" all words that match this regex. They can also be larger than 16 characters!
I use the following command:
LC_ALL=C grep -F -f generic.txt base.txt > test.txt
Unfortunately I get no results, but also no error messages. What am I doing wrong?
Basically I want grep to check the file "base.txt" for every line from the file "generic.txt" and extract these lines into a new file.
The following list should remain at the end:
Ford
Opel
Audi
BMW
Alfa-Romeo
Mercedes
Pegout-323
Lada 2109
Lada 2106
Toyouta
mitsubisi
jeep
TIA
Sergio
The immediate error is that the -F option disables regular expressions entirely, and requires the text to match the pattern literally. (So for example [0-9] matches the literal string [0-9] and no other strings.)
Probably a better approach entirely is to try to generalize this absurd list of patterns to a single pattern, or a very small list of patterns. How did you come up with this list?
For example
grep -E '^[A-Za-z0-9]{16}$' base.txt
seems to extract only the (apparent) generated patterns in your example.
Problem is the definition of a "word", meaning why should Ford be a valid word while e.g. F0rd is not? That said, for your given list, you could use
^[a-zA-Z]+(?:[- ]\w+)?$
See a demo on regex101.com.
Another solution would be to emphasize that a word cannot start with a digit, thus anything that starts with a digit does not contain valid words:
^[0-9].{15}$(*SKIP)(*FAIL)|^.+
See another demo for this one on regex101.com.

How to use zgrep to display all words of a x size from a wordlist?

I want to display all the words from my wordlist who start with a w and are 9 letters long. Yesterday I learnt a bit more on how to use zgrep so I came with :
zgrep '\(^w\)\(^.........$\)' a.gz
But this doesn't work and I think it's because I don't know how to do a AND between the two conditions. I found that it should be (?=expr)(?=expr) but I can't figure out how to build my command then
So how can I build my command using the (?=expr) ?
for example if I have a wordlist like this:
Washington
Sausage
Walalalalalaaaa --> shouldn't match
Wwwwwwwww --> should match
You may use
zgrep '^w[[:alpha:]]\{8\}$' a.gz
The POSIX BRE pattern will match a string that
^w - starts with w
[[:alpha:]]\{8\} - then has eight letters
$ - followed with with the end of string marker.
Also, see the 9.3 Basic Regular Expressions.

regex to match lines with coordinates ending in zero

given the following:
1803 1004 -4.2
1807 1005 3.3
1809 1006 -8.9
1800 1007 -3.7
1805 1008 9.1
1808 1009 -4.3
1800 1000 3.2
I'd like regex to match a line with the two first coordinates that are ending in zero, so we'd only return:
1800 1000 3.2
I only want lines that have both the first two digits ending in zero, and yes the lines will have large quantities of whitespace either at the start or between the digits.
I've tried various combinations of '\s*\d+0\z*\d+0*' and '\d+0\s\d+0*' with no result.
I'm using this in combination with grep.
I recommend option in grep: -E
$ grep -E '^ *([0-9]*0) +([0-9]*0) +.*$' dataFile
Result:
In action: https://regex101.com/r/h4on2q/1
Additional,
About -E: $ man grep
-E, --extended-regexp
Interpret pattern as an extended regular expression (i.e. force grep to behave as egrep).
Basic vs Extended Regular Expressions:
https://www.gnu.org/software/grep/manual/html_node/Basic-vs-Extended.html
Give this a try: ^\s*\d+0\s+\d+0\s+.*$
In action: https://regex101.com/r/t0hhDL/2
It's not clear from your question whether the data you're working with is all one big string, or these are multiple lines being returned. I assumed the latter with the answer above, but the pattern will need to be slightly different if that's not the case.

Regex match and grouping

Here's a sample string which I want do a regex on
101-nocola_conte_-_fuoco_fatuo_(koop_remix)
The first digit in "101" is the disc number and the next 2 digits are the track numbers. How do I match the track numbers and ignore the disc number (first digit)?
Something like
/^\d(\d\d)/
Would match one digit at the start of the string, then capture the following two digits
Do you mean that you don't mind what the disk number is, but you want to match, say, track number 01 ?
In perl you would match it like so: "^[0-9]01.*"
or more simply "^.01.*" - which means that you don't even mind if the first char is not a digit.
^\d(\d\d)
You may need \ in front of the ( depending on which environment you intend to run the regex into (like vi(1)).
Which programming language? For the shell something with egrep will do the job:
echo '101-nocola_conte_-_fuoco_fatuo_(koop_remix)' | egrep -o '^[0-9]{3}' | egrep -o '[0-9]{2}$'