grep to search for a specified pattern

grep to search for a specified pattern - regex

i want to grep all the texts in a file which contain symbols (non alpha numeric) and start with a number and which have spaces between them
grep -i "^[0-9]\|[^a-zA-Z0-9]\| "
I have written the following grep command which works perfectly , however i also wish to include those texts which are not in a particular limit say for example all those texts which are less than 3 and more than 15 should be greped
How can include that limit pattern as well in one command
I tried using
{3,15}
and all but could not get the desired output
sample input
aa
9dsa
abcd
abc#$
ab d
Sample output
aa //because length less than 3
ab d //because has space in between
9dsa // because starts with a number
abc#$ //because has special symbols in it

For clarity, simplicty, robustness, portability, etc. just use awk instead of grep to search for non-trivial conditions:
$ awk 'length()<3 || length()>15 || /[^[:alnum:]]/ || /[[:space:]]/ || /^[0-9]/' file
aa
9dsa
abc#$
ab d
I mean seriously, that couldn't get much clearer/simpler and it will work in any POSIX awk and it's trivial to change if/when your requirements change.

Below expression should help you find the required lines. I am assuming you will use grep -E so the alternation will work properly
^[[:digit:]]|[##$%^&*()]|^.{0,3}$|^.{15,}$
Below is the explanation for the regex
^[[:digit:]] - Match a line that starts with a number
[##$%^&*()] - Match a line containing the specified symbols.
Alternatively you can use [^[:alnum:]], if you want
the symbol to match any non alpha numeric character.
Beware that a space, underscore, tab, quote, etc are all
examples of non alpha numeric characters
^.{0,3}$ - Match a line containing less than 3 characters
^.{15,}$ - Match a line containing more than 15 characters

Related

Sed: Searching for a string length and character specific

I've been searching for a few hours now for a way to find a string containing 21 numeric characters and place a return in front of the string itself. Finally i found the solution using:
sed -r 's/\b[0-9]{21}\b/\n&/g'
Works great!
Now i have a new set of data containing 21 numeric characters but adding to that there is some alphabetic characters at the end of the string with a variable length of 3 to 10 characters.
Sample input:
169349870913736210308ABC
168232727246529300209DEFGHI
166587299965005120122JKLMNOPQRS
162411281984306600005TUVWXYZ
What i would like is to have a space between the numeric and the alphabetical characters:
169349870913736210308 ABC
168232727246529300209 DEFGHI
166587299965005120122 JKLMNOPQRS
162411281984306600005 TUVWXYZ
Do note the 16 which every number starts with. I've tried using:
sed -r 's/^\b[0-9]{21}\+[A-Z]{3,10}\b/ /g' filename
But i couldnt get it to work because i dont know and couldnt find how to specifically search for a string containing an exact amount of numeric characters combined with alphabetical characters of a special length. I've found a lot of helpfull questions on this website, but this one i couldnt find.

Use capturing group.
sed -r 's/^([0-9]{21})([A-Z]{3,10})$/\1 \2/' filename

Search from left to right first non numeric character ([^0-9]) and replace it by a whitespace and the matching (&) non numeric character:
sed 's/[^0-9]/ &/' file
Output:
169349870913736210308 ABC
168232727246529300209 DEFGHI
166587299965005120122 JKLMNOPQRS
162411281984306600005 TUVWXYZ

Why not just add a space after the first 21 characters, like so:
sed 's/^...................../& /g'

Using grep to find keywords, and then list the following characters until the next ; character

I have a long list of chemical conditions in the following form:
0.2M sodium acetate; 0.3M ammonium thiosulfate;
The molarities can be listed in various ways:
x.xM, x.x M, x M
where the number of x digits vary. I want to do two things, select those numbers using grep, and then list only the following characters until ;. So if I select 0.2M in the example above, I want to be able to list sodium acetate.
For selecting, I have tried the following:
grep '[0-9]*.[0-9]*[[:space:]]*M' file
so that there are arbitrary number of digits and spaces, but it always ends with M. The problem is, it also selects the following:
0.05MRbCl+MgCl2;
I am not quite sure why this is selected. Ideally, I would want 0.05M to be selected, and then list RbCl+MgCl2. How can I achieve this?
(The system is OS X Yosemite)

It matches that because:
[0-9]* matches 0
. matches any character (this is the . in this case, but you probably meant to escape it)
[0-9]* matches 05
[[:space:]]* matches the empty string between 05 and M
M matches M
As for how to do what you want: I think that if you don't want the numbers to be printed with the output, this would require either a lookbehind assertion or the ability to print a specific capture group, which it sounds like OS X's grep doesn't support. You could use a similar approach with a slightly more powerful tool, though:
$ cat test.txt
0.2M sodium acetate; 0.3M ammonium thiosulfate;
0.05MRbCl+MgCl2;
1.23M dihydrogen monoxide;
45 M xenon quadroxide;
$ perl -ne 'while (/([0-9]*\.)?[0-9]+\s*M\s*([^;]+)/g) { print "$2\n"; }' test.txt
sodium acetate
ammonium thiosulfate
RbCl+MgCl2
dihydrogen monoxide
xenon quadroxide
Written out, that regex is:
([0-9]*\.)? optionally, some digits and a decimal point
[0-9]+ one or more digits
\s*M\s* the letter M, with spacing around it
([^;]+) all the characters up until the next semicolon (the thing you want to print)

With GNU awk for multi-char RS, gensub() and \s:
$ awk -vRS=';\\s*' -vm='0.2M' 'm==gensub(/\s*([0-9.]+)\s*M.*/,"\\1M","")' file
0.2M sodium acetate
$ awk -vRS=';\\s*' -vm='0.05M' 'm==gensub(/\s*([0-9.]+)\s*M.*/,"\\1M","")' file
0.05MRbCl+MgCl2

How do I specify a regex of certain length where I want to disallow characters instead of allowing them?

I need a regular expression which will validate a string to have length 7 and doesn't contain vowels, number 0 and number 1.
I know about character classes like [a-z] but it seems a pain to have to specify every possibility that way: [2-9~!##$%^&*()b-df-hj-np-t...]
For example:
If I pass a String June2013 - it should fail because length of the string is 8 and it contains 2 vowels and number 0 and 1.
If I pass a String XYZ2003 - it should fail because it contains 0.
If I pass a String XYZ2223 - it should pass.
Thanks in advance!

So that would be something like this:
^[^aeiouAEIOU01]{7}$
The ^$ anchors ensure there's nothing in there but what you specify, the character class [^...] means any character except those listed and the {7} means exactly seven of them.
That's following the English definition of vowel, other cultures may have a different idea as to what constitutes voweliness.
Based on your test data, the results are:
pax> echo 'June2013' | egrep '^[^aeiouAEIOU01]{7}$'
pax> echo 'XYZ2003' | egrep '^[^aeiouAEIOU01]{7}$'
pax> echo 'XYZ2223' | egrep '^[^aeiouAEIOU01]{7}$'
XYZ2223

This is the briefest way to express it:
(?i)^[^aeiou01]{7}$
The term (?i) means "ignore case", which obviates typing both upper and lower vowels.

Regex: Match any character (including whitespace) except a comma

I would like to match any character and any whitespace except comma with regex. Only matching any character except comma gives me:
[^,]*
but I also want to match any whitespace characters, tabs, space, newline, etc. anywhere in the string.
EDIT:
This is using sed in vim via :%s/foo/bar/gc.
I want to find starting from func up until the comma, in the following example:
func("bla bla bla"
"asdfasdfasdfasdfasdfasdf"
"asdfasdfasdf", "more strings")
I

To work with multiline in SED using RegEx, you should look at here.
EDIT:
In SED command, working with NewLine is a bit different. SED command support three patterns to manage multiline operations N, P and D. To see how it works see this(Working with Multiple Lines) explaination. Here these three operations discussed.
My guess is that N operator is the area of consideration that is missing from here. Addition of N operator will allows to sense \n in string.
An example from here:
Occasionally one wishes to use a new line character in a sed script.
Well, this has some subtle issues here. If one wants to search for a
new line, one has to use "\n." Here is an example where you search for
a phrase, and delete the new line character after that phrase -
joining two lines together.
(echo a;echo x;echo y) | sed '/x$/ { N s:x\n:x: }'
which generates
a xy
However, if you are inserting a new line, don't use "\n" - instead
insert a literal new line character:
(echo a;echo x;echo y) | sed 's:x:X\ :'
generates
a X
y

So basically you're trying to match a pattern over multiple lines.
Here's one way to do it in sed (pretty sure these are not useable within vim though, and I don't know how to replicate this within vim)
sed '
/func/{
:loop
/,/! {N; b loop}
s/[^,]*/func("ok"/
}
' inputfile
Let's say inputfile contains these lines
func("bla bla bla"
"asdfasdfasdfasdfasdfasdf"
"asdfasdfasdf", "more strings")
The output is
func("ok", "more strings")
Details:
If a line contains func, enter the braces.
:loop is a label named loop
If the line does not contain , (that's what /,/! means)
append the next line to pattern space (N)
branch to / go to loop label (b loop)
So it will keep on appending lines and looping until , is found, upon which the s command is run which matches all characters before the first comma against the (multi-line) pattern space, and performs a replacement.

string pattern and regex

I have a file with different lines, among which I have some lines like
173.194.034.006.00080-138.096.201.072.49934
the pattern is 3 numbers and then a dot and then 3 numbers and then a dot, etc.
I want to use awk, grep, or sed for this purpose. How do I express this regular expression?

Assuming you want to get lines with 1 series like 123. exists, do
grep '[0-9][0-9][0-9]\.' file > numbersFile
If you want 2 series like 123.345., then do
grep '[0-9][0-9][0-9]\.[0-9][0-9][0-9]\.' file > numbersFile
etc, etc.
Each [0-9] means match only one occurance of characters in the range between 0-9 (0,1,2,3,4,5,6,7,8,9).
Because the '.' char has a special meaning in a normal grep regexp, you nave to escape it like \. to indicate "Just match the '.' char (only!) ;-)
There are fancy extensions to grep that allow you to specify the pattern once, and include a qualifier like {3} or sometimes \{3\} (to indicate 3 repetitions). But this extension isn't portable to older Unix like Solaris, AIX, and others.
Here's a simple test to see if your system supports qualifiers. (Super Grep-heads are welcome to correct my terminology :-).
echo "173.194.034.006.00080-138.096.201.072.49934" | grep '[0-9]\{10\}\.'
echo "173.194.034.006.00080-138.096.201.072.49934" | grep '[0-9]\{2\}\.'
The first test should fail, the 2nd will succeed if your grep supports qualifiers.
It doesn't hurt to learn the long-hand solution (as above), and you can be sure this will work with any grep.
IHTH.

In awk I'd probably build up the string and then search for it as:
BEGIN {
p = "[.]"
d = "[[:digit:]]"
d3 = d d d # or d"{3}"
d5 = d d d d d # or d"{5}"
re = d3 p d3 p d3 p d3 p d5 # or "(" d3 p "){4}" d5
}
$0 ~ re "-" re
but it really all depends what you want to do with it.

By the look of it, these are IP addresses, followed by a port number, a dash and then the IP address/port number combination again.
If you're on a modern UNIX/Linux system then
grep -P '(\d{3}\.){4}\d{5}-(\d{3}\.){4}\d{5})'
would do the trick -- although may not be the most portable way to do it. This uses the '-P' for "use Perl regular expressions" option, which some people might consider to be cheating!
You didn't say if you've got extra text either before or after these strings on the line. If you have then you can use the '-o' option just to extract the matched text and ignore everything else.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

grep to search for a specified pattern - regex

Related

Sed: Searching for a string length and character specific

Using grep to find keywords, and then list the following characters until the next ; character

How do I specify a regex of certain length where I want to disallow characters instead of allowing them?

Regex: Match any character (including whitespace) except a comma

string pattern and regex

Categories

Resources