Greedy regex behavior not wanted. usual cures don't work

Greedy regex behavior not wanted. usual cures don't work - regex

I have a large document that I needed to put anchors in. I appended a number to the end of the line. The format was " Area 1" This list goes on for hundreds of entries.
I tried to awk out the slice I wanted with the anchor but this is what I get.
cat file | awk '/Area 5/{print $0}'
Area 5
Area 50
Area 51
Area 52
Area 53
Area 54
Area 55
Area 56
Area 57
Area 58
Area 59
As you can see I wanted just "Area 5" but the regex engine matched it with 5 and 5x. Yes, I know it is being greedy. I tried to limit that behavior with:
/Area 5{1}/
and I still had this problem. I also tried {0} and {0,1} to no effect.
Question 1: What can I do to force awk (and grep as well) to limit it to the requested number?
Question 2: I used awk '/pattern/ { $0=$0 "" ++i }1' to append the number. It leaves "Area 1" I would like it to be Area1. Any ideas?
Thanks for the help.
B

To avoid matching prefixes like '5x', you can use a word boundary.
(Explanation)
In awk, word boundaries are matched using \y.
To eliminate the space between area I simply match group 'Area' and the number '5' and then print them without space.
In my tests, the following worked:
cat test.txt | awk '/Area 5\y/{print $1 $2}'
Output
Area5

/Area 5([^0-9]|$)/ would account for end of line, as well as any-thing but a digit.
But a more awk way of doing things, would be:
awk '/^Area/ && $2==5' file

If the '5' is the end of the line, you can use /Area 5$/. The $ matches end-of-line.
If it's followed by further text, /Area 5[^0-9]/ should work. The [^0-9] matches one character that is anything except a digit.
Good luck!

Some proposals.
awk '$2==5' file
Area 5
awk '$2 ~ /^[5]$/' file
Area 5

Related

Unable to match multiple digits in regex

I am simply trying to print 5 or 6 digit number present in each line.
cat file.txt
Random_something xyz ...64763
Random2 Some String abc-778986
Something something 676347
Random string without numbers
cat file.txt | sed 's/^.*\([0-9]\{5,6\}\+\).*$/\1/'
Current Output
64763
78986
76347
Random string without numbers
Expected Output
64763
778986
676347
The regex doesn't seem to work as intended with 6 digit numbers. It skips the first number of the 6 digit number for some reason and it prints the last line which I don't need as it doesn't contain any 5 or 6 digit number whatsoever

grep is a better for this with -o option that prints only matched string:
grep -Eo '[0-9]{5,6}' file
64763
778986
676347
-E is for enabling extended regex mode.
If you really want a sed, this should work:
sed -En 's/(^|.*[^0-9])([0-9]{5,6}).*/\2/p' file
64763
778986
676347
Details:
-n: Suppress normal output
(^|.*[^0-9]): Match start or anything that is followed by a non-digit
([0-9]{5,6}): Match 5 or 6 digits in capture group #2
.* Match remaining text
\2: is replacement that puts matched digits back in replacement
/p prints substituted text

With awk, you could try following. Simple explanation would be, using match function of awk and giving regex to match 5 to 6 digits in each line, if match is found then print the matched part.
awk 'match($0,/[0-9]{5,6}/){print substr($0,RSTART,RLENGTH)}' Input_file

Bash - count a pattern and print the line containing the pattern

everyone! While I was reading this discussion, "Count number of occurrences of a pattern in a file (even on same line)", I wondered if I could add the line containing the pattern next to the count values.
Somehow I wasn't able to add any comment on the discussion, so I'm posting a new question. Can somebody en-light me?
There must be some misunderstanding here, so I put an example.
Let's say, I have a DNA sequence like below and want to find out how many 'CG' are present in each line.
ACAAAGAACTCAAGAAGTTGGACCCCAGAGAACCAAATAACCCTATTAAA
AATTCGGAACAGAGATAAACAAAGAATTCTCAACTGAGGAAACTTGAATG
GGATTTTTTTTTAAGATTCACTTATTTTTATTTTCTGCATGAGTGTTTGC
CTCGATGTATGTACATATACGACATGTGTACGTGGTGCGCAAGTAAGCAG
Additionally, I want to print each line (not the pattern) along with the pattern counts.
0 ACAAAGAACTCAAGAAGTTGGACCCCAGAGAACCAAATAACCCTATTAAA
1 AATTCGGAACAGAGATAAACAAAGAATTCTCAACTGAGGAAACTTGAATG
0 GGATTTTTTTTTAAGATTCACTTATTTTTATTTTCTGCATGAGTGTTTGC
4 CTCGATGTATGTACATATACGACATGTGTACGTGGTGCGCAAGTAAGCAG
I wish the example above will help to understand the question better.
Thank you!

You can do:
printf 'pattern' | tee >(sed 's/$/ : /') | grep -cf - input.txt
Taking help of tee and process substitution.
Example:
% cat file.txt
foobar
spamegg
foo
% printf 'foo' | tee >(sed 's/$/ : /') | grep -cf - file.txt
foo : 2

cat fileName | grep pattern | uniq -c

I just found a really simple and elegant solution using EXCEL.
The formula goes like below...
=(LEN(B2)-LEN(SUBSTITUTE(B2,"CG","")))/2
What this formula basically does is it counts total length of strings in a cell and length after removal of the pattern ("CG" in this case), then subtract them. Since each "CG" is replaced by blanks, 2 strings are missing after substitution, and you can get the number of the pattern by dividing it with length of your pattern which is 2 in this case.
For example, following sequence contains 50 strings and 13 CG's.
CAGTGCACACAACACATGTACGCGCGCGCGCGCGCGCGCGCGCGCGTGTG 50
After substituting "CG" to blanks, you get 24 strings.
CAGTGCACACAACACATGTATGTG 24
To count the "CG" occurances,
(50-24)/2 = 13
If you are looking for "CAG", enter "CAG" instead of "CG" and divide by 3.
How simple is that!
You can see the original post in the following link.
http://fiveminutelessons.com/learn-microsoft-excel/count-occurrences-single-character-cell-excel#sthash.H4VfOkGB.dpbs
English is not my primary language, so please understand errors in my writing.
People are geniuses!

Awk 3 Spaces + 1 space or hyphen

I have a rather large chart to parse. Each column is separated by either 4 spaces or by 3 spaces and a hyphen (since the numbers in the chart can be negative).
cat DATA.txt | awk "{ print match($0,/\s\s/) }"
does nothing but print a slew of 0's. I'm trying to understand AWK and when to escape, etc, but I'm not getting the hang of it. Help is appreciated.
One line:
1979 1 -0.176 -0.185 -0.412 0.069 -0.129 0.297 -2.132 -0.334 -0.019
1979 1 -0.176 0.185 -0.412 0.069 -0.129 0.297 -2.132 -0.334 -0.019
I would like to get just, say, the second column. I copied the line, but I'd like to see -0.185 and 0.185.

You need to start by thinking about bash quoting, since it is bash which interprets the argument to awk which will be the awk program. Inside double-quoted strings, bash expands $0 to the name of the bash executable (or current script); that's almost certainly not what you want, since it will not be a quoted string. In fact, you almost never want to use double quotes around the awk program argument, so you should get into the habit of writing awk '...'.
Also, awk regular expressions don't understand \s (although Gnu awk will handle that as an extension). And match returns the position of the match, which I don't think you care about either.
Since by default, awk considers any sequence of whitespace a field separator, you don't really need to play any games to get the fourth column. Just use awk '{print $4}'

Why not just use this simple awk
awk '$0=$4' Data.txt
-0.185
0.185
It sets $0 to value in $4 and does the default action, print.
PS do not use cat with program that can read data itself, like awk
In case of filed 4 containing 0, you can make it more robust like:
awk '{$0=$4}1' Data.txt

If you're trying to split the input according to 3 or 4 spaces then you will get the expected output only from column 3.
$ awk -v FS=" {3,4}" '{print $3}' file
-0.185
0.185
FS=" {3,4}" here we pass a regex as FS value. This regex get parsed and set the Field Separator value to three or four spaces. In regex {min,max} called range quantifier which repeats the previous token from min to max times.

Line-insensitive pattern-matching – How can some context be displayed?

I'm looking for a technique to search a file for a pattern (typically a phrase) that may span multiple lines, and print the match with some surrounding context on one line. The file's lines may be too long or too short for a sensible amount of context; I'm not concerned to print a single line of the file, as you might do with grep, but rather to print onto a single line of my terminal.
Basic requirements
Show a specified number of characters before and after the match, even if it straddles lines.
Show newlines as ‘\n’ to prevent flooding the terminal with whitespace if there are many short lines.
Prefix output line with line and column number of the start of the match.
Preferably a sed oneliner.
So far, I'm assuming that the pattern has a constant length shorter than the width of the terminal, which is okay and very useful for most phrases I might want to search for.
Further considerations
I would be interested to see how the following could also be achieved using sed or the likes:
Prefix output line with line and column number range of the match.
Generalise for variable length patterns, truncating the middle of the match to ‘[…]’ if too long.
Can I avoid using something like ‘[ \n]’ between words in a phrase regex on a file that has been ‘hard-wrapped’ using newlines, without altering what's printed?
Using the output of stty size to dynamically determine the terminal width may be useful, though I'd probably prefer to leave it static in case I want to resize the terminal or use it from screen attached from terminals of different sizes.
Examples
The basic idea for 10 characters of context would be something like:
‘excessively long line with match in the middle\n’ → ‘line with match in the mi’
‘short\nlines\n\nmatch\nlots\nof\nshort\nlines\n’ → ‘rt\nlines\n\nmatch\nlots\nof\ns’

Here's a command to return the 20 characters surrounding a pattern, spanning newlines and including them as a character:
$ input="test.txt"
$ pattern="match"
$ tr '\n' '~' < "$input" | grep -o ".\{10\}${pattern}.\{10\}" | sed 's/~/\\n/g'
line with match in the mi
rt\nlines\n\nmatch\nlots\nof\ns
With row number of the match as well:
$ paste <(grep -n ${pattern} "$input" | cut -d: -f1) \
<(tr '\n' '~' < "$input" | grep -o ".\{10\}${pattern}.\{10\}" | sed 's/~/\\n/g')
1 line with match in the mi
5 rt\nlines\n\nmatch\nlots\nof\ns
I realise this doesn't quite fulfill all of your basic requirements, but am not good enough with awk to do better (guess this is technically possible in sed, but I don't want to think about what it would look like).

Problem with regular expression using grep

I've got some textfiles that hold names, phone numbers and region codes. One combination per line.
The syntax is always "Name Region_code number"
With any number of spaces between the 3 variables.
What I want to do is search for specific region codes, like 23 or 493, forexample.
The problem is that these numbers might appear in the longer numbers too, which might enable a return that shouldn't have been returned.
I was thinking of this sort of command:
grep '04' numbers.txt
But if I do that, a line that contains 04 in the number but not as region code will show as a result too... which is not correct.

I'm sure you are about to get buried in clever regular expressions, but I think in this case all you need to do is include one of the spaces on each side of your region code in the grep.
grep ' 04 ' numbers.txt

I'd do:
awk '$2 == "04"' < numbers.txt
and with grep:
grep -e '^[^ ]*[ ]*04[ ]*[^ ]*$' numbers.txt

If you want region codes alone, you should use:
grep "[[:space:]]04[[:space:]]"
this way it will only look for numbers on the middle column, while start or end of strings are considered word breaks.
You can even do:
function search_region_codes {
grep "[[:space:]]${1}[[:space:]]" FILE
}
replacing FILE with the name of your file,
and use
search_region_codes 04
or even
function search_region_codes {
grep "[[:space:]]${1}[[:space:]]" $2
}
and using
search_region_codes NUMBER FILE

Are you searching for an entire region code, or a region code that contains the subpattern?
If you want the whole region code, and there is at least one space on either side, then you can format the grep by adding a single space on either side of the specific region code. There are other ways to indicate word boundaries using regular expressions.
grep ' 04 ' numbers.txt
If there can be spaces in the name or phone number fields, than that solution might not work. Also, if you the pattern can be a sub-part of the region code, then awk is a better tool. This assumes that the 'name' field contains no spaces. The matching operator '==' requires that the pattern exactly match the field. This can be tricky when there is whitespace on either side of the field.
awk '$2 == "04" {print $0}' < numbers.txt
If the file has a delimiter, than can be set in awk using the '-F' argument to awk to set the field separator character. In this example, a comma is used as the field separator. In addition, the matching operator in this example is a '~' allowing the pattern to be any part of the region code (if that is applicable). The "/y" is a way to match work boundaries at the beginning and end of the expression.
awk -F , '$2 ~ /\y04\y/ {print $0}' < numbers.txt
In both examples, the {print $0} is optional, if you want the full line to be printed. However, if you want to do any formatting on the output, that can be done inside that block.

use word boundaries. not sure if this works in grep, but in other regex implementations i'd surround it with whitespace or word boundary patterns
'\s+04\s+' or '\b04\b'
Something like that

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js