grep regular expression not matching zero correctly - regex

I'm having some issues with grep regular expressions. I'm trying to grep some ascii coded hexidecimal data where the characters are all lower case
My grep statement is as follows:
grep -E "01[a-f0-9]{2}81[a-f0-9]0" log.log
Most of the matches in the file look ok, except there are numerous matches that are as follows
010481ec070000
01b481ec070000
01508129070521
I can't work out why these strings are matching. They should not match because 81 must be followed by a hex character then a zero.
I have done some further investigation. If I place these three strings in a separate file, and grep that file. I get no matches. Not quite sure what is going on here.
This is grep 2.12.
here is part of the raw data in the file. These are all lines that have matched. And still match after exporting LC_ALL=C
input data : 011a81a907000b3002004070eaa3d2240fa81272011763dd0040002001
input data : 010481e1070000
input data : 010481ea070000
input data : 011a81a207000b980f0040681f2b11d2f60202dc003669ba0140006100
input data : 014681ab07002140010040d2e457f8c00494ed5e014362bf0240006101ae0500404ee311f402feb2165401c562450240005801db08044068f09ff6a6005af953008062470640004d01
input data : 010481e3070000
input data : 013081ac070016c0000040f6d963fcb4f7e8127c0103637b0140006f01bf0200408ae344fdd2043eed72018362a30240006f01
input data : 010481e4070000
input data : 011a81ad07000b5c06006064f96804901154fed2008e66ff0f4000a401
input data : 010481e5070000
input data : 014681ae070021170d004069f196134cf6a805b4000769b6034000be014e0e004092e80820da0b82fbfa000c6c5c014000bf01880a004020d9ce21f4efd40954011469a1004000ae01
input data : 011a81a607000bef0d0060d60dd6edf8f18e104e015b63d3014000da00
input data : 011a81af07000b4c0800401cfbb0184a0c28f7fa00516931024000e101
input data : 015c81a007002c12050020f2ff640028007afd00801205f70540000400280c00404f016a0a10fbd0012a00e769ff0f400018005d020040e3fabd21e00830f4d200c769d80140000300030a004042030

Try executing it with the environment variable LC_ALL=C. The locale affects the way grep interprets character ranges.

Assuming that the command is exactly as you say... The quotes are right, there's no filename glob going on before grep gets the arguments, you don't have {0} instead of 0, etc....
I wonder if the -a (treat binary file as text) is the culprit. Binary output could be processed by the terminal. (That's how we change colors or do curses positioning or whatnot.)
What if you had binary in there that erased part of the line? Say control-H's...
What happens if you pipe the grep output through od -c (or perhaps od -a or od -t a if you have it).
What happens if you store the output in a file, pull out just one such line with grep, and look at it with od?

Related

Extract a string from vcf file

I need to extract RS=368138379 string from following lines in a vcf file of few thousand millions lines. I am wondering how can we use grep -o "" and regular expression to quickly extract that?
AF_ESP=0.0001;ALLELEID=359042;CLNDISDB=MedGen:C0678202,OMIM:266600;CLNDN=Inflammatory_bowel_disease_1;CLNHGVS=NC_000006.11:g.31779521C>T;CLNREVSTAT=no_assertion_criteria_provided;CLNSIG=association;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=HSPA1L:3305;MC=SO:0001583|missense_variant;ORIGIN=4;RS=368138379
Thanks very much indeed.
Something along the lines of RS=\d+ should do the trick for the expression you're looking for.
Let's say text.log contains your log you can use:
grep -oE "RS=[0-9]+" test.log
If you want to print also the line numbers:
grep -noE "RS=[0-9]+" test.log
Best to avoid using grep to parse VCF/BCF files. Use bcftools query instead:
bcftools query -f '%INFO/RS\n' -e 'INFO/RS="."' clinvar.vcf.gz
A simple zgrep -oE "RS=[0-9]+" clinvar.vcf.gz will miss RS values for records that contain more than one ID, which can be pipe-delimited:
##INFO=<ID=RS,Number=.,Type=String,Description="dbSNP ID (i.e. rs number)">
Number is . when the number of possible values varies, is unknown, or is unbounded. Please see: https://samtools.github.io/hts-specs/VCFv4.2.pdf

How to extract values between 2 known strings

I have some huge files containing mixed binary and xml data. I want to extract all values between 2 XML tags that have multiple occurrences in the file. Pattern would be as following: <C99><F1>050</F1><F2>random value</F2></C99> . Portions of XML data are not formatted, everything is in a single line.
I need all values between <F1> and </F1> from <C99> where value is between range 050 and 999(<F1> exists under other fields as well but I need only values of F1 from C99). I need to count them, to see how many C99 have F1 with values between 050 and 999.
I want a hint how I could easily reach and extract that values (using cat and grep? or sed?). Sorting and counting is easy to do it once values are exported in a file.
My temporary solution:
After removing all binary data from the file, I can run the following command:
cat filename | grep -o "<C99><F1>......." > file.txt
This will export first 12 characters from all strings starting with <C99><F1>.
<C99><F1>001
<C99><F1>056
<C99><F1>123
<C99><F1>445
.....
Once exported in a text file, I replace <C99><F1> with nothing and then I sort and count remaining values.
Thank you!
Using XMLStarlet:
$ xml sel -t -v '//C99/F1[. >= 50 and . <= 999]' -nl data.xml | wc -l
Not much of a hint there, sorry.

Using grep, how can I extract every number from a blob of text?

Unlike this previous question, I want to do this on a commandline (just grep).
How can I grep every number from a text file and display the results?
Sample text:
This is a sentence with 1 number, while the number 2 appears here, too.
I would expect to be able to extract the "1" and "2" from the text (my actual text is substantially longer, of course).
I think you want something like this,
$ echo 'This is a sentence with 1 number, while the number 2 appears here, too.' | grep -o '[0-9]\+'
1
2
Since basic sed uses BRE (Basic Regular Expression), you need to escape the + symbol so that it would repeat the previous character one or more times.

how to parse a text file for a particular compound expressions filtering in shell scripting

I want to extract (parse) a text file which has particular word, for my requirement whatever the rows which have the words "cluster" and "week" and "8.2" it should be written to the output file.
sample text in the file
2013032308470272~800000102507~Cluster-Mode~WEEK~8.1.2~V6240
2013032308470272~800000102507~Cluster-Mode~monthly~8.1.2~V6240
2013032308470272~800000102507~Cluster-Mode~WEEK~8.2.2~V6240
2013032308470272~800000102507~Cluster-Mode~yearly~8.1.2~V6240
Desired output into another text file by above mentioned filters
2013032308470272~800000102507~Cluster-Mode~WEEK~8.2.2~V6240
I have writen a code using the awk command, however the output file contains the rows which are out of the scope of the filters.
code used to extract the text
awk '/Cluster/ && /WEEK/ && /8.2/ { print $NF > "/u/nbsvc/Data/Lookup/derived_asup_2010404_201409_2.txt" }' /u/nbsvc/Data/Lookup/cmode_asup_lookup.txt
obtained output
2013032308470272~800000102507~Cluster-Mode~WEEK~8.1.2~V6240
2013032308470272~800000102507~Cluster-Mode~WEEK~8.2.2~V6240
Note: the first line of obtained output is not needed in the desired output. How can I change my script to only get the line that I want?
To remove any ambiguity and false matches on partial fields or the wrong field, THIS is the command you need to run:
$ awk -F'~' '$3~/^Cluster/ && $4=="WEEK" && $5~/^8\.2/' file
2013032308470272~800000102507~Cluster-Mode~WEEK~8.2.2~V6240
I don't think that awk is needed at all here. Just use grep to match the line that you're interested in:
grep 'Cluster.*WEEK.*8\.2' file > output_file
The .* matches zero or more of any character and > is used to redirect the output to a new file. I have escaped the . in between "8.2" so that it is interpreted literally, rather than matching any character (although it would work either way).
there is actually little more in my requirement, it is I need to read this text file, then I need to split the line (where the cursor is) and push the values into a array and then I need to check for the values does it match with my pattern or not, if it matches then I need to write it to a out put text file, else simply ignore it, this one I did like as below..
cat /inputfolder_path/lookup_filename.txt | awk '{IGNORECASE = 1;line=$0;split(line,a, "~") ;if (a[1] ~ /201404/ && a[3]~/Cluster/ && a[4]~/WEEK/ && a[5]~/8.2/){print $0}}' > /outputfolder_path/derived_output_filename.txt
this is working exactly for my requirement..
Just thought to update this to every one, as it may help someone..
Thanks,
Siva

Line-insensitive pattern-matching – How can some context be displayed?

I'm looking for a technique to search a file for a pattern (typically a phrase) that may span multiple lines, and print the match with some surrounding context on one line. The file's lines may be too long or too short for a sensible amount of context; I'm not concerned to print a single line of the file, as you might do with grep, but rather to print onto a single line of my terminal.
Basic requirements
Show a specified number of characters before and after the match, even if it straddles lines.
Show newlines as ‘\n’ to prevent flooding the terminal with whitespace if there are many short lines.
Prefix output line with line and column number of the start of the match.
Preferably a sed oneliner.
So far, I'm assuming that the pattern has a constant length shorter than the width of the terminal, which is okay and very useful for most phrases I might want to search for.
Further considerations
I would be interested to see how the following could also be achieved using sed or the likes:
Prefix output line with line and column number range of the match.
Generalise for variable length patterns, truncating the middle of the match to ‘[…]’ if too long.
Can I avoid using something like ‘[ \n]’ between words in a phrase regex on a file that has been ‘hard-wrapped’ using newlines, without altering what's printed?
Using the output of stty size to dynamically determine the terminal width may be useful, though I'd probably prefer to leave it static in case I want to resize the terminal or use it from screen attached from terminals of different sizes.
Examples
The basic idea for 10 characters of context would be something like:
‘excessively long line with match in the middle\n’ → ‘line with match in the mi’
‘short\nlines\n\nmatch\nlots\nof\nshort\nlines\n’ → ‘rt\nlines\n\nmatch\nlots\nof\ns’
Here's a command to return the 20 characters surrounding a pattern, spanning newlines and including them as a character:
$ input="test.txt"
$ pattern="match"
$ tr '\n' '~' < "$input" | grep -o ".\{10\}${pattern}.\{10\}" | sed 's/~/\\n/g'
line with match in the mi
rt\nlines\n\nmatch\nlots\nof\ns
With row number of the match as well:
$ paste <(grep -n ${pattern} "$input" | cut -d: -f1) \
<(tr '\n' '~' < "$input" | grep -o ".\{10\}${pattern}.\{10\}" | sed 's/~/\\n/g')
1 line with match in the mi
5 rt\nlines\n\nmatch\nlots\nof\ns
I realise this doesn't quite fulfill all of your basic requirements, but am not good enough with awk to do better (guess this is technically possible in sed, but I don't want to think about what it would look like).