Bash, grep between two lines with specified strings

Bash, grep between two lines with specified strings - regex

example_file.txt:
a43
<un:Test1 id="U111">
abc1
cvb1
bnm1
</un:Test1>
<un:Test1 id="U222">
abc2
cvb2
bnm2
</un:Test1>
I need all lines between <un:Test1 id="U111"> and first </un:Test1> only. Number of these lines is differ from one input file to another input file. I have tried
grep -E -A100000 '<un:Test1 id=\"U111\">' example_file.txt | grep -B100000 '</un:Test1>'
but it returns all strings bellow <un:Test1 id="U222"> also. I know that it`s better to use xmlparser to parse such kind of files but it is not allowed to install additional libs to the server so I can use grep, awk, sed etc. only. Help me please.

Do you mean this?
sed -n '/<un:Test1 id="U111">/,/<\/un:Test1>/p' file
update with xmllint
If your input is xml, you can try:
xmllint --xpath "//*[local-name()='Test1'][#id='U111']" file.xml
Note: If you have different namespaces for same localname ("Test1"), you need add the namespace-uri()

Related

Extract a string from vcf file

I need to extract RS=368138379 string from following lines in a vcf file of few thousand millions lines. I am wondering how can we use grep -o "" and regular expression to quickly extract that?
AF_ESP=0.0001;ALLELEID=359042;CLNDISDB=MedGen:C0678202,OMIM:266600;CLNDN=Inflammatory_bowel_disease_1;CLNHGVS=NC_000006.11:g.31779521C>T;CLNREVSTAT=no_assertion_criteria_provided;CLNSIG=association;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=HSPA1L:3305;MC=SO:0001583|missense_variant;ORIGIN=4;RS=368138379
Thanks very much indeed.

Something along the lines of RS=\d+ should do the trick for the expression you're looking for.

Let's say text.log contains your log you can use:
grep -oE "RS=[0-9]+" test.log
If you want to print also the line numbers:
grep -noE "RS=[0-9]+" test.log

Best to avoid using grep to parse VCF/BCF files. Use bcftools query instead:
bcftools query -f '%INFO/RS\n' -e 'INFO/RS="."' clinvar.vcf.gz
A simple zgrep -oE "RS=[0-9]+" clinvar.vcf.gz will miss RS values for records that contain more than one ID, which can be pipe-delimited:
##INFO=<ID=RS,Number=.,Type=String,Description="dbSNP ID (i.e. rs number)">
Number is . when the number of possible values varies, is unknown, or is unbounded. Please see: https://samtools.github.io/hts-specs/VCFv4.2.pdf

Remove the data before the second repeated specified character in linux

I have a text file which has some below data:
AB-NJCFNJNVNE-802ac94f09314ee
AB-KJNCFVCNNJNWEJJ-e89ae688336716bb
AB-POJKKVCMMMMMJHHGG-9ae6b707a18eb1d03b83c3
AB-QWERTU-55c3375fb1ee8bcd8c491e24b2
I need to remove the data before the second hyphen (-) and produce another text file with the below output:
802ac94f09314ee
e89ae688336716bb
9ae6b707a18eb1d03b83c3
55c3375fb1ee8bcd8c491e24b2
I am pretty new to linux and trying sed command with unsuccessful attempts for the last couple of hours. How can I get the desired output with sed or any other useful command like awk?

You can use a simple cut call:
$ cat myfile.txt | cut -d"-" -f3- > myoutput.txt
Edit:
Some explanation, as requested in the comments:
cut breaks up a string of text to fields according to a given delimiter.
-d defines the delimiter, - in this case.
-f defines which fields to output. In this case, we want to eliminate everything before the second hyphen, or, in other words, return the third field and onwards (3-).
The rest of the command is just piping the output. cating the file into cut, and then saving the result to an output file.

Or, using sed:
cat myfile.txt | sed -e 's/^.\+-//'

print multiple patterns with sed

I try to print multiple patterns with sed.
Here's a typical string to process :
(<span class="arabic">1</span>.<span class="arabic">15</span>)</td></tr>
and I would like : (1.15)
For this, I tried :
sed 's/^(<span.*">\([0-9]*\).*\([0-9]*\).*">/(\1\.\2)/'
but I get (1.)15</span>)</td></tr>
Anyone could see what's wrong ?
Thanks

If you are Chuck Norris, use regex, brainfuck or assembly. If you're not, don't use regex to parse HTML, instead, use a tool that support xpath, like xmllint. In 2014, it's a solved problem :
xmllint --html --xpath '//span[#class="arabic"]/text()' file_or_URL
Check the famous RegEx match open tags except XHTML self-contained tags
xmllint comes from libxml2-utils package (for debian and derivatives)

Reason why you are getting "(1.)15) as your output"
sed 's/^(<span.*">\([0-9]*\).*\([0-9]*\).*">/(\1\.\2)/'
^^
the two characters "> needs to be placed before \([0-9]*\) since "> in your line is before the two digits (in this case). This way sed can find the pattern
The correct sed command
sed 's/^(<span.*">\([0-9]*\).*">\([0-9]*\).*/(\1.\2)/'
^^
Correct Command line
echo '(<span class="arabic">1</span>.<span class="arabic">15</span>)</td></tr>'|sed 's/^(<span.*">\([0-9]*\).*">\([0-9]*\).*/(\1.\2)/'
results using the command line above
(1.15)

If data is at the same place all the time, awk may be a simpler solution than sed:
awk -F"[<>]" '{print "("$3"."$7")"}' file
(1.15)

$ lynx -dump -nomargins file.htm
(1.15)

Regular expression to extract text from XML-ish data using GNU sed

I have a file full of lines extracted from an XML file using "gsed regexp -i FILENAME". The lines in the file are all of one of either format:
<field number='1' name='Account' type='STRING'W/>
<field number='2' name='AdvId' type='STRING'W>
I've inserted a 'W' in the end which represents optional whitespace. The order and number of properties are not necessarily the same in all lines throughout the file although "number" is always before "type".
What I'm searching for is a regular expression "regexp" that I can give to gnu sed so that this command:
gsed regexp -i FILENAME
gives me a file with lines looking like this:
1 STRING
2 STRING
I don't care about the amount of whitespace in the result as long as there is some after the number and a newline at the end of each line.
I'm sure it is possible, but I just can't figure out how in a reasonable amount of time. Can anyone help?
Thanks a lot,
jules

Using xsh, a Perl wrapper around XML::LibXML:
open file.xml ;
for //field echo #number #type ;

I'm sure this can be optimized, but it works for me and answers your question:
sed "s/^.*number='\([0-9]*\)'.*type='\(.*\)'.*$/\1 \2/" <filename>
Saying that, I think the others are right, if you have an XML-file you should use an XML-parser.

I think you're much better off using a command line XML tool such as XMLStarlet. That will integrate well with the shell and let you perform XPath searches. It's XML-aware so it'll handle character encodings, whitespace correctly etc.

Simple cut should work for you:
cut -f2,6 -d"'" --output-delimiter=" "
If you really want sed:
sed -r "s/.'(.)'.type='(.)'.*/\1 \2/"

You can use this:
sed -r "s/<field [^>]*?number='([0-9]+)'[^>]*?type='([^']+)'[^>]*>/\1 \2/"

You would be better off using an XML parser, but if you had to use sed:
sed 's/<field number=\'(.*?)\'.*?type=\'(.*?)\'/\1 \2

sed -ni "/<field .*>/s#^.*[[:space:]]number='\\([^']\\+\\).*[[:space:]]type='\\([^']\\+\\).*#\1 \2#p" FILENAME
Or if you don't mind contents of number and type to be optional:
sed -ni "/<field .*>/s#^.*[[:space:]]number='\\([^']*\\).*[[:space:]]type='\\([^']*\\).*#\1 \2#p" FILENAME
Just change from [^']\\+ to [^']* at your preference.

grep - search for "<?\n" at start of a file

I have a hunch that I should probably be using ack or egrep instead, but what should I use to basically look for
<?
at the start of a file? I'm trying to find all files that contain the php short open tag since I migrated a bunch of legacy scripts to a relatively new server with the latest php 5.
I know the regex would probably be '/^<\?\n/'

I RTFM and ended up using:
grep -RlIP '^<\?\n' *
the P argument enabled full perl compatible regexes.

If you're looking for all php short tags, use a negative lookahead
/<\?(?!php)/
will match <? but will not match <?php
[meder ~/project]$ grep -rP '<\?(?!php)' .

find . -name "*.php" | xargs grep -nHo "<?[^p^x]"
^x to exclude xml start tag

if you worried about windows line endings, just add \r?.

grep '^<?$' filename
Don't know if that is showing up correctly. Should be
grep ' ^ < ? $ ' filename

Do you mean a literal "backslash n" or do you mean a newline?
For the former:
grep '^<?\\n' [files]
For the latter:
grep '^<?$' [files]
Note that grep will search all lines, so if you want to find matches just at the beginning of the file, you'll need to either filter each file down to its first line, or ask grep to print out line numbers and then only look for line-1 matches.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Bash, grep between two lines with specified strings - regex

Do you mean this? sed -n '/<un:Test1 id="U111">/,/<\/un:Test1>/p' file update with xmllint If your input is xml, you can try: xmllint --xpath "//*[local-name()='Test1'][#id='U111']" file.xml Note: If you have different namespaces for same localname ("Test1"), you need add the namespace-uri()

Related

Extract a string from vcf file

Remove the data before the second repeated specified character in linux

print multiple patterns with sed

Regular expression to extract text from XML-ish data using GNU sed

grep - search for "<?\n" at start of a file

Categories

Resources