Grep Syntax Dilemna - regex

I'm trying to use grep to copy out lines in a text file that match a certain pattern but, I'm running into some issues... I would like to grab the values in the "title=" container.
Code:
get_tmax=`grep '[0-9][0-9]°C' K0G7_ec_tmp`
echo "${get_tmax}" > K0G7_ec_tmp2
Text File Contents:
<p class="one" title="19°C">19</p>
<p class="two" title="26°C">26</p>

You can use grep -P with match reset \K:
grep -ioP 'title="\K[^"]+' K0G7_ec_tmp
19°C
26°C
However take caution while parsing HTML file using shell utilities grep/awk/sed etc. Better to use dedicated HTML parser for this job.

grep is shorthand for g/re/p which is not exactly what you're trying to do so I'd look at sed for this:
$ sed 's/.*title="\([^"]*\).*/\1/' file
19°C
26°C
That will work with any sed version on any OS.

Related

How to do a grep regex search for single-quotes?

How do you use grep to do a text file search for a pattern like ABC='123'?
I'm currently using:
grep -rnwi some/path -e "ABC\s*=\s*[\'\"][^\'\"]+[\'\"]"
but this only finds text like ABC="123". It misses any instances that use single-quotes. What's wrong with my regex?
You are using a PCRE. So, you need the -P flag. So, use this:
grep -rnwi some/path -P "ABC\s*=\s*[\'\"][^\'\"]+[\'\"]"
We don't need a \\ for single quotes inside the character classes. So, your regex can also be written as:
"ABC\s*=\s*['\"][^'\"]+['\"]"
Input file:
ABC="123"
ABC='123'
Run grep with your PCRE:
grep -P "ABC\s*=\s*['\"][^'\"]+['\"]" input.txt
Output:
ABC="123"
ABC='123'

use regular expressions to identify html form action tags

I am trying to sed -i to update all my html forms for url shortening. Basically I need to delete the .php from all the action="..." tags in my html forms.
But I am stuck at just identifying these instances. I am trying this testfile:
action = "yo.php"
action = 'test.php'
action='test.php'
action="upup.php"
And I am using this expression:
grep -R "action\s?=\s?(.*)php(\"|\')" testfile
And grep returns nothing at all.
I've tried a bunch of variations, and I can see that even the \s? isn't working because just this grep command also returns nothing:
grep -R "action\s?=\s?" testfile
grep -R "action\\s?=\\s?" testfile
(the latter I tried thinking maybe I had to escape the \ in \s).
Can someone tell me what's wrong with these commands?
Edit:
Fix 1 - apparently I need to escape the question make in \s? to make it be perceived as optional character rather than a literal question mark.
The way you're using it, grep accepts basic posix regex syntax. The single quote does not need to be escaped in it1, but some of the metacharacters you use do -- in particular, ?, (), and |. You can use
grep -R "action\s\?=\s\?\(.*\)php\(\"\|'\)" testfile
I recommend, however, that you use extended posix regex syntax by giving grep the -E flag:
grep -E -R "action\s?=\s?(.*)php(\"|')" testfile
As you can see, that makes the whole thing much more readable.
Addendum: To remove the .php extension from all action attributes in a file, you could use
sed -i 's/\(action\s*=\s*["'\''][^"'\'']*\)\.php\(["'\'']\)/\1\2/g' testfile
Shell strings make this look scarier than it is; the sed code is simply
s/\(action\s*=\s*["'][^"']*\)\.php\(["']\)/\1\2/g
I amended the regex slightly so that in a line action='foo.php' somethingelse='bar.php' the right .php would be removed. I tried to make this as safe as I can, but be aware that handling HTML with sed is always hacky.
Combine this with find and its -exec filter to handle a whole directory.
1 And that the double quote needs to be escaped is because you use a doubly-quoted shell string, not because the regex requires it.
You need to use the -P option to use Perl regexs:
$ grep -P "action\s?=\s?(.*)php(\"|\')" test
action = "yo.php"
action = 'test.php'
action='test.php'
action="upup.php"
try this unescaped plain regex, which only selects text within quotes:
action\s?=\s?["'](.*)\.php["']
you can fiddle around here:
https://regex101.com/r/lN8iG0/1
so on command line this would be:
grep -P "action\s?=\s?[\"'](.*)\.php[\"']" test

Copy matched regex to new file

I want to copy regex matched text to a new file.
<SHOPITEM>([\s\S]*?)<YEAR>2015<\/YEAR>([\s\S]*?)<\/SHOPITEM>
([\s\S]*?) = any text, any line
This works (I am able to find) in Sublime editor, but how this regex looks for sed/grep (or any other Unix tool)?
Usually sed and grep are used to search on lines not on multiline mode as is it still possible under certain conditions.
I would advise to use Perl which should be installed on your computer:
perl -p -e 'undef $/;$_=<>;print $& if /<SHOPITEM>([\s\S]*?)<YEAR>2015<\/YEAR>([\s\S]*?)<\/SHOPITEM>/i;'
Be aware that this regex won't work if you have nested <shopitem> tags or even multiple occurences. Instead use a XML parser.
Also you can write a Program that parse your xml file and this time it will capture all the matches.
myparser.pl:
#!/usr/bin/env perl
undef $/;
$_ = <>;
print while(/<(shopitem)>[\s\S]*<(year)>2015<\/\2>[\s\S]*<\/\1>/ig);
That you can execute:
$ chmod u+x myparser.pl
$ ./myparser.pl myfile.xml
I'm not the best scripter, but I think this should work:
grep "<SHOPITEM>" infile | grep "<YEAR>2015" | sed -e "s/<[^>]*>//g" | sed "s/2015/ /g" > outfile
Edit: I didn't match the regex, instead I got SHOPITEMs with YEAR 2015 tag and removed all the unwanted parts.
Edit: I'd do it this way, but I'm not sure it's the most elegant solution.

Regular expression to extract everything between two tags

I have a few million custom txt files generated with content like this in each one of them.
I previously used ruby(Nokogiri) to parse through these files one by one and extract the contents from these files and store in the database.
<doc id="12" url="http://en.wikipedia.org/wiki?curid=12" title="Anarchism">
...
...
...
few hundred lines of text
...
</doc>
However using ruby seems to be too slow as it takes more than two weeks of running this single process to complete the overwhelming number of these article files. So I was trying to extract the data needed from the shell commands itself and skip ruby totally. But I am still a naive at using regular expressions.
So far I have been able to extract these data.
informations=`grep -E '<doc' F1.txt`
id=`echo $informations | grep -Po '\bid="[0-9]+"' | grep -Eo '[0-9]+'`
url=`echo $informations | grep -Po 'https?:\/\/(.*?)([A-Za-z]|[.]|[\/]|[?]|[=]|[0-9])*'`
title=`echo $informations | grep -Po '(?<=title=").*(?=">)'`
But I also need to capture everything in between the doc tag as body.
body=`a command to take those few hundreed lines between the two doc tags`.
I tried to use this in the grep environment /(?<=">)(.)*(?=</doc>)/m .
grep -Po '(?<=">)(.)*(?=<\/doc>)' F1.txt
But it does not return any match.
Any suggestions on how to get this done ?
awk '/<doc/,/<\/doc>/' YourFile
Will stop at first match
use this
<doc.*?</doc>
UPDATE:
grep -P '<doc(.|\n)*?</doc>' file.txt
use -P option

Match & Extract Multi-line Pattern In File

I made a Bash script to download this page http://php.net/downloads.php and then search for the first occurrence of the latest PHP filename, version and MD5sum. Right now I have it working but broken up into two different sed commands. When I try to put the regexps into a single one it wont match. I believe it has to do with the newlines present.
How do I go about using one single sed pattern where I get all three matches in either an array (preferred) or seperated by spaces.
Btw, it does not have to be sed. I just want something where the system that the script will be run on will likely work, so no perl for instance.
wget -q http://php.net/downloads.php
FILE_INFO=$(sed -nr "s/.*(php-([0-9\.]+)\.tar\.bz2).*/\1 \2/p;T;q" downloads.php)
MD5SUM=$(sed -nr "s/.*md5: ([0-9a-f]{32}).*/\1/p;T;q" downloads.php)
echo $FILE_INFO
echo $MD5SUM
These are the two lines from the file in question and it needs to extract the info from:
PHP 5.4.5 (tar.bz2) [10,754Kb] - 19 July 2012<br />
<span class="md5sum">md5: ffcc7f4dcf2b79d667fe0c110e6cb724</span>
This might work for you (GNU sed):
sed '\|<a href="/get/php|!d;N;s/.*\(php-\([0-9\.]\+\)\.tar\.bz2\).*md5: \([0-9a-f]\{32\}\).*/\1 \2 \3/;q' file
sed -nr 's/.*(php-([0-9\.]+)\.tar\.bz2).*/\1 \2/p;s/.*md5: ([0-9a-f]{32}).*/\1/p;T;' downloads.php