Awk pattern matching - regex

I want to print
userId = 1234
userid = 12345
timestamp = 88888888
js = abc
from my data
messssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss
<input name="userId" value="1234" type="hidden"> messsssssssssssssssssss
<input name="userid" value="12345" type="hidden"> messssssssssssssssssss
<input name="timestamp" value="88888888" type="hidden"> messssssssssssss
<input name="js" value="abc" type="hidden"> messssssssssssssssssssssssss
messssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss
How can I do this with AWK(or whatever)? Assume that my data is stored in the "$info" variable (single line data).
Edit : single line data i mean all data represent like this
messss...<input name="userId" value="1234" type="hidden">messsss...<input ....>messssssss
So i can't use grep to extract interest section.

I'm not sure I understand your "single line data" comment but if this is in a file, you can just do something like:
cat file
| grep '^<input '
| sed 's/^<input name="//'
| sed 's/" value="/ = /'
| sed 's/".*$//'
Here's the cut'n'paste version:
cat file | grep '^<input ' | sed 's/^<input name="//' | sed 's/" value="/ = /' | sed 's/".*$//'
This turns:
messssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss
<input name="userId" value="1234" type="hidden"> messsssssssssssssssssss
<input name="userid" value="12345" type="hidden"> messssssssssssssssssss
<input name="timestamp" value="88888888" type="hidden"> messssssssssssss
<input name="js" value="abc" type="hidden"> messssssssssssssssssssssssss
messssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss
quite happily into:
userId = 1234
userid = 12345
timestamp = 88888888
js = abc
The grep simply extracts the lines you want while the sed commandsrespectively:
strip off up to the first quote.
replace the section between the name and value with an "=".
remove everything following the value closing quote (including that quote).

This part should probably be a comment on Pax's answer, but it got a bit long for that little box. I'm thinking 'single line data' means you don't have any newlines in your variable at all? Then this will work:
echo "$info" | sed -n -r '/<input/s/<input +name="([^"]+)" +value="([^"]+)"[^>]*>[^<]*/\1 = \2\n/gp'
Notes on interesting bits:
- -n means don't print by default - we'll say when to print with that p at the end.
-r means extended regex
/<input/ at the beginning makes sure we don't even bother to work on lines that don't contain the desired pattern
That \n at the end is there to ensure all records end up on separate lines - any original newlines will still be there, and the fastest way to get rid of them is to tack on a '| grep .' on the end - you could use some sed magic but you wouldn't be able to understand it thirty seconds after you typed it in.
I can think of ways to do this in awk, but this is really a job for sed (or perl!).

To process variables that contain more than one line, you need to put the variable name in double quotes:
echo "$info"|sed 's/^\(<input\( \)name\(=\)"\([^"]*\)" value="\([^"]*\)"\)\?.*/\4\2\3\2\5/'

using perl
cat file | perl -ne 'print($1 . "=" . $2 . "\n") if(/name="(.*?)".*value="(.*?)"/);'

IMO, parsing HTML should be done with a proper HTML/XML parser. For example, Ruby has an excellent package, Nokogiri, for parsing HTML/XML:
ruby -e '
require "rubygems"
require "nokogiri"
doc = Nokogiri::HTML.parse(ARGF.read)
doc.search("//input").each do |node|
atts = node.attributes
puts "%s = %s" % [atts["name"], atts["value"]]
end
' mess.html
produces the output you're after

AWK:
BEGIN {
# Use record separator "<", instead of "\n".
RS = "<"
first = 1
}
# Skip the first record, as that begins before the first tag
first {
first = 0
next
}
/^input[^>]*>/ { #/
# make sure we don't match outside of the tag
end = match($0,/>/)
# locate the name attribute
pos = match($0,/name="[^"]*"/)
if (pos == 0 || pos > end) { next }
name = substr($0,RSTART+6,RLENGTH-7)
# locate the value attribute
pos = match($0,/value="[^"]*"/)
if (pos == 0 || pos > end) { next }
value = substr($0,RSTART+7,RLENGTH-8)
# print out the result
print name " = " value
}

Tools like awk and sed can be used together with XMLStarlet and HTML Tidy to parse HTML.

Related

Extract all but last field from a variable in bash

I have a file with lines similar to this:
01/01 THIS IS A DESCRIPTION 123.45
12/23 SHORTER DESC 9.00
11/16 DESC 1,234.00
Three fields: date, desc, amount. The first field will always be followed by a space. The last field will always be preceded by a space. But the middle field will usually contain spaces.
I know bash/regex well enough to get the first and last fields (for example, echo ${LINE##* } or cut -f1 -d\). But how do I get the middle field? Essentially everything except the first and last fields.
You can use sed for that:
$ sed -E 's/^[^[:space:]]*[[:space:]](.*)[[:space:]][^[:space:]]*$/\1/' file
THIS IS A DESCRIPTION
SHORTER DESC
DESC
Or with awk:
$ awk '{$1=$NF=""; sub(/^[ \t]*/,"")}1' file
# same output
You can also use cut and rev to delete the first and last fields:
$ cut -d ' ' -f2- file | rev | cut -d ' ' -f2- | rev
# same output
Or GNU grep:
$ grep -oP '^\H+\h\K(.*)(?=\h+\H+$)' file
# same output
Or, with a Bash loop and parameter expansion:
$ while read -r line; do line="${line#* }"; echo "${line% *}"; done <file
# same output
Or, if you want to capture the fields as variables in Bash:
while IFS= read -r line; do
date="${line%% *}"
amt="${line##* }"
line="${line#* }"
desc="${line% *}"
printf "%5s %10s \"%s\"\n" "$date" "$amt" "$desc"
done <file
Prints:
01/01 123.45 "THIS IS A DESCRIPTION"
12/23 9.00 "SHORTER DESC"
11/16 1,234.00 "DESC"
If you want to remove the first and last fields, you can just extend the parameter expansion technique you referenced:
var=${var#* } var=${var% *}
A single # or % removes the shortest substring that matches the glob.
bash: read the line into an array of words, and pick out the wanted elements from the array
while read -ra words; do
date=${words[0]}
amount=${words[-1]}
description=${words[*]:1:${#words[#]}-2}
printf "%s=%s\n" date "$date" desc "$description" amt "$amount"
done < file
outputs
date=01/01
desc=THIS IS A DESCRIPTION
amt=123.45
date=12/23
desc=SHORTER DESC
amt=9.00
date=11/16
desc=DESC
amt=1,234.00
This is the fun bit: ${words[*]:1:${#words[#]}-2}
take a slice of the words array, from index 1 (the 2nd element) for a length of "number of elements minus 2"
the words will be joined into a single string with a space separator.
See Shell Parameter Expansion and scroll down a bit for the ${parameter:offset:length} discussion.
If you want to use a regex in bash, then you can use capturing parentheses and the BASH_REMATCH array
while IFS= read -r line; do
if [[ $line =~ ([^[:blank:]]+)" "(.+)" "([^[:blank:]]+) ]]; then
echo "date=${BASH_REMATCH[1]}"
echo "desc=${BASH_REMATCH[2]}"
echo "amt=${BASH_REMATCH[3]}"
fi
done < file
Same output as above.
Notice in the pattern that the spaces need to be quoted (or backslash-escaped)
You could try below one with awk:
awk '{$1="";$NF="";sub(/^[ \t]*/,"")}1' file_name

Get multiple values in an xml file

<!-- someotherline -->
<add name="core" connectionString="user id=value1;password=value2;Data Source=datasource1.comapany.com;Database=databasename_compny" />
I need to grab the values in userid , password, source, database. Not all lines are in the same format.My desired result would be (username=value1,password=value2, DataSource=datasource1.comapany.com,Database=databasename_compny)
This regex seems little bit more complicated as it is more complicated. Please, explain your answer if possible.
I realised its better to loop through each line. Code I wrote so far
while read p || [[ -n $p ]]; do
#echo $p
if [[ $p =~ .*connectionString.* ]]; then
echo $p
fi
done <a.config
Now inside the if I have to grab the values.
For this solution I am considering:
Some lines can contain no data
No semi-colon ; is inside the data itself (nor field names)
No equal sign = is inside the data itself (nor field names)
A possible solution for you problem would be:
#!/bin/bash
while read p || [[ -n $p ]]; do
# 1. Only keep what is between the quotes after connectionString=
filteredLine=`echo $p | sed -n -e 's/^.*connectionString="\(.\+\)".*$/\1/p'`;
# 2. Ignore empty lines (that do not contain the expected data)
if [ -z "$filteredLine" ]; then
continue;
fi;
# 3. split each field on a line
oneFieldByLine=`echo $filteredLine | sed -e 's/;/\r\n/g'`;
# 4. For each field
while IFS= read -r field; do
# extract field name + field value
fieldName=`echo $field | sed 's/=.*$//'`;
fieldValue=`echo $field | sed 's/^[^=]*=//' | sed 's/[\r\n]//'`;
# do stuff with it
echo "'$fieldName' => '$fieldValue'";
done < <(printf '%s\n' "$oneFieldByLine")
done <a.xml
Explanations
General sed replacement syntax :
sed 's/a/b/' will replace what matches the regex a by the content of b
Step 1
-n argument tells sed not to output if no match is found. In this case this is useful to ignore useless lines.
^.* - anything at the beginning of the line
connectionString=" - literally connectionString="
\(.\+\)" - capturing group to store anything in before the closing quote "
.*$" - anything until the end of the line
\1 tells sed to replace the whole match with only the capturing group (which contains only the data between the quotes)
p tells sed to print out the replacement
Step 3
Replace ; by \r\n ; it is equivalent to splitting by semi-colon because bash can loop over line breaks
Step 4 - field name
Replaces literal = and the rest of the line with nothing (it removes it)
Step 4 - field value
Replaces all the characters at the beginning that are not = ([^=] matches all but what is after the '^' symbol) until the equal symbol by nothing.
Another sed command removes the line breaks by replacing it with nothing.

use sed to replace "file={{bla-bla}}" with "file={bla-bla}"

my bibtex file is corrupted in a sense that I need to change
file = {{name:/path/to/file.pdf:application/pdf}},
with file = {name:/path/to/file.pdf:application/pdf}, that is, remove the first pair of curly brackets.
All the strings I am interested start with file = {{.
My first attempt is
echo "file = {{name:/path/to/file.pdf:application/pdf}}," | sed 's/file = {{/file = {/g;s/}}/}/g'
The problem with this one is that it also alters lines like
title = {{ blablabla }} which i don't it want to.
How does one write a REGEX with something like s/file = {{EVERYTHING-IN-BETWEEN/file = {KEEP-WHAT-WAS-THERE}/g ?
p.s. if it's not possible with sed, any other unix commands are welcome.
p.p.s. I am on OS-X, sed here is apparently different to GNU, so some answers below do not work for me, unfortunately.
You can do the following:
sed 's/\(file = \){\({[^}]*}\)}/\1\2/g'
This is probably wrong for your situation, but this will change any double brace to a single brace:
sed 's/\([{}]\)\1/\1/g' <<END
{{
}}
file={{bar-blah}}
{}
}{
END
{
}
file={bar-blah}
{}
}{
The search part \([{}]\)\1 finds a single open or close brace followed by what was just captured. The replacement part is the single captured character.
Assuming both opening and closing braces are on the same line and no other pairs of braces exist on that same line then this should do what you want:
sed '/file *= *{{/{s/{{/{/; s/}}/}/}' file
That's:
/file = {{/ - match lines that have file = {{ on it
{ - start a group of commands
s/{{/{/ - replace {{ with { once
s/}}/}/ - replace }} with } once
} - end a group of commands
If OS X sed cannot handle that command, and this version without the command grouping does not work either:
sed '/file *= *{{/s/}}/}/; /file *= *{{/s/{{/{/'
then this, hopefully, should:
sed -e '/file *= *{{/s/}}/}/' -e '/file *= *{{/s/{{/{/'
or, to steal from glenn jackman's answer a bit:
sed -e '/file *= *{{/s/\([{}]\)\1/\1/g'
huh, GNU sed
echo "file={{bla-bla}}" | sed 's/\(file\s*=\s*\){\s*{\s*\([^}]*\)}\s*}\s*/\1{\2}/'
file={bla-bla}
\s is catching eventual white space characters
assuming that '}' is not inside internal string

Extracting string from html file or curl output

I have a html file where some of them are "minified", this means that a whole website can be in just one line.
I want to filter the value of ?idsite= which contains numbers. So a html contains something like this: img src="//stats.domains.com/piwik.php?idsite=44.
So the plain output should be "44".
I tried grep but it echos the whole line and just highlights the value.
With perl it could be something like:
echo "Whole bunch of stuff \
img src=\"stats.domains.com/piwik.php?idsite=44\" " \
| perl -nE 'say /.*idsite=(..)\"/ '
(assumes that idsite is always two characters ! :-). Your regex will need to be more sophisticated than this most likely).
Putting the snippet from the page you reference above in an HTML file (non-minified) and subsituting 44 for the parameter variable, this bit of perl will extract the "44":
perl -nE 'say /.*idsite=(..)/ if /idsite/ ' idsite.html
Translating the one liner to a sed command line would be similar:
echo "Whole bunch of stuff \
img src=\"stats.domains.com/piwik.php?idsite=44\" " \
| sed -En "s/^.*idsite=(..)\"/\1/p"
This is POSIXsed from FreeBSD (should work on OSX) the -E switch is to add "modern" regexes.
Doing it in awk is left as an exercise for another community member :-)
Here is a perl way to extract only the trailing digits of strings like src="//stats.domains.com/piwik.php?idsite=44" and run on a bash command line:
echo $src|perl -ne '$_ =~m /(\d+$)/; print $1'
Here is a python way to do the same thing:
import re
print ', '.join( re.findall(r'\d+$', src))
If there will be a lot of src strings to process, it would be best to compile the regex when using Python as follows:
import re
p = re.compile('\d+$')
print ', '.join(p.findall(src))
The import and the compilation only have to be done once.
Here is a Ruby way to do it:
puts src.scan( /\d+$/ ).first
In all cases the regexes end with "$" which matches the end of the string. That is why they match and extract only digits (\d+) at the end of the string.
If you don't need to check whether the idsite is in the value of a src attribute, then all you need is
perl -nE'say $1 if /\bidsite=(\d+)' myfile.html
$ cat site.html
lorem ipsum idsite='4934' fasdf a
other line
$ sed -n '/idsite/ { s/.*idsite=\([0-9]\+\).*$/\1/; p }' < site.html
4934
Let me know in case you need an explanation of what is going on.

Find string in a file after a string pattern using shell script

i have my output file with 4 lines
storefront/storefront.war/location/header-info.jsp:30:<input type="hidden" id="welcomeConfigValue" value="${welcomeConfig}"/>
storefront/storefront.war/location/header-info.jsp:31:<span id="selected-location" class="top-txt top-nav-fix">
storefront/storefront.war/location/header-info.jsp:33:<span id="headRestName"></span><span class="header-spacing"> | </span><span id="headRestPhone"></span><span class="header-spacing"> | </span>
storefront/storefront.war/location/header-info.jsp:35:<a href="#" class="capitalize link-wht" id="location-show"><fmt:message
I'd like to get output string after id= with the UNIX shell.
I.e., output should be like this:
welcomeConfigValue
selected-location
headRestName
headRestPhone
location-show
you can try with grep:
grep -Po '\sid="\K[^"]*' file
Command:
sed -r 's/(^.*id=")([^"]+)(.*$)/\2/g' < file.txt
Output:
sdlcb#Goofy-Gen:~/AMD$ sed -r 's/(^.*id=")([^"]+)(.*$)/\2/g' < ff.txt
welcomeConfigValue
selected-location
headRestPhone
location-show
Here, we are grouping the patterns into 3 sets using "(" & ")". First set contains all characters from beginning of the line till 'id="' including. Second set contains characters between the "s (i.e between 'id="' and the pair '"'). Third set contains the remaining chars till the end of the line. Then we just avoid the 1st and 3rd patterns.