Regular expression to extract text from XML-ish data using GNU sed - regex

I have a file full of lines extracted from an XML file using "gsed regexp -i FILENAME". The lines in the file are all of one of either format:
<field number='1' name='Account' type='STRING'W/>
<field number='2' name='AdvId' type='STRING'W>
I've inserted a 'W' in the end which represents optional whitespace. The order and number of properties are not necessarily the same in all lines throughout the file although "number" is always before "type".
What I'm searching for is a regular expression "regexp" that I can give to gnu sed so that this command:
gsed regexp -i FILENAME
gives me a file with lines looking like this:
1 STRING
2 STRING
I don't care about the amount of whitespace in the result as long as there is some after the number and a newline at the end of each line.
I'm sure it is possible, but I just can't figure out how in a reasonable amount of time. Can anyone help?
Thanks a lot,
jules

Using xsh, a Perl wrapper around XML::LibXML:
open file.xml ;
for //field echo #number #type ;

I'm sure this can be optimized, but it works for me and answers your question:
sed "s/^.*number='\([0-9]*\)'.*type='\(.*\)'.*$/\1 \2/" <filename>
Saying that, I think the others are right, if you have an XML-file you should use an XML-parser.

I think you're much better off using a command line XML tool such as XMLStarlet. That will integrate well with the shell and let you perform XPath searches. It's XML-aware so it'll handle character encodings, whitespace correctly etc.

Simple cut should work for you:
cut -f2,6 -d"'" --output-delimiter=" "
If you really want sed:
sed -r "s/.'(.)'.type='(.)'.*/\1 \2/"

You can use this:
sed -r "s/<field [^>]*?number='([0-9]+)'[^>]*?type='([^']+)'[^>]*>/\1 \2/"

You would be better off using an XML parser, but if you had to use sed:
sed 's/<field number=\'(.*?)\'.*?type=\'(.*?)\'/\1 \2

sed -ni "/<field .*>/s#^.*[[:space:]]number='\\([^']\\+\\).*[[:space:]]type='\\([^']\\+\\).*#\1 \2#p" FILENAME
Or if you don't mind contents of number and type to be optional:
sed -ni "/<field .*>/s#^.*[[:space:]]number='\\([^']*\\).*[[:space:]]type='\\([^']*\\).*#\1 \2#p" FILENAME
Just change from [^']\\+ to [^']* at your preference.

Related

Regex command line change format of each line

I have a file that contains lines in a format similar to this...
/data/file.geojson?10,20,30,40
/data/file.geojson?bbox=-5.20751953125,49.05227025601607,3.0322265625,56.46249048388979
/data/file.geojson?bbox=-21.46728515625,45.99696161820381,19.2919921875,58.88194208135912
/data/file.geojson?bbox=-2.8482055664062496,54.38935426009769,-0.300750732421875,55.158473983815306
/data/file.geojson?bbox=-21.46728515625,45.99696161820381,19.2919921875,58.88194208135912
/data/file.geojson?bbox=-21.46728515625,45.99696161820381,19.2919921875,58.88194208135912
I've tried a combination of grep, sed, gawk, and |(pipes) to try and pattern match and then change the format to be more like this...
[10,40],[30,40],[30,20][10,20],
[-5.20751953125,56.46249048388979],[3.0322265625,56.46249048388979].....
Hopefully you get the idea from the first line so I don't have to type out all the examples manually!
I've got the hang of regex to match the co-ordinates. In fact the input file is the result of extracting from apache access logs. It might be easier to read/understand answers if they just match positive integer numbers, I will then be able to slot in a more complicated pattern to match the right range.
To be able to arrange the results like you which it is important to be able to access the last for values per line.
No pattern matching is required if you use awk. You can split the input strings by a set of delimiters and reassemble the resulting fields. 40 can be accessed as $(NF), 30 as $(NF-1) and so on.
awk -F'[?,=]' '
{printf "[%s,%s],[%s,%s],[%s,%s],[%s,%s]\n",
$(NF-3),$(NF),$(NF-1),$(NF),
$(NF-1),$(NF-2),$(NF-3),$(NF-2)
}' file
I'm using ?, , or = as the field delimiters. This makes it simple to access the columns of interest.
Output:
[10,40],[30,40],[30,20],[10,20]
[-5.20751953125,56.46249048388979],[3.0322265625,56.46249048388979],[3.0322265625,49.05227025601607],[-5.20751953125,49.05227025601607]
[-21.46728515625,58.88194208135912],[19.2919921875,58.88194208135912],[19.2919921875,45.99696161820381],[-21.46728515625,45.99696161820381]
[-2.8482055664062496,55.158473983815306],[-0.300750732421875,55.158473983815306],[-0.300750732421875,54.38935426009769],[-2.8482055664062496,54.38935426009769]
[-21.46728515625,58.88194208135912],[19.2919921875,58.88194208135912],[19.2919921875,45.99696161820381],[-21.46728515625,45.99696161820381]
[-21.46728515625,58.88194208135912],[19.2919921875,58.88194208135912],[19.2919921875,45.99696161820381],[-21.46728515625,45.99696161820381]
Btw, also sed can be used here:
sed -r 's/.*[?=]([^,]+),([^,]+),([^,]+),(.*)/[\1,\4],[\3,\4],[\3,\2],[\1,\2]/' file
The command is capturing the numbers at the end each in a separate capturing group and re-assembles them in the replacement part.
Not all versions of sed support the + quantifier. The most compatible version would look like this :)
sed 's/.*[?=]\([^,]\{1,\}\),\([^,]\{1,\}+\),\([^,]\{1,\}\),\(.*\)/[\1,\4],[\3,\4],[\3,\2],[\1,\2]/' file
sed strips off items prior to numbers, then awk splits on comma and outputs in different order. Assuming data is in a file called "td.txt"
sed 's/^[^0-9-]*//' td.txt|awk -F, '{print "["$1","$4"],["$3","$4"],["$3","$2"],["$1","$2"],"}'
This might work for you (GNU sed):
sed -r 's/^.*\?[^-0-9]*([^,]*),([^,]*),([^,]*),([^,]*)/[\1,\4],[\3,\4],[\3,\2],[\1,\2]/' file
Or with more toothpicks:
sed 's/^.*\?[^-0-9]*\([^,]*\),\([^,]*\),\([^,]*\),\([^,]*\)/[\1,\4],[\3,\4],[\3,\2],[\1,\2]/' file
You can use the following to match:
(\/data\/file\.geojson\?(?:bbox=)?)([0-9.-]+),([0-9.-]+),([0-9.-]+),([0-9.-]+)
And replace with the following:
$1[$2,$3],[$4,$5]
See DEMO

print multiple patterns with sed

I try to print multiple patterns with sed.
Here's a typical string to process :
(<span class="arabic">1</span>.<span class="arabic">15</span>)</td></tr>
and I would like : (1.15)
For this, I tried :
sed 's/^(<span.*">\([0-9]*\).*\([0-9]*\).*">/(\1\.\2)/'
but I get (1.)15</span>)</td></tr>
Anyone could see what's wrong ?
Thanks
If you are Chuck Norris, use regex, brainfuck or assembly. If you're not, don't use regex to parse HTML, instead, use a tool that support xpath, like xmllint. In 2014, it's a solved problem :
xmllint --html --xpath '//span[#class="arabic"]/text()' file_or_URL
Check the famous RegEx match open tags except XHTML self-contained tags
xmllint comes from libxml2-utils package (for debian and derivatives)
Reason why you are getting "(1.)15) as your output"
sed 's/^(<span.*">\([0-9]*\).*\([0-9]*\).*">/(\1\.\2)/'
^^
the two characters "> needs to be placed before \([0-9]*\) since "> in your line is before the two digits (in this case). This way sed can find the pattern
The correct sed command
sed 's/^(<span.*">\([0-9]*\).*">\([0-9]*\).*/(\1.\2)/'
^^
Correct Command line
echo '(<span class="arabic">1</span>.<span class="arabic">15</span>)</td></tr>'|sed 's/^(<span.*">\([0-9]*\).*">\([0-9]*\).*/(\1.\2)/'
results using the command line above
(1.15)
If data is at the same place all the time, awk may be a simpler solution than sed:
awk -F"[<>]" '{print "("$3"."$7")"}' file
(1.15)
$ lynx -dump -nomargins file.htm
(1.15)

sed replace exact match

I want to change some names in a file using sed. This is how the file looks like:
#! /bin/bash
SAMPLE="sample_name"
FULLSAMPLE="full_sample_name"
...
Now I only want to change sample_name & not full_sample_name using sed
I tried this
sed s/\<sample_name\>/sample_01/g ...
I thought \<> could be used to find an exact match, but when I use this, nothing is changed.
Adding '' helped to only change the sample_name. However there is another problem now: my situation was a bit more complicated than explained above since my sed command is embedded in a loop:
while read SAMPLE
do
name=$SAMPLE
sed -e 's/\<sample_name\>/$SAMPLE/g' /path/coverage.sh > path/new_coverage.sh
done < $1
So sample_name should be changed with the value attached to $SAMPLE. However when running the command sample_name is changed to $SAMPLE and not to the value attached to $SAMPLE.
I believe \< and \> work with gnu sed, you just need to quote the sed command:
sed -i.bak 's/\<sample_name\>/sample_01/g' file
In GNU sed, the following command works:
sed 's/\<sample_name\>/sample_01/' file
The only difference here is that I've enclosed the command in single quotes. Even when it is not necessary to quote a sed command, I see very little disadvantage to doing so (and it helps avoid these kinds of problems).
Another way of achieving what you want more portably is by adding the quotes to the pattern and replacement:
sed 's/"sample_name"/"sample_01"/' script.sh
Alternatively, the syntax you have proposed also works in GNU awk:
awk '{sub(/\<sample_name\>/, "sample_01")}1' file
If you want to use a variable in the replacement string, you will have to use double quotes instead of single, for example:
sed "s/\<sample_name\>/$var/" file
Variables are not expanded within single quotes, which is why you are getting the the name of your variable rather than its contents.
#user1987607
You can do this the following way:
sed s/"sample_name">/sample_01/g
where having "sample_name" in quotes " " matches the exact string value.
/g is for global replacement.
If "sample_name" occurs like this ifsample_name and you want to replace that as well
then you should use the following:
sed s/"sample_name ">/"sample_01 "/g
So that it replaces only the desired word. For example the above syntax will replace word "the" from a text file and not from words like thereby.
If you are interested in replacing only first occurence, then this would work fine
sed s/"sample_name"/sample_01/
Hope it helps

select part of filename using regex

I got a file that looks like
dcdd62defb908e37ad037820f7 /sp/dir/su1/89/asga.gz
7d59319afca23b02f572a4034b /sp/dir/su2/89/sfdh.gz
ee1d443b8a0cc27749f4b31e56 /sp/dir/su3/89/24.gz
33c02e311fd0a894f7f0f8aae4 /sp/dir/su4/89/dfad.gz
43f6cdce067f6794ec378c4e2a /sp/dir/su5/89/adf.gz
2f6c584116c567b0f26dfc8703 /sp/dir/su6/895/895.gz
a864b7e327dac1bb6de59dedce /sp/dir/su7/895/895.gz
How do i use sed to substitue all the su* such that I can replace with a single value like
sed "s/REXEXP/newfolder/g" myfile
thanks in advance
I think you want
sed 's/su./newfolder/g'
If you actually want to keep the number in su1...su7 as a part of newfolder (for example newfolder1...newfolder7), you can do:
sed 's/su\(.\)/newfolder\1/g'
It also depends upon how "strict" do you want your patterns to be. The above will match su followed by any character and do the replacement. On the other hand, a command like s#/su\([0-9]\)/#/newfolder\1/#g will only match /su followed by a digit, followed by /. So you may need to adjust your pattern accordingly.
$ sed -e 's|/su[^/]*|/newfolder|' /tmp/files\
dcdd62defb908e37ad037820f7 /sp/dir/newfolder/89/asga.gz
7d59319afca23b02f572a4034b /sp/dir/newfolder/89/sfdh.gz
...
If you want to get rid of the checksums as well:
$ sed -r -e 's|/su[^/]*|/newfolder|' -e 's/^[^ ]+ +//' /tmp/files\
/sp/dir/newfolder/89/asga.gz
/sp/dir/newfolder/89/sfdh.gz
...
su[0-9] will match a single digit.
sed requires a dirty amount of metacharacter escaping, some of it may be slightly off.
sed -i -e 's/\/su[^\/]+\//\/newFolder\//g' myfile
I vote for Wayne Conrad's answer as the most likely to be what the OP wants, but I'd suggest using an alternate character for the sed expression separator, thus:
sed 's|/su[^/]*|/newfolder|' /tmp/files
That makes it a bit cleaner.
Note also that the trailing 'g' is probably not wanted.
use awk. since there is a delimiter you can use , '/'. after that, column 4 is what you want to change. So if you have paths like /sp/su3dir/su2/89/sfdh.gz , su3dir will not be affected.
awk -F"/" '{$4="newfolder";}1' OFS="/" file

Regular expression to find a line containing certain characters and remove that line

I have text file which has lot of character entries one line after another.
I want to find all lines which start with :: and delete all those lines.
What is the regular expression to do this?
-AD
Regular expressions don't "do" anything. They only match text.
What you want is some tools that uses regular expressions to identify a line and then apply some command to those tools.
One such tools is sed (there's also awk and many others). You'd use it like this:
sed -e "/^::/d" < input.txt > output.txt
The part "/^::/" tells sed to apply the following command to all lines that start with "::" and "d" simply means "delete that line".
Or the simplest solution (which my brain didn't produce for some strange reason):
grep -v "^::" input.txt > output.txt
sed -i -e '/^::/d' yourfile.txt
^::.*[\r\n]*
If you're reading the file line-by-line you won't need the [\r\n]* part.
Simple as:
^::
If you don't have sed or grep, find this and replace with empty string:
^::.*[\r\n]
Thanks for the pointers:
Following thing worked for me. After "::" any character was possiblly present in the text file so i gave:
^::[a-zA-Z0-9 I put all punctuation symbols here]*$
-AD
Here's my contribution in C#:
Text stream:
string stream = :: This is a comment line
Syntax:
Regex commentsExp = new Regex("^::.*", RegexOptions.Singleline);
Usage:
Console.WriteLine(commentsExp.Replace(stream, string.Empty));
Alternatively, if I wanted to simply take a text file that included comments and produce an exact duplicate without the comment lines I could use a simple but effective combination of the type and findstr commandline tools:
type commented.txt | findstr /v /R "^::" > uncommented.txt