Geany regex to extract data inside and outside parenthesis separately - regex

I have an incomplete XML file I am trying to convert to CSV to map to a spreadsheet. To create the header I need to extract the label before each = and seperate with a ,.
Inversely, I need to capture everything between the "" on all the lines to match up to the header.
Where I'm having trouble is there are some spaces in some of the data fields which is messing me up in creating anchors, and some fields have no data at all with just "". Here is a sample with both cases in which I was trying to create my header.
lvendor="EBL" lxref="1304112" linked="0" ltrnqty="" labeltype="ITEM W/DATE,VENDOR" taxcode="1" foodstamp="false" nonstock="false" detail="true" ars2="false"
The Geany regex I tried with is:
[=]["](\S+)?["][\s]
This works until I run into a space in the data field, but replacing (\S+)? with (.+)? gives me other problems. I'm just not sure how to anchor my regex properly, or if I need to use a capture group to get it done.
I'm not even positive if Geany is the right tool here. I'm on an Arch Linux box, so I'm open to any tools that are available to me.

You could do:
(\w+)(?==)|"([^"]*)"
This will save the variable names on first capturing group and their corresponding values on the second capturing group.

Since you are open to new tools, you can convert XML to CSV easily in the terminal with sed:
cat file.xml | sed -r 's/\s?\S+=/,/g' | sed -r 's/^,//'

Related

Remove matching strings using regex

I have a comma-delimited list of name/value pairs like this:
make=mazda;model=cx-5;year=2016;moonroof=yes;radio=yes;navigation=no;color=gray;
I would like to remove the moonroof, radio, and navigation pairs. I can capture these pairs using a regex like this:
(radio|navigation|moonroof)=.*?(?:;|$)
Is there a way to remove the captured group(s) using regex alone, without writing code? Alternatively, is there a way to get the rest of the pairs excluding the captured groups?
If your data set is small, you can use an online website to do it (such as https://regex101.com/.) With Linux or, I imagine Windows Subsystem for Linux you should be able to use the above expression with sed or bash regexp:
sed -ri 's/(radio|navigation|moonroof)=.*?(;|$)//g' <filename>
That sed command will do it in situ, so back up your data first.
Without bash/sed/perl to help you from a suitable command line, I'm sorry to say you need code, or rather the regexp engine associated with it!!!
Hope that helps!

fnr.exe regex capture groups output

I'm using a batch script with fnr.exe to perform bulk search and replace operations over multiple files, but I'm having trouble figuring out how to use capture groups in fnr.exe. I can use them fine in a similar utility called FAR that uses java regex's, but FAR can't be run from the command line, so I'm unable to automate it. Fnr.exe says in it's documentation that it uses .NET regular expressions, and the documentation for .NET regular expressions is great when it comes to how to capture a group, but when it comes to outputting the captured group, it's rather lacking and assumes I'm writing C# or VB code where I can call things like:
Console.WriteLine("Match: {0}", match.Value)
I have a bunch of strings like the following, with the original string on the left and my desired replacement string on the right:
include "fooPrintDriver.h"; | include "barPrintDriver.h";
include "fooSearchAgent.h"; | include "barSearchAgent.h";
include "fooEventListener.h"; | include "barEventListener.h";
In FAR, I could find the strings on the left with
foo(.*?)\.h"
And then replace it with my desired string using
bar\1.h"
Where the '\1' would be PrintDriver or SearchAgent or EventListener, however when I try the same thing in fnr.exe, the '\1' is literally '\1', so my input and output will be:
include "fooPrintDriver.h"; | include "bar\1.h";
include "fooSearchAgent.h"; | include "bar\1.h";
include "fooEventListener.h"; | include "bar\1.h";
Anyone know how to get it working in fnr.exe?
I figured it out. I need to use the following:
bar$1.h"
Good to see you figured it out. However, having tried it and initially failed, it's important to group it with the parentheses. For example, if you used \w to find the letters after foo (a more reliable method than you used), using $1 would have replaced those with literal $1. You'd need to use (\w). Then the $1 would replace correctly.

Is posible to add characters to a string as part of a regular expression (regex)

I use an application to find specific text patterns in free text fields in XML records. It uses regex to identify the pattern and then it is tagged in the XML. For a specific project, it would be a great time saver (I am working with about 18 million records) if I could add 2 characters 27 in front of one of the pattern I have to use.
Can this be done or am I just going to have to go the long way around?
No, you can't have a regex match text that isn't there. A regex will only be able to return text that is part of the original text.
However, if you matched into groups, you could potentially use the group name for extra information about what you're matching.
Regex is not the right tool if you'd like to edit an XML file. Instead, use a modern language like Python, Perl, Ruby, PHP, Java with a proper XML parser module. If you work in Unix like shell, I recommend xmlstarlet
That said, if you'd like to go ahead with a substitution, you can try sed (at your own risks) :
sed -i -r 's/987654/27&/g' files*.xml
(use only -i switch only to modify in-place)

Replace exact part of text in a string (fstab) using sed

I'm in the process of migrating some data between 2 servers. The data is held in the same folder structure on each server.
Once the data has been moved I want to update the fstab file on all of the affected Linux machines. I have a bash script that rsyncs the data between the servers and then logs on to each machine in a list and updates the fstab with the new IP address using sed.
sed "s/\(172.16.0.30\)\(.*\)\(${share}\)\(.*\)/172.16.0.35\2\3\4/"
This has worked absolutely fine in the past, however this time I'm migrating a folder which has a name very similar to a few others, let's say $share is 'home':
home
home-old
home-ancient
The problem I'm having is that this regex is picking up all of the shares with the text contained in $share and not just the one I want.
Is there a way to adjust the regex so that it will only replace the IP on the single line that I want? I've looked at the /b variable but can't seem to get it to work, unfortunately regular expressions usually confuse me!
\b is a GNU extension and in this case won't work because it matches a word boundary, and both the space and - are in the group of non-word. It will match all of them. One simple option is to match a space (or end-of-line) character after $share, like:
sed "s/\(172.16.0.30\)\(.*\)\(${share}\)\( \(.*\)\|$\)/172.16.0.35\2\3\4/"

How do I extract data between HTML tags using Regex?

I've been assigned some sed homework in my class and am one step away from finishing the assignment. I've racked my head trying to come up with a solution and nothing's worked to the point where I'm about to give up.
Basically, in the file I've got...I'm supposed to replace this:
<b>Some text here...each bold tag has different content...</b>
with
Some text here...each bold tag has different content...
I've got it partially completed, but what I can't figure out is how to "echo" the extracted content using sed (regexp).
I manage to substitute the content out just fine, but it's when I'm trying to actually OUTPUT the content that's between the HTML tags that it goes wrong.
If that's confusing, I truly apologize. I've been at this project a couple hours now and am getting a bit frusturated. Basically, why does this not work?
s/<b>.*<\/b>/.*/g
I simply want to output the content WITHOUT the bold tags.
Thanks a bunch!
If you want to reference a part of your regex match in the replacement, you need to place that portion of the regex into a capturing group, and then refer to it using the group number preceded by a backslash. Try the following:
s/<b>\(.*\)</b>/\1/g
You need to use a capturing group, which are parentheses ()
So, it's just this:
s/<b>(.*)<\/b>/\1/g
Capturing groups are numbered, from left to right, starting with one, and increasing.
This syntax is the standard way to do regular expressions; sed's syntax is slightly different. the sed command is
sed 's/<b>\(.*\)<\/b>/\1/g' [file]
or
sed -r 's/<b>(.*)<\/b>/\1/g' [file]
Of course, if you just want to remove the bold tags, the other solution would be to just replace the HTML tags with blanks like so
sed 's/<\([^>]\|\(\"[^\"]\"\)\)*>//g' [file]
(I dislike sed's need to escape everything)
s/<([^\]|(\"[^\"]\"))*>//g
I think this question should be addressed to SED's mans. Like this: http://www.grymoire.com/Unix/Sed.html#uh-4