Remove matching strings using regex - regex

I have a comma-delimited list of name/value pairs like this:
make=mazda;model=cx-5;year=2016;moonroof=yes;radio=yes;navigation=no;color=gray;
I would like to remove the moonroof, radio, and navigation pairs. I can capture these pairs using a regex like this:
(radio|navigation|moonroof)=.*?(?:;|$)
Is there a way to remove the captured group(s) using regex alone, without writing code? Alternatively, is there a way to get the rest of the pairs excluding the captured groups?

If your data set is small, you can use an online website to do it (such as https://regex101.com/.) With Linux or, I imagine Windows Subsystem for Linux you should be able to use the above expression with sed or bash regexp:
sed -ri 's/(radio|navigation|moonroof)=.*?(;|$)//g' <filename>
That sed command will do it in situ, so back up your data first.
Without bash/sed/perl to help you from a suitable command line, I'm sorry to say you need code, or rather the regexp engine associated with it!!!
Hope that helps!

Related

Notepad++ RegEx : Shuffle text and paste

I try to shuffle this word in notepad++ with RegEx:
word1|word2|word3|word4|word5
and Result is :
word2|word1|word3|word4|word5
word3|word1|word2|word4|word5
word4|word1|word3|word2|word5
word5|word1|word3|word4|word2
Can notepad++ do that?
Sure, just capture the words and output them:
Search: ^(.*?)\|(.*?)\|(.*?)\|(.*?)\|(.*?)$
Regex mode
Replace: \2|\1|\3|\4|\5\r\n\3|\1|\2|\4|\5\r\n\4|\1|\3|\2|\5\r\n\5|\1|\3|\4|\2
But if you want something more general (ie. variable number of words, generating all permutations rather than specific ones, etc) then you will need a script of some kind. Personally, I'd whip up a quick and dirty PHP script to do the job, but others may use Node, Python, etc. - plenty of options.

Geany regex to extract data inside and outside parenthesis separately

I have an incomplete XML file I am trying to convert to CSV to map to a spreadsheet. To create the header I need to extract the label before each = and seperate with a ,.
Inversely, I need to capture everything between the "" on all the lines to match up to the header.
Where I'm having trouble is there are some spaces in some of the data fields which is messing me up in creating anchors, and some fields have no data at all with just "". Here is a sample with both cases in which I was trying to create my header.
lvendor="EBL" lxref="1304112" linked="0" ltrnqty="" labeltype="ITEM W/DATE,VENDOR" taxcode="1" foodstamp="false" nonstock="false" detail="true" ars2="false"
The Geany regex I tried with is:
[=]["](\S+)?["][\s]
This works until I run into a space in the data field, but replacing (\S+)? with (.+)? gives me other problems. I'm just not sure how to anchor my regex properly, or if I need to use a capture group to get it done.
I'm not even positive if Geany is the right tool here. I'm on an Arch Linux box, so I'm open to any tools that are available to me.
You could do:
(\w+)(?==)|"([^"]*)"
This will save the variable names on first capturing group and their corresponding values on the second capturing group.
Since you are open to new tools, you can convert XML to CSV easily in the terminal with sed:
cat file.xml | sed -r 's/\s?\S+=/,/g' | sed -r 's/^,//'

Bash - Regex for HTML contents

I'm learning about Bash scripting, and need some help understanding regex's.
I have a variable that is basically the html of a webpage (exported using wget):
currentURL = "https://www.example.com"
currentPage=$(wget -q -O - $currentURL)
I want to get the id's of all linked photos in this page. I just need help figuring out what the RegEx should be.
I started with this, but I need to modify the regex:
Test string (this is what currentURL contains, there can be zero to many instances of this):
<img src="./download/file.php?id=123456&t=1">
Current Regex:
.\/download\/file.php\?id=[0-9]{6}\&mode=view
Here's the regex I created, but it doesn't seem to work in bash.
The best solution would be to have the ID of each file. In this case, simply 123456. But if we can start with getting the /download/file.php?id=123456, that'd be a good start.
Don't parse XML/HTML with regex, use a proper XML/HTML parser.
theory :
According to the compiling theory, HTML can't be parsed using regex based on finite state machine. Due to hierarchical construction of HTML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.
realLife©®™ everyday tool in a shell :
You can use one of the following :
xmllint often installed by default with libxml2, xpath1
xmlstarlet can edit, select, transform... Not installed by default, xpath1
xpath installed via perl's module XML::XPath, xpath1
xidel xpath3
saxon-lint my own project, wrapper over #Michael Kay's Saxon-HE Java library, xpath3
or you can use high level languages and proper libs, I think of :
python's lxml (from lxml import etree)
perl's XML::LibXML, XML::XPath, XML::Twig::XPath, HTML::TreeBuilder::XPath
Check: Using regular expressions with HTML tags
Example using xidel:
xidel -s "$currentURL" -e '//a/extract(#href,"id=(\d+)",1)'
Let's first clarify a couple of misunderstandings.
I'm learning about Bash scripting, and need some help understanding regex's.
You seem to be implying some sort of relation between Bash and regex.
As if Bash was some sort of regex engine.
It isn't. The [[ builtin is the only thing I recall in Bash that supports regular expressions, but I think you mean something else.
There are some common commands executed in Bash that support some implementation of regular expressions such as grep or sed and others. Maybe that's what you meant. It's good to be specific and accurate.
I want to get the id's of all linked photos in this page. I just need help figuring out what the RegEx should be.
This suggests an underlying assumption that if you want to extract content from an HTML, then regex is the way to go. That assumption is incorrect.
Although it's best to extract content from HTML using an XML parser (using one of the suggestions in Gilles' answer),
and trying to use regex for it is not a good reflect,
for simple cases like yours it might just be good enough:
grep -oP '\./download/file\.php\?id=\K\d+(?=&mode=view)' file.html
Take note that you escaped the wrong characters in the regex:
/ and & don't have a special meaning and don't need to be escaped
. and ? have special meaning and need to be escaped
Some extra tricks in the above regex are good to explain:
The -P flag of grep enables Perl style (powerful) regular expressions
\K is a Perl specific symbol, it means to not include in the match the content before the \K
The (?=...) is a zero-width positive lookahead assertion. For example, /\w+(?=\t)/ matches a word followed by a tab, without including the tab in the match.
The \K and the lookahead trickery is to work with grep -o, which outputs only the matched part. But without these trickeries the matched part would be for example ./download/file.php?id=123456&mode=view, which is more than what you want.

Textmate Regex Issue

I have a very large .CSV document with text I need removing. The data looks like this
774431994&images=774431994,774431996,774431998,774432000,774432003,774432006,774432009&formats=0,0,0,0,0 /1/6/9/5/2/6/8/webimg/774431994
774431996&images=774431994,774431996,774431998,774432000,774432003,774432006,774432009&formats=0,0,0,0,0 /1/6/9/5/2/6/8/webimg/774431996
774431998&images=774431994,774431996,774431998,774432000,774432003,774432006,774432009&formats=0,0,0,0,0 /1/6/9/5/2/6/8/webimg/774431998
774432000&images=774431994,774431996,774431998,774432000,774432003,774432006,774432009&formats=0,0,0,0,0 /1/6/9/5/2/6/8/webimg/774432000
774432003&images=774431994,774431996,774431998,774432000,774432003,774432006,774432009&formats=0,0,0,0,0 /1/6/9/5/2/6/8/webimg/774432003
774432006&images=774431994,774431996,774431998,774432000,774432003,774432006,774432009&formats=0,0,0,0,0 /1/6/9/5/2/6/8/webimg/774432006
774432009&images=774431994,774431996,774431998,774432000,774432003,774432006,774432009&formats=0,0,0,0,0 /1/6/9/5/2/6/8/webimg/774432009
I'm using the following Regex which is working on http://regexr.com/3a6oa
/.{128}(?=webimg).{10}/g
It just doesn't seem to work with Textmate Search. Does anyone know why? I need to select all of this junk and replace it with nothing, the numbers are unique each time.
Thanks very much
Why are you using a lookahead in your pattern? Just use: /.{128}webimg.{10}/g
Why are you using Textmate search at all? I'd need to know more context of the problem to say for sure, but I bet a simple sed command could just be used instead:
sed -i "webimg/d" ./filename.csv

Is posible to add characters to a string as part of a regular expression (regex)

I use an application to find specific text patterns in free text fields in XML records. It uses regex to identify the pattern and then it is tagged in the XML. For a specific project, it would be a great time saver (I am working with about 18 million records) if I could add 2 characters 27 in front of one of the pattern I have to use.
Can this be done or am I just going to have to go the long way around?
No, you can't have a regex match text that isn't there. A regex will only be able to return text that is part of the original text.
However, if you matched into groups, you could potentially use the group name for extra information about what you're matching.
Regex is not the right tool if you'd like to edit an XML file. Instead, use a modern language like Python, Perl, Ruby, PHP, Java with a proper XML parser module. If you work in Unix like shell, I recommend xmlstarlet
That said, if you'd like to go ahead with a substitution, you can try sed (at your own risks) :
sed -i -r 's/987654/27&/g' files*.xml
(use only -i switch only to modify in-place)