Alternative to many substitutions in a row? [closed] - regex

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Most of my Perl scripts deal with converting ugly formats to the plain TXT content. So far, I've done this with dozens and dozens of substitutions all in a row. Is there a more elegant way to do this in Perl? For instance, a hash containing all the s/// pairs, or even an external file containing the substitutions?
I'm just wondering how other people handle this kind of formatting script, or if just having a novel's worth of s/// expressions is the normal way to go. It gets hard to manage at a certain point.
Thanks!

Sometimes the most efficient approach is to parse the old data format into a memory structure and then output the new format.
Depending on the structure, this can be done line by line. But if you have to do the whole document that works, as long as they aren't too gigantic.
As an example, this is how you'd do an image file conversion: read a GIF into a bitmap and then produce a JPEG output. You wouldn't use regular expressions, even if you could, it would be horribly inefficient.

I have a utility method I use all the time for this:
sub subst($#) {
my($x, #map) = #_;
#map % 2 == 0 or die 'subst requires an odd number of params';
while (#map) {
my $from = shift(#map);
my $to = shift(#map);
$x =~ s/$from/$to/g;
}
return $x;
}
I use a list instead of a hash for map so I can control the order. Use it like this:
my $new_x = subst($x,
pattern1 => replacement1,
pattern2 => replacement2);
Even with a single pattern, it's simpler if you aren't substituting something in place. I.e. it's cleaner than this:
my $new_x = $x;
$new_x =~ s/pattern1/replacement1/g;

Zan Lynx gives good advice, but to answer your question about hash-driven substitutions, you can use the following:
my %replacements = (
foo => 'bar',
bar => 'baz',
baz => 'foo',
# ...
);
my $pat = join '|', map quotemeta, keys %replacements;
s/($pat)/$replacements{$1}/g;
This solves two problems:
It scans the input only once, not once per pattern.
It takes away the risk of accidentally matching the replacements of earlier substitutions. In fact, the example I used couldn't be accomplished with three substitutions.

This is an awkward question because the order of the substitutions affects the result
If your changes are so extensive then I would much prefer a sequence of s/// statements to anything held in a hash, which would shuffle the substitutions into a different order each time the program was run
I would still be worried about the accuracy of “dozens and dozens” (two hundred plus?) of substitutions, but at least the result would be under control

Related

Questions about using regular expressions [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I'm trying to learn more about regex and I'm running into a block
my current query:
function telephoneCheck(str) {
return str.match(/[0-9]{3}[-][0-9]{3}[-][0-9]{4}/g)? true : false
}
This will only work for a specific inputs such as "555-555-5555", but for other inputs such as "1 (555) 555-5555" it will not. I'm at a loss on how to query for optional characters and whitespace. Moreover bracket handling is odd and I've found some crazy queries such as /(\d+-)\1\d{4}/g but I have no idea what its doing and I don't want to use code I don't understand.
Can someone show me a query that solves for "1 (555) 555-5555" where the first two characters (the one and space) are optional inputs?
These are inputs that the regex should be able to handle:
"1 (555) 555-5555"
"1(555)555-5555"
"1 555-555-5555"
"555-555-5555"
"(555)555-5555"
"5555555555"
I found a solution
regex: function telephoneCheck(str) {
var regex = /^(1\s?)?(\(\d{3}\)|\d{3})[\s\-]?\d{3}[\s\-]?\d{4}$/;
return regex.test(str);
}
telephoneCheck("555-555-5555");
But I have no idea whats going on in here. If someone could explain whats happening I'd be happy to give you the answer for this posted question :)
You have be wary of trying to be all things within regex and question why the data is so varied in the first place.
If you are just parsing a bunch of what you are thinking should be phone numbers for example and notice a lot of different formats it might actually be more readable to use logic.
There is probably a really clever way of doing the above but I tend to be a bit more brute force with regex until I need more.
The below combines both patterns in to one regex expression. You use the | separator to say or. Also if your strings are exactly as you say, you should to use the ^ (starts with) and $ ends with to ensure you don't get false positives.
var pattern = /^[0-9] \([0-9]{3}\) [0-9]{3}-[0-9]{4}$|^[0-9]{3}[-][0-9]{3}[-][0-9]{4}$/
pattern.test('555-555-5555') //true
pattern.test('1 (555) 555-5555') // true
pattern.test('(555) 555-5555') // false
And as I say if you have lots of different formats in one. Question why, is there a way to clean things up first. Then perhaps use logic and separate statements.
var parensPattern = /^[0-9] \([0-9]{3}\) [0-9]{3}-[0-9]{4}$/
var noParensPattern = /^[0-9]{3}[-][0-9]{3}[-][0-9]{4}$/
if(parensPattern.test('1 (555) 555-5555')) {
// do something
} else if (noParensPattern.test('555-555-5555)) {
// do something
}
Check out http://regex101.com, it is a great resource.

How to replace in a particular region in many files in Perl? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am working in many xml files. And I want to replace some particular content only in a specific region of all files. For example:
the files may have many of the following contents:
<h2>Content comes here</h2>
Now I want to replace a word only in the above <h2>...</h2> region in all files.
Please advice. Thanks in advance.
General text replacement in Perl is usually done using regexes and the s/// operator. However it is considered very unadvisable to try to interpret the structure of an XML file using only regexes.
You should use a module which parses XML. XML::Simple will allow you to load the whole document as a Perl object (using hashrefs for attributes and subtags, etc.) and you can then traverse it and do the replacement you want to. However you then have to write that structure back as you choose.
XML::Parser is a good bet in my opinion. It is conceptually a bit more tricky, but is designed to do exactly the sort of thing you want. You set up handler functions which get called every time the parser finds the start or end of a tag. In your case all these have to do is output the tag and its contents, except when it's a h2 tag, in which case you do some extra processing.
There are also some DOM-oriented parsers which you might want to use if you are used to doing stuff like this in JavaScript or some other DOM-based XML library.
Last and for the sake of completeness, you can probably write a (very short) XSLT file which will do this transformation (not an expert, so not sure exactly how) and apply it using XML::XSLT, basically in one line.

RegExp Remove content outside of commas [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 8 years ago.
Improve this question
Alright, so I have a database where you can get information from that'll show off in this kind of way:
ID, Display name, Likes Cake, Likes Coffee, Likes Dogs
So if you get the information, it would show something a little like to this:
1,anonymous,1,0,1
Now it's not very popular so I would like to show the people who has answered this so I would like the "1,!anonymous!,1,0,1" (anything outside the !'s) gone. I looked around and found a RegExp code that would remove stuff outside quotes, but it's rather hard and I'm rather impatient to put all the display names in quotes.
So if there was a RegExp that would erase the numbers so I could put the usernames up, would be delicious.
Well, you could do something like this:
Replae '^[^,]+([^,]+).*' With '$1'
How it looks exactly in your language may vary, of course.
But in your case this looks like CSV, so isn't parsing the CSV file easier in that case? E.g. in PowerShell you could do
Import-Csv foo.csv | select 'Display name'
and likewise for other languages that have such parsing built-in somewhere. Besides, most other options may break depending on the input because fields in CSV may contain commas too which breaks both above regex and a naïve splitting method.
You can split the database result string and then get the relevant array index.
string dbString = "1,anonymous,1,0,1";
string username = dbString.Split(',')[1];
//value of username will be "anonymous"

Regular expression to search for Gadaffi [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I'm trying to search for the word Gadaffi, which can be spelled in many different ways. What's the best regular expression to search for this?
This is a list of 30 variants:
Gadaffi
Gadafi
Gadafy
Gaddafi
Gaddafy
Gaddhafi
Gadhafi
Gathafi
Ghadaffi
Ghadafi
Ghaddafi
Ghaddafy
Gheddafi
Kadaffi
Kadafi
Kaddafi
Kadhafi
Kazzafi
Khadaffy
Khadafy
Khaddafi
Qadafi
Qaddafi
Qadhafi
Qadhdhafi
Qadthafi
Qathafi
Quathafi
Qudhafi
Kad'afi
My best attempt so far is:
\b[KG]h?add?af?fi$\b
But I still seem to be missing some variants. Any suggestions?
Easy... (Qadaffi|Khadafy|Qadafi|...)... it's self-documented, maintainable, and assuming your regexp engine actually compiles regular expressions (rather than interpreting them), it will compile to the same DFA that a more obfuscated solution would.
Writing compact regular expressions is like using short variable names to speed up a program. It only helps if your compiler is brain-dead.
\b[KGQ]h?add?h?af?fi\b
Arabic transcription is (Wiki says) "Qaḏḏāfī", so maybe adding a Q. And one H ("Gadhafi", as the article (see below) mentions).
Btw, why is there a $ at the end of the regex?
Btw, nice article on the topic:
Gaddafi, Kadafi, or Qaddafi? Why is the Libyan leader’s name spelled so many different ways?.
EDIT
To match all the names in the article you've mentioned later, this should match them all. Let's just hope it won't match a lot of other stuff :D
\b(Kh?|Gh?|Qu?)[aeu](d['dt]?|t|zz|dhd)h?aff?[iy]\b
One interesting thing to note from your list of potential spellings is that there's only 3 Soundex values for the contained list (if you ignore the outlier 'Kazzafi')
G310, K310, Q310
Now, there are false positives in there ('Godby' also is G310), but by combining the limited metaphone hits as well, you can eliminate them.
<?
$soundexMatch = array('G310','K310','Q310');
$metaphoneMatch = array('KTF','KTHF','FTF','KHTF','K0F');
$text = "This is a big glob of text about Mr. Gaddafi. Even using compound-Khadafy terms in here, then we might find Mr Qudhafi to be matched fairly well. For example even with apostrophes sprinkled randomly like in Kad'afi, you won't find false positives matched like godfrey, or godby, or even kabbadi";
$wordArray = preg_split('/[\s,.;-]+/',$text);
foreach ($wordArray as $item){
$rate = in_array(soundex($item),$soundexMatch) + in_array(metaphone($item),$metaphoneMatch);
if ($rate > 1){
$matches[] = $item;
}
}
$pattern = implode("|",$matches);
$text = preg_replace("/($pattern)/","<b>$1</b>",$text);
echo $text;
?>
A few tweaks, and lets say some cyrillic transliteration, and you'll have a fairly robust solution.
Using CPAN module Regexp::Assemble:
#!/usr/bin/env perl
use Regexp::Assemble;
my $ra = Regexp::Assemble->new;
$ra->add($_) for qw(Gadaffi Gadafi Gadafy Gaddafi Gaddafy
Gaddhafi Gadhafi Gathafi Ghadaffi Ghadafi
Ghaddafi Ghaddafy Gheddafi Kadaffi Kadafi
Kaddafi Kadhafi Kazzafi Khadaffy Khadafy
Khaddafi Qadafi Qaddafi Qadhafi Qadhdhafi
Qadthafi Qathafi Quathafi Qudhafi Kad'afi);
say $ra->re;
This produces the following regular expression:
(?-xism:(?:G(?:a(?:d(?:d(?:af[iy]|hafi)|af(?:f?i|y)|hafi)|thafi)|h(?:ad(?:daf[iy]|af?fi)|eddafi))|K(?:a(?:d(?:['dh]a|af?)|zza)fi|had(?:af?fy|dafi))|Q(?:a(?:d(?:(?:(?:hd)?|t)h|d)?|th)|u(?:at|d)h)afi))
I think you're over complicating things here. The correct regex is as simple as:
\u0627\u0644\u0642\u0630\u0627\u0641\u064a
It matches the concatenation of the seven Arabic Unicode code points that forms the word القذافي (i.e. Gadaffi).
If you want to avoid matching things that no-one has used (ie avoid tending towards ".+") your best approach would be to create a regular expression that's just all the alternatives (eg. (Qadafi|Kadafi|...)) then compile that to a DFA, and then convert the DFA back into a regular expression. Assuming a moderately sensible implementation that would give you a "compressed" regular expression that's guaranteed not to contain unexpected variants.
If you've got a concrete listing of all 30 possibilities, just concatenate them all together with a bunch of "ors". Then you can be sure that it only matches the exact things you've listed, and no more. Your RE engine will probably be able to optimize in further, and, well, with 30 choices even if it doesn't it's still not a big deal. Trying to fiddle around with manually turning it into a "clever" RE can't possibly turn out better and may turn out worse.
(G|Gh|K|Kh|Q|Qh|Q|Qu)(a|au|e|u)(dh|zz|th|d|dd)(dh|th|a|ha|)(\x27|)(a|)(ff|f)(i|y)
Certainly not the most optimized version, split on syllables to maximize matches while trying to make sure we don't get false positives.
Well since you are matching small words why don't you try a similarity search engine with the Levenshtein distance? You can allow at most k insertions or deletions. This way you can change the distance function to other things that work better for your specific problem. There are many functions available in the simMetrics library.
A possible alternative is the online tool for generate regular expressions from examples http://regex.inginf.units.it.
Give it a chance!
Why not do a mixed approach? Something between a list of all possibilities and a complicated Regex that matches far too much.
Regex is about pattern matching and I can't see a pattern for all variants in the list. Trying to do so, will also find things like "Gazzafy" or "Quud'haffi" which are most probably not a used variant and definitly not on the list.
But I can see patterns for some of the variants, and so I ended up with this:
\b(?:Gheddafi|Gathafi|Kazzafi|Kad'afi|Qadhdhafi|Qadthafi|Qudhafi|Qu?athafi|[KG]h?add?h?aff?[iy]|Qad[dh]?afi)\b
At the beginning I list the ones where I can't see a pattern, then followed by some variants where there are patterns.
See it here on www.rubular.com
I know this is an old question, but...
Neither of these two regexes is the prettiest, but they are optimized and both match ALL the variations in the original post.
"Little Beauty" #1
(?:G(?:a(?:d(?:d(?:af[iy]|hafi)|af(?:f?i|y)|hafi)|thafi)|h(?:ad(?:daf[iy]|af?fi)|eddafi))|K(?:a(?:d(?:['dh]a|af?)|zza)fi|had(?:af?fy|dafi))|Q(?:a(?:d(?:(?:(?:hd)?|t)h|d)?|th)|u(?:at|d)h)afi)
"Little Beauty" #2
(?:(?:Gh|[GK])adaff|(?:(?:Gh|[GKQ])ad|(?:Ghe|(?:[GK]h|[GKQ])a)dd|(?:Gadd|(?:[GKQ]a|Q(?:adh|u))d|(?:Qad|(?:Qu|[GQ])a)t)h|Ka(?:zz|d'))af)i|(?:Khadaff|(?:(?:Kh|G)ad|Gh?add)af)y
Rest in Peace, Muammar.
Just an addendum: you should add "Gheddafi" as alternate spelling. So the RE should be
\b[KG]h?[ae]dd?af?fi$\b
[GQK][ahu]+[dtez]+\'?[adhz]+f{1,2}(i|y)
In parts:
[GQK]
[ahu]+
[dtez]+
\'?
[adhz]+
f{1,2}(i|y)
Note: Just wanted to give a shot at this.
What else starts with Q, G, or K, has a d, z or t in the middle, and ends in "fi" the people actually search for?
/\b[GQK].+[dzt].+fi\b/i
Done.
>>> print re.search(a, "Gadasadasfiasdas") != None
False
>>> print re.search(a, "Gadasadasfi") != None
True
>>> print re.search(a, "Qa'dafi") != None
True
Interesting that I'm getting downvoted. Can someone leave some false positives in the comments?

I'm looking for an application/text editor that [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
can best help me systematically modify the "replace" field of a regex search as it encounters each match.
For example, I have an xml file that needs the phrase "id = $number" inserted at regular points in the text, and basically, $number++ each time the regex matches (id = 1, id = 2, etc) until the end of the file.
I know I could just write a bash/perl/python script or some such, but I'd like it to be at least moderately user-friendly so I could teach my intelligent (but less technically-inclined) workers how to use it and make their own modifications. Regexing is not a problem for them.
The closest I've come so far is Notepad++'s Column Editor and 'increase [number] by' function, but with this I have to write a separate regex to align everything, add the increments, and then write another to put it back. Unfortunately, I need to use this function on too many different types of files and 'replace's to make macros feasible.
Ideally, the program would also be available for both Windows & Linux (WINE is acceptable but native is much preferred), and have a 'VI/VIM input' option (if it's a text editor), but these are of secondary importance.
Of course, it'd be nice if there is an OSS solution, and I'd be glad to donate $20-$50 to the developer(s) if it provides the solution I'm looking for.
Apologies for the length, and thanks so much for your help!
emacs (version 22 and later) can do what you're looking for. See Steve Yegge's blog for a really interesting read about it. I think this should work:
M-x replace-regexp
Replace regexp: insert pattern regexp here
Replace regexp with: id = \#
\# is a special metacharacter that gets replaced by the total number of replacements that have occurred so far, starting from 0. If you want the list to start from 1 instead of 0, use the following replacement string:
id = \,(1+ \#)
JEdit can probably help you:
http://www.jedit.org/
you can do all kinds of regex and even bean result based replacing with it.
UltraEdit32 is great and I believe it has the features you need. There is a free 30-day download so you can make sure. :)
I know you want an app available on Windows/Linux, but there's another solution on Mac : TextWrangler, and it's free.
Take a look at UltraEdit32. It's very good. Not free, but available in Windows, Linux and Mac platforms. It has regex based search & replace.
This script should let you do what you want in Vim.
Vim functions can do the incrementing number trick and aren't too hard to write. For example the Vim wiki says how to do this. See also :h sub-replace-\=.
function! Counter()
let i = g:c
let g:c = g:c + 1
return i
endfunction
:let c=1|%s/<\w\+\zs/\=' id="' . Counter() . '"'/g
We've probably left user-friendliness long behind at this point but Vim's Ruby support can do this kind of thing easily too:
:ruby c=0
:rubydo $_.gsub!(/<\w+/){|m| c += 1; m + ' id="' + c.to_s + '"'}
Or Perl:
:perl $c=1
:perldo s/<\w+/$& . ' id="' . $c++ . '"'/eg
To me, this sounds like it might be a job for awk, rather than a job for an editor.