Grep replace é, ü, ï etc - replace

For my workflow, I need to replace accented characters with a unique string (to be re-replaced in another step of the process).
But in my current Grep-rule (In InDesign's FindChangeList), this does not recognize the accented letters:
grep {findWhat:"é"} {changeTo:"!e"} //Doesn't do anything
To verify:
grep {findWhat:"\/"} {changeTo:"\+"} //Does work: it replaces a slash with a plus sign.
grep {findWhat:"e"} {changeTo:"f"} //Does work, and does not replace é to f

Do you clean the Preferences before and after the changes?
app.findGrepPreferences = app.changeGrepPreferences = null;

Related

extract only words that do not contain words ending with a particular letter combination (using regex only)

I have this list of Portuguese language words https://raw.githubusercontent.com/pythonprobr/palavras/master/palavras.txt. I want to extract only words that do not end in "er" or "ar". I have been trying to apply the methods in the answers to this question Regex not matching words ending with "Impl" but I can't make it work.
I've been using the command like this from this answer https://stackoverflow.com/a/22964675/10824251 : $ grep -oP '[A-Z][A-Za-z\d]*(\?<! er) [ [A-Z] [A-Za-z \\ d] * (\? <! er)] ' palavra.txt > output.txt
To get all lines that do not end with er and ar, you may use
grep -v '[ea]r$' palavras.txt > output.txt
NOTES:
-v - inverts the result, we get all the lines that do not match the regex
[ea]r$ - matches e or a, then r at the end of the string

Trim end of string

I'm having trouble trimming off some characters at the end of a string. The string usually looks like:
C:\blah1\blah2
But sometimes it looks like:
C:\blah1\blah2.extra
I need to extract out the string 'blah2'. Most of the time, that's easy with a substring command. But on the rare occasions when the '.extra' portion is present, I need to first trim that part off.
The thing is, '.extra' always begins with a dot, but then is followed by various combinations of letters with various lengths. So wildcards will be necessary. Essentially, I need to script, "If the string contains a dot, trim off the dot and anything following it."
$string.replace(".*","") doesn't work. Nor does $string.replace(".\*",""). Nor does $string.replace(".[A-Z]","").
Also, I can't get at it from the beginning of the string either. 'blah1' is unknown and of various lengths. I have to get at 'blah2' from the end of the string.
Assuming that the string is always a path to a file with or without an extension (such as ".extra"), you can use Path.GetFileNameWithoutExtension():
PS C:\> [System.IO.Path]::GetFileNameWithoutExtension("C:\blah1\blah2")
blah2
PS C:\> [System.IO.Path]::GetFileNameWithoutExtension("C:\blah1\blah2.extra")
blah2
The path doesn't even have to be rooted:
PS C:\> [System.IO.Path]::GetFileNameWithoutExtension("blah1\blah2.extra")
blah2
If you want to implement similar functionality on your own, that should be fairly simply as well - use String.LastIndexOf() to find the last \ in the string and use that as your starting argument for Substring():
function Extract-Name {
param($NameString)
# Extract part after the last occurrence of \
if($NameString -like '*\*') {
$NameString = $NameString.Substring($NameString.LastIndexOf('\') + 1)
}
# Remove anything after a potential .
if($NameString -like '*.*') {
$NameString.Remove($NameString.IndexOf("."))
}
$NameString
}
And you'll see similar results:
PS C:\> Extract-Name "C:\blah1\blah2.extra"
blah2
PS C:\> Extract-Name "C:\blah124323\blah2.extra"
blah2
PS C:\> Extract-Name "C:\blah124323\blah2"
blah2
PS C:\> Extract-Name "abc124323\blah2"
blah2
As the other posters have said, you can use special file name manipulators for this. If you'd like to do it with regular expressions, you can say
$string.replace("\..*","")
The \..* regex matches a dot (\.) and then any string of characters (.*).
Let me address each of the non-working regexes individually:
$string.replace(".*","")
The reason this doesn't work is that . and * are both special characters in regular expressions: . is a wildcard character that matches any character, and * means "match the previous character zero or more times." So .* means "any string of characters."
$string.replace(".\*","")
In this instance, you're escaping the * character, meaning that the regex treats it literally, so the regex matches any single character (.) followed by a star (\*).
$string.replace(".[A-Z]","")
In this case, the regex will match any character (.) followed by any single capital letter ([A-Z]).
If the strings are actual paths using Get-Item would be another option:
$path = 'C:\blah1\blah2.something'
(Get-Item $path).BaseName
The Replace() method can't be used here, because it doesn't support wildcards or regular expressions.

Replace inside matched string with Notepad++ and regex

I have some lines in a text file :
Joëlle;Dupont;123456
Alex;Léger;134234
And I want to replace them by :
Joëlle;Dupont;123456;joelle.dupont#mail.com
Alex;Léger;134234;alex.leger#mail.com
I want to replace all characters with accents (é, ë…) by characters without accents (e, e…) but only on the mail adress, only on a part of the line.
I know I can use \L\E to change uppercase letter into lowercase letter but it's not the only thing I have to do.
I used :
(.*?);(.*?);(\d*?)\n
To replace it by :
$1;$2;$3;\L$1.$2#mail.com\E\n
But it wouldn't replace characters with accents :
Joëlle;Dupont;123456;joëlle.dupont#mail.com
Alex;Léger;134234;alex.léger#mail.com
If you have any idea how I could do this with Notepad++, even with more than one replacement, maybe you can help me.
I don't know your whole population, but you could use the below to replace the variations of e with an e:
[\xE8-\xEB](?!.*;)
And replace with e.
[I got the range above from this webpage, taking the column names]
regex101 demo
This regex matches any è, é, ê or ë and replaces them with an e, if there is no ; on the same line after it.
For variations of o:
[\xF2-\xF6](?!.*;)
For c (there's only one, so you can also put in ç directly):
\xE7(?!.*;)
For a:
[\xE0-\xE5](?!.*;)

string pattern and regex

I have a file with different lines, among which I have some lines like
173.194.034.006.00080-138.096.201.072.49934
the pattern is 3 numbers and then a dot and then 3 numbers and then a dot, etc.
I want to use awk, grep, or sed for this purpose. How do I express this regular expression?
Assuming you want to get lines with 1 series like 123. exists, do
grep '[0-9][0-9][0-9]\.' file > numbersFile
If you want 2 series like 123.345., then do
grep '[0-9][0-9][0-9]\.[0-9][0-9][0-9]\.' file > numbersFile
etc, etc.
Each [0-9] means match only one occurance of characters in the range between 0-9 (0,1,2,3,4,5,6,7,8,9).
Because the '.' char has a special meaning in a normal grep regexp, you nave to escape it like \. to indicate "Just match the '.' char (only!) ;-)
There are fancy extensions to grep that allow you to specify the pattern once, and include a qualifier like {3} or sometimes \{3\} (to indicate 3 repetitions). But this extension isn't portable to older Unix like Solaris, AIX, and others.
Here's a simple test to see if your system supports qualifiers. (Super Grep-heads are welcome to correct my terminology :-).
echo "173.194.034.006.00080-138.096.201.072.49934" | grep '[0-9]\{10\}\.'
echo "173.194.034.006.00080-138.096.201.072.49934" | grep '[0-9]\{2\}\.'
The first test should fail, the 2nd will succeed if your grep supports qualifiers.
It doesn't hurt to learn the long-hand solution (as above), and you can be sure this will work with any grep.
IHTH.
In awk I'd probably build up the string and then search for it as:
BEGIN {
p = "[.]"
d = "[[:digit:]]"
d3 = d d d # or d"{3}"
d5 = d d d d d # or d"{5}"
re = d3 p d3 p d3 p d3 p d5 # or "(" d3 p "){4}" d5
}
$0 ~ re "-" re
but it really all depends what you want to do with it.
By the look of it, these are IP addresses, followed by a port number, a dash and then the IP address/port number combination again.
If you're on a modern UNIX/Linux system then
grep -P '(\d{3}\.){4}\d{5}-(\d{3}\.){4}\d{5})'
would do the trick -- although may not be the most portable way to do it. This uses the '-P' for "use Perl regular expressions" option, which some people might consider to be cheating!
You didn't say if you've got extra text either before or after these strings on the line. If you have then you can use the '-o' option just to extract the matched text and ignore everything else.

Replace patterns that are inside delimiters using a regular expression call

I need to clip out all the occurances of the pattern '--' that are inside single quotes in long string (leaving intact the ones that are outside single quotes).
Is there a RegEx way of doing this?
(using it with an iterator from the language is OK).
For example, starting with
"xxxx rt / $ 'dfdf--fggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g '--ggh--' vcbcvb"
I should end up with:
"xxxx rt / $ 'dfdffggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g 'ggh' vcbcvb"
So I am looking for a regex that could be run from the following languages as shown:
+-------------+------------------------------------------+
| Language | RegEx |
+-------------+------------------------------------------+
| JavaScript | input.replace(/someregex/g, "") |
| PHP | preg_replace('/someregex/', "", input) |
| Python | re.sub(r'someregex', "", input) |
| Ruby | input.gsub(/someregex/, "") |
+-------------+------------------------------------------+
I found another way to do this from an answer by Greg Hewgill at Qn138522
It is based on using this regex (adapted to contain the pattern I was looking for):
--(?=[^\']*'([^']|'[^']*')*$)
Greg explains:
"What this does is use the non-capturing match (?=...) to check that the character x is within a quoted string. It looks for some nonquote characters up to the next quote, then looks for a sequence of either single characters or quoted groups of characters, until the end of the string. This relies on your assumption that the quotes are always balanced. This is also not very efficient."
The usage examples would be :
JavaScript: input.replace(/--(?=[^']*'([^']|'[^']*')*$)/g, "")
PHP: preg_replace('/--(?=[^\']*'([^']|'[^']*')*$)/', "", input)
Python: re.sub(r'--(?=[^\']*'([^']|'[^']*')*$)', "", input)
Ruby: input.gsub(/--(?=[^\']*'([^']|'[^']*')*$)/, "")
I have tested this for Ruby and it provides the desired result.
This cannot be done with regular expressions, because you need to maintain state on whether you're inside single quotes or outside, and regex is inherently stateless. (Also, as far as I understand, single quotes can be escaped without terminating the "inside" region).
Your best bet is to iterate through the string character by character, keeping a boolean flag on whether or not you're inside a quoted region - and remove the --'s that way.
If bending the rules a little is allowed, this could work:
import re
p = re.compile(r"((?:^[^']*')?[^']*?(?:'[^']*'[^']*?)*?)(-{2,})")
txt = "xxxx rt / $ 'dfdf--fggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g '--ggh--' vcbcvb"
print re.sub(p, r'\1-', txt)
Output:
xxxx rt / $ 'dfdf-fggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g '-ggh-' vcbcvb
The regex:
( # Group 1
(?:^[^']*')? # Start of string, up till the first single quote
[^']*? # Inside the single quotes, as few characters as possible
(?:
'[^']*' # No double dashes inside theses single quotes, jump to the next.
[^']*?
)*? # as few as possible
)
(-{2,}) # The dashes themselves (Group 2)
If there where different delimiters for start and end, you could use something like this:
-{2,}(?=[^'`]*`)
Edit: I realized that if the string does not contain any quotes, it will match all double dashes in the string. One way of fixing it would be to change
(?:^[^']*')?
in the beginning to
(?:^[^']*'|(?!^))
Updated regex:
((?:^[^']*'|(?!^))[^']*?(?:'[^']*'[^']*?)*?)(-{2,})
Hm. There might be a way in Python if there are no quoted apostrophes, given that there is the (?(id/name)yes-pattern|no-pattern) construct in regular expressions, but it goes way over my head currently.
Does this help?
def remove_double_dashes_in_apostrophes(text):
return "'".join(
part.replace("--", "") if (ix&1) else part
for ix, part in enumerate(text.split("'")))
Seems to work for me. What it does, is split the input text to parts on apostrophes, and replace the "--" only when the part is odd-numbered (i.e. there has been an odd number of apostrophes before the part). Note about "odd numbered": part numbering starts from zero!
You can use the following sed script, I believe:
:again
s/'\(.*\)--\(.*\)'/'\1\2'/g
t again
Store that in a file (rmdashdash.sed) and do whatever exec magic in your scripting language allows you to do the following shell equivalent:
sed -f rmdotdot.sed < file containing your input data
What the script does is:
:again <-- just a label
s/'\(.*\)--\(.*\)'/'\1\2'/g
substitute, for the pattern ' followed by anything followed by -- followed by anything followed by ', just the two anythings within quotes.
t again <-- feed the resulting string back into sed again.
Note that this script will convert '----' into '', since it is a sequence of two --'s within quotes. However, '---' will be converted into '-'.
Ain't no school like old school.